November 6, 2023

What We Learned From Building AI Chatbots in 2023

Large Language Models (LLMs) & AI assistants

With the rise of large language models (LLMs), many businesses are eager to adopt AI assistants/chatbots using LLMs with the promise of higher revenue through better customer experience or lower employee costs.

While modern LLMs are powerful tools, they often struggle to provide accurate, consistent and contextually relevant responses. This necessitates the use of methods like retrieval augmented generation (RAG), a method that queries external knowledge sources such as databases and articles as supplementary info for the LLM.

These additional references reduces the inherent randomness of LLM outputs and produces contextually rich responses. This has emerged as a common approach to building AI assistants for most businesses.

Embedding-based retrieval augmented generation

However, while many businesses manage to leverage RAG to create promising LLM proof-of-concepts, the leap to production often remains elusive. From developing chatbots for our customers, we’ve learned firsthand that there are three major challenges in the way of production-ready LLM applications:

Prompt Engineering Is More Art Than Science
The Necessity of Non-LLM Approaches
Difficulty in Evaluating the System

In this post, we'll delve into these challenges and share our hard-won learnings from working with enterprise-grade chatbots, particularly those powered by GPT-3.5 Turbo and Retrieval Augmented Generation (RAG) systems.

Our Learnings From Building AI Chatbots

Prompt Engineering Is More Art Than Science

Building an effective AI chatbot requires carefully tuning prompts and providing the right context using RAG to produce good responses. This tuning process is more art than science - you need to quickly iterate and test different prompt and context variations to get the desired output.

For example, what we found was that giving the LLM a persona e.g. "You are a helpful shop assistant that only provides replies that you are 100% sure of" goes a long way in improving chatbot performance.

The non-deterministic nature of large language models like GPT-3.5 makes this process tricky. Even for the same input query, you may get different responses each time, demanding constant iteration and testing.

‍

The Necessity of Non-LLM Approaches

LLMs, though powerful, can sometimes generate erroneous or "hallucinate" results. For critical actions in a production environment, routing these scenarios to a human-in-the-loop (HITL) or a conventional keyword-matching approach becomes essential.

The architecture diagram above highlights our approach to using these methods in our chatbot. Steps 1-4 illustrates our implemented RAG approach to reduce hallucination.

Initially, we had trouble retrieving the right context based on the query. However, we managed to overcome this by using a separate embedding model that specialises in context retrieval, text-ada-002.

Steps 4-5 involve feeding retrieved context and our query into our LLM, based on GPT-3.5 Turbo to generate the answer.

In our case, we often encounter the scenario where the retrieved context is correct but the LLM still manages to hallucinate when synthesizing the response (see step 4 & 5 in the architecture diagram). We dealt with this by implementing a keyword-matching function to verify the output accuracy.

For some other cases, where this approach falls short, an alternative system of model implementation or prompting might be required. This system also needs to know how to discern and route different scenarios. This makes tailoring chatbots to novel use cases challenging.

You first need to deeply understand the new use cases, evaluate where the base system fails and then determine how to route those failure cases to a human or an alternate model. Comprehensive testing is also needed to catch edge cases where routing might fail.

‍

Difficulty in Evaluating the System

On top of the random nature of the chatbot, there are a lot of factors to be considered that makes systematic tuning challenging. There are many moving parts in the system that affect the quality of the output including system prompt, reference context for RAG, few-shot examples and detailed RAG variables such as chunk size and number of chunks.

Tweaking one element might enhance performance for one scenario but hamper another. Ensuring consistent quality demands rigorous testing across all queries. To make things worse, certain situations mandate human evaluations due to the limitations of automated output assessors using LLMs. Training automated output assessors also requires the initial generation of model answer sets by humans.

What We Learned

In summary, achieving optimal performance in production from state-of-the-art LLMs isn't just a technical endeavor—it's an art that demands nuanced prompt tuning, iterative testing, and continuous refinement.

‍
When building for new use cases, it's crucial to plan what are the critical actions and their handling strategies before delving into the prompt engineering phase. Vigilant testing during development can preempt potential pitfalls. Even after deployment, it remains essential to monitor and refine the system as real-world interactions reveal unforeseen challenges.

…

A major challenge that we found is the need for human evaluators at different parts of the training process. In SUPA, we help customers working with LLMs to build better AI. Learn more about our use cases here.

We’re also leveraging our expertise to help businesses with their AI chatbot needs! Sign up for early access here.

‍