Large Language Models (LLMs) are revolutionizing how humans use artificial intelligence (AI) for natural language processing (NLP) tasks
However, traditional LLMs pose a risk as users can rely on the model's information without verifying its accuracy. Of course, the real danger occurs when a regular LLM hallucinates, generating responses that have no grounding in reality. This can lead to the spread of misinformation if left unchecked.
That’s where Retrieval Augmented Generation (RAG) comes into play - a technique that helps to overcome these issues by grounding LLM responses to factual information.
The article delves deeper into the workings of RAG and how it significantly improves LLM performance. It will also demonstrate implementing RAG and discuss practical tips for beginners.What is Retrieval Augmented Generation?
RAG is a technique for improving accuracy and relevance in LLM by augmenting its responses based on information retrieved from external sources.
Regular LLMs generate responses using parametric knowledge gained through extensive training on a large dataset.
RAG adds an extra step where the LLM uses its pre-trained knowledge and integrates retrieved information from external sources, such as documents, articles, websites, etc., to produce a context-specific response.
The basic architecture consists of an embedding model that converts user queries and external sources into vectors. When a user provides a query, the architecture searches a database and finds the most relevant vectors matching the user query based on a similarity function.
The next step is to provide the relevant vectors to the LLM as context and tell it to produce a suitable response.
Implementing advanced RAG involves several techniques and components for ensuring a smooth user experience. Unlike basic RAG, advanced RAG consists of enhanced contextualization and methods to generate responses based on user intent.
The following section will explain its implementation in detail and provide practical tips to help you learn about RAG from the ground up.
The implementation uses llama-index - an open-source library, to build a RAG pipeline.
You can begin by installing the library through the Anaconda prompt using the following command:
Next, open a Python file in Jupyter Notebooks and clone the following git repository:
You must install pipx through the Anaconda prompt using the following commands:
You must install poetry through the Anaconda prompt to ensure you have all the package dependencies. However, you must first change the prompt's path to the llama-index directory you cloned from GitHub.
You can locate the directory on your local machine and then use the cd command with the directory’s location.
Install poetry using the following commands:
In the Anaconda prompt, create a virtual environment as follows:
Install relevant package dependencies using the following command in the Anaconda prompt:
Go to Jupyter Notebooks and create a new sub-folder called data in the llama-index folder.
Download a sample essay from https://paulgraham.com/worked.html and store it as a .txt file in the data folder.
In the llama-index folder, create a new Python file, name it starter.py, and import the relevant packages.
Once you import the relevant packages, you must get the OpenAI API key, since llama_index uses the API as the backend model by default.
OpenAI provides free API credits for the first 3 months. For extended use, you can buy the API subscription. You can sign up using this link and generate your key.
The next step is to set the API key as an environment variable using the following commands in Jupyter Notebook:
Next, import the following packages:
Finally, load all the documents you want the model to refer to for better context. In our case, it is just a single file stored in the data folder. You can load it using the following command:
You are now set to implement an advanced RAG pipeline in Python. The diagram below illustrates the advanced RAG architecture.
The following sections will explain all the components with their implementation techniques to help you understand how RAG works in theory and practice.
Firstly, we need to split our external sources into small chunks. Chunking ensures the LLM does not hallucinate by providing it with targeted snippets from source documentation related to the user prompt.
As such, the technique helps the model better understand the context and speeds up the search process by fitting an LLMs context window.
You can define the chunk size with the ServiceContext library. The following code sets the chunk size to 500.
You should choose a chunk size based on the nature of your content, the embedding model, and the expected length of user queries.
For instance, technical content based on books and journal articles may require larger chunk sizes to give LLMs more context for optimal responses.
The next step is to convert the text chunks into vector embeddings. Embeddings are high-dimensional vector representations of your documents where each element in the vector tells a particular property about the text to the LLM.
Vectorization allows the model to perform a similarity search between information in external sources and user queries in semantic space. For instance, the model can compute cosine similarity between the query vector and vectors for the external data.
Once you create embedding vectors, you must store them in a database that indexes the vectors and performs the similarity search against the user query.
Llama-index uses SimpleVectorStore by default, which stores all embeddings in an in-memory dictionary.
The following code converts your documents into embeddings and stores them in a vector index for retrieval:
You can create custom indices based on your needs. For instance, hierarchical indexation involves creating two indices - one consisting of summaries for each document and the other consisting of text chunks relating to each summary. The technique is suitable when you have several large documents and directly searching text chunks is time-consuming.
Another technique involves generating hypothetical questions through the LLM based on vectorized chunks. During the search, the LLM fetches the most relevant question and then retrieves the related chunk for generating the response. The method is useful when you are familiar with user queries.
Lastly, you can implement context enrichment - a technique that adds smaller chunks as context before asking LLM to generate a response.
In addition to a similarity search using a vector index, you can include the traditional keyword-based search mechanism to improve LLM’s responses.
Fusion retrieval is one way that helps combine the two search methods.
However, the technique involves ranking text chunks retrieved from each search method and returning the top-k results.
But this means you need another method to combine rankings from keyword-based and semantic-based retrievers to get an overall rank.
One method is reciprocal fusion ranking - a popular algorithm that adds up the reciprocal of all the ranks for each chunk from different retrievers. It uses the final scores to re-rank the retrieved results.
After re-ranking, you can apply postprocessing methods to refine the results further.
The following implementation uses the Best Match 25 (BM25) algorithm for keyword-based retrieval. BM25 is an enhanced version of the basic term-frequency inverse-document-frequency (tf-idf) method.
The similarity_top_k=2 means a retriever will return the top 2 chunks based on similarity. Preprocessing here discards results with a similarity score below 0.001.
Fusion retrieval is suitable when you require LLMs to understand user intent more deeply. LLMs can better comprehend what a user is looking for through the query’s keywords and its semantic meaning,
The method is effective when building LLM applications for e-commerce sites where a user can provide specific queries to search for relevant products.
The previous section is a basic RAG implementation with no agentic behavior. As the initial diagram illustrates, you can add agents to your RAG pipeline for more advanced capabilities.
Agents are automated modules that decide the most appropriate action by analyzing a user’s query. We will discuss implementing a simple query agent that transforms and routes user queries.
You can use query transformation to improve an LLM’s reasoning functionality. The method involves modifying a user query that allows for better retrieval quality.
For instance, an LLM can break down a complex query into multiple simpler sub-queries and retrieve relevant chunks based on a similarity search. It will use the results as context to generate a final response.
Query Routing consists of an LLM deciding on the most appropriate action based on a user query. For instance, the LLM can generate a summary if the user asks for summarization or a short response if the user asks a straightforward question.
Llama-index allows you to implement sub-queries using the following code:
Note that the code defines a new variable vector_engine, and uses that as an argument in the vector_tool variable.
The vector_engine variable includes the retriever and postprocessing steps from the previous section. In addition, the sub_query_engine breaks down a complex query into simpler sub-queries to generate a response based on the source vectors.
You can add a query routing step where an agent will decide whether to summarize a response or provide more context to LLM for the final answer.
The following code implements the routing logic:
The code defines a list_index and a vector_index. The list_index summarizes the information, while vector_index adds more context.
The router_query_engine defines the routing logic that the agent will use to choose between the list_index or the vector_index.
We can combine the sub_query_engine and router_query_engine through OpenAI’s agent with the following code:
The response variable stores the agent’s final output.
While agentic systems provide greater functionality, the additional operational overhead increases response latency.
For instance, the code above creates an agent that breaks down the user into multiple sub-queries. Then, for each sub-query, it decides whether to summarize or add more context using source vectors.
The pipeline is suitable in cases where you expect the user to ask complex questions over multiple documents for a simplified response.
The last step in building a RAG pipeline is to evaluate response quality and fine-tune the LLM to improve output if it does not meet the desired criteria.
One useful metric to assess in RAG is response relevance. The approach lets you evaluate whether the responses match the query and context.
Llama-index lets you evaluate relevance using the following code:
You can refer to the documentation here to learn about fine-tuning.
Covering fine-tuning and evaluation tips for LLMs requires a separate discussion since it involves multiple factors, such as evaluation methodology, human feedback strategies, prompt engineering, etc.
The critical point to remember regarding evaluating LLMs is to use several automated scores with human-level feedback to assess response relevance, truthfulness, and toxicity.
Due to RAG’s ability to add more relevance and accuracy to LLM responses, RAG-based LLMs have multiple applications across industries. The following discusses a few significant real-life use cases.
Building a domain-specific LLM with RAG requires annotated data that improves the model’s response relevance and accuracy against a particular user query.
A significant aspect of the process is to assess how precise the responses are in fetching the relevant information from external sources. A practical way to do this involves humans evaluating response quality based on comparisons with pre-defined templates and expert judgment.
That’s where SUPA comes into play. The platform employs several human evaluators who assess response quality manually by inspecting how an LLM with RAG responds to a range of pre-answered questions.
The approach allows checking response accuracy and whether the LLM fetches information based on user query. Additionally, SUPA’s extensive network of annotators allows you to efficiently label large datasets belonging to different domains.
In short, SUPA provides a human-centric solution to boost your labeling and training operations for building LLMs with RAG. Book a demo now to find out more.
This post was guestwritten by Haziqa Sajid