January 23, 2024

Retrieval Augmented Generation: Techniques and Use Cases (2024)

Retrieval Augmented Generation: Techniques and Use Cases (2024)

Large Language Models (LLMs) are revolutionizing how humans use artificial intelligence (AI) for natural language processing (NLP) tasks

However, traditional LLMs pose a risk as users can rely on the model's information without verifying its accuracy. Of course, the real danger occurs when a regular LLM hallucinates, generating responses that have no grounding in reality. This can lead to the spread of misinformation if left unchecked.

That’s where Retrieval Augmented Generation (RAG) comes into play - a technique that helps to overcome these issues by grounding LLM responses to factual information.

The article delves deeper into the workings of RAG and how it significantly improves LLM performance. It will also demonstrate implementing RAG and discuss practical tips for beginners.What is Retrieval Augmented Generation?

RAG is a technique for improving accuracy and relevance in LLM by augmenting its responses based on information retrieved from external sources.

Regular LLMs generate responses using parametric knowledge gained through extensive training on a large dataset.

RAG adds an extra step where the LLM uses its pre-trained knowledge and integrates retrieved information from external sources, such as documents, articles, websites, etc., to produce a context-specific response.

Basic RAG Architecture. Source: TowardsAI

The basic architecture consists of an embedding model that converts user queries and external sources into vectors. When a user provides a query, the architecture searches a database and finds the most relevant vectors matching the user query based on a similarity function.

The next step is to provide the relevant vectors to the LLM as context and tell it to produce a suitable response.

How Does Retrieval Augmented Generation Work?

Implementing advanced RAG involves several techniques and components for ensuring a smooth user experience. Unlike basic RAG, advanced RAG consists of enhanced contextualization and methods to generate responses based on user intent.

The following section will explain its implementation in detail and provide practical tips to help you learn about RAG from the ground up.

The implementation uses llama-index - an open-source library, to build a RAG pipeline.


Step 1

You can begin by installing the library through the Anaconda prompt using the following command:

Code Snippet Example

Python Code Snippet

pip install llama-index

Next, open a Python file in Jupyter Notebooks and clone the following git repository:

!git clone https://github.com/jerryjliu/llama_index.git 

Step 2

You must install pipx through the Anaconda prompt using the following commands:

pip install --user pipx
pipx ensurepath

Step 3

You must install poetry through the Anaconda prompt to ensure you have all the package dependencies. However, you must first change the prompt's path to the llama-index directory you cloned from GitHub.

You can locate the directory on your local machine and then use the cd command with the directory’s location.

Install poetry using the following commands:

pipx install poetry

Step 4

In the Anaconda prompt, create a virtual environment as follows:

poetry shell

Step 5

Install relevant package dependencies using the following command in the Anaconda prompt:

poetry install

Step 6

Go to Jupyter Notebooks and create a new sub-folder called data in the llama-index folder.

Step 7

Download a sample essay from https://paulgraham.com/worked.html and store it as a .txt file in the data folder.

Step 8

In the llama-index folder, create a new Python file, name it starter.py, and import the relevant packages.

from llama_index import VectorStoreIndex, SimpleDirectoryReader

Step 9

Once you import the relevant packages, you must get the OpenAI API key, since llama_index uses the API as the backend model by default.

OpenAI provides free API credits for the first 3 months. For extended use, you can buy the API subscription. You can sign up using this link and generate your key.

Step 10

The next step is to set the API key as an environment variable using the following commands in Jupyter Notebook:

import os
os.environ['OPENAI_API_KEY'] = "your API key"

Step 11

Next, import the following packages:

from llama_index import ServiceContext, LLMPredictor, OpenAIEmbedding, PromptHelper
from llama_index.llms import OpenAI
from llama_index.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import set_global_service_context"

Step 12

Finally, load all the documents you want the model to refer to for better context. In our case, it is just a single file stored in the data folder. You can load it using the following command:

documents = SimpleDirectoryReader("data").load_data()

You are now set to implement an advanced RAG pipeline in Python. The diagram below illustrates the advanced RAG architecture.

Advanced RAG Architecture. Source: TowardsAI

The following sections will explain all the components with their implementation techniques to help you understand how RAG works in theory and practice.

Chunk Size

Firstly, we need to split our external sources into small chunks. Chunking ensures the LLM does not hallucinate by providing it with targeted snippets from source documentation related to the user prompt.

As such, the technique helps the model better understand the context and speeds up the search process by fitting an LLMs context window.


You can define the chunk size with the ServiceContext library. The following code sets the chunk size to 500.

from llama_index import VectorStoreIndex, SimpleDirectoryReader
service_context = ServiceContext.from_defaults(chunk_size=500)


You should choose a chunk size based on the nature of your content, the embedding model, and the expected length of user queries.

For instance, technical content based on books and journal articles may require larger chunk sizes to give LLMs more context for optimal responses.

Embedding Vectors and Indexing

The next step is to convert the text chunks into vector embeddings. Embeddings are high-dimensional vector representations of your documents where each element in the vector tells a particular property about the text to the LLM.

Vectorization allows the model to perform a similarity search between information in external sources and user queries in semantic space. For instance, the model can compute cosine similarity between the query vector and vectors for the external data.

Embedding Vectors. Source: OpenAI

Once you create embedding vectors, you must store them in a database that indexes the vectors and performs the similarity search against the user query.


Llama-index uses SimpleVectorStore by default, which stores all embeddings in an in-memory dictionary.

The following code converts your documents into embeddings and stores them in a vector index for retrieval:

index = VectorStoreIndex.from_documents(documents)


You can create custom indices based on your needs. For instance, hierarchical indexation involves creating two indices - one consisting of summaries for each document and the other consisting of text chunks relating to each summary. The technique is suitable when you have several large documents and directly searching text chunks is time-consuming.

Another technique involves generating hypothetical questions through the LLM based on vectorized chunks. During the search, the LLM fetches the most relevant question and then retrieves the related chunk for generating the response. The method is useful when you are familiar with user queries.

Lastly, you can implement context enrichment - a technique that adds smaller chunks as context before asking LLM to generate a response.

Fusion Retrieval, Ranking, and Postprocessing

In addition to a similarity search using a vector index, you can include the traditional keyword-based search mechanism to improve LLM’s responses.

Fusion retrieval is one way that helps combine the two search methods.

However, the technique involves ranking text chunks retrieved from each search method and returning the top-k results.

But this means you need another method to combine rankings from keyword-based and semantic-based retrievers to get an overall rank.

One method is reciprocal fusion ranking - a popular algorithm that adds up the reciprocal of all the ranks for each chunk from different retrievers. It uses the final scores to re-rank the retrieved results.

After re-ranking, you can apply postprocessing methods to refine the results further.


The following implementation uses the Best Match 25 (BM25) algorithm for keyword-based retrieval. BM25 is an enhanced version of the basic term-frequency inverse-document-frequency (tf-idf) method.

from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import QueryFusionRetriever
from llama_index.postprocessor import SimilarityPostprocessor
import nest_asyncio
vector_retriever = index.as_retriever(similarity_top_k=2)
bm25_retriever = BM25Retriever.from_defaults(    docstore=index.docstore, similarity_top_k=2)
retriever = QueryFusionRetriever(    [vector_retriever, bm25_retriever],    similarity_top_k=2,    mode="reciprocal_rerank",)
node_postprocessors = [SimilarityPostprocessor(similarity_cutoff=0.001)]

The similarity_top_k=2 means a retriever will return the top 2 chunks based on similarity. Preprocessing here discards results with a similarity score below 0.001.


Fusion retrieval is suitable when you require LLMs to understand user intent more deeply. LLMs can better comprehend what a user is looking for through the query’s keywords and its semantic meaning,

The method is effective when building LLM applications for e-commerce sites where a user can provide specific queries to search for relevant products.

Agents for Query Transformation and Routing using Query Engines

The previous section is a basic RAG implementation with no agentic behavior. As the initial diagram illustrates, you can add agents to your RAG pipeline for more advanced capabilities.

Agents are automated modules that decide the most appropriate action by analyzing a user’s query. We will discuss implementing a simple query agent that transforms and routes user queries.

Query Transformation

You can use query transformation to improve an LLM’s reasoning functionality. The method involves modifying a user query that allows for better retrieval quality.

For instance, an LLM can break down a complex query into multiple simpler sub-queries and retrieve relevant chunks based on a similarity search. It will use the results as context to generate a final response.

Query Routing

Query Routing consists of an LLM deciding on the most appropriate action based on a user query. For instance, the LLM can generate a summary if the user asks for summarization or a short response if the user asks a straightforward question.


Llama-index allows you to implement sub-queries using the following code:

from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_engine = RetrieverQueryEngine(

vector_tool = [
            description="Paul Graham essay on What I Worked On",

sub_query_engine = SubQuestionQueryEngine.from_defaults(

Note that the code defines a new variable vector_engine, and uses that as an argument in the vector_tool variable.

The vector_engine variable includes the retriever and postprocessing steps from the previous section. In addition, the sub_query_engine breaks down a complex query into simpler sub-queries to generate a response based on the source vectors.

You can add a query routing step where an agent will decide whether to summarize a response or provide more context to LLM for the final answer.

The following code implements the routing logic:

from llama_index.query_engine.router_query_engine import RouterQueryEngine
from llama_index.selectors.pydantic_selectors import PydanticSingleSelector
from llama_index import SummaryIndex
from llama_index import StorageContext

nodes = service_context.node_parser.get_nodes_from_documents(documents)
storage_context = StorageContext.from_defaults()

summary_index = SummaryIndex(nodes, storage_context=storage_context)

list_query_engine = summary_index.as_query_engine(
list_tool = QueryEngineTool.from_defaults(
    description="Useful for summarization questions relevant to the data source",

vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

vector_query_engine = vector_index.as_query_engine()

vector_tool = QueryEngineTool.from_defaults(
    "Useful for fetching relevant context from the data."    

router_query_engine = RouterQueryEngine(

The code defines a list_index and a vector_index. The list_index summarizes the information, while vector_index adds more context.

The router_query_engine defines the routing logic that the agent will use to choose between the list_index or the vector_index.

We can combine the sub_query_engine and router_query_engine through OpenAI’s agent with the following code:

query_engine_tools = [
                "Provides information based on sub-queries."
                "Provides summary information. "

from llama_index.agent import OpenAIAgent

agent = OpenAIAgent.from_tools(query_engine_tools, verbose=True)
response = agent.query("What was Paul Graham's life?")

The response variable stores the agent’s final output.


While agentic systems provide greater functionality, the additional operational overhead increases response latency.

For instance, the code above creates an agent that breaks down the user into multiple sub-queries. Then, for each sub-query, it decides whether to summarize or add more context using source vectors.

The pipeline is suitable in cases where you expect the user to ask complex questions over multiple documents for a simplified response.

Retrieval Augmented Generation vs. Fine-tuning - Evaluation

The last step in building a RAG pipeline is to evaluate response quality and fine-tune the LLM to improve output if it does not meet the desired criteria.

One useful metric to assess in RAG is response relevance. The approach lets you evaluate whether the responses match the query and context.


Llama-index lets you evaluate relevance using the following code:

from llama_index.evaluation import RelevancyEvaluator
evaluator = RelevancyEvaluator(service_context=service_context)

query_str = (
   "What was Paul Graham's life?"
eval_result = evaluator.evaluate_response(query=query_str, response=response)

You can refer to the documentation here to learn about fine-tuning.


Covering fine-tuning and evaluation tips for LLMs requires a separate discussion since it involves multiple factors, such as evaluation methodology, human feedback strategies, prompt engineering, etc.

The critical point to remember regarding evaluating LLMs is to use several automated scores with human-level feedback to assess response relevance, truthfulness, and toxicity.

Practical Use Cases

Due to RAG’s ability to add more relevance and accuracy to LLM responses, RAG-based LLMs have multiple applications across industries. The following discusses a few significant real-life use cases.

  • Document Question Answering Systems: The previous section shows that RAG-based LLMs perform exceptionally well in Q&A-type use cases by referring to factual sources before generating a response. The feature helps in retrieval augmented generation for knowledge-intensive NLP tasks, such as dialogue generation, fact-checking, research, etc.
  • Conversational Agents: You can develop RAG-based agents over product manuals or terms of service. The method will help users understand the technicalities through normal conversation instead of reading entire documents.
  • Content Generation: While traditional LLMs can generate content for multiple use cases, RAGs enhance content quality by using up-to-date resources and preventing the LLM from hallucinating. RAG enhances an LLM’s capability to generate content for specific applications.
  • Real-time Event Commentary: You can build an application where you feed a RAG-based LLM real-time information on specific events. Using text-to-speech models and automated prompt generation, you can make the application work as a virtual commentator to cover live events such as sports.
  • E-Learning: RAGs can help develop e-learning platforms where students can interact with RAG-based agents to learn about concepts in different disciplines. You can enhance knowledge sources by adding relevant books and articles.
  • Medical Diagnosis: Healthcare professionals can use RAGs to summarize complex patient reports and ask RAG agents to perform diagnoses based on the information in these reports. The system can streamline medical workflows and help healthcare providers achieve better patient outcomes. 
  • Virtual Assistants for E-commerce sites: Visitors to an e-commerce site often find it challenging to search for relevant products. Integrating the site with a RAG-based virtual chatbot allows visitors to interact with the bot and ask questions regarding their desired items. The bot can also provide more context-specific recommendations than a regular LLM-based framework.
  • Financial Services: Political-economic events significantly affect financial outcomes in the real world. You can feed these events in RAGs to help you make financial decisions based on the most accurate and latest market information. Of course, you must first fine-tune an LLM to understand financial data and perform correct analysis.
  • Interacting with Company Data Using RAG: You can train an LLM that uses RAG  to understand your company data stored in enterprise resource planning (ERP) systems. Employees can use the system to ask domain-specific questions and quickly get relevant stats with recommendations for faster decision-making.
  • Employee Performance Reviews: Companys can build LLM applications that use RAG to manage performance reviews more efficiently. The RAG module in the app can use employee and manager feedback as the basis for evaluating performance. Personnel management teams can query the application to generate strategic recommendations for improving employee morale and manager-employee relationships.

Closing Thoughts: Implementing RAG in LLM Applications?

Building a domain-specific LLM with RAG requires annotated data that improves the model’s response relevance and accuracy against a particular user query. 

A significant aspect of the process is to assess how precise the responses are in fetching the relevant information from external sources. A practical way to do this involves humans evaluating response quality based on comparisons with pre-defined templates and expert judgment.

That’s where SUPA comes into play. The platform employs several human evaluators who assess response quality manually by inspecting how an LLM with RAG responds to a range of pre-answered questions.

The approach allows checking response accuracy and whether the LLM fetches information based on user query. Additionally, SUPA’s extensive network of annotators allows you to efficiently label large datasets belonging to different domains.


In short, SUPA provides a human-centric solution to boost your labeling and training operations for building LLMs with RAG. Book a demo now to find out more.
This post was guestwritten by
Haziqa Sajid