Open In App

RAG Architecture

Last Updated : 08 Sep, 2025
Comments
Improve
Suggest changes
4 Likes
Like
Report

Retrieval-Augmented Generation (RAG) is an architecture that enhances the capabilities of Large Language Models (LLMs) by integrating them with external knowledge sources. This integration allows LLMs to access up-to-date, domain-specific information which helps in improving the accuracy and relevance of generated responses. RAG is effective in addressing challenges such as hallucinations and outdated knowledge.

The Retrieval-Augmented Generation (RAG) architecture is a two-part process involving a retriever component and a generator component.

1. Retrieval Component

The retrieval component identifies relevant data to assist in generating accurate responses. Dense Passage Retrieval (DPR) is a common model that is used to perform retrieval. Lets see how DPR works:

  • Query Encoding: When we submit a query such as a question or prompt, an encoder converts it into a dense vector. This vector represents the query's semantic meaning in a high-dimensional space.
  • Passage Encoding: Each document in the knowledge base is also encoded into vectors. The system performs this encoding process offline and stores it to enable fast retrieval when the query is entered.
  • Retrieval: Upon receiving the query the system compares the query vector with the vectors of all the documents in the knowledge base. It then retrieves the most relevant passages.

2. Generative Component

When the retrieval model identifies the relevant match it is then passed to the generative component. The generative component is based on Transformer architecture like BART or GPT. The generated response will be a combination of the retrieved information along with a newly generated output from the model.

The generative component uses two main strategies i.e Fusion-in-decoder (FiD) and Fusion-in-Encoder (FiE). The final output is based on the user's input. In fusion management both FiD and FiE combine the retrieved information with the user's input to generate a response.

  • FiD (Fusion-in-Decoder): The retrieval and generation processes are kept separate. The generative model only merges the retrieved information during the decoding phase. This allows the model to focus on the most relevant parts of each document when generating the final response, offering greater flexibility in the integration of retrieved data.
  • FiE (Fusion-in-Encoder): FiE combines the query and the retrieved passages at the beginning of the process. Both are processed simultaneously by the encoder. While this method can be more efficient, it offers less flexibility in integrating the retrieved information compared to FiD.

Lets see the key difference between FiD and FiE:

Aspect

Fusion-in-Decoder(FiD)

Fusion-in-Encoder(FiE)

Fusion Point

Fusion occurs in the decoding phase.

Fusion happens at the encoding phase before decoding.

Process Separation

Retrieval and generation are kept separate.

Retrieval and generation are processed together.

Efficiency

Slower due to separate retrieval and generation steps.

Faster due to simultaneous process in encoder phase

Complexity

More Complex

Simpler

Performance

Higher-quality response

Quicker response generation

Workflow of a Retrieval-Augmented Generation (RAG) system

The RAG architecture’s workflow can be broken down into the following steps:

RAG-architecture
Retrieval-Augmented Generation
  1. Query Processing: The input query which could be a natural language question or prompt is first pre-processed. It is then passed to an embedding model that transforms the query into a high-dimensional vector representation.
  2. Embedding Model: The query is passed through an embedding model which transforms it into a vector that captures the deeper meaning of the query.
  3. Vector Database Retrieval: The query is in vector form which is used to search through a vector database. The system uses these vectors to find the most relevant documents that match the query.
  4. Retrieved Contexts: The system retrieves the documents that are closest to the query. These documents are then forwarded to the generative model to help it craft a response.
  5. LLM Response Generation: The LLM combines the original query with the additional retrieved context using its internal mechanisms to generate a response. It uses its trained knowledge alongside the fresh data to create a contextually accurate and coherent answer.
  6. Response: A response that blends the model's inherent knowledge with the up-to-date information retrieved during the process is then presented. This makes the response more accurate and detailed.

Implementing the working of RAG

Let's see an example to see the working of RAG Architecture,

Step 1: Install Dependencies

We will install the required libraries and packages for our model, FAISS for vector search and Transformers for language models.

Python
!pip install faiss-cpu transformers

Step 2:  Initialize Vector Index and Add Contextual Embeddings

We will initialize,

  • faiss.IndexFlatL2(768) creates a flat FAISS index for 768-dimensional vectors using L2 distance (Euclidean distance) as the similarity metric.
  • np.random.seed(42) ensures reproducibility by fixing the random seed.
  • np.random.random((100, 768)) generates 100 random embeddings of size 768, simulating document embeddings.
  • index.add(context_data) adds the generated vectors into the FAISS index.
  • index.ntotal gives the total number of vectors currently stored in the index.
Python
import faiss
import numpy as np

index = faiss.IndexFlatL2(768)

np.random.seed(42)
context_data = np.random.random((100, 768)).astype('float32')

index.add(context_data)

print(f"Indexed {index.ntotal} context vectors.")

Output:

Indexed 100 context vectors.

Step 3: Define Semantic Search Function

We will define,

  • semantic_search() is a helper function to perform vector similarity search.
  • query_embedding is the embedding of the user’s query (shape: [1, 768]).
  • index.search(query_embedding, top_k) retrieves the top_k most similar vectors from the FAISS index.
  • Returns indices: points to the most relevant documents for context.
Python
def semantic_search(query_embedding, index, top_k=5):
    distances, indices = index.search(query_embedding, top_k)
    return indices

Step 4: Example Query Embedding and Retrieval

We simulate a query vector, perform semantic search and get indices.

  • np.random.random((1, 768)) simulates a single query embedding.
  • semantic_search() retrieves the top matching documents.
  • retrieved_indices contains the list of document indices that best match the query.
Python
query_embedding = np.random.random((1, 768)).astype(
    'float32')
retrieved_indices = semantic_search(query_embedding, index)
print(f"Retrieved document indices: {retrieved_indices}")

Output:

Retrieved document indices for query: [[26 38 11 78 12]]

Step 5: Initialize Tokenizer and LLM Model

We load,

  • GPT2Tokenizer.from_pretrained('gpt2') loads the GPT-2 tokenizer, which converts text into tokens.
  • GPT2LMHeadModel.from_pretrained('gpt2') loads the GPT-2 model for text generation.
  • torch.device('cuda' if torch.cuda.is_available() else 'cpu') checks if GPU is available; if not, defaults to CPU.
  • model.to(device) moves the model to the chosen device for computation.
Python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

Output:

Screenshot-2025-09-01-181259
Model Loading and Training

Step 6: Create Prompt with Retrieval Context

Combine question and retrieved context passages into a single prompt.

  • chat_history: keeps track of conversation for continuity.
  • context: text from retrieved documents.
  • question: user input.
Python
context_texts = [
    "Retrieval-Augmented Generation combines a retriever and generator.",
    "It reduces hallucinations by grounding answers in retrieved documents.",
    "Uses Dense Passage Retrieval for semantic search.",
    "Employs Fusion-in-Decoder and Fusion-in-Encoder techniques.",
    "Provides up-to-date and domain-specific responses."
]

prompt_template = PromptTemplate(
    input_variables=["question", "context"],
    template="Question: {question}\nContext: {context}\nAnswer:"
)

Step 7: Initialize Memory and Build Chat Function

We will initialize memory to our system:

  • memory_key: identifies where conversation history is stored.
  • return_messages=False: returns text as plain string instead of structured messages.
  • Memory enables the agent to remember previous questions/answers.
  • Retrieve documents: semantic_search finds relevant context.
  • Load conversation: includes past Q&A from memory.
  • Format prompt: combines context, history and user query.
  • Generate response: GPT-2 predicts answer.
  • Post-process: clean up newlines and spaces.
  • Update memory: conversation history is saved for future queries.
Python
memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=False)


def chat(question):
    query_embedding = np.random.rand(1, 768).astype("float32")
    retrieved_indices = semantic_search(query_embedding, index)
    context_texts = [f"Document {i}" for i in retrieved_indices[0]]

    chat_history = memory.load_memory_variables({}).get("chat_history", "")

    prompt = prompt_template.format(
        chat_history=chat_history,
        question=question,
        context="\n".join(context_texts)
    )

    inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        inputs,
        max_new_tokens=100,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response[len(prompt):].strip()
    response = re.sub(r"[\r\n]+", " ", response)

    memory.chat_memory.add_user_message(question)
    memory.chat_memory.add_ai_message(response)

    return response

Step 8: Generate Response

We will see the functioning of system and the use of memory,

Python
print(chat("What is Retrieval-Augmented Generation (RAG)?"))
print(chat("Explain the role of memory in this system."))

Output:

Screenshot-2025-09-02-100920
Result

Advantages of RAG Architecture

  • Up-to-Date Responses: RAG enables LLMs to generate answers based on the most current external data rather than being limited to pre-trained knowledge that may be outdated.
  • Reduced Hallucinations: By grounding the LLM's response in reliable external knowledge RAG reduces the risk of hallucinations or generation of incorrect data, ensuring that model provides more factual accuracy.
  • Domain-Specific Responses: RAG allows LLMs to provide answers that are more relevant to specific organizational needs or industries without retraining.
  • Efficiency: RAG is cost-effective compared to traditional fine-tuning as it allows models to be updated with new data without needing retraining.

Article Tags :

Explore