Open In App

How to Build RAG Pipelines for LLM Projects?

Last Updated : 23 Jul, 2025
Comments
Improve
Suggest changes
1 Likes
Like
Report

Large Language Models (LLMs) have transformed the landscape of natural language processing (NLP) by enabling machines to understand and generate human-like text. These models, such as GPT-3 and BERT, have been trained on massive datasets and can perform a wide range of tasks, from answering questions to generating content. However, despite their impressive capabilities, LLMs are limited by the data they were trained on and often struggle to provide real-time, context-specific information.

RAG-Pipelines
How to Build RAG Pipelines for LLMs

In this article, we will explore how integrating Retrieval-Augmented Generation (RAG) pipelines can enhance the capabilities of LLMs by incorporating external knowledge sources. We will discuss the core concepts behind LLMs, RAG, and how they work together in a RAG pipeline. Additionally, we will provide a practical guide on how to build and implement your own RAG pipeline for LLM-based projects, ensuring your model is equipped to handle both general and domain-specific queries.

What is an LLM?

Large Language Models (LLMs) are machine learning models trained on large volumes of text data to perform natural language understanding and generation tasks. These models are built on architectures like transformers, which utilize attention mechanisms to focus on different parts of the input text for context-aware processing. LLMs can perform a wide range of tasks, such as language translation, summarization, question answering, and text generation, all of which rely on their vast training datasets.

Despite their capabilities, LLMs face challenges when dealing with dynamic or niche information that wasn't included during training. They often generate responses based on patterns learned from historical data, making it difficult to answer real-time or highly specific queries. This is where the integration of external knowledge sources, such as a RAG pipeline, can significantly enhance their functionality.

What is RAG?

Retrieval-Augmented Generation (RAG) is a method designed to enhance the capabilities of traditional large language models (LLMs) by integrating them with external information retrieval systems. In a RAG setup, a retrieval system—such as a search engine or a vector database - fetches relevant information from a vast corpus of data. This external knowledge is then used to guide the generation process of the LLM, resulting in more accurate, contextually relevant answers. The key advantage of RAG is that it allows the model to access up-to-date, domain-specific, or niche knowledge that it might not have encountered during training, blending retrieval with generation to produce more informative and precise responses.

A RAG pipeline consists of three key components: retrieval, augmentation, and generation, each playing an essential role in generating accurate, context-aware outputs.

  • Retrieval: The retrieval step is where the system searches an external knowledge base to gather relevant information. This can include documents, articles, or web resources. Using techniques like keyword matching or embedding-based methods, the system quickly identifies and retrieves information that is similar to the user’s query. This external data helps the model overcome its limitations by allowing it to access real-time or specialized knowledge that it wasn't trained on.
  • Augmentation: Once the relevant data is retrieved, the augmentation step kicks in. Here, the retrieved information is used as additional context for the LLM, helping it generate a more accurate and relevant response. This step ensures that the model’s answer is not only based on its inherent knowledge but also enriched by the external data, making the response more comprehensive and contextually appropriate.
  • Generation: The final stage is generation, where the language model processes the augmented data and creates a coherent, context-aware response. By synthesizing the new external context with its own pre-trained knowledge, the model produces a more precise and informative answer. This step allows the RAG system to generate answers that are both grammatically correct and highly relevant to the user's query.

RAG Pipeline Architecture

1. Data Collection

The first stage of a RAG pipeline involves gathering unstructured data from various sources, such as documents, online articles, databases, and emails. This data is typically raw and unorganized, so it needs to be collected and prepared for subsequent steps. Tools like LangChain and custom data loaders are commonly employed in this stage to handle different data formats, such as PDFs, CSV files, and web pages. This process centralizes the data, making it accessible for further processing and retrieval tasks.

2. Data Preprocessing

Once the data is collected, it often requires pre-processing to extract the relevant textual content. Raw data sources like PDFs or web pages may contain a mix of text, images, tables, and other elements, so it’s important to clean and extract just the useful information. Tools like AWS Textract or open-source libraries can assist in extracting readable text from complex documents. This stage ensures that the pipeline only works with structured, clean text, which is essential for efficient retrieval and response generation in the following stages.

3. Data Transformation

After the data is cleaned, it needs to be transformed into a format suitable for embedding and subsequent retrieval. This step often involves splitting documents into smaller chunks, known as chunking. Chunking is crucial because many models, especially embedding-based ones, have token limits that require breaking large text blocks into smaller, manageable pieces. This step is also important for maintaining semantic coherence across smaller chunks. In cases where documents are complex or lengthy, the challenge is to ensure that chunking doesn't lose context, as coherent segments are essential for quality retrieval and response generation.

4. Embedding and Representation

In this stage, the chunks are transformed into high-dimensional vectors or embeddings. These vectors represent the meaning of the text in a format that makes it easy for the system to search for similar content in a vector database. Embedding models like OpenAI’s text-embedding-ada or domain-specific models (such as Mistral AI) generate these embeddings. The vectors allow the system to perform efficient similarity searches and retrieve the most relevant pieces of data based on a user’s query. The generation of accurate embeddings is critical for the retrieval system’s performance, as it directly impacts the quality of the data returned for response generation.

5. Storage and Persistence

Once the data has been embedded, it is stored in a specialized vector database designed for high-dimensional data. Vector databases are optimized to quickly handle large volumes of embeddings and efficiently perform similarity searches. This stage ensures that the embeddings are stored in a structured and indexed format, making them easily accessible for future queries. Additionally, the persistence layer needs to store metadata (such as document IDs or source links) alongside the embeddings to keep track of the context of each retrieved chunk. Maintaining this structured storage is crucial for fast, real-time response generation.

6. Updating and Refreshing

Over time, new data becomes available, and existing documents may change. This stage of the pipeline addresses the need for regular updates to the stored embeddings. As fresh data is ingested and processed, the embeddings must be updated in the vector database to maintain the relevance of responses. Without regular refreshing, the pipeline may generate outdated responses, diminishing its accuracy and effectiveness. The refreshing process ensures that the system remains synchronized with the latest information, improving the reliability of the model's output as it continuously adapts to new content.

Building a RAG Pipeline for LLM

1. Importing the Libraries

The required libraries for loading documents, chunking text, embeddings, vector storage, and model generation are imported. This includes LangChain for workflow management, HuggingFace for embeddings and model generation, and Chroma for vector database functionality.

Python
import os
from typing import List, Dict, Any
from getpass import getpass

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.schema import Document

2. Loading the Data

The WebContentLoader class loads content from provided URLs using LangChain’s WebBaseLoader and converts it into documents. It includes error handling to ensure content is successfully loaded.

Python
class WebContentLoader:
    def __init__(self, urls: List[str]):
        self.urls = urls
        
    def load_content(self) -> List[Document]:
        loader = WebBaseLoader(self.urls)
        try:
            documents = loader.load()
            print(f"Successfully loaded content from {len(self.urls)} URLs")
            return documents
        except Exception as e:
            print(f"Error loading content: {str(e)}")
            return []

3. Chunking the Text

The DocumentChunker class splits documents into smaller chunks using LangChain’s RecursiveCharacterTextSplitter. This allows for better processing by creating manageable text pieces with overlap for context preservation.

Python
class DocumentChunker:
    def __init__(self, chunk_size: int = 256, chunk_overlap: int = 50):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    def create_chunks(self, documents: List[Document]) -> List[Document]:
        chunks = self.splitter.split_documents(documents)
        print(f"Created {len(chunks)} chunks from {len(documents)} documents")
        return chunks

4. Embeddings

The HuggingFaceEmbeddings class uses HuggingFace’s API to convert text into vector embeddings, capturing semantic meaning for effective similarity-based search.

Python
class HuggingFaceEmbeddings:
    def __init__(self, model_name: str = "BAAI/bge-base-en-v1.5"):
        # Get HuggingFace token if not already set
        if 'HUGGINGFACEHUB_API_TOKEN' not in os.environ:
            hf_token = getpass("Enter your HuggingFace API token: ")
            os.environ['HUGGINGFACEHUB_API_TOKEN'] = hf_token
            
        self.embeddings = HuggingFaceInferenceAPIEmbeddings(
            api_key=os.environ['HUGGINGFACEHUB_API_TOKEN'],
            model_name=model_name
        )
    
    def get_embeddings(self):
        return self.embeddings

5. Vector Database

The VectorStore class stores embeddings in Chroma, enabling efficient querying by creating a searchable vector store from the documents.

Python
class VectorStore:
    def __init__(self, embeddings):
        self.embeddings = embeddings
        self.vectorstore = None
    
    def create_store(self, documents: List[Document]) -> Chroma:
        self.vectorstore = Chroma.from_documents(
            documents=documents,
            embedding=self.embeddings
        )
        return self.vectorstore

6. Retrieval

The Retriever class uses Chroma to retrieve relevant documents based on a query, applying Maximum Marginal Relevance (MMR) to optimize for relevance and diversity.

Python
class Retriever:
    def __init__(self, vectorstore: Chroma, k: int = 3):
        self.retriever = vectorstore.as_retriever(
            search_type="mmr",  # Maximum Marginal Relevance
            search_kwargs={"k": k}
        )
    
    def get_relevant_documents(self, query: str) -> List[Document]:
        return self.retriever.get_relevant_documents(query)

7. Prompt Templates

The PromptManager class creates prompts in the Zephyr format, providing context to the model and guiding it to generate accurate responses.

Python
class PromptManager:
    @staticmethod
    def create_zephyr_prompt(query: str, context: str = "") -> str:
        return f"""
<|system|>
You are an AI Assistant that follows instructions extremely well.
Please be truthful and give direct answers. Please tell 'I don't know' if user query is not in context
</s>
<|user|>
Context: {context}

Question: {query}
</s>
<|assistant|>
"""

8. Model Generation

The ResponseGenerator class uses HuggingFaceHub to load a language model, which processes the query and retrieved context to generate responses.

Python
class ResponseGenerator:
    def __init__(self, model_id: str = "HuggingFaceH4/zephyr-7b-alpha"):
        self.model = HuggingFaceHub(
            repo_id=model_id,
            model_kwargs={
                "temperature": 0.5,
                "max_new_tokens": 512,
                "max_length": 64
            }
        )
        
    def create_qa_chain(self, retriever) -> RetrievalQA:
        return RetrievalQA.from_chain_type(
            llm=self.model,
            retriever=retriever,
            chain_type="stuff"
        )

9. RAG Pipeline

The RAGPipeline class integrates all components - loading, chunking, embeddings, retrieval, prompt creation, and model generation - into a unified pipeline that processes queries and generates responses based on relevant documents.

Python
class RAGPipeline:
    def __init__(self, urls: List[str]):
        self.loader = WebContentLoader(urls)
        self.chunker = DocumentChunker()
        self.embeddings = HuggingFaceEmbeddings()
        self.vectorstore = None
        self.retriever = None
        self.generator = None

    def build(self):
        documents = self.loader.load_content()
        chunks = self.chunker.create_chunks(documents)
        vector_store = VectorStore(self.embeddings.get_embeddings())
        self.vectorstore = vector_store.create_store(chunks)
        retriever_component = Retriever(self.vectorstore)
        self.retriever = retriever_component.retriever
        self.generator = ResponseGenerator()
        self.qa_chain = self.generator.create_qa_chain(self.retriever)
        
    def query(self, question: str) -> str:
        prompt = PromptManager.create_zephyr_prompt(question)
    
        response = self.qa_chain(prompt)
        return response['result']

Example Usage

The main function demonstrates using the RAGPipeline, processing the data and generating a response for the query, "What is recurrent neural network?"

Python
def main():
    # Using two of our amazing articles as examples
    urls = [
        "https://www.geeksforgeeks.org/nlp/stock-price-prediction-project-using-tensorflow/",
        "https://www.geeksforgeeks.org/deep-learning/training-of-recurrent-neural-networks-rnn-in-tensorflow/"
    ]
    
    pipeline = RAGPipeline(urls)
    pipeline.build()
    
    query = "What is recurrent neural network?"
    response = pipeline.query(query)
    
    print(f"\nQuery: {query}")
    print(f"Response: {response}")

if __name__ == "__main__":
    main()

Conclusion

This article covered the key steps in building a chatbot using Langchain, from loading and chunking text to using embeddings and vector databases like Chroma. By integrating advanced language models like Zephyr-7B, we demonstrated how to retrieve relevant documents and generate meaningful responses. These techniques form the foundation for creating intelligent systems that can understand and respond to user queries efficiently, whether for chatbots or other AI-driven applications.


Article Tags :

Explore