From the course: LLMOps in Practice: A Deep Dive

Retrieval augmented generation (RAG) - Python Tutorial

From the course: LLMOps in Practice: A Deep Dive

Retrieval augmented generation (RAG)

Welcome back to our LLMOps course. In chapters one and two, we built a chatbot using Node.js and OpenAI APIs, whose job it was to help you be a better public speaker. We explored logging and the basics of how to capture data for reinforcement learning through human feedback, or RLHF. And while you're not going to retrain the underlying models the way the folks at OpenAI or Google may, by capturing that data, you are in a position to make informed choices about how to better serve your customers. In a simple chatbot like that one, it might be to change the underlying model or the model version that you're using and testing further feedback, or it could be that you would want to rewrite your system prompt and test different ones against the feedback that your users give. In this video, we're going to switch gears and to help you build a better application that's an expert on a specific topic. To do that, we'll dive into an exciting technique called retrieval augmented generation, or RAG for short. We'll also look at how you can implement RAG using Node.js, LangChain, and Chroma. We'll cover the fundamentals of RAG, its benefits, and the high-level steps to build a RAG system. We're going to keep things conceptual for now, but don't worry, in future videos, we're going to get our hands dirty with code and practical implementation details. But before we start, let's just explain what RAG actually is. It's a powerful technique that combines the strengths of large language models with external knowledge retrieval. So let's break down how it works. First, we'll create a knowledge base of information that we want our system to access. And this could be anything from documentation and articles to entire books or databases. When the user asks a question, the system processes it to vectorize the contents of the user's query. The system will then search the knowledge base for relevant information that's related to that query, and it finds this using something called vector similarity. If the vectors are close to the original, then the underlying information is going to be syntactically similar. The retrieved information is then used to augment the prompt that's sent to the language model, and this provides additional context for the model to work with. The language model generates a response based on both its training data and the retrieved context, and then the system returns that generated response to the user. For the purposes of this course, in this video, we're going to take a look at the text of a novel that I wrote over ten years ago. Shortly after publication, the publisher folded, and as a result, the novel wasn't widely distributed, and that means it wasn't used in the training set for GPT, Gemini, Claude, or any other large language models. So if I were to ask these models about characters or circumstances in the book, they'd have no idea. And in fact, they might even hallucinate a lot. And this makes it ideal for RAG. We'll create a local database of the book, and then we'll use that to augment queries to the LLM. And the power of the LLM can then provide reasoning ability across that text to help us understand and parse it. I hope this helps you understand the power of RAG a little better. And some of the benefits of RAG are improved accuracy. By providing relevant context, RAG helps the model generate more accurate and informed responses. Up-to-date information. The knowledge base can be updated regularly, allowing the system to access current information beyond the model's training data. Domain-specific knowledge. RAG can incorporate specialized knowledge that might not be well represented in the model's general training. Reduced hallucinations. By grounding responses in retrieved information, RAG can help minimize the model's tendency to generate false or misleading information. Transparency. RAG systems can often provide sources for the information used in those responses, increasing trust and traceability. The next question, of course, is how does one get started in building something like that? Well, we're going to go through that step by step from a high level, and in later videos, we'll implement this. The first step is data ingestion. And that's taking your data and getting it ready to store. In this course, we're going to use the example of taking a full book in PDF format. We'll need to extract the text from the PDF and break it down into smaller, manageable chunks. Each chunk will be processed to create a meaningful representation that can then be searched efficiently. To make our text searchable, we'll convert each chunk into a numerical representation called an embedding. Embeddings are high-dimensional vectors that capture the semantic meaning of the text. We'll use pre-trained models to generate these embeddings, which will allow us to perform the semantic similarity searches later on. We'll need to store these embeddings in a way that's efficient, and it's easy to search for similar ones. For this, I'm going to use Chroma as the vector store to efficiently index and search our embeddings. Chroma is designed for fast similarity search in high-dimensional spaces, and it makes it ideal for our RAG system. We'll show you how to set up Chroma and add our processed book chunks to it. Chroma will allow us to quickly find the most relevant text chunks for any given query. So next comes our query processing. When the user asks a question, we'll need to process it in the same way as our knowledge base text. So this involves generating an embedding for the user's query, which we can then use to search our vector store for ones that are similar. So for example, if I were to ask about a particular character and get details on them, their name would be in the query and the vector similarity search would extract stuff from the book that also contains their name, and then we'll retrieve the embeddings that are similar to our query. We'll perform a similarity search in Chroma to find the most relevant chunks from our knowledge base. We'll explore different retrieval methods such as K-Nearest Neighbors or KNN, and we'll discuss how to fine-tune them for better results. We'll also look at techniques like reranking to improve the quality of retrieved information. But now it's time to prompt the LLM. Once we have our retrieved context, we'll need to construct an augmented prompt. This involves combining the user's original question with the most relevant retrieved information, and we'll discuss strategies for effectively integrating this context without overwhelming the language model. We'll then use LangChain to seamlessly integrate with language models like GPT-3.5 or GPT-4. LangChain provides a very convenient abstraction layer that simplifies working with different language models. We'll show how to send the augmented prompt to the model and then process the response, and we'll get that response directly from the model. In the upcoming videos, we'll dive deeper into all of these steps and provide hands-on demonstrations and code examples. You'll learn how to build a robust RAG system that can enhance your chatbot's capabilities and provide more accurate, context-aware responses. We'll start by ingesting and processing our example book into Chroma, showing you how to handle PDF extraction, text chunking, and embedding generation. From there, we're going to build out each component of the RAG system, integrating it with our existing chatbot application. By the end of this chapter, you'll have a powerful knowledge-augmented chatbot that combines the flexibility of large language models with the precision of information retrieval. So get ready to take your skills to the next level with retrieval augmented generation. Stay tuned for our next video where we're going to begin our hands-on journey into the world of RAG. I'll see you there.

Contents