From the course: Build with AI: Create Custom Chatbots with n8n
Retrieval-augmented generation (RAG) in five minutes
From the course: Build with AI: Create Custom Chatbots with n8n
Retrieval-augmented generation (RAG) in five minutes
- [Instructor] So, in this short presentation, I'll take you on a 5-minute journey through the essentials of Retrieval Augmented Generation or RAG for short. We'll take a look at how it works, what it solves, and how to use it in real-world applications. Let's begin by framing the core idea. LLMs essentially work with the context we provide in a prompt alongside the usual elements like the model's role, its goal, the specific task, and any formatting instructions. We can also inject additional information directly into the prompt as structured knowledge. This combination forms the context that then gets passed into the LLM, which then generates a more relevant and grounded response. Now, the problem with that approach is how do we deal with huge amounts of information or information that is changing constantly? Feeding everything into a prompt doesn't scale, and keeping it all up-to-date manually isn't really scalable. So, we need a smarter way to keep LLMs informed without overwhelming them or us. The key idea here is Smart Context Augmentation. Instead of dumping entire documents into the prompt, we provide the model just with the pieces of information that are required to solve the current task, like answering the current questions. For example, we can take some document sources like PDFs, Word, Docs, or any type, and break them up into smaller chunks of information like individual pages or sections. And then when someone asks a question, we only pull in the bits that actually matter. In this example, maybe page 3 from one file and page 2 from another. Those chunks go into the knowledge part of the prompt, and that's what gets passed to the LLM. The benefits of that approach are that it's way more efficient than just pasting entire documents into your prompt. You are only including what's actually needed, which allows you to work with much larger data sources. It also tends to give better quality responses because the model's not distracted or overwhelmed by irrelevant information. And best of all, it plays nicely with existing search tools, so you don't have to reinvent the wheel. On the downside, this process can get complicated quickly, especially as you scale up and need to manage how chunks are created, retrieved, and inserted into the prompt, which is why we have RAG. RAG stands for Retrieval Augmented Generation, and it's a powerful design paradigm that helps us to manage this complexity. It gives us a structured way to store information, retrieve just what we need, and use that to generate accurate up-to-date responses from a language model. Now, let's look at the RAG process from a high level. Step 1: We store our documents in a searchable database. These can be PDFs, Word files, really anything that contains useful information. Step 2: When someone asks a question, the system retrieves only the most relevant information parts or chunks from that database. And finally, step 3. The LLM uses that retrieved information combined with its own reasoning abilities to generate a well-grounded response. So, instead of guessing or hallucinating, the model stays connected to real trusted knowledge sources in real-time. Now, let's zoom in on Step 1: Storage. Before anything can be retrieved, we first need to break down our documents into smaller, more manageable chunks like individual paragraphs, sections, or pages. And then, these chunks are prepped for fast and efficient search, usually by storing them in a database. And they're typically saved in two ways. As plain text and as numerical vector embeddings, which we'll talk more about in a moment. This is what sets the foundation for everything that follows in the RAG pipeline. Now, what are these embeddings? Embeddings really are just numeric representations of text. They turn each chunk of content into a series of numbers that capture its meaning, not just the exact words. These embeddings are usually generated by a dedicated embedding model, which works similarly to an LLM. For example, a sentence about dogs, and one about wolves will end up with vectors that are close together because they're semantically related. In contrast, they'll be much farther away from sentences about apples and bananas, which again would cluster near each other. The big advantage of this numeric format is that we can now calculate distances between chunks, and that lets us pull out, say, the top five most relevant pieces for any given query. And that's really the heart of semantic search, which we'll get into next. But for now, just know that to store all these high dimensional vectors and search through them quickly, we typically use a special kind of database called Vector Store or Vector Database, and that's what enables us to search not just by keywords but by meaning. Now, once everything is stored and embedded, the next step is search. There are two main ways we can perform the search operation. First, there's the traditional search, Keyword Search. This works great if you are looking for exact words or phrases like names, codes, or IDs. It works well when you know exactly what you're looking for, but of course, it has limited vocabulary. On the right, we've got Semantic Search, and this is where things get really powerful. Instead of just matching words, it matches meaning. So, if you search for employee benefits, it might surface results that talk about staff perks even those exact words never appear. It understands what you're trying to say, not just what you typed. Now, instead of choosing between keyword search and semantic search, we can actually combine both, and that's what we call hybrid search. It gives us the best of both worlds. Precise keyword matching and meaning based retrieval. This approach is especially useful in real world scenarios where queries can be messy, ambiguous, or a mix of structured and natural language. By combining both methods, we make our system more flexible, more accurate, and more reliable. So, let's look at a little deeper into how semantic search actually works. The first step is we convert the user's question just like the documents into an embedding. Next, we compare that vector to all the document embeddings in our Vector Store and find the ones that are closest in meaning. Finally, we return the top k chunks. That is the most semantically relevant pieces of information to include in the prompt. And this process makes it possible to match intent even when the user's wording doesn't exactly match the source material. Once the basic RAG setup is working, you can start layering on more advanced techniques. First up, chat history context. Instead of treating each question in isolation, you include recent conversation history to give the model a better sense of continuity. Next, there's retrieval optimization. This means tuning how you rank, filter or re-rank results so you don't just pull in the top scoring chunks, but the most useful ones. Then, there's iterative retrieval. In some cases, the system might retrieve once, generate a follow-up query and then retrieve again, refining the answers step by step. And finally, business logic. You can plug in custom rules like prioritizing recent documents, avoiding certain sources entirely, or just showing information that the user has access to. All to align the behavior with real-world constraints. Now, these upgrades help make RAG more dynamic, reliable, and production ready. Of course, RAG isn't perfect. There are some real limitations to be aware of. First, that's numbers and calculations. RAG isn't great at doing math or comparing values, especially when numbers are pulled from different chunks. Second, it struggles with comprehensive analysis. If your query is something like, "find all X that meet condition Y", RAG might not retrieve everything you need or might miss Edge cases entirely. It's also very data dependent, meaning the quality of the output heavily relies on the quality and structure of your input documents. And lastly, it can be really maintenance heavy. You'll need processes to manage data freshness, chunking strategies, embedding updates, and more. So, while RAG is quite powerful, it's not a silver bullet. It shines best when used thoughtfully and with the right expectations. Now, let's flip the coin and look at what makes RAG stand out and so valuable. First, it's always up-to-date. You can update the knowledge base anytime. No need to retrain any AI model. It also scales effortlessly to massive knowledge bases. Whether you've got hundreds or millions of documents, retrieval keeps things really efficient. Another big plus, traceable answers. You know exactly where a piece of information came from, making it easier to verify and debug. And finally, it's surprisingly quick to set up. With the right tools, you can get a basic RAG system running in just a few hours, and not weeks. To wrap it up, let's take a look at where RAG really shines in practice. First, customer service. RAG helps support agents or even chatbots pull in precise up-to-date answers from internal docs, FAQs, and policies. It saves time and improves accuracy. Second, legal research. Lawyers and analysts can use RAG to sift through huge volumes of legal texts and case files, servicing only the most relevant excerpts without needing to scan everything manually. And third, knowledge management. Whether it's for internal wikis, product documentation, or compliance records, RAG helps teams service what they need when they need it from massive content repositories. So in short, whenever there's a lot of text and a need for fast accurate answers, RAG fits right in.