From the course: Defending and Deploying AI by Pearson

Introducing retrieval augmented generation (RAG)

From the course: Defending and Deploying AI by Pearson

Introducing retrieval augmented generation (RAG)

Let's go over Retrieval Augmented Generation, and this is a technique that is pretty much everywhere right now. This is where you enhance the capability of a large language model, whether it's GPT, CLOD, or an open source model like Lama or Mistral, and then you combine them with external knowledge retrieval. And this is extremely popular because there are different ways that you can implement LLM-based applications. One is that you can train a model from scratch. That is extremely expensive, requires a lot of expertise, requires a lot of infrastructure. And then the second one is you can fine-tune a model. It is becoming more accessible. So it's a little bit cheaper than, of course, training from scratch. but it still requires a lot of expertise and infrastructure. So retrieval augmented generation is an AI framework that integrates information retrieval from different places like a vector database or an API and so on with then, of course, generate text or generate images in some cases, and then improve the accuracy, the relevancy, and also up-to-date information for using pre-trained models. Now, retrieving relevant information from an external knowledge base is a crucial aspect of RAG. Now, what you're seeing in front of your screen is basically a diagram. Let me make it a little bit bigger. It's a diagram that is within my becomingahacker.org blog, if you will. And in this diagram, from a high level, I explain how retrieval augmented generation works. So what you're seeing here is that you have some data, and in this case is security best practices, security advisories, common vulnerability enumeration, so CVE vulnerabilities, CWEs, it can be anything, right? So since I'm a security guy, of course I'm using security data in here. Then you need an embedding model or an embedding generator. Basically, you're converting this text, whether it's unstructured data from a PDF or a JSON file and so on, and then you're converting this data into numbers. And then you're putting those in a vector database, things like Pinecone, ChromaDB, MongoDB, VectorSearch, FIAS. There are many, many of them out there. And I'm going to introduce a few of them later in the course. And then what it allows you to do is that after a human or a machine ask a question or gives a prompt, then you can do semantic search on this data, provide additional context to the LLM, and then you can rank those results and then provide an answer that is a lot better. and you're not using the pre-trained data from the LLM. Orchestration libraries like Lanchain allows you to actually integrate these components very, very easily. And we covered Lanchain earlier in the course. So again, some of the benefits of a RAG includes improve accuracy by grounding the responses. Grounding is a technique in machine learning to reduce the likelihood of hallucinations and improve factual correctness of the responses. So the other thing is that you're, of course, using your own data, not some pre-trained data that was used to train that model. And then also up-to-date information. So it can incorporate current data, overcoming the limitations of LLMs being frozen in time after training. The other thing is, of course, domain-specific knowledge. It's just like you're seeing in the screen with security data, you can use retro-augmented generation to allow the LLM to leverage that preparatory or specialized information sources to provide you a better response. Now, again, how RAC works, it converts the documents into vector embeddings. You can store them in a vector database, or you can retrieve information from other tools or other data pipelines using APIs and so on. And then you encode the user query as a vector and retrieve the relevant documents using semantic search and then augment the LLM prompt with the retrieve information. Now, the perfect example of seeing this in action is with Lanchain and their chatbot. If you go to chat.lanchain.com, it will actually take you to this page. And what this is, is basically a perfect example on how you can use Retrieval Augmented Generation. Lanchain, basically they vectorized their Python documentation. And now you have a chat bot, you can actually ask questions related to that documentation. Like for example, what does the runnable pass through data sign do, right? Or what is an agent? Let's actually put that in there. What is an agent? And what it's doing, it's actually going over all the documents, or the chunks of the documents, of the Python documentation that they vectorize, and it gives you an answer. And then not only it gives you an answer, but it gives you the references to the sections of the document of the documentation where it talks about this topic. So if you can click on any of them, you actually see that it takes you directly to the documentation in there. Now, of course, this allows you to select from different models. I was using GPT-4.0 Mini, but you can use Cloud, Gemini Pro, Mistral, Lama, Cohere. And this, of course, is continuing to change literally by the minute. And more models are being introduced, more capabilities are being introduced, and so on. But again, this is a perfect example on a Retrieval Augmented Generation implementation. Now, there are different types of Retrieval Augmented Generation solutions out there, if you will. One is called RAG Fusion, which takes the traditional RAG a little bit further by combining different techniques to then provide better results and provide a way to augment that data and then reduce the likelihood of hallucinations, right? Then the other one is called Raptor. Now again, in becoming a hacker.org, I do have a write-up related to all three, but real quick, if you look at this diagram, let me make it a little bit bigger here. In Rackfusion, basically you start a query, you retrieve the different components from a vector database and so on. But you have three different retrieval systems in this case, right? You can have multiple, many more if you want to. And then basically you're comparing or evaluating and fusing, that's the term in here, fusing and re-ranking those results. And then after that re-ranking, you're sending that to the LLM to provide additional context. This is called Reciprocal Rank Fusion. And I have references into how this actually works. And as a matter of fact, even references related to how to implement it with different GitHub repositories, different tutorials that are existing online and so on. The other approach is called Raptor. And in this case, let me actually make this bigger. It takes the information from your Retrieval Augmented Generation vector stores, from the vector databases, and then basically forms a decision tree. Basically going and getting all those chunks, re-ranking those chunks, and also clustering those chunks, and then sending it to the actual model. And I have tons of examples in here, even some write-ups and videos from the creators of LandChain that demonstrate these concepts. And there's another video here by the creators of Lama Index. So now there are a few differences that I actually have in here. Traditional RAG, with no re-ranking, basically uses the top k chunks. So basically, it retrieves those chunks of information from the vector store, but it doesn't do any type of re-ranking, or it doesn't do any type of effusing of those documents. Rackfusion then adds that extra step for generating multiple queries from the original query, and then retrieving those documents, and then providing that to the LLM. Then Raptor uses a tree-based retrieval method that then clusters and summarizes the text that it receives and then sends it to the LLM. Now, a few things in here is that with Rack Fusion and with Raptor, the cost for you to actually do the retrieval, the re-ranking, and then of course, the queries to the LLM is significantly higher depending on the implementation in comparison to just traditional Rack, right? and where you're actually just doing one retrieval and so on. However, the results are a lot better. It's something that you have to take into consideration whenever you're thinking about cost. And there's not a silver bullet answer on to how much it's actually going to cost. It all depends on the memory. It depends on your data, and so on. One other thing that I want to talk about related to data. Your data, whenever you put them into a vector database, has to be properly indexed in order for you to actually be able to retrieve those chunks of your documents and do semantic search successfully. And there are many other techniques out there, including hybrid search. And as a matter of fact, I have a write-up also in my GitHub repository, as well as in the becomingahacker.org blog related to hybrid search as well. So these are different techniques that are evolving as we speak. In a nutshell, that is retrieval augmented generation. It's basically providing additional context to the LLM for you to get better results and reduce the likelihood of hallucinations.

Contents