From the course: LLMOps in Practice: A Deep Dive

BYOD to a VectorDB - Python Tutorial

From the course: LLMOps in Practice: A Deep Dive

BYOD to a VectorDB

Now that you've installed and set up a Chroma DB, let's explore what you're going to need to do to get your data into it. The core scenario for retrieval augmented generation is when an LLM may not have all of the facts about a particular scenario. For example, you might have data that's private to you that wasn't used in a training set for a model like GPT or Gemini. So if you want to have a chatbot that's an expert on that content, you would either have to train an entirely new model with it or use the reasoning powers of an LLM across that content. And that's where RAG comes in. The idea here is that when a user sends a prompt to an LLM, you can also send its snippets of relevant information from your content, and it will use what it knows from its training data to help parse that. In this video, we're going to use an interesting example of this, where you'll take a science fiction novel that was not used in any of the training data for any LLMs, and you're going to chat with an LLM about it. The novel is called Space Cadets, and it's a near-future story of the first Space Academy and the generation of young people who occupy it. Here's an image that I painted of one of the characters from this novel. Her name is Soo-Kyung Kim, and she's the first from her country to attend this academy. She's from the Democratic People's Republic of Korea, or, as it's often known, North Korea. I chose her as an example here because if you were to ask a chatbot about her, it would have no idea, as in this case with Gemini, or worse, it could hallucinate about her, as in this case with GPT-4o. But if you were to ask one using RAG, then it could take text from the book about her, such as her hometown, her likes and dislikes, her history, stuff like that. And it will then give you a much better profile about her. So how would we do this? Well, the first step is to get the contents of the book into a database where we can start doing similarity searches against it, and we'll explore that next. As the book is a PDF, the first step in this process will be to read a PDF. In Python, the pypdf library has the support for this. PDFs are usually stored as pages. It was a format that was designed for print after all. So when we want to get all of the text from one, we have to iterate through each page, and then we'll append the contents of that page to our overall text, so that when this loop completes, the text variable will contain the entire text of the book. Next, in order to store the text as embeddings in the database, we have to split it into smaller chunks. Each of these chunks will have an embedding associated with it, and that's simply a numeric vector that represents the text in a way that's similar pieces of text have similar vectors. So when you have a prompt going to an LLM, you can search the database for similar text to that prompt, and then you can use what you found to augment the prompt. For example, if your prompt was something about Soo-Kyung, then of course, snippets from the book mentioning her would be retrieved, and these could include data about her hometown or other things that the LLM does understand, and which could then be used to inform the answer. To split into chunks, you can use the recursive character text splitter library from LangChain. This simply splits the text into chunks like this, but it is important to understand the chunk size and chunk overlap properties. Their goal is to help you determine the best chunks to use, and that's going to take a bit of experimentation. They're measured in characters. The overlaps help you not lose semantics in your text by chopping things in the middle of a sentence. So once you have the chunks, you need to calculate the embeddings for them. Remember, for similarity search, it will use these numeric values to make finding stuff in your database that's close to the prompt much easier. And the LangChain OpenAI library includes an embedding function that you can use. For this, you'll need an OpenAI API key, and this code will iterate through each of your chunks and calculate the embeddings for them, returning it as a list. Now, simply all you need to do is store these embeddings in the database. With Chroma DB, you do this with what's called a collection. Let's take a look at the code. It's pretty straightforward, but I'm going to break it down piece by piece. Earlier, you calculated the embeddings for your text chunks using an embedding function from OpenAI with a library from LangChain. When using Chroma DB, you have to use its embedding functions to specify what embedding that your data uses. So you have to import the library. You'll then specify which embedding function from the Chroma DB library you want to use. Be careful here. This must match what you use to calculate the embeddings from your text, or the results will be meaningless. If you have errors in retrieval, I've always found this is the first thing you should check. Once you have that, you can then create the collection by giving it a name and specifying what embedding function it uses. Here, we're creating a collection called PDF embeddings, and we're specifying the embedding function from above, which of course, is the OpenAI text embedding function. Now storing the chunks becomes as easy as just adding them to this collection. Note that the collection stores embeddings and the original raw data, which it calls a document. So when you retrieve, you don't need to decode the embedding, you simply read the document. Great. Now you can see how you can slice up a PDF document to store it in a database. Next, I'm going to show you this code in operation so you can see how it all works. After that, we're going to look at how you can use Node to talk to this database and retrieve documents that are close to another string. And you'll have the beginnings of a RAG chatbot that's an expert on this book.

Contents