From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)

Prepare data for the knowledge base

We will now proceed to acquire a data source and prepare it for adding to the knowledge base. For the data source, we have a PDF document. The document is called Large Language Models.pdf that is available as part of the exercise files. This document contains text extracted from Wikipedia about LLMs. To load and process the document, we will use LangChain. LangChain has loaders for several data sources. We will use the PDFMinerLoader for this purpose. Given that this is a clean document, we do not need any filtering or cleansing. Similarly, as we are only using one data source, there is no need for standardization. Let us load up this document. Next, we proceed to chunking. For chunking, we use LangChain's RecursiveCharacterTextSplitter. Lang Chain also has several other chunking methods. We initialize the splitter first. We use a chunking size of 512. We also use a chunk overlap of 32. So there is overlap between chunks and there is continuity. The length function is used by the splitter to measure the chunk size. We can also use custom methods here. Using the chunker, we split up the document into chunks. Then we add all the chunks to a list. Finally, we print the count of chunks and a sample chunk text. Let's run this code now. There are 23 chunks in this document. Once we have the chunks, we also need to create embeddings for them. We follow the same process as in the earlier chapters to initialize the OpenAI embeddings. Then we use this embedding model to create corresponding embeddings for each of these chunks. For chunk IDs, we create a list of increasing numbers from zero based on the number of chunks. We now have data for all the three fields, and now we are ready for loading. The chunks are created and ready.

Contents