From the course: OpenAI API for Python Developers
Load and split documents
- [Instructor] Now, the step is to load documents as a source of information. In the next example, we build an AI assistant that can understand natural language to respond to user queries, and which is also capable of retrieving relevant information. So we want to build a virtual assistant which is driven by AI and trained on custom data. So we're going to use Chroma, an open source vector database, and we're going to find below, a basic example in order to get started. And basically, that's going to be in two steps. First, we want to load the documents as a source of information, and next, we're going to split these documents into chunks. So that's going to be those two steps. So what we're going to do first is to add these import statements. Let's go back to our example right here. And you're going to make sure that you comment this one out because the next example will be to see how we load and split the documents. So let's comment these lines out for the moment. And we go back, we're going to take those two lines. I'm going to add this right here. And this corresponds to the step, which is to load the documents. And we're going to replace here with the directory where we keep our information, so that's going to be inside this docs directory. Inside, you're going to find a text document that includes frequently asked questions about your products and services, for example. So with this information, your AI assistant is going to be able to respond about what type of products that you offer, about the shipping conditions, the return policy, and also the loyalty program. Alright, so we're going to make sure that we load this document. So I'm going to, here, replace the file path to the directory docs. And inside, we're going to find this FAQ text file. Here we go. And so second step, we want to split these documents into chunks. So let's take these two lines and add it right here. And now, we're going to make a quick demo by printing the docs. And actually, I'm going to take only docs just to print the whole thing. After the process of loading and splitting the documents into chunks, we're going to see what is the result. Alright, so this is a big chunk of information that we're going to find under page content. But actually, we're going to make a few adjustments. You're going to see right here that you've got this setting, which is chunk size. This corresponds to the size of the chunk. So we'd rather have smaller size for every chunk because we have kind of a small document. So I think that we could allow to load smaller chunks of information. So that's going to be more economical and more performant as well when we're going to load the chunks of information to the vector store. So let's try that again. Alright, and this time you're going to see that there was this process, which was to create chunks of size which are much smaller right here. And here, it's going to specify that this is slightly longer than a hundred. And this is just to give an average. For average chunk size, we're going to have around, slightly over a hundred. So that was the first step, which was to split the documents into chunks in order to create next embeddings, and then load the embeddings into the vector store. And we use the embeddings to measure the relatedness of text strings. So when you submit a text input, we're going to search by similarity between the embeddings. So that's going to be the next step, to create embeddings and load to the vector store, and we're going to use Chroma as a vector store.