From the course: Vector Databases in Practice: Deep Dive
Solution: Import Wikipedia data chunks
(upbeat music) - [Instructor] Okay, here's my solution. Let's see if we arrived at Civil Loss Solutions. First, of course, we need to create a collection, so hopefully this process is familiar to you by now. We define a name for our collection, we'll call it WikiChunk. We define a vectorizer for creating vectors and a generative module for RAG queries. Then let's create some properties. We have a title here with text data type and a chunk with the same text data type. What I'm going to do is to also add a chunk number as an integer. This is not from the original data set, but this is going to be handy because without it you won't be able to tell where in the source document the chunk comes from. This will create the collection definition, so now we'll get the newly created collection before inserting our chunks into it. Remember our hint from before, the chunked pages object is a dictionary where the key is the title of each document, so we can iterate through it like this. First, we'll iterate through the chunked pages dictionary for each page name and the values, which will be the page chunks. What we'll do then is to create a list of data objects, which will be the chunked objects, and we need to iterate through the page chunks again, because that's going to be the inner loop that allows you to iterate through the page chunks for each page. And now it comes to the question of how to generate the UUID for each object. Remember that before in our movie database we used the row number, but here that won't necessarily work, and that's because there are multiple chunks here with the same chunk number. Again, remember that there are multiple pages, so there'll be, for example, multiple chunk zeros. So what I do here instead, is to use the page name as well as the chunk number to generate the ID from. That will ensure that the seed for our UUID is unique. Once we've made that decision, we generate the data object similarly to what you have done before. We use the properties of title and the chunk in this case, and we'll pass on the chunk number as well so that we can retrieve it when we get each chunk. Remember to pass the UUID on as well, and once you have the chunk object appended to our list of chunk objects for each page, we'll just insert the chunk object at this point and let's print out a message just telling us that, "Hey, it's important finishing this particular page." And of course, remember to close the client connection. So if you run that code, it'll populate your instance (indistinct), with a few hundred Wikipedia chunks. And that means you'll now have your own searchable mini set of Wikipedia pages that you can perform RAG, or any other searches with. Just as a note, this type of individual page downloading isn't really for bulk downloads for various reasons. If you want the large Wikipedia corpus, you can actually download Wikipedia's official data dumps that we've discussed before. If you're interested in building a vector database with a big dataset, this could be actually a very good option. These types of chunking and data import techniques are really useful and applicable for many types and sources of data, and you now know how to apply them yourself end to end from creating a database to importing this data. This is really exciting and I would really encourage you to build your own databases with whatever data you're interested in.