From the course: Vector Databases in Practice: Deep Dive

Basic data import in Weaviate

From the course: Vector Databases in Practice: Deep Dive

Basic data import in Weaviate

- [Instructor] It's time to add data to our database. In many ways, we've done a lot of the hard work already. We've created a database and created a scaffold or blueprint by defining our movie collection. All we need to do now is to load our data and pass it to Weaviate in a way that matches the collection definition. So let's take a look. What we're going to do first is to load the dataset using a library called pandas. This line loads our movie data from a CSV to a nice tabular format called a DataFrame. But of course, you can load your data any way you like. We'll connect to our database and then get the movie collection that we just defined. Next, we're going to create a list of data objects, one for each movie by iterating through the movie rows. Next, we'll create a list of properties. This is the data that's going on to our collection. The keys on the left match the collection property names and the values on the right are column names from our DataFrame. Then we'll generate a unique identifier, also called a UID, for each object. UUIDs are how Weaviate internally identifies objects. We'll use this helper function here to generate a deterministic identifier using the movie ID as the original source. And then we can use the data properties that we created and the UUID to create a data object instance for each movie. Specifying an ID like this allows us to prevent duplication. Because the movie's row ID is unique, the UUID will be as well. But another important factor is that this is a deterministic or predictable way of generating IDs. And as a result, if the data changes, all we need to do is to insert it with the same UUID generated from the same movie ID. Doing so will override the data and you will not as a result have any duplication. Next, we'll append the object to our list and then insert our dataset with the insert_many method. Note the syntax here. We've used something like movies query.number times before for our searches, but now we're using movies.data. What the Python client does here is to separate functions or methods into submodules like query or data or generate to make usage easier for developers. We'll close the connection and then run this code. So when we now run this code with the insert_many method, it'll populate Weaviate with the movie data. Note that while Weaviate does so, it'll contact the OpenAI API. Remember that we defined the vectorizer module earlier to obtain vectors to represent each object. You probably also remember that we provided the OpenAI API key in our connection code earlier. That's the key that's going to be used by Weaviate during the import to create the vectors. Now we'll run the code. It shouldn't take a lot of time. And just like that, you've built a fully functioning vector database. This means you can already run any of the queries you've learned about earlier, whether they be vector, keyword, or hybrid searches, as well as filters. In fact, I'd encourage you to try and run some of the queries you've learned about on your own database. So that's it for basic imports or object insertions. We're almost done here building our database. Before we wrap up, let's move on to references, which will let us establish relationships between collections.

Contents