From the course: Introduction to AI-Native Vector Databases
Adding data to a vector database
From the course: Introduction to AI-Native Vector Databases
Adding data to a vector database
In this video, we'll look at getting started with Weaviate, an open source vector database. Firstly, we're going to import a bunch of libraries that we're going to use. So, the request library and the JSON library. The first one is going to help us grab our data, the second one is going to help us print out the data. The data set is hosted remotely. So, we're going to use this URL to get access to it, and we're going to load it up. Once we've loaded the data set into our Python environment, we're going to go ahead and examine our data set. So firstly, we want to look at the type of data that we've just brought in, and we're also going to look at how much data we have. So, we're going to look at the length of the data object there. And so we find out that the data object is a list, a Python list, and there's ten objects within that list. We're also going to go ahead and print out what the first object looks like. Because this is a JSON file, we're going to use JSON dumps, and we're going to access the first element in the data set. And we're going to go ahead and indent this to two. And so this is the first object in our data set. Let's examine the properties. We've got a category. We've got a question. So, this is the question itself, and the answer to the question. Notice how the category is the concept that the question is based on. Because this JSON printing is a common function that we're going to be using, we're going to wrap that into a Python function, and we're going to use this function to print out the rest of our data. To do that, we simply just call the function on the rest of our data, and we can run this line of code. You notice how we've got all ten questions and answers, all extracted and on the screen? Now, the majority of the questions have to do with science and animals. So now that our data is set up and ready to go, we need to instantiate Weaviate and we need to get this data into the vector database. So, to get started with Weaviate, we're going to import it. And in this particular instance, we're going to use Weaviate in the embedded mode. That means that it's going to run locally for us. So, to get a client instantiated, we can simply go in here and create a Weaviate client so we can tell we Weaviate that we want a client, tell it that we want to run it in embedded mode, and we can do so like this. We're also going to pass in a third-party API key by passing in additional headers here. So, the additional header that we want to pass in is one for OpenAI. So, the naming convention that we use here is like so. In order to setup your API key, you can go into the Readme file for this course's repository, and there I explain exactly how you can set it up so that you'll be able to use it as I'm specifying it here. I'm going into my environment variables, and I'm just grabbing the OpenAI API key because I've already set this up as a part of the setup for the course. So, now, we're going to go ahead and run that, and it lets us know that Weaviate embedded: I started with a process ID. We can ignore these warnings for now. Should be good. To verify that everything is good, we're going to go ahead and check if it's running. And in order to check that it's running, we're going to go ahead and call an endpoint called the get meta endpoint. So we're going to JSON print. A lot of are talking to the Weaviate instance is going to go through this JSON print filter so that it's formatted nicely. We're going to go ahead into the client, and we're going to run the get meta endpoint. And this is just going to show us some metadata around this particular Weaviate endpoint. So we can see the hostname the modules that we have available. We don't need to understand these details. This just make sure that we can talk to and get the endpoint metadata. As long as you get some response here, that means that your Weaviate instance is up and running. The next thing we're going to do is go in and check whether in this instance, we have a questions class already up and running. Because if there is, we haven't created it, we want to get rid of it. So this line of code is going to check whether that class exists. And if it does exist, this line of code is going to delete that class. The reason why we delete that class is because we want to create our own class, and we want to instantiate it from scratch. So, I'm going to go ahead and run those two lines of code to get rid of any classes that are already there, because I want to go in and create my own class using this class object. So, the first thing I'm going to do is create the name of the class. And secondly, I'm going to specify the vectorizer here. I'm going to use OpenAI because this is the model that's going to be used to convert my data into vectors. We'll talk more about this in detail later. But for now, we just specify it. And then we're going to pass in this class object to create our schema. So, to do that, we can go ahead and tell Weaviate that we want to create a schema using this class object. So, that's all set up. And the next thing we want to do is now that we have our data locally available, we've got Weaviate running locally. We want to take the data and we want to pass it into Weaviate. So, to do that, we're going to batch our data, and we're going to loop through the data, one data point at a time. And I've got a print statement here, that just lets you know whether we've accessed a particular data point. The important thing here is to specify the correct properties that our data has. Earlier, we saw that every single data point had a category, a question, and an answer. We want to specify that and let the database know that those three distinct properties exist. We're going to do that over here. So we're going to specify properties. And this is going to be a dictionary here. And we're going to say that the first property here is answer. And that can be extracted from this particular data point over here. So, this is going to refer back to our original data in the JSON file. So, we want to pass in the answer. We want to pass in the question itself here. So, we let the database know that it's got a question incoming as well. And that's where we're going to grab the question. And then we're also going to pass in the category itself. Category. We can specify the category here. So, these are the three properties that we had in our data. And we're letting the database know what to expect as we're passing it in. So, now that we specify the properties, now we need to set up and add these properties one by one to Weaviate. And the way we do that is by calling the Add data object. So we can go into the batch, and we can call in the add data object method here. And we can pass in individual data object. So, we can say the data object is equal to this property object that we've set up here, like so, and we can tell it which class to append this data to. And that class is what we specified here in questions. So we can let it know where to exactly put the data by specifying the class name here. And this has to match with the class that we've created up here. So that looks all good. We can run this, and it lets us know that it's imported question one all the way up to question ten. So, all of our questions are now in Weaviate. So the next thing we're going to do is check that the database has actually registered, and it has all ten data points. The way to do that is, again, any talking to the database is going to be filtered through this JSON print function to format it nicely. We're going to go ahead and query our client. Querying is a concept where we ask a question of our client. And in this case, the question is, how many data points do you have inside the client at this time? We're going to pass in, and this is known as an aggregation query. We're going to tell it what class we want to aggregate data from. And that's the question class. And then here, we're going to use the with meta count query. And we're going to tell it to execute this query. So, here, when we run this query we get information back. And what we're interested in here is the meta field that tells us that the count of objects is ten, which agrees with what we saw earlier. We have ten questions and answers. We pass them all in. The program told us that all of them had been registered, and we can see that they're all accounted for in the database itself. So, the last thing we're going to do is go in and extract three questions and answers. Any three random questions will do just to make sure that the data that we extract from the database aligns well with the data that we saw at the beginning of this notebook. So, here we're going to write a query that goes in and extracts three questions and answers. Once again, any conversation that we're going to have with the database is going to be filtered through this JSON print function, so that we can format it nicely. So, here we're going to query the client, and we're going to let it know which class we want to extract the data from. And we're also going to let it know which properties we want to extract. So, here we're only interested in extracting the question and the answer. We're not too interested in extracting the category information that we passed in earlier. So, here we're going to specify how many data points we want to extract. So that can be done using this with limit method. And we're going to let it know that we only want three data points back. Perform this operation, and then we can go ahead and run this. And so, here, it went in to the questions class. It extracted the question and answer. And it did this for three random data points. And we can see that we got three questions and answers back. And you can even see from our data that these three questions can be found in this list that we have here. In this video, we were able to set up a vector database running locally on our computer by adding ten objects to it. In the next video, we'll see where the vectors for these data objects are and how we can use them to perform vector search.
Contents
-
-
-
-
(Locked)
Frame the query as a question or search1m 56s
-
(Locked)
Generate the question in machine-understandable language1m 22s
-
Adding data to a vector database9m 48s
-
Performing semantic searches using Weaviate13m 36s
-
(Locked)
Challenge: Vector search with Weaviate49s
-
Solution: Vector Search with Weaviate11m 5s
-
(Locked)
-
-
-
-