From the course: Vector Databases in Practice: Deep Dive
Messiness of real data
- [Instructor] Up until now, we've dealt with relatively neat data sets that have been pretty straightforward to work with. Our movie data, for example, is contained in a tabular CSV format, and we haven't had to do much further processing than deciding which columns to use and what the corresponding properties might look like in our database. When it comes to dealing with real data, though, things can look a little bit trickier. Data, especially semantic data like text, can often be quite messy. Take a look at this web page, for example, from Wikipedia. There's a great deal of really useful information in here, but it doesn't quite lend itself to a simple data import, like what we've done here, for a couple of reasons. First of all, you'll see that the page is actually quite long. Remember that each data object will be saved with one vector that represents its meaning. What this means is that in this particular case, we might not want to save the entire page as one data object, but break it down as a series of smaller portions so that we can find the relevant sections from it for better precision. This is called chunking. And we'll cover some examples of it on how to chunk data very soon. Another is a matter of context. If we're reading a book, we would be aware of the title of the book and perhaps already have an idea of what it's about. This information is a very useful part of understanding the overall meaning of a data object. So we might want to consider providing this type of overall, higher-level information to each chunk as well. But how much of it should we provide? Should we, for example, save the page title for each chunk? What about its section headings or the URL if it's a website? In some cases, you might also want to save a summary of the overall page, as well as the chunk itself. Another is just a matter of formatting. This web page is nicely presented on our browser, but take a look at the underlying code. Yikes. Well, that's not very readable at all, is it? All of this extra code defines how the page is organized and displayed on our web browser so that we as humans find it easier to read. But it's not really semantically useful. So what we need to do is to find some ways of extracting the information, such that key aspects, like section headings and paragraph markers, are identified. And we can discard these purely cosmetic things, like typefaces and background colors, as well as layouts. The good news is, we won't need to build tools from the ground up to perform these tasks. Because this is a fairly common challenge, many tools already exist that can help us with these tasks. So in summary, these are some of the typical things you'll be doing with real data: chunking, deciding what contextual information to include with each chunk, as well as extracting semantic information. Over the next couple of sections, we'll take a look at how we might be able to tackle some of these challenges.