From the course: Vector Databases in Practice: Deep Dive
Chunking longer texts
- [Instructor] Once you've extracted the text data, there's just one more thing to consider, before adding the data to a vector database. You need to decide whether and how to split up the source data into smaller sections. This topic is known as chunking. At a high level, you can think of chunking as a way to help define a unit of information. When it comes to databases, chunking would define the smallest amount of retrievable information. At a library, a unit of information might be a book, but in a book's index, the unit of information might be a page. So in a database, a unit of information is a data object. So the question is how much information will each data object contain? This is an especially important topic in vector databases where each chunk is going to be represented by a vector. You should also know, though, that there are trade-offs depending on the size of each chunk as well. What might happen if each chunk was large, like if it contained a chapter of text each? Well, the good news is that there would not be too many chunks in our database, and each chunk would contain a lot of rich contextual information. So it might make it easier to retrieve the right chunk with a search and then to understand what that chunk is about. But also this might mean that finding specific information might be challenging, like getting an entire book when what you actually want is a specific passage from within it. And if each chunk was too small, well, it'll be like looking for an index card or a sticky note in a very big pile of them. Now, this would make it easier to find granular specific passages that we're looking for, but this could lead to the opposite problem. Short passages can often be confusing when out of context, like a baffling sticky note from months and months ago. So you can see that chunking is a nuanced topic and unfortunately, there's no one-size-fits-all type answer, but we can provide some general tips and starting points that would work relatively well. Let's go back to the example of our Wikipedia page. One really great chunking method is to use available section markers. Here, the text helpful includes headings like these, and the text within each section contains a related coherent idea. So all of that naturally means headings make a natural good candidate for text chunks. The section titles can even be used as additional structured information, but this method might not always be available as extracting section titles isn't always possible in every text examples, as you've seen before. So another really good approach is to split the source text simply by length. This is a simple but a quite effective and robust approach. You can set a maximum word count or a character count for each chunk and split the text accordingly. As a rule of thumb, a good starting point would be something like 100, 150 words or five to 700 characters per chunk with an optional overlap somewhere between say, 10 to 15%. This will lead to chunks that have enough information to be meaningful, and the overlap will provide some robustness against any awkward splits, either in words or sentences. Chunking longer text for ingestion into a database is an important and nuanced task, although there's no one-size-fits-all answer, you could start with chunking by sections or by length, like we discussed earlier. These guidelines will give you a pretty good starting point from which you can make adjustments to suit your specific needs.