From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Distance measure considerations
From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Distance measure considerations
When doing semantic search with vector databases, a key design consideration is the distance measure. When using vector databases, it's critical to understand how the distance measures work for a specific use case. As seen in the earlier code examples, a vector search will always return hits as long as there are records available in the database. If we set a limit of 10 in the query, it will return 10 records as long as there are 10 records in the database. The results are sorted by the distance between the search string and the string in the database. How do we determine if there is actually a match? We need to use distance or similarity thresholds. This is the maximum value of the distance below which we can consider that there is a match. So when a search is executed in Milvus, we can set the radius search parameter to this value so the search only returns those results where the distance is below the radius. What exactly do we mean by similar when comparing two strings? How close should these two strings be? Is it okay if they are just about a specific topic, or should there be an exact match? This is determined by the specific use case. Hence, the similarity or distance threshold should also be determined by the use case. Do note that embedding models and metric types have an impact on the distance values, and hence the thresholds being set. Sometimes custom embeddings by domain may be used for special purposes. In that case, the thresholds may be lower as we expect much closer matches. Let's continue this discussion in the next video.