From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)

Cache management

A cache once setup has a long life. Also, it may not be able to achieve optimal behavior right from the start. Let's go through some best practices to maximize the effectiveness of caching with vector databases. First, measure the cache hit ratio for the request. This is the ratio between the number of prompts served from the cache and the total number of prompts. The higher the hit ratio, the more efficient the cache is. Some use cases benefit a lot as the users ask similar questions, while some other use cases may not benefit at all. Next, it's also important to find the right similarity threshold for the distance. If the distance threshold is too small, we will use the LLM more often. If the distance threshold is too high, we will be returning inaccurate results from the cache. It's important to run benchmarks with a dataset of prompts and write responses, and determine the right similarity threshold. At this value, the cache should return accurate responses while maximizing the cache hit ratio. A cache can grow too big over time, impacting the efficiency and relevancy of the results. Set a limit for the cache size and manage it over time. It's recommended to add a last used timestamp to the cache collection and update it every time a cached entry is returned to the user. This helps track which entries are often used and which ones are not. To control the cache size, prune entries in the cache. It is recommended to prune them based on their age as well as when they are last used. It is also a good practice to get user feedback on if the answers returned from the cache are correct and relevant. This feedback can be in the form of a thumbs up, thumbs down in the user interface.

Contents