From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)

LLMs and caching

Let's now explore how to use a vector database to cache prompts and responses from large language models or LLMs for short. First, let's review some shortcomings of LLMs and how caching can help with these issues. Large language models have taken the world by storm in 2023, and there is a huge interest in using it for business purposes. A lot of innovation is happening, and several business applications are being built that are powered by LLMs. But the problem is the cost of LLMs. It takes a lot of resources to build, deploy, maintain, and use an LLM. So businesses are staying away from building their own models from scratch. On the other hand, when they use cloud LLMs, the cost per inference call is also high. This restricts LLMs to only those use cases where the returns justify the cost. LLMs generate one token at a time due to how the decoder in the transformer architecture functions. This results in high latency, especially when the responses are big. How can caching help? In a given organization or context, users are sending similar prompts to the LLMs, resulting in similar responses. There is a lot of overlap across users on what they use LLMs for. So instead of sending the prompt to an LLM every time and incur high costs and latency, a cache can be used to cache the prompts and responses. If a prompt and its response is cached and the similar prompt is seen from another user, the response can be served from the cache instead of going to the LLM. Using caches for prompt and response caching is becoming an essential component of generative AI applications that are built using LLMs. In this chapter, we will discuss how to use a vector database as a cache for LLMs.

Contents