From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Prompt caching workflow
From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Prompt caching workflow
How does the prompt caching process work with LLMs and vector databases? Let's walk through a workflow for prompt caching. In prompt caching, we cache the prompt, the embedding for the prompt, and the response in a vector database. This database serves as a local cache. A user issues prompts to the LLM using a user interface. It can also be triggered by applications. The workflow receives the input prompt from the user. First, the input prompt is converted to its equivalent embedding. We need to use the same embedding model as the prompt embeddings that are cached. We then compare this prompt embedding with other prompt embeddings in the cache to see if there are similar prompts. We will use a distance threshold to determine if the similarity between the incoming prompt and the cache prompts are below this distance threshold. Do note that based on the metric used, the range of distances may differ. If a similar prompt is found in the cache below the distance threshold, then the cached response from that prompt is returned to the user. If a similar prompt is not found, then we go to the LLM and fetch the response from the LLM for that prompt. This, of course, will incur additional cost and latency. The response from this LLM is returned back to the user. In addition, the prompt, prompt embedding, and the response are added to the cache. We will use the same embedding model as that is used for the input prompt. Initially, the cache is empty and most prompts go to the LLM, but as the cache builds up through this workflow, more responses will be returned from the cache. We will implement a simple cache with Milvus and OpenAI in this chapter.