From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Inference process and caching
From the course: LLM Foundations: Vector Databases for Caching and Retrieval Augmented Generation (RAG)
Inference process and caching
Now, we will exercise the prompt caching workflow when a user enters a prompt. We begin by setting up the OpenAI key. As discussed before, it's recommended to use your own OpenAI key for this purpose. We will create an LLM object for the actual model to use to get the responses. In addition, we will also set up the OpenAI embeddings model to get the embedding vectors. The similarity threshold is set to 0.3. This is the maximum distance permitted for a match. For L2, the distances will start from zero for an absolute match between the input prompt and the cache prompt. We set the threshold to 0.3, so only matches with distance less than 0.3 are considered cache hits. We then set up the search parameters for the search. Here, we set the radius parameter to the similarity threshold for the distances. This will ensure that only matches with distances less than this threshold value will be returned by Milvus. Next, we define a function for the inference loop. This function will return a response given a prompt from either the cache or the LLM. We will also time this operation. We start by converting the prompt to its embedding. Then we perform a search on the prompt embedding field with the input prompts embedding. We only look at the top result. We output the prompt and the response text. We first check if there is any result that has come back from the cache with the distance that is less than the similarity threshold. If so, we return that response to the user. If no results are found, we send this prompt to the LLM and get a response. This response is then returned back to the user. Additionally, we will save this prompt, its embedding, and its response into the cache for future use. We are not explicitly calling a flush as we are using the same connection object for both query and inserts. We finally print the time taken and also the outputs. Let's run this code now. You can ignore this warning. We then send a series of five prompts of different general knowledge questions to the get_response function. This will build the cache as there are no matches in the cache. Let's run this code and populate the cache. You can see that the responses are indeed returned by the LLM. Next, we send a couple of prompts that are similar to some of the five prompts we sent earlier. Let's run the new prompts. Here, we see that there is a cache hit and the answers are indeed returned from the cache. So the prompt "How tall is an elephant?" is semantically similar to "What is the typical height of an elephant," and the answers are fetched from the cache. You may not see a huge difference in latency between the cached responses and the LLM responses. This is because the prompts are small and the vector DB is running locally. But in real production scenarios, this difference can be significant. Using a cache also helps save expensive inference calls to the cloud LLM.