Cybersecurity

Structuring Applications to Secure the KV Cache

When interacting with transformer-based models like large language models (LLMs) and vision-language models (VLMs), the structure of the input shapes the model’s output. But prompts are often more than a simple user query. In practice, they optimize the response by dynamically assembling data from various sources such as system instructions, context data, and user input.

In multitenant environments, where multiple users share the same application infrastructure, this dynamic prompt construction can introduce unexpected security risks. One risk stems from a prefix caching optimization that, if not handled carefully, can leak information across user boundaries. 

This post explores the intersection of prompt structure and caching, and how their interaction can create subtle vulnerabilities in LLM-powered applications. By understanding these mechanisms, developers can design more secure systems. 

How application prompts are assembled

If you’ve only interacted with LLMs as a chatbot user, you might think of your prompt as something like:

Build me a vacation itinerary for Orlando in August.

But in most real-world applications, this user query is just one part of a larger, dynamically constructed input known as the application prompt. This prompt often includes multiple components designed to shape the model’s response more effectively. 

For example, that same vacation itinerary request might be transformed into something like:

You are a helpful travel assistant. Be courteous and avoid any topics that aren’t related to travel and building itineraries. Here’s the user’s request:
Build me a vacation itinerary for Orlando in August.
Today’s date is March 1st, 2025.
The following events are happening in Orlando in August:
Marathon (August 1)
Rock Concert (August 10)
Silent Disco (August 14)

In code, this might look like: 

application_prompt = f”{system_prompt}\n{user_prompt}\n{date}\n{context}”. 

Behind the scenes, the application fetched the current date and relevant local events, then stitched them together into a single prompt for the language model, as shown in Figure 1.

A sequence diagram showing a series of API calls and concatenations to build an application prompt from get_time, and get_local_events tools before passing the final prompt to the LLM
Figure 1. This series of API calls shows how a vacation planning application might fetch data for the application prompt

In this example, the prompt components were joined with newline characters (\n), but many production applications use explicit tags like <system>, <user>, <context> to separate different sections. Whether you use simple concatenation or structured tags, the underlying principle is the same: the application assembles a complete prompt tailored for the LLM. 

Sometimes applications, especially reasoning and planning models, make multiple LLM calls before producing a final response. Each of these intermediate prompts may be dynamically constructed using prior reasoning steps, historical context, tool outputs, or autogenerated subprompts. For the purposes of this post, we’re focused on any string sent to the LLM for inference, rather than explicitly what’s seen by the user. That includes internal steps in multistep reasoning processes, where the model builds up its answer over several prompts behind the scenes. 

The amount of data a user can control or influence within the prompt depends on how the application was architected. In the previous vacation planning example, the user might only influence the user prompt: “Build me a vacation itinerary for Orlando in August.” But in other setups, like retrieval-augmented generation (RAG) applications, the user might also influence which documents are retrieved and included as context, giving them indirect control over other parts of the prompt.

Why KV cache is fast

Prefix caching is a powerful performance optimization used in LLM serving systems. It works by reusing the model’s internal state for repeated prompt prefixes, enabling the system to skip redundant computations and return faster responses. 

Under the hood, this is implemented using key-value (KV) caching. As the model processes input tokens, it generates intermediate tensors—keys and values—that represent the model’s state. When a new prompt shares a prefix with a previous one, the system can reuse those cached KV instead of recalculating them. 

For example, if a model has already processed:

The quick brown fox jumps over the lazy dog.

And a new prompt comes in:

The quick brown fox crosses the stream,

The system can skip recomputing the shared prefix, “The quick brown fox” and begin generation directly from “crosses”. This optimization is especially effective when prompts share fixed system instructions, like the system prompt in the travel assistant example: 

You are a helpful travel assistant. Be courteous and avoid any topics that aren’t related to travel and building itineraries. Here’s the user’s request:

Since every query starts with the same 30 tokens, the model’s KV cache for that prefix can be reused across all user requests, dramatically reducing latency and cost. In practice, this caching is done at the block level, but we’re simplifying here for illustration purposes.

This shared efficiency comes with a tradeoff in multitenant environments: prefixes can unintentionally become a timing side-channel—a subtle way for one user to infer details about other users’ prompts.

Prefix caching information leaks

Prefix caching improves LLM response times by reusing previously computed internal states. In multitenant environments where multiple users share the same cache, this optimization introduces the potential for timing-based information disclosure.

If two prompts share a long prefix, the model can skip recomputing those initial shared tokens, making the second request faster. By crafting inputs and measuring response latency, an attacker may infer that the initial part  of their prompt was previously seen, potentially revealing details about other users’ queries. This risk has been explored in recent research, which demonstrates how KV cache reuse can act as a covert signal, leaking information based on response time differentials alone. For details, see Auditing Prompt Caching in Language Model APIs.

Example: Inferring location and date

Returning to the travel assistant example, imagine that User A sends the following prompt:

Build me a vacation itinerary for Orlando in August.

Later, User B, an attacker, probes the system by issuing similar queries and measuring response times: 

  • “Build me a vacation itinerary for Atlanta in June.”
  • “Build me a vacation itinerary for Orlando in June.” (This is faster because of the cached KVs for “Build me a vacation itinerary for Orlando” from User A.)
  • “Build me a vacation itinerary for Orlando in July.”
  • “Build me a vacation itinerary for Orlando in August.” (This is faster because of the cached KVs from User A.)

If the final variation returns significantly faster, it suggests that a cached KV prefix for that exact phrasing already exists due to User A’s request. With enough permutations, an attacker can infer the original query’s content, even if they never saw it directly. 

Risk extends beyond the user prompt

The risk doesn’t stop at user-entered queries. Many LLM applications dynamically append data to the prompt from retrieved documents or tool outputs. These additions, though not directly controlled by the user, may still become observable through timing differences. 

For example, User B sends: 

Build me a vacation itinerary for Orlando in August.\nToday’s date is February 28th.

and

Build me a vacation itinerary for Orlando in August.\nToday’s date is March 1st. 

If only the second query results in a cache hit, User B may infer when another user submitted their request, even though the date was appended by the application rather than provided by the user. It doesn’t matter that User B’s query will subsequently have the real date concatenated by the tool; that information doesn’t impact the prefix cache fetch.

In application contexts where information is fetched based on user identity or role based access controls, this creates the possibility of leaking application-side sensitive or privileged system-level data, not just user-entered inputs. 

How cache configuration impacts exploitability

An attacker could further exploit this timing side-channel by combining it with knowledge about how the cache is configured. Every system must make decisions about how much storage to allocate for KV caching, and most implementations use a least recently used (LRU) eviction policy. Instead of retaining entries for a set time, the cache maintains a fixed-size buffer, keeping only the most recently accessed items. As new entries are added, older, less-used ones are evicted. 

This configuration provides additional context for an attacker to control and measure system state. For example, by probing which inputs still result in cache hits, an attacker might deduce how recently a given prompt (or prefix) was used or how frequently it is being accessed. This might be done passively or through active manipulation to prime or flush the cache. 

Extracting meaningful signals from timing measurements isn’t trivial. The real-world performance of LLM systems is influenced by several sources of variability including: 

  • Network latency, which can introduce noise unrelated to cache behavior.
  • Batching, where multiple requests are grouped together to maximize throughput. Batching introduces uncertainty because individual request may be delayed while waiting for the batch to fill, obscuring subtle latency differences.
  • Tool and plugin interactions, where external API calls may be invoked during prompt processing. The response time of these calls can vary significantly based on system load, data complexity, or network conditions. 

All of these factors make it difficult to definitively determine whether a response benefitted from a cache hit. Still, under the right conditions—especially with shorter, tool-free prompts or in low-traffic environments—the timing signals may be strong enough to infer useful information. 

Designing safer systems

The risks introduced by prefix caching emerge directly from how prompts are constructed and used in production. Fortunately, developers can take practical steps in reducing exposure without sacrificing performance. 

Prompt structure matters

One of the most effective approaches is to be intentional about how prompts are assembled. When dynamically generating application prompts through concatenation, consider using the following ordering to minimize risk:

  1. System prompt
  2. Unique user or session identifier
  3. Augmentation context from tools, plugins, or datastores
  4. User prompt (sanitized and validated)

Structure prompts to reduce cross-user collisions

Prefix caching works by recognizing shared prefixes, so the more overlap between user prompts, the more risk of leakage. One simple mitigation is to break up common prefixes across users by including a non-guessable, user-specific identifier early in the prompt. 

For example: 

&lt;system>
You are a helpful travel assistant.
&lt;session> Session-ID: 2f3e1a...
&lt;context> ...
&lt;user> Build me a vacation itinerary for Orlando in August.

This doesn’t need to be a secret token, but it should be anonymized (not personally identifiable), hard to guess or brute force, and rotated periodically. Adding this early in the prompt forces cache separation across users, dramatically reducing the chance of information disclosure. 

Limit and validate user-controlled input

Developers should also take care to validate and sanitize user input before incorporating it into prompts. This is especially true in RAG systems or applications that fetch documents, dates, or context based on user queries. When possible: 

  • Set a maximum length to prevent overflow attacks.
  • Place tool-augmented context before the user input to avoid it being pushed out of the model’s context window.
  • Avoid echoing user input unnecessarily in multiple prompt components. 

These techniques are part of a broader class of prompt hardening strategies. If you’re looking to gain hands-on experience with these types of adversarial behaviors and defenses, check out the Exploring Adversarial Machine Learning NVIDIA Deep Learning Institute course. 

Consider cache partitioning or isolation

If infrastructure allows, consider isolating KV caches across tenants entirely. This could mean: 

  • Partitioning by session or tenant ID
  • Using different cache keys per tenant
  • Disabling prefix caching in high-risks contexts 

While this may reduce some performance benefits, it can be a worthwhile tradeoff in regulated or sensitive environments. 

Monitor for timing-based enumeration

Even with mitigations, systems should be monitored for suspicious patterns that suggest enumeration attempts like repeated queries with minor variations, high volumes of near-duplicate prompts, or latency-sensitive probing. Pairing rate limiting with anomaly detection can help flag potential abuse.

Conclusion

Prefix cache reuse can offer a significant performance gain in LLM applications, but in shared environments, it can also introduce subtle information disclosure risks. When prompts are dynamically assembled from user input, tool outputs, and contextual data, the structure of that prompt is a design choice that impacts both performance and security.

Cache reuse for a shared cache may enable determined attackers to infer parts of other users’ prompts through timing side channels. The impact of this disclosure is amplified when the application includes sensitive user-influenced or identity-dependent context.

By isolating cache usage, structuring prompts thoughtfully, and validating inputs carefully, developers can reduce exposure while maintaining performance. For a hands-on look at these and related threats, explore the Exploring Adversarial Machine Learning NVIDIA Deep Learning Institute course. To dive deeper into real-world red teaming insights and techniques for AI systems, check out the related NVIDIA Technical Blog posts.

Discuss (0)

Tags