Most people talk about AI agents Very few talk about context engineering Context engineering is quickly becoming the skill that separates agents that hallucinate from agents that reliably execute multi step workflows. To make this concrete, look at Manus AI. They were among the first to ship a browser agent that can navigate, extract, summarize, and act inside the web environment. And they published one of the clearest deep dives into how a production agent actually manages context at scale. After studying their pipeline, I put together a set of flashcards as a quick primer. Here is the distilled version. 1. Context is not a prompt. It is a system. Manus structures context across instructions, tool outputs, files, memory, and agent states. They treat it like an engineered data flow rather than a single block of text. When a system runs hundreds of steps in a loop, this precision matters. 2. KV cache stability is free performance. A stable prefix lets the model reuse its internal cache, which reduces latency and cost significantly. Even small variations, like a random timestamp, break the cache. Optimizing this alone changes the economics of long running agents. 3. Tools should be masked, not removed. Removing a tool mid loop breaks schema predictability. Manus keeps the toolset constant and masks tools contextually. This makes agent behavior more stable and more interpretable. 4. External memory beats cramming. When browser actions generate large observations, Manus writes them to files and stores only the references in context. This avoids context bloat and keeps the agent focused. 5. Recitation increases reliability. When loops get long, agents forget. Reciting goals into a todo file, summarizing progress, and placing objectives near the attention window helps the agent stay grounded. 6. Failures belong in context. They keep errors, stack traces, and broken actions. This allows the model to observe its own mistakes and correct future steps. It is the closest thing to real self improvement agents have today. All of this makes one thing clear. As agents move from novelty to infrastructure, the real differentiation will not come from prompting. It will come from engineering the context that shapes the model’s reasoning. If you want a quick way to learn these concepts, I turned my notes into flashcards. 〰️〰️〰️ ♻️ Share this with your network 🔔 Follow me (Aishwarya Srinivasan) for more data science and AI insights
Strategies for Managing Context in Large Language Models
Explore top LinkedIn content from expert professionals.
Summary
Strategies for managing context in large language models involve carefully organizing and delivering the right information so AI systems stay focused and accurate. Context engineering is the process of selecting, structuring, and updating the data that a model "sees" during tasks, which helps prevent confusion and supports reliable results.
- Organize input data: Clean up and structure the information you feed to the model so it can access key facts without distractions or unnecessary clutter.
- Use memory wisely: Distinguish between short-term data for current tasks and long-term memory for ongoing projects to avoid overwhelming the model and maintain relevance.
- Refine context: Regularly summarize, filter, and update the working information to prevent the model from losing critical details or being misled by outdated or redundant content.
-
-
Context engineering is quickly becoming one of the most critical skills in applied AI. Not prompt tweaking. Not model fine-tuning. But knowing what information a model needs, and when to give it. That is the real unlock behind AI agents that actually work. At its core, context engineering is about delivering the right information to the model, at the right time, in the right format; so it can reason effectively. It pushes developers to think more intentionally about how they shape a model’s inputs: 🔸What does the model need to know for this task? 🔸Where should that information come from? 🔸How do we fit it within the limits of the context window? 🔸And how do we prevent irrelevant or conflicting signals from getting in the way? Why does this matter so much? In practice, most agent failures are not due to weak models. They happen because the model did not have the context it needed. It missed a key fact, relied on stale data, or was overloaded with noise. Context engineering addresses this directly. It forces you to design the flow of information step by step, not just what the model sees, but how and when it sees it. This context can come from many places: 🔹Long- and short-term memory (such as prior conversations or user history) 🔹Retrieved data from APIs, vector stores, or internal systems 🔹Tool definitions and their recent outputs 🔹Structured formats or schemas that define how information is used 🔹Global state shared across multi-step workflows Frameworks like LlamaIndex, LangGraph AI, LangChain, are evolving to support this shift, giving developers the tools to manage context with much more precision. And there are now better resources than ever to help teams write, select, compress, and organize context with real control. Image from Langchain blog. #contextengineering #llms #generativeai #artificialintelligence #technology
-
The hardest part isn’t the model. It’s the context. The biggest myth about LLMs is that success depends on picking the “right” model. In reality, models like GPT, Claude, or Gemini perform similarly for most use cases. What makes or breaks an application is the context around the model. That context comes down to three things: - The quality of the data you feed it. - How clearly you structure prompts and instructions. - How you evaluate and refine both inputs (prompts) and outputs (retrieval and responses). I’ve seen this first-hand. We once built a color-palette tool for designers. The model worked fine, but we had no way of knowing if the results were actually good. We had to bring in real designers to judge the outputs. Without that expert context, the system would have been useless. Also, models can only process a limited number of tokens. If you feed them too much context, performance drops. If you feed them too little, they will use their own knowledge and maybe hallucinate. The real challenge is finding the right balance, providing enough useful signal without overwhelming the system with noise. So instead of wasting weeks swapping models, focus on what you can actually control like clean and well-structured data, clear and concise prompts, and strong evaluation sets that help you measure progress over time. Remember, the model is just a tool. Context is what turns it into something useful.
-
LLMs/ SLMs are inherently stateless, but the future of AI and AI Agents is stateful, personalized, and persistent. The critical discipline enabling this shift is Context Engineering and it is much more than just prompt engineering. Context Engineering is the process of dynamically assembling and managing all information within an LLM’s/ SLM’s context window. Think of it as the ‘mise en place’ for your agent, ensuring it has only the most relevant, high-quality ingredients for every turn. 🏛️The Two Pillars of Stateful AI:- 1. Sessions:- These govern the ‘now’. A session is the container for a single, continuous conversation, holding the chronological dialogue history and working memory. You can view it as the temporary workbench for a project. 2. Memory:- This is the mechanism for long-term persistence across multiple sessions. Memory captures and consolidates key information, acting as an organized filing cabinet that provides a continuous, personalized experience. 🐒The Production Challenge:- Combating Context Rot A major hurdle is managing the ever-growing conversation history, which increases cost, latency, and leads to ‘context rot’ (the model's diminished ability to pay attention to critical information). ℹ️ To solve this, Context Engineering employs compaction strategies:- • Token-Based Truncation:- Simply cutting off older messages to stay within a predefined token limit. • Recursive Summarization:- Using an LLM to periodically summarize the oldest parts of the conversation, preserving context in a condensed form. 💡The Key Production Insight:- Memory generation itself? the process of Extraction (distilling key facts) and Consolidation (integrating new facts, resolving conflicts, and deleting redundant data), must be run as an asynchronous background process. This ensures the agent is snappy, responsive, and doesn't keep the user waiting while it's ‘thinking’ about what to remember. Context Engineering is the foundation for building trusted, adaptive assistants that truly learn and grow with the user. What are your biggest challenges in moving your LLM proof-of-concept into a stateful production environment? #LLMOps #AIEngineering #ContextEngineering #GenAI #MachineLearning #LLMDevelopment
-
Large context windows are now becoming a major part of model marketing. 1 million tokens. 2 million tokens. But the important question is not: “How much context can the model technically accept?” The better question is: “How much context can the model use reliably?” Those are very different things. Even when models advertise very large context windows, serious benchmarks show that reasoning quality often starts degrading much earlier — frequently somewhere around the 100K–200K token range, depending on the task. The evidence is becoming fairly consistent. Chroma’s Context Rot study tested 18 frontier models and found that every model degraded as input length increased, even on relatively simple retrieval tasks. The NoLiMa benchmark from LMU Munich and Adobe, accepted at ICML 2025, removed easy keyword-matching shortcuts and showed that 11 of 13 models dropped below 50% of their baseline accuracy at just 32K tokens. In code-heavy workloads, the degradation can be even sharper. You also pay more above 200K tokens because the model now has to process more information. Why does this happen? Three forces compound. 1. Attention dilution - As the context gets larger, the model has to distribute attention across more tokens. Specific facts become harder to retrieve reliably 2. Lost-in-the-middle behavior - Models tend to attend more strongly to the beginning and end of the input, and less reliably to information buried in the middle. 3. Distractor interference - Irrelevant but semantically similar content can actively mislead the model. This matters because real enterprise context is rarely clean. It contains duplicate documents, stale decisions, old chat history, similar tickets, outdated specs, partial tool outputs, and contradictory references. This is why context window should not be treated as memory. It is better understood as working surface area. And like any working surface, it becomes less useful when it is cluttered. A more practical concept is Maximum Effective Context Window — the amount of context a model can use with acceptable reliability for a given task. That number is usually much smaller than the advertised maximum. For high-stakes workflows — legal review, regulated document intelligence, production code agents, financial analysis, enterprise search — the answer is not simply to use a bigger window. The answer is better context engineering: - targeted retrieval - hard relevance filtering - structured chunking - reranking - pruning of stale context - separation of memory from working context - task-specific context assembly before each call A dense 100K-token context with the right information will usually outperform a diluted 1M-token context filled with chat history, logs, tool outputs, and loosely related documents. 1M tokens is a ceiling, not a destination. #EnterpriseAI #AITransformation #Trainingledtransformation
-
The interview is for a Generative AI Engineer role at Cohere. Interviewer: "Your client complains that the LLM keeps losing track of earlier details in a long chat. What's happening?" You: "That's a classic context window problem. Every LLM has a fixed memory limit - say 8k, 32k, or 200k tokens. Once that's exceeded, earlier tokens get dropped or compressed, and the model literally forgets." Interviewer: "So you just buy a bigger model?" You: "You can, but that's like using a megaphone when you need a microphone. A larger context window costs more, runs slower, and doesn't always reason better." Interviewer: "Then how do you manage long-term memory?" You: 1. Summarization memory - periodically condense earlier chat segments into concise summaries. 2. Vector memory - store older context as embeddings; retrieve only the relevant pieces later. 3. Hybrid memory - combine summaries for continuity and retrieval for precision. Interviewer: "So you’re basically simulating memory?" You: "Yep. LLMs are stateless by design. You build memory on top of them - a retrieval layer that acts like long-term memory. Otherwise, your chatbot becomes a goldfish." Interviewer: "And how do you know if the memory strategy works?" You: "When the system recalls context correctly without bloating cost or latency. If a user says, 'Remind me what I told you last week,' and it answers from stored embeddings - that’s memory done right." Interviewer: "So context management isn’t a model issue - it’s an architecture issue?" You: "Exactly. Most think 'context length' equals intelligence. But true intelligence is recall with relevance - not recall with redundancy." #ai #genai #llms #rag #memory
-
Most engineers treat AI context windows like infinite RAM. Your agent fails not because the model is bad, but because you're flooding 200K tokens with noise and wondering why it hallucinates. After building agentic systems for production teams, I've learned: 𝗔 𝗳𝗼𝗰𝘂𝘀𝗲𝗱 𝗮𝗴𝗲𝗻𝘁 𝗶𝘀 𝗮 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝘁 𝗮𝗴𝗲𝗻𝘁. Context engineering isn't about cramming more information in. It's about systematic management of what goes in and what stays out. 𝗧𝗵𝗲 𝗥𝗲𝗱𝘂𝗰𝗲 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆: 𝗦𝘁𝗼𝗽 𝗪𝗮𝘀𝘁𝗶𝗻𝗴 𝗧𝗼𝗸𝗲𝗻𝘀 𝗧𝗵𝗲 𝗠𝗖𝗣 𝗦𝗲𝗿𝘃𝗲𝗿 𝗧𝗿𝗮𝗽: Most teams load every MCP server by default. I've seen 24,000+ tokens (12% of context) wasted on tools the agent never uses. 𝗧𝗵𝗲 𝗙𝗶𝘅: • Delete your default MCP.json file • Load MCP servers explicitly per task • Measure token cost before adding anything permanent This one change saves 20,000+ tokens instantly. 𝗧𝗵𝗲 𝗖𝗟𝗔𝗨𝗗𝗘.𝗺𝗱 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: Teams build massive memory files that grow forever. 23,000 tokens of "always loaded" context that's 70% irrelevant to the current task. 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: • Shrink CLAUDE.md to absolute universal essentials only • Build `/prime` commands for different task types • Load context dynamically based on what you're actually doing 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: ``` /prime-bug → Bug investigation context /prime-feature → Feature development context /prime-refactor → Refactoring-specific context ``` Dynamic context beats static memory every time. 𝗧𝗵𝗲 𝗠𝗲𝗻𝘁𝗮𝗹 𝗠𝗼𝗱𝗲𝗹 𝗦𝗵𝗶𝗳𝘁 Stop thinking: "How do I get more context in?" Start thinking: "How do I keep irrelevant context out?" 𝗪𝗵𝗮𝘁 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗲𝘀 𝗪𝗶𝗻𝗻𝗲𝗿𝘀 𝗳𝗿𝗼𝗺 𝗟𝗼𝘀𝗲𝗿𝘀: ✓ Winners: Measure token usage per agent operation ✗ Losers: "Just throw everything in the context" ✓ Winners: Design context architecture before writing prompts ✗ Losers: Keep adding to claude.md when agents fail Your agent's intelligence ceiling is your context management ceiling. --- What's the biggest waste of tokens in your AI setup right now? #ContextEngineering #AgenticEngineering #AIAgents #DeveloperProductivity #SoftwareArchitecture [Human Generated, Human Approved]
-
You can’t 𝐛𝐮𝐢𝐥𝐝 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐀𝐈 𝐚𝐠𝐞𝐧𝐭𝐬 without mastering this core skill: 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠. → Here’s the full picture in one visual. 👇 Here’s what’s really going on when you want your agent to reason, retrieve, interact with tools, and stay efficient over time: 🔹 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 is the process of selecting and feeding the right information into the LLM’s context window. It guides the model’s output during reasoning or task execution. It includes: → 𝐈𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧𝐬 (prompts, few-shot, tool specs) → 𝐊𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 (facts, memories) → 𝐓𝐨𝐨𝐥𝐬 (feedback from tool calls) 🔍 When applied to AI Agents, it solves a major 𝐩𝐚𝐢𝐧 𝐩𝐨𝐢𝐧𝐭: > Repeated LLM + tool interleaving in long tasks = large token usage ⚠️ Token overload leads to: → Exceeding context window → Increased latency and cost → Poorer agent reasoning 🔹 To manage that, here are 4 context engineering 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬: 1. 𝐖𝐫𝐢𝐭𝐞 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 (scratchpads, memories, tool call state) 2. 𝐒𝐞𝐥𝐞𝐜𝐭 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 (choose what to load using RAG, memory types) 3. 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 (summarization, trimming) 4. 𝐈𝐬𝐨𝐥𝐚𝐭𝐞 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 (multi-agent systems) 🧩 It's not just about stuffing more into a prompt—it’s about orchestrating memory, retrieval, and structure across the agent's workflow. Source: 👇
-
To build effective agents, you need sophisticated context engineering. But to achieve sophisticated context engineering at scale, you need agentic systems managing that context ⁉️ Everyone assumes larger context windows solve the problem. They don't. Transformers have an n² attention problem: every token attends to every other token. As context grows, the model's ability to capture these pairwise relationships gets stretched thin. Why Manual Curation Fails at Scale Consider a real agent workflow: multi-hour codebase migration, complex research synthesis, or financial analysis across hundreds of documents. Your agent generates: → Thousands of tool outputs → Multi-step reasoning chains → Execution traces with success/failure signals → Architectural decisions and dependencies → Domain-specific heuristics discovered through trial-and-error A human cannot process this velocity of information and make real-time decisions about what to compress, persist to memory, or discard. The cognitive load exceeds human reaction time capabilities. The Agentic Context Engineering Solution Research from Stanford's ACE (Agentic Context Engineering) framework proves this approach works in production. They implement a three-agent architecture: Generator: Produces reasoning trajectories and surfaces effective strategies Reflector: Critiques execution traces to extract concrete lessons Curator: Synthesizes updates into structured, itemized contexts Results: 10.6% improvement on agent benchmarks, 8.6% on domain-specific tasks. They matched IBM's production-level system while using smaller open-source models. The Technical Mechanisms That Matter Three core techniques emerged across all research: 1️⃣ Incremental Delta Updates: Instead of rewriting entire contexts (which causes "context collapse"), use structured bullets with metadata. Update only relevant sections. ACE reduced adaptation latency by 87% using this approach. 2️⃣ Just-in-Time Retrieval: Don't pre-load everything. Agents maintain lightweight identifiers (file paths, graph entity IDs) and dynamically load data using tools. Anthropic's Claude Code demonstrates this: it uses commands like head, tail, and grep to analyze large datasets without loading full objects into context. 3️⃣ Grow-and-Refine with De-duplication: Let contexts expand adaptively while using semantic embeddings to prune redundancy. This prevents both information loss and context bloat. GEPA (Genetic-Pareto prompt evolution) demonstrates this with reflective optimization. An agent analyzes execution traces, identifies which context elements were useful or misleading, and autonomously proposes improvements. It achieved 10-19% better performance than reinforcement learning while using 35× fewer rollouts. Knowledge graphs are essentially pre-computed indexes of high-signal relationships. Instead of hoping an LLM extracts relationships from unstructured text in context, you make them explicit and queryable.
-
A common challenge when building long-horizon agents is managing context. Tool results, the model's own reasoning, and user messages all accumulate, and eventually you either hit the token limit or start paying for context that isn't helping anymore. There are many levers for managing context, including compaction, tool-result clearing, and memory, which each come with benefits and trade-offs. Very excited to share this new cookbook that compares context engineering strategies for long-running agents! It covers under-the-hood implementations of core primitives, plus trajectory plots showing their real impact on a long-running research agent using Anthropic's APIs. What this cookbook covers: - How to cap in-session token growth when an agent's context is dominated by large, re-fetchable tool results like file reads and API responses - How to keep long conversations going with server-side compaction - How to persist agent knowledge across sessions by implementing a file-backed memory handler that the model drives itself - How to implement each primitive most effectively, replacing the default compaction prompt to preserve what your agent needs, guiding what the agent writes to memories, and testing clearing configs against your own workload's tool-use pattern - How to diagnose which part of the context problem your workload actually has, and pick the primitive that targets it, with a framework for mapping workload characteristics to the right tool Read more here: https://lnkd.in/g_Zfkc2P