The interview is for a Generative AI Engineer role at Cohere. Interviewer: "Your client complains that the LLM keeps losing track of earlier details in a long chat. What's happening?" You: "That's a classic context window problem. Every LLM has a fixed memory limit - say 8k, 32k, or 200k tokens. Once that's exceeded, earlier tokens get dropped or compressed, and the model literally forgets." Interviewer: "So you just buy a bigger model?" You: "You can, but that's like using a megaphone when you need a microphone. A larger context window costs more, runs slower, and doesn't always reason better." Interviewer: "Then how do you manage long-term memory?" You: 1. Summarization memory - periodically condense earlier chat segments into concise summaries. 2. Vector memory - store older context as embeddings; retrieve only the relevant pieces later. 3. Hybrid memory - combine summaries for continuity and retrieval for precision. Interviewer: "So you’re basically simulating memory?" You: "Yep. LLMs are stateless by design. You build memory on top of them - a retrieval layer that acts like long-term memory. Otherwise, your chatbot becomes a goldfish." Interviewer: "And how do you know if the memory strategy works?" You: "When the system recalls context correctly without bloating cost or latency. If a user says, 'Remind me what I told you last week,' and it answers from stored embeddings - that’s memory done right." Interviewer: "So context management isn’t a model issue - it’s an architecture issue?" You: "Exactly. Most think 'context length' equals intelligence. But true intelligence is recall with relevance - not recall with redundancy." #ai #genai #llms #rag #memory
LLM Performance and Coherence Challenges
Explore top LinkedIn content from expert professionals.
Summary
LLM performance and coherence challenges refer to difficulties large language models face when trying to deliver accurate, consistent, and context-aware responses, especially during long or complex conversations. These issues include models forgetting earlier details, misunderstanding concepts, and struggling to track changes in information or maintain logical reasoning throughout interactions.
- Clarify task details: Always define task objectives and agent roles in straightforward language to help the model stay focused and avoid confusion.
- Segment information: Break lengthy documents or conversations into clear, state-labeled chunks to prevent loss of context and improve sequential understanding.
- Use verification steps: Incorporate prompts that require self-checking or cross-verification so the model can catch mistakes and ensure accuracy.
-
-
Despite the impressive capabilities of LLMs, developers still face challenges in getting the most out of these systems. LLMs often need a lot of fine-tuning and prompt adjustments to produce the best results. First, LLMs currently lack the ability to refine and improve their own responses autonomously and second, they have limited research capabilities. It would be highly beneficial if LLMs could conduct their own research, equipped with a powerful search engine to access and integrate a broader range of resources. In the past couple of weeks, several studies have taken on these challenges: 1. Recursive Introspection (RISE): RISE introduces a novel fine-tuning approach where LLMs are trained to introspect and correct their responses iteratively. By framing the process as a multi-turn Markov decision process (MDP) and employing strategies from online imitation learning and reinforcement learning, RISE has shown significant performance improvements in models like LLaMa2 and Mistral. RISE enhanced LLaMa3-8B's performance by 8.2% and Mistral-7B's by 6.6% on specific reasoning tasks. 2. Self-Reasoning Framework: This framework enhances the reliability and traceability of RALMs by introducing a three-stage self-reasoning process, encompassing relevance-aware processing, evidence-aware selective processing, and trajectory analysis. Evaluations across multiple datasets demonstrated that this framework outperforms existing state-of-the-art models, achieving an 83.9% accuracy on the FEVER fact verification dataset, improving the model's ability to evaluate the necessity of external knowledge augmentation. 3. Meta-Rewarding with LLM-as-a-Meta-Judge: The Meta-Rewarding approach incorporates a meta-judge role into the LLM’s self-rewarding mechanism, allowing the model to critique its judgments as well as evaluate its responses. This self-supervised approach mitigates rapid saturation in self-improvement processes, as evidenced by an 8.5% improvement in the length-controlled win rate for models like LLaMa2-7B over multiple iterations, surpassing traditional self-rewarding methods. 4. Multi-Agent Framework for Complex Queries: It mimics human cognitive processes by decomposing complex queries into sub-tasks using dynamic graph construction. It employs multiple agents—WebPlanner and WebSearcher—that work in parallel to retrieve and integrate information from large-scale web sources. This approach led to significant improvements in response quality when compared to existing solutions like ChatGPT-Web and Perplexity.ai. The combination of these four studies would create a highly powerful system: It would self-improve through recursive introspection, continuously refining its responses, accurately assess its performance and learn from evaluations to prevent saturation, and efficiently acquire additional information as needed through dynamic and strategic search planning. How do you think a system with these capabilities reshape the future?
-
LLMs' apparent understanding runs deeper than we thought. New research reveals a pervasive illusion: Meet 'Potemkin Understanding.' I've gone through another research paper, in depth, and this one's worth your while (I think). This groundbreaking paper, "Potemkin Understanding in Large Language Models", directly challenges the assumption that high benchmark scores mean large language models truly understand. Researchers from MIT, UChicago, and Harvard have identified a critical failure mode they call 'Potemkin Understanding'. Think of it as an LLM building a perfect-looking facade of knowledge. It can flawlessly define a concept, even pass tests, but its internal understanding is fundamentally incoherent, unlike any human. It might explain a perfect rhyming scheme, then write a poem that fails to rhyme. This illusion of comprehension is where LLMs answer complex questions correctly yet fundamentally misunderstand concepts in ways no human would. They often can't tell you when they're truly right or dangerously wrong. Some of this you may think: Yes, but we've had this before, Markus. Well, turns out this phenomenon's scale extends far beyond the occasional errors we are already aware of. The paper finds Potemkins are ubiquitous across models, tasks, and domains, exposing a deeper internal incoherence in concept representations. Critically, this invalidates existing benchmarks as measures of true understanding. This research scientifically validates what many of us have argued: flawless output doesn't equate to genuine understanding. It underscores the critical need for human judgment and the "expert in the loop" to discern genuine insight from mere statistical mimicry. This directly reinforces themes I've explored in "Thinking Machines That Don't", an article that is publishing at The Learning Guild this week, and the imperative for critical human discernment. This is essential reading for anyone relying on LLMs for strategic decisions. Read the full paper here: https://lnkd.in/gsckwVA3 Would love to hear your thoughts. #AIStrategy #TheEndeavorReport #AppliedAI
-
MIT just published research on why ChatGPT struggles with state tracking. The problem isn't memory. It's how transformers encode position information. Current models use RoPE (rotary position encoding). It treats all words four positions apart the same way. Doesn't matter if it's "cat sat on box" or financial data changing over time. MIT-IBM Watson AI Lab built PaTH Attention to fix this. It outperforms RoPE on state tracking and sequential reasoning. Here's what this means for how you use LLMs today: 1. Audit where your LLM loses context in long documents Test with financial reports, legal contracts, or multi-step instructions. Track where the model misses state changes or sequential logic. Example: "Company X acquired Y in Q2, then sold Z in Q4" often gets confused. Current position encoding can't track entity relationships over time. 2. Break complex documents into state-aware chunks Don't feed 50-page contracts as single prompts. Segment by state changes: before acquisition, during transition, after close. Explicitly label each section's timeframe and context. This compensates for positional encoding limitations. 3. Use explicit state markers in your prompts Add "Current state:" before each major transition. Example: "Current state: Post-merger. Previous state: Pre-merger." Forces the model to treat position changes as data, not just distance. Reduces errors in multi-step reasoning by 40-60%. 4. Test LLM performance on conditional logic tasks Build test cases with "if-then" sequences over long contexts. Example: "If condition A occurs on page 5, apply rule B on page 20." Current models fail these because RoPE doesn't track causal relationships. Know your model's limits before deploying in production. 5. Prioritize reasoning over retrieval for complex documents RAG (retrieval-augmented generation) won't fix state tracking issues. It retrieves chunks but doesn't understand how states evolve. For contracts, regulations, or multi-step workflows, use specialized parsing. Position encoding is the bottleneck, not retrieval accuracy. 6. Watch for next-gen models with adaptive position encoding PaTH Attention is research, not yet in production models. But it signals where LLM architecture is heading. Models that track state changes will replace current transformers. Plan your document processing stack accordingly. Why this matters: You're using LLMs on tasks they structurally can't handle well. Financial analysis, legal review, code debugging over long contexts. All require state tracking that RoPE fundamentally doesn't provide. MIT just showed the problem and the solution. Most teams won't adjust their workflows until new models ship. You can compensate for these limitations now. Found this helpful? Follow Arturo Ferreira.
-
Why Do Multi-Agent LLM Systems “still” Fail? A new study explores why Multi Agent Systems are not significantly outperforming single-agent. The study identifies 14 failure modes multi-agent system. Multi-agent system (MAS) are agents that interact, communicate, and collaborate to achieve a shared goal, which would to be difficult or unreliable for a single agent to accomplish. Benchmark: - Selected five popular, open-source MAS (MetaGPT, ChatDev, HyperAgent, AppWorld, AG2) - Chose tasks representative of the MAS intended capabilities (Software D Development, SWE-Bench Lite, Utility Service Tasks, GSM-Plus) total of 150 tasks - Recorded the complete conversation logs, human annotators reviews, Cohen's Kappa score to ensure consistency and reliability, LLM-as-a-Judge Validation Multi Agent Failure modes: 1. Disobey Task Spec: Ignores task rules and requirements, leading to wrong output. 2. Disobey Role Spec: Agent acts outside its defined role and responsibilities. 3. Step Repetition: Unnecessarily repeats steps already completed, causing delays. 4. Loss of History: Forgets previous conversation context, causing incoherence. 5. Unaware Stop: Fails to recognize task completion, continues unnecessarily. 6. Conversation Reset: Dialogue unexpectedly restarts, losing context and progress. 7. Fail Clarify: Does not ask for needed information when unclear. 8. Task Derailment: Gradually drifts away from the intended task objective. 9. Withholding Info: Agent does not share important, relevant information. 10. Ignore Input: Disregards or insufficiently considers input from others. 11. Reasoning Mismatch: Actions do not logically follow from stated reasoning. 12. Premature Stop: Ends task too early before completion or information exchange. 13. No Verification: Lacks mechanisms to check or confirm task outcomes. 14. Incorrect Verification: Verification process is flawed, misses critical errors. How to improve Multi-Agent LLM System: 📝 Define tasks and agent roles clearly and explicitly in prompts. 🎯 Use examples in prompts to clarify expected task and role behavior. 🗣️ Design structured conversation flows to guide agent interactions. ✅ Implement self-verification steps in prompts for agents to check their reasoning. 🧩 Design modular agents with specific, well-defined roles for simpler debugging. 🔄 Redesign topology to incorporate verification roles and iterative refinement processes. 🤝 Implement cross-verification mechanisms for agents to validate each other. ❓ Design agents to proactively ask for clarification when needed. 📜 Define structured conversation patterns and termination conditions. Github: https://lnkd.in/ebmCg28d Paper: https://lnkd.in/etgsH6BH
-
The new open-source benchmark, MCP-Universe, is a useful step forward in how we evaluate LLMs. Unlike traditional benchmarks, it tests models on real enterprise tasks, like repository management and financial analysis. The latest results, though, are a wake-up call: as VentureBeat reports, GPT-5 failed in more than half of real work orchestration tasks. Not because the model isn’t powerful, but because raw model strength isn’t the same as enterprise readiness. Two challenges stood out: • Long context windows. Enterprise inputs are sprawling, incomplete, and often contradictory. Expanding the window isn’t enough. You need the right information inside it. Approaches like GraphRAG help by curating authoritative context and enabling multi-hop reasoning across knowledge. • Unfamiliar tools. LLMs struggle to adapt to proprietary formats, workflows, and security protocols. There’s a misconception that adding MCP on top of APIs will magically improve reliability. It won’t. MCP can connect systems, but that doesn’t guarantee value. Reliability comes from agents and tools built for specific jobs, grounded in a company’s own data, rules, and workflows—and from curating the right information, not just more of it. A “universal” layer doesn’t replace the need for domain-specific intelligence.
-
One of the biggest challenges I see with scaling LLM agents isn’t the model itself. It’s context. Agents break down not because they “can’t think” but because they lose track of what’s happened, what’s been decided, and why. Here’s the pattern I notice: 👉 For short tasks, things work fine. The agent remembers the conversation so far, does its subtasks, and pulls everything together reliably. 👉 But the moment the task gets longer, the context window fills up, and the agent starts forgetting key decisions. That’s when results become inconsistent, and trust breaks down. That’s where Context Engineering comes in. 🔑 Principle 1: Share Full Context, Not Just Results Reliability starts with transparency. If an agent only shares the final outputs of subtasks, the decision-making trail is lost. That makes it impossible to debug or reproduce. You need the full trace, not just the answer. 🔑 Principle 2: Every Action Is an Implicit Decision Every step in a workflow isn’t just “doing the work”, it’s making a decision. And if those decisions conflict because context was lost along the way, you end up with unreliable results. ✨ The Solution to this is "Engineer Smarter Context" It’s not about dumping more history into the next step. It’s about carrying forward the right pieces of context: → Summarize the messy details into something digestible. → Keep the key decisions and turning points visible. → Drop the noise that doesn’t matter. When you do this well, agents can finally handle longer, more complex workflows without falling apart. Reliability doesn’t come from bigger context windows. It comes from smarter context windows. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
Are Your LLM Rerankers Actually Good at Handling Novel Queries? New research from the Universität Innsbruck challenges a fundamental assumption in information retrieval: that state-of-the-art reranking models generalize well to unseen content. The Hidden Problem: Most benchmarks like TREC DL19/DL20 and BEIR contain queries that overlap with LLM training data. This contamination makes it nearly impossible to assess true generalization capability. The research introduces FutureQueryEval-a dataset with 148 queries collected after April 2025, ensuring zero overlap with existing model training cutoffs. Technical Deep Dive: The study evaluates 22 methods across three core paradigms: Pointwise Reranking scores query-document pairs independently with O(n) complexity. Models like MonoT5 use T5's encoder-decoder architecture with prompts like "Query: q Document: d Relevant:" to predict relevance probabilities. The challenge? Inconsistent score calibration across different prompts and heavy reliance on scoring APIs that many generation-only LLMs lack. Pairwise Reranking compares document pairs using prompts to determine relative relevance, aggregating results through methods like Heapsort (O(n log n)) or sliding windows (O(n)). PRP-FLAN-UL2 leads here, but the approach struggles with transitivity issues and scales poorly due to quadratic complexity in naive implementations. Listwise Reranking processes multiple documents simultaneously, with models like RankGPT generating identifier permutations (e.g., " > ") to capture inter-document relationships. While achieving O(n) complexity with sliding windows, these methods face challenges with long contexts and positional biases. The Surprising Results: On familiar benchmarks, RankGPT-GPT-4 dominates with 75.59 nDCG@10 on DL19. But on FutureQueryEval? Performance drops 5-15% across all categories. Listwise methods show the smallest degradation (8%), suggesting inter-document modeling provides better robustness. Meanwhile, fine-tuned models like MonoT5-3B (60.75 nDCG@10) and TWOLAR-XL (60.03) maintain strong performance, while lightweight options like FlashRank-MiniLM balance efficiency with 55.43 nDCG@10. Under the Hood: The key differentiator is how models handle context. Pointwise methods treat each document independently, missing relationship signals. Pairwise methods capture relative preferences but struggle with consistency. Listwise approaches like Zephyr-7B (62.65 nDCG@10 on novel queries) excel by modeling full document lists through attention mechanisms that weigh inter-document relevance simultaneously. The research exposes a critical limitation: claims of "generalization" based on standard benchmarks may be overstated. As retrieval systems increasingly power RAG applications and enterprise search, understanding how rerankers perform on truly unseen content becomes essential for building reliable AI systems.
-
Andrej Karpathy recently put a name to something a lot of us in the trenches have been circling for months: the "LLM Wiki". And he is spot on. For the last year, the industry has basically treated LLMs as ephemeral answer engines. You retrieve a few chunks, generate a response, throw the synthesis away, and repeat the exact same work tomorrow. This is the core bottleneck of naive RAG. It has zero durable memory. No accumulation. No compounding intelligence. Every hard question forces the system to rediscover the same relationships from scratch burning compute to rebuild context it should already own. The LLM Wiki model flips this entirely. Instead of just sitting at the end of a query pipeline, the LLM sits between raw information and a persistent knowledge layer. When new data flows in, it doesn’t just get embedded and buried in a database. The model actually does something with it: 🔹 Updates entity pages 🔹 Connects new facts to existing knowledge graphs 🔹 Flags contradictions instantly 🔹 Preserves state over time This shift is massive. Building low-footprint vector engines and on-prem AI architectures daily, the inefficiency of standard RAG is impossible for me to ignore. Recomputing understanding on the fly just doesn't scale for serious workloads. The real leverage isn’t in generating one more answer. It’s in compiling knowledge once and continuously maintaining it. Having managed large-scale R&D teams, I've seen firsthand how fast documentation drift happens. We still rely on humans to manually update references, link architectural decisions, and keep distributed teams aligned. At scale, that approach breaks down fast. The winning architecture is clear: 🧠 Humans drive the judgment, strategy, and the hard questions. 🤖 LLMs handle the heavy bookkeeping: updating knowledge, linking entities, and maintaining system coherence. The future of AI isn't just about faster code generation. It's about building knowledge that compounds. Naive RAG as we know it is actually just a stepping stone. What do you think? >> https://lnkd.in/dMURAJ_V
-
Bigger context windows will not save your LLM app. Most teams think the solution is to stuff more data into the model. It is not. The real advantage comes from Context Engineering. This is the skill of designing an AI system that feeds the model the right information at the right time. Not by changing the model, but by connecting it to the outside world: • retrieving fresh data • grounding answers in facts • using tools and memory to stay accurate The goal is not to overload a prompt. It is to make the model smarter about what stays active and what gets offloaded. This is what separates basic LLM Q and A from real production systems. To do this right, you need six components working together 👇 ⸻ 1. Agents 🤖 The decision makers. Agents evaluate what they know, decide what they need, choose the right tools, and recover when things go wrong. ⸻ 2. Query Augmentation 🔎 Turning messy user input into precise intent. If the system does not know exactly what the user is asking, everything downstream fails. ⸻ 3. Retrieval 📚 The bridge from the model to your real data. This is chunking, indexing, and fetching the right facts with the right balance of precision and context. ⸻ 4. Prompting Techniques 🧭 Guiding the model with clear reasoning instructions. Chain of Thought, Few shot examples, ReAct style prompting, and more. ⸻ 5. Memory 🧠 Short term and long term. Your app needs to remember past interactions and keep persistent knowledge available when needed. ⸻ 6. Tools 🔧 The action layer. APIs, code execution, web browsing, database calls. This is how your system moves from answering questions to actually performing work. ⸻ This is far more advanced than classic RAG. This is how production systems maintain coherence, access live data, reduce hallucinations, and actually get work done. If you want more breakdowns like this on LLM architecture, RAG systems, and AI engineering, follow my profile here on LinkedIn.