Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
LLM System Optimization
Explore top LinkedIn content from expert professionals.
-
-
Small variations in prompts can lead to very different LLM responses. Research that measures LLM prompt sensitivity uncovers what matters, and the strategies to get the best outcomes. A new framework for prompt sensitivity, ProSA, shows that response robustness increases with factors including higher model confidence, few-shot examples, and larger model size. Some strategies you should consider given these findings: 💡 Understand Prompt Sensitivity and Test Variability: LLMs can produce different responses with minor rephrasings of the same prompt. Testing multiple prompt versions is essential, as even small wording adjustments can significantly impact the outcome. Organizations may benefit from creating a library of proven prompts, noting which styles perform best for different types of queries. 🧩 Integrate Few-Shot Examples for Consistency: Including few-shot examples (demonstrative samples within prompts) enhances the stability of responses, especially in larger models. For complex or high-priority tasks, adding a few-shot structure can reduce prompt sensitivity. Standardizing few-shot examples in key prompts across the organization helps ensure consistent output. 🧠 Match Prompt Style to Task Complexity: Different tasks benefit from different prompt strategies. Knowledge-based tasks like basic Q&A are generally less sensitive to prompt variations than complex, reasoning-heavy tasks, such as coding or creative requests. For these complex tasks, using structured, example-rich prompts can improve response reliability. 📈 Use Decoding Confidence as a Quality Check: High decoding confidence—the model’s level of certainty in its responses—indicates robustness against prompt variations. Organizations can track confidence scores to flag low-confidence responses and identify prompts that might need adjustment, enhancing the overall quality of outputs. 📜 Standardize Prompt Templates for Reliability: Simple, standardized templates reduce prompt sensitivity across users and tasks. For frequent or critical applications, well-designed, straightforward prompt templates minimize variability in responses. Organizations should consider a “best-practices” prompt set that can be shared across teams to ensure reliable outcomes. 🔄 Regularly Review and Optimize Prompts: As LLMs evolve, so may prompt performance. Routine prompt evaluations help organizations adapt to model changes and maintain high-quality, reliable responses over time. Regularly revisiting and refining key prompts ensures they stay aligned with the latest LLM behavior. Link to paper in comments.
-
𝐘𝐨𝐮𝐫 𝐋𝐋𝐌 𝐢𝐬 𝐧𝐨𝐭 𝐛𝐫𝐨𝐤𝐞𝐧. 𝐘𝐨𝐮𝐫 𝐪𝐮𝐞𝐫𝐲 𝐩𝐫𝐞𝐩 𝐢𝐬. Here is what nobody tells you about why your RAG system keeps hallucinating 👇 Most engineers obsess over prompts and temperature settings. Meanwhile, their queries are doing this: - Arriving vague and contextless - Missing critical semantic variations - Trying to answer 5 questions at once - Getting routed to the wrong knowledge base 𝐓𝐡𝐞 𝐟𝐢𝐱? 𝐒𝐭𝐨𝐩 𝐭𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐪𝐮𝐞𝐫𝐢𝐞𝐬 𝐥𝐢𝐤𝐞 𝐭𝐡𝐫𝐨𝐰𝐚𝐰𝐚𝐲 𝐢𝐧𝐩𝐮𝐭𝐬. Here is the actual architecture that separates production RAG from toy demos: 𝟏. 𝐐𝐮𝐞𝐫𝐲 𝐑𝐞𝐰𝐫𝐢𝐭𝐢𝐧𝐠 Turn "API not working" into "authentication failure modes in REST endpoints" One captures intent. The other actually retrieves useful context. 𝟐. 𝐐𝐮𝐞𝐫𝐲 𝐄𝐱𝐩𝐚𝐧𝐬𝐢𝐨𝐧 Your vector DB does not know that "LLM" and "large language model" mean the same thing. Add variants. Boost recall. Stop missing obvious matches. 𝟑. 𝐐𝐮𝐞𝐫𝐲 𝐃𝐞𝐜𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧 "How do I fine-tune Llama 3 for customer support and deploy it cost-effectively?" That is not one query. That is three. Break it. Parallelize it. Actually answer it. 𝟒. 𝐐𝐮𝐞𝐫𝐲 𝐀𝐠𝐞𝐧𝐭𝐬 This is where it gets interesting. Before you touch your retriever: - Analyze intent - Route intelligently - Validate what came back - Decide if you even have enough to generate 𝟓. 𝐓𝐡𝐞 𝐃𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐋𝐚𝐲𝐞𝐫 𝐄𝐯𝐞𝐫𝐲𝐨𝐧𝐞 𝐒𝐤𝐢𝐩𝐬 Weak context? → Loop back and refine Strong context? → Generate with confidence Incomplete? → Don't hallucinate. Go get more. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐭𝐡𝐢𝐧𝐠: The best LLM systems don't start with "write a better prompt." They start with "did we even ask the right question?" Real talk: What breaks first in your system? - Query rewriting catching garbage input? - Retrieval returning irrelevant chunks? - Orchestration making the wrong routing call? 𝐃𝐫𝐨𝐩 𝐲𝐨𝐮𝐫 𝐰𝐚𝐫 𝐬𝐭𝐨𝐫𝐢𝐞𝐬 𝐛𝐞𝐥𝐨𝐰. 𝐋𝐞𝐭'𝐬 𝐝𝐞𝐛𝐮𝐠 𝐭𝐡𝐢𝐬 𝐭𝐨𝐠𝐞𝐭𝐡𝐞𝐫. 👇 ♻️ Repost this to help your network get started ➕ Follow Anurag(Anu) Karuparti for more PS: If you found this valuable, join my weekly newsletter where I document the real-world journey of AI transformation. ✉️ Free subscription: https://lnkd.in/esF52fm5 #AgenticAI #AIAgents #AILLMS
-
LLM pro tip to reduce hallucinations and improve performance: instruct the language model to ask clarifying questions in your prompt. Add a directive like "If any part of the question/task is unclear or lacks sufficient context, ask clarifying questions before providing an answer" to your system prompt. This will: (1) Reduce ambiguity - forcing the model to acknowledge knowledge gaps rather than filling them with hallucinations (2) Improve accuracy - enabling the model to gather necessary details before committing to an answer (3) Enhance interaction - creating a more natural, iterative conversation flow similar to human exchanges This approach was validated in the 2023 CALM paper, which showed that selectively asking clarifying questions for ambiguous inputs increased question-answering accuracy without negatively affecting responses to unambiguous queries https://lnkd.in/gnAhZ5zM
-
Herels How To Use HyDE When RAG Fails: A Practical Workflow --- I have been following the advice of advice of Avi Chawla and Daily Dose of Data Science. They offered the great graphic below. RAG works great for me. Right up until someone asks it a vague question. In customer support, internal search, or training, users rarely use the right technical terms. They say “the screen freezes” instead of “memory leak in main thread.” Standard RAG misses these queries because the "vector distance" (sorry for the suddent Geek Speak) is too wide. HyDE (Hypothetical Document Embedding) fixes this by searching with a hypothetical answer instead of the raw question. Here is the step-by-step workflow: STEP 1. Generate a Hypothetical Answer Pass the user’s question to an LLM first. Prompt it to write a short, plausible answer using proper vocabulary. The answer can be wrong—you just need realistic language and structure. Example prompt: “Write a short paragraph that answers this question as an expert would, using correct technical terms.” STEP 2. Embed the Hypothetical Answer Run the generated text through your embedding model. This creates a vector that sits much closer to your technical documentation than the original vague question. Now I get it, before you mention it. Yes, there might be hallucinations here. STEP 3. Retrieve with the Hypothetical Vector (yep, Geek Speak returns briefly) Search your vector database using the hypothetical answer’s embedding. This shifts retrieval from Question-to-Answer to Answer-to-Answer, which dramatically boosts similarity scores. STEP 4. Generate the Final Response Feed the retrieved documents back into the LLM with a grounded prompt: “Use only the context below to answer the original question accurately and clearly.” Where This Works For Me (and probably will for you) Customer Support: Turn “It’s slow” into “Potential database indexing issue” and find the right KB article. Enterprise Search: Map “quarterly sales slide” to the actual “Q3 Revenue Dashboard Template.” Technical Documentation: Connect “how to connect my app” to “OAuth2 implementation guide.” AI Training: Bridge novice symptom-language to expert root-cause language instantly. HyDE makes search feel less like a keyword matcher and more like a senior teammate who translates intent into execution.
-
System prompts are getting outdated! Here's a counterintuitive lesson from building real-world Agents: Writing giant system prompts doesn't improve an Agent's performance; it often makes it worse. For example, you add a rule about refund policies. Then one about tone. Then another about when to escalate. Before long, you have a 2,000-word instruction manual. But here’s what we’ve learned: LLMs are extremely poor at handling this. Recent research also confirms what many of us experience. There’s a “Curse of Instructions.” The more rules you add to a prompt, the worse the model performs at following any single one. Here’s a better approach: contextually conditional guidelines. Instead of one giant prompt, break your instructions into modular pieces that only load into the LLM when relevant. ``` agent.create_guideline( condition="Customer asks about refunds", action="Check order status first to see if eligible", tools=[check_order_status], ) ``` Each guideline has two parts: - Condition: When does it get loaded? - Action: What should the agent do? The magic happens behind the scenes. When a query arrives, the system evaluates which guidelines are relevant to the current conversation state. Only those guidelines get loaded into the model’s context. This keeps the LLM’s cognitive load minimal because instead of juggling 50 rules, it focuses on just 3-4 that actually matter at that point. This results in dramatically better instruction-following. This approach is called Alignment Modeling. Structuring guidance contextually so agents stay focused, consistent, and compliant. Instead of waiting for an allegedly smaller model, what matters is having an architecture that respects how LLMs fundamentally work. This approach is actually implemented in Parlant - a recently trending open-source framework (13k+ stars). You can see the full implementation and try it yourself. But the core insight applies regardless of what tools you use: Be more methodical about context engineering and actually explaining what you expect the behavior to be in special cases you care about. Then agents can become truly focused and useful. I’ve shared the repo link in the first comment! ___ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
-
Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs help or not. This is wasteful when the model already knows the answer, and often too late when it does not. New research introduces Skill-RAG, a failure-state-aware retrieval system. It uses hidden-state probing to detect when an LLM is approaching a knowledge failure, then routes the query to a specialized retrieval strategy matched to the gap. Evaluated on HotpotQA, Natural Questions, and TriviaQA, the approach improves over uniform RAG baselines on both efficiency and accuracy. Why does it matter? RAG is moving from a single monolithic pipeline to a suite of skills an agent selects between. Knowing when to retrieve and what kind of retrieval to run will matter more than raw retriever quality as agents take on multi-step reasoning, where a single bad lookup derails the whole chain.
-
Current benchmarks for Large Language Models (LLMs) fail to account for the dynamic, interactive nature fundamental to LLM-based software systems. A new control theoretic approach could revolutionize how we steer these systems towards desired outcomes. 𝐖𝐡𝐲 𝐂𝐨𝐧𝐭𝐫𝐨𝐥 𝐓𝐡𝐞𝐨𝐫𝐲 (𝐂𝐓)? Traditionally, LLM performance is measured using benchmarks like “HellaSwag,” “MMLU,” “TruthfulQA,” or “MATH.” These evaluate how well an LLM answers questions requiring knowledge, reasoning, and mathematical skills. However, these benchmarks overlook the dynamic interactions in LLM-based systems, such as chatbots, where multiple question-answer interactions occur. Users typically steer the LLM in a specific direction, refocusing it when it moves off course. Large context windows in modern LLMs build an internal state over interactions. Understanding and optimizing these dynamic interactions is crucial for developing better LLM systems. This is where control theory (CT) comes in. Originating from engineering, CT studies how to influence a systemtowards a desired state using a “control signal”. CT is widely applicable, from electrical engineering to biological systems and disease control. 𝐊𝐞𝐲 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 𝐂𝐓 𝐚𝐝𝐝𝐫𝐞𝐬𝐬𝐞𝐬 1) When is control possible? 2) What is the cost of control? 3) How computationally intensive is control? These are critical questions for LLM systems. Researchers now presented new results on controlling LLM systems (see comments). 𝐊𝐞𝐲 𝐜𝐨𝐧𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐩𝐚𝐩𝐞𝐫 1) Highlighting Differences from Classical Control Theory: LLM systems are discrete in state and time, unlike systems described by ordinary differential equations. Their state space grows exponentially with the number of tokens, and there is mutual exclusion on control input and generated tokens—at any time, you can either input or receive output from the LLM. 2) Defining Control Theory for LLMs: The focus is on analyzing the “reachable set” of output tokens (see image). 3) Theoretical Results: Upper bounds on the reachable set for self-attention layers show which outputs cannot be reached within the next k tokens given a context. 4) Empirical Results: Demonstrations of lower bounds on the reachable set for popular LLMs reveal that likelihood-based metrics, such as cross-entropy loss, cannot ensure exclusion from the reachable output set, highlighting gaps in our understanding of LLM systems and control theory. The paper concludes with exciting research questions: 1) Can LLMs learn to control each other? 2) Can we find controllable subspaces such as in classical control theory? 3) Can we compose control modules and subsystems into an interpretable, predictable, and effective whole? Exploring these questions may shift our approach from individual models to integrated systems and lead to new ideas beyond LLMs. #genai #llm #machinelearning #ai
-
🤔 What if, instead of using prompts, you could fine-tune LLMs to incorporate self-feedback and improvement mechanisms more effectively? Self-feedback and improvement have been shown to be highly beneficial for LLMs and agents, allowing them to reflect on their behavior and reasoning and correct their mistakes as more computational resources or interactions become available. The authors mention that frequently used test-time methods like prompt tuning and few-shot learning that are used for self-improvement, often fail to enable models to correct their mistakes in complex reasoning tasks. ⛳ The paper introduces RISE: Recursive Introspection, an approach to improve LLMs by teaching them how to introspect and improve their responses iteratively. ⛳ RISE leverages principles from online imitation learning and reinforcement learning to develop a self-improvement mechanism within LLMs. By treating each prompt as part of a multi-turn Markov decision process (MDP), RISE allows models to learn from their previous attempts and refine their answers over multiple turns, ultimately improving their problem-solving capabilities. ⛳It models the fine-tuning process as a multi-turn Markov decision process, where the initial state is the prompt, and subsequent states involve recursive improvements. ⛳It employs a reward-weighted regression (RWR) objective to learn from both high- and low-quality rollouts, enabling models to improve over turns. The approach uses data generated by the learner itself or more capable models to supervise improvements iteratively. RISE significantly improves the performance of LLMs like LLaMa2, LLaMa3, and Mistral on math reasoning tasks, outperforming single-turn strategies with the same computational resources. Link: https://lnkd.in/e2JDQr8M
-
Want to prompt like the top AI startups? 👇 YC shared tips how the top AI startups in their portfolio are prompting LLMs: Key learnings: 1/ Be Hyper-Specific & Detailed (The “Manager” Style) Treat your LLM like a new employee. Provide long, detailed prompts that define their role, task, constraints, and desired output. Example: Parahelp uses a 6+ page prompt for their AI customer support agent! 2/ Assign a Clear Role (Set a Persona) Start with: “You are a [role].” This sets the context, tone, and expected expertise. This helps the LLM adopt the desired style and reasoning for the tasks. 3/ Outline the Task + Provide the Steps Clearly state the LLM's primary task Break down complex tasks into a step-by-step plan. This Improves reliability and makes complex operations more manageable for the LLM. 4/ Structure Your Prompt (and Output) Use Markdown, bullet points, XML tags to structure your instructions Clear format helps with consistent and reliable outputs. Example: Parahelp, for instance, uses tags like <manager_verify> to enforce response format. 5/ Meta-Prompting (LLM, Improve Thyself) Yes, you can ask the LLM to help you write or refine prompts. Give it your current prompt. Ask it to make your prompt better or critique it. LLMs often suggest effective improvements you might not think of. 6/ Provide Examples For complex tasks, include a few high-quality examples of input-output pairs directly in the prompt. This improves the LLM's ability to understand and replicate desired behavior. Example: Jazzberry (AI bug finder) feeds hard examples to guide the LLM. 7/ Prompt Folding & Dynamic Generation Design prompts that generate specialized sub-prompts on the fly. Use this in multi-step workflows to break down complexity and adapt based on prior output. Example: A tool classification prompt that outputs a more targeted follow-up prompt like: “Now write a bug triage report for a frontend UI error.” 8/ Add an Escape Hatch Build fail-safes right into your prompt: Example: If you’re unsure or missing info, say ‘I don’t know’ and ask for clarification. This reduces hallucinations. Increases trust. 9/ Use Debug Info & Thinking Traces Ask the model to explain its reasoning “Include a section called ‘debug_info’ where you explain the logic behind your answer.” This is great for debugging and fine-tuning. 10/ Treat Evals Like Gold Yes, prompts matter. But evals are your most imp IP. Evals are essential for knowing why a prompt works and for iterating effectively. 11/ Consider Model "Personalities" & Distillation Different LLMs have different "personalities" Use the most powerful model to write and refine prompts. Then distill the optimized prompt for speed/post for production use. Know someone building AI agents? Share this with them! Let’s level up our prompt engineering together 🔥 🔗 Source in comments #Startups #ArtificialIntelligence #PromptEngineering #AgenticAI #EnterpriseAI