I started by asking AI to do everything. Six months later, 65% of my agent’s workflow nodes run as non-AI code. The first version was fully agentic : every task went to an LLM. LLMs would confidently progress through tasks, though not always accurately. So I added tools to constrain what the LLM could call. Limited its ability to deviate. I added a Discovery tool to help the AI find those tools. Better, but not enough. Then I found Stripe’s minion architecture. Their insight : deterministic code handles the predictable ; LLMs tackle the ambiguous. I implemented blueprints, workflow charts written in code. Each blueprint specifies nodes, transitions between them, trigger conditions for matching tasks, & explicit error handling. This differs from skills or prompts. A skill tells the LLM what to do. A blueprint tells the system when to involve the LLM at all. Each blueprint is a directed graph of nodes. Nodes come in two types : deterministic (code) & agentic (LLM). Transitions between nodes can branch based on conditions. Deal pipeline updates, chat messages, & email routing account for 29% of workflows, all without a single LLM call. Company research, newsletter processing, & person research need the LLM for extraction & synthesis only. Another 36%. The workflow runs 67-91% as code. The LLM sees only what it needs : a chunk of text to summarize, a list to categorize, processed in one to three turns with constrained tools. Blog posts, document analysis, bug fixes are genuinely hybrid. 21% of workflows. Multiple LLM calls iterate toward quality. Only 14% remain fully agentic. Data transforms & error investigations. These tend to be coding tasks rather than evaluating a decision point in a workflow. The LLM needs freedom to explore. AI started doing everything. Now it handles routing, exceptions, research, planning, & coding. The rest runs without it. Is AI doing less? Yes. Is the system doing more? Also yes. The blueprints, the tools, the skills might be temporary scaffolding. With each new model release, capabilities expand. Tasks that required deterministic code six months ago might not tomorrow.
Understanding LLM Workflow Variability
Explore top LinkedIn content from expert professionals.
Summary
Understanding LLM workflow variability means recognizing why large language models (LLMs) may produce different results for the same task, even when settings and prompts appear unchanged. This variability is affected not just by the model itself, but also by the surrounding system—such as workflow design, prompt structure, and operational conditions—making it crucial to design, monitor, and troubleshoot for reliable AI output.
- Control system architecture: Map out and manage each component in your workflow to identify where variability may be introduced, from prompt construction to tool integration.
- Standardize evaluation practices: Regularly test and review prompts and system conditions, using consistent templates and example-driven prompts to reduce unpredictable responses.
- Monitor production environment: Pay attention to batching, scaling, and hardware operations in live deployments, since these factors can influence LLM output even with fixed parameters.
-
-
Small variations in prompts can lead to very different LLM responses. Research that measures LLM prompt sensitivity uncovers what matters, and the strategies to get the best outcomes. A new framework for prompt sensitivity, ProSA, shows that response robustness increases with factors including higher model confidence, few-shot examples, and larger model size. Some strategies you should consider given these findings: 💡 Understand Prompt Sensitivity and Test Variability: LLMs can produce different responses with minor rephrasings of the same prompt. Testing multiple prompt versions is essential, as even small wording adjustments can significantly impact the outcome. Organizations may benefit from creating a library of proven prompts, noting which styles perform best for different types of queries. 🧩 Integrate Few-Shot Examples for Consistency: Including few-shot examples (demonstrative samples within prompts) enhances the stability of responses, especially in larger models. For complex or high-priority tasks, adding a few-shot structure can reduce prompt sensitivity. Standardizing few-shot examples in key prompts across the organization helps ensure consistent output. 🧠 Match Prompt Style to Task Complexity: Different tasks benefit from different prompt strategies. Knowledge-based tasks like basic Q&A are generally less sensitive to prompt variations than complex, reasoning-heavy tasks, such as coding or creative requests. For these complex tasks, using structured, example-rich prompts can improve response reliability. 📈 Use Decoding Confidence as a Quality Check: High decoding confidence—the model’s level of certainty in its responses—indicates robustness against prompt variations. Organizations can track confidence scores to flag low-confidence responses and identify prompts that might need adjustment, enhancing the overall quality of outputs. 📜 Standardize Prompt Templates for Reliability: Simple, standardized templates reduce prompt sensitivity across users and tasks. For frequent or critical applications, well-designed, straightforward prompt templates minimize variability in responses. Organizations should consider a “best-practices” prompt set that can be shared across teams to ensure reliable outcomes. 🔄 Regularly Review and Optimize Prompts: As LLMs evolve, so may prompt performance. Routine prompt evaluations help organizations adapt to model changes and maintain high-quality, reliable responses over time. Regularly revisiting and refining key prompts ensures they stay aligned with the latest LLM behavior. Link to paper in comments.
-
"𝐖𝐡𝐲 𝐢𝐬 𝐦𝐲 𝐋𝐋𝐌 𝐠𝐢𝐯𝐢𝐧𝐠 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐚𝐧𝐬𝐰𝐞𝐫𝐬 𝐭𝐨 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧?" If you have asked this in the last month, here is your Debugging Playbook. Most teams treat inconsistent LLM outputs as a Model Problem. It is almost never the Model. It is your System Architecture exposing variability you did not know existed. After debugging 40+ production AI systems, I have developed a 6-Step Framework that isolates the real culprit: Step 1: Confirm the Inconsistency Is Real ��� Compare responses across identical prompts • Control temperature, top-p, and randomness • Check prompt versions and hidden changes • Goal: Rule out noise before debugging the system Step 2: Break the Output into System Drivers • Decompose your response pipeline into components • Prompt structure, retrieved context (RAG), tool calls, model version, system instructions • Use a "dropped metric" approach to test each driver independently • Goal: Identify where variability can be introduced Step 3: Analyze Variability per Driver • Inspect each driver independently for instability • Does retrieval return different chunks? Are tool outputs non-deterministic? Are prompts dynamically constructed? • Test drivers across same period vs previous period • Goal: Isolate the component causing divergence Step 4: Segment by Execution Conditions • Slice outputs by environment or context • User input variants, model updates/routing, time-based data changes, token limits or truncation • Look for patterns in when inconsistency spikes • Goal: Find conditions where inconsistency spikes Step 5: Compare Stable vs Unstable Runs • Contrast successful outputs with failing ones • Same prompt/different output, same context/different reasoning, same goal/different execution • Surface the exact difference that matters • Goal: Surface the exact difference that matters Step 6: Form and Test Hypotheses • Turn findings into testable explanations • Hypothesis: retrieval drift, prompt ambiguity, tool response variance • Move from suspicion to proof • Goal: Move from suspicion to proof The pattern I see repeatedly: Teams jump straight to "let's try a different model" or "let's add more examples." But inconsistent outputs are rarely a model issue-they are usually a system issue. • Your retrieval is pulling different documents. • Your tool is returning non-deterministic results. • Your prompt is being constructed differently based on context length. The 6-step framework forces you to treat LLM systems like the distributed systems they actually are. Which step do most teams skip? Step 1. They assume inconsistency without proving it. Control your variables first. ♻️ Repost this to help your network get started ➕ Follow Anurag(Anu) Karuparti for more PS: If you found this valuable, join my weekly newsletter where I document the real-world journey of AI transformation. ✉️ Free subscription: https://lnkd.in/exc4upeq #GenAI #AIAgents
-
It shouldn’t surprise people that LLMs are not fully deterministic, they can’t be. Even when you set temperature to zero, fix the seed, and send the exact same prompt, you can still get different outputs in production. There’s a common misconception that nondeterminism in LLMs comes only from sampling strategies. In reality, part of the variability comes from how inference is engineered at scale. In production systems, requests are often batched together to optimize throughput and cost. Depending on traffic patterns, your prompt may be grouped differently at different times. That changes how certain low-level numerical operations are executed on hardware. And because floating-point arithmetic is not perfectly associative, tiny numerical differences can accumulate and lead to different token choices. The model weights haven’t changed, neither has the prompt. But the serving context has. Enterprise teams often evaluate models assuming reproducibility is guaranteed if parameters are fixed. But reliability in LLM systems is not only a modeling problem. It is a systems engineering problem. You can push toward stricter determinism. But doing so may require architectural trade-offs in latency, cost, or scaling flexibility. The point is not that LLMs are unreliable, but that nondeterminism is part of the stack. If you are deploying AI in production, you need to understand where it enters, and design your evaluation, monitoring, and governance around it.
-
In an interesting study from our newsletter sponsor Qevlar AI, they discuss a fundamental problem that keeps getting glossed over: LLMs are non-deterministic. Run the same security investigation twice, get different results. Sometimes dramatically different. These inconsistencies are baked into how LLMs work. Qevlar AI quantified this problem. They ran 18,000 investigation attempts on 180 real security alerts: same inputs, different outputs. The numbers are interesting: → Even simple 3-step investigations only followed the same path 75% of the time → Complex alerts (15-20 steps) generated 90 unique investigation paths across 100 attempts → The canonical path appeared in just 3% of complex cases → Critical enrichment steps like CTI queries were randomly skipped 17% of the time In production SOCs, this means: 1. Identical alerts get different severity ratings depending on which path the LLM decides to take 2. Investigation quality becomes a dice roll 3. You can't establish consistent baselines or SOPs 4. False negatives vary unpredictably Mature SOC processes and consistency is paramount in how we train analysts, maintain quality, and ensure nothing gets missed in the SOC. Qevlar's approach is that they're not trying to prompt-engineer their way out of this. They built a graph orchestration layer that enforces deterministic investigation paths. The LLM performs analysis at each step, but the workflow itself is predictable and repeatable. The study is linked below. Worth a read if you're evaluating autonomous SOC tools or building AI-powered security workflows. https://lnkd.in/g4rCcYnp
-
Over the past few weeks, I validated several patterns that reveal how AI agents truly behave in production. Autonomy is impressive, but structure still delivers the most consistent results. In a traditional LLM workflow where logic and reasoning are fully orchestrated, the same model ran twice as fast and used twelve times fewer tokens than an agentic setup. Efficiency scales best when reasoning is guided, not left open-ended. When deterministic logic was moved into the orchestration layer, the agent gained flexibility, but it came at a cost: more time and higher token usage. Predictable performance, yet less efficient overall. The biggest insight came from reasoning models themselves. GPT 5, with its superior compression and contextual efficiency, outperformed GPT 4o not because it was larger, but because it reasoned more precisely. What my findings validated: For simple and well-defined use cases, LLM workflows can achieve over 99% reliability without complex agent logic. A verifier layer - a lightweight “check my work” agent, can further improve reliability and confidence. For complex, critical, or regulated processes, orchestration remains faster, cheaper, and more auditable. Autonomy sounds exciting, but it isn’t always the optimal path. The smartest systems know when to act independently and when to rely on structured reasoning. AI agents perform best within boundaries that balance adaptability with control. Use them where discovery and contextual reasoning create value. Rely on orchestration where precision, governance, and cost efficiency are non-negotiable.
-
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
-
One of the biggest challenges I see with scaling LLM agents isn’t the model itself. It’s context. Agents break down not because they “can’t think” but because they lose track of what’s happened, what’s been decided, and why. Here’s the pattern I notice: 👉 For short tasks, things work fine. The agent remembers the conversation so far, does its subtasks, and pulls everything together reliably. 👉 But the moment the task gets longer, the context window fills up, and the agent starts forgetting key decisions. That’s when results become inconsistent, and trust breaks down. That’s where Context Engineering comes in. 🔑 Principle 1: Share Full Context, Not Just Results Reliability starts with transparency. If an agent only shares the final outputs of subtasks, the decision-making trail is lost. That makes it impossible to debug or reproduce. You need the full trace, not just the answer. 🔑 Principle 2: Every Action Is an Implicit Decision Every step in a workflow isn’t just “doing the work”, it’s making a decision. And if those decisions conflict because context was lost along the way, you end up with unreliable results. ✨ The Solution to this is "Engineer Smarter Context" It’s not about dumping more history into the next step. It’s about carrying forward the right pieces of context: → Summarize the messy details into something digestible. → Keep the key decisions and turning points visible. → Drop the noise that doesn’t matter. When you do this well, agents can finally handle longer, more complex workflows without falling apart. Reliability doesn’t come from bigger context windows. It comes from smarter context windows. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg
-
If your legal AI makes a “judgment,” it’s not just following rules. 𝗜𝘁’𝘀 𝗲𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝗻𝗴 𝗮 𝘄𝗼𝗿𝗹𝗱𝘃𝗶𝗲𝘄. Most software operates on if/then logic. Input X produces Output Y. Every time. Deterministic. LLMs are fundamentally different. They're probabilistic, and new research shows just how much that matters. A recent paper, "Evaluative Fingerprints" by Wajid N., studied 9 frontier LLMs evaluating the same content using the same rubric. The results are striking: 𝗜𝗻𝘁𝗲𝗿-𝗺𝗼𝗱𝗲𝗹 𝗮𝗴𝗿𝗲𝗲𝗺𝗲𝗻𝘁 𝘄𝗮𝘀 𝗻𝗲𝗮𝗿 𝘇𝗲𝗿𝗼. Yet individual models were remarkably consistent 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲𝗺𝘀𝗲𝗹𝘃𝗲𝘀, just not with each other. The researchers could identify which model produced an evaluation with 𝟴𝟵.𝟵% 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆 based solely on its scoring patterns. Even GPT-4.1 and GPT-5.2 (same provider, different versions) were distinguishable 99.6% of the time. The paper calls this the "reliability paradox": models don't agree on what "good" means, but 𝘁𝗵𝗲𝘆'𝗿𝗲 𝘀𝗼 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗶𝗻 𝗵𝗼𝘄 𝘁𝗵𝗲𝘆 𝗱𝗶𝘀𝗮𝗴𝗿𝗲𝗲 that their evaluation patterns function as fingerprints. 𝗪𝗵𝘆 𝘀𝗵𝗼𝘂𝗹𝗱 𝗹𝗲𝗴𝗮𝗹 𝗰𝗮𝗿𝗲? In many ways, LLMs behave like people, who are also probabilistic. We've always known that different lawyers assess risk differently, interpret contract language differently, prioritize issues differently. Now we're deploying AI with the same characteristics. Consider a legal department implementing an agentic workflow that escalates matters based on risk assessment. This research suggests the choice of model isn't an implementation detail. It's a substantive decision that shapes outcomes. As we build agentic processes that assert judgment (contract review, risk triage, compliance monitoring), we need to understand that model selection is a methodological choice with real consequences. The question isn't whether to use AI for legal judgment. 𝗜𝘁'𝘀 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝘄𝗲 𝘂𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝘄𝗵𝗮𝘁 𝘁𝗵𝗲𝗼𝗿𝘆 𝗼𝗳 𝗷𝘂𝗱𝗴𝗺𝗲𝗻𝘁 𝘄𝗲'𝗿𝗲 𝗲𝗻𝗰𝗼𝗱𝗶𝗻𝗴 𝘄𝗵𝗲𝗻 𝘄𝗲 𝗽𝗶𝗰𝗸 𝗮 𝗺𝗼𝗱𝗲𝗹.
-
The challenge of integrating multiple large language models (LLMs) in enterprise AI isn’t just about picking the best model, it’s about choosing the right mix for each specific scenario. When I was tasked with leveraging Azure AI Foundry alongside Microsoft 365 Copilot, Copilot Studio, Claude Sonnet 4, and Opus 4.1 to enhance workflows, the advice I heard was to double down on a single, well‑tuned model for simplicity. In our environment, that approach started to break down at scale. Model pluralism turned out to be the unexpected solution, using multiple LLMs in parallel, each optimised for different tasks. The complexity was daunting at first, from integration overhead to security and governance concerns. But this approach let us tighten data grounding and security in ways a single model couldn’t. For example, routing the most sensitive tasks to Opus 4.1 helped us measurably reduce security exposure in our internal monitoring, while Claude Sonnet 4 noticeably improved the speed and quality of customer‑facing interactions. In practice, the chain looked like this: we integrated multiple LLMs, mapped each one to the tasks it handled best, and saw faster execution on specialised workloads, fewer security and compliance issues, and a clear uplift in overall workflow effectiveness. Just as importantly, the architecture became more robust, if one model degraded or failed, the others could pick up the slack, which matters in a high‑stakes enterprise environment. The lesson? The “obvious” choice, standardising on a single model for simplicity, can overlook critical realities like security, governance, and scalability. Model pluralism gave us the flexibility and resilience we needed once we moved beyond small pilots into real enterprise scale. For those leading enterprise AI initiatives, how are you balancing the trade‑off between operational simplicity and a pluralistic, multi‑model architecture? What does your current model mix look like?