Layers of observability in AI systems, explained visually: If you’re deploying LLM-powered apps to real users, you need to know what’s happening inside your pipeline at every step. Here’s the mental model (see the attached diagram): Think of your AI pipeline as a series of steps. For simplicity, consider RAG. A user asks a question, it flows through multiple components, and eventually, a response comes out. Each of those steps takes time, each step can fail, and each step has its own cost. And if you’re only looking at the input and output of the entire system, you will never have full visibility. This is where traces and spans come in. > A Trace captures the entire journey, from the moment a user submits a query to when they get a response. Look at the "Trace" column in the diagram below. One continuous bar that encompasses everything. > Spans are the individual operations within that trace. Each colored box on the right represents a span. Let’s understand what each span captures in this case: - Query span: User submits a question. This is where your trace begins. You capture the raw input, timestamp, and session info. - Embedding Span: The query hits the embedding model and becomes a vector. This span tracks token count and latency. If your embedding API is slow or hitting rate limits, you’ll catch it here. - Retrieval Span: The vector goes to your database for similarity search. Our observation suggests that this is where most RAG problems hide, with the most common reasons being bad chunks, low relevance scores, wrong top-k values, etc. The retrieval span exposes all of it. - Context Span: In this span, the retrieved chunks get assembled with your system prompt. This span shows you exactly what’s being fed to the LLM. So if the context is too long, you’ll see it here. - Generation Span: Finally, the LLM produces a response. This span is usually the longest and most expensive. Input tokens, output tokens, latency, reasoning (if any), etc., everything is logged for cost tracking and debugging. This should make it clear that without span-level tracing, debugging is almost impossible. You would just know that the response was bad, but you would never know if it was due to bad retrieval, bad context, or the LLM’s hallucination. Cost tracking is another big one. Span-level tracking lets you see where the money is actually going. One more thing: AI systems degrade over time. What worked last month might not work today. Span-level metrics let you catch drift early and tune each component independently. If you want to see how component-level observability + evals are implemented in practice, I have shared a snippet in the comments that uses the DeepEval open-source framework. ____ Find me → Avi Chawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Understanding Observability in AI Systems
Explore top LinkedIn content from expert professionals.
Summary
Understanding observability in AI systems means making sure you can monitor, track, and explain how AI makes decisions and handles data from start to finish. Observability allows teams to spot issues early, understand why something happened, and ensure AI systems remain trustworthy and reliable over time.
- Track every step: Set up detailed monitoring that captures inputs, internal processes, and outputs so you can pinpoint where problems arise and see exactly how decisions are made.
- Monitor for change: Use tools to catch data drift and shifting behaviors, ensuring you spot issues before users or regulators do.
- Keep records accessible: Store logs and decision histories in a way that supports easy review, compliance, and troubleshooting whenever you need answers.
-
-
𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐀𝐈 𝐢𝐬 𝐦𝐚𝐭𝐮𝐫𝐢𝐧𝐠. 𝐁𝐮𝐭 𝐭𝐡𝐞 𝐫𝐞𝐚𝐥 𝐛𝐨𝐭𝐭𝐥𝐞𝐧𝐞𝐜𝐤? 𝐃𝐞𝐛𝐮𝐠𝐠𝐢𝐧𝐠 𝐭𝐡𝐞 𝐠𝐡��𝐬𝐭𝐬. We’ve seen toolkits. We’ve seen use cases. What we haven’t seen - until now - is a way to understand how agents behave once they’re deployed and left to operate on their own. Because here’s the problem: → LLM-based agents are inherently stochastic → Same input, different outputs, unpredictable tool invocations → “Works in demo” doesn’t scale to production The authors propose a solution: Treat every agent trajectory - tool calls, decisions, delegation patterns - as a process log. Then apply process mining and causal discovery to see what’s consistent, and what’s not. Why this matters: Most failures in multi-agent setups aren’t logic bugs. They’re mismatches between what the developer intended and what the agent improvised. → You thought only the Calculator could call math tools → But the Manager quietly started using them too → Why? The prompt was too vague. The role permissions too soft. Using causal models, LLM-based static analysis, and trajectory logging, this approach reveals: → “Breaches of responsibility” between agents → Hidden variability in execution flows → Ambiguity in natural language prompts that leads to divergence → Unstable behavior even with temperature = 0 This isn't just academic. It's the early foundation for something we don’t yet have: DevOps for agentic systems. Implications for enterprise AI teams: → You need observability pipelines for your AI agents, not just dashboards for humans → Prompt engineering is not enough - you need static validation and runtime tracing → Failure analysis must shift from error messages to behavioral forensics Just like we had to build test harnesses, CI/CD, and tracing for microservices, we’ll now need: → Agent trajectory logs → Causal maps of tool flows → Static analysis of prompt intent vs observed actions Because in agentic systems, debugging isn't about fixing code. It’s about understanding emergent behavior. Would love to hear from: → Builders working with CrewAI, LangGraph, AutoGen → Teams deploying autonomous workflows in production → Researchers thinking about agent alignment and runtime guarantees What would your agent observability stack look like? And who owns the problem when the AI decides to go off-script?
-
Why Agent Observability Actually Matters AI agents in enterprise workflows are fundamentally different from regular software. Traditional apps are predictable, whereby you enter X and get Y. Agents, on the other hand, are probabilistic in nature. They adapt, reason, and make decisions with little oversight, which creates risks that standard monitoring tools often miss. In practice, your APM dashboard might show that everything is running smoothly; latency is good, error rates are low, and resources are stable. However, your financial planning agent could be underweighting expense categories because the data changed, and you might not notice until the forecast is wrong months later. Indeed, true agent observability should capture the agent’s reasoning, not just its output. What options did it consider, how confident was it, and what probabilities influenced the decision? Drift is a major concern and comes in different forms. Input drift occurs when new data differs from what the model was trained on. Model drift is when the link between inputs and outputs changes over time. Semantic drift is harder to spot because the agent’s understanding of your instructions can shift, especially as it learns from ongoing use. Without question, in systems with multiple agents or swarms, drift can build up and cause unexpected problems further down the line. Decision retention is equally important. For example, when an agent assigns a vendor payment to a GL code, you need to record that decision so you can review it later. This includes what inputs the agent used, its confidence level, other options it considered, and whether anyone corrected it afterward. This approach provides audit trails, supports root cause analysis, and helps you find patterns that are hard to see when looking at decisions one by one. The real value comes when you link observability data, drift signals, and decision history in a way you can explore. Instead of only asking what happened, you can ask why it happened and whether you have seen this pattern before. This turns agents from black boxes into systems you can understand and manage. When implementing these systems, you need to carefully manage storage and performance, as collecting all the data generates a large amount of telemetry. The solution is to sample wisely, keep detailed records for important decisions, and compress routine data. Last but not least, regulators are also watching closely. Indeed, it’s arguably only a matter of time before SOX, SEC, HIPAA, and the EU AI Act all require some form of that, whereby, if you use agents in critical workflows, you must be able to demonstrate how decisions were made. Organizations that build this infrastructure now will be better prepared when regulations become stricter The TLDR is that as agents take on more critical operations, observability isn't optional. You either build systems you can explain and audit, or you end up running black boxes you can't trust.
-
As I finish sketching my “AI in 2026” observations, this last one ties everything together: As autonomy scales, responsibility becomes harder to locate. Once AI systems act continuously, coordinate with other agents, transact economically, and operate across organizational and jurisdictional boundaries, responsibility no longer maps cleanly to a single prompt, model, or human decision. Actions emerge from interactions. Decisions unfold over time. Outcomes are shaped by systems, not moments. When an agent triggers a financial loss, teams want to know what happened, why it happened, and where intervention was possible. When behavior drifts gradually, leaders need visibility into how decisions are being shaped by memory, incentives, and prior actions. Static policies and post-hoc audits don’t provide that clarity. This is why adaptive governance is becoming a practical design requirement. You can already see signals across research and product ecosystems. Recent work on autonomous agent oversight emphasizes runtime monitoring, traceability of decision paths, and intervention mechanisms that operate while systems are active. Explainability is moving closer to behavior itself: which tools were invoked, which memories were retrieved, and which constraints influenced an action. Startups are converging on the same needs from the ground up: ⭐ AgentOps.ai focuses on observability for agentic systems, tracing execution and surfacing failure modes in production. ⭐ CrewAI emphasizes role clarity and structured collaboration to make multi-agent behavior legible. ⭐ Portal26 and similar efforts focus on policy enforcement and auditability at the system level rather than trust in individual components. ⭐ Credo AI addresses governance from the organizational layer, helping enterprises operationalize AI policy, risk management, and accountability across models and systems. Responsibility shifts toward runtime visibility and control. Organizations begin to define responsibility across various layers, including agent behavior, orchestration logic, memory and data access, economic constraints, and human oversight. Governance becomes something systems participate in. Escalation paths are designed in advance. Intervention points are explicit. Logs and traces are preserved with intent, not just for debugging. This reaches beyond engineering. Legal teams, risk functions, procurement, and insurance increasingly ask for evidence of control rather than assurances of intent. Accountability becomes something that can be inspected and tested. By 2026, responsibility becomes a first-order design constraint. The organizations that scale autonomy successfully will build systems that can explain themselves, surface risk early, and invite intervention when boundaries are approached. Governance becomes part of the architecture. This is where AI stops being experimental capability and becomes institutional infrastructure.
-
🚀 Rethinking AI Risk Through the Lens of Control Theory: Introducing an Agentic AI Risk Assessment Framework As we enter the agentic AI era systems that actively pursue complex, multi-step goals in open environments, traditional risk frameworks feel like using a thermometer to navigate a spaceship. These are dynamic, non-linear, goal-directed systems. The only serious way to govern them is control-theoretic governance. Here’s the framework I’ve been refining, built explicitly on classical and modern control theory. 🌀 Controllability Can we reliably steer the agent from any state to a desired safe state in finite time, even under uncertainty or adversarial inputs? (Think: rank of the controllability Gramian in discrete-time systems, or the existence of a stabilizing feedback policy under partial observability.) 👁️ Observability & Interpretability Can we reconstruct the internal goal representation, planning horizon, and latent intentions from observable outputs alone? Weak observability → emergent deception or reward hacking becomes undetectable. 🎯 Stability (Robustness to Perturbations) Is the agent’s behavior BIBO stable (bounded-input → bounded-output) under distribution shift, goal misspecification, or malicious prompting? More critically: is it asymptotically stable around the intended objective, or does it exhibit chaotic or runaway amplification? 🔄 Feedback Bandwidth & Correction Latency How quickly can human-in-the-loop or automated guardrails detect and correct deviations? A system with high control delay is effectively uncontrollable in fast-moving environments (e.g., recursive self-improvement scenarios). 🛡️ Disturbance Rejection & Adversarial Robustness What is the H∞ norm of the closed-loop system? In plain English: how much worst-case disruption (prompt injection, data poisoning, objective tampering) can the system tolerate before catastrophic failure? Control theory gives us what today’s governance lacks: provable worst-case bounds, formal verification tools, and the actual engineering language used for rockets, grids, and reactors. Bank for International Settlements – BIS leaders (Trichet, Haldane, Carstens, Borio, others) have used exactly these concepts for 15+ years to explain why some financial systems survive crises and others explode. The Monetary Authority of Singapore (MAS) Nov 2025 consultation paper on Responsible Use of AI explicitly adopts the FEAT Principles I proposed in 2018 — and its sections on generative/autonomous systems are effectively demanding this control-theoretic approach. We already know how to build controllable, observable, stable systems at scale. Will we finally treat agentic AI with nuclear-reactor seriousness instead of consumer-app casualness? Is control theory the bridge we need for scalable oversight? Or do mesa-optimisation, ontology shifts, etc. break it? Thoughts welcome. #AgenticAI #AISafety #ControlTheory #AIGovernance #Alignment #FEATPrinciples #MAS #BIS
-
Here's what I've learned from building 5 LLM-related AI projects in 12 months: Architecting the observability pipeline is harder than building the product. But it’s absolutely critical. When you tweak a prompt, add a new feature, or fix a bug, you need to be certain you haven’t broken anything else. Constantly testing this manually is a huge waste of time and energy. So, what's the workaround? In the PhiloAgents project, we tackled this challenge by building a robust observability pipeline that unifies monitoring and evaluation into a single, cohesive system. We use Opik (by Comet) to: • Monitor prompt usage and versioning live • Capture detailed traces of user inputs, agent actions, tool calls, and outputs • Track key latency metrics like Time to First Token and Tokens per Second • Run offline evaluations with LLM-as-a-Judge on curated datasets • Store evaluation metadata for tracking performance over time The pipeline consists of two main parts: 𝗢𝗻𝗹𝗶𝗻𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: Gives real-time visibility into agent behavior, helping us quickly identify bugs, regressions, or performance issues during production. 𝗢𝗳𝗳𝗹𝗶𝗻𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: Runs systematic, reproducible tests to measure overall agent quality, track trends, and validate improvements before deployment. Together, these pipelines create a feedback loop that supports both immediate debugging and long-term continuous improvement. If you want to build AI agents that survive beyond proofs of concept and scale reliably in production, mastering observability is non-negotiable. For a practical deep dive into how we architected this system, check out Lesson 5 of the PhiloAgents course. 🔗 Here's the link: https://lnkd.in/dRYgHyid
-
Buying an observability platform doesn't give you observability. Just like buying a gym membership doesn't make you fit. Tools matter, but observability is a system made up of people, processes, and instrumentation. It requires consistency, conventions, and collaboration across teams. Observability becomes a system when you have: 🔹 Instrumentation discipline: Services emit structured, meaningful telemetry and not whatever each developer prefers. 🔹 Semantic conventions: Attributes, span names, and error formats are consistent across services. 🔹 A reliable pipeline: OpenTelemetry Collectors route data predictably and safely. 🔹 Operational workflows: Engineers know how to investigate outages, not just where to click. 🔹 Ownership: Teams maintain what they instrument and review observability as part of delivery. Without these pieces, even the best tool becomes little more than a data sink. 🧩 Example: When Observability Fails as a Tool Imagine a company buys a premium observability platform. They hook up a few logs and metrics. Dashboards are created. Alerts are set. Then an incident happens. Engineers jump into dashboards and see CPU spikes but no correlated traces. They search logs, but every service logs differently. They pull up metrics, but have no context for which user flows are impacted. Everyone spends hours guessing. Why? Because they bought a tool, but never built a system. 🧩 Example: When Observability Works as a System Another team invests in: • Consistent OTel instrumentation across services • Shared semantic conventions • A unified collector pipeline • Playbooks for incident response • Regular observability reviews in sprint cycles When something breaks, engineers instantly see: • The failing service • The impacted user flows • The exact span where latency spikes began • Related logs with matching attributes • Recent deployments that touched that code path They don't just detect the issue, they understand it. That's observability as a system. 🎯 Bottom Line Observability isn't what you buy. It's what you build over time. Tools give you capabilities. Systems give you outcomes. 💬 How have you built observability beyond just tools in your organization? #Observability #OpenTelemetry #PlatformEngineering #SRE #O11yEngineering
-
As AI agents move from experiments into real, multi-agent systems, its important to ask the question: How do we know they’re actually working — and toward the outcomes we care about? That’s where AI agent observability becomes the next measured leap (https://deloi.tt/4szsO3I). As organizations move from “human in the loop” to “human on the loop,” agents stop being tools and start behaving more like digital teammates, with execution giving way to supervision. Productivity gives way to accountability. Intuition alone stops being enough. Within this reframing, observability isn’t just a technical capability. It’s a technology-enabled, discipline that lets organizations see, understand, and continuously improve how agents perform against goals, not just system metrics. We’re seeing this shift across a number of functions: 🟢 From execution to oversight. Agents can take on repeatable work, but humans don’t disappear; their role evolves. Oversight, judgment, and intervention become the differentiators. 🟢 From legacy KPIs to agent-native metrics. Traditional measures don’t translate cleanly to autonomous systems. New KPIs need to appraise impact, productivity, and risk. 🟢 From one-off deployments to agent operations. Observability, governance, and tuning have to scale across use cases, not get rebuilt every time. When we get human oversight and control right, we can enable our organizations to move faster with confidence (and avoid any surprises) as agents take on more responsibility. Great job Prakul Sharma, Parth Patwari and Brijraj Limbad!
-
CSIRO researchers just unveiled AgentOps, a groundbreaking DevOps paradigm that solves the black box problem in LLM agents by enabling comprehensive observability across their entire lifecycle. While LLM agents show immense potential for automating complex tasks, their autonomous and non-deterministic behavior raises significant AI safety concerns. AgentOps addresses this by introducing a systematic way to trace and monitor agent behavior. Paper highlights: (1) Artifact relationship model - maps out the complex interactions between different components of an agent system, from reasoning and planning to execution and evaluation (2) Comprehensive taxonomy - provides a template for developers to implement proper monitoring and logging of agent activities (3) Systematic tracing approach - enables agent developers to monitor agent behavior, track artifacts, detect anomalies, and assign accountability This study is particularly relevant as most existing DevOps tools only focus on LLM-specific metrics and prompt management, leaving a critical gap in agent-specific observability. Paper https://lnkd.in/gnqCMWJ3 More posts on AI Agents https://lnkd.in/gpeDupnj
-
Aren’t #evals and #observability the same thing? I read a post (https://lnkd.in/gkYeYbmj) by Aakash Gupta & Aman Khan and was curious to dig in a bit. Here is a quick cheat sheet: Short answer: No, They are two sides of building trustworthy AI. Evals = Exams ->Structured tests before launch. -> Golden datasets, win-rate tests, bias checks. -> Ask: “Does my AI meet the bar I set?” Observability = Health Monitoring -> Continuous monitoring in production. -> Tracks anomalies, drift, repeated outputs, hallucinations. -> Ask: “Is my AI behaving as expected in the wild?” Together, they cover the full lifecycle: 1. Prototype: Evals light your way. 2. Pre-Launch: Evals prevent bad surprises. 3. Launch: Observability catches real-world issues. 4. Scale: Both ensure reliability + trust. The takeaway: Evals prevent. Observability detects. You need both to avoid “it worked in the demo, but failed in production” moments. Curious to hear: how are you balancing evals vs. observability in your AI product workflow? #AIProductManagement #Evals #Observability #AI