I was debugging an agent a few weeks ago when I hit a problem that made me realize something fundamental about the shift we’re undergoing. The script had run, consumed a hundred thousand tokens, and returned an answer. But the answer was wrong. Not catastrophically wrong, just subtly, dangerously off.
The issue wasn’t that the model was bad. The problem was that I had no idea what the agent had thought while producing that answer. Which tools had it called? What information had it retrieved? What reasoning path had it wandered down? I had the input and the output, but the middle, the actual decision-making process, was a black box.
This mirrors the challenge I described in Everything Becomes an Agent. If our future architecture is a mesh of interacting agents, we cannot afford for them to be inscrutable. A single black box is a mystery; a system of black boxes is chaos.
This is the Observability Gap, and it is the first wall you hit when you move from prototype to production. You can build a working agent in an afternoon. You can give it tools, wire up a nice ReAct loop, and watch it dazzle you. But the moment you rely on it for something that matters, you realize you’re flying blind.
How do you know if your agent is working well? And more importantly, how do you fix it when it’s not?
Earlier in this series, I wrote about building guardrails and the Policy Engine that keeps agents from doing dangerous things. Observability is the complement to those guardrails. Guardrails define the boundaries; observability tells you whether the agent is respecting them, struggling against them, or quietly finding ways around them. One without the other is incomplete. A guardrail you can’t monitor is just a hope.
The Chain of Thought Problem
When you’re building traditional software, debugging is an exercise in logic. You set breakpoints, inspect variables, and trace execution. The flow is deterministic: if Input A produces Output B today, it will produce Output B tomorrow.
Agents don’t work that way. The same input can produce wildly different outputs depending on which tools the agent decides to call, how it interprets the results, and what “thought” it generates in that split second. The agent’s logic isn’t written in code; it’s written in natural language, scattered across multiple LLM calls, tool invocations, and iterative refinements.
I learned this the hard way with my Podcast RAG system. I’d ask it a question about a specific episode, and sometimes it would nail it, pulling the exact segment and synthesizing a perfect answer. Other times, it would search with the wrong keywords, get back irrelevant chunks, and confidently synthesize nonsense.
The model wasn’t hallucinating in the traditional sense. It was following a process. But I couldn’t see that process, so I couldn’t fix it.
That experience taught me the most important lesson about production agents: the final answer is the least interesting part. What matters is the chain of thought that produced it, every tool call, every intermediate result, every reasoning trace. Think of it as a flight recorder. When the plane lands at the wrong airport, the only way to understand what went wrong is to replay the entire flight.
Four Layers of Seeing
When I started building that flight recorder, I realized that “log everything” isn’t actually a strategy. You need structure. Through trial and error, and by studying how platforms like Langfuse and Arize Phoenix approach the problem, I’ve come to think of agent observability as having four distinct layers.
The first is the reasoning layer: the agent’s internal monologue where it decomposes your request into sub-tasks. This is where you catch the subtle bugs. When my Podcast RAG agent searched for the wrong keywords, the failure wasn’t in the tool call itself (which returned a perfectly valid HTTP 200). The failure was in the reasoning that chose those keywords. Without visibility into the “Thought” step of the ReAct loop, that kind of error is indistinguishable from an external system failure.
The second is the execution layer: the actual tool calls, their arguments, and the raw results. This is where you catch a different class of bug, one that’s becoming increasingly important. Tool hallucination. Not the model making up facts in prose, but the model calling a tool that doesn’t exist (you provided shell_tool but the model confidently calls bash_tool), fabricating a file path that isn’t real, or passing a string to a parameter that expects an integer. These are operational failures that cascade. I’ve seen an agent confidently pass a hallucinated document ID to a retrieval tool, get back an error, and then re-hallucinate a different invalid ID rather than change strategy. You only catch this if you’re logging the schema validation at the boundary between the model and the tool.
The third is the state layer: the contents of the agent’s context window at each decision point. Agents are stateful creatures. Their behavior at step ten is shaped by everything that happened in steps one through nine. And context windows are not infinite. As verbose tool outputs accumulate, relevant information gets pushed further and further from the model’s attention, a phenomenon researchers call “context drift” or the “Lost in the Middle” effect. Snapshotting the context at critical decision points lets you “time travel” during debugging. You can see exactly what the agent could see when it made its bad call.
The fourth is the feedback layer: error codes, user corrections, and signals from any critic or evaluator models. This layer tells you whether the agent is actually learning from its environment within a session, or just ignoring failure signals and looping. In frameworks like Reflexion, this feedback is explicitly wired into the next reasoning step. Watching this layer is how you know if your self-correction mechanisms are actually correcting.
But capturing these four layers independently isn’t enough. You need to bundle them into sessions: discrete, self-contained records of a single task from the moment the user makes a request to the moment the agent delivers (or fails to deliver) its result. A session is your unit of analysis. It’s the difference between having a pile of timestamped log lines and having a story you can read from beginning to end. When something goes wrong, you don’t want to grep through millions of events hoping to reconstruct what happened. You want to pull up session #47832 and replay the agent’s entire decision-making journey: what it thought, what it tried, what it saw, and how it responded to each result along the way.
This session-level thinking changes how you build your infrastructure. Every trace, every tool call, every context snapshot gets tagged with a session ID. Your dashboards stop showing you aggregate metrics and start showing you individual narratives. You can sort sessions by outcome (success, failure, abandonment), by cost (token consumption), or by duration, and immediately drill into the ones that matter. It’s the observability equivalent of going from reading a box score to watching the game film.
Making It Concrete
Here’s what this looks like in practice. Suppose you ask your agent to “check my calendar and suggest a time for a meeting.”
Without observability, you see:
Input: "Check my calendar and suggest a time for a meeting"Output: "How about Thursday at 2pm?"
With observability across all four layers, you see the mind at work:
[REASONING] User wants to schedule a meeting. I need to:1. Check their calendar for availability2. Consider team availability3. Suggest an optimal time[TOOL CALL] get_calendar(user_id="allen", days=7)[TOOL RESULT] Returns 45 events over next 7 days[STATE] Context window: 2,847 tokens used[REASONING] Analyzing free slots. User has:- Monday 2pm-4pm free- Thursday 2pm-4pm free- Friday all day booked[TOOL CALL] get_team_availability()[TOOL RESULT] Team members mostly available Thursday afternoon[REASONING] Thursday 2pm works for both user and team.[FEEDBACK] No errors. Response generated.[RESPONSE] "How about Thursday at 2pm?"
Suddenly, the black box is transparent. If the suggestion is wrong, you can see exactly why. Maybe the calendar tool returned incomplete data. Maybe the team availability check failed silently. Maybe the agent’s definition of “optimal” means “soonest” rather than “best for focus time.”
This kind of visibility saved me countless hours when building Gemini Scribe. Users would report that the agent “didn’t understand” their request, which is about as useful as telling your mechanic “the car sounds funny.” But when I turned on debug logging and pulled up the console output, I could see exactly where the confusion happened, usually in how the agent interpreted the file context or which notes it decided were relevant. The fix was never a mystery once I could see the reasoning. All of this logging is to the developer console and off by default, which is an important distinction. You want observability for yourself as the builder, not surveillance of your users.
The Standards Are Coming
For my own production agents, I’ve settled on a layered approach. Structured logging captures every action in machine-parseable JSON. A unique trace ID stitches together every LLM call and tool invocation into a single narrative flow.
But we are also seeing the industry mature beyond “roll your own.” The critical development here is the adoption of the OpenTelemetry (OTel) standard for GenAI. The OTel community has published semantic conventions that define a standard schema for agent traces: things like gen_ai.system (which provider), gen_ai.request.model (which exact model version), gen_ai.tool.name (which tool was called), and gen_ai.usage.input_tokens (how many tokens were consumed at each step).
This matters because it means an agent built with LangChain in Python and an agent built with Semantic Kernel in C# can produce traces that look structurally identical. You can pipe both into the same Datadog or Langfuse dashboard and analyze them side by side. You aren’t locked into a proprietary debugging tool; you can stream your agent’s thoughts into the same infrastructure you use for the rest of your stack.
It also enables what I think of as “boundary tracing,” where you instrument the stable interfaces (the HTTP calls, the tool invocations) rather than hacking into the agent’s internal logic. You get visibility without coupling your observability to a specific framework. That’s important, because if there’s one thing I’ve learned building in this space, it’s that frameworks change fast.
If you’re wondering where to start, here’s my honest advice: don’t wait for the perfect stack. Start with structured JSON logs and a session ID that ties each task together end-to-end. That alone gives you something you can grep, filter, and replay. Once you outgrow that (and you will, faster than you expect), graduate to an OTel-based pipeline. The good news is that many agent frameworks are adding robust hook mechanisms that let you tap into the agent lifecycle (before and after tool calls, on reasoning steps, on errors) without modifying your core logic. These hooks make it straightforward to plug in your telemetry from the start. The key is to instrument early, even if you’re only logging to a local file. Retrofitting observability into an agent that’s already in production is significantly harder than building it in from the beginning.
The Price of Transparency
Here’s the tension no one wants to talk about: full observability is expensive.
Autonomous agents are verbose by nature. A single reasoning step might generate hundreds of tokens of internal monologue. A RAG retrieval might pull megabytes of document context. If you log the full payload for every transaction, your storage costs can rival the cost of the LLM inference itself. I’ve seen reports of evaluation runs consuming over 100 million tokens, with more than 60% of the cost attributed to hidden reasoning tokens.
In production, you need sampling strategies. The approach I’ve landed on borrows from traditional distributed systems. Keep 100% of traces that result in errors or negative user feedback, because every failure is a learning opportunity. Keep traces that exceed your latency threshold (P95 or P99), because slow agents are often stuck agents. And for everything else, a small random sample (1-5%) is enough to establish your baseline and spot trends.
For storage, I use a tiered approach. Recent and failed traces go into a fast database for immediate querying. Older successful traces get compressed and moved to cold storage, where they can be pulled back if needed for deeper analysis. It’s not glamorous, but it keeps costs manageable without sacrificing the ability to debug the things that matter. In my own setup, this sampling and tiering strategy keeps observability overhead to roughly 15-20% of my inference spend. Without it, I was on track to spend more on storing agent thoughts than on generating them.
Evaluation Beyond Unit Tests
Logging tells you what happened. Evaluation tells you if it was any good.
This is where agents diverge sharply from traditional software. You can’t write a unit test that asserts function(x) == y. The whole point of an agent is to make decisions, and decisions must be evaluated on quality, not just syntax.
As Gemini Scribe grew more capable, I had to develop a new kind of test suite. I track Task Success Rate (did the agent accomplish what the user asked?), Tool Use Accuracy (did it read the right files and use the right tools for the job?), and Efficiency (did it burn 50 steps to do a 2-step task?).
But here’s the number that keeps me up at night. Because agents are non-deterministic, a single run is statistically meaningless. You have to run the same evaluation multiple times and look at distributions. Researchers distinguish between Pass@k (the probability that at least one of k attempts succeeds) and Pass^k (the probability that all k attempts succeed). Pass@k measures potential. Pass^k measures reliability.
The math is sobering. If your agent has a 70% success rate on a single attempt, its Pass^3 (succeeding three times in a row) drops to about 34%. Scale that to a real workflow where the agent needs to perform ten sequential steps correctly, and even a 95% per-step success rate gives you only about a 60% chance of completing the full task. This is the compounding probability of failure, and it’s why “works most of the time” isn’t good enough for production.
This kind of evaluation framework pays for itself the moment a new model drops. When Google released Flash 2.0, I was excited about the cost savings, but would it perform as well as Pro? I ran my eval suite on the same tasks with both models, and the results were more nuanced than I expected. For simple tasks like reformatting text or fixing grammar, Flash was just as good. For complex multi-step reasoning, particularly in my Podcast RAG system, Pro was noticeably better. The eval suite gave me the data to keep Pro where it mattered.
Then Flash 3 came out, and the eval suite surprised me in the other direction. I ran the same benchmarks expecting similar trade-offs, but Flash 3 handled the Podcast RAG tasks so well that I moved the entire system off of 2.5 Pro. Without evals, I might have assumed the old trade-off still held and kept paying for a model I no longer needed. The point isn’t that one model is always better. The point is that you can’t know without measuring, and the landscape shifts under your feet with every release.
The real breakthrough in my own workflow came when I started using an agent to evaluate itself. I built a separate “Evaluation Agent” that reviews the logs of the “Worker Agent.” It scores performance based on a rubric I defined: did it confirm the action before executing? Was the response grounded in retrieved context? Was the tone appropriate?
This LLM-as-a-Judge pattern is powerful, but it comes with caveats. Research shows these evaluator models have their own biases, particularly a tendency to prefer longer answers regardless of quality and a bias toward their own outputs. To calibrate mine, I built a small “golden dataset” of traces that I graded by hand, then tuned the evaluator’s prompt until its scores matched mine. It’s not perfect, but it spots patterns I miss, like a tendency to over-rely on search when a simple calculation would do.
When Things Go Wrong
The research into agentic failure modes has identified three patterns that I see constantly in my own work.
The first is looping. The agent searches for “pricing,” gets no results, then searches for “pricing” again with exactly the same parameters. It’s stuck in a local optimum of reasoning, unable to update its strategy based on the observation that it failed. The simplest fix is a state hash: you hash the (Thought, Action, Observation) tuple at each step and check it against a sliding window of recent steps. If you see a repeat, you force the agent to try something different. For “soft” loops where the agent slightly rephrases but semantically repeats itself, embedding similarity between consecutive reasoning steps catches the pattern. And above all, production agents need circuit breakers: hard limits on steps, tool calls, or tokens per session. When the breaker trips, the agent escalates to a human rather than continuing to burn resources.
The second is tool hallucination. I mentioned this earlier, but it deserves its own spotlight. The most robust defense is constrained decoding, where libraries like Outlines or Instructor use the tool’s JSON schema to build a finite state machine that masks out invalid tokens during generation. If the schema expects an integer, the system sets the probability of all non-digit tokens to zero. It mathematically guarantees that the agent’s tool call will be valid. This moves validation from “check after the fact” to “ensure during generation,” which is a fundamentally better position. A practical note: full constrained decoding (the FSM approach) requires control over the inference engine, so it works with locally-hosted models or providers that expose logit-level access. If you’re calling a hosted API like Gemini or OpenAI, Instructor-style libraries can still enforce schema validation by wrapping the response in a Pydantic model and retrying on parse failure. It’s not as elegant as preventing bad tokens from ever being generated, but it catches the same class of errors.
The third is silent abandonment. The agent hits an ambiguity or a tool failure, and instead of trying an alternative, it politely apologizes and gives up. “I’m sorry, I couldn’t find that information.” This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response to uncertainty. The Reflexion pattern combats this by forcing the agent to generate a self-critique when it fails (“I searched with the wrong term”) and storing that critique in a short-term memory buffer. The next reasoning step is conditioned on this reflection, pushing the agent to generate a new plan rather than surrender. Research shows this kind of “verbal reinforcement” can improve success rates on complex tasks from 80% to over 90%.
The Self-Improving System
Moving from prototype to production isn’t about adding features; it’s about shifting your mindset. A prototype proves that something can work. A production system proves that it works reliably, measurably, and transparently. But the real unlock comes when you realize that production isn’t the end of the development lifecycle. It’s the beginning of something more powerful.
Remember those sessions I mentioned, the bundled records of every task your agent attempts? Once you have a critical mass of them, you’re sitting on a goldmine. And this is where I think the story gets really interesting: you can point a different AI system at your session archive and ask it to find the patterns you’re missing.
I’ve started doing this with my own agents. The workflow is straightforward: I have a script that runs weekly, pulls the last seven days of sessions from my trace store, filters for failures and anything above P90 latency, and exports them as structured JSON. I then feed that batch to a separate, more capable evaluator model. Not the lightweight rubric-scorer I use for real-time evaluation, but a model with a broader mandate and a carefully written prompt: look across these sessions and tell me what you see. Where is the agent consistently struggling? Which tool calls tend to precede failures? Are there categories of user requests that reliably lead to abandonment or looping? I ask it to return its findings as a ranked list of patterns with supporting session IDs, so I can verify each observation myself.
The results have been genuinely surprising. The evaluator flagged a cluster of sessions where users were asking questions about the corpus itself, things like “how many of these podcasts are about guitars?” or “which shows cover AI the most?” The agent would gamely try to answer by searching transcripts, but it was never going to get there because I hadn’t indexed podcast descriptions. Each individual session just looked like a search that came up short. It was only in aggregate that the pattern became clear: users wanted to explore the collection, not just search within it. That finding led me to index descriptions as a new data source, and a whole category of previously failing queries started working.
This is what the industry calls the Data Flywheel: production data feeding back into development, continuously tightening the loop between user intent and agent capability. Your prompt logs become your reality check, revealing how users actually talk to your system versus how you imagined they would.When you cluster those real-world prompts (something as straightforward as embedding them and running HDBSCAN), you start finding these gaps systematically. That’s your roadmap for what to build next.
And the flywheel compounds. Better observability produces richer sessions. Richer sessions give the evaluator more to work with. Better evaluations lead to targeted improvements. Targeted improvements produce better outcomes, which produce more informative sessions. Each rotation makes the system a little smarter, a little more aligned with what users actually need.
To be clear: this isn’t the agent autonomously rewriting itself. I’m the one who reads the evaluator’s findings, verifies them against the session data, and decides what to change. Maybe I update a system prompt, add a new tool, or adjust a circuit breaker threshold. The AI surfaces the patterns; the human decides what to do about them. It’s the same human-on-the-loop philosophy I described in the last post, applied to the development cycle itself.
Together, these layers transform a clever demo into a system you can trust. Because in the age of agents, trust isn’t built on magic. It’s built on the ability to see the trick.
Throughout this series, we’ve been building up the theory: what agents are, how they think, what tools they need, how to keep them safe, and now how to make sure they’re actually working. In the next installment, I want to move from theory to practice. We’ll look at agents in the wild, real-world case studies in customer support, software development, and personal productivity, and what they tell us about how this technology is actually changing the way we work.
[…] Part 10: From Prototype to Production […]