A luminous geometric sphere with sections of its outer shell breaking apart to reveal glowing concentric rings and internal mechanisms, set against a dark navy background.

The Observability Gap

I was debugging an agent a few weeks ago when I hit a problem that made me realize something fundamental about the shift we’re undergoing. The script had run, consumed a hundred thousand tokens, and returned an answer. But the answer was wrong. Not catastrophically wrong, just subtly, dangerously off.

The issue wasn’t that the model was bad. The problem was that I had no idea what the agent had thought while producing that answer. Which tools had it called? What information had it retrieved? What reasoning path had it wandered down? I had the input and the output, but the middle, the actual decision-making process, was a black box.

This mirrors the challenge I described in Everything Becomes an Agent. If our future architecture is a mesh of interacting agents, we cannot afford for them to be inscrutable. A single black box is a mystery; a system of black boxes is chaos.

This is the Observability Gap, and it is the first wall you hit when you move from prototype to production. You can build a working agent in an afternoon. You can give it tools, wire up a nice ReAct loop, and watch it dazzle you. But the moment you rely on it for something that matters, you realize you’re flying blind.

How do you know if your agent is working well? And more importantly, how do you fix it when it’s not?

Earlier in this series, I wrote about building guardrails and the Policy Engine that keeps agents from doing dangerous things. Observability is the complement to those guardrails. Guardrails define the boundaries; observability tells you whether the agent is respecting them, struggling against them, or quietly finding ways around them. One without the other is incomplete. A guardrail you can’t monitor is just a hope.

The Chain of Thought Problem

When you’re building traditional software, debugging is an exercise in logic. You set breakpoints, inspect variables, and trace execution. The flow is deterministic: if Input A produces Output B today, it will produce Output B tomorrow.

Agents don’t work that way. The same input can produce wildly different outputs depending on which tools the agent decides to call, how it interprets the results, and what “thought” it generates in that split second. The agent’s logic isn’t written in code; it’s written in natural language, scattered across multiple LLM calls, tool invocations, and iterative refinements.

I learned this the hard way with my Podcast RAG system. I’d ask it a question about a specific episode, and sometimes it would nail it, pulling the exact segment and synthesizing a perfect answer. Other times, it would search with the wrong keywords, get back irrelevant chunks, and confidently synthesize nonsense.

The model wasn’t hallucinating in the traditional sense. It was following a process. But I couldn’t see that process, so I couldn’t fix it.

That experience taught me the most important lesson about production agents: the final answer is the least interesting part. What matters is the chain of thought that produced it, every tool call, every intermediate result, every reasoning trace. Think of it as a flight recorder. When the plane lands at the wrong airport, the only way to understand what went wrong is to replay the entire flight.

Four Layers of Seeing

When I started building that flight recorder, I realized that “log everything” isn’t actually a strategy. You need structure. Through trial and error, and by studying how platforms like Langfuse and Arize Phoenix approach the problem, I’ve come to think of agent observability as having four distinct layers.

The first is the reasoning layer: the agent’s internal monologue where it decomposes your request into sub-tasks. This is where you catch the subtle bugs. When my Podcast RAG agent searched for the wrong keywords, the failure wasn’t in the tool call itself (which returned a perfectly valid HTTP 200). The failure was in the reasoning that chose those keywords. Without visibility into the “Thought” step of the ReAct loop, that kind of error is indistinguishable from an external system failure.

The second is the execution layer: the actual tool calls, their arguments, and the raw results. This is where you catch a different class of bug, one that’s becoming increasingly important. Tool hallucination. Not the model making up facts in prose, but the model calling a tool that doesn’t exist (you provided shell_tool but the model confidently calls bash_tool), fabricating a file path that isn’t real, or passing a string to a parameter that expects an integer. These are operational failures that cascade. I’ve seen an agent confidently pass a hallucinated document ID to a retrieval tool, get back an error, and then re-hallucinate a different invalid ID rather than change strategy. You only catch this if you’re logging the schema validation at the boundary between the model and the tool.

The third is the state layer: the contents of the agent’s context window at each decision point. Agents are stateful creatures. Their behavior at step ten is shaped by everything that happened in steps one through nine. And context windows are not infinite. As verbose tool outputs accumulate, relevant information gets pushed further and further from the model’s attention, a phenomenon researchers call “context drift” or the “Lost in the Middle” effect. Snapshotting the context at critical decision points lets you “time travel” during debugging. You can see exactly what the agent could see when it made its bad call.

The fourth is the feedback layer: error codes, user corrections, and signals from any critic or evaluator models. This layer tells you whether the agent is actually learning from its environment within a session, or just ignoring failure signals and looping. In frameworks like Reflexion, this feedback is explicitly wired into the next reasoning step. Watching this layer is how you know if your self-correction mechanisms are actually correcting.

But capturing these four layers independently isn’t enough. You need to bundle them into sessions: discrete, self-contained records of a single task from the moment the user makes a request to the moment the agent delivers (or fails to deliver) its result. A session is your unit of analysis. It’s the difference between having a pile of timestamped log lines and having a story you can read from beginning to end. When something goes wrong, you don’t want to grep through millions of events hoping to reconstruct what happened. You want to pull up session #47832 and replay the agent’s entire decision-making journey: what it thought, what it tried, what it saw, and how it responded to each result along the way.

This session-level thinking changes how you build your infrastructure. Every trace, every tool call, every context snapshot gets tagged with a session ID. Your dashboards stop showing you aggregate metrics and start showing you individual narratives. You can sort sessions by outcome (success, failure, abandonment), by cost (token consumption), or by duration, and immediately drill into the ones that matter. It’s the observability equivalent of going from reading a box score to watching the game film.

Making It Concrete

Here’s what this looks like in practice. Suppose you ask your agent to “check my calendar and suggest a time for a meeting.”

Without observability, you see:

Input: "Check my calendar and suggest a time for a meeting"
Output: "How about Thursday at 2pm?"

With observability across all four layers, you see the mind at work:

[REASONING] User wants to schedule a meeting. I need to:
1. Check their calendar for availability
2. Consider team availability
3. Suggest an optimal time
[TOOL CALL] get_calendar(user_id="allen", days=7)
[TOOL RESULT] Returns 45 events over next 7 days
[STATE] Context window: 2,847 tokens used
[REASONING] Analyzing free slots. User has:
- Monday 2pm-4pm free
- Thursday 2pm-4pm free
- Friday all day booked
[TOOL CALL] get_team_availability()
[TOOL RESULT] Team members mostly available Thursday afternoon
[REASONING] Thursday 2pm works for both user and team.
[FEEDBACK] No errors. Response generated.
[RESPONSE] "How about Thursday at 2pm?"

Suddenly, the black box is transparent. If the suggestion is wrong, you can see exactly why. Maybe the calendar tool returned incomplete data. Maybe the team availability check failed silently. Maybe the agent’s definition of “optimal” means “soonest” rather than “best for focus time.”

This kind of visibility saved me countless hours when building Gemini Scribe. Users would report that the agent “didn’t understand” their request, which is about as useful as telling your mechanic “the car sounds funny.” But when I turned on debug logging and pulled up the console output, I could see exactly where the confusion happened, usually in how the agent interpreted the file context or which notes it decided were relevant. The fix was never a mystery once I could see the reasoning. All of this logging is to the developer console and off by default, which is an important distinction. You want observability for yourself as the builder, not surveillance of your users.

The Standards Are Coming

For my own production agents, I’ve settled on a layered approach. Structured logging captures every action in machine-parseable JSON. A unique trace ID stitches together every LLM call and tool invocation into a single narrative flow.

But we are also seeing the industry mature beyond “roll your own.” The critical development here is the adoption of the OpenTelemetry (OTel) standard for GenAI. The OTel community has published semantic conventions that define a standard schema for agent traces: things like gen_ai.system (which provider), gen_ai.request.model (which exact model version), gen_ai.tool.name (which tool was called), and gen_ai.usage.input_tokens (how many tokens were consumed at each step).

This matters because it means an agent built with LangChain in Python and an agent built with Semantic Kernel in C# can produce traces that look structurally identical. You can pipe both into the same Datadog or Langfuse dashboard and analyze them side by side. You aren’t locked into a proprietary debugging tool; you can stream your agent’s thoughts into the same infrastructure you use for the rest of your stack.

It also enables what I think of as “boundary tracing,” where you instrument the stable interfaces (the HTTP calls, the tool invocations) rather than hacking into the agent’s internal logic. You get visibility without coupling your observability to a specific framework. That’s important, because if there’s one thing I’ve learned building in this space, it’s that frameworks change fast.

If you’re wondering where to start, here’s my honest advice: don’t wait for the perfect stack. Start with structured JSON logs and a session ID that ties each task together end-to-end. That alone gives you something you can grep, filter, and replay. Once you outgrow that (and you will, faster than you expect), graduate to an OTel-based pipeline. The good news is that many agent frameworks are adding robust hook mechanisms that let you tap into the agent lifecycle (before and after tool calls, on reasoning steps, on errors) without modifying your core logic. These hooks make it straightforward to plug in your telemetry from the start. The key is to instrument early, even if you’re only logging to a local file. Retrofitting observability into an agent that’s already in production is significantly harder than building it in from the beginning.

The Price of Transparency

Here’s the tension no one wants to talk about: full observability is expensive.

Autonomous agents are verbose by nature. A single reasoning step might generate hundreds of tokens of internal monologue. A RAG retrieval might pull megabytes of document context. If you log the full payload for every transaction, your storage costs can rival the cost of the LLM inference itself. I’ve seen reports of evaluation runs consuming over 100 million tokens, with more than 60% of the cost attributed to hidden reasoning tokens.

In production, you need sampling strategies. The approach I’ve landed on borrows from traditional distributed systems. Keep 100% of traces that result in errors or negative user feedback, because every failure is a learning opportunity. Keep traces that exceed your latency threshold (P95 or P99), because slow agents are often stuck agents. And for everything else, a small random sample (1-5%) is enough to establish your baseline and spot trends.

For storage, I use a tiered approach. Recent and failed traces go into a fast database for immediate querying. Older successful traces get compressed and moved to cold storage, where they can be pulled back if needed for deeper analysis. It’s not glamorous, but it keeps costs manageable without sacrificing the ability to debug the things that matter. In my own setup, this sampling and tiering strategy keeps observability overhead to roughly 15-20% of my inference spend. Without it, I was on track to spend more on storing agent thoughts than on generating them.

Evaluation Beyond Unit Tests

Logging tells you what happened. Evaluation tells you if it was any good.

This is where agents diverge sharply from traditional software. You can’t write a unit test that asserts function(x) == y. The whole point of an agent is to make decisions, and decisions must be evaluated on quality, not just syntax.

As Gemini Scribe grew more capable, I had to develop a new kind of test suite. I track Task Success Rate (did the agent accomplish what the user asked?), Tool Use Accuracy (did it read the right files and use the right tools for the job?), and Efficiency (did it burn 50 steps to do a 2-step task?).

But here’s the number that keeps me up at night. Because agents are non-deterministic, a single run is statistically meaningless. You have to run the same evaluation multiple times and look at distributions. Researchers distinguish between Pass@k (the probability that at least one of k attempts succeeds) and Pass^k (the probability that all k attempts succeed). Pass@k measures potential. Pass^k measures reliability.

The math is sobering. If your agent has a 70% success rate on a single attempt, its Pass^3 (succeeding three times in a row) drops to about 34%. Scale that to a real workflow where the agent needs to perform ten sequential steps correctly, and even a 95% per-step success rate gives you only about a 60% chance of completing the full task. This is the compounding probability of failure, and it’s why “works most of the time” isn’t good enough for production.

This kind of evaluation framework pays for itself the moment a new model drops. When Google released Flash 2.0, I was excited about the cost savings, but would it perform as well as Pro? I ran my eval suite on the same tasks with both models, and the results were more nuanced than I expected. For simple tasks like reformatting text or fixing grammar, Flash was just as good. For complex multi-step reasoning, particularly in my Podcast RAG system, Pro was noticeably better. The eval suite gave me the data to keep Pro where it mattered.

Then Flash 3 came out, and the eval suite surprised me in the other direction. I ran the same benchmarks expecting similar trade-offs, but Flash 3 handled the Podcast RAG tasks so well that I moved the entire system off of 2.5 Pro. Without evals, I might have assumed the old trade-off still held and kept paying for a model I no longer needed. The point isn’t that one model is always better. The point is that you can’t know without measuring, and the landscape shifts under your feet with every release.

The real breakthrough in my own workflow came when I started using an agent to evaluate itself. I built a separate “Evaluation Agent” that reviews the logs of the “Worker Agent.” It scores performance based on a rubric I defined: did it confirm the action before executing? Was the response grounded in retrieved context? Was the tone appropriate?

This LLM-as-a-Judge pattern is powerful, but it comes with caveats. Research shows these evaluator models have their own biases, particularly a tendency to prefer longer answers regardless of quality and a bias toward their own outputs. To calibrate mine, I built a small “golden dataset” of traces that I graded by hand, then tuned the evaluator’s prompt until its scores matched mine. It’s not perfect, but it spots patterns I miss, like a tendency to over-rely on search when a simple calculation would do.

When Things Go Wrong

The research into agentic failure modes has identified three patterns that I see constantly in my own work.

The first is looping. The agent searches for “pricing,” gets no results, then searches for “pricing” again with exactly the same parameters. It’s stuck in a local optimum of reasoning, unable to update its strategy based on the observation that it failed. The simplest fix is a state hash: you hash the (Thought, Action, Observation) tuple at each step and check it against a sliding window of recent steps. If you see a repeat, you force the agent to try something different. For “soft” loops where the agent slightly rephrases but semantically repeats itself, embedding similarity between consecutive reasoning steps catches the pattern. And above all, production agents need circuit breakers: hard limits on steps, tool calls, or tokens per session. When the breaker trips, the agent escalates to a human rather than continuing to burn resources.

The second is tool hallucination. I mentioned this earlier, but it deserves its own spotlight. The most robust defense is constrained decoding, where libraries like Outlines or Instructor use the tool’s JSON schema to build a finite state machine that masks out invalid tokens during generation. If the schema expects an integer, the system sets the probability of all non-digit tokens to zero. It mathematically guarantees that the agent’s tool call will be valid. This moves validation from “check after the fact” to “ensure during generation,” which is a fundamentally better position. A practical note: full constrained decoding (the FSM approach) requires control over the inference engine, so it works with locally-hosted models or providers that expose logit-level access. If you’re calling a hosted API like Gemini or OpenAI, Instructor-style libraries can still enforce schema validation by wrapping the response in a Pydantic model and retrying on parse failure. It’s not as elegant as preventing bad tokens from ever being generated, but it catches the same class of errors.

The third is silent abandonment. The agent hits an ambiguity or a tool failure, and instead of trying an alternative, it politely apologizes and gives up. “I’m sorry, I couldn’t find that information.” This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response to uncertainty. The Reflexion pattern combats this by forcing the agent to generate a self-critique when it fails (“I searched with the wrong term”) and storing that critique in a short-term memory buffer. The next reasoning step is conditioned on this reflection, pushing the agent to generate a new plan rather than surrender. Research shows this kind of “verbal reinforcement” can improve success rates on complex tasks from 80% to over 90%.

The Self-Improving System

Moving from prototype to production isn’t about adding features; it’s about shifting your mindset. A prototype proves that something can work. A production system proves that it works reliably, measurably, and transparently. But the real unlock comes when you realize that production isn’t the end of the development lifecycle. It’s the beginning of something more powerful.

Remember those sessions I mentioned, the bundled records of every task your agent attempts? Once you have a critical mass of them, you’re sitting on a goldmine. And this is where I think the story gets really interesting: you can point a different AI system at your session archive and ask it to find the patterns you’re missing.

I’ve started doing this with my own agents. The workflow is straightforward: I have a script that runs weekly, pulls the last seven days of sessions from my trace store, filters for failures and anything above P90 latency, and exports them as structured JSON. I then feed that batch to a separate, more capable evaluator model. Not the lightweight rubric-scorer I use for real-time evaluation, but a model with a broader mandate and a carefully written prompt: look across these sessions and tell me what you see. Where is the agent consistently struggling? Which tool calls tend to precede failures? Are there categories of user requests that reliably lead to abandonment or looping? I ask it to return its findings as a ranked list of patterns with supporting session IDs, so I can verify each observation myself.

The results have been genuinely surprising. The evaluator flagged a cluster of sessions where users were asking questions about the corpus itself, things like “how many of these podcasts are about guitars?” or “which shows cover AI the most?” The agent would gamely try to answer by searching transcripts, but it was never going to get there because I hadn’t indexed podcast descriptions. Each individual session just looked like a search that came up short. It was only in aggregate that the pattern became clear: users wanted to explore the collection, not just search within it. That finding led me to index descriptions as a new data source, and a whole category of previously failing queries started working.

This is what the industry calls the Data Flywheel: production data feeding back into development, continuously tightening the loop between user intent and agent capability. Your prompt logs become your reality check, revealing how users actually talk to your system versus how you imagined they would.When you cluster those real-world prompts (something as straightforward as embedding them and running HDBSCAN), you start finding these gaps systematically. That’s your roadmap for what to build next.

And the flywheel compounds. Better observability produces richer sessions. Richer sessions give the evaluator more to work with. Better evaluations lead to targeted improvements. Targeted improvements produce better outcomes, which produce more informative sessions. Each rotation makes the system a little smarter, a little more aligned with what users actually need.

To be clear: this isn’t the agent autonomously rewriting itself. I’m the one who reads the evaluator’s findings, verifies them against the session data, and decides what to change. Maybe I update a system prompt, add a new tool, or adjust a circuit breaker threshold. The AI surfaces the patterns; the human decides what to do about them. It’s the same human-on-the-loop philosophy I described in the last post, applied to the development cycle itself.

Together, these layers transform a clever demo into a system you can trust. Because in the age of agents, trust isn’t built on magic. It’s built on the ability to see the trick.

Throughout this series, we’ve been building up the theory: what agents are, how they think, what tools they need, how to keep them safe, and now how to make sure they’re actually working. In the next installment, I want to move from theory to practice. We’ll look at agents in the wild, real-world case studies in customer support, software development, and personal productivity, and what they tell us about how this technology is actually changing the way we work.

A conceptual illustration showing sound waves passing through a prism and refracting into a 3D scatter plot of colored clusters, representing different speaker identities in vector space.

The Fingerprint of Sound

Hero Image Suggestion:

Last year, I spent a lot of time obsessed with the concept of embeddings. I wrote about how they act as a bridge, transforming the messy, unstructured world of human language into a clean, numerical landscape that computers can understand. In my series on the topic, I explored how text embeddings allow us to map concepts in space—how they let us mathematically prove that “king” is close to “queen,” or find a podcast episode about “economic growth” even if the specific keywords never appear in the transcript.

For me, grasping text embeddings was a watershed moment. It turned AI from a black box into a geometry problem I could solve. But recently, my friend Pete Warden released a post that clicked the another piece of the puzzle into place for me, moving that geometry from the page to the ear.

In his post, Speech Embeddings for Engineers, Pete tackles the problem of diarization—the technical term for figuring out “who spoke when” in an audio recording. If you’ve followed my podcast archive project, you know this has been a thorn in my side. I have thousands of transcripts, but they are largely monolithic blocks of text. I know what was said, but often I lose the context of who said it.

Pete’s explanation is brilliant because it leverages the exact same intuition we developed for text. Just as a text embedding captures the semantic “fingerprint” of a sentence, a speech embedding captures the vocal fingerprint of a speaker.

The mental shift is fascinating. When we embed text, we are mapping meaning. We want the vector for “dog” to be close to “puppy” and far from “motorcycle.” But when we embed speech for diarization, we don’t care about the meaning of the words at all. A speaker could be whispering a love sonnet or screaming a grocery list; semantically, those are worlds apart. But acoustically—in terms of timbre, pitch, and cadence—they share an undeniable identity.

Pete includes a Colab notebook that demonstrates this beautifully. It’s a joy to run through because it demystifies the process entirely. He walks you through taking short clips of audio, running them through a model, and visualizing the output.

Suddenly, you aren’t looking at waveforms anymore. You’re looking at clusters. You can see, visually, where one voice ends and another begins. It turns the murky problem of distinguishing speakers in a crowded room into a clean clustering algorithm, something any engineer can wrap their head around.

This reinforces a recurring theme for me: the power of small, composable tools. We often look for massive, end-to-end APIs to solve our problems—a “magic box” that takes audio and returns a perfect script. But understanding the primitives is where the real power lies. By understanding speech embeddings, we aren’t just consumers of a transcription service; we are architects who can build systems that listen, identify, and understand the nuance of conversation.

If you’ve ever wrestled with audio data, or if you just want to see how the concept of embeddings extends beyond text, I highly recommend finding a quiet hour to work through Pete’s notebook. It might just change how you hear the data.

Great Video on Gemini Scribe and Obsidian

I was recently looking through the feedback in the Gemini Scribe repository when I noticed a few insightful comments from a user named Paul O’Malley. Curiosity got the better of me, I love seeing who is actually pushing the boundaries of the tools I build, so I took a look at his YouTube page. I quickly found myself deep into a walkthrough titled “I Built a Second Brain That Organises Itself.”

What caught my eye wasn’t just another productivity system, we’ve all seen the “shiny new app” cycle that leads to digital bankruptcy. It was seeing Gemini Scribe being used as the engine for a fully automated Obsidian vault.

The Friction of Digital Maintenance

Paul hits on a fundamental truth: most systems fail because the friction of maintenance—the tagging, the filing, the constant admin—eventually outweighs the benefit. He argues that what we actually need is a system that “bridges the gap in our own executive function”.

In his setup, he uses Obsidian as the chassis because it relies on Markdown. I’ve long believed that Markdown is the native language of AI, and seeing it used here to create a “seamless bridge” between messy human thoughts and structured AI processing was incredibly satisfying.

Gemini Scribe as the Engine

It was a bit surreal to watch Paul walk through the installation of Gemini Scribe as the core engine for this self-organizing brain. He highlights a few features that I poured a lot of heart into:

  • Session History as Knowledge: By saving AI interactions as Markdown files, they become a searchable part of your knowledge base. You can actually ask the AI to reflect on past conversations to find patterns in your own thinking.
  • The Setup Wizard: He uses a “Setup Wizard” to convert the AI from a generic chatbot into a specialized system administrator. Through a conversational interview, the agent learns your profession and hobbies to tailor a project taxonomy (like the PARA method) specifically to you.
  • Agentic Automation: The video demonstrates the “Inbox Processor,” where the AI reads a raw note, gives it a proper title, applies tags, and physically moves it to the right folder.

Beyond the Tool: A Human in the Loop

One thing Paul emphasized that really resonated with my own philosophy of Guiding the Agent’s Behavior is the “Human in the Loop”. When the agent suggests a change or creates a new command, it writes to a staging file first.

As Paul puts it, you are the boss and the AI is the junior employee—it can draft the contract, but you have to sign it before it becomes official. You always remain in control of the files that run your life.

Small Tools, Big Ideas

Seeing the Gemini CLI mentioned as a “cleaner and slightly more powerful” alternative for power users was another nice nod. It reinforces the idea that small, sharp tools can be composed into something transformative.

Building tools in a vacuum is one thing, but seeing them live in the wild, helping someone clear their “mental RAM” and close their loop at the end of the day, is one of the reasons I do this. It’s a reminder that the best technology doesn’t try to replace us; it just makes the foundations a little sturdier.

A photorealistic image shows an old wooden-handled hammer on a cluttered workbench transforming into a small, multi-armed mechanical robot with glowing blue eyes, holding various miniature tools.

Everything Becomes an Agent

I’ve noticed a pattern in my coding life. It starts innocently enough. I sit down to write a simple Python script, maybe something to tidy up my Obsidian vault or a quick CLI tool to query an API. “Keep it simple,” I tell myself. “Just input, processing, output.”

But then, the inevitable thought creeps in: It would be cool if the model could decide which file to read based on the user’s question.

Two hours later, I’m not writing a script anymore. I’m writing a while loop. I’m defining a tools array. I’m parsing JSON outputs and handing them back to the model. I’m building memory context windows.

I’m building an agent. Again.

(For those keeping track: my working definition of an “agent” is simple: a model running in a loop with access to tools. I explored this in depth in my Agentic Shift series, but that’s the core of it.)

As I sit here writing this in January of 2026, I realize that almost every AI project I worked on last year ultimately became an agent. It feels like a law of nature: Every AI project, given enough time, converges on becoming an agent. In this post, I want to share some of what I’ve learned, and the cases where you might skip the intermediate steps and jump straight to building an agent.

The Gravitational Pull of Autonomy

This isn’t just feature creep. It’s a fundamental shift in how we interact with software. We are moving past the era of “smart typewriters” and into the era of “digital interns.”

Take Gemini Scribe, my plugin for Obsidian. When I started, it was a glorified chat window. You typed a prompt, it gave you text. Simple. But as I used it, the friction became obvious. If I wanted Scribe to use another note as context for a task, I had to take a specific action, usually creating a link to that note from the one I was working on, to make sure it was considered. I was managing the model’s context manually.

I was the “glue” code. I was the context manager.

The moment I gave Scribe access to the read_file tool, the dynamic changed. Suddenly, I wasn’t micromanaging context; I was giving instructions. “Read the last three meeting notes and draft a summary.” That’s not a chat interaction; that’s a delegation. And to support delegation, the software had to become an agent, capable of planning, executing, and iterating.

From Scripts to Sudoers

The Gemini CLI followed a similar arc. There were many of us on the team experimenting with Gemini on the command line. I was working on iterative refinement, where the model would ask clarifying questions to create deeper artifacts. Others were building the first agentic loops, giving the model the ability to run shell commands.

Once we saw how much the model could do with even basic tools, we were hooked. Suddenly, it wasn’t just talking about code; it was writing and executing it. It could run tests, see the failure, edit the file, and run the tests again. It was eye-opening how much we could get done as a small team.

But with great power comes great anxiety. As I explored in my Agentic Shift post on building guardrails and later in my post about the Policy Engine, I found myself staring at a blinking cursor, terrified that my helpful assistant might accidentally rm -rf my project.

This is the hallmark of the agentic shift: you stop worrying about syntax errors and start worrying about judgment errors. We had to build a “sudoers” file for our AI, a permission system that distinguishes between “read-only exploration” and “destructive action.” You don’t build policy engines for scripts; you build them for agents.

The Classifier That Wanted to Be an Agent

Last year, I learned to recognize a specific code smell: the AI classifier.

In my Podcast RAG project, I wanted users to search across both podcast descriptions and episode transcripts. Different databases, different queries. So I did what felt natural: I built a small classifier using Gemini Flash Lite. It would analyze the user’s question and decide: “Is this a description search or a transcript search?” Then it would call the appropriate function.

It worked. But something nagged at me. I had written a classifier to make a decision that a model is already good at making. Worse, the classifier was brittle. What if the user wanted both? What if their intent was ambiguous? I was encoding my assumptions about user behavior into branching logic, and those assumptions were going to be wrong eventually.

The fix was almost embarrassingly simple. I deleted the classifier and gave the agent two tools: search_descriptions and search_episodes. Now, when a user asks a question, the agent decides which tool (or tools) to use. It can search descriptions first, realize it needs more detail, and then dive into transcripts. It can do both in parallel. It makes the call in context, not based on my pre-programmed heuristics. (You can try it yourself at podcasts.hutchison.org.)

I saw the same pattern in Gemini Scribe. Early versions had elaborate logic for context harvesting, code that tried to predict which notes the user would need based on their current document and conversation history. I was building a decision tree for context, and it was getting unwieldy.

When I moved Scribe to a proper agentic architecture, most of that logic evaporated. The agent didn’t need me to pre-fetch context; it could use a read_file tool to grab what it needed, when it needed it. The complex anticipation logic was replaced by simple, reactive tool calls. The application got simpler and more capable at the same time.

Here’s the heuristic I’ve landed on: If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent. Deconstruct those branches into tools, give the agent really good descriptions of what those tools can do, and then let the model choose its own adventure.

You might be thinking: “What about routing queries to different models? Surely a classifier makes sense there.” I’m not so sure anymore. Even model routing starts to look like an orchestration problem, and a lightweight orchestrator with tools for accessing different models gives you the same flexibility without the brittleness. The question isn’t whether an agent can make the decision better than your code. It’s whether the agent, with access to the actual data in the moment, can make a decision at least as good as what you’re trying to predict when you’re writing the code. The agent has context you don’t have at development time.

The “Human-on-the-Loop”

We are transitioning from Human-in-the-Loop (where we manually approve every step) to Human-on-the-Loop (where we set the goals and guardrails, but let the system drive).

This shift is driven by a simple desire: we want partners, not just tools. As I wrote back in April about waiting for a true AI coding partner, a tool requires your constant attention. A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.

This freedom comes with a new responsibility: clarity. If your agent is going to work overnight, you need to make sure it’s working on something productive. You need to be precise about the goal, explicit about the boundaries, and thoughtful about what happens when things go wrong. Without the right guardrails, an agent can get stuck waiting for your input, and you’ll lose that time. Or worse, it can get sidetracked and spend hours on something that wasn’t what you intended.

The goal isn’t to remove the human entirely. It’s to move us from the execution layer to the supervision layer. We set the destination and the boundaries; the agent figures out the route. But we have to set those boundaries well.

Embracing the Complexity (Or Lack Thereof)

Here’s the counterintuitive thing: building an agent isn’t always harder than building a script. Yes, you have to think about loops, tool definitions, and context window management. But as my classifier example showed, an agentic architecture can actually delete complexity. All that brittle branching logic, all those edge cases I was trying to anticipate: gone. Replaced by a model that can reason about what it needs in the moment.

The real complexity isn’t in the code; it’s in the trust. You have to get comfortable with a system that makes decisions you didn’t explicitly program. That’s a different kind of engineering challenge, less about syntax, more about guardrails and judgment.

But the payoff is a system that grows with you. A script does exactly what you wrote it to do, forever. An agent does what you ask it to do, and sometimes finds better ways to do it than you’d considered.

So, if you find yourself staring at your “simple script” and wondering if you should give it a tools definition… just give in. You’re building an agent. It’s inevitable. You might as well enjoy the company.

A central, glowing blue polyhedral node suspended in a dark void, connected to several smaller satellite nodes by taut, luminous blue data filaments and orbital arcs, illustrating a network of interconnected AI agents.

When Agents Talk to Each Other

Welcome back to The Agentic Shift. Over the past eight installments, we’ve built our agent from the ground up, giving it a brain to thinkmemory to learn, a toolkit to actinstructions to followguardrails for safety, and a framework to build on. But there’s been an elephant in the room this whole time: our agent is alone.

I was sitting at my desk late last night, staring at three different windows on my monitor, feeling like a digital switchboard operator from the 1950s.

In one window, I had Helix, my text editor, where I was writing a Python script. In the second, I had a terminal running a deep research agent I’d built for Gemini CLI. In the third, I had a browser open to a documentation page.

Here’s the thing: Gemini CLI is brilliant, but it’s blind. It couldn’t see the code I had open in Helix. It couldn’t read the documentation in my browser. When it found a critical library update, I had to manually copy-paste the relevant code into the terminal. When I wanted it to understand an error, I had to copy-paste the stack trace. I was the glue, the slow, error-prone, context-losing glue.

We have spent this entire series building a digital Robinson Crusoe. In Part 1, we gave our agent a brain. In Part 4, we gave it tools. But watching my own workflow fragment into disjointed copy-paste loops, I realized we’ve hit a wall. We have built brilliant, isolated sparks of intelligence, but we haven’t built the wiring to connect them.

This fragmentation is the single biggest bottleneck in the agentic shift. But that is changing. We are witnessing the birth of the protocols that will turn these isolated islands into a network. We are moving from building agents to building the Internet of Agents.

The Struggle Before Standards

I tried to fix this myself, of course. We all have. I wrote brittle Python scripts to wrap my CLI tools. I tried building a mega-agent that had every possible API key hardcoded into its environment variables. I even built my own agentic TUI that explored many interesting ideas, but ultimately wasn’t the right solution.

My lowest moment came when I spent several evenings and weekends building an Electron-based AI research and writing application. The vision was grand: a unified workspace where I could query multiple AI models, organize research into projects, and write drafts with AI assistance, all in one window. I built a beautiful sidebar for project navigation, a markdown editor with live preview, a chat interface that could talk to Gemini, and a “sources” panel for managing references. By the time I stepped back to evaluate what I’d built, I had thousands of lines of TypeScript, a complex state management system, and an app that was slower than just using the terminal. Worse, it didn’t actually solve my problem. I still couldn’t get the AI to see what was in my other tools. I’d built a new silo, not a bridge. The repo still sits on my hard drive, unopened.

Every solution felt like a band-aid. The problem wasn’t that I couldn’t write the code; it was that I was trying to solve an ecosystem problem with a point solution.

The Anatomy of Connection

To solve this, we don’t just need “better agents.” We need a common language. The industry is converging on three distinct protocols, each solving a different layer of the communication stack: MCP for tools, ACP for interfaces, and A2A for collaboration.

Why three protocols instead of one? For the same reason the internet isn’t just “one protocol.” Think of it like the networking stack: TCP/IP handles reliable data transmission, HTTP handles document requests, and SMTP handles email. Each layer solves a distinct problem, and trying to collapse them into one mega-protocol would create an unmaintainable mess. The same logic applies here. MCP solves the “how do I use this tool?” problem. ACP solves the “how do I show this to a human?” problem. A2A solves the “how do I collaborate with another agent?” problem. They’re designed to compose, not compete.

The Internal Wiring of MCP

The Model Context Protocol (MCP), championed by Anthropic, represents the agent’s Internal Wiring. It answers the fundamental question: How does an agent perceive, act upon, and understand the world?

It’s easy to dismiss MCP as just “standardized tool calling,” but that misses the architectural shift. MCP creates a universal substrate for context, built on three distinct pillars. First, there are Resources, the agent’s sensory input that allows it to read data (files, logs, database rows) passively. Crucially, MCP supports subscriptions, meaning an agent can “watch” a log file and wake up the moment an error appears. Next are Tools, the agent’s hands, allowing for action: executing a SQL query, hitting an API, or writing a file. Finally, there are Prompts, perhaps the most overlooked feature, which allow domain experts to bake workflows directly into the server. A “Git Server” doesn’t just expose git commit; it can expose a generate_commit_message prompt that inherently knows your team’s style guide and grabs the current diff automatically.

Here is what that “handshake” looks like (from Anthropic’s MCP specification). It’s not magic; it’s a strict contract that turns an opaque binary into a discoverable capability:

{
  "jsonrpc": "2.0",
  "method": "tools/list",
  "result": {
    "tools": [
      {
        "name": "query_database",
        "description": "Execute a SELECT query against the local Postgres instance",
        "inputSchema": {
          "type": "object",
          "properties": {
            "sql": { "type": "string" }
          }
        }
      }
    ]
  }
}

Now, any agent (whether it’s running in Claude Desktop, Cursor, or a custom script) can “plug in” to my Postgres server and immediately know how to use it. It solves the N × M integration problem forever.

A skeptical reader might ask: “How is this different from REST or OpenAPI?” It’s a fair question. On the surface, MCP looks like “JSON-RPC with a schema,” and that’s not wrong. But the difference is what gets standardized. OpenAPI describes how to call an endpoint; MCP describes how an agent should understand and use a capability. The schema isn’t just for validation. It’s for reasoning. An MCP tool description is a prompt fragment that teaches the model when and why to use the tool, not just how.

But here’s where I need to offer some nuance, because protocol boosterism can obscure practical reality.

As Simon Willison observed in his year-end review, MCP’s explosive adoption may have been partly a timing accident. It launched right as models got reliable at tool-calling, leading some to confuse “MCP support” with “tool-calling ability.” More pointedly, he notes that for coding agents, “the best possible tool for any situation is Bash.” If your agent can run shell commands, it can use gh for GitHub, curl for APIs, and psql for databases, no MCP server required.

I’ve felt this myself. When I’m working in Gemini CLI, I rarely reach for an MCP server. The GitHub CLI (gh) is faster and more capable than any MCP wrapper I’ve tried. The same goes for gitdocker, and most developer tools with good CLIs.

So when does MCP make sense? I see three clear cases. First, when there’s no CLI (for example with my MCP service for Google Workspace), since many SaaS products expose APIs but no command-line interface. An MCP server is the natural wrapper. Second, when you need subscriptions, since MCP’s ability to “watch” a resource and push updates to the agent is something CLIs can’t do cleanly. Third, when you’re crossing network boundaries, since an MCP server can run on a remote machine and expose capabilities securely, which is harder to orchestrate with raw shell access.

The real insight here is about context engineering. MCP servers bring along a lot of context for every tool (descriptions, schemas, the full capability surface). For some workflows, that richness is valuable. But Anthropic themselves acknowledged the overhead with their Skills mechanism, a simpler approach where a Skill is just a Markdown file in a folder, optionally with some executable scripts. Skills are lightweight and only load when needed. MCP and Skills aren’t competing; they’re different tools for different context budgets.

Giving the Agent a Seat at the Keyboard

If MCP is the agent’s internal wiring, the Agent Client Protocol (ACP) is its window to the world.

I like to think of this as the LSP (Language Server Protocol) moment for the agentic age. Before LSP, if you wanted to support a new language in an IDE, you had to write a custom parser for every single editor. It was a nightmare of N × M complexity. ACP solves the same problem for intelligence. It decouples the “brain” from the “UI.”

This is why the collaboration between Zed and Google is so critical. When Zed announced bring your own agent with Google Gemini CLI integration, they weren’t just shipping features. They were standardizing the interface between the client (the editor) and the server (the agent). Intelligence became swappable. I can run a local Gemini instance through the same UI that powers a remote Claude agent.

The core of ACP is Symmetry. It’s not just the editor sending prompts to the agent. Through ACP, an editor like Zed (the reference implementation) can tell the agent exactly where your cursor is, what files you have open, and even feed it the terminal output from a failed build. The agent, in turn, can request to edit a specific line or show you a diff for approval.

I’ve been seriously thinking about building ACP support for Obsidian. I already built Gemini Scribe, an agent that lives inside Obsidian for research and writing assistance, but it’s hardcoded to Gemini. With ACP, I could make Obsidian a universal agent host, letting users bring whatever intelligence they prefer into their knowledge management workflow.

This turns the editor into the ultimate guardrail. Because the agent communicates its intent through a standardized protocol, the editor can pause, show the user exactly what’s about to happen, and wait for that “Approve” click. It’s the infrastructure that makes autonomous coding safe.

But the real magic isn’t just safety; it’s ubiquity. ACP liberates the agent from the tool. It means you can bring your preferred intelligence to whatever surface helps you flow. We are already seeing the ecosystem explode beyond just Zed.

For the terminal die-hards, there is Toad, a framework dedicated entirely to running ACP agents in a unified CLI. And for the VIM crowd, the CodeCompanion project has brought full ACP support to Neovim. This is the promise of the protocol: write the agent once, and let the user decide if they want to interact with it in a modern GUI, a raw terminal, or a modal editor from the 90s. The intelligence remains the same; only the glass changes.

When Agents Meet Strangers

Finally, we have the “Internet” layer: Agent-to-Agent (A2A).

While MCP connects an agent to a thing, and ACP connects an agent to a person, A2A connects an agent to society. It addresses the “lonely agent” problem by establishing a standard for horizontal, peer-to-peer collaboration.

This protocol, pushed forward by Google and the Linux Foundation, introduces a profound shift in how we think about distributed systems: Opaque Execution.

In traditional software, if Service A talks to Service B, Service A needs to know exactly how to call the API. In A2A, my agent doesn’t care about the how; it cares about the goal. My “Travel Agent” can ask a “Calendar Agent” to “find a slot for a meeting,” without knowing if that Calendar Agent is running a simple SQL query, consulting a complex rules engine, or even asking a human secretary for help.

This negotiation happens through the Agent Card, a machine-readable identity file hosted at a standard /.well-known/agent.json endpoint. It solves the “Theory of Mind” gap, allowing one agent to understand the capabilities of another. Here’s what one looks like:

{
  "name": "Calendar Agent",
  "description": "Manages scheduling, finds available slots, and coordinates meetings across time zones.",
  "url": "https://calendar.example.com",
  "version": "1.0.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": true
  },
  "skills": [
    {
      "id": "find-meeting-slot",
      "name": "Find Meeting Slot",
      "description": "Given a list of participants and constraints, finds optimal meeting times.",
      "inputSchema": {
        "type": "object",
        "properties": {
          "participants": { "type": "array", "items": { "type": "string" } },
          "duration_minutes": { "type": "integer" },
          "preferred_time_range": { "type": "string" }
        }
      }
    }
  ],
  "authentication": {
    "schemes": ["oauth2", "api_key"]
  }
}


When my Travel Agent encounters a scheduling problem, it doesn’t need to know how the Calendar Agent works internally. It reads this card, understands the agent can “find meeting slots,” and delegates the task. The Calendar Agent might use Google Calendar, Outlook, or a custom database. My agent doesn’t care.

But the real breakthrough is the Task Lifecycle. A2A tasks aren’t just request-response loops; they are stateful, modeled as a finite state machine with well-defined transitions:

  • Submitted: The task has been received but work hasn’t started.
  • Working: The agent is actively processing the request.
  • Input-Required: The agent needs clarification before continuing. This is the key innovation: the agent can pause, ask “Do you prefer aisle or window?”, and wait indefinitely.
  • Completed: The task finished successfully.
  • Failed: Something went wrong. The response includes an error message and optional retry hints.
  • Canceled: The requesting agent (or human) aborted the task.

This state machine brings the asynchronous, messy reality of human collaboration to the machine world. A task might sit in Input-Required for hours while waiting for a human to respond. It might transition from Working to Failed and back to Working after a retry. The protocol handles all of this gracefully.

Finding Agents You Can Trust

But let’s not declare victory just yet. We are seeing the very beginning of this shift, and the “Internet of Agents” brings its own set of dangers.

As we move from tens of agents to millions, we face a massive Discovery Problem. In a global network of opaque execution, how do you find the right agent? And more importantly, how do you trust it?

It’s not enough to just connect. You need safety guarantees. You need to know that the “Travel Agent” you just hired isn’t going to hallucinate a non-refundable booking or, worse, exfiltrate your credit card data to a malicious third party.

This is the focus of recent research on multi-agent security, which highlights that protocol compliance is only the first step. We need mechanisms for Behavioral Verification, ensuring that an agent does what it says it does.

What does verification look like in practice? Today, it’s mostly manual and ad-hoc. You might:

  • Audit the agent’s logs to see what actions it actually took versus what it claimed.
  • Run it in a sandbox with fake data before trusting it with real resources.
  • Require human approval for high-stakes actions (the “Human-in-the-Loop” pattern we explored in Part 6).
  • Check reputation signals: who built this agent? What’s their track record?

But these are stopgaps. The dream is automated verification: cryptographic proofs that an agent behaved according to its advertised policy, or sandboxed execution environments that can mathematically guarantee an agent never accessed unauthorized data. We’re not there yet.

Whether the solution looks like a decentralized “Web of Trust” (where agents vouch for each other, like PGP key signing) or a centralized “App Store for Agents” (where a trusted authority vets and signs off on agents) remains to be seen. My bet is we’ll see both: curated marketplaces for enterprise use cases, and open registries for the long tail. But solving the discovery and safety problem is the only way we move from a toy ecosystem to a production economy.

The Foundation of the Future

What excites me most isn’t just the code. It’s the governance.

We have seen this movie before. In the early days of the web, proprietary browser wars threatened to fracture the internet. We risked a world where “This site only works in Internet Explorer” became the norm. We avoided that fate because of open standards.

The same risk exists for agents. We cannot afford a future where an “Anthropic Agent” refuses to talk to an “OpenAI Agent” that won’t talk to a “Google Agent.”

That is why the formation of the Agentic AI Foundation by the Linux Foundation is the most important news you might have missed. By bringing together AI pioneers like OpenAI and Anthropic alongside infrastructure giants like GoogleMicrosoft, and AWS under a neutral banner, we are ensuring that the “Internet of Agents” remains open. This foundation will oversee the development of protocols like A2A, ensuring they evolve as shared public utilities rather than walled gardens. It is the guarantee that the intelligence we build today will be able to talk to the intelligence we build tomorrow.

The New Architecture of Work

When we combine these three protocols, the fragmentation dissolves.

Imagine I am back in Zed (connected via ACP). I ask my coding agent to “Add a secure user profile page.” Zed sends my cursor context to the agent. The agent reaches for MCP to query my local database schema and understand the users table. Realizing this touches PII, it autonomously pings a “Security Guardrail Agent” via A2A to review the proposed code. Approval comes back, and my local agent writes the code directly into my buffer.

I didn’t switch windows once.

But what happens when things go wrong? Let’s say the Security Guardrail Agent rejects the code because it detected a SQL injection vulnerability. The A2A task transitions to Failed with a structured error: {"reason": "sql_injection_detected", "line": 42, "suggestion": "Use parameterized queries"}. My local agent receives this, understands the failure, and either fixes the issue automatically or surfaces it to me with context. The rejection isn’t a dead end; it’s a conversation.

Or imagine the MCP server for my database is unreachable. The agent doesn’t just hang. It receives a timeout error and can decide to retry, fall back to cached schema information, or ask me whether to proceed without database context. Robust failure handling is baked into the protocols, not bolted on as an afterthought.

Where We Are Today

I want to be honest about maturity. These protocols are real and shipping, but the ecosystem is young.

MCP is the most mature. Just about everything supports it now: coding tools, virtualization environments, editors, even mobile apps. There are hundreds of community MCP servers for everything from Notion to Kubernetes. If you want to try this today, MCP is the on-ramp.

ACP is newer but moving fast. Zed is the reference implementation, with Neovim (via CodeCompanion) and terminal clients (via Toad) close behind. There are also robust client APIs for many languages, making ACP an interesting interface for controlling local agentic applications. If your editor doesn’t support ACP yet, you’ll likely be using proprietary plugin APIs for now.

A2A is the most nascent. Google and partners announced it in mid-2025, and the specification is still evolving. There aren’t many production A2A deployments yet. Most multi-agent systems today use custom protocols or framework-specific solutions like CrewAI or LangGraph. But the spec is public, the governance is in place, and early adopters are building.

If you’re starting a project today, my advice is: use MCP for tool integration, use whatever your editor supports for the UI layer, and keep an eye on A2A for future multi-agent workflows. The pieces are coming together, but we’re still early.

And yet, this isn’t science fiction. The protocols are here today. The “Internet of Agents” is booting up, and for the first time, our digital Robinson Crusoes are finally getting a radio.

But a radio is only as good as the conversations it enables. In our next post, we’ll move from protocols to practice and explore what happens when agents don’t just connect, but actually collaborate: forming teams, delegating tasks, and solving problems no single agent could tackle alone.

A laptop sits on a dark wooden desk under the warm glow of an Edison bulb; above the screen, a stream of glowing, holographic research papers and data visualizations cascades downward like a waterfall, physically dissolving into lines of green and white markdown text as they enter the open terminal window.

Bringing Deep Research to the Terminal

I lost the report somewhere between browser tabs. One moment it was there in the Gemini app, a detailed deep research analysis on how AI agents communicate with each other, complete with citations and a synthesis I’d spent an hour reviewing. The next moment, gone. Along with the draft blog post I’d been weaving it into.

I was working on part nine of my Agentic Shift series, trying to answer the question of what happens when agents start talking to each other instead of just talking to us. The research was sprawling—academic papers on multi-agent systems, documentation from LangGraph and AutoGen, blog posts from researchers at DeepMind and OpenAI. I’d been using Gemini’s deep research feature in the app to help synthesize all of this, and it was genuinely useful. The AI would spend minutes thinking through the question, querying sources, building a structured report. But then I had to move that report into my text-based workflow. Copy, paste, reformat, lose formatting, copy again. Somewhere in that dance between the browser and my terminal, I lost everything.

I stared at the empty browser tab for a moment. I could start over, rerun the research in the Gemini app, be more careful about saving this time. But this wasn’t the first time I’d hit this friction. Every time I used deep research in the browser, I had to bridge two worlds: the app where the AI did its thinking, and the terminal where I actually write and build.

What looked like yak shaving was actually a prerequisite. I needed deep research capabilities in my terminal workflow, not just wanted them. I couldn’t keep jumping between environments. And I was in luck. Just a few weeks earlier, Google had announced that deep research was now available through the Gemini API. The capability I’d been using in the browser could be accessed programmatically.

When Features Live in the Wrong Place

I’m not going to pretend this was built based on demand from the community. I needed this. Specifically, I needed to stop context-switching between the Gemini app and my terminal, because every time I did, I was introducing friction and risk. The lost report was just the most recent symptom of a workflow that was fundamentally broken for how I work.

I live in the terminal. My notes are markdown files. My drafts are plain text. My build process, my git workflow, my entire development environment assumes I’m working with files and command-line tools. When I have to move work from a browser back into that environment, I’m not just inconvenienced—I’m fighting against the grain of everything else I do.

Deep research is powerful. It works. But living in a web app meant it was disconnected from the places where I actually needed it. Sure, other people might benefit from having this integrated into MCP-compatible tools, but that’s a nice side effect. The real reason I built this was simpler: I had to finish part nine of the Agentic Shift series, and I couldn’t do that without fixing my workflow first.

The Model Context Protocol made this possible. It’s a standard for exposing AI capabilities as tools that can plug into different environments. Google’s API gave me the primitives. I just needed to connect them to where I actually work.

Building the Missing Piece

The extension wraps Gemini’s deep research capabilities into the Model Context Protocol, which means it integrates seamlessly with Gemini CLI and any other MCP-compatible client. The architecture is deliberately simple, but it supports two distinct workflows depending on what you need.

The first workflow is straightforward: you have a research question, and you want a deep investigation. You can kick off research with a simple command, but if you use the bundled /deep-research:start slash command, the model actually guides you through a step to optimize your question to get the most out of deep research. The agent then spends tens of minutes—or as much time as it needs—planning the investigation, querying sources, and synthesizing findings into a detailed report with citations you can follow up on.

The second workflow is for when you want to ground the research in your own documents. You use /deep-research:store-create to set up a file search store, then /deep-research:store-upload to index your files. Once they’re uploaded, you have two options: you can include that dataset in the deep research process so the agent grounds its investigation in your specific sources, or you can query against it directly for a simpler RAG experience. This is the same File Search capability I wrote about in November when I rebuilt my Podcast RAG system, but now it’s accessible from the terminal as part of my normal workflow.

The extension maintains local state in a workspace cache, so you don’t have to remember arcane resource identifiers or lose track of running research jobs. The whole thing is designed to feel as natural as running a grep command or kicking off a build—it’s just another tool in the environment where I already work.

So did it actually work?

The first time I ran it, I asked for a deep dive into Stonehenge construction. I’d been reading Ken Follett’s novel Circle of Days and found myself curious about the scientific evidence behind the story, what do we actually know about how it was built and who built it. I kicked off the query and watched something fascinating happen. The model understood that deep research takes time. Instead of just waiting silently, it kept checking in to see if the research was done, almost like checking the oven to see if dinner was ready. Twenty minutes later, a markdown file appeared in my filesystem with a comprehensive research report, complete with citations to academic sources, isotope analysis, and archaeological evidence. I didn’t have to copy anything from a browser. I didn’t lose any formatting. It was just there, ready to reference. The report mentioned the Bell Beaker culture and what happened to the Neolithic builders around 2500 BCE, which sent me down another rabbit hole. I immediately ran a second research query on that transition. Same seamless experience. That’s when I knew this was exactly what I needed.

What This Actually Means

I think extensions like this represent something important about where AI development is heading. We’re past the proof-of-concept phase where every AI interaction is a magic trick. Now we’re in the phase where AI capabilities need to integrate into actual workflows—not replace them, but augment them in ways that feel natural.

This is what I wrote about in November when I talked about the era of Personal Software. We’ve crossed a threshold where building a bespoke tool is often faster—and certainly less frustrating—than trying to adapt your workflow to someone else’s software. I didn’t build this extension for the community. I built it because I needed it. I had lost work, and I needed to stop context-switching between environments. If other people find it useful, that’s a nice side effect, but it’s fundamentally software for an audience of one.

The key insight for me was that the Model Context Protocol isn’t just a technical standard; it’s a design pattern for making AI tools composable. Instead of building a monolithic research application with its own UI and workflow, I built a small, focused extension that does one thing well and plugs into the environment where I already work. That composability matters because it means the tool can evolve with my workflow rather than forcing my workflow to evolve around the tool.

There’s also something interesting happening with how we think about AI capabilities. Deep research isn’t about making the model smarter—it’s about giving it time and structure. The same model that gives you a superficial answer in three seconds can give you a genuinely insightful report if you let it think for tens of minutes and provide it with the right sources. We’re learning that intelligence isn’t just about raw capability; it’s about how you orchestrate that capability over time.

What Comes Next

The extension is live on GitHub now, and I’m using it daily for my own research workflows. The immediate next step is adding better control over the research format—right now you can specify broad categories like “Technical Deep Dive” or “Executive Brief,” but I want more granular control over structure and depth. I’m also curious about chaining multiple research tasks together, where the output of one investigation becomes the input for the next.

But the bigger question I’m sitting with is what other AI capabilities are hiding in plain sight, waiting for someone to make them accessible. Deep research was always there in the Gemini API; it just needed a wrapper that made it feel like a natural part of the development workflow. What else is out there?

If you want to try it yourself, you’ll need a Gemini API key (get one at ai.dev) and set the GEMINI_DEEP_RESEARCH_API_KEY environment variable. Deep research runs on Gemini 3.0 Pro, and you can find the current pricing here. It’s charged based on token consumption for the research process plus any tool usage fees.

Install the extension with:

gemini extensions install https://github.com/allenhutchison/gemini-cli-deep-research --auto-update

The full source is on github.

As for me, I still need to finish part nine of the Agentic Shift series. But now I can get back to it with the confidence that I’m working in my preferred environment, with the tools I need accessible right from the terminal. Fair warning: once you start using AI for actual deep research, it’s hard to go back to the shallow stuff.

A close-up photograph on a wooden workbench shows a hand-carved wooden tool handle resting on a MacBook Pro keyboard. The handle transitions into a glowing blue and orange digital wireframe where it extends over the laptop's screen, which displays lines of green code. Wood shavings, chisels, and other traditional tools are scattered around the laptop. A warm desk lamp illuminates the scene from the right.

The Era of Personal Software

I was sitting in a coffee shop this afternoon, nursing a cappuccino and doing a quick triage of the GitHub repositories I maintain. It was supposed to be a quick check-in, but I was surprised to find a pile of issues I hadn’t seen before. They had slipped through the cracks of my notifications.

My immediate reaction wasn’t just annoyance; it was an itch to fix the process. I needed a way to monitor a configurable set of repos and get a consolidated report of new activity—something bespoke. For my smaller projects, I want to see everything. For the big, noisy ones, I only care if I’m assigned or mentioned.

So, I opened up my terminal. I fired up gemini cli and started describing what I needed.

Twenty minutes later, I had a working command-line tool. It did exactly what I described, filtering the noise exactly how I wanted. I ran it, verified the output, and added it to my daily workflow. I closed my laptop and went on with my day.

But on the walk home, I realized something strange had happened. Or rather, something hadn’t happened.

I never opened Google. I never searched GitHub for “activity monitor CLI.” I didn’t spend an hour trawling through “Top 10 GitHub Tools” blog posts, or installing three different utilities only to find out one was deprecated and the other required a subscription.

I just built the thing I needed and moved on.

We are entering the era of Personal Software. This is software written for an audience of one. It’s an application or a script built to solve a specific problem for a specific person, with no immediate intention of scaling, monetizing, or even sharing.

Looking back at my recent work, I realize I’ve been living in this category for a while. In many ways, this is the active evolution of the “Small Tools, Big Ideas” concept I explored earlier this year. Instead of just finding these sharp, focused tools, I’m now building them. Gemini Scribe started because I wanted a better way to write in Obsidian. Podcast Rag exists solely because I wanted to search my own podcast history. My github-activity-reporter from this afternoon? Pure personal necessity. Even adh-cli was just a sandbox for me to test ideas for the Gemini CLI.

We have crossed a threshold where building a bespoke application is often faster—and certainly less frustrating—than finding an off-the-shelf solution that mostly works. The friction of creation has dropped so low that it is now competing with the friction of discovery.

There is a profound freedom in this approach. When you build for an audience of one, the software does exactly what you want and nothing more. There is no feature bloat, no upsell, no UI clutter. You are the product manager, the engineer, and the customer. If your workflow changes next week, you don’t have to file a feature request and hope it gets upvoted; you just change the code. You don’t have to convince anyone else that your problem is worth solving.

But this freedom comes with a new kind of responsibility. When you step outside the walled garden of managed software, you are on your own. If you get stuck, there is no support ticket to file. If an API changes and breaks your tool, you are on the hook to fix it.

There is also the “trap of success.” Sometimes, your personal software is so useful that it accidentally becomes non-personal. Friends ask for it. Colleagues want to fork it. Suddenly, you aren’t just a user anymore; you’re a maintainer. You have to decide if you’re willing to take on the burden of supporting others, or if you’re comfortable saying, “This works for me, good luck to you.”

Not every problem is a nail for this particular hammer, of course. Over time, I’ve started to develop a rubric for what makes for good Personal Software.

The sweet spot is usually glue and logic. If you need to connect two APIs that don’t talk to each other, or parse a messy data export into a clean report, AI can write that script in seconds. My GitHub activity reporter is a perfect example: it’s just fetching data, filtering it against my specific rules, and printing text.

It’s also great for ephemeral workflows. If you have a task you need to do fifty times today but might never do again—like renaming a batch of files based on their content or scraping a specific webpage for research—building a throwaway tool is vastly superior to doing it manually.

Another fantastic category is quick web applications. We used to think of web apps as heavy projects requiring frameworks and hosting headaches. But modern platforms like Google Cloud Run or Vercel have made deployment trivial. Tools like Google AI Studio take this even further—offering a free “vibe coding” platform that can take you from a rough idea to a hosted application in minutes. My boxing workout app is a prime example: I didn’t write a line of infrastructure code; I just described the workout timer I needed, and it was live before I even put on my gloves.

Where Personal Software falls short is in infrastructure and security. I wouldn’t build my own password manager or roll my own encryption tools, no matter how good the model is. The stakes are too high, and the “audience of one” means there are no other eyes on the code to catch critical vulnerabilities. Similarly, if a problem requires a complex, interactive GUI or high-availability hosting, the maintenance burden usually outweighs the benefits of customization.

Despite the downsides, I find this shift fascinating. For decades, software development was an industrial process—building generic tools for mass consumption. Now, it’s becoming a craft again. We are returning to a time where we build our own tools, fitting the handle perfectly to our own grip.

So, I want to turn the question over to you. What are you building just for yourself? Are there small, nagging problems you’ve solved with a script only you will ever see? I’d love to hear about the kinds of personal software you’re creating in this new era. Let me know in the comments or reach out—I’m genuinely curious to see what handles you’re crafting.

A retro computer monitor displaying the Gemini CLI prompt "> Ask Gemini to scaffold a web app" inside a glowing neon blue and pink holographic wireframe box, representing a digital sandbox.

The Guardrails of Autonomy

I still remember the first time I let an LLM execute a shell command on my machine. It was a simple ls -la, but my finger hovered over the Enter key for a solid ten seconds.

There is a visceral, lizard-brain reaction to giving an AI that level of access. We all know the horror stories—or at least the potential horror stories. One hallucinated argument, one misplaced flag, and a helpful cleanup script becomes rm -rf /. This fear creates a central tension in what I call the Agentic Shift. We want agents to be autonomous enough to be useful—fixing a bug across ten files while we grab coffee—but safe enough to be trusted with the keys to the kingdom.

Until now, my approach with the Gemini CLI was the blunt instrument of “Human-in-the-Loop.” Any tool call with a side effect—executing shell commands, writing code, or editing files—required a manual y/n confirmation. It was safe, sure. But it was also exhausting.

I vividly remember asking Gemini to “fix all the linting errors in this project.” It brilliantly identified the issues and proposed edits for twenty different files. Then I sat there, hitting yyy… twenty times.

The magic evaporated. I wasn’t collaborating with an intelligent agent; I was acting as a slow, biological barrier for a very expensive macro. This feeling has a name—“Confirmation Fatigue”—and it’s the silent killer of autonomy. I realized I needed to move from micromanagement to strategic oversight. I didn’t want to stop the agent; I wanted to give it a leash.

The Policy Engine

The solution I’ve built is the Gemini CLI Policy Engine.

Think of it as a firewall for tool calls. It sits between the LLM’s request and your operating system’s execution. Every time the model reaches for a tool—whether it’s to read a file, run a grep command, or make a network request—the Policy Engine intercepts the call and evaluates it against a set of rules.

The system relies on three core actions:

  1. allow: The tool runs immediately.
  2. deny: The AI gets a “Permission denied” error.
  3. ask_user: The default manual approval.

A Hierarchy of Trust

The magic isn’t just in blocking or allowing things; it’s in the hierarchy. Instead of a flat list of rules, I built a tiered priority system that functions like layers of defense.

At the base, you have the Default Safety Net. These are the built-in rules that apply to everyone—basic common sense like “always ask before overwriting a file.”

Above that sits the User Layer, which is where I define my personal comfort zone. This allows me to customize the “personality” of my safety rails. On my personal laptop, I might be a cowboy, allowing git commands to run freely because I know I can always undo a bad commit. But on a production server, I might lock things down tighter than a vault.

Finally, at the top, is the Enterprise/Admin Layer. These are the immutable laws of physics for the agent. In an enterprise setting, this is where you ensure that no matter how “creative” the agent gets, it can never curl data to an external IP or access sensitive directories.

Safe Exploration

In practice, this means I can trust the agent to look but ask it to verify before it touches. I generally trust the agent to check the repository status, review history, or check if the build passed. I don’t need to approve every git log or gh run list.

[[rule]]
toolName = "run_shell_command"
commandPrefix = [
  "git status",
  "git log",
  "git diff",
  "gh issue list",
  "gh pr list",
  "gh pr view",
  "gh run list"
]
decision = "allow"
priority = 100

Yolo Mode

Sometimes, I’m working in a sandbox and I just want speed. I can use the dedicated yolo mode to take the training wheels off. There is a distinct feeling of freedom—and a slight thrill of danger—when you watch the terminal fly by, commands executing one after another.

However, even in Yolo mode, I want a final sanity check before I push code or open a PR. While Yolo mode is inherently permissive, I define specific high-priority rules to catch critical actions. I also explicitly block docker commands—I don’t want the agent spinning up (or spinning down) containers in the background without me knowing.

# Exception: Always ask before committing or creating a PR
[[rule]]
toolName = "run_shell_command"
commandPrefix = ["git commit", "gh pr create"]
decision = "ask_user"
priority = 900
modes = ["yolo"]

# Exception: Never run docker commands automatically
[[rule]]
toolName = "run_shell_command"
commandPrefix = "docker"
decision = "deny"
priority = 999
modes = ["yolo"]

The Hard Stop

And then there are the things that should simply never happen. I don’t care how confident the model is; I don’t want it rebooting my machine. These rules are the “break glass in case of emergency” protections that let me sleep at night.

[[rule]]
toolName = "run_shell_command"
commandRegex = "^(shutdown|reboot|kill)"
decision = "deny"
priority = 999

Decoupling Capability from Control

The significance of this feature goes beyond just saving me from pressing y. It fundamentally changes how we design agents.

I touched on this concept in my series on autonomous agents, specifically in Building Secure Autonomous Agents, where I argued that a “policy engine” is essential for scaling from one agent to a fleet. Now, I’m bringing that same architecture to the local CLI.

Previously, the conversation around AI safety often presented a binary choice: you could have a capable agent that was potentially dangerous, or a safe agent that was effectively useless. If I wanted to ensure the agent wouldn’t accidentally delete my home directory, the standard advice was to simply remove the shell tool. But that is a false choice. It confuses the tool with the intent. Removing the shell doesn’t just stop the agent from doing damage; it stops it from running tests, managing git, or installing packages—the very things I need it to do.

With the Policy Engine, I can give the agent powerful tools but wrap them in strict policies. I can give it access to kubectl, but only for get commands. I can let it edit files, but only on specific documentation sites.

This is how we bridge the gap between a fun demo and a production-ready tool. It allows me to define the sandbox in which the AI plays, giving me the confidence to let it run autonomously within those boundaries.

Defining Your Own Rules

The Policy Engine is available now in the latest release of Gemini CLI. You can dive into the full documentation here.

If you want to see exactly what rules are currently active on your system—including the built-in defaults and your custom additions—you can simply run /policies list from inside the Gemini CLI.

I’m currently running a mix of “Safe Exploration” and “Hard Stop” rules. It’s quieted the noise significantly while keeping my file system intact. I’d love to hear how you configure yours—are you a “deny everything” security maximalist, or are you running in full “allow” mode?

A stylized, dark digital illustration of an open laptop displaying lines of blue code. Floating above the laptop are three glowing, neon blue wireframe icons: a document on the left, a calendar in the center, and an envelope on the right. The icons appear to be formed from streams of digital particles rising from the laptop screen, symbolizing the integration of digital tools. The overall aesthetic is futuristic and high-tech, with dramatic lighting emphasizing the connection between the code and the applications.

Bringing the Office to the Terminal

There is a specific kind of friction that every developer knows. It’s the friction of the “Alt-Tab.”

You’re deep in the code, holding a complex mental model of a system in your head, when you realize you need to check a requirement. That requirement lives in a Google Doc. Or maybe you need to see if you have time to finish a feature before your next meeting. That information lives in Google Calendar.

So you leave the terminal. You open the browser. You navigate the tabs. You find the info. And in those thirty seconds, the mental model you were holding starts to evaporate. The flow is broken.

But it’s not just the context switch that kills your momentum—it’s the ambush. The moment you open that browser window, the red dots appear. Chat pings, new emails, unresolved comments on a doc you haven’t looked at in two days—they all clamor for your attention. Before you know it, the quick thing you needed to look up has morphed into an hour of answering questions and putting out fires. You didn’t just lose your place in the code; you lost your afternoon.

I’ve been thinking a lot about this friction lately, especially as I’ve moved more of my workflow into the Gemini CLI. If we want AI to be a true partner in our development process, it can’t just live in a silo. It needs access to the context of our work—and for most of us, that context is locked away in the cloud, in documents, chats, and calendars.

That’s why I built the Google Workspace extension for Gemini CLI.

Giving the Agent “Senses

We often talk about AI agents in the abstract, but their utility is defined by their boundaries. An agent that can only see your code is a great coding partner. An agent that can see your code and your design documents and your team’s chat history? That’s a teammate.

This extension connects the Gemini CLI to the Google Workspace APIs, effectively giving your terminal-based AI a set of digital senses and hands. It’s not just about reading data; it’s about integrating that data into your active workflow.

Here is what that looks like in practice:

1. Contextual Coding

Instead of copying and pasting requirements from a browser window, you can now ask Gemini to pull the context directly.

“Find the ‘Project Atlas Design Doc’ in Drive, read the section on API authentication, and help me scaffold the middleware based on those specs.”

2. Managing the Day

I often get lost in work and lose track of time. Now, I can simply ask my terminal:

“Check my calendar for the rest of the day. Do I have any blocks of free time longer than two hours to focus on this migration?”

3. Seamless Communication

Sometimes you just need to drop a quick note without leaving your environment.

“Send a message to the ‘Core Eng’ chat space letting them know the deployment is starting now.”

The Accidental Product

Truth be told, I didn’t set out to build a product. When I first joined Google DeepMind, this was simply my “starter project.” My manager suggested I spend a few weeks experimenting with Google Workspace and our agentic capabilities, and the Gemini CLI seemed like the perfect sandbox for that kind of exploration.

I started building purely for myself, guided by my own daily friction. I wanted to see if I could check my calendar without leaving the terminal. Then I wanted to see if I could pull specs from a Doc. I followed the path of my own curiosity, adding tools one by one.

But when I shared this little experiment with a few colleagues, the reaction was immediate. They didn’t just think it was cool; they wanted to install it. That’s when I realized this wasn’t just a personal hack—it was a shared need. It snowballed from a few scripts into a full-fledged extension that we knew we had to ship.

Under the Hood

The extension is built as a Model Context Protocol (MCP) server, which means it runs locally on your machine. It uses your own OAuth credentials, so your data never passes through a third-party server. It’s direct communication between your local CLI and the Google APIs.

It currently supports a wide range of tools across the Workspace suite:

  • Docs & Drive: Search for files, read content, and even create new docs from markdown.
  • Calendar: List events, find free time, and schedule meetings.
  • Gmail: Search threads, read emails, and draft replies.
  • Chat: Send messages and list spaces.

Why This Matters

This goes back to the idea of “Small Tools, Big Ideas.” Individually, a command-line tool to read a calendar isn’t revolutionary. But when you combine that capability with the reasoning engine of a large language model, it becomes something else entirely.

It turns your terminal into a cockpit for your entire digital work life. It allows you to script interactions between your code and your company’s knowledge base. It reduces the friction of context switching, letting you stay where you are most productive.

If you want to try it out, the extension is open source and available now. You can install it directly into the Gemini CLI:

gemini extensions install https://github.com/gemini-cli-extensions/workspace

I’m curious to see how you all use this. Does it change your workflow? Does it keep you in the flow longer? Give it a spin and let me know.

A central glowing crystal, representing a core AI, is connected by light pathways to four floating spheres. Each sphere contains a holographic blueprint for an AI framework: Google ADK, LangChain, and CrewAI, set against a dark, futuristic background with circuit patterns.

Choosing Your Agent Framework

Welcome back to The Agentic Shift. Over the past seven installments, we’ve carefully dissected the anatomy of an agent, peered into its different modes of thinking, mapped out its memory systems, examined its toolkit, learned how to guide its behavior, erected necessary safety guardrails, and tackled the challenge of managing its finite attention. We’ve essentially built a conceptual blueprint for an autonomous AI partner.

The ‘Simple Loop’ Fallacy

While some people will say that an agent is nothing more than a model running in a loop and using tools, if you implement your own agent you will find that there are a lot of details missing in that simple statement. Every model has its quirks—handling parallel tool calls, progressive context compression, or input context window management—all come with frustrations, and that’s before you start to deal with the unique features of each model SDK. Each time, you’ll find yourself spending 80% of your time on what one might call “undifferentiated heavy lifting” the complex but repetitive plumbing that every agent needs but that adds no unique value to your specific application.

This brings to mind something François Chollet said recently: to truly understand a concept, you have to “invent” it yourself. Understanding is an “active, high-agency, self-directed process of creating and debugging your own mental models”. And anyone who builds an agent from scratch has definitely been creating and debugging. The hands-on struggle teaches precisely where the real engineering challenges lie: state persistence across sessions, secure tool execution, intelligent context curation, and robust error handling.

The lesson is clear: build from scratch once to truly understand the fundamentals, but use a framework for everything after that. The real decision isn’t if you should use one, but which one to choose. This choice hinges on your specific needs—from single vs. multi-agent architectures to your existing cloud ecosystem, production timeline, and desired level of control. While my personal stack now favors the Google ADK for production and raw code for learning, I’ve learned the framework is just the scaffolding, not the building itself.

What Frameworks Actually Solve

The complexity frameworks address isn’t trivial. Based on my experience and research, they solve four fundamental challenges:

1. The Stateless → Stateful Transformation

LLM APIs are fundamentally stateless—each call has no memory of the previous one. Creating an agent that remembers your preferences, learns from interactions, and maintains context requires sophisticated external memory architecture, as we explored in Part 3. Frameworks provide battle-tested solutions, from simple conversation buffers to complex integrations with vector databases for semantic memory and knowledge graphs for entity relationships.

Take a customer service agent that needs to remember a user’s issue across multiple sessions. Without a framework, you’re writing database schemas, managing session state, implementing conversation history pruning, and building retrieval pipelines. Frameworks like LangGraph, however, handle much of this with just a few lines of configuration.

2. The Tool Orchestration Loop

Giving an agent “hands,” as we discussed in Part 4, means building a robust runtime that can generate machine-readable tool definitions, parse the LLM’s tool-calling decisions, validate and securely execute calls, handle errors gracefully, and feed results back into reasoning. I’ve written this loop several times. Each time, I discovered new edge cases. What happens when a tool times out? When the LLM hallucinates a non-existent function? When it tries to pass a string to an integer parameter? Frameworks have discovered these edge cases already and handle them elegantly.

3. The Context Window Crisis

As I explored in Part 7, context windows are finite and precious. Without active management, they fill with noise: old conversations, verbose tool outputs, redundant information. Frameworks offer automated strategies like recursive summarization and intelligent pruning that maintain signal while discarding noise.

4. The Security Minefield

An agent that can delete files, send emails, or execute code is a loaded weapon. The attack vectors we covered in Part 6 (prompt injection, tool manipulation, data exfiltration) are novel. Frameworks provide architectural patterns for sandboxing, human-in-the-loop approval, and policy enforcement that would take months to build from scratch.

Different Philosophies for Different Problems

The agent framework landscape isn’t just diverse; it’s philosophically fragmented. Each framework embodies a distinct worldview about how agents should be built and managed.

The Enterprise Architects: Google ADK and Microsoft’s Unified Framework

Google’s Agent Development Kit represents the “enterprise-first” philosophy. It treats agents as first-class software artifacts—with proper testing, versioning, and observability. The framework’s hierarchical multi-agent support is invaluable for scaling from a single agent to a team of specialists. The code can be verbose and the learning curve steep, but the production reliability is a key feature.

Microsoft’s newly unified Agent Framework (merging AutoGen’s innovation with Semantic Kernel’s enterprise features) takes a different approach: “conversation as computation.” Instead of explicit orchestration, agents collaborate through structured dialogue. It’s fascinating to watch, almost like a Slack channel where AI team members actually get work done.

The Developer Experience Champions: LangChain’s Evolution

LangChain’s journey mirrors the entire field’s maturation. It started with “chains”, linear sequences of operations that were intuitive but limited. The introduction of LangChain Expression Language (LCEL) formalized this into a powerful pipe syntax: prompt | model | parser.

But then came LangGraph, acknowledging what many developers learned the hard way: agents need cycles, not just chains. This directly relates to the cognitive patterns we discussed in Part 2, where simple linear “Plan-and-Execute” patterns give way to more complex, graph-based reasoning. LangGraph models workflows as stateful graphs where nodes are functions and edges are conditional logic. It’s more complex but infinitely more powerful, and has become a popular choice for developers who need fine-grained control over agent behavior.

The Minimalist’s Choice: OpenAI Agents SDK

OpenAI’s official open-source, lightweight framework takes a refreshingly minimal approach. It’s model-agnostic and provides just enough structure to be helpful without being prescriptive. Perfect for developers who want to build custom, multi-agent logic from the ground up without fighting framework opinions.

The Intuitive Collaborators: CrewAI’s Role-Playing Revolution

CrewAI took a radically different approach: what if we just described agents like job postings? You define a “Senior Research Analyst” with a goal and backstory, a “Technical Writer” with their own expertise, and let them collaborate naturally.

This model has proven to be remarkably effective for content creation pipelines, such as having a Researcher, Writer, and Editor collaborate on a blog post. The framework is designed to handle delegation, task management, and inter-agent communication transparently. In effect, you write what feels like HR documentation and get a functioning multi-agent system.

The Pythonic Pragmatists: Phidata

Phidata embodies “AI assistants as code”—clean, object-oriented Python with minimal magic. It’s a “batteries included” framework where an Assistant can be instantiated, configured with tools and a knowledge base, and deployed with a built-in UI.

For Retrieval-Augmented Generation (RAG) applications, Phidata is particularly well-suited. It is designed to handle vector database complexity, provides pre-built knowledge base classes, and manages the retrieval pipeline transparently.

Visual and Node-Based Builders: Democratizing Development

Platforms like Voiceflow, Botpress, and MindStudio represent a philosophy of visual programming for AI agents. They’re not just no-code—they’re thoughtfully designed for non-programmers to build sophisticated conversational agents, offering visual canvases, drag-and-drop logic, and built-in integrations. While you can hit walls when you need custom logic, many use cases don’t need it.

A powerful middle ground between these conversational builders and pure code exists with node-based automation platforms like n8n. These tools also use a visual, graph-based canvas, but are designed for complex data workflows, integrations, and backend logic, allowing developers to visually stitch together APIs, databases, and AI models in a way that is more robust than no-code and more accessible than a pure framework.

The Trade-offs: What You Give Up for Convenience

While frameworks accelerate development, they aren’t a free lunch, and it’s important to consider what you’re trading away. You are, in effect, trading the complexity of building from scratch for the complexity of learning a framework’s specific abstractions, such as internalizing LangGraph’s graph-based mental model or ADK’s enterprise patterns. There’s also the matter of framework overhead; these layers of abstraction can introduce performance hits or higher token usage. You might see simple tasks consume two to three times more tokens through a framework than in a hand-rolled implementation, depending on the approach. You also risk the “leaky abstraction” problem, where you eventually hit a wall that the framework’s design simply doesn’t fit. For instance, a developer might spend a week fighting a framework’s complex delegation logic when a simple round-robin task assignment is all that’s needed. Finally, in this breakneck-speed field, you’re betting on maturity and stability. APIs change, frameworks pivot, and what works today might be deprecated tomorrow—a phenomenon seen in LangChain’s multiple major architectural shifts over just two years.

Choosing Your Framework: A Practical Guide

My decision-making process for choosing a framework has evolved into a practical decision tree.

Start with Architecture: Single Agent or Multi-Agent?

For single, complex agents, your choice depends on your needs. LangGraph is ideal if you need explicit control over reasoning patterns, while the OpenAI Agents SDK is perfect if you want minimal abstractions. If your application is built around RAG, Phidata is a strong contender. For multi-agent systems, the options are philosophically different. CrewAI excels at intuitive, role-based teams, while the Microsoft Agent Framework is built for conversation-driven collaboration. For hierarchical, production-grade systems, Google ADK is the most robust choice.

Consider Your Constraints

Your production timeline is a major factor. If you’re prototyping this week, the rapid setup of CrewAI or LangChain is invaluable. If you’re building for a production launch next quarter, the enterprise-grade architecture of Google ADK or the Microsoft Agent Framework is a safer bet. For timelines somewhere in between, LangGraph or Phidata offer a balance of power and speed.

Team expertise also matters. Deep Python experience is a good fit for any code-first framework, but CrewAI’s natural language approach can be easier for mixed technical teams. For non-technical stakeholders, no-code platforms are the most accessible.

Finally, your choice depends heavily on your ecosystem and specific goals. If you’re building for production scale within Google Cloud, the Google ADK is the clear choice for its built-in Vertex AI integration and enterprise observability. Similarly, if your organization is built on Azure, the Microsoft Agent Framework provides native service integration and AutoGen’s powerful multi-agent patterns. For those needing rapid prototyping, LangChain and LangGraph offer a massive component library. If your goal is intuitive multi-agent collaboration, CrewAI’s role-based design is

remarkably effective, while the OpenAI Agents SDK is perfect for those who want minimal, clean abstractions. For visual, no-code workflow design, platforms like Voiceflow and Botpress democratize deployment; for node-based visual automation, platforms like n8n bridge the gap; and for a Python-native, full-stack experience, Phidata’s object-oriented approach is excellent.

My Personal Stack

For personal projects where I want complete control—like my Gemini Scribe agent for Obsidian, I still build from scratch. The entire agent is ~500 lines of typescript, perfectly tailored to my workflow. You can see exactly what’s happening at every step, and there’s no framework magic to debug when things go wrong.

But for more complex systems, Google ADK has become my go-to. My recent adh-cli TUI for Gemini is built with ADK, allowing me to spend more time thinking about the unique concepts I want to explore and less time on boilerplate of agent development.

The choice ultimately depends on your specific context and is a strategic decision that reflects and reinforces your approach to AI development.

Beyond The Scaffolding

Here’s what I’ve come to believe: the choice between frameworks isn’t about features—it’s about philosophy. Each framework embodies assumptions about how agents should think, collaborate, and evolve.

Adopting any framework provides immediate benefits: velocity to skip the boilerplate and focus on your unique logic; reliability from leveraging battle-tested patterns and error handling; community to tap into collective knowledge and shared solutions; and governance by enforcing architectural best practices automatically. The open-source nature of many of these frameworks means that even if you encounter a novel edge case, it’s likely you’re not alone, and a solution may already be in progress within the community.

But the real value is subtler. Frameworks don’t just accelerate development—they shape how you think about agents. LangGraph teaches you to model cognition as state machines. CrewAI makes you consider role-based decomposition. ADK asks you to think about production from day one.

The frameworks are the scaffolding necessary to build the next generation of intelligent applications. They’re transforming agent development from an artisanal craft into a systematic engineering discipline.

When One Agent Isn’t Enough

But what happens when a single agent, even one built on a sophisticated framework, hits its limits? Complex problems often require multiple perspectives, specialized expertise, and collaborative problem-solving.

In Part 9: Building an Agentic Team, we’ll explore the fascinating world of multi-agent systems. We’ll dive into orchestration patterns, examine how agents negotiate and delegate, and uncover the emergent behaviors that arise when AI agents work in teams.

If you thought managing one agent’s context window was challenging, wait until you see five agents trying to agree on a shared goal while maintaining their own specialized knowledge and constraints.

The future isn’t just agentic, it’s collaborative. And it’s already here.