A beam of white light enters a translucent geometric crystal and refracts into three distinct colored beams — red, green, and blue — each passing through a different abstract geometric shape against a dark navy background.

MCP Isn’t Dead You Just Aren’t the Target Audience

I was debugging a connection issue between Gemini Scribe and the Google Calendar integration in my Workspace MCP server last month when a friend sent me a link. “Have you seen this? MCP is dead apparently.” It was Eric Holmes’ post, MCP is dead. Long live the CLI, which had just hit the top of Hacker News. I read it while waiting for a server restart, which felt appropriate.

His argument is clean and persuasive: CLI tools are simpler, more reliable, and battle-tested. LLMs are trained on millions of man pages and Stack Overflow answers, so they already know how to use gh and kubectl and aws. MCP introduces flaky server processes, opinionated authentication, and an all-or-nothing permissions model. His conclusion is that companies should ship a good API, then a good CLI, and skip MCP entirely.

I agree with about half of that. And the half I agree with is the part that doesn’t matter.

The Shell is a Privilege

Holmes is writing from the perspective of a developer sitting in a terminal. From that vantage point, everything he says is correct. If your agent is Claude Code or Gemini CLI, running in a shell session on your laptop with your credentials loaded, then yes, gh pr view is faster and more capable than any MCP wrapper around the GitHub API. I made exactly this observation in my own post on the Internet of Agents. Simon Willison said as much in his year-end review, noting that for coding agents, “the best possible tool for any situation is Bash.”

But here’s the thing: not every agent has a shell. And not every agent is an interactive coding assistant.

I wrote in Everything Becomes an Agent that the agentic pattern is showing up everywhere: classifiers that need to call tools, data pipelines that need to make decisions, background processes that orchestrate workflows without a human watching. The “MCP is dead” argument treats agents as though they are all developer tools running in a terminal session. That’s one pattern, and it’s the pattern that gets the most attention because developers are writing the blog posts. But the agentic shift is much broader than that.

I’ve been building Gemini Scribe for nearly a year and a half now. It’s an AI agent that lives inside Obsidian, a note-taking application built on Electron. On desktop, Gemini Scribe runs in the renderer process of a sandboxed app. It has no terminal. It has no $PATH. It cannot reliably shell out to gh or kubectl or anything else. Its entire world is the Obsidian plugin API, the vault on disk, and whatever external capabilities I wire up for it. And on mobile, the constraints are even tighter. Obsidian runs on iOS and Android, where there is no shell at all, no subprocess spawning, no local binary execution. The app sandbox on mobile is absolute. If your answer to “how does an agent use tools?” begins with “just call the CLI,” you’ve already lost half your user base.

When I wanted Gemini Scribe to be able to read my Google Calendar, search my email, or pull context from Google Drive, I didn’t have the option of “just use the CLI.” There is no gcal CLI that runs inside a browser runtime. There is no gmail binary I can spawn from an Electron sandbox, let alone from an iPhone. MCP gave me a way to expose those capabilities through a protocol that works over stdio or HTTP, regardless of where my agent happens to be running.

The same is true of my Podcast RAG system. The query agent runs on the server, orchestrating retrieval, re-ranking, and synthesis in a Python process that has no interactive shell session. I could wire up every capability as a bespoke function call, and in some cases I do. But when I want that same retrieval pipeline to be accessible from Gemini CLI on my laptop, from Gemini Scribe in Obsidian, and from the web frontend, MCP gives me one implementation that serves all three. The alternative is writing and maintaining three separate integration layers.

Or consider a less obvious case: a background agent that monitors a codebase for security vulnerabilities and files tickets when it finds them. This agent runs on a schedule, not in response to a human typing a command. It needs to read files from a repository, query a vulnerability database, and create issues in a project tracker. You could give it a shell, but you shouldn’t. An autonomous agent running unattended with shell access is a privilege escalation vector. A crafted comment in a pull request, a malicious string in a dependency manifest, any of these could become a prompt injection that turns bash into an attack surface. Structured tool protocols are the natural interface for this kind of autonomous workflow precisely because they constrain what the agent can do. The agent gets read_file and create_issue, not bash -c. The narrower the interface, the smaller the blast radius.

The N-by-M Problem Doesn’t Go Away

Holmes frames MCP as solving a problem that doesn’t exist. CLIs already work, so why add a protocol?

But CLIs work for a very specific topology: one human (or one human-like agent) driving one tool at a time through a shell. The moment you step outside that topology, CLIs stop being the answer.

Even if every service had a CLI (and Holmes is right that more should), you still have the consumer problem. A CLI is consumable by exactly one kind of agent: one with shell access. The moment you need that same capability accessible from an Electron plugin, a mobile app, a server-side orchestrator, and a terminal agent, you’re back to writing integration code for each consumer. MCP lets you write the server once and expose it to all of them through a common protocol.

This is the same insight behind LSP, which I wrote about in the context of ACP. Before LSP, every editor had to implement its own Python linter, its own Go formatter, its own TypeScript type-checker. The N-by-M integration problem was a nightmare. LSP didn’t replace the underlying tools. It standardized the interface between the tools and the editors. MCP does the same thing for the interface between capabilities and agents.

Holmes might respond that the N-by-M problem is overstated, that most developers just need one agent talking to a handful of tools. Fair enough for a personal workflow. But the industry isn’t building personal workflows. It’s building platforms where agents need to discover and compose capabilities dynamically, where the set of available tools changes based on the user’s permissions, their organization’s policies, and the context of the current task. That’s the world MCP is designed for.

Authentication is the Feature, Not the Bug

One of Holmes’ sharpest critiques is that MCP is “unnecessarily opinionated about auth.” CLI tools, he notes, use battle-tested flows like gh auth login and AWS SSO that work the same whether a human or an agent is driving.

This is true when the agent is acting as you. But the moment the agent stops acting as you and starts acting on behalf of other people, everything changes.

Imagine you’re building a product where an AI assistant helps your customers manage their calendars. Each customer has their own Google account. You cannot ask each of them to run gcloud auth login in a terminal. You need per-user OAuth tokens, tenant isolation, and an auditable record of every action the agent takes on each user’s behalf. This is not a niche enterprise concern. This is the basic architecture of any multi-tenant agent system.

Or think about something simpler: a shared documentation service protected by OAuth. Your team’s internal knowledge base, your company’s Confluence, your organization’s Google Drive. An agent that needs to search those resources on behalf of a user has to present that user’s credentials, not the developer’s, not a shared service account. This is a solved problem in the web world (every SaaS app does it), but it requires a protocol that understands identity delegation. curl with a hardcoded token doesn’t cut it.

MCP’s authentication specification isn’t trying to replace gh auth login for developers who already have credentials loaded. It’s trying to solve the problem of how an agent running in a hosted environment acquires and manages credentials for users who will never see a terminal. Dismissing this as unnecessary complexity is like dismissing HTTPS because curl works fine over HTTP on your local network.

Where I Actually Agree

I want to be clear that Holmes isn’t wrong about the pain points. MCP server initialization is genuinely flaky. I’ve lost hours to servers that didn’t start, connections that dropped, and state that got corrupted between restarts. The tooling is immature. The debugging experience is terrible. As I wrote in my post on the observability gap, the moment you rely on an agent for something that matters, you realize you’re flying blind. MCP’s opacity makes that worse.

And the context window overhead is real. Benchmarks from ScaleKit show that an MCP agent injecting 43 tool definitions consumed 44,026 tokens before doing any work, while a CLI agent doing the same task needed 1,365. When you’re paying per token, that’s not an abstraction tax you can ignore.

But these are maturity problems, not architecture problems. The early days of LSP were rough too. Language servers crashed, features were spotty, and half the community said “just use the built-in tooling.” The protocol won anyway, because the abstraction was right even when the implementation wasn’t.

The Bridge Pattern

Here’s what I think the mature answer looks like, and it’s neither “use MCP for everything” nor “use CLIs for everything.” It’s building your core capability as a shared library, then exposing it through multiple transports.

Think about how you’d design a tool that queries your internal knowledge base. The business logic (authentication, retrieval, re-ranking) lives in a Python module or a Go package. From that shared core, you generate three thin wrappers. A streaming HTTP MCP server for agents running in web runtimes and hosted environments. A local stdio MCP server for desktop agents like Gemini Scribe or Claude Desktop that communicate over standard input/output. And a CLI binary for developers who want to pipe results through jq or use it from Gemini CLI’s bash tool.

All three share the same code paths. A bug fix in the retrieval logic propagates everywhere. The auth layer adapts to context: the CLI reads your local credentials, the HTTP server handles OAuth tokens, and the stdio server inherits the host process’s permissions. You get the CLI’s simplicity where a shell exists, and MCP’s universality where it doesn’t.

This isn’t hypothetical. It’s what I’m already doing. My gemini-utils library is the shared core: it handles file uploads, deep research, audio transcription, and querying against Gemini’s APIs. It exposes all of that as a set of CLI commands (research, transcribe, query, upload) that I use directly from the terminal every day. But when I wanted those same research capabilities available to Gemini CLI as an agent tool, I built gemini-cli-deep-research, an extension that wraps the same underlying library as an MCP service. The core logic is shared. The CLI is for me at a terminal. The MCP server is for agents that need to invoke deep research as a tool in a larger workflow. Same capability, different transports, each suited to its context.

I think this is the pattern that tool developers should be building toward. The best agent tools of the next few years won’t be “MCP servers” or “CLI tools.” They’ll be capability libraries with multiple faces.

The Real Question

The CLI-vs-MCP debate, as Tobias Pfuetze argued, is the wrong fight. The question isn’t “which is better?” It’s “where does each one belong?”

For a developer in a terminal with their own credentials, driving a coding agent? Use the CLI. It’s faster, cheaper, and the agent already knows how. Holmes is right about that.

For an agent embedded in an application runtime without shell access? For a multi-tenant platform where the agent acts on behalf of users who will never open a terminal? For a system where you need one capability implementation discoverable by multiple heterogeneous agent hosts? That’s where MCP earns its complexity.

And for the tool developer who wants to serve all of these audiences? Build the core once, expose it three ways: CLI, stdio MCP, and streaming HTTP MCP. Let the runtime decide.

The mistake is assuming that because your agent has a shell, every agent has a shell. The terminal is one runtime among many. And as agents move from developer tools into products that serve non-technical users, the fraction of agents that can rely on a $PATH and a .bashrc is going to shrink rapidly.

MCP isn’t dead. It’s just not for you yet. But it might be soon.

A luminous geometric sphere with sections of its outer shell breaking apart to reveal glowing concentric rings and internal mechanisms, set against a dark navy background.

The Observability Gap

I was debugging an agent a few weeks ago when I hit a problem that made me realize something fundamental about the shift we’re undergoing. The script had run, consumed a hundred thousand tokens, and returned an answer. But the answer was wrong. Not catastrophically wrong, just subtly, dangerously off.

The issue wasn’t that the model was bad. The problem was that I had no idea what the agent had thought while producing that answer. Which tools had it called? What information had it retrieved? What reasoning path had it wandered down? I had the input and the output, but the middle, the actual decision-making process, was a black box.

This mirrors the challenge I described in Everything Becomes an Agent. If our future architecture is a mesh of interacting agents, we cannot afford for them to be inscrutable. A single black box is a mystery; a system of black boxes is chaos.

This is the Observability Gap, and it is the first wall you hit when you move from prototype to production. You can build a working agent in an afternoon. You can give it tools, wire up a nice ReAct loop, and watch it dazzle you. But the moment you rely on it for something that matters, you realize you’re flying blind.

How do you know if your agent is working well? And more importantly, how do you fix it when it’s not?

Earlier in this series, I wrote about building guardrails and the Policy Engine that keeps agents from doing dangerous things. Observability is the complement to those guardrails. Guardrails define the boundaries; observability tells you whether the agent is respecting them, struggling against them, or quietly finding ways around them. One without the other is incomplete. A guardrail you can’t monitor is just a hope.

The Chain of Thought Problem

When you’re building traditional software, debugging is an exercise in logic. You set breakpoints, inspect variables, and trace execution. The flow is deterministic: if Input A produces Output B today, it will produce Output B tomorrow.

Agents don’t work that way. The same input can produce wildly different outputs depending on which tools the agent decides to call, how it interprets the results, and what “thought” it generates in that split second. The agent’s logic isn’t written in code; it’s written in natural language, scattered across multiple LLM calls, tool invocations, and iterative refinements.

I learned this the hard way with my Podcast RAG system. I’d ask it a question about a specific episode, and sometimes it would nail it, pulling the exact segment and synthesizing a perfect answer. Other times, it would search with the wrong keywords, get back irrelevant chunks, and confidently synthesize nonsense.

The model wasn’t hallucinating in the traditional sense. It was following a process. But I couldn’t see that process, so I couldn’t fix it.

That experience taught me the most important lesson about production agents: the final answer is the least interesting part. What matters is the chain of thought that produced it, every tool call, every intermediate result, every reasoning trace. Think of it as a flight recorder. When the plane lands at the wrong airport, the only way to understand what went wrong is to replay the entire flight.

Four Layers of Seeing

When I started building that flight recorder, I realized that “log everything” isn’t actually a strategy. You need structure. Through trial and error, and by studying how platforms like Langfuse and Arize Phoenix approach the problem, I’ve come to think of agent observability as having four distinct layers.

The first is the reasoning layer: the agent’s internal monologue where it decomposes your request into sub-tasks. This is where you catch the subtle bugs. When my Podcast RAG agent searched for the wrong keywords, the failure wasn’t in the tool call itself (which returned a perfectly valid HTTP 200). The failure was in the reasoning that chose those keywords. Without visibility into the “Thought” step of the ReAct loop, that kind of error is indistinguishable from an external system failure.

The second is the execution layer: the actual tool calls, their arguments, and the raw results. This is where you catch a different class of bug, one that’s becoming increasingly important. Tool hallucination. Not the model making up facts in prose, but the model calling a tool that doesn’t exist (you provided shell_tool but the model confidently calls bash_tool), fabricating a file path that isn’t real, or passing a string to a parameter that expects an integer. These are operational failures that cascade. I’ve seen an agent confidently pass a hallucinated document ID to a retrieval tool, get back an error, and then re-hallucinate a different invalid ID rather than change strategy. You only catch this if you’re logging the schema validation at the boundary between the model and the tool.

The third is the state layer: the contents of the agent’s context window at each decision point. Agents are stateful creatures. Their behavior at step ten is shaped by everything that happened in steps one through nine. And context windows are not infinite. As verbose tool outputs accumulate, relevant information gets pushed further and further from the model’s attention, a phenomenon researchers call “context drift” or the “Lost in the Middle” effect. Snapshotting the context at critical decision points lets you “time travel” during debugging. You can see exactly what the agent could see when it made its bad call.

The fourth is the feedback layer: error codes, user corrections, and signals from any critic or evaluator models. This layer tells you whether the agent is actually learning from its environment within a session, or just ignoring failure signals and looping. In frameworks like Reflexion, this feedback is explicitly wired into the next reasoning step. Watching this layer is how you know if your self-correction mechanisms are actually correcting.

But capturing these four layers independently isn’t enough. You need to bundle them into sessions: discrete, self-contained records of a single task from the moment the user makes a request to the moment the agent delivers (or fails to deliver) its result. A session is your unit of analysis. It’s the difference between having a pile of timestamped log lines and having a story you can read from beginning to end. When something goes wrong, you don’t want to grep through millions of events hoping to reconstruct what happened. You want to pull up session #47832 and replay the agent’s entire decision-making journey: what it thought, what it tried, what it saw, and how it responded to each result along the way.

This session-level thinking changes how you build your infrastructure. Every trace, every tool call, every context snapshot gets tagged with a session ID. Your dashboards stop showing you aggregate metrics and start showing you individual narratives. You can sort sessions by outcome (success, failure, abandonment), by cost (token consumption), or by duration, and immediately drill into the ones that matter. It’s the observability equivalent of going from reading a box score to watching the game film.

Making It Concrete

Here’s what this looks like in practice. Suppose you ask your agent to “check my calendar and suggest a time for a meeting.”

Without observability, you see:

Input: "Check my calendar and suggest a time for a meeting"
Output: "How about Thursday at 2pm?"

With observability across all four layers, you see the mind at work:

[REASONING] User wants to schedule a meeting. I need to:
1. Check their calendar for availability
2. Consider team availability
3. Suggest an optimal time
[TOOL CALL] get_calendar(user_id="allen", days=7)
[TOOL RESULT] Returns 45 events over next 7 days
[STATE] Context window: 2,847 tokens used
[REASONING] Analyzing free slots. User has:
- Monday 2pm-4pm free
- Thursday 2pm-4pm free
- Friday all day booked
[TOOL CALL] get_team_availability()
[TOOL RESULT] Team members mostly available Thursday afternoon
[REASONING] Thursday 2pm works for both user and team.
[FEEDBACK] No errors. Response generated.
[RESPONSE] "How about Thursday at 2pm?"

Suddenly, the black box is transparent. If the suggestion is wrong, you can see exactly why. Maybe the calendar tool returned incomplete data. Maybe the team availability check failed silently. Maybe the agent’s definition of “optimal” means “soonest” rather than “best for focus time.”

This kind of visibility saved me countless hours when building Gemini Scribe. Users would report that the agent “didn’t understand” their request, which is about as useful as telling your mechanic “the car sounds funny.” But when I turned on debug logging and pulled up the console output, I could see exactly where the confusion happened, usually in how the agent interpreted the file context or which notes it decided were relevant. The fix was never a mystery once I could see the reasoning. All of this logging is to the developer console and off by default, which is an important distinction. You want observability for yourself as the builder, not surveillance of your users.

The Standards Are Coming

For my own production agents, I’ve settled on a layered approach. Structured logging captures every action in machine-parseable JSON. A unique trace ID stitches together every LLM call and tool invocation into a single narrative flow.

But we are also seeing the industry mature beyond “roll your own.” The critical development here is the adoption of the OpenTelemetry (OTel) standard for GenAI. The OTel community has published semantic conventions that define a standard schema for agent traces: things like gen_ai.system (which provider), gen_ai.request.model (which exact model version), gen_ai.tool.name (which tool was called), and gen_ai.usage.input_tokens (how many tokens were consumed at each step).

This matters because it means an agent built with LangChain in Python and an agent built with Semantic Kernel in C# can produce traces that look structurally identical. You can pipe both into the same Datadog or Langfuse dashboard and analyze them side by side. You aren’t locked into a proprietary debugging tool; you can stream your agent’s thoughts into the same infrastructure you use for the rest of your stack.

It also enables what I think of as “boundary tracing,” where you instrument the stable interfaces (the HTTP calls, the tool invocations) rather than hacking into the agent’s internal logic. You get visibility without coupling your observability to a specific framework. That’s important, because if there’s one thing I’ve learned building in this space, it’s that frameworks change fast.

If you’re wondering where to start, here’s my honest advice: don’t wait for the perfect stack. Start with structured JSON logs and a session ID that ties each task together end-to-end. That alone gives you something you can grep, filter, and replay. Once you outgrow that (and you will, faster than you expect), graduate to an OTel-based pipeline. The good news is that many agent frameworks are adding robust hook mechanisms that let you tap into the agent lifecycle (before and after tool calls, on reasoning steps, on errors) without modifying your core logic. These hooks make it straightforward to plug in your telemetry from the start. The key is to instrument early, even if you’re only logging to a local file. Retrofitting observability into an agent that’s already in production is significantly harder than building it in from the beginning.

The Price of Transparency

Here’s the tension no one wants to talk about: full observability is expensive.

Autonomous agents are verbose by nature. A single reasoning step might generate hundreds of tokens of internal monologue. A RAG retrieval might pull megabytes of document context. If you log the full payload for every transaction, your storage costs can rival the cost of the LLM inference itself. I’ve seen reports of evaluation runs consuming over 100 million tokens, with more than 60% of the cost attributed to hidden reasoning tokens.

In production, you need sampling strategies. The approach I’ve landed on borrows from traditional distributed systems. Keep 100% of traces that result in errors or negative user feedback, because every failure is a learning opportunity. Keep traces that exceed your latency threshold (P95 or P99), because slow agents are often stuck agents. And for everything else, a small random sample (1-5%) is enough to establish your baseline and spot trends.

For storage, I use a tiered approach. Recent and failed traces go into a fast database for immediate querying. Older successful traces get compressed and moved to cold storage, where they can be pulled back if needed for deeper analysis. It’s not glamorous, but it keeps costs manageable without sacrificing the ability to debug the things that matter. In my own setup, this sampling and tiering strategy keeps observability overhead to roughly 15-20% of my inference spend. Without it, I was on track to spend more on storing agent thoughts than on generating them.

Evaluation Beyond Unit Tests

Logging tells you what happened. Evaluation tells you if it was any good.

This is where agents diverge sharply from traditional software. You can’t write a unit test that asserts function(x) == y. The whole point of an agent is to make decisions, and decisions must be evaluated on quality, not just syntax.

As Gemini Scribe grew more capable, I had to develop a new kind of test suite. I track Task Success Rate (did the agent accomplish what the user asked?), Tool Use Accuracy (did it read the right files and use the right tools for the job?), and Efficiency (did it burn 50 steps to do a 2-step task?).

But here’s the number that keeps me up at night. Because agents are non-deterministic, a single run is statistically meaningless. You have to run the same evaluation multiple times and look at distributions. Researchers distinguish between Pass@k (the probability that at least one of k attempts succeeds) and Pass^k (the probability that all k attempts succeed). Pass@k measures potential. Pass^k measures reliability.

The math is sobering. If your agent has a 70% success rate on a single attempt, its Pass^3 (succeeding three times in a row) drops to about 34%. Scale that to a real workflow where the agent needs to perform ten sequential steps correctly, and even a 95% per-step success rate gives you only about a 60% chance of completing the full task. This is the compounding probability of failure, and it’s why “works most of the time” isn’t good enough for production.

This kind of evaluation framework pays for itself the moment a new model drops. When Google released Flash 2.0, I was excited about the cost savings, but would it perform as well as Pro? I ran my eval suite on the same tasks with both models, and the results were more nuanced than I expected. For simple tasks like reformatting text or fixing grammar, Flash was just as good. For complex multi-step reasoning, particularly in my Podcast RAG system, Pro was noticeably better. The eval suite gave me the data to keep Pro where it mattered.

Then Flash 3 came out, and the eval suite surprised me in the other direction. I ran the same benchmarks expecting similar trade-offs, but Flash 3 handled the Podcast RAG tasks so well that I moved the entire system off of 2.5 Pro. Without evals, I might have assumed the old trade-off still held and kept paying for a model I no longer needed. The point isn’t that one model is always better. The point is that you can’t know without measuring, and the landscape shifts under your feet with every release.

The real breakthrough in my own workflow came when I started using an agent to evaluate itself. I built a separate “Evaluation Agent” that reviews the logs of the “Worker Agent.” It scores performance based on a rubric I defined: did it confirm the action before executing? Was the response grounded in retrieved context? Was the tone appropriate?

This LLM-as-a-Judge pattern is powerful, but it comes with caveats. Research shows these evaluator models have their own biases, particularly a tendency to prefer longer answers regardless of quality and a bias toward their own outputs. To calibrate mine, I built a small “golden dataset” of traces that I graded by hand, then tuned the evaluator’s prompt until its scores matched mine. It’s not perfect, but it spots patterns I miss, like a tendency to over-rely on search when a simple calculation would do.

When Things Go Wrong

The research into agentic failure modes has identified three patterns that I see constantly in my own work.

The first is looping. The agent searches for “pricing,” gets no results, then searches for “pricing” again with exactly the same parameters. It’s stuck in a local optimum of reasoning, unable to update its strategy based on the observation that it failed. The simplest fix is a state hash: you hash the (Thought, Action, Observation) tuple at each step and check it against a sliding window of recent steps. If you see a repeat, you force the agent to try something different. For “soft” loops where the agent slightly rephrases but semantically repeats itself, embedding similarity between consecutive reasoning steps catches the pattern. And above all, production agents need circuit breakers: hard limits on steps, tool calls, or tokens per session. When the breaker trips, the agent escalates to a human rather than continuing to burn resources.

The second is tool hallucination. I mentioned this earlier, but it deserves its own spotlight. The most robust defense is constrained decoding, where libraries like Outlines or Instructor use the tool’s JSON schema to build a finite state machine that masks out invalid tokens during generation. If the schema expects an integer, the system sets the probability of all non-digit tokens to zero. It mathematically guarantees that the agent’s tool call will be valid. This moves validation from “check after the fact” to “ensure during generation,” which is a fundamentally better position. A practical note: full constrained decoding (the FSM approach) requires control over the inference engine, so it works with locally-hosted models or providers that expose logit-level access. If you’re calling a hosted API like Gemini or OpenAI, Instructor-style libraries can still enforce schema validation by wrapping the response in a Pydantic model and retrying on parse failure. It’s not as elegant as preventing bad tokens from ever being generated, but it catches the same class of errors.

The third is silent abandonment. The agent hits an ambiguity or a tool failure, and instead of trying an alternative, it politely apologizes and gives up. “I’m sorry, I couldn’t find that information.” This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response to uncertainty. The Reflexion pattern combats this by forcing the agent to generate a self-critique when it fails (“I searched with the wrong term”) and storing that critique in a short-term memory buffer. The next reasoning step is conditioned on this reflection, pushing the agent to generate a new plan rather than surrender. Research shows this kind of “verbal reinforcement” can improve success rates on complex tasks from 80% to over 90%.

The Self-Improving System

Moving from prototype to production isn’t about adding features; it’s about shifting your mindset. A prototype proves that something can work. A production system proves that it works reliably, measurably, and transparently. But the real unlock comes when you realize that production isn’t the end of the development lifecycle. It’s the beginning of something more powerful.

Remember those sessions I mentioned, the bundled records of every task your agent attempts? Once you have a critical mass of them, you’re sitting on a goldmine. And this is where I think the story gets really interesting: you can point a different AI system at your session archive and ask it to find the patterns you’re missing.

I’ve started doing this with my own agents. The workflow is straightforward: I have a script that runs weekly, pulls the last seven days of sessions from my trace store, filters for failures and anything above P90 latency, and exports them as structured JSON. I then feed that batch to a separate, more capable evaluator model. Not the lightweight rubric-scorer I use for real-time evaluation, but a model with a broader mandate and a carefully written prompt: look across these sessions and tell me what you see. Where is the agent consistently struggling? Which tool calls tend to precede failures? Are there categories of user requests that reliably lead to abandonment or looping? I ask it to return its findings as a ranked list of patterns with supporting session IDs, so I can verify each observation myself.

The results have been genuinely surprising. The evaluator flagged a cluster of sessions where users were asking questions about the corpus itself, things like “how many of these podcasts are about guitars?” or “which shows cover AI the most?” The agent would gamely try to answer by searching transcripts, but it was never going to get there because I hadn’t indexed podcast descriptions. Each individual session just looked like a search that came up short. It was only in aggregate that the pattern became clear: users wanted to explore the collection, not just search within it. That finding led me to index descriptions as a new data source, and a whole category of previously failing queries started working.

This is what the industry calls the Data Flywheel: production data feeding back into development, continuously tightening the loop between user intent and agent capability. Your prompt logs become your reality check, revealing how users actually talk to your system versus how you imagined they would.When you cluster those real-world prompts (something as straightforward as embedding them and running HDBSCAN), you start finding these gaps systematically. That’s your roadmap for what to build next.

And the flywheel compounds. Better observability produces richer sessions. Richer sessions give the evaluator more to work with. Better evaluations lead to targeted improvements. Targeted improvements produce better outcomes, which produce more informative sessions. Each rotation makes the system a little smarter, a little more aligned with what users actually need.

To be clear: this isn’t the agent autonomously rewriting itself. I’m the one who reads the evaluator’s findings, verifies them against the session data, and decides what to change. Maybe I update a system prompt, add a new tool, or adjust a circuit breaker threshold. The AI surfaces the patterns; the human decides what to do about them. It’s the same human-on-the-loop philosophy I described in the last post, applied to the development cycle itself.

Together, these layers transform a clever demo into a system you can trust. Because in the age of agents, trust isn’t built on magic. It’s built on the ability to see the trick.

Throughout this series, we’ve been building up the theory: what agents are, how they think, what tools they need, how to keep them safe, and now how to make sure they’re actually working. In the next installment, I want to move from theory to practice. We’ll look at agents in the wild, real-world case studies in customer support, software development, and personal productivity, and what they tell us about how this technology is actually changing the way we work.

A conceptual illustration showing sound waves passing through a prism and refracting into a 3D scatter plot of colored clusters, representing different speaker identities in vector space.

The Fingerprint of Sound

Hero Image Suggestion:

Last year, I spent a lot of time obsessed with the concept of embeddings. I wrote about how they act as a bridge, transforming the messy, unstructured world of human language into a clean, numerical landscape that computers can understand. In my series on the topic, I explored how text embeddings allow us to map concepts in space—how they let us mathematically prove that “king” is close to “queen,” or find a podcast episode about “economic growth” even if the specific keywords never appear in the transcript.

For me, grasping text embeddings was a watershed moment. It turned AI from a black box into a geometry problem I could solve. But recently, my friend Pete Warden released a post that clicked the another piece of the puzzle into place for me, moving that geometry from the page to the ear.

In his post, Speech Embeddings for Engineers, Pete tackles the problem of diarization—the technical term for figuring out “who spoke when” in an audio recording. If you’ve followed my podcast archive project, you know this has been a thorn in my side. I have thousands of transcripts, but they are largely monolithic blocks of text. I know what was said, but often I lose the context of who said it.

Pete’s explanation is brilliant because it leverages the exact same intuition we developed for text. Just as a text embedding captures the semantic “fingerprint” of a sentence, a speech embedding captures the vocal fingerprint of a speaker.

The mental shift is fascinating. When we embed text, we are mapping meaning. We want the vector for “dog” to be close to “puppy” and far from “motorcycle.” But when we embed speech for diarization, we don’t care about the meaning of the words at all. A speaker could be whispering a love sonnet or screaming a grocery list; semantically, those are worlds apart. But acoustically—in terms of timbre, pitch, and cadence—they share an undeniable identity.

Pete includes a Colab notebook that demonstrates this beautifully. It’s a joy to run through because it demystifies the process entirely. He walks you through taking short clips of audio, running them through a model, and visualizing the output.

Suddenly, you aren’t looking at waveforms anymore. You’re looking at clusters. You can see, visually, where one voice ends and another begins. It turns the murky problem of distinguishing speakers in a crowded room into a clean clustering algorithm, something any engineer can wrap their head around.

This reinforces a recurring theme for me: the power of small, composable tools. We often look for massive, end-to-end APIs to solve our problems—a “magic box” that takes audio and returns a perfect script. But understanding the primitives is where the real power lies. By understanding speech embeddings, we aren’t just consumers of a transcription service; we are architects who can build systems that listen, identify, and understand the nuance of conversation.

If you’ve ever wrestled with audio data, or if you just want to see how the concept of embeddings extends beyond text, I highly recommend finding a quiet hour to work through Pete’s notebook. It might just change how you hear the data.

Great Video on Gemini Scribe and Obsidian

I was recently looking through the feedback in the Gemini Scribe repository when I noticed a few insightful comments from a user named Paul O’Malley. Curiosity got the better of me, I love seeing who is actually pushing the boundaries of the tools I build, so I took a look at his YouTube page. I quickly found myself deep into a walkthrough titled “I Built a Second Brain That Organises Itself.”

What caught my eye wasn’t just another productivity system, we’ve all seen the “shiny new app” cycle that leads to digital bankruptcy. It was seeing Gemini Scribe being used as the engine for a fully automated Obsidian vault.

The Friction of Digital Maintenance

Paul hits on a fundamental truth: most systems fail because the friction of maintenance—the tagging, the filing, the constant admin—eventually outweighs the benefit. He argues that what we actually need is a system that “bridges the gap in our own executive function”.

In his setup, he uses Obsidian as the chassis because it relies on Markdown. I’ve long believed that Markdown is the native language of AI, and seeing it used here to create a “seamless bridge” between messy human thoughts and structured AI processing was incredibly satisfying.

Gemini Scribe as the Engine

It was a bit surreal to watch Paul walk through the installation of Gemini Scribe as the core engine for this self-organizing brain. He highlights a few features that I poured a lot of heart into:

  • Session History as Knowledge: By saving AI interactions as Markdown files, they become a searchable part of your knowledge base. You can actually ask the AI to reflect on past conversations to find patterns in your own thinking.
  • The Setup Wizard: He uses a “Setup Wizard” to convert the AI from a generic chatbot into a specialized system administrator. Through a conversational interview, the agent learns your profession and hobbies to tailor a project taxonomy (like the PARA method) specifically to you.
  • Agentic Automation: The video demonstrates the “Inbox Processor,” where the AI reads a raw note, gives it a proper title, applies tags, and physically moves it to the right folder.

Beyond the Tool: A Human in the Loop

One thing Paul emphasized that really resonated with my own philosophy of Guiding the Agent’s Behavior is the “Human in the Loop”. When the agent suggests a change or creates a new command, it writes to a staging file first.

As Paul puts it, you are the boss and the AI is the junior employee—it can draft the contract, but you have to sign it before it becomes official. You always remain in control of the files that run your life.

Small Tools, Big Ideas

Seeing the Gemini CLI mentioned as a “cleaner and slightly more powerful” alternative for power users was another nice nod. It reinforces the idea that small, sharp tools can be composed into something transformative.

Building tools in a vacuum is one thing, but seeing them live in the wild, helping someone clear their “mental RAM” and close their loop at the end of the day, is one of the reasons I do this. It’s a reminder that the best technology doesn’t try to replace us; it just makes the foundations a little sturdier.

A photorealistic image shows an old wooden-handled hammer on a cluttered workbench transforming into a small, multi-armed mechanical robot with glowing blue eyes, holding various miniature tools.

Everything Becomes an Agent

I’ve noticed a pattern in my coding life. It starts innocently enough. I sit down to write a simple Python script, maybe something to tidy up my Obsidian vault or a quick CLI tool to query an API. “Keep it simple,” I tell myself. “Just input, processing, output.”

But then, the inevitable thought creeps in: It would be cool if the model could decide which file to read based on the user’s question.

Two hours later, I’m not writing a script anymore. I’m writing a while loop. I’m defining a tools array. I’m parsing JSON outputs and handing them back to the model. I’m building memory context windows.

I’m building an agent. Again.

(For those keeping track: my working definition of an “agent” is simple: a model running in a loop with access to tools. I explored this in depth in my Agentic Shift series, but that’s the core of it.)

As I sit here writing this in January of 2026, I realize that almost every AI project I worked on last year ultimately became an agent. It feels like a law of nature: Every AI project, given enough time, converges on becoming an agent. In this post, I want to share some of what I’ve learned, and the cases where you might skip the intermediate steps and jump straight to building an agent.

The Gravitational Pull of Autonomy

This isn’t just feature creep. It’s a fundamental shift in how we interact with software. We are moving past the era of “smart typewriters” and into the era of “digital interns.”

Take Gemini Scribe, my plugin for Obsidian. When I started, it was a glorified chat window. You typed a prompt, it gave you text. Simple. But as I used it, the friction became obvious. If I wanted Scribe to use another note as context for a task, I had to take a specific action, usually creating a link to that note from the one I was working on, to make sure it was considered. I was managing the model’s context manually.

I was the “glue” code. I was the context manager.

The moment I gave Scribe access to the read_file tool, the dynamic changed. Suddenly, I wasn’t micromanaging context; I was giving instructions. “Read the last three meeting notes and draft a summary.” That’s not a chat interaction; that’s a delegation. And to support delegation, the software had to become an agent, capable of planning, executing, and iterating.

From Scripts to Sudoers

The Gemini CLI followed a similar arc. There were many of us on the team experimenting with Gemini on the command line. I was working on iterative refinement, where the model would ask clarifying questions to create deeper artifacts. Others were building the first agentic loops, giving the model the ability to run shell commands.

Once we saw how much the model could do with even basic tools, we were hooked. Suddenly, it wasn’t just talking about code; it was writing and executing it. It could run tests, see the failure, edit the file, and run the tests again. It was eye-opening how much we could get done as a small team.

But with great power comes great anxiety. As I explored in my Agentic Shift post on building guardrails and later in my post about the Policy Engine, I found myself staring at a blinking cursor, terrified that my helpful assistant might accidentally rm -rf my project.

This is the hallmark of the agentic shift: you stop worrying about syntax errors and start worrying about judgment errors. We had to build a “sudoers” file for our AI, a permission system that distinguishes between “read-only exploration” and “destructive action.” You don’t build policy engines for scripts; you build them for agents.

The Classifier That Wanted to Be an Agent

Last year, I learned to recognize a specific code smell: the AI classifier.

In my Podcast RAG project, I wanted users to search across both podcast descriptions and episode transcripts. Different databases, different queries. So I did what felt natural: I built a small classifier using Gemini Flash Lite. It would analyze the user’s question and decide: “Is this a description search or a transcript search?” Then it would call the appropriate function.

It worked. But something nagged at me. I had written a classifier to make a decision that a model is already good at making. Worse, the classifier was brittle. What if the user wanted both? What if their intent was ambiguous? I was encoding my assumptions about user behavior into branching logic, and those assumptions were going to be wrong eventually.

The fix was almost embarrassingly simple. I deleted the classifier and gave the agent two tools: search_descriptions and search_episodes. Now, when a user asks a question, the agent decides which tool (or tools) to use. It can search descriptions first, realize it needs more detail, and then dive into transcripts. It can do both in parallel. It makes the call in context, not based on my pre-programmed heuristics. (You can try it yourself at podcasts.hutchison.org.)

I saw the same pattern in Gemini Scribe. Early versions had elaborate logic for context harvesting, code that tried to predict which notes the user would need based on their current document and conversation history. I was building a decision tree for context, and it was getting unwieldy.

When I moved Scribe to a proper agentic architecture, most of that logic evaporated. The agent didn’t need me to pre-fetch context; it could use a read_file tool to grab what it needed, when it needed it. The complex anticipation logic was replaced by simple, reactive tool calls. The application got simpler and more capable at the same time.

Here’s the heuristic I’ve landed on: If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent. Deconstruct those branches into tools, give the agent really good descriptions of what those tools can do, and then let the model choose its own adventure.

You might be thinking: “What about routing queries to different models? Surely a classifier makes sense there.” I’m not so sure anymore. Even model routing starts to look like an orchestration problem, and a lightweight orchestrator with tools for accessing different models gives you the same flexibility without the brittleness. The question isn’t whether an agent can make the decision better than your code. It’s whether the agent, with access to the actual data in the moment, can make a decision at least as good as what you’re trying to predict when you’re writing the code. The agent has context you don’t have at development time.

The “Human-on-the-Loop”

We are transitioning from Human-in-the-Loop (where we manually approve every step) to Human-on-the-Loop (where we set the goals and guardrails, but let the system drive).

This shift is driven by a simple desire: we want partners, not just tools. As I wrote back in April about waiting for a true AI coding partner, a tool requires your constant attention. A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.

This freedom comes with a new responsibility: clarity. If your agent is going to work overnight, you need to make sure it’s working on something productive. You need to be precise about the goal, explicit about the boundaries, and thoughtful about what happens when things go wrong. Without the right guardrails, an agent can get stuck waiting for your input, and you’ll lose that time. Or worse, it can get sidetracked and spend hours on something that wasn’t what you intended.

The goal isn’t to remove the human entirely. It’s to move us from the execution layer to the supervision layer. We set the destination and the boundaries; the agent figures out the route. But we have to set those boundaries well.

Embracing the Complexity (Or Lack Thereof)

Here’s the counterintuitive thing: building an agent isn’t always harder than building a script. Yes, you have to think about loops, tool definitions, and context window management. But as my classifier example showed, an agentic architecture can actually delete complexity. All that brittle branching logic, all those edge cases I was trying to anticipate: gone. Replaced by a model that can reason about what it needs in the moment.

The real complexity isn’t in the code; it’s in the trust. You have to get comfortable with a system that makes decisions you didn’t explicitly program. That’s a different kind of engineering challenge, less about syntax, more about guardrails and judgment.

But the payoff is a system that grows with you. A script does exactly what you wrote it to do, forever. An agent does what you ask it to do, and sometimes finds better ways to do it than you’d considered.

So, if you find yourself staring at your “simple script” and wondering if you should give it a tools definition… just give in. You’re building an agent. It’s inevitable. You might as well enjoy the company.

A central, glowing blue polyhedral node suspended in a dark void, connected to several smaller satellite nodes by taut, luminous blue data filaments and orbital arcs, illustrating a network of interconnected AI agents.

When Agents Talk to Each Other

Welcome back to The Agentic Shift. Over the past eight installments, we’ve built our agent from the ground up, giving it a brain to thinkmemory to learn, a toolkit to actinstructions to followguardrails for safety, and a framework to build on. But there’s been an elephant in the room this whole time: our agent is alone.

I was sitting at my desk late last night, staring at three different windows on my monitor, feeling like a digital switchboard operator from the 1950s.

In one window, I had Helix, my text editor, where I was writing a Python script. In the second, I had a terminal running a deep research agent I’d built for Gemini CLI. In the third, I had a browser open to a documentation page.

Here’s the thing: Gemini CLI is brilliant, but it’s blind. It couldn’t see the code I had open in Helix. It couldn’t read the documentation in my browser. When it found a critical library update, I had to manually copy-paste the relevant code into the terminal. When I wanted it to understand an error, I had to copy-paste the stack trace. I was the glue, the slow, error-prone, context-losing glue.

We have spent this entire series building a digital Robinson Crusoe. In Part 1, we gave our agent a brain. In Part 4, we gave it tools. But watching my own workflow fragment into disjointed copy-paste loops, I realized we’ve hit a wall. We have built brilliant, isolated sparks of intelligence, but we haven’t built the wiring to connect them.

This fragmentation is the single biggest bottleneck in the agentic shift. But that is changing. We are witnessing the birth of the protocols that will turn these isolated islands into a network. We are moving from building agents to building the Internet of Agents.

The Struggle Before Standards

I tried to fix this myself, of course. We all have. I wrote brittle Python scripts to wrap my CLI tools. I tried building a mega-agent that had every possible API key hardcoded into its environment variables. I even built my own agentic TUI that explored many interesting ideas, but ultimately wasn’t the right solution.

My lowest moment came when I spent several evenings and weekends building an Electron-based AI research and writing application. The vision was grand: a unified workspace where I could query multiple AI models, organize research into projects, and write drafts with AI assistance, all in one window. I built a beautiful sidebar for project navigation, a markdown editor with live preview, a chat interface that could talk to Gemini, and a “sources” panel for managing references. By the time I stepped back to evaluate what I’d built, I had thousands of lines of TypeScript, a complex state management system, and an app that was slower than just using the terminal. Worse, it didn’t actually solve my problem. I still couldn’t get the AI to see what was in my other tools. I’d built a new silo, not a bridge. The repo still sits on my hard drive, unopened.

Every solution felt like a band-aid. The problem wasn’t that I couldn’t write the code; it was that I was trying to solve an ecosystem problem with a point solution.

The Anatomy of Connection

To solve this, we don’t just need “better agents.” We need a common language. The industry is converging on three distinct protocols, each solving a different layer of the communication stack: MCP for tools, ACP for interfaces, and A2A for collaboration.

Why three protocols instead of one? For the same reason the internet isn’t just “one protocol.” Think of it like the networking stack: TCP/IP handles reliable data transmission, HTTP handles document requests, and SMTP handles email. Each layer solves a distinct problem, and trying to collapse them into one mega-protocol would create an unmaintainable mess. The same logic applies here. MCP solves the “how do I use this tool?” problem. ACP solves the “how do I show this to a human?” problem. A2A solves the “how do I collaborate with another agent?” problem. They’re designed to compose, not compete.

The Internal Wiring of MCP

The Model Context Protocol (MCP), championed by Anthropic, represents the agent’s Internal Wiring. It answers the fundamental question: How does an agent perceive, act upon, and understand the world?

It’s easy to dismiss MCP as just “standardized tool calling,” but that misses the architectural shift. MCP creates a universal substrate for context, built on three distinct pillars. First, there are Resources, the agent’s sensory input that allows it to read data (files, logs, database rows) passively. Crucially, MCP supports subscriptions, meaning an agent can “watch” a log file and wake up the moment an error appears. Next are Tools, the agent’s hands, allowing for action: executing a SQL query, hitting an API, or writing a file. Finally, there are Prompts, perhaps the most overlooked feature, which allow domain experts to bake workflows directly into the server. A “Git Server” doesn’t just expose git commit; it can expose a generate_commit_message prompt that inherently knows your team’s style guide and grabs the current diff automatically.

Here is what that “handshake” looks like (from Anthropic’s MCP specification). It’s not magic; it’s a strict contract that turns an opaque binary into a discoverable capability:

{
  "jsonrpc": "2.0",
  "method": "tools/list",
  "result": {
    "tools": [
      {
        "name": "query_database",
        "description": "Execute a SELECT query against the local Postgres instance",
        "inputSchema": {
          "type": "object",
          "properties": {
            "sql": { "type": "string" }
          }
        }
      }
    ]
  }
}

Now, any agent (whether it’s running in Claude Desktop, Cursor, or a custom script) can “plug in” to my Postgres server and immediately know how to use it. It solves the N × M integration problem forever.

A skeptical reader might ask: “How is this different from REST or OpenAPI?” It’s a fair question. On the surface, MCP looks like “JSON-RPC with a schema,” and that’s not wrong. But the difference is what gets standardized. OpenAPI describes how to call an endpoint; MCP describes how an agent should understand and use a capability. The schema isn’t just for validation. It’s for reasoning. An MCP tool description is a prompt fragment that teaches the model when and why to use the tool, not just how.

But here’s where I need to offer some nuance, because protocol boosterism can obscure practical reality.

As Simon Willison observed in his year-end review, MCP’s explosive adoption may have been partly a timing accident. It launched right as models got reliable at tool-calling, leading some to confuse “MCP support” with “tool-calling ability.” More pointedly, he notes that for coding agents, “the best possible tool for any situation is Bash.” If your agent can run shell commands, it can use gh for GitHub, curl for APIs, and psql for databases, no MCP server required.

I’ve felt this myself. When I’m working in Gemini CLI, I rarely reach for an MCP server. The GitHub CLI (gh) is faster and more capable than any MCP wrapper I’ve tried. The same goes for gitdocker, and most developer tools with good CLIs.

So when does MCP make sense? I see three clear cases. First, when there’s no CLI (for example with my MCP service for Google Workspace), since many SaaS products expose APIs but no command-line interface. An MCP server is the natural wrapper. Second, when you need subscriptions, since MCP’s ability to “watch” a resource and push updates to the agent is something CLIs can’t do cleanly. Third, when you’re crossing network boundaries, since an MCP server can run on a remote machine and expose capabilities securely, which is harder to orchestrate with raw shell access.

The real insight here is about context engineering. MCP servers bring along a lot of context for every tool (descriptions, schemas, the full capability surface). For some workflows, that richness is valuable. But Anthropic themselves acknowledged the overhead with their Skills mechanism, a simpler approach where a Skill is just a Markdown file in a folder, optionally with some executable scripts. Skills are lightweight and only load when needed. MCP and Skills aren’t competing; they’re different tools for different context budgets.

Giving the Agent a Seat at the Keyboard

If MCP is the agent’s internal wiring, the Agent Client Protocol (ACP) is its window to the world.

I like to think of this as the LSP (Language Server Protocol) moment for the agentic age. Before LSP, if you wanted to support a new language in an IDE, you had to write a custom parser for every single editor. It was a nightmare of N × M complexity. ACP solves the same problem for intelligence. It decouples the “brain” from the “UI.”

This is why the collaboration between Zed and Google is so critical. When Zed announced bring your own agent with Google Gemini CLI integration, they weren’t just shipping features. They were standardizing the interface between the client (the editor) and the server (the agent). Intelligence became swappable. I can run a local Gemini instance through the same UI that powers a remote Claude agent.

The core of ACP is Symmetry. It’s not just the editor sending prompts to the agent. Through ACP, an editor like Zed (the reference implementation) can tell the agent exactly where your cursor is, what files you have open, and even feed it the terminal output from a failed build. The agent, in turn, can request to edit a specific line or show you a diff for approval.

I’ve been seriously thinking about building ACP support for Obsidian. I already built Gemini Scribe, an agent that lives inside Obsidian for research and writing assistance, but it’s hardcoded to Gemini. With ACP, I could make Obsidian a universal agent host, letting users bring whatever intelligence they prefer into their knowledge management workflow.

This turns the editor into the ultimate guardrail. Because the agent communicates its intent through a standardized protocol, the editor can pause, show the user exactly what’s about to happen, and wait for that “Approve” click. It’s the infrastructure that makes autonomous coding safe.

But the real magic isn’t just safety; it’s ubiquity. ACP liberates the agent from the tool. It means you can bring your preferred intelligence to whatever surface helps you flow. We are already seeing the ecosystem explode beyond just Zed.

For the terminal die-hards, there is Toad, a framework dedicated entirely to running ACP agents in a unified CLI. And for the VIM crowd, the CodeCompanion project has brought full ACP support to Neovim. This is the promise of the protocol: write the agent once, and let the user decide if they want to interact with it in a modern GUI, a raw terminal, or a modal editor from the 90s. The intelligence remains the same; only the glass changes.

When Agents Meet Strangers

Finally, we have the “Internet” layer: Agent-to-Agent (A2A).

While MCP connects an agent to a thing, and ACP connects an agent to a person, A2A connects an agent to society. It addresses the “lonely agent” problem by establishing a standard for horizontal, peer-to-peer collaboration.

This protocol, pushed forward by Google and the Linux Foundation, introduces a profound shift in how we think about distributed systems: Opaque Execution.

In traditional software, if Service A talks to Service B, Service A needs to know exactly how to call the API. In A2A, my agent doesn’t care about the how; it cares about the goal. My “Travel Agent” can ask a “Calendar Agent” to “find a slot for a meeting,” without knowing if that Calendar Agent is running a simple SQL query, consulting a complex rules engine, or even asking a human secretary for help.

This negotiation happens through the Agent Card, a machine-readable identity file hosted at a standard /.well-known/agent.json endpoint. It solves the “Theory of Mind” gap, allowing one agent to understand the capabilities of another. Here’s what one looks like:

{
  "name": "Calendar Agent",
  "description": "Manages scheduling, finds available slots, and coordinates meetings across time zones.",
  "url": "https://calendar.example.com",
  "version": "1.0.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": true
  },
  "skills": [
    {
      "id": "find-meeting-slot",
      "name": "Find Meeting Slot",
      "description": "Given a list of participants and constraints, finds optimal meeting times.",
      "inputSchema": {
        "type": "object",
        "properties": {
          "participants": { "type": "array", "items": { "type": "string" } },
          "duration_minutes": { "type": "integer" },
          "preferred_time_range": { "type": "string" }
        }
      }
    }
  ],
  "authentication": {
    "schemes": ["oauth2", "api_key"]
  }
}


When my Travel Agent encounters a scheduling problem, it doesn’t need to know how the Calendar Agent works internally. It reads this card, understands the agent can “find meeting slots,” and delegates the task. The Calendar Agent might use Google Calendar, Outlook, or a custom database. My agent doesn’t care.

But the real breakthrough is the Task Lifecycle. A2A tasks aren’t just request-response loops; they are stateful, modeled as a finite state machine with well-defined transitions:

  • Submitted: The task has been received but work hasn’t started.
  • Working: The agent is actively processing the request.
  • Input-Required: The agent needs clarification before continuing. This is the key innovation: the agent can pause, ask “Do you prefer aisle or window?”, and wait indefinitely.
  • Completed: The task finished successfully.
  • Failed: Something went wrong. The response includes an error message and optional retry hints.
  • Canceled: The requesting agent (or human) aborted the task.

This state machine brings the asynchronous, messy reality of human collaboration to the machine world. A task might sit in Input-Required for hours while waiting for a human to respond. It might transition from Working to Failed and back to Working after a retry. The protocol handles all of this gracefully.

Finding Agents You Can Trust

But let’s not declare victory just yet. We are seeing the very beginning of this shift, and the “Internet of Agents” brings its own set of dangers.

As we move from tens of agents to millions, we face a massive Discovery Problem. In a global network of opaque execution, how do you find the right agent? And more importantly, how do you trust it?

It’s not enough to just connect. You need safety guarantees. You need to know that the “Travel Agent” you just hired isn’t going to hallucinate a non-refundable booking or, worse, exfiltrate your credit card data to a malicious third party.

This is the focus of recent research on multi-agent security, which highlights that protocol compliance is only the first step. We need mechanisms for Behavioral Verification, ensuring that an agent does what it says it does.

What does verification look like in practice? Today, it’s mostly manual and ad-hoc. You might:

  • Audit the agent’s logs to see what actions it actually took versus what it claimed.
  • Run it in a sandbox with fake data before trusting it with real resources.
  • Require human approval for high-stakes actions (the “Human-in-the-Loop” pattern we explored in Part 6).
  • Check reputation signals: who built this agent? What’s their track record?

But these are stopgaps. The dream is automated verification: cryptographic proofs that an agent behaved according to its advertised policy, or sandboxed execution environments that can mathematically guarantee an agent never accessed unauthorized data. We’re not there yet.

Whether the solution looks like a decentralized “Web of Trust” (where agents vouch for each other, like PGP key signing) or a centralized “App Store for Agents” (where a trusted authority vets and signs off on agents) remains to be seen. My bet is we’ll see both: curated marketplaces for enterprise use cases, and open registries for the long tail. But solving the discovery and safety problem is the only way we move from a toy ecosystem to a production economy.

The Foundation of the Future

What excites me most isn’t just the code. It’s the governance.

We have seen this movie before. In the early days of the web, proprietary browser wars threatened to fracture the internet. We risked a world where “This site only works in Internet Explorer” became the norm. We avoided that fate because of open standards.

The same risk exists for agents. We cannot afford a future where an “Anthropic Agent” refuses to talk to an “OpenAI Agent” that won’t talk to a “Google Agent.”

That is why the formation of the Agentic AI Foundation by the Linux Foundation is the most important news you might have missed. By bringing together AI pioneers like OpenAI and Anthropic alongside infrastructure giants like GoogleMicrosoft, and AWS under a neutral banner, we are ensuring that the “Internet of Agents” remains open. This foundation will oversee the development of protocols like A2A, ensuring they evolve as shared public utilities rather than walled gardens. It is the guarantee that the intelligence we build today will be able to talk to the intelligence we build tomorrow.

The New Architecture of Work

When we combine these three protocols, the fragmentation dissolves.

Imagine I am back in Zed (connected via ACP). I ask my coding agent to “Add a secure user profile page.” Zed sends my cursor context to the agent. The agent reaches for MCP to query my local database schema and understand the users table. Realizing this touches PII, it autonomously pings a “Security Guardrail Agent” via A2A to review the proposed code. Approval comes back, and my local agent writes the code directly into my buffer.

I didn’t switch windows once.

But what happens when things go wrong? Let’s say the Security Guardrail Agent rejects the code because it detected a SQL injection vulnerability. The A2A task transitions to Failed with a structured error: {"reason": "sql_injection_detected", "line": 42, "suggestion": "Use parameterized queries"}. My local agent receives this, understands the failure, and either fixes the issue automatically or surfaces it to me with context. The rejection isn’t a dead end; it’s a conversation.

Or imagine the MCP server for my database is unreachable. The agent doesn’t just hang. It receives a timeout error and can decide to retry, fall back to cached schema information, or ask me whether to proceed without database context. Robust failure handling is baked into the protocols, not bolted on as an afterthought.

Where We Are Today

I want to be honest about maturity. These protocols are real and shipping, but the ecosystem is young.

MCP is the most mature. Just about everything supports it now: coding tools, virtualization environments, editors, even mobile apps. There are hundreds of community MCP servers for everything from Notion to Kubernetes. If you want to try this today, MCP is the on-ramp.

ACP is newer but moving fast. Zed is the reference implementation, with Neovim (via CodeCompanion) and terminal clients (via Toad) close behind. There are also robust client APIs for many languages, making ACP an interesting interface for controlling local agentic applications. If your editor doesn’t support ACP yet, you’ll likely be using proprietary plugin APIs for now.

A2A is the most nascent. Google and partners announced it in mid-2025, and the specification is still evolving. There aren’t many production A2A deployments yet. Most multi-agent systems today use custom protocols or framework-specific solutions like CrewAI or LangGraph. But the spec is public, the governance is in place, and early adopters are building.

If you’re starting a project today, my advice is: use MCP for tool integration, use whatever your editor supports for the UI layer, and keep an eye on A2A for future multi-agent workflows. The pieces are coming together, but we’re still early.

And yet, this isn’t science fiction. The protocols are here today. The “Internet of Agents” is booting up, and for the first time, our digital Robinson Crusoes are finally getting a radio.

But a radio is only as good as the conversations it enables. In our next post, we’ll move from protocols to practice and explore what happens when agents don’t just connect, but actually collaborate: forming teams, delegating tasks, and solving problems no single agent could tackle alone.

A laptop sits on a dark wooden desk under the warm glow of an Edison bulb; above the screen, a stream of glowing, holographic research papers and data visualizations cascades downward like a waterfall, physically dissolving into lines of green and white markdown text as they enter the open terminal window.

Bringing Deep Research to the Terminal

I lost the report somewhere between browser tabs. One moment it was there in the Gemini app, a detailed deep research analysis on how AI agents communicate with each other, complete with citations and a synthesis I’d spent an hour reviewing. The next moment, gone. Along with the draft blog post I’d been weaving it into.

I was working on part nine of my Agentic Shift series, trying to answer the question of what happens when agents start talking to each other instead of just talking to us. The research was sprawling—academic papers on multi-agent systems, documentation from LangGraph and AutoGen, blog posts from researchers at DeepMind and OpenAI. I’d been using Gemini’s deep research feature in the app to help synthesize all of this, and it was genuinely useful. The AI would spend minutes thinking through the question, querying sources, building a structured report. But then I had to move that report into my text-based workflow. Copy, paste, reformat, lose formatting, copy again. Somewhere in that dance between the browser and my terminal, I lost everything.

I stared at the empty browser tab for a moment. I could start over, rerun the research in the Gemini app, be more careful about saving this time. But this wasn’t the first time I’d hit this friction. Every time I used deep research in the browser, I had to bridge two worlds: the app where the AI did its thinking, and the terminal where I actually write and build.

What looked like yak shaving was actually a prerequisite. I needed deep research capabilities in my terminal workflow, not just wanted them. I couldn’t keep jumping between environments. And I was in luck. Just a few weeks earlier, Google had announced that deep research was now available through the Gemini API. The capability I’d been using in the browser could be accessed programmatically.

When Features Live in the Wrong Place

I’m not going to pretend this was built based on demand from the community. I needed this. Specifically, I needed to stop context-switching between the Gemini app and my terminal, because every time I did, I was introducing friction and risk. The lost report was just the most recent symptom of a workflow that was fundamentally broken for how I work.

I live in the terminal. My notes are markdown files. My drafts are plain text. My build process, my git workflow, my entire development environment assumes I’m working with files and command-line tools. When I have to move work from a browser back into that environment, I’m not just inconvenienced—I’m fighting against the grain of everything else I do.

Deep research is powerful. It works. But living in a web app meant it was disconnected from the places where I actually needed it. Sure, other people might benefit from having this integrated into MCP-compatible tools, but that’s a nice side effect. The real reason I built this was simpler: I had to finish part nine of the Agentic Shift series, and I couldn’t do that without fixing my workflow first.

The Model Context Protocol made this possible. It’s a standard for exposing AI capabilities as tools that can plug into different environments. Google’s API gave me the primitives. I just needed to connect them to where I actually work.

Building the Missing Piece

The extension wraps Gemini’s deep research capabilities into the Model Context Protocol, which means it integrates seamlessly with Gemini CLI and any other MCP-compatible client. The architecture is deliberately simple, but it supports two distinct workflows depending on what you need.

The first workflow is straightforward: you have a research question, and you want a deep investigation. You can kick off research with a simple command, but if you use the bundled /deep-research:start slash command, the model actually guides you through a step to optimize your question to get the most out of deep research. The agent then spends tens of minutes—or as much time as it needs—planning the investigation, querying sources, and synthesizing findings into a detailed report with citations you can follow up on.

The second workflow is for when you want to ground the research in your own documents. You use /deep-research:store-create to set up a file search store, then /deep-research:store-upload to index your files. Once they’re uploaded, you have two options: you can include that dataset in the deep research process so the agent grounds its investigation in your specific sources, or you can query against it directly for a simpler RAG experience. This is the same File Search capability I wrote about in November when I rebuilt my Podcast RAG system, but now it’s accessible from the terminal as part of my normal workflow.

The extension maintains local state in a workspace cache, so you don’t have to remember arcane resource identifiers or lose track of running research jobs. The whole thing is designed to feel as natural as running a grep command or kicking off a build—it’s just another tool in the environment where I already work.

So did it actually work?

The first time I ran it, I asked for a deep dive into Stonehenge construction. I’d been reading Ken Follett’s novel Circle of Days and found myself curious about the scientific evidence behind the story, what do we actually know about how it was built and who built it. I kicked off the query and watched something fascinating happen. The model understood that deep research takes time. Instead of just waiting silently, it kept checking in to see if the research was done, almost like checking the oven to see if dinner was ready. Twenty minutes later, a markdown file appeared in my filesystem with a comprehensive research report, complete with citations to academic sources, isotope analysis, and archaeological evidence. I didn’t have to copy anything from a browser. I didn’t lose any formatting. It was just there, ready to reference. The report mentioned the Bell Beaker culture and what happened to the Neolithic builders around 2500 BCE, which sent me down another rabbit hole. I immediately ran a second research query on that transition. Same seamless experience. That’s when I knew this was exactly what I needed.

What This Actually Means

I think extensions like this represent something important about where AI development is heading. We’re past the proof-of-concept phase where every AI interaction is a magic trick. Now we’re in the phase where AI capabilities need to integrate into actual workflows—not replace them, but augment them in ways that feel natural.

This is what I wrote about in November when I talked about the era of Personal Software. We’ve crossed a threshold where building a bespoke tool is often faster—and certainly less frustrating—than trying to adapt your workflow to someone else’s software. I didn’t build this extension for the community. I built it because I needed it. I had lost work, and I needed to stop context-switching between environments. If other people find it useful, that’s a nice side effect, but it’s fundamentally software for an audience of one.

The key insight for me was that the Model Context Protocol isn’t just a technical standard; it’s a design pattern for making AI tools composable. Instead of building a monolithic research application with its own UI and workflow, I built a small, focused extension that does one thing well and plugs into the environment where I already work. That composability matters because it means the tool can evolve with my workflow rather than forcing my workflow to evolve around the tool.

There’s also something interesting happening with how we think about AI capabilities. Deep research isn’t about making the model smarter—it’s about giving it time and structure. The same model that gives you a superficial answer in three seconds can give you a genuinely insightful report if you let it think for tens of minutes and provide it with the right sources. We’re learning that intelligence isn’t just about raw capability; it’s about how you orchestrate that capability over time.

What Comes Next

The extension is live on GitHub now, and I’m using it daily for my own research workflows. The immediate next step is adding better control over the research format—right now you can specify broad categories like “Technical Deep Dive” or “Executive Brief,” but I want more granular control over structure and depth. I’m also curious about chaining multiple research tasks together, where the output of one investigation becomes the input for the next.

But the bigger question I’m sitting with is what other AI capabilities are hiding in plain sight, waiting for someone to make them accessible. Deep research was always there in the Gemini API; it just needed a wrapper that made it feel like a natural part of the development workflow. What else is out there?

If you want to try it yourself, you’ll need a Gemini API key (get one at ai.dev) and set the GEMINI_DEEP_RESEARCH_API_KEY environment variable. Deep research runs on Gemini 3.0 Pro, and you can find the current pricing here. It’s charged based on token consumption for the research process plus any tool usage fees.

Install the extension with:

gemini extensions install https://github.com/allenhutchison/gemini-cli-deep-research --auto-update

The full source is on github.

As for me, I still need to finish part nine of the Agentic Shift series. But now I can get back to it with the confidence that I’m working in my preferred environment, with the tools I need accessible right from the terminal. Fair warning: once you start using AI for actual deep research, it’s hard to go back to the shallow stuff.

A close-up photograph on a wooden workbench shows a hand-carved wooden tool handle resting on a MacBook Pro keyboard. The handle transitions into a glowing blue and orange digital wireframe where it extends over the laptop's screen, which displays lines of green code. Wood shavings, chisels, and other traditional tools are scattered around the laptop. A warm desk lamp illuminates the scene from the right.

The Era of Personal Software

I was sitting in a coffee shop this afternoon, nursing a cappuccino and doing a quick triage of the GitHub repositories I maintain. It was supposed to be a quick check-in, but I was surprised to find a pile of issues I hadn’t seen before. They had slipped through the cracks of my notifications.

My immediate reaction wasn’t just annoyance; it was an itch to fix the process. I needed a way to monitor a configurable set of repos and get a consolidated report of new activity—something bespoke. For my smaller projects, I want to see everything. For the big, noisy ones, I only care if I’m assigned or mentioned.

So, I opened up my terminal. I fired up gemini cli and started describing what I needed.

Twenty minutes later, I had a working command-line tool. It did exactly what I described, filtering the noise exactly how I wanted. I ran it, verified the output, and added it to my daily workflow. I closed my laptop and went on with my day.

But on the walk home, I realized something strange had happened. Or rather, something hadn’t happened.

I never opened Google. I never searched GitHub for “activity monitor CLI.” I didn’t spend an hour trawling through “Top 10 GitHub Tools” blog posts, or installing three different utilities only to find out one was deprecated and the other required a subscription.

I just built the thing I needed and moved on.

We are entering the era of Personal Software. This is software written for an audience of one. It’s an application or a script built to solve a specific problem for a specific person, with no immediate intention of scaling, monetizing, or even sharing.

Looking back at my recent work, I realize I’ve been living in this category for a while. In many ways, this is the active evolution of the “Small Tools, Big Ideas” concept I explored earlier this year. Instead of just finding these sharp, focused tools, I’m now building them. Gemini Scribe started because I wanted a better way to write in Obsidian. Podcast Rag exists solely because I wanted to search my own podcast history. My github-activity-reporter from this afternoon? Pure personal necessity. Even adh-cli was just a sandbox for me to test ideas for the Gemini CLI.

We have crossed a threshold where building a bespoke application is often faster—and certainly less frustrating—than finding an off-the-shelf solution that mostly works. The friction of creation has dropped so low that it is now competing with the friction of discovery.

There is a profound freedom in this approach. When you build for an audience of one, the software does exactly what you want and nothing more. There is no feature bloat, no upsell, no UI clutter. You are the product manager, the engineer, and the customer. If your workflow changes next week, you don’t have to file a feature request and hope it gets upvoted; you just change the code. You don’t have to convince anyone else that your problem is worth solving.

But this freedom comes with a new kind of responsibility. When you step outside the walled garden of managed software, you are on your own. If you get stuck, there is no support ticket to file. If an API changes and breaks your tool, you are on the hook to fix it.

There is also the “trap of success.” Sometimes, your personal software is so useful that it accidentally becomes non-personal. Friends ask for it. Colleagues want to fork it. Suddenly, you aren’t just a user anymore; you’re a maintainer. You have to decide if you’re willing to take on the burden of supporting others, or if you’re comfortable saying, “This works for me, good luck to you.”

Not every problem is a nail for this particular hammer, of course. Over time, I’ve started to develop a rubric for what makes for good Personal Software.

The sweet spot is usually glue and logic. If you need to connect two APIs that don’t talk to each other, or parse a messy data export into a clean report, AI can write that script in seconds. My GitHub activity reporter is a perfect example: it’s just fetching data, filtering it against my specific rules, and printing text.

It’s also great for ephemeral workflows. If you have a task you need to do fifty times today but might never do again—like renaming a batch of files based on their content or scraping a specific webpage for research—building a throwaway tool is vastly superior to doing it manually.

Another fantastic category is quick web applications. We used to think of web apps as heavy projects requiring frameworks and hosting headaches. But modern platforms like Google Cloud Run or Vercel have made deployment trivial. Tools like Google AI Studio take this even further—offering a free “vibe coding” platform that can take you from a rough idea to a hosted application in minutes. My boxing workout app is a prime example: I didn’t write a line of infrastructure code; I just described the workout timer I needed, and it was live before I even put on my gloves.

Where Personal Software falls short is in infrastructure and security. I wouldn’t build my own password manager or roll my own encryption tools, no matter how good the model is. The stakes are too high, and the “audience of one” means there are no other eyes on the code to catch critical vulnerabilities. Similarly, if a problem requires a complex, interactive GUI or high-availability hosting, the maintenance burden usually outweighs the benefits of customization.

Despite the downsides, I find this shift fascinating. For decades, software development was an industrial process—building generic tools for mass consumption. Now, it’s becoming a craft again. We are returning to a time where we build our own tools, fitting the handle perfectly to our own grip.

So, I want to turn the question over to you. What are you building just for yourself? Are there small, nagging problems you’ve solved with a script only you will ever see? I’d love to hear about the kinds of personal software you’re creating in this new era. Let me know in the comments or reach out—I’m genuinely curious to see what handles you’re crafting.

A retro computer monitor displaying the Gemini CLI prompt "> Ask Gemini to scaffold a web app" inside a glowing neon blue and pink holographic wireframe box, representing a digital sandbox.

The Guardrails of Autonomy

I still remember the first time I let an LLM execute a shell command on my machine. It was a simple ls -la, but my finger hovered over the Enter key for a solid ten seconds.

There is a visceral, lizard-brain reaction to giving an AI that level of access. We all know the horror stories—or at least the potential horror stories. One hallucinated argument, one misplaced flag, and a helpful cleanup script becomes rm -rf /. This fear creates a central tension in what I call the Agentic Shift. We want agents to be autonomous enough to be useful—fixing a bug across ten files while we grab coffee—but safe enough to be trusted with the keys to the kingdom.

Until now, my approach with the Gemini CLI was the blunt instrument of “Human-in-the-Loop.” Any tool call with a side effect—executing shell commands, writing code, or editing files—required a manual y/n confirmation. It was safe, sure. But it was also exhausting.

I vividly remember asking Gemini to “fix all the linting errors in this project.” It brilliantly identified the issues and proposed edits for twenty different files. Then I sat there, hitting yyy… twenty times.

The magic evaporated. I wasn’t collaborating with an intelligent agent; I was acting as a slow, biological barrier for a very expensive macro. This feeling has a name—“Confirmation Fatigue”—and it’s the silent killer of autonomy. I realized I needed to move from micromanagement to strategic oversight. I didn’t want to stop the agent; I wanted to give it a leash.

The Policy Engine

The solution I’ve built is the Gemini CLI Policy Engine.

Think of it as a firewall for tool calls. It sits between the LLM’s request and your operating system’s execution. Every time the model reaches for a tool—whether it’s to read a file, run a grep command, or make a network request—the Policy Engine intercepts the call and evaluates it against a set of rules.

The system relies on three core actions:

  1. allow: The tool runs immediately.
  2. deny: The AI gets a “Permission denied” error.
  3. ask_user: The default manual approval.

A Hierarchy of Trust

The magic isn’t just in blocking or allowing things; it’s in the hierarchy. Instead of a flat list of rules, I built a tiered priority system that functions like layers of defense.

At the base, you have the Default Safety Net. These are the built-in rules that apply to everyone—basic common sense like “always ask before overwriting a file.”

Above that sits the User Layer, which is where I define my personal comfort zone. This allows me to customize the “personality” of my safety rails. On my personal laptop, I might be a cowboy, allowing git commands to run freely because I know I can always undo a bad commit. But on a production server, I might lock things down tighter than a vault.

Finally, at the top, is the Enterprise/Admin Layer. These are the immutable laws of physics for the agent. In an enterprise setting, this is where you ensure that no matter how “creative” the agent gets, it can never curl data to an external IP or access sensitive directories.

Safe Exploration

In practice, this means I can trust the agent to look but ask it to verify before it touches. I generally trust the agent to check the repository status, review history, or check if the build passed. I don’t need to approve every git log or gh run list.

[[rule]]
toolName = "run_shell_command"
commandPrefix = [
  "git status",
  "git log",
  "git diff",
  "gh issue list",
  "gh pr list",
  "gh pr view",
  "gh run list"
]
decision = "allow"
priority = 100

Yolo Mode

Sometimes, I’m working in a sandbox and I just want speed. I can use the dedicated yolo mode to take the training wheels off. There is a distinct feeling of freedom—and a slight thrill of danger—when you watch the terminal fly by, commands executing one after another.

However, even in Yolo mode, I want a final sanity check before I push code or open a PR. While Yolo mode is inherently permissive, I define specific high-priority rules to catch critical actions. I also explicitly block docker commands—I don’t want the agent spinning up (or spinning down) containers in the background without me knowing.

# Exception: Always ask before committing or creating a PR
[[rule]]
toolName = "run_shell_command"
commandPrefix = ["git commit", "gh pr create"]
decision = "ask_user"
priority = 900
modes = ["yolo"]

# Exception: Never run docker commands automatically
[[rule]]
toolName = "run_shell_command"
commandPrefix = "docker"
decision = "deny"
priority = 999
modes = ["yolo"]

The Hard Stop

And then there are the things that should simply never happen. I don’t care how confident the model is; I don’t want it rebooting my machine. These rules are the “break glass in case of emergency” protections that let me sleep at night.

[[rule]]
toolName = "run_shell_command"
commandRegex = "^(shutdown|reboot|kill)"
decision = "deny"
priority = 999

Decoupling Capability from Control

The significance of this feature goes beyond just saving me from pressing y. It fundamentally changes how we design agents.

I touched on this concept in my series on autonomous agents, specifically in Building Secure Autonomous Agents, where I argued that a “policy engine” is essential for scaling from one agent to a fleet. Now, I’m bringing that same architecture to the local CLI.

Previously, the conversation around AI safety often presented a binary choice: you could have a capable agent that was potentially dangerous, or a safe agent that was effectively useless. If I wanted to ensure the agent wouldn’t accidentally delete my home directory, the standard advice was to simply remove the shell tool. But that is a false choice. It confuses the tool with the intent. Removing the shell doesn’t just stop the agent from doing damage; it stops it from running tests, managing git, or installing packages—the very things I need it to do.

With the Policy Engine, I can give the agent powerful tools but wrap them in strict policies. I can give it access to kubectl, but only for get commands. I can let it edit files, but only on specific documentation sites.

This is how we bridge the gap between a fun demo and a production-ready tool. It allows me to define the sandbox in which the AI plays, giving me the confidence to let it run autonomously within those boundaries.

Defining Your Own Rules

The Policy Engine is available now in the latest release of Gemini CLI. You can dive into the full documentation here.

If you want to see exactly what rules are currently active on your system—including the built-in defaults and your custom additions—you can simply run /policies list from inside the Gemini CLI.

I’m currently running a mix of “Safe Exploration” and “Hard Stop” rules. It’s quieted the noise significantly while keeping my file system intact. I’d love to hear how you configure yours—are you a “deny everything” security maximalist, or are you running in full “allow” mode?

A stylized, dark digital illustration of an open laptop displaying lines of blue code. Floating above the laptop are three glowing, neon blue wireframe icons: a document on the left, a calendar in the center, and an envelope on the right. The icons appear to be formed from streams of digital particles rising from the laptop screen, symbolizing the integration of digital tools. The overall aesthetic is futuristic and high-tech, with dramatic lighting emphasizing the connection between the code and the applications.

Bringing the Office to the Terminal

There is a specific kind of friction that every developer knows. It’s the friction of the “Alt-Tab.”

You’re deep in the code, holding a complex mental model of a system in your head, when you realize you need to check a requirement. That requirement lives in a Google Doc. Or maybe you need to see if you have time to finish a feature before your next meeting. That information lives in Google Calendar.

So you leave the terminal. You open the browser. You navigate the tabs. You find the info. And in those thirty seconds, the mental model you were holding starts to evaporate. The flow is broken.

But it’s not just the context switch that kills your momentum—it’s the ambush. The moment you open that browser window, the red dots appear. Chat pings, new emails, unresolved comments on a doc you haven’t looked at in two days—they all clamor for your attention. Before you know it, the quick thing you needed to look up has morphed into an hour of answering questions and putting out fires. You didn’t just lose your place in the code; you lost your afternoon.

I’ve been thinking a lot about this friction lately, especially as I’ve moved more of my workflow into the Gemini CLI. If we want AI to be a true partner in our development process, it can’t just live in a silo. It needs access to the context of our work—and for most of us, that context is locked away in the cloud, in documents, chats, and calendars.

That’s why I built the Google Workspace extension for Gemini CLI.

Giving the Agent “Senses

We often talk about AI agents in the abstract, but their utility is defined by their boundaries. An agent that can only see your code is a great coding partner. An agent that can see your code and your design documents and your team’s chat history? That’s a teammate.

This extension connects the Gemini CLI to the Google Workspace APIs, effectively giving your terminal-based AI a set of digital senses and hands. It’s not just about reading data; it’s about integrating that data into your active workflow.

Here is what that looks like in practice:

1. Contextual Coding

Instead of copying and pasting requirements from a browser window, you can now ask Gemini to pull the context directly.

“Find the ‘Project Atlas Design Doc’ in Drive, read the section on API authentication, and help me scaffold the middleware based on those specs.”

2. Managing the Day

I often get lost in work and lose track of time. Now, I can simply ask my terminal:

“Check my calendar for the rest of the day. Do I have any blocks of free time longer than two hours to focus on this migration?”

3. Seamless Communication

Sometimes you just need to drop a quick note without leaving your environment.

“Send a message to the ‘Core Eng’ chat space letting them know the deployment is starting now.”

The Accidental Product

Truth be told, I didn’t set out to build a product. When I first joined Google DeepMind, this was simply my “starter project.” My manager suggested I spend a few weeks experimenting with Google Workspace and our agentic capabilities, and the Gemini CLI seemed like the perfect sandbox for that kind of exploration.

I started building purely for myself, guided by my own daily friction. I wanted to see if I could check my calendar without leaving the terminal. Then I wanted to see if I could pull specs from a Doc. I followed the path of my own curiosity, adding tools one by one.

But when I shared this little experiment with a few colleagues, the reaction was immediate. They didn’t just think it was cool; they wanted to install it. That’s when I realized this wasn’t just a personal hack—it was a shared need. It snowballed from a few scripts into a full-fledged extension that we knew we had to ship.

Under the Hood

The extension is built as a Model Context Protocol (MCP) server, which means it runs locally on your machine. It uses your own OAuth credentials, so your data never passes through a third-party server. It’s direct communication between your local CLI and the Google APIs.

It currently supports a wide range of tools across the Workspace suite:

  • Docs & Drive: Search for files, read content, and even create new docs from markdown.
  • Calendar: List events, find free time, and schedule meetings.
  • Gmail: Search threads, read emails, and draft replies.
  • Chat: Send messages and list spaces.

Why This Matters

This goes back to the idea of “Small Tools, Big Ideas.” Individually, a command-line tool to read a calendar isn’t revolutionary. But when you combine that capability with the reasoning engine of a large language model, it becomes something else entirely.

It turns your terminal into a cockpit for your entire digital work life. It allows you to script interactions between your code and your company’s knowledge base. It reduces the friction of context switching, letting you stay where you are most productive.

If you want to try it out, the extension is open source and available now. You can install it directly into the Gemini CLI:

gemini extensions install https://github.com/gemini-cli-extensions/workspace

I’m curious to see how you all use this. Does it change your workflow? Does it keep you in the flow longer? Give it a spin and let me know.