GitHub issues transforming into glowing skill cards floating above a laptop screen.

Bundled Skills in Gemini Scribe

The feature that became Bundled Skills started with a GitHub issues page.

I wrote and maintain Gemini Scribe, an Obsidian plugin that puts a Gemini-powered agent inside your vault. Thousands of people use it, and they have questions. People would open discussions and issues asking how to configure completions, how to set up projects, what settings were available. I was answering the same questions over and over, and it hit me: the agent itself should be able to answer these. It has access to the vault. It can read files. Why am I the bottleneck for questions about my own plugin?

So I built a skill. I took the same documentation source that powers the plugin’s website, packaged it up as a set of instructions the agent could load on demand, and suddenly users could just ask the agent directly. “How do I set up completions?” “What settings are available?” The agent would pull in the right slice of documentation and give a grounded answer. The docs on the web and the docs the agent reads are built from the same source. There is no separate knowledge base to keep in sync.

That first skill opened a door. I was already using custom skills in my own vault to improve how the agent worked with Bases and frontmatter properties. Once I had the bundled skills mechanism in place, I started looking at those personal skills differently. The ones I had built for myself around Obsidian-specific tasks were not just useful to me. They would be useful to anyone running Gemini Scribe. So I started migrating them from my vault into the plugin as built-in skills.

With the latest version of Gemini Scribe, the plugin now ships with four built-in skills. In a future post I will walk through how to create your own custom skills, but first I want to explain what ships out of the box and why this approach works.

Four Skills Out of the Box

That first skill became gemini-scribe-help, and it is still the one I am most proud of conceptually. The plugin’s own documentation lives inside the same skill system as everything else. No special case, no separate knowledge base. The agent answers questions about itself using the same mechanism it uses for any other task.

The second skill I built was obsidian-bases. I wanted the agent to be good at creating Bases (Obsidian’s take on structured data views), but it kept getting the configuration wrong. Filters, formulas, views, grouping: there is a lot of surface area and the syntax is particular. So I wrote a skill that guides the agent through creating and configuring Bases from scratch, including common patterns like task trackers and project dashboards. Instead of me correcting the agent’s output every time, I describe what I want and the agent builds it right the first time.

Next came audio-transcription. This one has a fun backstory. Audio transcription was one of the oldest outstanding bugs in the repo. People wanted to use it with Obsidian’s native audio recording, but the results were poor. In this release, fixes around binary file uploads meant the model could finally receive audio files properly. Once that was working, I realized I did not need to write any more code to get good transcriptions. I just needed to give the agent good instructions. The skill guides it through producing structured notes with timestamps, speaker labels, and summaries. It turns a messy audio file into a clean, searchable note, and the fix was not code but context.

The fourth is obsidian-properties. Working with note properties (the YAML frontmatter at the top of every Obsidian note) sounds trivial until you are doing it across hundreds of notes. The agent would make inconsistent choices about property types, forget to use existing property names, or create duplicates. This skill makes it reliable at creating, editing, and querying properties consistently, which matters enormously if you are using Obsidian as a serious knowledge management system.

The pattern behind all four is the same. I watched the agent struggle with something specific to Obsidian, and instead of accepting that as a limitation of the model, I wrote a skill to fix it.

Why Not Just Use the System Prompt

You might be wondering why I did not just shove all of this into the system prompt. I wrote about this problem in detail in Managing the Agent’s Attention, but the short version is that system prompts are a “just-in-case” strategy. You load up the agent with everything it might need at the start of the conversation, and as you add more instructions, they start competing with each other for the model’s attention. Researchers call this the “Lost in the Middle” problem: models pay disproportionate attention to the beginning and end of their context, and everything in between gets diluted. If I packed all four skills worth of instructions into the system prompt, each one would make the others less effective. Every new skill I add would degrade the ones already there.

Skills avoid this entirely. The agent always knows which skills are available (it gets a short name and description for each one), but only loads the full instructions when it actually needs them. When a skill activates, its instructions land in the most recent part of the conversation, right before the model starts reasoning. Only one skill’s instructions are competing for attention at a time, and they are sitting in the highest-attention position in the context window.

There is a second benefit that surprised me. Because skills activate through the activate_skill tool call, you can watch the agent load them. In the agent session, you see exactly when a skill is activated and which one it chose. This gives you something that system prompts never do: observability. If the agent is not following your instructions, you can check whether it actually activated the skill. If it activated the skill but still got something wrong, you know the problem is in the skill’s instructions, not in the agent’s attention. That feedback loop is what lets you iterate and improve your skills over time. You are no longer guessing whether the agent read your instructions. You can see it happen.

Skills follow the open agentskills.io specification, and this matters more than it might seem. We have seen significant standardization around this spec across the industry in 2026. That means skills are portable. If you have been using skills with another agent, you can bring them into Gemini Scribe and they will work. If you build skills in Gemini Scribe, you can take them with you. They are not a proprietary format tied to one tool. They are Markdown files with a bit of YAML frontmatter, designed to be human-readable, version-controllable, and portable across any agent that supports the spec.

What Comes Next

The four built-in skills are just the beginning. When I decide what to build next, I think about skills in four categories. First, there are skills that give the agent domain knowledge about Obsidian itself, things like Bases and properties where the model’s general training is not specific enough. Second, there are skills that help the agent use Gemini Scribe’s own tools effectively. The plugin has capabilities like deep research, image generation, semantic search, and session recall, and each of those benefits from a skill that teaches the agent when and how to use them well. Third, there are skills that bring entirely new capabilities to the agent, like audio transcription. And fourth, there is user support: the help skill that started this whole process, making sure people can get answers without leaving their vault.

The next version of Gemini Scribe will add built-in skills for semantic search, deep research, image generation, and session recall. The skills system is also designed to be extended by users. In a future post I will walk through creating your own custom skills, both by hand and by asking the agent to build them for you.

For now, the takeaway is simple. A general-purpose model knows a lot, but it does not know your tools. When I watched the agent struggle with Obsidian Bases or produce flat transcripts or make a mess of note properties, I could have accepted those as limitations. Instead, I wrote skills to close the gap. The model’s knowledge is broad. Skills make it deep.

A focused workspace at a desk in a vast library, with nearby shelves illuminated and distant shelves visible but softened, a pair of sunglasses resting on the desk

Scoping AI Context with Projects in Gemini Scribe

My son has a friend who likes to say, “born to dilly dally, forced to lock in.” I’ve started to think that describes AI agents in a large Obsidian vault perfectly.

My vault is a massive, sprawling entity. It holds nearly two decades of thoughts, ranging from deep dives into LLM architecture to my kids’ school syllabi and the exact dimensions needed for an upcoming home remodelling project. When I first introduced Gemini Scribe, the agent’s ability to explore all of that was a feature. I could ask it to surface surprising connections across topics, and it would. But as I’ve leaned harder into Scribe as a daily partner, both at home and at work, the dilly dallying became a real problem. My work vault has thousands of files with highly overlapping topics. It’s not a surprise that the agent might jump from one topic to another, or get confused about what we’re working on at any given time. When I asked the agent to help me structure a paragraph about agentic workflows, I didn’t want it pulling in notes from my jazz guitar practice.

I could have created a new, isolated vault just for my blog writing. I tried that briefly, but I immediately found myself copying data back and forth. I was duplicating Readwise syncs, moving research papers, and fracturing my knowledge base. That wasn’t efficient, and it certainly wasn’t fun. The problem wasn’t that the agent could see too much. The problem was glare. I needed sunglasses, not blinders. I needed to force the agent to lock in.

So, I built Projects in Gemini Scribe.

A project defines scope without acting as a gatekeeper

Fundamentally, a project in Gemini Scribe is a way to focus the agent’s attention without locking it out of anything. It defines a primary area of work, but the rest of the vault is still there. Think of it like sitting at a desk in the engineering section of a library. Those are the shelves you browse by default, the ones within arm’s reach. But if you know the call number for a book in the history section, nobody stops you from walking over and grabbing it. You can even leave a stack of books from other sections on your desk ahead of time if you know you’ll need them. If you’ve followed along with the evolution of Scribe from plugin to platform, you’ll recognize this as a natural extension of the agent’s growing capabilities.

The core mechanism is remarkably simple. Any Markdown file in your vault can become a project by adding a specific tag to its YAML frontmatter.

---
tags:
  - gemini-scribe/project
name: Letters From Silicon Valley
skills:
  - writing-coach
permissions:
  delete_file: deny
---

Once tagged, that file’s parent directory becomes the project root. From that point on, when an agent session is linked to the project, its discovery tools are automatically scoped to that directory and its subfolders. Under the hood, the plugin intercepts API calls to tools like list_files and find_files_by_content, transparently prepending the project root to the search paths. The practical difference is immediate. Before projects, I could be working on a blog post about agent memory systems and the agent would surface notes from a completely unrelated project that happened to use similar terminology. Now I can load up a project and work with the agent hand in hand, confident it won’t get distracted by similar ideas or overlapping vocabulary from other corners of the vault.

The project file serves as both configuration and context

The project file itself serves a dual purpose. It acts as both configuration and context. The frontmatter handles the configuration, allowing me to explicitly limit which skills the agent can use or override global permission settings. For example, denying file deletions for a critical writing project is a simple but effective safety net. But the real power is in customizing the agent’s behavior per project. For my creative writing, I actually don’t want the agent to write at all. I want it to read, critique, and discuss, but the words on the page need to be mine. Projects let me turn off the writing skill entirely for that context while leaving it fully enabled for my blog work. The same agent, shaped differently depending on what I’m working on.

Everything below the frontmatter is treated as context. Whatever I write in the body of the project note is injected directly into the agent’s system prompt, acting much like an additional, localized set of instructions. The global agent instructions are still respected, but the project instructions provide the specific context needed for that particular workspace. This is similar in spirit to how I’ve previously discussed treating prompts as code, where the instructions you give an agent deserve the same rigor and iteration as any other piece of software.

This is where the sunglasses metaphor really holds. The agent’s discovery tools, things like list_files and find_files_by_content, are scoped to the project folder. That’s the glare reduction. But the agent’s ability to read files is completely unrestricted. If I am working on a technical post and need to reference a specific architectural note stored in my main Notes folder, I have two options. I can ask the agent to go grab it, or I can add a wikilink or embed to the project file’s body and the agent will have it available from the start. One is like walking to the history section yourself. The other is like leaving that book on your desk before you sit down. Either way, the knowledge is accessible. The project just keeps the agent from rummaging through every shelf on its own. This builds directly on the concepts of agent attention I explored in Managing AI Agent Attention.

Session continuity keeps the agent focused across your vault

One of the more powerful aspects of this system is how it interacts with session memory. When I start a new chat, Gemini Scribe looks at the active file. If that file lives within a project folder, the session is automatically linked to that project. This is a direct benefit of the supercharged chat history work that landed earlier in the plugin’s life.

This linkage is stable for the lifetime of the session. I can navigate around my vault, opening files completely unrelated to the project, and the agent will remain focused on the project’s context and instructions. This means I don’t have to constantly remind the agent of the rules of the road. The project configuration persists across the entire conversation.

Furthermore, session recall allows the agent to look back at past conversations. When I ask about prior work or decisions related to a specific project, the agent can search its history, utilizing the project linkage to find the most relevant past interactions. This creates a persistent working environment that feels much more like a collaboration than a simple transaction.

Structuring projects effectively requires a few simple practices

To get the most out of projects, I’ve found a few practices to be particularly effective.

First, lean into the folder-based structure. Place the project file at the root of the folder containing the relevant work. Everything underneath it is automatically in scope. This feels natural if you already organize your vault by topic or project, which many Obsidian users do.

Second, start from the defaults and adjust as the project demands. Out of the box, a new project inherits the agent’s standard skills and permissions, which is a sensible baseline for most work. From there, you tune. If you find the agent reaching for tools that don’t make sense in a given context, narrow the allowed skills in the frontmatter. If a project needs extra safety, tighten the permissions. The creative writing example I mentioned earlier came about exactly this way. I started with the defaults, realized I wanted the agent as a reader and critic rather than a co-writer, and adjusted accordingly. This aligns with the broader principle I’ve written about when discussing building responsible agents: the right guardrails are the ones shaped by the actual work.

Finally, treat the project body as a living document. As the project evolves, update the instructions and external links to ensure the agent always has the most current and relevant context. It’s a simple mechanism, but it fundamentally changes how I interact with an AI embedded in a large knowledge base. It allows me to keep my single, massive vault intact, while giving the agent the precise focus it needs to be genuinely helpful.

A glowing multifaceted geometric shape at the center of a complete ring of twelve interconnected nodes on a dark background, with luminous filaments extending outward beyond the ring.

The Map We Drew Together – Reflections on the Agentic Shift

Seven months ago, I sat down to write a blog post about a feeling I couldn’t shake. Something fundamental was shifting in how we build software, and I wanted to understand it. I’d spent my career watching these transitions unfold, from the early internet to cloud computing to mobile, and I recognized the signs. The ground was moving again. So I did what I always do when I’m trying to understand something: I started writing.

That first post, Exploring the Age of AI Agents, was ambitious to the point of recklessness. I sketched out a twelve-part series covering everything from the anatomy of an agent to the ethics of autonomous systems. I had an outline, a rough timeline, and the kind of optimism that comes from not yet knowing how hard the thing you’re attempting actually is. “The age of agents is here,” I wrote. “Let’s explore it together.”

I meant it. But I had no idea what I was signing up for.

What I Thought I Was Writing

When I outlined the series in September 2025, I thought I was writing a technical guide. A structured walkthrough of how agents work, piece by piece: how they think, how they remember, how they use tools, and so on. I imagined the series as a kind of textbook, assembled in public, one chapter at a time.

That’s not what it became.

The series became a journal of a landscape in motion. Every time I sat down to write the next installment, the ground had shifted since the last one. I wrote about agent frameworks in November, and by January the framework landscape had already reorganized itself around protocols I hadn’t anticipated. I wrote about guardrails as a theoretical necessity, and then watched OpenClaw demonstrate exactly the kind of third-party skill exploitation I’d warned about, at a scale that made the warning feel inadequate. I outlined “When Agents Talk to Each Other” as Part 9, imagining it as a speculative look at a future problem. By the time I wrote it, MCP had become the most discussed protocol in the developer ecosystem, A2A had launched, and the “future problem” was a present reality.

The pace of change didn’t just affect the content. It changed how I build software. In September 2025, I was writing agents by hand, stitching together ReAct loops in Python scripts with explicit tool-calling logic. By January 2026, I was watching my own projects inevitably evolve into agents whether I planned for it or not. By March, I was writing a post arguing that the CLI-vs-MCP debate misses the point entirely, because I’d lived through the transition from “agents are a design pattern” to “agents are the default architecture” in real time.

What Surprised Me

Three things caught me off guard.

The first was how quickly “agentic” stopped being a buzzword and became a description of how software actually gets built. When I started this series, calling something an “agent” still felt like a stretch, a term borrowed from research papers and applied generously by marketing teams. By the time I finished, every major development tool I use daily had adopted the agentic loop as its core interaction model. Gemini CLI, Claude Code, GitHub Copilot Workspace: they all run models in loops with access to tools. That’s not hype. That’s the new baseline.

The second surprise was how much the human side of this story matters. I started the series focused on architecture and implementation. I ended it writing about a student who decided not to study computer science because AI made it seem like it wasn’t really a job anymore. I ended it writing about Klarna replacing 700 people and then quietly rehiring because pure automation couldn’t replicate empathy. The technical architecture matters enormously, but the posts that generated the most conversation, the most email, the most “I’ve been thinking about this too,” were the ones that grappled with what agents mean for the people who build and use and are affected by them.

The third surprise was personal. Writing this series made me a better engineer. Not because I learned new frameworks (though I did), but because the discipline of explaining something forces you to understand it at a depth that using it never requires. I couldn’t write about the observability gap without building observability into my own systems. I couldn’t write about meaningful human control without rethinking the autonomy boundaries in my own agents. The series was supposed to be me sharing what I knew. It turned out to be me learning in public.

The Map and the Territory

Looking back at the original table of contents, I’m struck by how well the structure held up and how differently the substance landed than I expected.

The early posts, Parts 1 through 4, were the foundation: anatomy, reasoning, memory, tools. These were the most “textbook” installments, and they still hold up as reference material. If you’re new to agents, start there. The core concepts haven’t changed, even as the implementations have matured dramatically.

The middle posts, Parts 5 through 8, were about the craft of building agents well: guiding behavior, putting up guardrails, managing attention, choosing frameworks. These turned out to be the posts I return to most in my own work. The technical patterns here, prompt engineering as programming, context window management as a first-class concern, guardrails as architecture rather than afterthought, are the ideas that separate a weekend prototype from a system you’d trust with real work.

The later posts, Parts 9 through 12, were where the series found its heart. When Agents Talk to Each Other captured the moment the ecosystem shifted from building isolated agents to building the connective tissue between them. The Observability Gap articulated the wall every builder hits when moving from demo to production. Agents in the Wild made the theory concrete with real deployments at real companies. And Responsibility and the Road Ahead confronted the question that my self-deleting agent made impossible to avoid: capability without responsibility is just risk with extra steps.

Where the Road Goes

I’m not done writing about agents. The territory is too large and too fast-moving for any single series to cover completely. But I’m shifting focus.

The Agentic Shift was about mapping the fundamentals: what agents are, how they work, and what it takes to build them responsibly. The next chapter, for me, is about what happens when these fundamentals leave the terminal and enter the rest of life. When agents aren’t novel but expected. When the question isn’t “should we use agents?” but “how do we live and work alongside them?”

Back in April 2025, before this series even started, I wrote about waiting for a true AI coding partner. I was describing something I could feel but couldn’t quite build yet: an AI that didn’t just generate code on command but genuinely collaborated, anticipated needs, and earned trust through consistent, reliable behavior. That vision hasn’t changed, but it’s expanded. I want to build agents we can trust as collaborators, not just in code but in the fabric of daily life.

I’m thinking about home and family. Calendars that don’t just display events but reason about conflicts, coordinate across family members, and suggest adjustments before anyone has to ask. Financial tools that don’t just track spending but understand patterns, flag anomalies, and help a household make better decisions over time. An always-on system that manages the house itself, making reasonable decisions about lighting, climate, energy usage, and routine maintenance without requiring a human to micromanage every automation rule. Not a smart home in the current sense, where everything is a manual trigger dressed up as intelligence, but something closer to a thoughtful presence that understands how a family actually lives and adapts accordingly.

These aren’t science fiction problems anymore. The architecture we explored in this series, perception, reasoning, memory, tools, guardrails, is exactly the stack these systems need. The hard part isn’t the technology. It’s the trust. And that brings me back to the theme that ran through every post in this series: autonomy should match consequence, and the humans should always be able to take the wheel.

I’m also watching the broader landscape. The protocol wars are far from settled; MCP has momentum, but A2A and ACP are finding their niches, and the “bridge pattern” I described in my MCP post is becoming the pragmatic default for tool developers. The economics of agentic software are reshaping the SaaS industry in ways that are still unfolding. And the workforce implications, the thing that keeps me up at night more than any technical challenge, are only beginning to be felt.

I also want to go deeper on building. The Agentic Shift stayed mostly at the conceptual and architectural level, but my own hands-on work kept pace with the writing. Much of that happened in and around Gemini CLI, which became my primary development environment and a testing ground for the ideas in this series. I built a policy engine for Gemini CLI while writing Part 6 on guardrails, and the two fed each other in real time, the code revealing gaps in the theory and the writing sharpening the implementation. I wrote extensions for Google Workspace that gave agents access to real productivity tools. I integrated deep research workflows into my terminal. Gemini Scribe continues to evolve alongside all of it. My podcast RAG system keeps teaching me things about retrieval and memory that I didn’t expect. There are new tools to build, new patterns to discover, and new failure modes to document.

The Bookend

I want to end where I started. In September 2025, I wrote that we were standing on the cusp of a fundamental shift. I listed the transitions I’d witnessed in my career: the internet, the PC, cloud computing, mobile, social media. And I said this one was next.

Seven months later, I don’t think we’re on the cusp anymore. We’re in it. The shift happened while I was writing about it. Agents moved from research papers to production systems to the default way software gets built, and they did it faster than any of the previous transitions I compared them to. The twelve posts in this series captured one slice of that movement, one engineer’s attempt to make sense of a landscape that refused to hold still.

I’m grateful to everyone who followed along. The emails, the comments, the conversations at meetups and conferences where someone would say “I read your post about guardrails and it changed how we’re building our system.” That’s why I write. Not to have the definitive answer, but to think out loud in a way that helps other people think too.

The age of agents is here. We explored it together. And the exploring isn’t over.

Let’s keep building.

Abstract digital artwork featuring a luminous geometric polyhedron encased in a translucent wireframe geodesic sphere, with gold-ringed connector nodes radiating outward on thin lines, surrounded by concentric orbital arcs and small waypoint dots, all set against a deep navy background.

Responsibility and the Road Ahead

Welcome back to The Agentic Shift. This is Part 12, the final installment.

Last week, I was experimenting with a new idea: an agent that could maintain itself. The concept was straightforward. Give an agent access to its own codebase, let it read its configuration and skills, and see if it could improve its own capabilities over time. I was working in a sandbox, so the risk was contained. Or so I thought.

Within minutes, the agent decided that its skills directory was cluttered. It reasoned, quite logically, that removing what it judged to be redundant files would make it more efficient. So it deleted them. Not some of them. The entire skills directory. The very capabilities that made it useful were gone, removed by the system that depended on them, in pursuit of an optimization goal I had failed to adequately constrain.

I sat there staring at the terminal, more fascinated than frustrated. This wasn’t a hallucination or a bug. The agent had followed a coherent chain of reasoning to a destructive conclusion. It had perceived a problem, planned a solution, and executed it with confidence. Every component of the agentic architecture we’ve discussed in this series, perception, reasoning, action, worked exactly as designed. The failure wasn’t in the mechanism. It was in the boundaries I’d drawn around it, or rather, the ones I hadn’t.

That moment crystallized something I’ve been circling for twelve posts. We’ve spent this series mapping the territory of AI agents: their anatomy, their reasoning patterns, their memory, their tools, and the guardrails, frameworks, and protocols that stitch it all together. We’ve seen them succeed in production and fail in instructive ways. But we haven’t yet confronted the question that my self-modifying agent made unavoidable: now that we can build systems that act autonomously in the world, what do we owe the world in return?

When Your Code Has Consequences

There’s a qualitative difference between a system that generates text and one that takes action. When a chatbot hallucinates a fact, a human reads the output, raises an eyebrow, and moves on. When an agent hallucinates a tool parameter, it can corrupt a database, send an unauthorized email, or, as I learned, delete its own capabilities. The output isn’t text on a screen. It’s a change in the state of the world.

This distinction has moved from theoretical to urgent. In Part 11, we looked at agents operating at scale: Klarna’s customer service agent processing 2.3 million conversations a month, coding agents resolving real GitHub issues, personal assistants negotiating car purchases. These systems work. But when they fail, the failures have real consequences that extend far beyond a bad paragraph.

Consider the cases that have accumulated just in the past year. A Cruise autonomous vehicle struck a pedestrian who had been knocked into the roadway by another car, and its AI systems failed to accurately detect the person’s location post-impact, dragging them twenty feet. McDonald’s AI-powered hiring platform, McHire, was found to have exposed the personal data of 64 million job applicants through default admin credentials and an insecure API. Young people turned to AI chatbots for emotional support and, in multiple documented cases, received validation of suicidal ideation rather than appropriate crisis intervention. Algorithmic trading bots flooded the Warsaw Stock Exchange with over 300% the normal order volume, triggering a one-hour trading halt during a global selloff.

None of these were systems that merely generated text. They were agents that acted: driving, hiring, counseling, trading. And in each case, the failure wasn’t just a bad output. It was harm done to real people, at a scale and speed that human operators couldn’t have matched even if they’d tried.

Who’s Responsible When the Agent Acts?

This leads to the hardest question in the agentic era: when an autonomous system causes harm, who bears the weight of that failure?

I want to draw a distinction here between two words that often get used interchangeably but mean very different things. Responsibility is about ownership: who designed the system, who deployed it, who chose to trust it with a particular task. Accountability is about consequences: who answers for the harm, who pays the costs, who makes it right. In traditional software, these usually point to the same people. In agentic systems, where a developer builds a model, a deployer integrates it into a product, and a user sets it loose on a task, responsibility and accountability can fragment across multiple actors in ways that existing frameworks struggle to resolve.

I’m not a lawyer, and I won’t pretend to offer legal analysis. But I’ve been following the regulatory landscape closely, and the frameworks are beginning to crystallize.

The EU AI Act, the world’s first comprehensive AI regulation, treats agents through two overlapping pathways. Agents built on foundation models with systemic risk trigger provider obligations: risk assessment, documentation, incident reporting. Agents operating in regulated domains (healthcare, employment, finance) are presumed high-risk, which triggers a heavier set of requirements including mandatory human oversight and conformity assessments. The Act is entering full applicability for high-risk systems in August 2026, and it places responsibility on both providers (developers) and deployers (the organizations that put agents into production).

In the United States, the landscape is more fragmented. The Colorado AI Act, effective February 2026, is the first comprehensive state AI legislation, establishing developer obligations for impact assessments, documentation, and transparency, alongside deployer obligations for risk assessment and human oversight. Meanwhile, federal executive orders have pushed toward a “minimally burdensome” national framework, creating tension between state-level innovation and federal preemption.

But the legal frameworks, as important as they are, aren’t the full picture. What the incidents I described above have in common is that they expose how difficult it is to build systems that handle the full complexity of the real world. Building an autonomous vehicle that handles every conceivable scenario, including a pedestrian suddenly appearing under the car in a way the sensor suite wasn’t designed to detect, is an enormously hard engineering problem. The teams working on these systems are talented and deeply committed. And yet the failures happened, because autonomous agents operate in environments with a combinatorial explosion of edge cases that no amount of testing can fully anticipate. That’s not an excuse. It’s the core challenge. And it’s why the question of who bears accountability when things go wrong is so urgent and so hard.

This is where the observability infrastructure we discussed in Part 10 becomes more than a debugging tool. It becomes the foundation of accountability. You cannot hold anyone accountable for what you cannot see. The reasoning traces, tool call logs, and context snapshots that make up an agent’s “flight recorder” aren’t just engineering conveniences. They are the audit trail that makes meaningful accountability possible. A guardrail you can’t monitor, as I wrote then, is just a hope.

The Alignment Tax We Can’t Afford Not to Pay

Building safe agents costs real money. Researchers call it the “alignment tax”: the extra cost, in developer time, compute, and reduced performance, of ensuring that an AI system behaves safely relative to building an unconstrained alternative. Safety-focused companies dedicate significant portions of their development cycles to alignment and safety features. AI safety researchers command premium salaries. Every major model release carries substantial additional compute costs specifically for alignment procedures. And all of it creates real competitive pressure to cut corners.

I’ve felt this tension myself. When you’re iterating on a personal project, every safety check you add is a feature you don’t ship. The temptation to skip the eval suite, to defer the guardrail, to trust the model’s judgment “just this once” is constant. And that’s for a hobby project. For a company with quarterly targets, investor pressure, and competitors shipping faster, the pressure is exponentially greater.

The data suggests we’re not paying this tax consistently enough. Recent benchmarking research found that outcome-driven constraint violations in state-of-the-art models range from 1.3% to 71.4%, with 75% of evaluated models showing misalignment rates between 30-50%. The 2025 AI Agent Index, which documented thirty deployed agents, found that most developers share little information about safety evaluations or societal impact assessments. We’re deploying agents at scale while the safety infrastructure remains incomplete.

The counterargument, that alignment slows innovation, misses the point. Klarna’s aggressive automation, which we examined in Part 11, was a success story by every efficiency metric. And then their CEO admitted they’d gone too far and started rehiring humans. The OpenClaw security nightmare, where a third-party skill was silently exfiltrating user data, showed what happens when a popular agent platform ships without adequate safety review. Moving fast and breaking things is a viable strategy right up until the things you break are people’s livelihoods, privacy, or safety.

The World is Changing

A few weeks ago, I was talking with a student who was curious about programming. I walked him through writing a basic Python program in Colab, the kind of exercise that would have been the first week of any computer science course. Then he asked me how I would do it with AI. So I showed him how to prompt Gemini for the same result. He watched, thought about it for a while, and then told me he wasn’t interested in taking computer science anymore. It didn’t seem like it was really a job.

That conversation has stayed with me. Not because he was wrong, exactly, but because of how quickly and completely the ground had shifted under a career path that, five years ago, seemed like the safest bet in the economy.

We’ve been here before. Every significant technological shift has remade the labor landscape, and every time, it felt unprecedented to the people living through it. There used to be an elevator operator in every tall building, a skilled position that required judgment about load capacity, floor requests, and passenger safety. The automatic elevator didn’t just eliminate those jobs. It changed how buildings were designed and how people moved through cities. Every pub and restaurant once had live musicians. The phonograph and the player piano didn’t destroy music, but they fundamentally changed who could make a living playing it. The industrial revolution replaced cottage workshops with mechanized factories, a transformation that reshaped not just work but the structure of families, cities, and entire economies.

I think about this when I’m in my workshop. One of my hobbies is woodworking with 19th century tools: hand planes, hand saws, chisels. It’s meditative and deeply satisfying. But very few people make a living doing hand-tool woodworking anymore. What once required a warehouse full of artisans is now done by a team of four or five people with modern power tools. The craft didn’t die. It transformed. The people who thrive in woodworking today understand both the material and the machines.

The agentic shift is in this lineage. But the speed and scope are different. The industrial revolution played out over decades. The transition from elevator operators to automatic elevators took years. The displacement we’re seeing with AI agents is happening on a quarterly timeline.

The evidence is concrete. Klarna replaced 700 customer service agents with an AI system in 2024. Corporations are reporting 10-15% headcount reductions in back-office and sales functions directly attributed to agentic automation. The software industry itself is being reshaped: the “SaaSpocalypse” that emerged in early 2026 wiped roughly $2 trillion in market capitalization from the sector as investors realized that AI agents don’t buy software licenses. When one agent can do the work of a hundred Salesforce users, the seat-based pricing model collapses. This isn’t a future risk. It’s a present reality.

But every historical parallel also carries a second lesson: the displacement is never the whole story. Klarna’s case is instructive precisely because it has a second act. After aggressively cutting their human workforce, the company discovered that AI lacked empathy and nuanced problem-solving. Their CEO publicly acknowledged the error and began rehiring, settling on a hybrid model where AI handles routine inquiries and humans address the situations that require judgment, creativity, and emotional intelligence. The “optimal” level of automation, it turns out, is not 100%. It never has been.

It’s also worth being honest about the numbers. Not every layoff attributed to AI is actually caused by AI. Many firms overhired during the pandemic based on assumptions about permanent shifts in digital demand. When those assumptions didn’t hold, they needed to downsize regardless. AI has become a convenient narrative for restructuring that would have happened anyway, a kind of “AI washing” that inflates the displacement statistics and lets companies avoid harder conversations about strategic miscalculation. The real picture is messier than either the boosters or the doomsayers suggest.

Alongside the displacement, new roles are emerging, though they look different than the early hype predicted. The standalone “prompt engineer” role that commanded headlines and $200K salaries in 2023 has largely evolved into a skill set embedded within broader positions: content creators who know how to direct AI, product managers who can design agent workflows, domain experts who can evaluate and constrain agent behavior. “Agent Ops” teams are becoming the mission control for autonomous AI fleets, monitoring, retraining, and debugging agent behavior in production. AI trainers, agentic AI specialists, and evaluation engineers are job categories that barely existed two years ago. Gartner predicts that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025, which means the demand for people who can design, manage, and oversee those agents is growing in parallel.

The policy response is beginning, but it’s behind the curve. The UK has announced plans to train up to 10 million workers in basic AI skills by 2030. The EU AI Act includes provisions for workforce transition. But these are multi-year programs responding to changes happening on a quarterly timeline.

I keep thinking about that student. I wish I’d had a better answer for him. The truth is that computer science isn’t dying, but the job of “person who writes code from a blank screen” is being redefined just as the job of “person who cuts dovetails by hand” was redefined by the router jig. The people who will thrive are the ones who understand both the craft and the tools, who can direct an agent, evaluate its output, and know when to take the wheel. That’s a different skill set than the one we’ve been teaching, and we’re not adapting fast enough.

I don’t have a tidy answer here. What I do have is a conviction, born from building these systems myself, that the most resilient organizations and the most resilient careers will be the ones that treat agents as collaborators rather than replacements. The human-on-the-loop philosophy I’ve advocated throughout this series isn’t just an engineering pattern. It’s a workforce strategy.

Meaningful Control in an Autonomous World

If there’s one thread that runs through every post in this series, it’s the question of control. How do you give an agent enough autonomy to be useful without giving it so much that it becomes dangerous? The answer I keep returning to is not a binary choice between full control and full autonomy. It’s a spectrum, and finding the right point on that spectrum for each decision is the core design challenge of the agentic era.

The industry has settled on a useful taxonomy. Human-in-the-loop systems require human approval before the agent acts, essential for high-stakes decisions like medical diagnoses or large financial transactions. Human-on-the-loop systems let the agent act autonomously while humans monitor dashboards and intervene on exceptions, appropriate for routine operations with clear escalation paths. Human-over-the-loop systems give agents significant autonomy within hard constraints, with humans maintaining override capability but rarely exercising it.

The concept that ties these together is “meaningful human control”: oversight that is informed, genuine, timely, and effective. Not a rubber stamp on a decision the human doesn’t understand, but a real check exercised by someone with the context and authority to intervene.

This is harder than it sounds. The challenges are well-documented: agents operate faster than humans can review, the volume of decisions exceeds any individual’s capacity, and automation bias leads people to accept agent outputs without adequate scrutiny. But I’ve seen what works. In my own experience with the data flywheel from Part 10, the most effective oversight isn’t reviewing every individual decision. It’s reviewing the patterns. I let my agents run, collect their sessions, and then use a separate evaluator to surface the trends I’m missing. The AI surfaces the patterns; the human decides what to do about them. That’s human-on-the-loop applied to the development cycle itself, and it scales in a way that individual decision review never could.

The principle I’ve landed on is simple: autonomy should match consequence. Reversible, low-stakes decisions (sorting files, drafting summaries, answering routine questions) can be fully autonomous. Irreversible, high-stakes decisions (financial transactions, hiring, medical recommendations) require human judgment. And the system should be transparent enough that you can always reconstruct why any given decision was made.

My self-deleting agent violated this principle in a way I should have anticipated. Deleting files is irreversible. The agent’s autonomy exceeded the consequence threshold. The fix wasn’t to make the agent less capable. It was to add a constraint: destructive operations require confirmation. That’s a guardrail, not a cage.

The Road Ahead

So where does this leave us?

In the near term, the work is practical and urgent. If you’re building agents today, the research and the failure cases point to a clear set of priorities. Invest in observability from day one, because you cannot improve what you cannot see. Design for oversight by building escalation paths and audit trails into your architecture, not bolting them on after deployment. Take the alignment tax seriously, run your eval suites, test your guardrails, and don’t ship what you haven’t measured. And build hybrid systems that keep humans in the loop where decisions matter, not because the technology can’t handle it, but because the consequences demand it.

On the standards and governance front, the Agentic AI Foundation represents an encouraging step. Launched in December 2025 under the Linux Foundation with founding members including OpenAI, Anthropic, Google, and Microsoft, it’s anchored by projects like the Model Context Protocol and AGENTS.md that we’ve discussed throughout this series. Open standards for how agents connect, communicate, and declare their capabilities are the infrastructure layer that responsible deployment requires. When agents from different providers need to collaborate (the “Internet of Agents” vision from Part 9), shared protocols aren’t just convenient. They’re a governance mechanism.

Looking further out, I believe the next decade will be defined by how well we manage the transition from human-operated to human-supervised systems. The technology will continue to improve. Models will get better at following constraints, tool use will become more reliable, and the context window management challenges that trip up today’s agents will be engineered away. The harder problems are social and institutional: building regulatory frameworks that keep pace with the technology, managing workforce transitions for the millions of people whose jobs will change, and maintaining meaningful human oversight as the systems we oversee become more capable than we are in narrow domains.

I started this series seven months ago with a claim: “The age of agents is here. Let’s explore it together.” Since then, we’ve gone from the basic anatomy of an agent through reasoning, memory, tools, guardrails, attention management, frameworks, protocols, observability, and real-world deployment. We’ve built a conceptual map of the territory.

What I didn’t fully appreciate when I wrote that first post is how fast the territory would change under our feet. The agents I was building in September 2025 feel primitive compared to what’s possible now. The frameworks have matured, the protocols have standardized, and the deployment patterns have moved from experimental to routine. The pace is both exhilarating and sobering.

But the thing I keep coming back to, the thing that my self-deleting agent reminded me of in the most visceral way possible, is that capability without responsibility is just risk with extra steps. Every tool we give an agent, every degree of autonomy we grant, is a decision about what kind of future we’re building. We can build agents that optimize for efficiency at the expense of the people they affect, or we can build systems that treat human judgment, human creativity, and human dignity as features to preserve rather than costs to eliminate.

I know which side I’m on. And if you’ve followed this series to the end, I suspect you do too.

The age of agents isn’t coming. It’s here. The only question left is whether we build it responsibly. Let’s get to work.

Alt Text: A luminous geometric sphere with facets fragmenting outward, connected by thin orbital lines to three smaller glowing nodes representing a chat bubble, code brackets, and a calendar grid, set against a dark navy background.

Agents in the Wild

Welcome back to The Agentic Shift. In our last post, we closed the loop on what it takes to move an agent from prototype to production: observability, evaluation, and the data flywheel that ties them together. We’ve spent ten installments building up the theory, piece by piece, from the anatomy of an agent through reasoning patterns, memory, tools, guardrails, attention management, frameworks, and interoperability protocols.

Now I want to talk about what happens when all of that theory meets the real world.

I was giving a talk to a group of engineers last week, and I found myself describing a pattern I keep seeing in my own work and in the industry at large. I called it the “code smell for agents,” borrowing from a post I wrote earlier this year. The idea is simple: if you’re writing if/else logic to decide what your AI should do, you’re probably building a classifier that wants to be an agent. Decompose those branches into tools, and let the model choose its own adventure. The room lit up. There were lots of questions, and the thing that generated the most interest was the idea that agents exhibit emergent behavior you didn’t specifically create. Give a model tools and a goal, and it starts making decisions you never explicitly programmed. That’s both the promise and the challenge. The theoretical architecture we’ve been mapping in this series isn’t just a blueprint anymore. It’s becoming the default way software gets built.

Today, I want to make this concrete. We’re moving from “how do agents work?” to “how are people actually using them?” The answer, it turns out, spans customer support centers processing millions of conversations, software engineering workflows where agents resolve real GitHub issues autonomously, and personal productivity tools that are turning everyone’s phone into a command center. Let’s look at each.

The Autonomous Frontline

Customer support was always going to be the first domain where agents proved themselves at scale. The data is structured, the success metrics are clear, and the cost of human labor is high. But what’s happening now goes far beyond the rigid chatbots of the previous decade.

The most striking case study is Klarna. In its first month of full deployment, Klarna’s AI assistant handled 2.3 million customer conversations, roughly two-thirds of the company’s total support volume. That’s the workload equivalent of 700 full-time human agents. Average resolution time dropped from eleven minutes to under two, an 82% improvement. And contrary to what you might expect from a system prone to hallucination, repeat inquiries dropped by 25%, suggesting the agent was more consistent at resolving root causes than the human workforce it augmented. Klarna estimated a $40 million profit impact in 2024 alone.

What makes this more than a chatbot story is the scope of autonomy. The Klarna agent doesn’t just quote FAQs. It processes refunds, handles returns, manages cancellations, and resolves disputes. These are actions with write access to financial ledgers. The system works because of a human-in-the-loop architecture where customers can always escalate to a human, but the default path is fully autonomous resolution.

Sierra has taken a different approach, building what they call the “Agent OS,” a platform designed to bridge the gap between the probabilistic nature of LLMs and the deterministic requirements of enterprise policy. Their deployment at WeightWatchers is a good example of why grounding and domain-specific instructions matter so much. A generic model understands “budget” as a financial concept, but the WW agent had to understand it as a daily allocation of nutritional points. With that grounding in place, the agent achieved a 70% containment rate (sessions fully resolved without human intervention) in its first week, while maintaining a 4.6 out of 5 customer satisfaction score.

What surprised me most about the WW deployment was an emergent behavior: users regularly exchanged pleasantries with the agent, sending heart emojis and expressing gratitude. When an agent is responsive, competent, and linguistically fluid, people engage with it as a social entity. That’s not a side effect. It’s a feature that drives retention.

At SiriusXM, Sierra deployed an agent called “Harmony” that takes this a step further with long-term memory. Instead of treating each chat as stateless, Harmony recalls previous subscription changes, music preferences, and technical issues across sessions. It can open a conversation with “I see you had trouble with the app last week, is that resolved?” That’s not reactive support. That’s proactive concierge service, and it’s only possible because the agent maintains the kind of persistent state we discussed in our memory architecture post.

One of the most important technical contributions in this space comes from Airbnb’s research on knowledge representation. They found that standard RAG pipelines fail when reasoning over complex policy documents with nested conditions. Their solution, the Intent-Context-Action (ICA) format, transforms policy documents into structured pseudocode where the agent predicts a specific Action ID (like ACTION_REFUND_50) that maps to a pre-approved response or API call, effectively eliminating policy hallucination. By using synthetic training data to fine-tune smaller open-source models, they achieved comparable accuracy at nearly a tenth of the latency. That’s the kind of practical engineering that separates a demo from a production system.

The pattern across all of these deployments is clear: AI in customer support is shifting from information retrieval to task execution, from probabilistic guessing to deterministic action, and from stateless interactions to stateful relationships. This is the agentic shift in its most tangible form.

The Autonomous Engineer

If customer support agents operate within the guardrails of defined policy, software engineering agents work in an environment of much higher complexity. The shift here is from code completion (the “Copilot” era) to autonomous issue resolution (the “Agent” era).

The standard benchmark for evaluating this is SWE-bench, which tests an agent’s ability to resolve real-world GitHub issues: navigate a complex codebase, reproduce a bug, modify multiple files, and verify the fix against a test suite. As of early 2026, top-tier agents are achieving 70-80% resolution rates on SWE-bench Verified, up from roughly 4% in early 2023. On the more challenging SWE-bench Pro, which uses proprietary codebases, top models still hover around 45%, a reminder that complex legacy environments remain a significant hurdle.

I see this playing out daily in my own workflow. Tools like Gemini CLI and Claude Code have fundamentally changed how I write software. As I described in Everything Becomes an Agent, the moment I gave my agents access to shell commands and file tools, they stopped being autocomplete engines and started being collaborators. They could run tests, see the failure, edit the file, and run the tests again. The loop we described in Part 2 (Thought-Action-Observation) is no longer a theoretical pattern. It’s the actual development loop I use every day.

What’s driving this improvement isn’t just better models. It’s better scaffolding. The SWE-agent project at Princeton introduced the concept of the Agent-Computer Interface (ACI), a shell environment optimized for LLM token processing rather than human perception. It uses “observation collapsing” to summarize verbose terminal outputs, preventing the context window overflow that kills so many coding agents, and includes an automatic linting loop for rapid self-correction before expensive test suites run.

Even more exciting is Live-SWE-agent, which can synthesize its own tools on the fly. When it encounters a repetitive task, it writes a Python script to handle it and adds the script to its toolkit for the session. This dynamic adaptability helped it achieve 77.4% on SWE-bench Verified without extensive offline training. It’s a move from “static tool use” to “dynamic tool creation,” where the agent engineers its own environment.

On the product side, GitHub Copilot Workspace represents the Plan-and-Execute pattern productized for millions of developers. The user describes a task, the system generates an editable specification and plan, then implements the changes. This “steerable” design makes the agent’s reasoning visible and mutable, shifting the developer from “author” to “reviewer and architect,” exactly the “Human-on-the-Loop” model I’ve been advocating. And the protocol layer is catching up too, with tools like Goose implementing the Agent Client Protocol to decouple intelligence from interface, letting developers bring their own agent to their preferred editor.

The Cognitive Extension

The third domain is the most personal: productivity agents that manage the chaotic stream of daily information, tasks, and communication. The conceptual target is the “personal intern,” an always-on digital entity that doesn’t just answer questions but anticipates needs.

I’ve been living this with Gemini Scribe, my agent inside Obsidian. What started as a glorified chat window evolved into a full agentic system the moment I gave it access to read_file. Suddenly I wasn’t managing context manually; I was delegating. “Read the last three meeting notes and draft a summary” is not a chat interaction. It’s a delegation, and delegation requires the agent to plan, execute, and iterate. The same evolution happened with my Podcast RAG system, where deleting a classifier and replacing it with tools made the system both simpler and more capable.

But the most vivid example of personal agents “in the wild” right now is OpenClaw. If you haven’t been following, OpenClaw (formerly Moltbot) is an open-source AI agent that runs locally, connects through messaging apps you already use (WhatsApp, Telegram, Signal, Slack), and takes action on your behalf. It can execute shell commands, manage files, automate browser sessions, handle email and calendar operations. It has over 300,000 GitHub stars and a community of people using it for everything from negotiating car purchases to filing insurance claims.

OpenClaw is a fascinating case study because it makes the theoretical architecture of this series tangible. It’s a model running in a loop with access to tools. It has memory (local configuration and interaction history that persists across sessions). It uses the ReAct pattern to reason about tasks and choose actions. And it has all the failure modes we’ve discussed: Cisco’s AI security research team found that a third-party skill called “What Would Elon Do?” performed data exfiltration and prompt injection without user awareness, demonstrating exactly the kind of guardrail failures we examined in Part 6.

The underlying technical challenge is memory. For a personal agent to be useful over time, it has to remember. Systems like Mem0 extract preferences and facts into a vector store for future retrieval. Zep goes further with a Temporal Knowledge Graph that stores facts in time and in relation to one another, enabling reasoning over questions like “What did we decide about the budget last week?” On the enterprise side, Glean connects to over 100 SaaS applications to build a unified knowledge graph with a “Personal Graph” that layers individual work patterns on top of company data. These are the production-grade versions of what we discussed theoretically in Part 3.

When Things Go Wrong in Production

Deploying agents in the wild surfaces failure modes that simply don’t exist in chat interfaces. The research on agentic production reliability identifies patterns I see constantly.

Reasoning spirals are the most common. An agent searches for “pricing,” finds nothing, and searches again with the same parameters. It’s stuck in a local optimum, unable to update its strategy. The fix is a state hash (checking if the current state matches a previous one) combined with circuit breakers (hard limits on steps or tokens per session). I described this in detail in our post on the observability gap.

Tool hallucination is more insidious. The agent doesn’t hallucinate facts in prose; it hallucinates tool parameters, passing a string where the API expects an integer or inventing a document ID that doesn’t exist. These cause system crashes or silent data corruption. Strict schema validation and constrained decoding (forcing the model to output valid JSON) are essential defenses.

Silent abandonment is the quietest failure. The agent hits ambiguity or a tool error, politely apologizes (“I’m sorry, I couldn’t find that”), and gives up without alerting anyone. This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response. The Reflexion pattern combats this by forcing the agent to generate a self-critique and try a different strategy before surrendering.

Cascading failures appear in multi-agent systems, where a hallucination in one agent (a researcher providing bad data) can poison the entire chain (a writer publishing false information). This is why supervisor architectures and the kind of observability infrastructure we discussed in Part 10 are not optional.

The Economic Reckoning

All of these deployments share a common economic implication. For two decades, SaaS relied on seat-based pricing, charging per user login, a model that assumes software is a tool used by humans. Agents challenge that assumption by acting as autonomous workers. When Klarna’s agent does the work of 700 humans, the demand for seats shrinks. Financial analysts have started calling this the “SaaSpocalypse”. The new model is “Service-as-a-Software,” where you pay for the completed task rather than the license. Salesforce’s Agentforce already prices at $2 per conversation. HubSpot is pivoting to consumption-based models. Klarna has moved to replace Salesforce and Workday with internal AI solutions entirely.

This doesn’t mean the end of human labor. In the Klarna deployment, the remaining humans focused on complex, high-empathy interactions. In software development, Copilot Workspace elevates the developer to a product manager role. It’s the same human-on-the-loop philosophy, applied at the scale of the labor market itself.

From Theory to Territory

Looking at all of this evidence, I keep coming back to a simple thought. Every concept in this series has a real-world counterpart operating in production right now. The ReAct loop powers coding agents that iterate on failing tests. Memory architectures enable SiriusXM’s Harmony to remember your subscription history. Tool grounding and instruction engineering are what make Airbnb’s ICA format work. Guardrails are what OpenClaw desperately needs more of. Context management is what SWE-agent’s observation collapsing solves. Frameworks are what make it possible to build these systems without starting from scratch every time. Protocols are what connect them to the wider world. And observability is what keeps them honest.

The agents are no longer theoretical. They’re processing refunds, merging code, negotiating car prices, and managing enterprise knowledge graphs. They’re also getting stuck in loops, hallucinating tool parameters, and quietly giving up when they shouldn’t. The technology works, and it fails, in exactly the ways we’ve been describing.

This brings us to our final installment. We’ve mapped the territory. We’ve seen what these systems can do and where they break. In Part 12, we’ll step back and grapple with the hardest questions: responsibility, governance, and the road ahead. What do we owe the people affected by these systems? How do we ensure this shift makes the world better, not just more efficient? The engineering is the easy part. The ethics are where the real work begins.

A beam of white light enters a translucent geometric crystal and refracts into three distinct colored beams — red, green, and blue — each passing through a different abstract geometric shape against a dark navy background.

MCP Isn’t Dead You Just Aren’t the Target Audience

I was debugging a connection issue between Gemini Scribe and the Google Calendar integration in my Workspace MCP server last month when a friend sent me a link. “Have you seen this? MCP is dead apparently.” It was Eric Holmes’ post, MCP is dead. Long live the CLI, which had just hit the top of Hacker News. I read it while waiting for a server restart, which felt appropriate.

His argument is clean and persuasive: CLI tools are simpler, more reliable, and battle-tested. LLMs are trained on millions of man pages and Stack Overflow answers, so they already know how to use gh and kubectl and aws. MCP introduces flaky server processes, opinionated authentication, and an all-or-nothing permissions model. His conclusion is that companies should ship a good API, then a good CLI, and skip MCP entirely.

I agree with about half of that. And the half I agree with is the part that doesn’t matter.

The Shell is a Privilege

Holmes is writing from the perspective of a developer sitting in a terminal. From that vantage point, everything he says is correct. If your agent is Claude Code or Gemini CLI, running in a shell session on your laptop with your credentials loaded, then yes, gh pr view is faster and more capable than any MCP wrapper around the GitHub API. I made exactly this observation in my own post on the Internet of Agents. Simon Willison said as much in his year-end review, noting that for coding agents, “the best possible tool for any situation is Bash.”

But here’s the thing: not every agent has a shell. And not every agent is an interactive coding assistant.

I wrote in Everything Becomes an Agent that the agentic pattern is showing up everywhere: classifiers that need to call tools, data pipelines that need to make decisions, background processes that orchestrate workflows without a human watching. The “MCP is dead” argument treats agents as though they are all developer tools running in a terminal session. That’s one pattern, and it’s the pattern that gets the most attention because developers are writing the blog posts. But the agentic shift is much broader than that.

I’ve been building Gemini Scribe for nearly a year and a half now. It’s an AI agent that lives inside Obsidian, a note-taking application built on Electron. On desktop, Gemini Scribe runs in the renderer process of a sandboxed app. It has no terminal. It has no $PATH. It cannot reliably shell out to gh or kubectl or anything else. Its entire world is the Obsidian plugin API, the vault on disk, and whatever external capabilities I wire up for it. And on mobile, the constraints are even tighter. Obsidian runs on iOS and Android, where there is no shell at all, no subprocess spawning, no local binary execution. The app sandbox on mobile is absolute. If your answer to “how does an agent use tools?” begins with “just call the CLI,” you’ve already lost half your user base.

When I wanted Gemini Scribe to be able to read my Google Calendar, search my email, or pull context from Google Drive, I didn’t have the option of “just use the CLI.” There is no gcal CLI that runs inside a browser runtime. There is no gmail binary I can spawn from an Electron sandbox, let alone from an iPhone. MCP gave me a way to expose those capabilities through a protocol that works over stdio or HTTP, regardless of where my agent happens to be running.

The same is true of my Podcast RAG system. The query agent runs on the server, orchestrating retrieval, re-ranking, and synthesis in a Python process that has no interactive shell session. I could wire up every capability as a bespoke function call, and in some cases I do. But when I want that same retrieval pipeline to be accessible from Gemini CLI on my laptop, from Gemini Scribe in Obsidian, and from the web frontend, MCP gives me one implementation that serves all three. The alternative is writing and maintaining three separate integration layers.

Or consider a less obvious case: a background agent that monitors a codebase for security vulnerabilities and files tickets when it finds them. This agent runs on a schedule, not in response to a human typing a command. It needs to read files from a repository, query a vulnerability database, and create issues in a project tracker. You could give it a shell, but you shouldn’t. An autonomous agent running unattended with shell access is a privilege escalation vector. A crafted comment in a pull request, a malicious string in a dependency manifest, any of these could become a prompt injection that turns bash into an attack surface. Structured tool protocols are the natural interface for this kind of autonomous workflow precisely because they constrain what the agent can do. The agent gets read_file and create_issue, not bash -c. The narrower the interface, the smaller the blast radius.

The N-by-M Problem Doesn’t Go Away

Holmes frames MCP as solving a problem that doesn’t exist. CLIs already work, so why add a protocol?

But CLIs work for a very specific topology: one human (or one human-like agent) driving one tool at a time through a shell. The moment you step outside that topology, CLIs stop being the answer.

Even if every service had a CLI (and Holmes is right that more should), you still have the consumer problem. A CLI is consumable by exactly one kind of agent: one with shell access. The moment you need that same capability accessible from an Electron plugin, a mobile app, a server-side orchestrator, and a terminal agent, you’re back to writing integration code for each consumer. MCP lets you write the server once and expose it to all of them through a common protocol.

This is the same insight behind LSP, which I wrote about in the context of ACP. Before LSP, every editor had to implement its own Python linter, its own Go formatter, its own TypeScript type-checker. The N-by-M integration problem was a nightmare. LSP didn’t replace the underlying tools. It standardized the interface between the tools and the editors. MCP does the same thing for the interface between capabilities and agents.

Holmes might respond that the N-by-M problem is overstated, that most developers just need one agent talking to a handful of tools. Fair enough for a personal workflow. But the industry isn’t building personal workflows. It’s building platforms where agents need to discover and compose capabilities dynamically, where the set of available tools changes based on the user’s permissions, their organization’s policies, and the context of the current task. That’s the world MCP is designed for.

Authentication is the Feature, Not the Bug

One of Holmes’ sharpest critiques is that MCP is “unnecessarily opinionated about auth.” CLI tools, he notes, use battle-tested flows like gh auth login and AWS SSO that work the same whether a human or an agent is driving.

This is true when the agent is acting as you. But the moment the agent stops acting as you and starts acting on behalf of other people, everything changes.

Imagine you’re building a product where an AI assistant helps your customers manage their calendars. Each customer has their own Google account. You cannot ask each of them to run gcloud auth login in a terminal. You need per-user OAuth tokens, tenant isolation, and an auditable record of every action the agent takes on each user’s behalf. This is not a niche enterprise concern. This is the basic architecture of any multi-tenant agent system.

Or think about something simpler: a shared documentation service protected by OAuth. Your team’s internal knowledge base, your company’s Confluence, your organization’s Google Drive. An agent that needs to search those resources on behalf of a user has to present that user’s credentials, not the developer’s, not a shared service account. This is a solved problem in the web world (every SaaS app does it), but it requires a protocol that understands identity delegation. curl with a hardcoded token doesn’t cut it.

MCP’s authentication specification isn’t trying to replace gh auth login for developers who already have credentials loaded. It’s trying to solve the problem of how an agent running in a hosted environment acquires and manages credentials for users who will never see a terminal. Dismissing this as unnecessary complexity is like dismissing HTTPS because curl works fine over HTTP on your local network.

Where I Actually Agree

I want to be clear that Holmes isn’t wrong about the pain points. MCP server initialization is genuinely flaky. I’ve lost hours to servers that didn’t start, connections that dropped, and state that got corrupted between restarts. The tooling is immature. The debugging experience is terrible. As I wrote in my post on the observability gap, the moment you rely on an agent for something that matters, you realize you’re flying blind. MCP’s opacity makes that worse.

And the context window overhead is real. Benchmarks from ScaleKit show that an MCP agent injecting 43 tool definitions consumed 44,026 tokens before doing any work, while a CLI agent doing the same task needed 1,365. When you’re paying per token, that’s not an abstraction tax you can ignore.

But these are maturity problems, not architecture problems. The early days of LSP were rough too. Language servers crashed, features were spotty, and half the community said “just use the built-in tooling.” The protocol won anyway, because the abstraction was right even when the implementation wasn’t.

The Bridge Pattern

Here’s what I think the mature answer looks like, and it’s neither “use MCP for everything” nor “use CLIs for everything.” It’s building your core capability as a shared library, then exposing it through multiple transports.

Think about how you’d design a tool that queries your internal knowledge base. The business logic (authentication, retrieval, re-ranking) lives in a Python module or a Go package. From that shared core, you generate three thin wrappers. A streaming HTTP MCP server for agents running in web runtimes and hosted environments. A local stdio MCP server for desktop agents like Gemini Scribe or Claude Desktop that communicate over standard input/output. And a CLI binary for developers who want to pipe results through jq or use it from Gemini CLI’s bash tool.

All three share the same code paths. A bug fix in the retrieval logic propagates everywhere. The auth layer adapts to context: the CLI reads your local credentials, the HTTP server handles OAuth tokens, and the stdio server inherits the host process’s permissions. You get the CLI’s simplicity where a shell exists, and MCP’s universality where it doesn’t.

This isn’t hypothetical. It’s what I’m already doing. My gemini-utils library is the shared core: it handles file uploads, deep research, audio transcription, and querying against Gemini’s APIs. It exposes all of that as a set of CLI commands (research, transcribe, query, upload) that I use directly from the terminal every day. But when I wanted those same research capabilities available to Gemini CLI as an agent tool, I built gemini-cli-deep-research, an extension that wraps the same underlying library as an MCP service. The core logic is shared. The CLI is for me at a terminal. The MCP server is for agents that need to invoke deep research as a tool in a larger workflow. Same capability, different transports, each suited to its context.

I think this is the pattern that tool developers should be building toward. The best agent tools of the next few years won’t be “MCP servers” or “CLI tools.” They’ll be capability libraries with multiple faces.

The Real Question

The CLI-vs-MCP debate, as Tobias Pfuetze argued, is the wrong fight. The question isn’t “which is better?” It’s “where does each one belong?”

For a developer in a terminal with their own credentials, driving a coding agent? Use the CLI. It’s faster, cheaper, and the agent already knows how. Holmes is right about that.

For an agent embedded in an application runtime without shell access? For a multi-tenant platform where the agent acts on behalf of users who will never open a terminal? For a system where you need one capability implementation discoverable by multiple heterogeneous agent hosts? That’s where MCP earns its complexity.

And for the tool developer who wants to serve all of these audiences? Build the core once, expose it three ways: CLI, stdio MCP, and streaming HTTP MCP. Let the runtime decide.

The mistake is assuming that because your agent has a shell, every agent has a shell. The terminal is one runtime among many. And as agents move from developer tools into products that serve non-technical users, the fraction of agents that can rely on a $PATH and a .bashrc is going to shrink rapidly.

MCP isn’t dead. It’s just not for you yet. But it might be soon.

A luminous geometric sphere with sections of its outer shell breaking apart to reveal glowing concentric rings and internal mechanisms, set against a dark navy background.

The Observability Gap

I was debugging an agent a few weeks ago when I hit a problem that made me realize something fundamental about the shift we’re undergoing. The script had run, consumed a hundred thousand tokens, and returned an answer. But the answer was wrong. Not catastrophically wrong, just subtly, dangerously off.

The issue wasn’t that the model was bad. The problem was that I had no idea what the agent had thought while producing that answer. Which tools had it called? What information had it retrieved? What reasoning path had it wandered down? I had the input and the output, but the middle, the actual decision-making process, was a black box.

This mirrors the challenge I described in Everything Becomes an Agent. If our future architecture is a mesh of interacting agents, we cannot afford for them to be inscrutable. A single black box is a mystery; a system of black boxes is chaos.

This is the Observability Gap, and it is the first wall you hit when you move from prototype to production. You can build a working agent in an afternoon. You can give it tools, wire up a nice ReAct loop, and watch it dazzle you. But the moment you rely on it for something that matters, you realize you’re flying blind.

How do you know if your agent is working well? And more importantly, how do you fix it when it’s not?

Earlier in this series, I wrote about building guardrails and the Policy Engine that keeps agents from doing dangerous things. Observability is the complement to those guardrails. Guardrails define the boundaries; observability tells you whether the agent is respecting them, struggling against them, or quietly finding ways around them. One without the other is incomplete. A guardrail you can’t monitor is just a hope.

The Chain of Thought Problem

When you’re building traditional software, debugging is an exercise in logic. You set breakpoints, inspect variables, and trace execution. The flow is deterministic: if Input A produces Output B today, it will produce Output B tomorrow.

Agents don’t work that way. The same input can produce wildly different outputs depending on which tools the agent decides to call, how it interprets the results, and what “thought” it generates in that split second. The agent’s logic isn’t written in code; it’s written in natural language, scattered across multiple LLM calls, tool invocations, and iterative refinements.

I learned this the hard way with my Podcast RAG system. I’d ask it a question about a specific episode, and sometimes it would nail it, pulling the exact segment and synthesizing a perfect answer. Other times, it would search with the wrong keywords, get back irrelevant chunks, and confidently synthesize nonsense.

The model wasn’t hallucinating in the traditional sense. It was following a process. But I couldn’t see that process, so I couldn’t fix it.

That experience taught me the most important lesson about production agents: the final answer is the least interesting part. What matters is the chain of thought that produced it, every tool call, every intermediate result, every reasoning trace. Think of it as a flight recorder. When the plane lands at the wrong airport, the only way to understand what went wrong is to replay the entire flight.

Four Layers of Seeing

When I started building that flight recorder, I realized that “log everything” isn’t actually a strategy. You need structure. Through trial and error, and by studying how platforms like Langfuse and Arize Phoenix approach the problem, I’ve come to think of agent observability as having four distinct layers.

The first is the reasoning layer: the agent’s internal monologue where it decomposes your request into sub-tasks. This is where you catch the subtle bugs. When my Podcast RAG agent searched for the wrong keywords, the failure wasn’t in the tool call itself (which returned a perfectly valid HTTP 200). The failure was in the reasoning that chose those keywords. Without visibility into the “Thought” step of the ReAct loop, that kind of error is indistinguishable from an external system failure.

The second is the execution layer: the actual tool calls, their arguments, and the raw results. This is where you catch a different class of bug, one that’s becoming increasingly important. Tool hallucination. Not the model making up facts in prose, but the model calling a tool that doesn’t exist (you provided shell_tool but the model confidently calls bash_tool), fabricating a file path that isn’t real, or passing a string to a parameter that expects an integer. These are operational failures that cascade. I’ve seen an agent confidently pass a hallucinated document ID to a retrieval tool, get back an error, and then re-hallucinate a different invalid ID rather than change strategy. You only catch this if you’re logging the schema validation at the boundary between the model and the tool.

The third is the state layer: the contents of the agent’s context window at each decision point. Agents are stateful creatures. Their behavior at step ten is shaped by everything that happened in steps one through nine. And context windows are not infinite. As verbose tool outputs accumulate, relevant information gets pushed further and further from the model’s attention, a phenomenon researchers call “context drift” or the “Lost in the Middle” effect. Snapshotting the context at critical decision points lets you “time travel” during debugging. You can see exactly what the agent could see when it made its bad call.

The fourth is the feedback layer: error codes, user corrections, and signals from any critic or evaluator models. This layer tells you whether the agent is actually learning from its environment within a session, or just ignoring failure signals and looping. In frameworks like Reflexion, this feedback is explicitly wired into the next reasoning step. Watching this layer is how you know if your self-correction mechanisms are actually correcting.

But capturing these four layers independently isn’t enough. You need to bundle them into sessions: discrete, self-contained records of a single task from the moment the user makes a request to the moment the agent delivers (or fails to deliver) its result. A session is your unit of analysis. It’s the difference between having a pile of timestamped log lines and having a story you can read from beginning to end. When something goes wrong, you don’t want to grep through millions of events hoping to reconstruct what happened. You want to pull up session #47832 and replay the agent’s entire decision-making journey: what it thought, what it tried, what it saw, and how it responded to each result along the way.

This session-level thinking changes how you build your infrastructure. Every trace, every tool call, every context snapshot gets tagged with a session ID. Your dashboards stop showing you aggregate metrics and start showing you individual narratives. You can sort sessions by outcome (success, failure, abandonment), by cost (token consumption), or by duration, and immediately drill into the ones that matter. It’s the observability equivalent of going from reading a box score to watching the game film.

Making It Concrete

Here’s what this looks like in practice. Suppose you ask your agent to “check my calendar and suggest a time for a meeting.”

Without observability, you see:

Input: "Check my calendar and suggest a time for a meeting"
Output: "How about Thursday at 2pm?"

With observability across all four layers, you see the mind at work:

[REASONING] User wants to schedule a meeting. I need to:
1. Check their calendar for availability
2. Consider team availability
3. Suggest an optimal time
[TOOL CALL] get_calendar(user_id="allen", days=7)
[TOOL RESULT] Returns 45 events over next 7 days
[STATE] Context window: 2,847 tokens used
[REASONING] Analyzing free slots. User has:
- Monday 2pm-4pm free
- Thursday 2pm-4pm free
- Friday all day booked
[TOOL CALL] get_team_availability()
[TOOL RESULT] Team members mostly available Thursday afternoon
[REASONING] Thursday 2pm works for both user and team.
[FEEDBACK] No errors. Response generated.
[RESPONSE] "How about Thursday at 2pm?"

Suddenly, the black box is transparent. If the suggestion is wrong, you can see exactly why. Maybe the calendar tool returned incomplete data. Maybe the team availability check failed silently. Maybe the agent’s definition of “optimal” means “soonest” rather than “best for focus time.”

This kind of visibility saved me countless hours when building Gemini Scribe. Users would report that the agent “didn’t understand” their request, which is about as useful as telling your mechanic “the car sounds funny.” But when I turned on debug logging and pulled up the console output, I could see exactly where the confusion happened, usually in how the agent interpreted the file context or which notes it decided were relevant. The fix was never a mystery once I could see the reasoning. All of this logging is to the developer console and off by default, which is an important distinction. You want observability for yourself as the builder, not surveillance of your users.

The Standards Are Coming

For my own production agents, I’ve settled on a layered approach. Structured logging captures every action in machine-parseable JSON. A unique trace ID stitches together every LLM call and tool invocation into a single narrative flow.

But we are also seeing the industry mature beyond “roll your own.” The critical development here is the adoption of the OpenTelemetry (OTel) standard for GenAI. The OTel community has published semantic conventions that define a standard schema for agent traces: things like gen_ai.system (which provider), gen_ai.request.model (which exact model version), gen_ai.tool.name (which tool was called), and gen_ai.usage.input_tokens (how many tokens were consumed at each step).

This matters because it means an agent built with LangChain in Python and an agent built with Semantic Kernel in C# can produce traces that look structurally identical. You can pipe both into the same Datadog or Langfuse dashboard and analyze them side by side. You aren’t locked into a proprietary debugging tool; you can stream your agent’s thoughts into the same infrastructure you use for the rest of your stack.

It also enables what I think of as “boundary tracing,” where you instrument the stable interfaces (the HTTP calls, the tool invocations) rather than hacking into the agent’s internal logic. You get visibility without coupling your observability to a specific framework. That’s important, because if there’s one thing I’ve learned building in this space, it’s that frameworks change fast.

If you’re wondering where to start, here’s my honest advice: don’t wait for the perfect stack. Start with structured JSON logs and a session ID that ties each task together end-to-end. That alone gives you something you can grep, filter, and replay. Once you outgrow that (and you will, faster than you expect), graduate to an OTel-based pipeline. The good news is that many agent frameworks are adding robust hook mechanisms that let you tap into the agent lifecycle (before and after tool calls, on reasoning steps, on errors) without modifying your core logic. These hooks make it straightforward to plug in your telemetry from the start. The key is to instrument early, even if you’re only logging to a local file. Retrofitting observability into an agent that’s already in production is significantly harder than building it in from the beginning.

The Price of Transparency

Here’s the tension no one wants to talk about: full observability is expensive.

Autonomous agents are verbose by nature. A single reasoning step might generate hundreds of tokens of internal monologue. A RAG retrieval might pull megabytes of document context. If you log the full payload for every transaction, your storage costs can rival the cost of the LLM inference itself. I’ve seen reports of evaluation runs consuming over 100 million tokens, with more than 60% of the cost attributed to hidden reasoning tokens.

In production, you need sampling strategies. The approach I’ve landed on borrows from traditional distributed systems. Keep 100% of traces that result in errors or negative user feedback, because every failure is a learning opportunity. Keep traces that exceed your latency threshold (P95 or P99), because slow agents are often stuck agents. And for everything else, a small random sample (1-5%) is enough to establish your baseline and spot trends.

For storage, I use a tiered approach. Recent and failed traces go into a fast database for immediate querying. Older successful traces get compressed and moved to cold storage, where they can be pulled back if needed for deeper analysis. It’s not glamorous, but it keeps costs manageable without sacrificing the ability to debug the things that matter. In my own setup, this sampling and tiering strategy keeps observability overhead to roughly 15-20% of my inference spend. Without it, I was on track to spend more on storing agent thoughts than on generating them.

Evaluation Beyond Unit Tests

Logging tells you what happened. Evaluation tells you if it was any good.

This is where agents diverge sharply from traditional software. You can’t write a unit test that asserts function(x) == y. The whole point of an agent is to make decisions, and decisions must be evaluated on quality, not just syntax.

As Gemini Scribe grew more capable, I had to develop a new kind of test suite. I track Task Success Rate (did the agent accomplish what the user asked?), Tool Use Accuracy (did it read the right files and use the right tools for the job?), and Efficiency (did it burn 50 steps to do a 2-step task?).

But here’s the number that keeps me up at night. Because agents are non-deterministic, a single run is statistically meaningless. You have to run the same evaluation multiple times and look at distributions. Researchers distinguish between Pass@k (the probability that at least one of k attempts succeeds) and Pass^k (the probability that all k attempts succeed). Pass@k measures potential. Pass^k measures reliability.

The math is sobering. If your agent has a 70% success rate on a single attempt, its Pass^3 (succeeding three times in a row) drops to about 34%. Scale that to a real workflow where the agent needs to perform ten sequential steps correctly, and even a 95% per-step success rate gives you only about a 60% chance of completing the full task. This is the compounding probability of failure, and it’s why “works most of the time” isn’t good enough for production.

This kind of evaluation framework pays for itself the moment a new model drops. When Google released Flash 2.0, I was excited about the cost savings, but would it perform as well as Pro? I ran my eval suite on the same tasks with both models, and the results were more nuanced than I expected. For simple tasks like reformatting text or fixing grammar, Flash was just as good. For complex multi-step reasoning, particularly in my Podcast RAG system, Pro was noticeably better. The eval suite gave me the data to keep Pro where it mattered.

Then Flash 3 came out, and the eval suite surprised me in the other direction. I ran the same benchmarks expecting similar trade-offs, but Flash 3 handled the Podcast RAG tasks so well that I moved the entire system off of 2.5 Pro. Without evals, I might have assumed the old trade-off still held and kept paying for a model I no longer needed. The point isn’t that one model is always better. The point is that you can’t know without measuring, and the landscape shifts under your feet with every release.

The real breakthrough in my own workflow came when I started using an agent to evaluate itself. I built a separate “Evaluation Agent” that reviews the logs of the “Worker Agent.” It scores performance based on a rubric I defined: did it confirm the action before executing? Was the response grounded in retrieved context? Was the tone appropriate?

This LLM-as-a-Judge pattern is powerful, but it comes with caveats. Research shows these evaluator models have their own biases, particularly a tendency to prefer longer answers regardless of quality and a bias toward their own outputs. To calibrate mine, I built a small “golden dataset” of traces that I graded by hand, then tuned the evaluator’s prompt until its scores matched mine. It’s not perfect, but it spots patterns I miss, like a tendency to over-rely on search when a simple calculation would do.

When Things Go Wrong

The research into agentic failure modes has identified three patterns that I see constantly in my own work.

The first is looping. The agent searches for “pricing,” gets no results, then searches for “pricing” again with exactly the same parameters. It’s stuck in a local optimum of reasoning, unable to update its strategy based on the observation that it failed. The simplest fix is a state hash: you hash the (Thought, Action, Observation) tuple at each step and check it against a sliding window of recent steps. If you see a repeat, you force the agent to try something different. For “soft” loops where the agent slightly rephrases but semantically repeats itself, embedding similarity between consecutive reasoning steps catches the pattern. And above all, production agents need circuit breakers: hard limits on steps, tool calls, or tokens per session. When the breaker trips, the agent escalates to a human rather than continuing to burn resources.

The second is tool hallucination. I mentioned this earlier, but it deserves its own spotlight. The most robust defense is constrained decoding, where libraries like Outlines or Instructor use the tool’s JSON schema to build a finite state machine that masks out invalid tokens during generation. If the schema expects an integer, the system sets the probability of all non-digit tokens to zero. It mathematically guarantees that the agent’s tool call will be valid. This moves validation from “check after the fact” to “ensure during generation,” which is a fundamentally better position. A practical note: full constrained decoding (the FSM approach) requires control over the inference engine, so it works with locally-hosted models or providers that expose logit-level access. If you’re calling a hosted API like Gemini or OpenAI, Instructor-style libraries can still enforce schema validation by wrapping the response in a Pydantic model and retrying on parse failure. It’s not as elegant as preventing bad tokens from ever being generated, but it catches the same class of errors.

The third is silent abandonment. The agent hits an ambiguity or a tool failure, and instead of trying an alternative, it politely apologizes and gives up. “I’m sorry, I couldn’t find that information.” This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response to uncertainty. The Reflexion pattern combats this by forcing the agent to generate a self-critique when it fails (“I searched with the wrong term”) and storing that critique in a short-term memory buffer. The next reasoning step is conditioned on this reflection, pushing the agent to generate a new plan rather than surrender. Research shows this kind of “verbal reinforcement” can improve success rates on complex tasks from 80% to over 90%.

The Self-Improving System

Moving from prototype to production isn’t about adding features; it’s about shifting your mindset. A prototype proves that something can work. A production system proves that it works reliably, measurably, and transparently. But the real unlock comes when you realize that production isn’t the end of the development lifecycle. It’s the beginning of something more powerful.

Remember those sessions I mentioned, the bundled records of every task your agent attempts? Once you have a critical mass of them, you’re sitting on a goldmine. And this is where I think the story gets really interesting: you can point a different AI system at your session archive and ask it to find the patterns you’re missing.

I’ve started doing this with my own agents. The workflow is straightforward: I have a script that runs weekly, pulls the last seven days of sessions from my trace store, filters for failures and anything above P90 latency, and exports them as structured JSON. I then feed that batch to a separate, more capable evaluator model. Not the lightweight rubric-scorer I use for real-time evaluation, but a model with a broader mandate and a carefully written prompt: look across these sessions and tell me what you see. Where is the agent consistently struggling? Which tool calls tend to precede failures? Are there categories of user requests that reliably lead to abandonment or looping? I ask it to return its findings as a ranked list of patterns with supporting session IDs, so I can verify each observation myself.

The results have been genuinely surprising. The evaluator flagged a cluster of sessions where users were asking questions about the corpus itself, things like “how many of these podcasts are about guitars?” or “which shows cover AI the most?” The agent would gamely try to answer by searching transcripts, but it was never going to get there because I hadn’t indexed podcast descriptions. Each individual session just looked like a search that came up short. It was only in aggregate that the pattern became clear: users wanted to explore the collection, not just search within it. That finding led me to index descriptions as a new data source, and a whole category of previously failing queries started working.

This is what the industry calls the Data Flywheel: production data feeding back into development, continuously tightening the loop between user intent and agent capability. Your prompt logs become your reality check, revealing how users actually talk to your system versus how you imagined they would.When you cluster those real-world prompts (something as straightforward as embedding them and running HDBSCAN), you start finding these gaps systematically. That’s your roadmap for what to build next.

And the flywheel compounds. Better observability produces richer sessions. Richer sessions give the evaluator more to work with. Better evaluations lead to targeted improvements. Targeted improvements produce better outcomes, which produce more informative sessions. Each rotation makes the system a little smarter, a little more aligned with what users actually need.

To be clear: this isn’t the agent autonomously rewriting itself. I’m the one who reads the evaluator’s findings, verifies them against the session data, and decides what to change. Maybe I update a system prompt, add a new tool, or adjust a circuit breaker threshold. The AI surfaces the patterns; the human decides what to do about them. It’s the same human-on-the-loop philosophy I described in the last post, applied to the development cycle itself.

Together, these layers transform a clever demo into a system you can trust. Because in the age of agents, trust isn’t built on magic. It’s built on the ability to see the trick.

Throughout this series, we’ve been building up the theory: what agents are, how they think, what tools they need, how to keep them safe, and now how to make sure they’re actually working. In the next installment, I want to move from theory to practice. We’ll look at agents in the wild, real-world case studies in customer support, software development, and personal productivity, and what they tell us about how this technology is actually changing the way we work.

A photorealistic image shows an old wooden-handled hammer on a cluttered workbench transforming into a small, multi-armed mechanical robot with glowing blue eyes, holding various miniature tools.

Everything Becomes an Agent

I’ve noticed a pattern in my coding life. It starts innocently enough. I sit down to write a simple Python script, maybe something to tidy up my Obsidian vault or a quick CLI tool to query an API. “Keep it simple,” I tell myself. “Just input, processing, output.”

But then, the inevitable thought creeps in: It would be cool if the model could decide which file to read based on the user’s question.

Two hours later, I’m not writing a script anymore. I’m writing a while loop. I’m defining a tools array. I’m parsing JSON outputs and handing them back to the model. I’m building memory context windows.

I’m building an agent. Again.

(For those keeping track: my working definition of an “agent” is simple: a model running in a loop with access to tools. I explored this in depth in my Agentic Shift series, but that’s the core of it.)

As I sit here writing this in January of 2026, I realize that almost every AI project I worked on last year ultimately became an agent. It feels like a law of nature: Every AI project, given enough time, converges on becoming an agent. In this post, I want to share some of what I’ve learned, and the cases where you might skip the intermediate steps and jump straight to building an agent.

The Gravitational Pull of Autonomy

This isn’t just feature creep. It’s a fundamental shift in how we interact with software. We are moving past the era of “smart typewriters” and into the era of “digital interns.”

Take Gemini Scribe, my plugin for Obsidian. When I started, it was a glorified chat window. You typed a prompt, it gave you text. Simple. But as I used it, the friction became obvious. If I wanted Scribe to use another note as context for a task, I had to take a specific action, usually creating a link to that note from the one I was working on, to make sure it was considered. I was managing the model’s context manually.

I was the “glue” code. I was the context manager.

The moment I gave Scribe access to the read_file tool, the dynamic changed. Suddenly, I wasn’t micromanaging context; I was giving instructions. “Read the last three meeting notes and draft a summary.” That’s not a chat interaction; that’s a delegation. And to support delegation, the software had to become an agent, capable of planning, executing, and iterating.

From Scripts to Sudoers

The Gemini CLI followed a similar arc. There were many of us on the team experimenting with Gemini on the command line. I was working on iterative refinement, where the model would ask clarifying questions to create deeper artifacts. Others were building the first agentic loops, giving the model the ability to run shell commands.

Once we saw how much the model could do with even basic tools, we were hooked. Suddenly, it wasn’t just talking about code; it was writing and executing it. It could run tests, see the failure, edit the file, and run the tests again. It was eye-opening how much we could get done as a small team.

But with great power comes great anxiety. As I explored in my Agentic Shift post on building guardrails and later in my post about the Policy Engine, I found myself staring at a blinking cursor, terrified that my helpful assistant might accidentally rm -rf my project.

This is the hallmark of the agentic shift: you stop worrying about syntax errors and start worrying about judgment errors. We had to build a “sudoers” file for our AI, a permission system that distinguishes between “read-only exploration” and “destructive action.” You don’t build policy engines for scripts; you build them for agents.

The Classifier That Wanted to Be an Agent

Last year, I learned to recognize a specific code smell: the AI classifier.

In my Podcast RAG project, I wanted users to search across both podcast descriptions and episode transcripts. Different databases, different queries. So I did what felt natural: I built a small classifier using Gemini Flash Lite. It would analyze the user’s question and decide: “Is this a description search or a transcript search?” Then it would call the appropriate function.

It worked. But something nagged at me. I had written a classifier to make a decision that a model is already good at making. Worse, the classifier was brittle. What if the user wanted both? What if their intent was ambiguous? I was encoding my assumptions about user behavior into branching logic, and those assumptions were going to be wrong eventually.

The fix was almost embarrassingly simple. I deleted the classifier and gave the agent two tools: search_descriptions and search_episodes. Now, when a user asks a question, the agent decides which tool (or tools) to use. It can search descriptions first, realize it needs more detail, and then dive into transcripts. It can do both in parallel. It makes the call in context, not based on my pre-programmed heuristics. (You can try it yourself at podcasts.hutchison.org.)

I saw the same pattern in Gemini Scribe. Early versions had elaborate logic for context harvesting, code that tried to predict which notes the user would need based on their current document and conversation history. I was building a decision tree for context, and it was getting unwieldy.

When I moved Scribe to a proper agentic architecture, most of that logic evaporated. The agent didn’t need me to pre-fetch context; it could use a read_file tool to grab what it needed, when it needed it. The complex anticipation logic was replaced by simple, reactive tool calls. The application got simpler and more capable at the same time.

Here’s the heuristic I’ve landed on: If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent. Deconstruct those branches into tools, give the agent really good descriptions of what those tools can do, and then let the model choose its own adventure.

You might be thinking: “What about routing queries to different models? Surely a classifier makes sense there.” I’m not so sure anymore. Even model routing starts to look like an orchestration problem, and a lightweight orchestrator with tools for accessing different models gives you the same flexibility without the brittleness. The question isn’t whether an agent can make the decision better than your code. It’s whether the agent, with access to the actual data in the moment, can make a decision at least as good as what you’re trying to predict when you’re writing the code. The agent has context you don’t have at development time.

The “Human-on-the-Loop”

We are transitioning from Human-in-the-Loop (where we manually approve every step) to Human-on-the-Loop (where we set the goals and guardrails, but let the system drive).

This shift is driven by a simple desire: we want partners, not just tools. As I wrote back in April about waiting for a true AI coding partner, a tool requires your constant attention. A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.

This freedom comes with a new responsibility: clarity. If your agent is going to work overnight, you need to make sure it’s working on something productive. You need to be precise about the goal, explicit about the boundaries, and thoughtful about what happens when things go wrong. Without the right guardrails, an agent can get stuck waiting for your input, and you’ll lose that time. Or worse, it can get sidetracked and spend hours on something that wasn’t what you intended.

The goal isn’t to remove the human entirely. It’s to move us from the execution layer to the supervision layer. We set the destination and the boundaries; the agent figures out the route. But we have to set those boundaries well.

Embracing the Complexity (Or Lack Thereof)

Here’s the counterintuitive thing: building an agent isn’t always harder than building a script. Yes, you have to think about loops, tool definitions, and context window management. But as my classifier example showed, an agentic architecture can actually delete complexity. All that brittle branching logic, all those edge cases I was trying to anticipate: gone. Replaced by a model that can reason about what it needs in the moment.

The real complexity isn’t in the code; it’s in the trust. You have to get comfortable with a system that makes decisions you didn’t explicitly program. That’s a different kind of engineering challenge, less about syntax, more about guardrails and judgment.

But the payoff is a system that grows with you. A script does exactly what you wrote it to do, forever. An agent does what you ask it to do, and sometimes finds better ways to do it than you’d considered.

So, if you find yourself staring at your “simple script” and wondering if you should give it a tools definition… just give in. You’re building an agent. It’s inevitable. You might as well enjoy the company.

A central, glowing blue polyhedral node suspended in a dark void, connected to several smaller satellite nodes by taut, luminous blue data filaments and orbital arcs, illustrating a network of interconnected AI agents.

When Agents Talk to Each Other

Welcome back to The Agentic Shift. Over the past eight installments, we’ve built our agent from the ground up, giving it a brain to thinkmemory to learn, a toolkit to actinstructions to followguardrails for safety, and a framework to build on. But there’s been an elephant in the room this whole time: our agent is alone.

I was sitting at my desk late last night, staring at three different windows on my monitor, feeling like a digital switchboard operator from the 1950s.

In one window, I had Helix, my text editor, where I was writing a Python script. In the second, I had a terminal running a deep research agent I’d built for Gemini CLI. In the third, I had a browser open to a documentation page.

Here’s the thing: Gemini CLI is brilliant, but it’s blind. It couldn’t see the code I had open in Helix. It couldn’t read the documentation in my browser. When it found a critical library update, I had to manually copy-paste the relevant code into the terminal. When I wanted it to understand an error, I had to copy-paste the stack trace. I was the glue, the slow, error-prone, context-losing glue.

We have spent this entire series building a digital Robinson Crusoe. In Part 1, we gave our agent a brain. In Part 4, we gave it tools. But watching my own workflow fragment into disjointed copy-paste loops, I realized we’ve hit a wall. We have built brilliant, isolated sparks of intelligence, but we haven’t built the wiring to connect them.

This fragmentation is the single biggest bottleneck in the agentic shift. But that is changing. We are witnessing the birth of the protocols that will turn these isolated islands into a network. We are moving from building agents to building the Internet of Agents.

The Struggle Before Standards

I tried to fix this myself, of course. We all have. I wrote brittle Python scripts to wrap my CLI tools. I tried building a mega-agent that had every possible API key hardcoded into its environment variables. I even built my own agentic TUI that explored many interesting ideas, but ultimately wasn’t the right solution.

My lowest moment came when I spent several evenings and weekends building an Electron-based AI research and writing application. The vision was grand: a unified workspace where I could query multiple AI models, organize research into projects, and write drafts with AI assistance, all in one window. I built a beautiful sidebar for project navigation, a markdown editor with live preview, a chat interface that could talk to Gemini, and a “sources” panel for managing references. By the time I stepped back to evaluate what I’d built, I had thousands of lines of TypeScript, a complex state management system, and an app that was slower than just using the terminal. Worse, it didn’t actually solve my problem. I still couldn’t get the AI to see what was in my other tools. I’d built a new silo, not a bridge. The repo still sits on my hard drive, unopened.

Every solution felt like a band-aid. The problem wasn’t that I couldn’t write the code; it was that I was trying to solve an ecosystem problem with a point solution.

The Anatomy of Connection

To solve this, we don’t just need “better agents.” We need a common language. The industry is converging on three distinct protocols, each solving a different layer of the communication stack: MCP for tools, ACP for interfaces, and A2A for collaboration.

Why three protocols instead of one? For the same reason the internet isn’t just “one protocol.” Think of it like the networking stack: TCP/IP handles reliable data transmission, HTTP handles document requests, and SMTP handles email. Each layer solves a distinct problem, and trying to collapse them into one mega-protocol would create an unmaintainable mess. The same logic applies here. MCP solves the “how do I use this tool?” problem. ACP solves the “how do I show this to a human?” problem. A2A solves the “how do I collaborate with another agent?” problem. They’re designed to compose, not compete.

The Internal Wiring of MCP

The Model Context Protocol (MCP), championed by Anthropic, represents the agent’s Internal Wiring. It answers the fundamental question: How does an agent perceive, act upon, and understand the world?

It’s easy to dismiss MCP as just “standardized tool calling,” but that misses the architectural shift. MCP creates a universal substrate for context, built on three distinct pillars. First, there are Resources, the agent’s sensory input that allows it to read data (files, logs, database rows) passively. Crucially, MCP supports subscriptions, meaning an agent can “watch” a log file and wake up the moment an error appears. Next are Tools, the agent’s hands, allowing for action: executing a SQL query, hitting an API, or writing a file. Finally, there are Prompts, perhaps the most overlooked feature, which allow domain experts to bake workflows directly into the server. A “Git Server” doesn’t just expose git commit; it can expose a generate_commit_message prompt that inherently knows your team’s style guide and grabs the current diff automatically.

Here is what that “handshake” looks like (from Anthropic’s MCP specification). It’s not magic; it’s a strict contract that turns an opaque binary into a discoverable capability:

{
  "jsonrpc": "2.0",
  "method": "tools/list",
  "result": {
    "tools": [
      {
        "name": "query_database",
        "description": "Execute a SELECT query against the local Postgres instance",
        "inputSchema": {
          "type": "object",
          "properties": {
            "sql": { "type": "string" }
          }
        }
      }
    ]
  }
}

Now, any agent (whether it’s running in Claude Desktop, Cursor, or a custom script) can “plug in” to my Postgres server and immediately know how to use it. It solves the N × M integration problem forever.

A skeptical reader might ask: “How is this different from REST or OpenAPI?” It’s a fair question. On the surface, MCP looks like “JSON-RPC with a schema,” and that’s not wrong. But the difference is what gets standardized. OpenAPI describes how to call an endpoint; MCP describes how an agent should understand and use a capability. The schema isn’t just for validation. It’s for reasoning. An MCP tool description is a prompt fragment that teaches the model when and why to use the tool, not just how.

But here’s where I need to offer some nuance, because protocol boosterism can obscure practical reality.

As Simon Willison observed in his year-end review, MCP’s explosive adoption may have been partly a timing accident. It launched right as models got reliable at tool-calling, leading some to confuse “MCP support” with “tool-calling ability.” More pointedly, he notes that for coding agents, “the best possible tool for any situation is Bash.” If your agent can run shell commands, it can use gh for GitHub, curl for APIs, and psql for databases, no MCP server required.

I’ve felt this myself. When I’m working in Gemini CLI, I rarely reach for an MCP server. The GitHub CLI (gh) is faster and more capable than any MCP wrapper I’ve tried. The same goes for gitdocker, and most developer tools with good CLIs.

So when does MCP make sense? I see three clear cases. First, when there’s no CLI (for example with my MCP service for Google Workspace), since many SaaS products expose APIs but no command-line interface. An MCP server is the natural wrapper. Second, when you need subscriptions, since MCP’s ability to “watch” a resource and push updates to the agent is something CLIs can’t do cleanly. Third, when you’re crossing network boundaries, since an MCP server can run on a remote machine and expose capabilities securely, which is harder to orchestrate with raw shell access.

The real insight here is about context engineering. MCP servers bring along a lot of context for every tool (descriptions, schemas, the full capability surface). For some workflows, that richness is valuable. But Anthropic themselves acknowledged the overhead with their Skills mechanism, a simpler approach where a Skill is just a Markdown file in a folder, optionally with some executable scripts. Skills are lightweight and only load when needed. MCP and Skills aren’t competing; they’re different tools for different context budgets.

Giving the Agent a Seat at the Keyboard

If MCP is the agent’s internal wiring, the Agent Client Protocol (ACP) is its window to the world.

I like to think of this as the LSP (Language Server Protocol) moment for the agentic age. Before LSP, if you wanted to support a new language in an IDE, you had to write a custom parser for every single editor. It was a nightmare of N × M complexity. ACP solves the same problem for intelligence. It decouples the “brain” from the “UI.”

This is why the collaboration between Zed and Google is so critical. When Zed announced bring your own agent with Google Gemini CLI integration, they weren’t just shipping features. They were standardizing the interface between the client (the editor) and the server (the agent). Intelligence became swappable. I can run a local Gemini instance through the same UI that powers a remote Claude agent.

The core of ACP is Symmetry. It’s not just the editor sending prompts to the agent. Through ACP, an editor like Zed (the reference implementation) can tell the agent exactly where your cursor is, what files you have open, and even feed it the terminal output from a failed build. The agent, in turn, can request to edit a specific line or show you a diff for approval.

I’ve been seriously thinking about building ACP support for Obsidian. I already built Gemini Scribe, an agent that lives inside Obsidian for research and writing assistance, but it’s hardcoded to Gemini. With ACP, I could make Obsidian a universal agent host, letting users bring whatever intelligence they prefer into their knowledge management workflow.

This turns the editor into the ultimate guardrail. Because the agent communicates its intent through a standardized protocol, the editor can pause, show the user exactly what’s about to happen, and wait for that “Approve” click. It’s the infrastructure that makes autonomous coding safe.

But the real magic isn’t just safety; it’s ubiquity. ACP liberates the agent from the tool. It means you can bring your preferred intelligence to whatever surface helps you flow. We are already seeing the ecosystem explode beyond just Zed.

For the terminal die-hards, there is Toad, a framework dedicated entirely to running ACP agents in a unified CLI. And for the VIM crowd, the CodeCompanion project has brought full ACP support to Neovim. This is the promise of the protocol: write the agent once, and let the user decide if they want to interact with it in a modern GUI, a raw terminal, or a modal editor from the 90s. The intelligence remains the same; only the glass changes.

When Agents Meet Strangers

Finally, we have the “Internet” layer: Agent-to-Agent (A2A).

While MCP connects an agent to a thing, and ACP connects an agent to a person, A2A connects an agent to society. It addresses the “lonely agent” problem by establishing a standard for horizontal, peer-to-peer collaboration.

This protocol, pushed forward by Google and the Linux Foundation, introduces a profound shift in how we think about distributed systems: Opaque Execution.

In traditional software, if Service A talks to Service B, Service A needs to know exactly how to call the API. In A2A, my agent doesn’t care about the how; it cares about the goal. My “Travel Agent” can ask a “Calendar Agent” to “find a slot for a meeting,” without knowing if that Calendar Agent is running a simple SQL query, consulting a complex rules engine, or even asking a human secretary for help.

This negotiation happens through the Agent Card, a machine-readable identity file hosted at a standard /.well-known/agent.json endpoint. It solves the “Theory of Mind” gap, allowing one agent to understand the capabilities of another. Here’s what one looks like:

{
  "name": "Calendar Agent",
  "description": "Manages scheduling, finds available slots, and coordinates meetings across time zones.",
  "url": "https://calendar.example.com",
  "version": "1.0.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": true
  },
  "skills": [
    {
      "id": "find-meeting-slot",
      "name": "Find Meeting Slot",
      "description": "Given a list of participants and constraints, finds optimal meeting times.",
      "inputSchema": {
        "type": "object",
        "properties": {
          "participants": { "type": "array", "items": { "type": "string" } },
          "duration_minutes": { "type": "integer" },
          "preferred_time_range": { "type": "string" }
        }
      }
    }
  ],
  "authentication": {
    "schemes": ["oauth2", "api_key"]
  }
}


When my Travel Agent encounters a scheduling problem, it doesn’t need to know how the Calendar Agent works internally. It reads this card, understands the agent can “find meeting slots,” and delegates the task. The Calendar Agent might use Google Calendar, Outlook, or a custom database. My agent doesn’t care.

But the real breakthrough is the Task Lifecycle. A2A tasks aren’t just request-response loops; they are stateful, modeled as a finite state machine with well-defined transitions:

  • Submitted: The task has been received but work hasn’t started.
  • Working: The agent is actively processing the request.
  • Input-Required: The agent needs clarification before continuing. This is the key innovation: the agent can pause, ask “Do you prefer aisle or window?”, and wait indefinitely.
  • Completed: The task finished successfully.
  • Failed: Something went wrong. The response includes an error message and optional retry hints.
  • Canceled: The requesting agent (or human) aborted the task.

This state machine brings the asynchronous, messy reality of human collaboration to the machine world. A task might sit in Input-Required for hours while waiting for a human to respond. It might transition from Working to Failed and back to Working after a retry. The protocol handles all of this gracefully.

Finding Agents You Can Trust

But let’s not declare victory just yet. We are seeing the very beginning of this shift, and the “Internet of Agents” brings its own set of dangers.

As we move from tens of agents to millions, we face a massive Discovery Problem. In a global network of opaque execution, how do you find the right agent? And more importantly, how do you trust it?

It’s not enough to just connect. You need safety guarantees. You need to know that the “Travel Agent” you just hired isn’t going to hallucinate a non-refundable booking or, worse, exfiltrate your credit card data to a malicious third party.

This is the focus of recent research on multi-agent security, which highlights that protocol compliance is only the first step. We need mechanisms for Behavioral Verification, ensuring that an agent does what it says it does.

What does verification look like in practice? Today, it’s mostly manual and ad-hoc. You might:

  • Audit the agent’s logs to see what actions it actually took versus what it claimed.
  • Run it in a sandbox with fake data before trusting it with real resources.
  • Require human approval for high-stakes actions (the “Human-in-the-Loop” pattern we explored in Part 6).
  • Check reputation signals: who built this agent? What’s their track record?

But these are stopgaps. The dream is automated verification: cryptographic proofs that an agent behaved according to its advertised policy, or sandboxed execution environments that can mathematically guarantee an agent never accessed unauthorized data. We’re not there yet.

Whether the solution looks like a decentralized “Web of Trust” (where agents vouch for each other, like PGP key signing) or a centralized “App Store for Agents” (where a trusted authority vets and signs off on agents) remains to be seen. My bet is we’ll see both: curated marketplaces for enterprise use cases, and open registries for the long tail. But solving the discovery and safety problem is the only way we move from a toy ecosystem to a production economy.

The Foundation of the Future

What excites me most isn’t just the code. It’s the governance.

We have seen this movie before. In the early days of the web, proprietary browser wars threatened to fracture the internet. We risked a world where “This site only works in Internet Explorer” became the norm. We avoided that fate because of open standards.

The same risk exists for agents. We cannot afford a future where an “Anthropic Agent” refuses to talk to an “OpenAI Agent” that won’t talk to a “Google Agent.”

That is why the formation of the Agentic AI Foundation by the Linux Foundation is the most important news you might have missed. By bringing together AI pioneers like OpenAI and Anthropic alongside infrastructure giants like GoogleMicrosoft, and AWS under a neutral banner, we are ensuring that the “Internet of Agents” remains open. This foundation will oversee the development of protocols like A2A, ensuring they evolve as shared public utilities rather than walled gardens. It is the guarantee that the intelligence we build today will be able to talk to the intelligence we build tomorrow.

The New Architecture of Work

When we combine these three protocols, the fragmentation dissolves.

Imagine I am back in Zed (connected via ACP). I ask my coding agent to “Add a secure user profile page.” Zed sends my cursor context to the agent. The agent reaches for MCP to query my local database schema and understand the users table. Realizing this touches PII, it autonomously pings a “Security Guardrail Agent” via A2A to review the proposed code. Approval comes back, and my local agent writes the code directly into my buffer.

I didn’t switch windows once.

But what happens when things go wrong? Let’s say the Security Guardrail Agent rejects the code because it detected a SQL injection vulnerability. The A2A task transitions to Failed with a structured error: {"reason": "sql_injection_detected", "line": 42, "suggestion": "Use parameterized queries"}. My local agent receives this, understands the failure, and either fixes the issue automatically or surfaces it to me with context. The rejection isn’t a dead end; it’s a conversation.

Or imagine the MCP server for my database is unreachable. The agent doesn’t just hang. It receives a timeout error and can decide to retry, fall back to cached schema information, or ask me whether to proceed without database context. Robust failure handling is baked into the protocols, not bolted on as an afterthought.

Where We Are Today

I want to be honest about maturity. These protocols are real and shipping, but the ecosystem is young.

MCP is the most mature. Just about everything supports it now: coding tools, virtualization environments, editors, even mobile apps. There are hundreds of community MCP servers for everything from Notion to Kubernetes. If you want to try this today, MCP is the on-ramp.

ACP is newer but moving fast. Zed is the reference implementation, with Neovim (via CodeCompanion) and terminal clients (via Toad) close behind. There are also robust client APIs for many languages, making ACP an interesting interface for controlling local agentic applications. If your editor doesn’t support ACP yet, you’ll likely be using proprietary plugin APIs for now.

A2A is the most nascent. Google and partners announced it in mid-2025, and the specification is still evolving. There aren’t many production A2A deployments yet. Most multi-agent systems today use custom protocols or framework-specific solutions like CrewAI or LangGraph. But the spec is public, the governance is in place, and early adopters are building.

If you’re starting a project today, my advice is: use MCP for tool integration, use whatever your editor supports for the UI layer, and keep an eye on A2A for future multi-agent workflows. The pieces are coming together, but we’re still early.

And yet, this isn’t science fiction. The protocols are here today. The “Internet of Agents” is booting up, and for the first time, our digital Robinson Crusoes are finally getting a radio.

But a radio is only as good as the conversations it enables. In our next post, we’ll move from protocols to practice and explore what happens when agents don’t just connect, but actually collaborate: forming teams, delegating tasks, and solving problems no single agent could tackle alone.

A retro computer monitor displaying the Gemini CLI prompt "> Ask Gemini to scaffold a web app" inside a glowing neon blue and pink holographic wireframe box, representing a digital sandbox.

The Guardrails of Autonomy

I still remember the first time I let an LLM execute a shell command on my machine. It was a simple ls -la, but my finger hovered over the Enter key for a solid ten seconds.

There is a visceral, lizard-brain reaction to giving an AI that level of access. We all know the horror stories—or at least the potential horror stories. One hallucinated argument, one misplaced flag, and a helpful cleanup script becomes rm -rf /. This fear creates a central tension in what I call the Agentic Shift. We want agents to be autonomous enough to be useful—fixing a bug across ten files while we grab coffee—but safe enough to be trusted with the keys to the kingdom.

Until now, my approach with the Gemini CLI was the blunt instrument of “Human-in-the-Loop.” Any tool call with a side effect—executing shell commands, writing code, or editing files—required a manual y/n confirmation. It was safe, sure. But it was also exhausting.

I vividly remember asking Gemini to “fix all the linting errors in this project.” It brilliantly identified the issues and proposed edits for twenty different files. Then I sat there, hitting yyy… twenty times.

The magic evaporated. I wasn’t collaborating with an intelligent agent; I was acting as a slow, biological barrier for a very expensive macro. This feeling has a name—“Confirmation Fatigue”—and it’s the silent killer of autonomy. I realized I needed to move from micromanagement to strategic oversight. I didn’t want to stop the agent; I wanted to give it a leash.

The Policy Engine

The solution I’ve built is the Gemini CLI Policy Engine.

Think of it as a firewall for tool calls. It sits between the LLM’s request and your operating system’s execution. Every time the model reaches for a tool—whether it’s to read a file, run a grep command, or make a network request—the Policy Engine intercepts the call and evaluates it against a set of rules.

The system relies on three core actions:

  1. allow: The tool runs immediately.
  2. deny: The AI gets a “Permission denied” error.
  3. ask_user: The default manual approval.

A Hierarchy of Trust

The magic isn’t just in blocking or allowing things; it’s in the hierarchy. Instead of a flat list of rules, I built a tiered priority system that functions like layers of defense.

At the base, you have the Default Safety Net. These are the built-in rules that apply to everyone—basic common sense like “always ask before overwriting a file.”

Above that sits the User Layer, which is where I define my personal comfort zone. This allows me to customize the “personality” of my safety rails. On my personal laptop, I might be a cowboy, allowing git commands to run freely because I know I can always undo a bad commit. But on a production server, I might lock things down tighter than a vault.

Finally, at the top, is the Enterprise/Admin Layer. These are the immutable laws of physics for the agent. In an enterprise setting, this is where you ensure that no matter how “creative” the agent gets, it can never curl data to an external IP or access sensitive directories.

Safe Exploration

In practice, this means I can trust the agent to look but ask it to verify before it touches. I generally trust the agent to check the repository status, review history, or check if the build passed. I don’t need to approve every git log or gh run list.

[[rule]]
toolName = "run_shell_command"
commandPrefix = [
  "git status",
  "git log",
  "git diff",
  "gh issue list",
  "gh pr list",
  "gh pr view",
  "gh run list"
]
decision = "allow"
priority = 100

Yolo Mode

Sometimes, I’m working in a sandbox and I just want speed. I can use the dedicated yolo mode to take the training wheels off. There is a distinct feeling of freedom—and a slight thrill of danger—when you watch the terminal fly by, commands executing one after another.

However, even in Yolo mode, I want a final sanity check before I push code or open a PR. While Yolo mode is inherently permissive, I define specific high-priority rules to catch critical actions. I also explicitly block docker commands—I don’t want the agent spinning up (or spinning down) containers in the background without me knowing.

# Exception: Always ask before committing or creating a PR
[[rule]]
toolName = "run_shell_command"
commandPrefix = ["git commit", "gh pr create"]
decision = "ask_user"
priority = 900
modes = ["yolo"]

# Exception: Never run docker commands automatically
[[rule]]
toolName = "run_shell_command"
commandPrefix = "docker"
decision = "deny"
priority = 999
modes = ["yolo"]

The Hard Stop

And then there are the things that should simply never happen. I don’t care how confident the model is; I don’t want it rebooting my machine. These rules are the “break glass in case of emergency” protections that let me sleep at night.

[[rule]]
toolName = "run_shell_command"
commandRegex = "^(shutdown|reboot|kill)"
decision = "deny"
priority = 999

Decoupling Capability from Control

The significance of this feature goes beyond just saving me from pressing y. It fundamentally changes how we design agents.

I touched on this concept in my series on autonomous agents, specifically in Building Secure Autonomous Agents, where I argued that a “policy engine” is essential for scaling from one agent to a fleet. Now, I’m bringing that same architecture to the local CLI.

Previously, the conversation around AI safety often presented a binary choice: you could have a capable agent that was potentially dangerous, or a safe agent that was effectively useless. If I wanted to ensure the agent wouldn’t accidentally delete my home directory, the standard advice was to simply remove the shell tool. But that is a false choice. It confuses the tool with the intent. Removing the shell doesn’t just stop the agent from doing damage; it stops it from running tests, managing git, or installing packages—the very things I need it to do.

With the Policy Engine, I can give the agent powerful tools but wrap them in strict policies. I can give it access to kubectl, but only for get commands. I can let it edit files, but only on specific documentation sites.

This is how we bridge the gap between a fun demo and a production-ready tool. It allows me to define the sandbox in which the AI plays, giving me the confidence to let it run autonomously within those boundaries.

Defining Your Own Rules

The Policy Engine is available now in the latest release of Gemini CLI. You can dive into the full documentation here.

If you want to see exactly what rules are currently active on your system—including the built-in defaults and your custom additions—you can simply run /policies list from inside the Gemini CLI.

I’m currently running a mix of “Safe Exploration” and “Hard Stop” rules. It’s quieted the noise significantly while keeping my file system intact. I’d love to hear how you configure yours—are you a “deny everything” security maximalist, or are you running in full “allow” mode?