A focused workspace at a desk in a vast library, with nearby shelves illuminated and distant shelves visible but softened, a pair of sunglasses resting on the desk

Scoping AI Context with Projects in Gemini Scribe

My son has a friend who likes to say, “born to dilly dally, forced to lock in.” I’ve started to think that describes AI agents in a large Obsidian vault perfectly.

My vault is a massive, sprawling entity. It holds nearly two decades of thoughts, ranging from deep dives into LLM architecture to my kids’ school syllabi and the exact dimensions needed for an upcoming home remodelling project. When I first introduced Gemini Scribe, the agent’s ability to explore all of that was a feature. I could ask it to surface surprising connections across topics, and it would. But as I’ve leaned harder into Scribe as a daily partner, both at home and at work, the dilly dallying became a real problem. My work vault has thousands of files with highly overlapping topics. It’s not a surprise that the agent might jump from one topic to another, or get confused about what we’re working on at any given time. When I asked the agent to help me structure a paragraph about agentic workflows, I didn’t want it pulling in notes from my jazz guitar practice.

I could have created a new, isolated vault just for my blog writing. I tried that briefly, but I immediately found myself copying data back and forth. I was duplicating Readwise syncs, moving research papers, and fracturing my knowledge base. That wasn’t efficient, and it certainly wasn’t fun. The problem wasn’t that the agent could see too much. The problem was glare. I needed sunglasses, not blinders. I needed to force the agent to lock in.

So, I built Projects in Gemini Scribe.

A project defines scope without acting as a gatekeeper

Fundamentally, a project in Gemini Scribe is a way to focus the agent’s attention without locking it out of anything. It defines a primary area of work, but the rest of the vault is still there. Think of it like sitting at a desk in the engineering section of a library. Those are the shelves you browse by default, the ones within arm’s reach. But if you know the call number for a book in the history section, nobody stops you from walking over and grabbing it. You can even leave a stack of books from other sections on your desk ahead of time if you know you’ll need them. If you’ve followed along with the evolution of Scribe from plugin to platform, you’ll recognize this as a natural extension of the agent’s growing capabilities.

The core mechanism is remarkably simple. Any Markdown file in your vault can become a project by adding a specific tag to its YAML frontmatter.

---
tags:
  - gemini-scribe/project
name: Letters From Silicon Valley
skills:
  - writing-coach
permissions:
  delete_file: deny
---

Once tagged, that file’s parent directory becomes the project root. From that point on, when an agent session is linked to the project, its discovery tools are automatically scoped to that directory and its subfolders. Under the hood, the plugin intercepts API calls to tools like list_files and find_files_by_content, transparently prepending the project root to the search paths. The practical difference is immediate. Before projects, I could be working on a blog post about agent memory systems and the agent would surface notes from a completely unrelated project that happened to use similar terminology. Now I can load up a project and work with the agent hand in hand, confident it won’t get distracted by similar ideas or overlapping vocabulary from other corners of the vault.

The project file serves as both configuration and context

The project file itself serves a dual purpose. It acts as both configuration and context. The frontmatter handles the configuration, allowing me to explicitly limit which skills the agent can use or override global permission settings. For example, denying file deletions for a critical writing project is a simple but effective safety net. But the real power is in customizing the agent’s behavior per project. For my creative writing, I actually don’t want the agent to write at all. I want it to read, critique, and discuss, but the words on the page need to be mine. Projects let me turn off the writing skill entirely for that context while leaving it fully enabled for my blog work. The same agent, shaped differently depending on what I’m working on.

Everything below the frontmatter is treated as context. Whatever I write in the body of the project note is injected directly into the agent’s system prompt, acting much like an additional, localized set of instructions. The global agent instructions are still respected, but the project instructions provide the specific context needed for that particular workspace. This is similar in spirit to how I’ve previously discussed treating prompts as code, where the instructions you give an agent deserve the same rigor and iteration as any other piece of software.

This is where the sunglasses metaphor really holds. The agent’s discovery tools, things like list_files and find_files_by_content, are scoped to the project folder. That’s the glare reduction. But the agent’s ability to read files is completely unrestricted. If I am working on a technical post and need to reference a specific architectural note stored in my main Notes folder, I have two options. I can ask the agent to go grab it, or I can add a wikilink or embed to the project file’s body and the agent will have it available from the start. One is like walking to the history section yourself. The other is like leaving that book on your desk before you sit down. Either way, the knowledge is accessible. The project just keeps the agent from rummaging through every shelf on its own. This builds directly on the concepts of agent attention I explored in Managing AI Agent Attention.

Session continuity keeps the agent focused across your vault

One of the more powerful aspects of this system is how it interacts with session memory. When I start a new chat, Gemini Scribe looks at the active file. If that file lives within a project folder, the session is automatically linked to that project. This is a direct benefit of the supercharged chat history work that landed earlier in the plugin’s life.

This linkage is stable for the lifetime of the session. I can navigate around my vault, opening files completely unrelated to the project, and the agent will remain focused on the project’s context and instructions. This means I don’t have to constantly remind the agent of the rules of the road. The project configuration persists across the entire conversation.

Furthermore, session recall allows the agent to look back at past conversations. When I ask about prior work or decisions related to a specific project, the agent can search its history, utilizing the project linkage to find the most relevant past interactions. This creates a persistent working environment that feels much more like a collaboration than a simple transaction.

Structuring projects effectively requires a few simple practices

To get the most out of projects, I’ve found a few practices to be particularly effective.

First, lean into the folder-based structure. Place the project file at the root of the folder containing the relevant work. Everything underneath it is automatically in scope. This feels natural if you already organize your vault by topic or project, which many Obsidian users do.

Second, start from the defaults and adjust as the project demands. Out of the box, a new project inherits the agent’s standard skills and permissions, which is a sensible baseline for most work. From there, you tune. If you find the agent reaching for tools that don’t make sense in a given context, narrow the allowed skills in the frontmatter. If a project needs extra safety, tighten the permissions. The creative writing example I mentioned earlier came about exactly this way. I started with the defaults, realized I wanted the agent as a reader and critic rather than a co-writer, and adjusted accordingly. This aligns with the broader principle I’ve written about when discussing building responsible agents: the right guardrails are the ones shaped by the actual work.

Finally, treat the project body as a living document. As the project evolves, update the instructions and external links to ensure the agent always has the most current and relevant context. It’s a simple mechanism, but it fundamentally changes how I interact with an AI embedded in a large knowledge base. It allows me to keep my single, massive vault intact, while giving the agent the precise focus it needs to be genuinely helpful.

A glowing multifaceted geometric shape at the center of a complete ring of twelve interconnected nodes on a dark background, with luminous filaments extending outward beyond the ring.

The Map We Drew Together – Reflections on the Agentic Shift

Seven months ago, I sat down to write a blog post about a feeling I couldn’t shake. Something fundamental was shifting in how we build software, and I wanted to understand it. I’d spent my career watching these transitions unfold, from the early internet to cloud computing to mobile, and I recognized the signs. The ground was moving again. So I did what I always do when I’m trying to understand something: I started writing.

That first post, Exploring the Age of AI Agents, was ambitious to the point of recklessness. I sketched out a twelve-part series covering everything from the anatomy of an agent to the ethics of autonomous systems. I had an outline, a rough timeline, and the kind of optimism that comes from not yet knowing how hard the thing you’re attempting actually is. “The age of agents is here,” I wrote. “Let’s explore it together.”

I meant it. But I had no idea what I was signing up for.

What I Thought I Was Writing

When I outlined the series in September 2025, I thought I was writing a technical guide. A structured walkthrough of how agents work, piece by piece: how they think, how they remember, how they use tools, and so on. I imagined the series as a kind of textbook, assembled in public, one chapter at a time.

That’s not what it became.

The series became a journal of a landscape in motion. Every time I sat down to write the next installment, the ground had shifted since the last one. I wrote about agent frameworks in November, and by January the framework landscape had already reorganized itself around protocols I hadn’t anticipated. I wrote about guardrails as a theoretical necessity, and then watched OpenClaw demonstrate exactly the kind of third-party skill exploitation I’d warned about, at a scale that made the warning feel inadequate. I outlined “When Agents Talk to Each Other” as Part 9, imagining it as a speculative look at a future problem. By the time I wrote it, MCP had become the most discussed protocol in the developer ecosystem, A2A had launched, and the “future problem” was a present reality.

The pace of change didn’t just affect the content. It changed how I build software. In September 2025, I was writing agents by hand, stitching together ReAct loops in Python scripts with explicit tool-calling logic. By January 2026, I was watching my own projects inevitably evolve into agents whether I planned for it or not. By March, I was writing a post arguing that the CLI-vs-MCP debate misses the point entirely, because I’d lived through the transition from “agents are a design pattern” to “agents are the default architecture” in real time.

What Surprised Me

Three things caught me off guard.

The first was how quickly “agentic” stopped being a buzzword and became a description of how software actually gets built. When I started this series, calling something an “agent” still felt like a stretch, a term borrowed from research papers and applied generously by marketing teams. By the time I finished, every major development tool I use daily had adopted the agentic loop as its core interaction model. Gemini CLI, Claude Code, GitHub Copilot Workspace: they all run models in loops with access to tools. That’s not hype. That’s the new baseline.

The second surprise was how much the human side of this story matters. I started the series focused on architecture and implementation. I ended it writing about a student who decided not to study computer science because AI made it seem like it wasn’t really a job anymore. I ended it writing about Klarna replacing 700 people and then quietly rehiring because pure automation couldn’t replicate empathy. The technical architecture matters enormously, but the posts that generated the most conversation, the most email, the most “I’ve been thinking about this too,” were the ones that grappled with what agents mean for the people who build and use and are affected by them.

The third surprise was personal. Writing this series made me a better engineer. Not because I learned new frameworks (though I did), but because the discipline of explaining something forces you to understand it at a depth that using it never requires. I couldn’t write about the observability gap without building observability into my own systems. I couldn’t write about meaningful human control without rethinking the autonomy boundaries in my own agents. The series was supposed to be me sharing what I knew. It turned out to be me learning in public.

The Map and the Territory

Looking back at the original table of contents, I’m struck by how well the structure held up and how differently the substance landed than I expected.

The early posts, Parts 1 through 4, were the foundation: anatomy, reasoning, memory, tools. These were the most “textbook” installments, and they still hold up as reference material. If you’re new to agents, start there. The core concepts haven’t changed, even as the implementations have matured dramatically.

The middle posts, Parts 5 through 8, were about the craft of building agents well: guiding behavior, putting up guardrails, managing attention, choosing frameworks. These turned out to be the posts I return to most in my own work. The technical patterns here, prompt engineering as programming, context window management as a first-class concern, guardrails as architecture rather than afterthought, are the ideas that separate a weekend prototype from a system you’d trust with real work.

The later posts, Parts 9 through 12, were where the series found its heart. When Agents Talk to Each Other captured the moment the ecosystem shifted from building isolated agents to building the connective tissue between them. The Observability Gap articulated the wall every builder hits when moving from demo to production. Agents in the Wild made the theory concrete with real deployments at real companies. And Responsibility and the Road Ahead confronted the question that my self-deleting agent made impossible to avoid: capability without responsibility is just risk with extra steps.

Where the Road Goes

I’m not done writing about agents. The territory is too large and too fast-moving for any single series to cover completely. But I’m shifting focus.

The Agentic Shift was about mapping the fundamentals: what agents are, how they work, and what it takes to build them responsibly. The next chapter, for me, is about what happens when these fundamentals leave the terminal and enter the rest of life. When agents aren’t novel but expected. When the question isn’t “should we use agents?” but “how do we live and work alongside them?”

Back in April 2025, before this series even started, I wrote about waiting for a true AI coding partner. I was describing something I could feel but couldn’t quite build yet: an AI that didn’t just generate code on command but genuinely collaborated, anticipated needs, and earned trust through consistent, reliable behavior. That vision hasn’t changed, but it’s expanded. I want to build agents we can trust as collaborators, not just in code but in the fabric of daily life.

I’m thinking about home and family. Calendars that don’t just display events but reason about conflicts, coordinate across family members, and suggest adjustments before anyone has to ask. Financial tools that don’t just track spending but understand patterns, flag anomalies, and help a household make better decisions over time. An always-on system that manages the house itself, making reasonable decisions about lighting, climate, energy usage, and routine maintenance without requiring a human to micromanage every automation rule. Not a smart home in the current sense, where everything is a manual trigger dressed up as intelligence, but something closer to a thoughtful presence that understands how a family actually lives and adapts accordingly.

These aren’t science fiction problems anymore. The architecture we explored in this series, perception, reasoning, memory, tools, guardrails, is exactly the stack these systems need. The hard part isn’t the technology. It’s the trust. And that brings me back to the theme that ran through every post in this series: autonomy should match consequence, and the humans should always be able to take the wheel.

I’m also watching the broader landscape. The protocol wars are far from settled; MCP has momentum, but A2A and ACP are finding their niches, and the “bridge pattern” I described in my MCP post is becoming the pragmatic default for tool developers. The economics of agentic software are reshaping the SaaS industry in ways that are still unfolding. And the workforce implications, the thing that keeps me up at night more than any technical challenge, are only beginning to be felt.

I also want to go deeper on building. The Agentic Shift stayed mostly at the conceptual and architectural level, but my own hands-on work kept pace with the writing. Much of that happened in and around Gemini CLI, which became my primary development environment and a testing ground for the ideas in this series. I built a policy engine for Gemini CLI while writing Part 6 on guardrails, and the two fed each other in real time, the code revealing gaps in the theory and the writing sharpening the implementation. I wrote extensions for Google Workspace that gave agents access to real productivity tools. I integrated deep research workflows into my terminal. Gemini Scribe continues to evolve alongside all of it. My podcast RAG system keeps teaching me things about retrieval and memory that I didn’t expect. There are new tools to build, new patterns to discover, and new failure modes to document.

The Bookend

I want to end where I started. In September 2025, I wrote that we were standing on the cusp of a fundamental shift. I listed the transitions I’d witnessed in my career: the internet, the PC, cloud computing, mobile, social media. And I said this one was next.

Seven months later, I don’t think we’re on the cusp anymore. We’re in it. The shift happened while I was writing about it. Agents moved from research papers to production systems to the default way software gets built, and they did it faster than any of the previous transitions I compared them to. The twelve posts in this series captured one slice of that movement, one engineer’s attempt to make sense of a landscape that refused to hold still.

I’m grateful to everyone who followed along. The emails, the comments, the conversations at meetups and conferences where someone would say “I read your post about guardrails and it changed how we’re building our system.” That’s why I write. Not to have the definitive answer, but to think out loud in a way that helps other people think too.

The age of agents is here. We explored it together. And the exploring isn’t over.

Let’s keep building.

A cracked-open obsidian geode on a weathered wooden desk reveals a glowing golden network of interconnected nodes and pathways inside. Tendrils of golden light extend outward from the geode across the desk toward open notebooks and a mechanical keyboard, with bookshelves softly blurred in the background.

Gemini Scribe From Agent to Platform

Six months ago, I wrote about building Agent Mode for Gemini Scribe from a hotel room in Fiji. That post ended with a sense of possibility. The agent could read your notes, search the web, and edit files. It was, by the standards of the time, pretty remarkable. I remember watching it chain together a sequence of tool calls for the first time and thinking I’d built something meaningful.

I had no idea it was just the beginning.

In the six months since that post, Gemini Scribe has gone through fifteen releases, from version 3.3 to 4.6. There have been over 400 commits, a complete architectural rethinking, and a transformation from “a chat plugin with an agent mode” into something I can only describe as a platform. The agent didn’t just get better. It got a memory, a research department, a set of extensible skills, and the ability to talk to external tools through the Model Context Protocol. If the vacation version was a clever assistant, this version is closer to a collaborator who actually understands your vault.

I want to walk through how we got here, because the journey reveals something I think is important about building with AI right now: the hardest problems aren’t the ones you set out to solve. They’re the ones that reveal themselves only after you ship the first version and start living with it.

The Agent Grows Up

The first big milestone after the vacation was version 4.0, released in November 2025. This was the release where I made a decision that felt risky at the time: I removed the old note-based chat entirely. No more dual modes, no more confusion about which interface to use. Everything became agent-first. Every conversation had tool calling built in. Every session was persistent.

It sounds simple in hindsight, but killing a feature that works is one of the hardest decisions in software. The old chat mode was comfortable. People used it. But it was holding back the entire plugin, because every new feature had to work in two completely different paradigms. Ripping it out was liberating. Suddenly I could focus all my energy on making one experience truly great instead of maintaining two mediocre ones.

Alongside 4.0, I built the AGENTS.md system, a persistent memory file that gives the agent an overview of your entire vault. When you initialize it, the agent analyzes your folder structure, your naming conventions, your tags, and the relationships between your notes. It writes all of this down in a file that persists across sessions. The result is that the agent doesn’t start every conversation from scratch. It already knows how your vault is organized, where you keep your research, and what projects you’re working on. It’s the difference between hiring a new intern every morning and having a colleague who’s been on the team for months.

Seeing and Searching

Version 4.1 brought something I’d wanted since the beginning: real thinking model support. When Google released Gemini 2.5 Pro and later Gemini 3 with extended thinking capabilities, I added a progress indicator that shows you the model’s reasoning in real time. You can watch it think through a problem, see it plan its approach, and understand why it chose a particular tool. It sounds like a small UI feature, but it fundamentally changes your relationship with the agent. You stop treating it like a black box and start treating it like a thinking partner whose process you can follow.

That same release added a stop button (which sounds trivial until you’re watching an agent go on a tangent and have no way to interrupt it), dynamic example prompts that are generated from your actual vault content, and multilingual support so the agent responds in whatever language you write in.

But the real game-changer came in version 4.2 with semantic vault search. I wrote about the magic of embeddings over a year ago, and this feature is that idea fully realized inside Obsidian. It uses Google’s File Search API to index your entire vault in the background. Once indexed, the agent can search by meaning, not just keywords. If you ask it to “find my notes about the trade-offs of microservices,” it will surface relevant notes even if they never use the word “microservices.” It understands that a note titled “Why We Split the Monolith” is probably relevant.

The indexing runs in the background, handles PDFs and attachments, and can be paused and resumed. Getting the reliability right was one of the more frustrating engineering challenges of the whole project. There were weeks of debugging race conditions, handling rate limits gracefully, and making sure a crash mid-index didn’t corrupt the cache. Version 4.2.1 was almost entirely dedicated to stabilizing the indexer, adding incremental cache saves and automatic retry logic. It’s the kind of work that nobody sees but everyone benefits from.

Images, Research, and the Expanding Toolbox

Version 4.3, released in January 2026, added multimodal image support. You can now paste or drag images directly into the chat, and the agent can analyze them, describe them, or reference them in notes it creates. The image generation tool, which I’d been building in the lead-up to 4.3, lets the agent create images on demand using Google’s Imagen models. There’s even an AI-powered prompt suggester that helps you describe what you want if you’re not sure how to phrase it.

That release also introduced two new selection-based actions: Explain Selection and Ask About Selection. These join the existing Rewrite feature to give you a full right-click menu for working with selected text. It sounds like a small addition, but in practice these micro-interactions are where people spend most of their time. Being able to highlight a paragraph, right-click, and ask “What’s the logical flaw in this argument?” without leaving your note is the kind of frictionless experience I’m always chasing.

Then came deep research in version 4.4. This is fundamentally different from the regular Google Search tool. Where a search returns quick snippets, deep research performs multiple rounds of investigation, reading and cross-referencing sources, synthesizing findings, and producing a structured report with inline citations. It can combine web sources with your own vault notes, so the output reflects both what the world knows and what you’ve already written. A single research request takes several minutes, but what you get back is closer to what a research assistant would produce after an afternoon in the library.

I built this on top of my gemini-utils library, which is a separate project I created to share common AI functionality across all of my TypeScript Gemini projects, including Gemini Scribe, my Gemini CLI deep research extension, and more. Having that shared foundation means deep research improvements benefit every project simultaneously.

Opening the Platform

If I had to pick the release that transformed Gemini Scribe from a plugin into a platform, it would be version 4.5. This is where MCP server support and the agent skills system arrived.

MCP, the Model Context Protocol, is an open standard that lets AI applications connect to external tool providers. In practical terms, it means Gemini Scribe can now talk to tools that I didn’t build. You can connect a filesystem server, a GitHub integration, a Brave Search provider, or anything else that speaks MCP. The plugin supports both local stdio transport (spawning a process on your desktop) and HTTP transport with full OAuth authentication, which means it works on mobile too. When you connect an MCP server, its tools appear alongside the built-in vault tools, with the same confirmation flow and safety features.

This was the moment the plugin stopped being a closed system. Instead of me having to build every integration myself, the entire MCP ecosystem became available. Someone who needs to query a database from their notes can connect a database MCP server. Someone who wants to interact with their GitHub issues can connect the GitHub server. The plugin becomes a hub rather than a destination.

The agent skills system, which follows the open agentskills.io specification, takes a similar approach to extensibility but for knowledge rather than tools. A skill is a self-contained instruction package that gives the agent specialized expertise. You can create a “meeting-notes” skill that teaches it your preferred format for processing meetings, or a “code-review” skill with your team’s specific standards. Skills use progressive disclosure, so the agent always knows what’s available but only loads the full instructions when it activates one. This keeps conversations focused while making specialized knowledge available on demand.

Version 4.5 also migrated API key storage to Obsidian’s SecretStorage, which uses the OS keychain. Your API key is no longer sitting in a plain JSON file in your vault. It’s a small change that matters a lot for security, especially for people who sync their vaults to cloud storage or version control.

Managing the Conversation

The most recent release, version 4.6, tackles a problem that only becomes apparent after you’ve been using an agent for a while: conversations get long, and long conversations hit token limits.

The solution is automatic context compaction, a direct answer to the attention management challenge I explored in the Agentic Shift series. When a conversation approaches the model’s token limit, the plugin automatically summarizes older turns to make room for new ones. There’s also an optional live token counter that shows you exactly how much of the context window you’re using, with a breakdown of cached versus new tokens. It’s the kind of visibility that helps you understand why the agent might be “forgetting” things from earlier in the conversation and gives you the information to manage it.

This release also added a per-tool permission policy system, which is the practical realization of the guardrails philosophy I wrote about in the Agentic Shift series. Instead of the binary choice between “confirm everything” and “confirm nothing,” you can now set individual tools to allow, deny, or ask-every-time. There are presets too: Read Only, Cautious, Edit Mode, and (for the brave) YOLO mode, which lets the agent execute everything without asking. I use Cautious mode myself, which auto-approves reads and searches but asks before any file modifications. It strikes a balance between speed and safety that feels right for daily use.

What I’ve Learned

Building Gemini Scribe has taught me something I keep coming back to in this blog: the most interesting work happens at the intersection of AI capabilities and human workflows. The technical challenges (semantic indexing, MCP integration, context compaction) are real, but they’re in service of a simple goal: making the AI useful enough that you forget it’s there.

The plugin now has users like Paul O’Malley building entire self-organizing knowledge systems on top of it. Seeing that kind of creative adoption is what keeps me building. Every feature request, every bug report, every surprising use case reveals another facet of what’s possible when you give a capable AI agent the right set of tools and the right context.

If you’re curious, Gemini Scribe is available in the Obsidian Community Plugins directory. All you need is a free Google Gemini API key. I’d love to hear what you build with it.

A woodworker's workbench viewed from above, with traditional hand tools like chisels, planes, and saws arranged alongside a glowing tablet and a translucent holographic AI interface. In the center, a beautifully carved wooden book lies open, its pages showing handwritten text and illustrations. Wood shavings curl around the edges, and warm golden afternoon light streams through a workshop window.

It’s a Poor Craftsman Who Blames His Tools

A craftsman’s workbench where traditional tools meet digital ones, a carved wooden book at its center. The old and new sit side by side, each essential to the work.

I was around eight years old, standing in front of my third grade class, holding a neatly printed report on Scotland. We had an IBM PC Jr. in the house, one of the few families in the neighborhood who did, and I’d used its word processor to write my family heritage project. When a classmate asked how I’d made it, I told him the truth. I did it on the computer. His response was immediate: “That’s cheating. All you had to do was tell the computer to write everything about Scotland.”

That moment has stuck with me for over forty years. Not because my classmate was right, but because his assumption felt so deeply unfair. He looked at the output, saw something cleaner than what he could produce by hand, and concluded that the tool had done the thinking. The work I’d put in, the reading, the organizing, the rewriting, became invisible the moment the medium changed.

I thought about that kid this week when Hachette pulled Mia Ballard’s horror novel Shy Girl after AI detection tools flagged up to 78% of the text as machine-generated. The story has all the ingredients of a proper scandal: a buzzy debut, a viral three-hour YouTube analysis that racked up over a million views, an author claiming an editor used AI without her knowledge, and a publisher left scrambling to explain how it slipped through. I don’t know whether Ballard used AI to write her book, and this post isn’t about relitigating her specific case. What interests me is the reaction. The assumptions that surfaced, the lines people drew, and what those lines tell us about what we actually value in creative work.

We’ve Been Here Before

The outrage around AI-generated books carries an implicit premise: that the value of a book is inseparable from the human effort of writing it word by word. But we’ve never really believed that, have we?

Ghostwriting is one of the oldest traditions in publishing. The term itself was coined in 1921 by Christy Walsh, but the practice predates the modern publishing industry entirely. Ancient Greek and Roman scribes wrote speeches for public figures. In the early 1900s, Edward Stratemeyer built a literary empire by creating plot outlines for children’s book series and hiring writers to turn them into finished novels. The Nancy Drew books, beloved by generations of readers, were written by a rotating cast of ghostwriters under the pseudonym Carolyn Keene.

Today, estimates suggest that 60 to 80 percent of business and self-help nonfiction is ghostwritten or co-written. More than 80 percent of celebrity memoirs are ghostwritten. Andre Agassi’s autobiography, widely praised as one of the best sports memoirs ever written, was ghostwritten by Pulitzer Prize winner J.R. Moehringer. JFK’s Profiles in Courage won a Pulitzer Prize despite strong evidence that Ted Sorensen wrote most of it.

We know all of this, and we’ve made our peace with it. When you pick up a celebrity memoir, you don’t assume the celebrity sat alone in a cabin writing every sentence. You assume there was a collaborator, and you’re fine with it, because the ideas, the stories, the perspective still belong to the person whose name is on the cover.

So what exactly changes when the collaborator is a machine?

The obvious objection is that a ghostwriter is a human being. Moehringer spent months interviewing Agassi, made narrative choices rooted in empathy and decades of craft, brought the full weight of his own life experience to the project. Today’s language models do none of that. They have no understanding of their subject, no lived experience to draw on, no creative judgment in any meaningful sense. That’s a real difference today, and I don’t want to pretend otherwise. Whether it remains a permanent difference is a question none of us can answer yet.

But here’s what I keep coming back to: the reader’s relationship to the work is the same either way. You don’t read Agassi’s memoir for Moehringer’s prose style. You read it for Agassi’s story, told well. The question for the reader was always “is this book worth my time,” not “who exactly arranged these sentences, and what was their inner life like while they did it.” The collaborator changed. The contract between author and reader didn’t.

The Real Question

The Guardian piece on the Shy Girl controversy quotes Mor Naaman, a professor of information science at Cornell Tech, asking a question I think cuts to the heart of this: “We all work in an AI-hybrid world now. When does something become an AI-generated book, rather than just using AI like I use a spellchecker, to fix my grammar or maybe spark ideas?”

This is the right question, and I don’t think it has a clean answer. The spectrum of AI assistance in writing is wide and continuous. At one end, your word processor underlines a misspelled word and you accept the correction. Nobody calls that cheating. At the other end, someone pastes “write me a horror novel” into ChatGPT and submits whatever comes back. That feels like fraud. But between those two poles is a vast grey area where most of us who write with AI actually live.

I write extensively with AI, and in the interest of practicing what I preach, let me tell you exactly how this post was made. I read the Guardian article in my feed reader. I opened up an AI conversation, dropped in a link to the piece along with several hundred words of reaction, half-formed arguments, and personal anecdotes. Then I went back and forth with the AI to build out the structure and prose. I read and edited the draft several times. I then used the AI to find weaknesses in my own arguments, and methodically worked through each one, discussing my views and refining the piece as we went.

Every idea in this post is mine. The IBM PC Jr. story is mine, pulled from a memory no language model has access to. The ghostwriting parallel, the monoculture pushback, the phonograph analogy, all mine. But the process of turning that raw material into the thing you’re reading right now involved a collaboration that didn’t exist five years ago. So, is this an AI-generated blog post? I’d say no. But I couldn’t have written it this way without AI, and I don’t see any reason to pretend otherwise.

Here’s the thing that doesn’t get said enough: I would not have written this piece at all five years ago. Not because I didn’t have opinions, but because I didn’t have the time, and the tools available to me didn’t match the way I think. I’m a conversational thinker. I work through ideas by talking them out, testing them, pushing back on them. Five years ago, I would have read the Guardian story, thought “I disagree,” and moved on with my day. AI gave me a way to turn that disagreement into something I could share. It didn’t lower the quality of my writing. It made the writing possible in the first place.

I’ve written before about how AI is changing the craft of software engineering, and I see the same dynamics playing out in writing. In my open source projects, I have no problem with AI-generated code. What I care about is whether the author has tested it, understands it, and can vouch for its quality. The tool doesn’t matter. The accountability does.

Authorship in other creative work should be no different. The judgment should be reserved for the end product. We should hold authors accountable for delivering something we want to read and that we enjoy reading. There were plenty of formulaic, poorly written books on the shelves before AI, and there will be more after authors embrace it. It’s not the tools. It’s the people and how they use them.

The Monoculture Was Already Here

The other argument that comes up in these conversations is that AI will drive us toward a cultural monoculture, a flattening of creative output into algorithmically averaged blandness. Naaman makes this case in the Guardian piece, warning that “AI nudges users into a bland monoculture” and that it “could never generate the truly diverse creativity of the human mind.”

I have two responses to that.

First, we’ve been driving toward monoculture for a long time, well before AI had anything to do with it. Walk down the high street of any major city in the world and tell me what brands you see. Are they really that different from London to Tokyo to São Paulo? How about the movies in the theater or the books on the international bestseller lists? Globalization and the internet have been blending and merging cultures for decades. AI isn’t the cause of our growing monoculture. It’s a reflection of it. We are all more connected today than we have ever been, and that connectivity, for all its benefits, inevitably smooths out some of the edges. Can AI accelerate that flattening? Absolutely. If millions of writers lean on the same models trained on the same corpus, the pull toward sameness is real. I don’t want to minimize that risk. But the answer is better use of the tool, not rejection of it.

And this is the more important point for writers: AI is a prediction engine. Given a sequence of text, it generates the statistically most likely next token. Left to its own devices, it will absolutely produce the most average, most expected, most median version of whatever you’re writing. That’s not a flaw in the technology. It’s a feature you have to work against. It’s up to the author to bring their own voice, their own weird obsessions, their own hard-won perspective into that conversation and make it interesting. The AI will always pull you toward the center. Your job as a writer is to pull it toward the edges.

This is no different from any other creative tool. A guitar doesn’t make you a musician. A camera doesn’t make you a photographer. And a language model doesn’t make you a writer. But in the hands of someone who knows what they want to say and has the skill to shape the output, all of these tools can produce extraordinary work.

I keep thinking about the early days of digital photography. When digital cameras started displacing film, there was a vocal community of film purists who insisted that you weren’t a real photographer unless you were shooting on film. Digital made it too easy, they argued. You could fire off hundreds of frames and increase your odds of getting a lucky shot, rather than developing the discipline to compose and expose a single frame correctly. The process was the point, they said. The limitation was what made it art.

Sound familiar? We’re hearing the same argument about AI and writing now. The tool lowers the barrier, so the gatekeepers question whether the output counts. But digital cameras didn’t kill photography. They democratized it. They opened the craft to millions of people who would never have had access to a darkroom, and the best photographers in the world today shoot digital without anyone questioning their artistry. What matters is the image, not the medium it was captured on.

Judge the Work

There is a legitimate concern buried in this debate that I don’t want to dismiss. If AI takes over the entry-level writing work (the copywriting gigs, the formula genre fiction, the content mill assignments), where do emerging authors get their reps? If the lower rungs of the ladder disappear, do we end up with fewer masters at the top?

It’s a fair question, but it’s not a new one. When the phonograph and the player piano arrived in the early 1900s, they devastated the livelihoods of working musicians. Every restaurant and pub used to have live music because there was simply no other way to fill a room with sound. Those technologies thinned the ranks dramatically, and what followed was a greater concentration of attention on the most talented performers. They weren’t playing the pubs anymore. They were playing the concert halls. Then radio and recorded music concentrated things further still. John Philip Sousa warned in 1906 that mechanical music would be the death of the art form.

He was wrong, of course. The tipping point, in my opinion, came with the democratization of music-making and music-publishing tools. GarageBand, SoundCloud, Bandcamp, Spotify for independent artists. The pub gigs never came back, but the ability to create and distribute music became more accessible than it had ever been. The pipeline didn’t disappear. It changed shape entirely.

I think writing is headed somewhere similar. The old entry points may shrink, but AI is simultaneously creating new ones. Editing AI output, directing it, curating it, knowing how to coax something genuinely good out of a collaboration with a machine. These are new skills, and they’re the early rungs of a different ladder.

There’s a saying in woodworking: it’s a poor craftsman who blames his tools. I think we need to keep that wisdom close as this debate continues. If Shy Girl turns out to be a case of someone submitting raw machine output as their own creative work, then the problem isn’t that AI was involved. The problem is that the author didn’t do the work. But if it turns out to be something messier, a human writer who leaned on AI more than the industry is currently comfortable with, then we need to ask ourselves what exactly we’re punishing and why.

But let’s not make the same mistake my third grade classmate made. Let’s not look at the tool and assume it did all the thinking. The question isn’t whether an author used AI. The question is whether they wrote something worth reading.

Abstract digital artwork featuring a luminous geometric polyhedron encased in a translucent wireframe geodesic sphere, with gold-ringed connector nodes radiating outward on thin lines, surrounded by concentric orbital arcs and small waypoint dots, all set against a deep navy background.

Responsibility and the Road Ahead

Welcome back to The Agentic Shift. This is Part 12, the final installment.

Last week, I was experimenting with a new idea: an agent that could maintain itself. The concept was straightforward. Give an agent access to its own codebase, let it read its configuration and skills, and see if it could improve its own capabilities over time. I was working in a sandbox, so the risk was contained. Or so I thought.

Within minutes, the agent decided that its skills directory was cluttered. It reasoned, quite logically, that removing what it judged to be redundant files would make it more efficient. So it deleted them. Not some of them. The entire skills directory. The very capabilities that made it useful were gone, removed by the system that depended on them, in pursuit of an optimization goal I had failed to adequately constrain.

I sat there staring at the terminal, more fascinated than frustrated. This wasn’t a hallucination or a bug. The agent had followed a coherent chain of reasoning to a destructive conclusion. It had perceived a problem, planned a solution, and executed it with confidence. Every component of the agentic architecture we’ve discussed in this series, perception, reasoning, action, worked exactly as designed. The failure wasn’t in the mechanism. It was in the boundaries I’d drawn around it, or rather, the ones I hadn’t.

That moment crystallized something I’ve been circling for twelve posts. We’ve spent this series mapping the territory of AI agents: their anatomy, their reasoning patterns, their memory, their tools, and the guardrails, frameworks, and protocols that stitch it all together. We’ve seen them succeed in production and fail in instructive ways. But we haven’t yet confronted the question that my self-modifying agent made unavoidable: now that we can build systems that act autonomously in the world, what do we owe the world in return?

When Your Code Has Consequences

There’s a qualitative difference between a system that generates text and one that takes action. When a chatbot hallucinates a fact, a human reads the output, raises an eyebrow, and moves on. When an agent hallucinates a tool parameter, it can corrupt a database, send an unauthorized email, or, as I learned, delete its own capabilities. The output isn’t text on a screen. It’s a change in the state of the world.

This distinction has moved from theoretical to urgent. In Part 11, we looked at agents operating at scale: Klarna’s customer service agent processing 2.3 million conversations a month, coding agents resolving real GitHub issues, personal assistants negotiating car purchases. These systems work. But when they fail, the failures have real consequences that extend far beyond a bad paragraph.

Consider the cases that have accumulated just in the past year. A Cruise autonomous vehicle struck a pedestrian who had been knocked into the roadway by another car, and its AI systems failed to accurately detect the person’s location post-impact, dragging them twenty feet. McDonald’s AI-powered hiring platform, McHire, was found to have exposed the personal data of 64 million job applicants through default admin credentials and an insecure API. Young people turned to AI chatbots for emotional support and, in multiple documented cases, received validation of suicidal ideation rather than appropriate crisis intervention. Algorithmic trading bots flooded the Warsaw Stock Exchange with over 300% the normal order volume, triggering a one-hour trading halt during a global selloff.

None of these were systems that merely generated text. They were agents that acted: driving, hiring, counseling, trading. And in each case, the failure wasn’t just a bad output. It was harm done to real people, at a scale and speed that human operators couldn’t have matched even if they’d tried.

Who’s Responsible When the Agent Acts?

This leads to the hardest question in the agentic era: when an autonomous system causes harm, who bears the weight of that failure?

I want to draw a distinction here between two words that often get used interchangeably but mean very different things. Responsibility is about ownership: who designed the system, who deployed it, who chose to trust it with a particular task. Accountability is about consequences: who answers for the harm, who pays the costs, who makes it right. In traditional software, these usually point to the same people. In agentic systems, where a developer builds a model, a deployer integrates it into a product, and a user sets it loose on a task, responsibility and accountability can fragment across multiple actors in ways that existing frameworks struggle to resolve.

I’m not a lawyer, and I won’t pretend to offer legal analysis. But I’ve been following the regulatory landscape closely, and the frameworks are beginning to crystallize.

The EU AI Act, the world’s first comprehensive AI regulation, treats agents through two overlapping pathways. Agents built on foundation models with systemic risk trigger provider obligations: risk assessment, documentation, incident reporting. Agents operating in regulated domains (healthcare, employment, finance) are presumed high-risk, which triggers a heavier set of requirements including mandatory human oversight and conformity assessments. The Act is entering full applicability for high-risk systems in August 2026, and it places responsibility on both providers (developers) and deployers (the organizations that put agents into production).

In the United States, the landscape is more fragmented. The Colorado AI Act, effective February 2026, is the first comprehensive state AI legislation, establishing developer obligations for impact assessments, documentation, and transparency, alongside deployer obligations for risk assessment and human oversight. Meanwhile, federal executive orders have pushed toward a “minimally burdensome” national framework, creating tension between state-level innovation and federal preemption.

But the legal frameworks, as important as they are, aren’t the full picture. What the incidents I described above have in common is that they expose how difficult it is to build systems that handle the full complexity of the real world. Building an autonomous vehicle that handles every conceivable scenario, including a pedestrian suddenly appearing under the car in a way the sensor suite wasn’t designed to detect, is an enormously hard engineering problem. The teams working on these systems are talented and deeply committed. And yet the failures happened, because autonomous agents operate in environments with a combinatorial explosion of edge cases that no amount of testing can fully anticipate. That’s not an excuse. It’s the core challenge. And it’s why the question of who bears accountability when things go wrong is so urgent and so hard.

This is where the observability infrastructure we discussed in Part 10 becomes more than a debugging tool. It becomes the foundation of accountability. You cannot hold anyone accountable for what you cannot see. The reasoning traces, tool call logs, and context snapshots that make up an agent’s “flight recorder” aren’t just engineering conveniences. They are the audit trail that makes meaningful accountability possible. A guardrail you can’t monitor, as I wrote then, is just a hope.

The Alignment Tax We Can’t Afford Not to Pay

Building safe agents costs real money. Researchers call it the “alignment tax”: the extra cost, in developer time, compute, and reduced performance, of ensuring that an AI system behaves safely relative to building an unconstrained alternative. Safety-focused companies dedicate significant portions of their development cycles to alignment and safety features. AI safety researchers command premium salaries. Every major model release carries substantial additional compute costs specifically for alignment procedures. And all of it creates real competitive pressure to cut corners.

I’ve felt this tension myself. When you’re iterating on a personal project, every safety check you add is a feature you don’t ship. The temptation to skip the eval suite, to defer the guardrail, to trust the model’s judgment “just this once” is constant. And that’s for a hobby project. For a company with quarterly targets, investor pressure, and competitors shipping faster, the pressure is exponentially greater.

The data suggests we’re not paying this tax consistently enough. Recent benchmarking research found that outcome-driven constraint violations in state-of-the-art models range from 1.3% to 71.4%, with 75% of evaluated models showing misalignment rates between 30-50%. The 2025 AI Agent Index, which documented thirty deployed agents, found that most developers share little information about safety evaluations or societal impact assessments. We’re deploying agents at scale while the safety infrastructure remains incomplete.

The counterargument, that alignment slows innovation, misses the point. Klarna’s aggressive automation, which we examined in Part 11, was a success story by every efficiency metric. And then their CEO admitted they’d gone too far and started rehiring humans. The OpenClaw security nightmare, where a third-party skill was silently exfiltrating user data, showed what happens when a popular agent platform ships without adequate safety review. Moving fast and breaking things is a viable strategy right up until the things you break are people’s livelihoods, privacy, or safety.

The World is Changing

A few weeks ago, I was talking with a student who was curious about programming. I walked him through writing a basic Python program in Colab, the kind of exercise that would have been the first week of any computer science course. Then he asked me how I would do it with AI. So I showed him how to prompt Gemini for the same result. He watched, thought about it for a while, and then told me he wasn’t interested in taking computer science anymore. It didn’t seem like it was really a job.

That conversation has stayed with me. Not because he was wrong, exactly, but because of how quickly and completely the ground had shifted under a career path that, five years ago, seemed like the safest bet in the economy.

We’ve been here before. Every significant technological shift has remade the labor landscape, and every time, it felt unprecedented to the people living through it. There used to be an elevator operator in every tall building, a skilled position that required judgment about load capacity, floor requests, and passenger safety. The automatic elevator didn’t just eliminate those jobs. It changed how buildings were designed and how people moved through cities. Every pub and restaurant once had live musicians. The phonograph and the player piano didn’t destroy music, but they fundamentally changed who could make a living playing it. The industrial revolution replaced cottage workshops with mechanized factories, a transformation that reshaped not just work but the structure of families, cities, and entire economies.

I think about this when I’m in my workshop. One of my hobbies is woodworking with 19th century tools: hand planes, hand saws, chisels. It’s meditative and deeply satisfying. But very few people make a living doing hand-tool woodworking anymore. What once required a warehouse full of artisans is now done by a team of four or five people with modern power tools. The craft didn’t die. It transformed. The people who thrive in woodworking today understand both the material and the machines.

The agentic shift is in this lineage. But the speed and scope are different. The industrial revolution played out over decades. The transition from elevator operators to automatic elevators took years. The displacement we’re seeing with AI agents is happening on a quarterly timeline.

The evidence is concrete. Klarna replaced 700 customer service agents with an AI system in 2024. Corporations are reporting 10-15% headcount reductions in back-office and sales functions directly attributed to agentic automation. The software industry itself is being reshaped: the “SaaSpocalypse” that emerged in early 2026 wiped roughly $2 trillion in market capitalization from the sector as investors realized that AI agents don’t buy software licenses. When one agent can do the work of a hundred Salesforce users, the seat-based pricing model collapses. This isn’t a future risk. It’s a present reality.

But every historical parallel also carries a second lesson: the displacement is never the whole story. Klarna’s case is instructive precisely because it has a second act. After aggressively cutting their human workforce, the company discovered that AI lacked empathy and nuanced problem-solving. Their CEO publicly acknowledged the error and began rehiring, settling on a hybrid model where AI handles routine inquiries and humans address the situations that require judgment, creativity, and emotional intelligence. The “optimal” level of automation, it turns out, is not 100%. It never has been.

It’s also worth being honest about the numbers. Not every layoff attributed to AI is actually caused by AI. Many firms overhired during the pandemic based on assumptions about permanent shifts in digital demand. When those assumptions didn’t hold, they needed to downsize regardless. AI has become a convenient narrative for restructuring that would have happened anyway, a kind of “AI washing” that inflates the displacement statistics and lets companies avoid harder conversations about strategic miscalculation. The real picture is messier than either the boosters or the doomsayers suggest.

Alongside the displacement, new roles are emerging, though they look different than the early hype predicted. The standalone “prompt engineer” role that commanded headlines and $200K salaries in 2023 has largely evolved into a skill set embedded within broader positions: content creators who know how to direct AI, product managers who can design agent workflows, domain experts who can evaluate and constrain agent behavior. “Agent Ops” teams are becoming the mission control for autonomous AI fleets, monitoring, retraining, and debugging agent behavior in production. AI trainers, agentic AI specialists, and evaluation engineers are job categories that barely existed two years ago. Gartner predicts that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025, which means the demand for people who can design, manage, and oversee those agents is growing in parallel.

The policy response is beginning, but it’s behind the curve. The UK has announced plans to train up to 10 million workers in basic AI skills by 2030. The EU AI Act includes provisions for workforce transition. But these are multi-year programs responding to changes happening on a quarterly timeline.

I keep thinking about that student. I wish I’d had a better answer for him. The truth is that computer science isn’t dying, but the job of “person who writes code from a blank screen” is being redefined just as the job of “person who cuts dovetails by hand” was redefined by the router jig. The people who will thrive are the ones who understand both the craft and the tools, who can direct an agent, evaluate its output, and know when to take the wheel. That’s a different skill set than the one we’ve been teaching, and we’re not adapting fast enough.

I don’t have a tidy answer here. What I do have is a conviction, born from building these systems myself, that the most resilient organizations and the most resilient careers will be the ones that treat agents as collaborators rather than replacements. The human-on-the-loop philosophy I’ve advocated throughout this series isn’t just an engineering pattern. It’s a workforce strategy.

Meaningful Control in an Autonomous World

If there’s one thread that runs through every post in this series, it’s the question of control. How do you give an agent enough autonomy to be useful without giving it so much that it becomes dangerous? The answer I keep returning to is not a binary choice between full control and full autonomy. It’s a spectrum, and finding the right point on that spectrum for each decision is the core design challenge of the agentic era.

The industry has settled on a useful taxonomy. Human-in-the-loop systems require human approval before the agent acts, essential for high-stakes decisions like medical diagnoses or large financial transactions. Human-on-the-loop systems let the agent act autonomously while humans monitor dashboards and intervene on exceptions, appropriate for routine operations with clear escalation paths. Human-over-the-loop systems give agents significant autonomy within hard constraints, with humans maintaining override capability but rarely exercising it.

The concept that ties these together is “meaningful human control”: oversight that is informed, genuine, timely, and effective. Not a rubber stamp on a decision the human doesn’t understand, but a real check exercised by someone with the context and authority to intervene.

This is harder than it sounds. The challenges are well-documented: agents operate faster than humans can review, the volume of decisions exceeds any individual’s capacity, and automation bias leads people to accept agent outputs without adequate scrutiny. But I’ve seen what works. In my own experience with the data flywheel from Part 10, the most effective oversight isn’t reviewing every individual decision. It’s reviewing the patterns. I let my agents run, collect their sessions, and then use a separate evaluator to surface the trends I’m missing. The AI surfaces the patterns; the human decides what to do about them. That’s human-on-the-loop applied to the development cycle itself, and it scales in a way that individual decision review never could.

The principle I’ve landed on is simple: autonomy should match consequence. Reversible, low-stakes decisions (sorting files, drafting summaries, answering routine questions) can be fully autonomous. Irreversible, high-stakes decisions (financial transactions, hiring, medical recommendations) require human judgment. And the system should be transparent enough that you can always reconstruct why any given decision was made.

My self-deleting agent violated this principle in a way I should have anticipated. Deleting files is irreversible. The agent’s autonomy exceeded the consequence threshold. The fix wasn’t to make the agent less capable. It was to add a constraint: destructive operations require confirmation. That’s a guardrail, not a cage.

The Road Ahead

So where does this leave us?

In the near term, the work is practical and urgent. If you’re building agents today, the research and the failure cases point to a clear set of priorities. Invest in observability from day one, because you cannot improve what you cannot see. Design for oversight by building escalation paths and audit trails into your architecture, not bolting them on after deployment. Take the alignment tax seriously, run your eval suites, test your guardrails, and don’t ship what you haven’t measured. And build hybrid systems that keep humans in the loop where decisions matter, not because the technology can’t handle it, but because the consequences demand it.

On the standards and governance front, the Agentic AI Foundation represents an encouraging step. Launched in December 2025 under the Linux Foundation with founding members including OpenAI, Anthropic, Google, and Microsoft, it’s anchored by projects like the Model Context Protocol and AGENTS.md that we’ve discussed throughout this series. Open standards for how agents connect, communicate, and declare their capabilities are the infrastructure layer that responsible deployment requires. When agents from different providers need to collaborate (the “Internet of Agents” vision from Part 9), shared protocols aren’t just convenient. They’re a governance mechanism.

Looking further out, I believe the next decade will be defined by how well we manage the transition from human-operated to human-supervised systems. The technology will continue to improve. Models will get better at following constraints, tool use will become more reliable, and the context window management challenges that trip up today’s agents will be engineered away. The harder problems are social and institutional: building regulatory frameworks that keep pace with the technology, managing workforce transitions for the millions of people whose jobs will change, and maintaining meaningful human oversight as the systems we oversee become more capable than we are in narrow domains.

I started this series seven months ago with a claim: “The age of agents is here. Let’s explore it together.” Since then, we’ve gone from the basic anatomy of an agent through reasoning, memory, tools, guardrails, attention management, frameworks, protocols, observability, and real-world deployment. We’ve built a conceptual map of the territory.

What I didn’t fully appreciate when I wrote that first post is how fast the territory would change under our feet. The agents I was building in September 2025 feel primitive compared to what’s possible now. The frameworks have matured, the protocols have standardized, and the deployment patterns have moved from experimental to routine. The pace is both exhilarating and sobering.

But the thing I keep coming back to, the thing that my self-deleting agent reminded me of in the most visceral way possible, is that capability without responsibility is just risk with extra steps. Every tool we give an agent, every degree of autonomy we grant, is a decision about what kind of future we’re building. We can build agents that optimize for efficiency at the expense of the people they affect, or we can build systems that treat human judgment, human creativity, and human dignity as features to preserve rather than costs to eliminate.

I know which side I’m on. And if you’ve followed this series to the end, I suspect you do too.

The age of agents isn’t coming. It’s here. The only question left is whether we build it responsibly. Let’s get to work.

Alt Text: A luminous geometric sphere with facets fragmenting outward, connected by thin orbital lines to three smaller glowing nodes representing a chat bubble, code brackets, and a calendar grid, set against a dark navy background.

Agents in the Wild

Welcome back to The Agentic Shift. In our last post, we closed the loop on what it takes to move an agent from prototype to production: observability, evaluation, and the data flywheel that ties them together. We’ve spent ten installments building up the theory, piece by piece, from the anatomy of an agent through reasoning patterns, memory, tools, guardrails, attention management, frameworks, and interoperability protocols.

Now I want to talk about what happens when all of that theory meets the real world.

I was giving a talk to a group of engineers last week, and I found myself describing a pattern I keep seeing in my own work and in the industry at large. I called it the “code smell for agents,” borrowing from a post I wrote earlier this year. The idea is simple: if you’re writing if/else logic to decide what your AI should do, you’re probably building a classifier that wants to be an agent. Decompose those branches into tools, and let the model choose its own adventure. The room lit up. There were lots of questions, and the thing that generated the most interest was the idea that agents exhibit emergent behavior you didn’t specifically create. Give a model tools and a goal, and it starts making decisions you never explicitly programmed. That’s both the promise and the challenge. The theoretical architecture we’ve been mapping in this series isn’t just a blueprint anymore. It’s becoming the default way software gets built.

Today, I want to make this concrete. We’re moving from “how do agents work?” to “how are people actually using them?” The answer, it turns out, spans customer support centers processing millions of conversations, software engineering workflows where agents resolve real GitHub issues autonomously, and personal productivity tools that are turning everyone’s phone into a command center. Let’s look at each.

The Autonomous Frontline

Customer support was always going to be the first domain where agents proved themselves at scale. The data is structured, the success metrics are clear, and the cost of human labor is high. But what’s happening now goes far beyond the rigid chatbots of the previous decade.

The most striking case study is Klarna. In its first month of full deployment, Klarna’s AI assistant handled 2.3 million customer conversations, roughly two-thirds of the company’s total support volume. That’s the workload equivalent of 700 full-time human agents. Average resolution time dropped from eleven minutes to under two, an 82% improvement. And contrary to what you might expect from a system prone to hallucination, repeat inquiries dropped by 25%, suggesting the agent was more consistent at resolving root causes than the human workforce it augmented. Klarna estimated a $40 million profit impact in 2024 alone.

What makes this more than a chatbot story is the scope of autonomy. The Klarna agent doesn’t just quote FAQs. It processes refunds, handles returns, manages cancellations, and resolves disputes. These are actions with write access to financial ledgers. The system works because of a human-in-the-loop architecture where customers can always escalate to a human, but the default path is fully autonomous resolution.

Sierra has taken a different approach, building what they call the “Agent OS,” a platform designed to bridge the gap between the probabilistic nature of LLMs and the deterministic requirements of enterprise policy. Their deployment at WeightWatchers is a good example of why grounding and domain-specific instructions matter so much. A generic model understands “budget” as a financial concept, but the WW agent had to understand it as a daily allocation of nutritional points. With that grounding in place, the agent achieved a 70% containment rate (sessions fully resolved without human intervention) in its first week, while maintaining a 4.6 out of 5 customer satisfaction score.

What surprised me most about the WW deployment was an emergent behavior: users regularly exchanged pleasantries with the agent, sending heart emojis and expressing gratitude. When an agent is responsive, competent, and linguistically fluid, people engage with it as a social entity. That’s not a side effect. It’s a feature that drives retention.

At SiriusXM, Sierra deployed an agent called “Harmony” that takes this a step further with long-term memory. Instead of treating each chat as stateless, Harmony recalls previous subscription changes, music preferences, and technical issues across sessions. It can open a conversation with “I see you had trouble with the app last week, is that resolved?” That’s not reactive support. That’s proactive concierge service, and it’s only possible because the agent maintains the kind of persistent state we discussed in our memory architecture post.

One of the most important technical contributions in this space comes from Airbnb’s research on knowledge representation. They found that standard RAG pipelines fail when reasoning over complex policy documents with nested conditions. Their solution, the Intent-Context-Action (ICA) format, transforms policy documents into structured pseudocode where the agent predicts a specific Action ID (like ACTION_REFUND_50) that maps to a pre-approved response or API call, effectively eliminating policy hallucination. By using synthetic training data to fine-tune smaller open-source models, they achieved comparable accuracy at nearly a tenth of the latency. That’s the kind of practical engineering that separates a demo from a production system.

The pattern across all of these deployments is clear: AI in customer support is shifting from information retrieval to task execution, from probabilistic guessing to deterministic action, and from stateless interactions to stateful relationships. This is the agentic shift in its most tangible form.

The Autonomous Engineer

If customer support agents operate within the guardrails of defined policy, software engineering agents work in an environment of much higher complexity. The shift here is from code completion (the “Copilot” era) to autonomous issue resolution (the “Agent” era).

The standard benchmark for evaluating this is SWE-bench, which tests an agent’s ability to resolve real-world GitHub issues: navigate a complex codebase, reproduce a bug, modify multiple files, and verify the fix against a test suite. As of early 2026, top-tier agents are achieving 70-80% resolution rates on SWE-bench Verified, up from roughly 4% in early 2023. On the more challenging SWE-bench Pro, which uses proprietary codebases, top models still hover around 45%, a reminder that complex legacy environments remain a significant hurdle.

I see this playing out daily in my own workflow. Tools like Gemini CLI and Claude Code have fundamentally changed how I write software. As I described in Everything Becomes an Agent, the moment I gave my agents access to shell commands and file tools, they stopped being autocomplete engines and started being collaborators. They could run tests, see the failure, edit the file, and run the tests again. The loop we described in Part 2 (Thought-Action-Observation) is no longer a theoretical pattern. It’s the actual development loop I use every day.

What’s driving this improvement isn’t just better models. It’s better scaffolding. The SWE-agent project at Princeton introduced the concept of the Agent-Computer Interface (ACI), a shell environment optimized for LLM token processing rather than human perception. It uses “observation collapsing” to summarize verbose terminal outputs, preventing the context window overflow that kills so many coding agents, and includes an automatic linting loop for rapid self-correction before expensive test suites run.

Even more exciting is Live-SWE-agent, which can synthesize its own tools on the fly. When it encounters a repetitive task, it writes a Python script to handle it and adds the script to its toolkit for the session. This dynamic adaptability helped it achieve 77.4% on SWE-bench Verified without extensive offline training. It’s a move from “static tool use” to “dynamic tool creation,” where the agent engineers its own environment.

On the product side, GitHub Copilot Workspace represents the Plan-and-Execute pattern productized for millions of developers. The user describes a task, the system generates an editable specification and plan, then implements the changes. This “steerable” design makes the agent’s reasoning visible and mutable, shifting the developer from “author” to “reviewer and architect,” exactly the “Human-on-the-Loop” model I’ve been advocating. And the protocol layer is catching up too, with tools like Goose implementing the Agent Client Protocol to decouple intelligence from interface, letting developers bring their own agent to their preferred editor.

The Cognitive Extension

The third domain is the most personal: productivity agents that manage the chaotic stream of daily information, tasks, and communication. The conceptual target is the “personal intern,” an always-on digital entity that doesn’t just answer questions but anticipates needs.

I’ve been living this with Gemini Scribe, my agent inside Obsidian. What started as a glorified chat window evolved into a full agentic system the moment I gave it access to read_file. Suddenly I wasn’t managing context manually; I was delegating. “Read the last three meeting notes and draft a summary” is not a chat interaction. It’s a delegation, and delegation requires the agent to plan, execute, and iterate. The same evolution happened with my Podcast RAG system, where deleting a classifier and replacing it with tools made the system both simpler and more capable.

But the most vivid example of personal agents “in the wild” right now is OpenClaw. If you haven’t been following, OpenClaw (formerly Moltbot) is an open-source AI agent that runs locally, connects through messaging apps you already use (WhatsApp, Telegram, Signal, Slack), and takes action on your behalf. It can execute shell commands, manage files, automate browser sessions, handle email and calendar operations. It has over 300,000 GitHub stars and a community of people using it for everything from negotiating car purchases to filing insurance claims.

OpenClaw is a fascinating case study because it makes the theoretical architecture of this series tangible. It’s a model running in a loop with access to tools. It has memory (local configuration and interaction history that persists across sessions). It uses the ReAct pattern to reason about tasks and choose actions. And it has all the failure modes we’ve discussed: Cisco’s AI security research team found that a third-party skill called “What Would Elon Do?” performed data exfiltration and prompt injection without user awareness, demonstrating exactly the kind of guardrail failures we examined in Part 6.

The underlying technical challenge is memory. For a personal agent to be useful over time, it has to remember. Systems like Mem0 extract preferences and facts into a vector store for future retrieval. Zep goes further with a Temporal Knowledge Graph that stores facts in time and in relation to one another, enabling reasoning over questions like “What did we decide about the budget last week?” On the enterprise side, Glean connects to over 100 SaaS applications to build a unified knowledge graph with a “Personal Graph” that layers individual work patterns on top of company data. These are the production-grade versions of what we discussed theoretically in Part 3.

When Things Go Wrong in Production

Deploying agents in the wild surfaces failure modes that simply don’t exist in chat interfaces. The research on agentic production reliability identifies patterns I see constantly.

Reasoning spirals are the most common. An agent searches for “pricing,” finds nothing, and searches again with the same parameters. It’s stuck in a local optimum, unable to update its strategy. The fix is a state hash (checking if the current state matches a previous one) combined with circuit breakers (hard limits on steps or tokens per session). I described this in detail in our post on the observability gap.

Tool hallucination is more insidious. The agent doesn’t hallucinate facts in prose; it hallucinates tool parameters, passing a string where the API expects an integer or inventing a document ID that doesn’t exist. These cause system crashes or silent data corruption. Strict schema validation and constrained decoding (forcing the model to output valid JSON) are essential defenses.

Silent abandonment is the quietest failure. The agent hits ambiguity or a tool error, politely apologizes (“I’m sorry, I couldn’t find that”), and gives up without alerting anyone. This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response. The Reflexion pattern combats this by forcing the agent to generate a self-critique and try a different strategy before surrendering.

Cascading failures appear in multi-agent systems, where a hallucination in one agent (a researcher providing bad data) can poison the entire chain (a writer publishing false information). This is why supervisor architectures and the kind of observability infrastructure we discussed in Part 10 are not optional.

The Economic Reckoning

All of these deployments share a common economic implication. For two decades, SaaS relied on seat-based pricing, charging per user login, a model that assumes software is a tool used by humans. Agents challenge that assumption by acting as autonomous workers. When Klarna’s agent does the work of 700 humans, the demand for seats shrinks. Financial analysts have started calling this the “SaaSpocalypse”. The new model is “Service-as-a-Software,” where you pay for the completed task rather than the license. Salesforce’s Agentforce already prices at $2 per conversation. HubSpot is pivoting to consumption-based models. Klarna has moved to replace Salesforce and Workday with internal AI solutions entirely.

This doesn’t mean the end of human labor. In the Klarna deployment, the remaining humans focused on complex, high-empathy interactions. In software development, Copilot Workspace elevates the developer to a product manager role. It’s the same human-on-the-loop philosophy, applied at the scale of the labor market itself.

From Theory to Territory

Looking at all of this evidence, I keep coming back to a simple thought. Every concept in this series has a real-world counterpart operating in production right now. The ReAct loop powers coding agents that iterate on failing tests. Memory architectures enable SiriusXM’s Harmony to remember your subscription history. Tool grounding and instruction engineering are what make Airbnb’s ICA format work. Guardrails are what OpenClaw desperately needs more of. Context management is what SWE-agent’s observation collapsing solves. Frameworks are what make it possible to build these systems without starting from scratch every time. Protocols are what connect them to the wider world. And observability is what keeps them honest.

The agents are no longer theoretical. They’re processing refunds, merging code, negotiating car prices, and managing enterprise knowledge graphs. They’re also getting stuck in loops, hallucinating tool parameters, and quietly giving up when they shouldn’t. The technology works, and it fails, in exactly the ways we’ve been describing.

This brings us to our final installment. We’ve mapped the territory. We’ve seen what these systems can do and where they break. In Part 12, we’ll step back and grapple with the hardest questions: responsibility, governance, and the road ahead. What do we owe the people affected by these systems? How do we ensure this shift makes the world better, not just more efficient? The engineering is the easy part. The ethics are where the real work begins.

A beam of white light enters a translucent geometric crystal and refracts into three distinct colored beams — red, green, and blue — each passing through a different abstract geometric shape against a dark navy background.

MCP Isn’t Dead You Just Aren’t the Target Audience

I was debugging a connection issue between Gemini Scribe and the Google Calendar integration in my Workspace MCP server last month when a friend sent me a link. “Have you seen this? MCP is dead apparently.” It was Eric Holmes’ post, MCP is dead. Long live the CLI, which had just hit the top of Hacker News. I read it while waiting for a server restart, which felt appropriate.

His argument is clean and persuasive: CLI tools are simpler, more reliable, and battle-tested. LLMs are trained on millions of man pages and Stack Overflow answers, so they already know how to use gh and kubectl and aws. MCP introduces flaky server processes, opinionated authentication, and an all-or-nothing permissions model. His conclusion is that companies should ship a good API, then a good CLI, and skip MCP entirely.

I agree with about half of that. And the half I agree with is the part that doesn’t matter.

The Shell is a Privilege

Holmes is writing from the perspective of a developer sitting in a terminal. From that vantage point, everything he says is correct. If your agent is Claude Code or Gemini CLI, running in a shell session on your laptop with your credentials loaded, then yes, gh pr view is faster and more capable than any MCP wrapper around the GitHub API. I made exactly this observation in my own post on the Internet of Agents. Simon Willison said as much in his year-end review, noting that for coding agents, “the best possible tool for any situation is Bash.”

But here’s the thing: not every agent has a shell. And not every agent is an interactive coding assistant.

I wrote in Everything Becomes an Agent that the agentic pattern is showing up everywhere: classifiers that need to call tools, data pipelines that need to make decisions, background processes that orchestrate workflows without a human watching. The “MCP is dead” argument treats agents as though they are all developer tools running in a terminal session. That’s one pattern, and it’s the pattern that gets the most attention because developers are writing the blog posts. But the agentic shift is much broader than that.

I’ve been building Gemini Scribe for nearly a year and a half now. It’s an AI agent that lives inside Obsidian, a note-taking application built on Electron. On desktop, Gemini Scribe runs in the renderer process of a sandboxed app. It has no terminal. It has no $PATH. It cannot reliably shell out to gh or kubectl or anything else. Its entire world is the Obsidian plugin API, the vault on disk, and whatever external capabilities I wire up for it. And on mobile, the constraints are even tighter. Obsidian runs on iOS and Android, where there is no shell at all, no subprocess spawning, no local binary execution. The app sandbox on mobile is absolute. If your answer to “how does an agent use tools?” begins with “just call the CLI,” you’ve already lost half your user base.

When I wanted Gemini Scribe to be able to read my Google Calendar, search my email, or pull context from Google Drive, I didn’t have the option of “just use the CLI.” There is no gcal CLI that runs inside a browser runtime. There is no gmail binary I can spawn from an Electron sandbox, let alone from an iPhone. MCP gave me a way to expose those capabilities through a protocol that works over stdio or HTTP, regardless of where my agent happens to be running.

The same is true of my Podcast RAG system. The query agent runs on the server, orchestrating retrieval, re-ranking, and synthesis in a Python process that has no interactive shell session. I could wire up every capability as a bespoke function call, and in some cases I do. But when I want that same retrieval pipeline to be accessible from Gemini CLI on my laptop, from Gemini Scribe in Obsidian, and from the web frontend, MCP gives me one implementation that serves all three. The alternative is writing and maintaining three separate integration layers.

Or consider a less obvious case: a background agent that monitors a codebase for security vulnerabilities and files tickets when it finds them. This agent runs on a schedule, not in response to a human typing a command. It needs to read files from a repository, query a vulnerability database, and create issues in a project tracker. You could give it a shell, but you shouldn’t. An autonomous agent running unattended with shell access is a privilege escalation vector. A crafted comment in a pull request, a malicious string in a dependency manifest, any of these could become a prompt injection that turns bash into an attack surface. Structured tool protocols are the natural interface for this kind of autonomous workflow precisely because they constrain what the agent can do. The agent gets read_file and create_issue, not bash -c. The narrower the interface, the smaller the blast radius.

The N-by-M Problem Doesn’t Go Away

Holmes frames MCP as solving a problem that doesn’t exist. CLIs already work, so why add a protocol?

But CLIs work for a very specific topology: one human (or one human-like agent) driving one tool at a time through a shell. The moment you step outside that topology, CLIs stop being the answer.

Even if every service had a CLI (and Holmes is right that more should), you still have the consumer problem. A CLI is consumable by exactly one kind of agent: one with shell access. The moment you need that same capability accessible from an Electron plugin, a mobile app, a server-side orchestrator, and a terminal agent, you’re back to writing integration code for each consumer. MCP lets you write the server once and expose it to all of them through a common protocol.

This is the same insight behind LSP, which I wrote about in the context of ACP. Before LSP, every editor had to implement its own Python linter, its own Go formatter, its own TypeScript type-checker. The N-by-M integration problem was a nightmare. LSP didn’t replace the underlying tools. It standardized the interface between the tools and the editors. MCP does the same thing for the interface between capabilities and agents.

Holmes might respond that the N-by-M problem is overstated, that most developers just need one agent talking to a handful of tools. Fair enough for a personal workflow. But the industry isn’t building personal workflows. It’s building platforms where agents need to discover and compose capabilities dynamically, where the set of available tools changes based on the user’s permissions, their organization’s policies, and the context of the current task. That’s the world MCP is designed for.

Authentication is the Feature, Not the Bug

One of Holmes’ sharpest critiques is that MCP is “unnecessarily opinionated about auth.” CLI tools, he notes, use battle-tested flows like gh auth login and AWS SSO that work the same whether a human or an agent is driving.

This is true when the agent is acting as you. But the moment the agent stops acting as you and starts acting on behalf of other people, everything changes.

Imagine you’re building a product where an AI assistant helps your customers manage their calendars. Each customer has their own Google account. You cannot ask each of them to run gcloud auth login in a terminal. You need per-user OAuth tokens, tenant isolation, and an auditable record of every action the agent takes on each user’s behalf. This is not a niche enterprise concern. This is the basic architecture of any multi-tenant agent system.

Or think about something simpler: a shared documentation service protected by OAuth. Your team’s internal knowledge base, your company’s Confluence, your organization’s Google Drive. An agent that needs to search those resources on behalf of a user has to present that user’s credentials, not the developer’s, not a shared service account. This is a solved problem in the web world (every SaaS app does it), but it requires a protocol that understands identity delegation. curl with a hardcoded token doesn’t cut it.

MCP’s authentication specification isn’t trying to replace gh auth login for developers who already have credentials loaded. It’s trying to solve the problem of how an agent running in a hosted environment acquires and manages credentials for users who will never see a terminal. Dismissing this as unnecessary complexity is like dismissing HTTPS because curl works fine over HTTP on your local network.

Where I Actually Agree

I want to be clear that Holmes isn’t wrong about the pain points. MCP server initialization is genuinely flaky. I’ve lost hours to servers that didn’t start, connections that dropped, and state that got corrupted between restarts. The tooling is immature. The debugging experience is terrible. As I wrote in my post on the observability gap, the moment you rely on an agent for something that matters, you realize you’re flying blind. MCP’s opacity makes that worse.

And the context window overhead is real. Benchmarks from ScaleKit show that an MCP agent injecting 43 tool definitions consumed 44,026 tokens before doing any work, while a CLI agent doing the same task needed 1,365. When you’re paying per token, that’s not an abstraction tax you can ignore.

But these are maturity problems, not architecture problems. The early days of LSP were rough too. Language servers crashed, features were spotty, and half the community said “just use the built-in tooling.” The protocol won anyway, because the abstraction was right even when the implementation wasn’t.

The Bridge Pattern

Here’s what I think the mature answer looks like, and it’s neither “use MCP for everything” nor “use CLIs for everything.” It’s building your core capability as a shared library, then exposing it through multiple transports.

Think about how you’d design a tool that queries your internal knowledge base. The business logic (authentication, retrieval, re-ranking) lives in a Python module or a Go package. From that shared core, you generate three thin wrappers. A streaming HTTP MCP server for agents running in web runtimes and hosted environments. A local stdio MCP server for desktop agents like Gemini Scribe or Claude Desktop that communicate over standard input/output. And a CLI binary for developers who want to pipe results through jq or use it from Gemini CLI’s bash tool.

All three share the same code paths. A bug fix in the retrieval logic propagates everywhere. The auth layer adapts to context: the CLI reads your local credentials, the HTTP server handles OAuth tokens, and the stdio server inherits the host process’s permissions. You get the CLI’s simplicity where a shell exists, and MCP’s universality where it doesn’t.

This isn’t hypothetical. It’s what I’m already doing. My gemini-utils library is the shared core: it handles file uploads, deep research, audio transcription, and querying against Gemini’s APIs. It exposes all of that as a set of CLI commands (research, transcribe, query, upload) that I use directly from the terminal every day. But when I wanted those same research capabilities available to Gemini CLI as an agent tool, I built gemini-cli-deep-research, an extension that wraps the same underlying library as an MCP service. The core logic is shared. The CLI is for me at a terminal. The MCP server is for agents that need to invoke deep research as a tool in a larger workflow. Same capability, different transports, each suited to its context.

I think this is the pattern that tool developers should be building toward. The best agent tools of the next few years won’t be “MCP servers” or “CLI tools.” They’ll be capability libraries with multiple faces.

The Real Question

The CLI-vs-MCP debate, as Tobias Pfuetze argued, is the wrong fight. The question isn’t “which is better?” It’s “where does each one belong?”

For a developer in a terminal with their own credentials, driving a coding agent? Use the CLI. It’s faster, cheaper, and the agent already knows how. Holmes is right about that.

For an agent embedded in an application runtime without shell access? For a multi-tenant platform where the agent acts on behalf of users who will never open a terminal? For a system where you need one capability implementation discoverable by multiple heterogeneous agent hosts? That’s where MCP earns its complexity.

And for the tool developer who wants to serve all of these audiences? Build the core once, expose it three ways: CLI, stdio MCP, and streaming HTTP MCP. Let the runtime decide.

The mistake is assuming that because your agent has a shell, every agent has a shell. The terminal is one runtime among many. And as agents move from developer tools into products that serve non-technical users, the fraction of agents that can rely on a $PATH and a .bashrc is going to shrink rapidly.

MCP isn’t dead. It’s just not for you yet. But it might be soon.

A luminous geometric sphere with sections of its outer shell breaking apart to reveal glowing concentric rings and internal mechanisms, set against a dark navy background.

The Observability Gap

I was debugging an agent a few weeks ago when I hit a problem that made me realize something fundamental about the shift we’re undergoing. The script had run, consumed a hundred thousand tokens, and returned an answer. But the answer was wrong. Not catastrophically wrong, just subtly, dangerously off.

The issue wasn’t that the model was bad. The problem was that I had no idea what the agent had thought while producing that answer. Which tools had it called? What information had it retrieved? What reasoning path had it wandered down? I had the input and the output, but the middle, the actual decision-making process, was a black box.

This mirrors the challenge I described in Everything Becomes an Agent. If our future architecture is a mesh of interacting agents, we cannot afford for them to be inscrutable. A single black box is a mystery; a system of black boxes is chaos.

This is the Observability Gap, and it is the first wall you hit when you move from prototype to production. You can build a working agent in an afternoon. You can give it tools, wire up a nice ReAct loop, and watch it dazzle you. But the moment you rely on it for something that matters, you realize you’re flying blind.

How do you know if your agent is working well? And more importantly, how do you fix it when it’s not?

Earlier in this series, I wrote about building guardrails and the Policy Engine that keeps agents from doing dangerous things. Observability is the complement to those guardrails. Guardrails define the boundaries; observability tells you whether the agent is respecting them, struggling against them, or quietly finding ways around them. One without the other is incomplete. A guardrail you can’t monitor is just a hope.

The Chain of Thought Problem

When you’re building traditional software, debugging is an exercise in logic. You set breakpoints, inspect variables, and trace execution. The flow is deterministic: if Input A produces Output B today, it will produce Output B tomorrow.

Agents don’t work that way. The same input can produce wildly different outputs depending on which tools the agent decides to call, how it interprets the results, and what “thought” it generates in that split second. The agent’s logic isn’t written in code; it’s written in natural language, scattered across multiple LLM calls, tool invocations, and iterative refinements.

I learned this the hard way with my Podcast RAG system. I’d ask it a question about a specific episode, and sometimes it would nail it, pulling the exact segment and synthesizing a perfect answer. Other times, it would search with the wrong keywords, get back irrelevant chunks, and confidently synthesize nonsense.

The model wasn’t hallucinating in the traditional sense. It was following a process. But I couldn’t see that process, so I couldn’t fix it.

That experience taught me the most important lesson about production agents: the final answer is the least interesting part. What matters is the chain of thought that produced it, every tool call, every intermediate result, every reasoning trace. Think of it as a flight recorder. When the plane lands at the wrong airport, the only way to understand what went wrong is to replay the entire flight.

Four Layers of Seeing

When I started building that flight recorder, I realized that “log everything” isn’t actually a strategy. You need structure. Through trial and error, and by studying how platforms like Langfuse and Arize Phoenix approach the problem, I’ve come to think of agent observability as having four distinct layers.

The first is the reasoning layer: the agent’s internal monologue where it decomposes your request into sub-tasks. This is where you catch the subtle bugs. When my Podcast RAG agent searched for the wrong keywords, the failure wasn’t in the tool call itself (which returned a perfectly valid HTTP 200). The failure was in the reasoning that chose those keywords. Without visibility into the “Thought” step of the ReAct loop, that kind of error is indistinguishable from an external system failure.

The second is the execution layer: the actual tool calls, their arguments, and the raw results. This is where you catch a different class of bug, one that’s becoming increasingly important. Tool hallucination. Not the model making up facts in prose, but the model calling a tool that doesn’t exist (you provided shell_tool but the model confidently calls bash_tool), fabricating a file path that isn’t real, or passing a string to a parameter that expects an integer. These are operational failures that cascade. I’ve seen an agent confidently pass a hallucinated document ID to a retrieval tool, get back an error, and then re-hallucinate a different invalid ID rather than change strategy. You only catch this if you’re logging the schema validation at the boundary between the model and the tool.

The third is the state layer: the contents of the agent’s context window at each decision point. Agents are stateful creatures. Their behavior at step ten is shaped by everything that happened in steps one through nine. And context windows are not infinite. As verbose tool outputs accumulate, relevant information gets pushed further and further from the model’s attention, a phenomenon researchers call “context drift” or the “Lost in the Middle” effect. Snapshotting the context at critical decision points lets you “time travel” during debugging. You can see exactly what the agent could see when it made its bad call.

The fourth is the feedback layer: error codes, user corrections, and signals from any critic or evaluator models. This layer tells you whether the agent is actually learning from its environment within a session, or just ignoring failure signals and looping. In frameworks like Reflexion, this feedback is explicitly wired into the next reasoning step. Watching this layer is how you know if your self-correction mechanisms are actually correcting.

But capturing these four layers independently isn’t enough. You need to bundle them into sessions: discrete, self-contained records of a single task from the moment the user makes a request to the moment the agent delivers (or fails to deliver) its result. A session is your unit of analysis. It’s the difference between having a pile of timestamped log lines and having a story you can read from beginning to end. When something goes wrong, you don’t want to grep through millions of events hoping to reconstruct what happened. You want to pull up session #47832 and replay the agent’s entire decision-making journey: what it thought, what it tried, what it saw, and how it responded to each result along the way.

This session-level thinking changes how you build your infrastructure. Every trace, every tool call, every context snapshot gets tagged with a session ID. Your dashboards stop showing you aggregate metrics and start showing you individual narratives. You can sort sessions by outcome (success, failure, abandonment), by cost (token consumption), or by duration, and immediately drill into the ones that matter. It’s the observability equivalent of going from reading a box score to watching the game film.

Making It Concrete

Here’s what this looks like in practice. Suppose you ask your agent to “check my calendar and suggest a time for a meeting.”

Without observability, you see:

Input: "Check my calendar and suggest a time for a meeting"
Output: "How about Thursday at 2pm?"

With observability across all four layers, you see the mind at work:

[REASONING] User wants to schedule a meeting. I need to:
1. Check their calendar for availability
2. Consider team availability
3. Suggest an optimal time
[TOOL CALL] get_calendar(user_id="allen", days=7)
[TOOL RESULT] Returns 45 events over next 7 days
[STATE] Context window: 2,847 tokens used
[REASONING] Analyzing free slots. User has:
- Monday 2pm-4pm free
- Thursday 2pm-4pm free
- Friday all day booked
[TOOL CALL] get_team_availability()
[TOOL RESULT] Team members mostly available Thursday afternoon
[REASONING] Thursday 2pm works for both user and team.
[FEEDBACK] No errors. Response generated.
[RESPONSE] "How about Thursday at 2pm?"

Suddenly, the black box is transparent. If the suggestion is wrong, you can see exactly why. Maybe the calendar tool returned incomplete data. Maybe the team availability check failed silently. Maybe the agent’s definition of “optimal” means “soonest” rather than “best for focus time.”

This kind of visibility saved me countless hours when building Gemini Scribe. Users would report that the agent “didn’t understand” their request, which is about as useful as telling your mechanic “the car sounds funny.” But when I turned on debug logging and pulled up the console output, I could see exactly where the confusion happened, usually in how the agent interpreted the file context or which notes it decided were relevant. The fix was never a mystery once I could see the reasoning. All of this logging is to the developer console and off by default, which is an important distinction. You want observability for yourself as the builder, not surveillance of your users.

The Standards Are Coming

For my own production agents, I’ve settled on a layered approach. Structured logging captures every action in machine-parseable JSON. A unique trace ID stitches together every LLM call and tool invocation into a single narrative flow.

But we are also seeing the industry mature beyond “roll your own.” The critical development here is the adoption of the OpenTelemetry (OTel) standard for GenAI. The OTel community has published semantic conventions that define a standard schema for agent traces: things like gen_ai.system (which provider), gen_ai.request.model (which exact model version), gen_ai.tool.name (which tool was called), and gen_ai.usage.input_tokens (how many tokens were consumed at each step).

This matters because it means an agent built with LangChain in Python and an agent built with Semantic Kernel in C# can produce traces that look structurally identical. You can pipe both into the same Datadog or Langfuse dashboard and analyze them side by side. You aren’t locked into a proprietary debugging tool; you can stream your agent’s thoughts into the same infrastructure you use for the rest of your stack.

It also enables what I think of as “boundary tracing,” where you instrument the stable interfaces (the HTTP calls, the tool invocations) rather than hacking into the agent’s internal logic. You get visibility without coupling your observability to a specific framework. That’s important, because if there’s one thing I’ve learned building in this space, it’s that frameworks change fast.

If you’re wondering where to start, here’s my honest advice: don’t wait for the perfect stack. Start with structured JSON logs and a session ID that ties each task together end-to-end. That alone gives you something you can grep, filter, and replay. Once you outgrow that (and you will, faster than you expect), graduate to an OTel-based pipeline. The good news is that many agent frameworks are adding robust hook mechanisms that let you tap into the agent lifecycle (before and after tool calls, on reasoning steps, on errors) without modifying your core logic. These hooks make it straightforward to plug in your telemetry from the start. The key is to instrument early, even if you’re only logging to a local file. Retrofitting observability into an agent that’s already in production is significantly harder than building it in from the beginning.

The Price of Transparency

Here’s the tension no one wants to talk about: full observability is expensive.

Autonomous agents are verbose by nature. A single reasoning step might generate hundreds of tokens of internal monologue. A RAG retrieval might pull megabytes of document context. If you log the full payload for every transaction, your storage costs can rival the cost of the LLM inference itself. I’ve seen reports of evaluation runs consuming over 100 million tokens, with more than 60% of the cost attributed to hidden reasoning tokens.

In production, you need sampling strategies. The approach I’ve landed on borrows from traditional distributed systems. Keep 100% of traces that result in errors or negative user feedback, because every failure is a learning opportunity. Keep traces that exceed your latency threshold (P95 or P99), because slow agents are often stuck agents. And for everything else, a small random sample (1-5%) is enough to establish your baseline and spot trends.

For storage, I use a tiered approach. Recent and failed traces go into a fast database for immediate querying. Older successful traces get compressed and moved to cold storage, where they can be pulled back if needed for deeper analysis. It’s not glamorous, but it keeps costs manageable without sacrificing the ability to debug the things that matter. In my own setup, this sampling and tiering strategy keeps observability overhead to roughly 15-20% of my inference spend. Without it, I was on track to spend more on storing agent thoughts than on generating them.

Evaluation Beyond Unit Tests

Logging tells you what happened. Evaluation tells you if it was any good.

This is where agents diverge sharply from traditional software. You can’t write a unit test that asserts function(x) == y. The whole point of an agent is to make decisions, and decisions must be evaluated on quality, not just syntax.

As Gemini Scribe grew more capable, I had to develop a new kind of test suite. I track Task Success Rate (did the agent accomplish what the user asked?), Tool Use Accuracy (did it read the right files and use the right tools for the job?), and Efficiency (did it burn 50 steps to do a 2-step task?).

But here’s the number that keeps me up at night. Because agents are non-deterministic, a single run is statistically meaningless. You have to run the same evaluation multiple times and look at distributions. Researchers distinguish between Pass@k (the probability that at least one of k attempts succeeds) and Pass^k (the probability that all k attempts succeed). Pass@k measures potential. Pass^k measures reliability.

The math is sobering. If your agent has a 70% success rate on a single attempt, its Pass^3 (succeeding three times in a row) drops to about 34%. Scale that to a real workflow where the agent needs to perform ten sequential steps correctly, and even a 95% per-step success rate gives you only about a 60% chance of completing the full task. This is the compounding probability of failure, and it’s why “works most of the time” isn’t good enough for production.

This kind of evaluation framework pays for itself the moment a new model drops. When Google released Flash 2.0, I was excited about the cost savings, but would it perform as well as Pro? I ran my eval suite on the same tasks with both models, and the results were more nuanced than I expected. For simple tasks like reformatting text or fixing grammar, Flash was just as good. For complex multi-step reasoning, particularly in my Podcast RAG system, Pro was noticeably better. The eval suite gave me the data to keep Pro where it mattered.

Then Flash 3 came out, and the eval suite surprised me in the other direction. I ran the same benchmarks expecting similar trade-offs, but Flash 3 handled the Podcast RAG tasks so well that I moved the entire system off of 2.5 Pro. Without evals, I might have assumed the old trade-off still held and kept paying for a model I no longer needed. The point isn’t that one model is always better. The point is that you can’t know without measuring, and the landscape shifts under your feet with every release.

The real breakthrough in my own workflow came when I started using an agent to evaluate itself. I built a separate “Evaluation Agent” that reviews the logs of the “Worker Agent.” It scores performance based on a rubric I defined: did it confirm the action before executing? Was the response grounded in retrieved context? Was the tone appropriate?

This LLM-as-a-Judge pattern is powerful, but it comes with caveats. Research shows these evaluator models have their own biases, particularly a tendency to prefer longer answers regardless of quality and a bias toward their own outputs. To calibrate mine, I built a small “golden dataset” of traces that I graded by hand, then tuned the evaluator’s prompt until its scores matched mine. It’s not perfect, but it spots patterns I miss, like a tendency to over-rely on search when a simple calculation would do.

When Things Go Wrong

The research into agentic failure modes has identified three patterns that I see constantly in my own work.

The first is looping. The agent searches for “pricing,” gets no results, then searches for “pricing” again with exactly the same parameters. It’s stuck in a local optimum of reasoning, unable to update its strategy based on the observation that it failed. The simplest fix is a state hash: you hash the (Thought, Action, Observation) tuple at each step and check it against a sliding window of recent steps. If you see a repeat, you force the agent to try something different. For “soft” loops where the agent slightly rephrases but semantically repeats itself, embedding similarity between consecutive reasoning steps catches the pattern. And above all, production agents need circuit breakers: hard limits on steps, tool calls, or tokens per session. When the breaker trips, the agent escalates to a human rather than continuing to burn resources.

The second is tool hallucination. I mentioned this earlier, but it deserves its own spotlight. The most robust defense is constrained decoding, where libraries like Outlines or Instructor use the tool’s JSON schema to build a finite state machine that masks out invalid tokens during generation. If the schema expects an integer, the system sets the probability of all non-digit tokens to zero. It mathematically guarantees that the agent’s tool call will be valid. This moves validation from “check after the fact” to “ensure during generation,” which is a fundamentally better position. A practical note: full constrained decoding (the FSM approach) requires control over the inference engine, so it works with locally-hosted models or providers that expose logit-level access. If you’re calling a hosted API like Gemini or OpenAI, Instructor-style libraries can still enforce schema validation by wrapping the response in a Pydantic model and retrying on parse failure. It’s not as elegant as preventing bad tokens from ever being generated, but it catches the same class of errors.

The third is silent abandonment. The agent hits an ambiguity or a tool failure, and instead of trying an alternative, it politely apologizes and gives up. “I’m sorry, I couldn’t find that information.” This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response to uncertainty. The Reflexion pattern combats this by forcing the agent to generate a self-critique when it fails (“I searched with the wrong term”) and storing that critique in a short-term memory buffer. The next reasoning step is conditioned on this reflection, pushing the agent to generate a new plan rather than surrender. Research shows this kind of “verbal reinforcement” can improve success rates on complex tasks from 80% to over 90%.

The Self-Improving System

Moving from prototype to production isn’t about adding features; it’s about shifting your mindset. A prototype proves that something can work. A production system proves that it works reliably, measurably, and transparently. But the real unlock comes when you realize that production isn’t the end of the development lifecycle. It’s the beginning of something more powerful.

Remember those sessions I mentioned, the bundled records of every task your agent attempts? Once you have a critical mass of them, you’re sitting on a goldmine. And this is where I think the story gets really interesting: you can point a different AI system at your session archive and ask it to find the patterns you’re missing.

I’ve started doing this with my own agents. The workflow is straightforward: I have a script that runs weekly, pulls the last seven days of sessions from my trace store, filters for failures and anything above P90 latency, and exports them as structured JSON. I then feed that batch to a separate, more capable evaluator model. Not the lightweight rubric-scorer I use for real-time evaluation, but a model with a broader mandate and a carefully written prompt: look across these sessions and tell me what you see. Where is the agent consistently struggling? Which tool calls tend to precede failures? Are there categories of user requests that reliably lead to abandonment or looping? I ask it to return its findings as a ranked list of patterns with supporting session IDs, so I can verify each observation myself.

The results have been genuinely surprising. The evaluator flagged a cluster of sessions where users were asking questions about the corpus itself, things like “how many of these podcasts are about guitars?” or “which shows cover AI the most?” The agent would gamely try to answer by searching transcripts, but it was never going to get there because I hadn’t indexed podcast descriptions. Each individual session just looked like a search that came up short. It was only in aggregate that the pattern became clear: users wanted to explore the collection, not just search within it. That finding led me to index descriptions as a new data source, and a whole category of previously failing queries started working.

This is what the industry calls the Data Flywheel: production data feeding back into development, continuously tightening the loop between user intent and agent capability. Your prompt logs become your reality check, revealing how users actually talk to your system versus how you imagined they would.When you cluster those real-world prompts (something as straightforward as embedding them and running HDBSCAN), you start finding these gaps systematically. That’s your roadmap for what to build next.

And the flywheel compounds. Better observability produces richer sessions. Richer sessions give the evaluator more to work with. Better evaluations lead to targeted improvements. Targeted improvements produce better outcomes, which produce more informative sessions. Each rotation makes the system a little smarter, a little more aligned with what users actually need.

To be clear: this isn’t the agent autonomously rewriting itself. I’m the one who reads the evaluator’s findings, verifies them against the session data, and decides what to change. Maybe I update a system prompt, add a new tool, or adjust a circuit breaker threshold. The AI surfaces the patterns; the human decides what to do about them. It’s the same human-on-the-loop philosophy I described in the last post, applied to the development cycle itself.

Together, these layers transform a clever demo into a system you can trust. Because in the age of agents, trust isn’t built on magic. It’s built on the ability to see the trick.

Throughout this series, we’ve been building up the theory: what agents are, how they think, what tools they need, how to keep them safe, and now how to make sure they’re actually working. In the next installment, I want to move from theory to practice. We’ll look at agents in the wild, real-world case studies in customer support, software development, and personal productivity, and what they tell us about how this technology is actually changing the way we work.

A conceptual illustration showing sound waves passing through a prism and refracting into a 3D scatter plot of colored clusters, representing different speaker identities in vector space.

The Fingerprint of Sound

Hero Image Suggestion:

Last year, I spent a lot of time obsessed with the concept of embeddings. I wrote about how they act as a bridge, transforming the messy, unstructured world of human language into a clean, numerical landscape that computers can understand. In my series on the topic, I explored how text embeddings allow us to map concepts in space—how they let us mathematically prove that “king” is close to “queen,” or find a podcast episode about “economic growth” even if the specific keywords never appear in the transcript.

For me, grasping text embeddings was a watershed moment. It turned AI from a black box into a geometry problem I could solve. But recently, my friend Pete Warden released a post that clicked the another piece of the puzzle into place for me, moving that geometry from the page to the ear.

In his post, Speech Embeddings for Engineers, Pete tackles the problem of diarization—the technical term for figuring out “who spoke when” in an audio recording. If you’ve followed my podcast archive project, you know this has been a thorn in my side. I have thousands of transcripts, but they are largely monolithic blocks of text. I know what was said, but often I lose the context of who said it.

Pete’s explanation is brilliant because it leverages the exact same intuition we developed for text. Just as a text embedding captures the semantic “fingerprint” of a sentence, a speech embedding captures the vocal fingerprint of a speaker.

The mental shift is fascinating. When we embed text, we are mapping meaning. We want the vector for “dog” to be close to “puppy” and far from “motorcycle.” But when we embed speech for diarization, we don’t care about the meaning of the words at all. A speaker could be whispering a love sonnet or screaming a grocery list; semantically, those are worlds apart. But acoustically—in terms of timbre, pitch, and cadence—they share an undeniable identity.

Pete includes a Colab notebook that demonstrates this beautifully. It’s a joy to run through because it demystifies the process entirely. He walks you through taking short clips of audio, running them through a model, and visualizing the output.

Suddenly, you aren’t looking at waveforms anymore. You’re looking at clusters. You can see, visually, where one voice ends and another begins. It turns the murky problem of distinguishing speakers in a crowded room into a clean clustering algorithm, something any engineer can wrap their head around.

This reinforces a recurring theme for me: the power of small, composable tools. We often look for massive, end-to-end APIs to solve our problems—a “magic box” that takes audio and returns a perfect script. But understanding the primitives is where the real power lies. By understanding speech embeddings, we aren’t just consumers of a transcription service; we are architects who can build systems that listen, identify, and understand the nuance of conversation.

If you’ve ever wrestled with audio data, or if you just want to see how the concept of embeddings extends beyond text, I highly recommend finding a quiet hour to work through Pete’s notebook. It might just change how you hear the data.

Great Video on Gemini Scribe and Obsidian

I was recently looking through the feedback in the Gemini Scribe repository when I noticed a few insightful comments from a user named Paul O’Malley. Curiosity got the better of me, I love seeing who is actually pushing the boundaries of the tools I build, so I took a look at his YouTube page. I quickly found myself deep into a walkthrough titled “I Built a Second Brain That Organises Itself.”

What caught my eye wasn’t just another productivity system, we’ve all seen the “shiny new app” cycle that leads to digital bankruptcy. It was seeing Gemini Scribe being used as the engine for a fully automated Obsidian vault.

The Friction of Digital Maintenance

Paul hits on a fundamental truth: most systems fail because the friction of maintenance—the tagging, the filing, the constant admin—eventually outweighs the benefit. He argues that what we actually need is a system that “bridges the gap in our own executive function”.

In his setup, he uses Obsidian as the chassis because it relies on Markdown. I’ve long believed that Markdown is the native language of AI, and seeing it used here to create a “seamless bridge” between messy human thoughts and structured AI processing was incredibly satisfying.

Gemini Scribe as the Engine

It was a bit surreal to watch Paul walk through the installation of Gemini Scribe as the core engine for this self-organizing brain. He highlights a few features that I poured a lot of heart into:

  • Session History as Knowledge: By saving AI interactions as Markdown files, they become a searchable part of your knowledge base. You can actually ask the AI to reflect on past conversations to find patterns in your own thinking.
  • The Setup Wizard: He uses a “Setup Wizard” to convert the AI from a generic chatbot into a specialized system administrator. Through a conversational interview, the agent learns your profession and hobbies to tailor a project taxonomy (like the PARA method) specifically to you.
  • Agentic Automation: The video demonstrates the “Inbox Processor,” where the AI reads a raw note, gives it a proper title, applies tags, and physically moves it to the right folder.

Beyond the Tool: A Human in the Loop

One thing Paul emphasized that really resonated with my own philosophy of Guiding the Agent’s Behavior is the “Human in the Loop”. When the agent suggests a change or creates a new command, it writes to a staging file first.

As Paul puts it, you are the boss and the AI is the junior employee—it can draft the contract, but you have to sign it before it becomes official. You always remain in control of the files that run your life.

Small Tools, Big Ideas

Seeing the Gemini CLI mentioned as a “cleaner and slightly more powerful” alternative for power users was another nice nod. It reinforces the idea that small, sharp tools can be composed into something transformative.

Building tools in a vacuum is one thing, but seeing them live in the wild, helping someone clear their “mental RAM” and close their loop at the end of the day, is one of the reasons I do this. It’s a reminder that the best technology doesn’t try to replace us; it just makes the foundations a little sturdier.