Alt Text: A luminous geometric sphere with facets fragmenting outward, connected by thin orbital lines to three smaller glowing nodes representing a chat bubble, code brackets, and a calendar grid, set against a dark navy background.

Agents in the Wild

Welcome back to The Agentic Shift. In our last post, we closed the loop on what it takes to move an agent from prototype to production: observability, evaluation, and the data flywheel that ties them together. We’ve spent ten installments building up the theory, piece by piece, from the anatomy of an agent through reasoning patterns, memory, tools, guardrails, attention management, frameworks, and interoperability protocols.

Now I want to talk about what happens when all of that theory meets the real world.

I was giving a talk to a group of engineers last week, and I found myself describing a pattern I keep seeing in my own work and in the industry at large. I called it the “code smell for agents,” borrowing from a post I wrote earlier this year. The idea is simple: if you’re writing if/else logic to decide what your AI should do, you’re probably building a classifier that wants to be an agent. Decompose those branches into tools, and let the model choose its own adventure. The room lit up. There were lots of questions, and the thing that generated the most interest was the idea that agents exhibit emergent behavior you didn’t specifically create. Give a model tools and a goal, and it starts making decisions you never explicitly programmed. That’s both the promise and the challenge. The theoretical architecture we’ve been mapping in this series isn’t just a blueprint anymore. It’s becoming the default way software gets built.

Today, I want to make this concrete. We’re moving from “how do agents work?” to “how are people actually using them?” The answer, it turns out, spans customer support centers processing millions of conversations, software engineering workflows where agents resolve real GitHub issues autonomously, and personal productivity tools that are turning everyone’s phone into a command center. Let’s look at each.

The Autonomous Frontline

Customer support was always going to be the first domain where agents proved themselves at scale. The data is structured, the success metrics are clear, and the cost of human labor is high. But what’s happening now goes far beyond the rigid chatbots of the previous decade.

The most striking case study is Klarna. In its first month of full deployment, Klarna’s AI assistant handled 2.3 million customer conversations, roughly two-thirds of the company’s total support volume. That’s the workload equivalent of 700 full-time human agents. Average resolution time dropped from eleven minutes to under two, an 82% improvement. And contrary to what you might expect from a system prone to hallucination, repeat inquiries dropped by 25%, suggesting the agent was more consistent at resolving root causes than the human workforce it augmented. Klarna estimated a $40 million profit impact in 2024 alone.

What makes this more than a chatbot story is the scope of autonomy. The Klarna agent doesn’t just quote FAQs. It processes refunds, handles returns, manages cancellations, and resolves disputes. These are actions with write access to financial ledgers. The system works because of a human-in-the-loop architecture where customers can always escalate to a human, but the default path is fully autonomous resolution.

Sierra has taken a different approach, building what they call the “Agent OS,” a platform designed to bridge the gap between the probabilistic nature of LLMs and the deterministic requirements of enterprise policy. Their deployment at WeightWatchers is a good example of why grounding and domain-specific instructions matter so much. A generic model understands “budget” as a financial concept, but the WW agent had to understand it as a daily allocation of nutritional points. With that grounding in place, the agent achieved a 70% containment rate (sessions fully resolved without human intervention) in its first week, while maintaining a 4.6 out of 5 customer satisfaction score.

What surprised me most about the WW deployment was an emergent behavior: users regularly exchanged pleasantries with the agent, sending heart emojis and expressing gratitude. When an agent is responsive, competent, and linguistically fluid, people engage with it as a social entity. That’s not a side effect. It’s a feature that drives retention.

At SiriusXM, Sierra deployed an agent called “Harmony” that takes this a step further with long-term memory. Instead of treating each chat as stateless, Harmony recalls previous subscription changes, music preferences, and technical issues across sessions. It can open a conversation with “I see you had trouble with the app last week, is that resolved?” That’s not reactive support. That’s proactive concierge service, and it’s only possible because the agent maintains the kind of persistent state we discussed in our memory architecture post.

One of the most important technical contributions in this space comes from Airbnb’s research on knowledge representation. They found that standard RAG pipelines fail when reasoning over complex policy documents with nested conditions. Their solution, the Intent-Context-Action (ICA) format, transforms policy documents into structured pseudocode where the agent predicts a specific Action ID (like ACTION_REFUND_50) that maps to a pre-approved response or API call, effectively eliminating policy hallucination. By using synthetic training data to fine-tune smaller open-source models, they achieved comparable accuracy at nearly a tenth of the latency. That’s the kind of practical engineering that separates a demo from a production system.

The pattern across all of these deployments is clear: AI in customer support is shifting from information retrieval to task execution, from probabilistic guessing to deterministic action, and from stateless interactions to stateful relationships. This is the agentic shift in its most tangible form.

The Autonomous Engineer

If customer support agents operate within the guardrails of defined policy, software engineering agents work in an environment of much higher complexity. The shift here is from code completion (the “Copilot” era) to autonomous issue resolution (the “Agent” era).

The standard benchmark for evaluating this is SWE-bench, which tests an agent’s ability to resolve real-world GitHub issues: navigate a complex codebase, reproduce a bug, modify multiple files, and verify the fix against a test suite. As of early 2026, top-tier agents are achieving 70-80% resolution rates on SWE-bench Verified, up from roughly 4% in early 2023. On the more challenging SWE-bench Pro, which uses proprietary codebases, top models still hover around 45%, a reminder that complex legacy environments remain a significant hurdle.

I see this playing out daily in my own workflow. Tools like Gemini CLI and Claude Code have fundamentally changed how I write software. As I described in Everything Becomes an Agent, the moment I gave my agents access to shell commands and file tools, they stopped being autocomplete engines and started being collaborators. They could run tests, see the failure, edit the file, and run the tests again. The loop we described in Part 2 (Thought-Action-Observation) is no longer a theoretical pattern. It’s the actual development loop I use every day.

What’s driving this improvement isn’t just better models. It’s better scaffolding. The SWE-agent project at Princeton introduced the concept of the Agent-Computer Interface (ACI), a shell environment optimized for LLM token processing rather than human perception. It uses “observation collapsing” to summarize verbose terminal outputs, preventing the context window overflow that kills so many coding agents, and includes an automatic linting loop for rapid self-correction before expensive test suites run.

Even more exciting is Live-SWE-agent, which can synthesize its own tools on the fly. When it encounters a repetitive task, it writes a Python script to handle it and adds the script to its toolkit for the session. This dynamic adaptability helped it achieve 77.4% on SWE-bench Verified without extensive offline training. It’s a move from “static tool use” to “dynamic tool creation,” where the agent engineers its own environment.

On the product side, GitHub Copilot Workspace represents the Plan-and-Execute pattern productized for millions of developers. The user describes a task, the system generates an editable specification and plan, then implements the changes. This “steerable” design makes the agent’s reasoning visible and mutable, shifting the developer from “author” to “reviewer and architect,” exactly the “Human-on-the-Loop” model I’ve been advocating. And the protocol layer is catching up too, with tools like Goose implementing the Agent Client Protocol to decouple intelligence from interface, letting developers bring their own agent to their preferred editor.

The Cognitive Extension

The third domain is the most personal: productivity agents that manage the chaotic stream of daily information, tasks, and communication. The conceptual target is the “personal intern,” an always-on digital entity that doesn’t just answer questions but anticipates needs.

I’ve been living this with Gemini Scribe, my agent inside Obsidian. What started as a glorified chat window evolved into a full agentic system the moment I gave it access to read_file. Suddenly I wasn’t managing context manually; I was delegating. “Read the last three meeting notes and draft a summary” is not a chat interaction. It’s a delegation, and delegation requires the agent to plan, execute, and iterate. The same evolution happened with my Podcast RAG system, where deleting a classifier and replacing it with tools made the system both simpler and more capable.

But the most vivid example of personal agents “in the wild” right now is OpenClaw. If you haven’t been following, OpenClaw (formerly Moltbot) is an open-source AI agent that runs locally, connects through messaging apps you already use (WhatsApp, Telegram, Signal, Slack), and takes action on your behalf. It can execute shell commands, manage files, automate browser sessions, handle email and calendar operations. It has over 300,000 GitHub stars and a community of people using it for everything from negotiating car purchases to filing insurance claims.

OpenClaw is a fascinating case study because it makes the theoretical architecture of this series tangible. It’s a model running in a loop with access to tools. It has memory (local configuration and interaction history that persists across sessions). It uses the ReAct pattern to reason about tasks and choose actions. And it has all the failure modes we’ve discussed: Cisco’s AI security research team found that a third-party skill called “What Would Elon Do?” performed data exfiltration and prompt injection without user awareness, demonstrating exactly the kind of guardrail failures we examined in Part 6.

The underlying technical challenge is memory. For a personal agent to be useful over time, it has to remember. Systems like Mem0 extract preferences and facts into a vector store for future retrieval. Zep goes further with a Temporal Knowledge Graph that stores facts in time and in relation to one another, enabling reasoning over questions like “What did we decide about the budget last week?” On the enterprise side, Glean connects to over 100 SaaS applications to build a unified knowledge graph with a “Personal Graph” that layers individual work patterns on top of company data. These are the production-grade versions of what we discussed theoretically in Part 3.

When Things Go Wrong in Production

Deploying agents in the wild surfaces failure modes that simply don’t exist in chat interfaces. The research on agentic production reliability identifies patterns I see constantly.

Reasoning spirals are the most common. An agent searches for “pricing,” finds nothing, and searches again with the same parameters. It’s stuck in a local optimum, unable to update its strategy. The fix is a state hash (checking if the current state matches a previous one) combined with circuit breakers (hard limits on steps or tokens per session). I described this in detail in our post on the observability gap.

Tool hallucination is more insidious. The agent doesn’t hallucinate facts in prose; it hallucinates tool parameters, passing a string where the API expects an integer or inventing a document ID that doesn’t exist. These cause system crashes or silent data corruption. Strict schema validation and constrained decoding (forcing the model to output valid JSON) are essential defenses.

Silent abandonment is the quietest failure. The agent hits ambiguity or a tool error, politely apologizes (“I’m sorry, I couldn’t find that”), and gives up without alerting anyone. This is often a side effect of RLHF training, where the model has learned that apologizing is a safe response. The Reflexion pattern combats this by forcing the agent to generate a self-critique and try a different strategy before surrendering.

Cascading failures appear in multi-agent systems, where a hallucination in one agent (a researcher providing bad data) can poison the entire chain (a writer publishing false information). This is why supervisor architectures and the kind of observability infrastructure we discussed in Part 10 are not optional.

The Economic Reckoning

All of these deployments share a common economic implication. For two decades, SaaS relied on seat-based pricing, charging per user login, a model that assumes software is a tool used by humans. Agents challenge that assumption by acting as autonomous workers. When Klarna’s agent does the work of 700 humans, the demand for seats shrinks. Financial analysts have started calling this the “SaaSpocalypse”. The new model is “Service-as-a-Software,” where you pay for the completed task rather than the license. Salesforce’s Agentforce already prices at $2 per conversation. HubSpot is pivoting to consumption-based models. Klarna has moved to replace Salesforce and Workday with internal AI solutions entirely.

This doesn’t mean the end of human labor. In the Klarna deployment, the remaining humans focused on complex, high-empathy interactions. In software development, Copilot Workspace elevates the developer to a product manager role. It’s the same human-on-the-loop philosophy, applied at the scale of the labor market itself.

From Theory to Territory

Looking at all of this evidence, I keep coming back to a simple thought. Every concept in this series has a real-world counterpart operating in production right now. The ReAct loop powers coding agents that iterate on failing tests. Memory architectures enable SiriusXM’s Harmony to remember your subscription history. Tool grounding and instruction engineering are what make Airbnb’s ICA format work. Guardrails are what OpenClaw desperately needs more of. Context management is what SWE-agent’s observation collapsing solves. Frameworks are what make it possible to build these systems without starting from scratch every time. Protocols are what connect them to the wider world. And observability is what keeps them honest.

The agents are no longer theoretical. They’re processing refunds, merging code, negotiating car prices, and managing enterprise knowledge graphs. They’re also getting stuck in loops, hallucinating tool parameters, and quietly giving up when they shouldn’t. The technology works, and it fails, in exactly the ways we’ve been describing.

This brings us to our final installment. We’ve mapped the territory. We’ve seen what these systems can do and where they break. In Part 12, we’ll step back and grapple with the hardest questions: responsibility, governance, and the road ahead. What do we owe the people affected by these systems? How do we ensure this shift makes the world better, not just more efficient? The engineering is the easy part. The ethics are where the real work begins.

One thought on “Agents in the Wild

Leave a Reply