Welcome back to The Agentic Shift. We’ve spent the last few posts assembling our AI agent piece by piece: giving it senses (Perception), different ways to think (Reasoning Patterns), ways to remember (Memory), and hands to act (Toolkit). We’ve even considered how to keep it safe (Guardrails). Our agent is becoming quite capable.
But there’s a hidden bottleneck, a cognitive constraint that governs everything an agent does: its attention span. Think of an agent’s context window—the amount of information it can hold in its “mind” at once—like a craftsperson’s workbench. A tiny bench limits the tools and materials you can have ready, forcing you to constantly swap things in and out. A massive bench might seem like the solution, but if it’s cluttered with every tool you own, finding the right one becomes a nightmare. You spend more time searching than working.
For an AI agent, its context window is this workbench. It’s arguably its most precious resource. Every instruction, every piece of conversation history, every tool description, every retrieved document—they all compete for space on this limited surface. And just like a cluttered workbench hinders a craftsperson, a crowded context window can cripple an agent’s performance.
This isn’t just about running out of space. It’s about the very nature of how these models “pay attention.” Let’s explore why simply throwing more context at an agent isn’t the answer, and why mastering the art of managing its attention is the key to building truly effective autonomous systems.
The Illusion of Infinite Space
In our industry, we have a tendency to race toward bigger numbers. This has led to an arms race for enormous context windows—millions of tokens, capable of holding entire books or codebases in memory. It’s tempting to see this as the solution to an agent’s limitations. Just pour everything in, right?
Unfortunately, it’s not that simple. There’s a critical distinction to be made between ingesting data and interacting with it. Models like Gemini have shown incredible capability in understanding a vast, static context dumped in all at once—an entire codebase, a full video, or a library of books. This is the “read-only” use case, and it’s powerful for both one-off and multi-shot analysis, where the key is that the data in the context is not being overwritten or superseded by new, conflicting information as the agent works.
But agentic work is rarely read-only. An agent changes things. It writes new code, it modifies files, it holds a conversation. And this is where the cracks appear. The moment the context becomes dynamic, with the agent adding its own thoughts, observations, and new file versions, performance can begin to degrade. The problem isn’t just size; it’s churn. This churn, this constant modification of the workbench, leads to three fundamental problems.
First, there’s the simple physics of attention. At their core, most modern LLMs rely on a mechanism called “self-attention,” first introduced in the foundational “Attention Is All You Need” paper. It’s what allows them to weigh the importance of different words and understand long-range connections in text. But this power comes at a cost: the computation required scales quadratically with the length of the input. Doubling the context doesn’t double the work; it quadruples it. This leads to slower responses (latency) and higher operational costs, hitting practical limits long before theoretical token limits are reached. Adding to this, the “KV cache”—a sort of short-term memory for processing—also grows linearly with context, demanding huge amounts of expensive GPU memory just to keep the conversation going (a problem that optimizations like FlashAttention aim to manage, but don’t fundamentally eliminate).
We don’t even need to look at the architecture to see this; we can just follow the money. Many model providers have different pricing tiers for the same model, with a steep cliff for requests that use a very large context. This isn’t just a business decision; it’s a direct reflection of the resource cost. As builders, we can use this as a practical heuristic. If we design our agent’s main reasoning loop to stay under that pricing cliff—say, in the cheapest 20% of the context window—we not only save significant cost, but we’re also implicitly aligning with the model’s most efficient operational range, which often correlates with higher reliability and performance.
Second, even with infinite computing power, we run into a curious cognitive blind spot. Research has revealed a flaw in how LLMs use long contexts. The “Lost in the Middle” paper famously showed that models have a strong bias towards information at the very beginning and very end of their context window. Information buried in the middle often gets ignored or forgotten, regardless of its importance. It’s like trying to remember the middle chapters of a very long book – the beginning and end stick, but the details in between get fuzzy. This means a bloated context window doesn’t just slow things down; it can actively hide critical information from the model’s attention, leading to mistakes and task failures.
Finally, all this clutter ends up drowning the signal. From an information theory perspective, the context window is a communication channel. We’re trying to send a “signal” (the important instructions and data) to the model. Everything else is “noise.” A crowded context window, filled with redundant chat history, verbose tool outputs, or irrelevant retrieved documents, lowers the signal-to-noise ratio. The crucial instructions get drowned out. The agent loses track of its original goals, misinterprets commands, or gets stuck in loops because the essential information is obscured by the clutter. This phenomenon, sometimes called “context pollution,” is perhaps the most insidious problem, as it degrades the agent’s reasoning quality subtly over time.
A classic, painful example of this is a coding agent. In the course of its work, it might generate five or six slightly different versions of the same file as it tries to fix a bug. If all of these versions remain in the context, the agent can become deeply confused. Worse, we often inject these file versions without any indicator or metadata specifying which is the most recent. For a model’s attention mechanism, which might be biased by what’s “Lost in the Middle,” it’s just a sea of text. It could easily start referring to an old, obsolete version of a file while generating new code, simply because of its position in the context. Which version is the ‘latest’? Which one has the bug? Which one was the dead end? The critical “signal” (the correct file version) is drowned in the “noise” (the five incorrect ones). This is the digital equivalent of a workbench so covered in drafts and scrap paper that you can no longer find the final blueprint.
These three factors—computational cost, cognitive biases, and information overload—converge to create the “crowded context problem.” It forces us to recognize that effective context management isn’t about maximizing quantity, but about optimizing quality and relevance. The goal isn’t just to give the agent information, but to carefully curate its attention.
This challenge isn’t all that different from our own cognitive limits. We humans are serial taskers, not parallel processors. Once our own working memory gets cluttered with too many facts, instructions, and interruptions, our performance degrades. We make simple mistakes. We forget the original goal. We handle this by externalizing our memory—we write things down, we make lists, we refer to our notes. Imagine trying to do your taxes entirely in your head, only getting to look at each document once. That’s a crowded context. The strategies we’re forced to develop for agents, like summarization and retrieval, are really just digital versions of the notebooks and ledgers we’ve relied on for centuries.
The Art of Curation
If we can’t just use a bigger bench, we must become better organizers. The art of building capable agents is, in large part, the art of context curation. This entire endeavor is a constant game of trade-offs, primarily between token cost and agent performance. Sometimes, as we’ll see, you must strategically spend more tokens on curation—like paying for an extra summarization call—to achieve a better long-term outcome in performance, reliability, and overall cost. Our goal isn’t just to use the fewest tokens, but to use them in the smartest way.
This “smart” usage is why we see frontier labs provide a spectrum of models, from high-performance “Pro” versions to incredibly fast “Flash” or “Lite” versions. We can design systems that use a cheap, fast model for the high-frequency work of context curation (like summarizing a conversation), saving the expensive, powerful model for the core reasoning task. This is a perfect example of the trade-off: we’re increasing our total token count by using two models, but we’re slashing our overall cost and latency by using the right tool for the job.
But wait, you might be thinking, didn’t we already solve this? In Part 3, we gave our agent a “Memory.” How is this different? This is a crucial distinction. Memory is the long-term, external filing cabinet. It’s the vector database, the SQL store, the document collection. It’s vast, persistent, and “cold.” The Context Window is the workbench. It’s the “hot,” active, in-the-moment workspace. An agent can’t think about something that isn’t on the workbench.
Context management, then, is the process of moving information between the filing cabinet and the workbench. A memory system on its own is useless; it’s the RAG (Retrieval-Augmented Generation) pipeline that finds the right document in the cabinet and places it on the bench, right when the agent needs it. The summarization techniques we’re about to discuss are like taking old notes from the bench, clipping them together, and filing them away, replacing them with a single summary document.
The goal is to keep the workbench clean, using the long-term memory as our strategic reserve. With that distinction, let’s look at the core techniques.
The most straightforward approach is Summarization. Instead of feeding the entire conversation history back into the context window with every turn, we use a separate, small LLM call to create a running summary of what’s been discussed. As the conversation gets longer, the oldest messages are consumed by this summarization process and replaced by a concise narrative. It’s the equivalent of finishing a step, putting all the specialized tools for that step into a drawer, and just leaving a label that says “Step 1: Assembled the frame.” The trade-off, of course, is a small cost in latency and tokens for the summarization call itself. You’re spending a little compute now to save a lot of compute later by keeping the main loop efficient. Deciding when it’s “worth it” is a practical question: if the conversation is short, it’s overkill. But for an agent designed to have a long-running, stateful interaction, it’s essential. Popular frameworks like LangChain and LlamaIndex have long offered “conversation buffer” utilities that handle this summarization and pruning logic out of the box, making it a standard pattern.
A more surgical approach is History Pruning and Filtering. Not all messages are created equal. A user’s core instruction (“Find me the best price on flights to Tokyo”) is far more important than the five conversational turns that follow (“Okay, searching now,” “What dates?,” “October 10th to the 20th,” “Got it,” “Here’s what I found…”). We can be ruthless. We can design systems to tag messages by importance or type—like system_instruction, user_request, agent_thought, tool_output—and then selectively prune the least important ones (like verbose tool_output or intermediate agent_thought steps) as the context fills up. This keeps the high-signal, high-importance messages while aggressively removing the noise.
We can also get clever with Strategic Re-ordering. This tactic directly combats the “Lost in the Middle” problem. If we know the model pays most attention to the beginning and end of the context, we can engineer the prompt to put the most critical information in those “premium” slots. This is a clever hack that plays to the model’s known psychology. The agent’s core identity, rules, and primary objective go at the very top (the “primacy” bias). The user’s very last instruction and the most recent tool outputs go at the very bottom (the “recency” bias). The long, noisy conversation history and retrieved documents get placed in the middle, where they’re accessible if needed but are much less likely to obscure the primary goal. It’s like putting the one thing you must not forget right by the front door.
Finally, we can solve the file versioning problem with External State Management, or what I call “The Scratchpad.” Instead of forcing the agent to hold entire files in its context, we give it tools to interact with an external “workspace”—a fancy term for a directory on a file system. The agent is given read_file(path), write_file(path, content), and list_files() tools. Its internal monologue becomes: “My task is to refactor main.py. First, I’ll read it.” It calls read_file(“main.py”). The file’s content enters its context. It thinks. “Okay, I need to change the foo function.” It generates the new code. “Now, I’ll save the new version.” It calls write_file(“main.py”, new_content). The new file is saved to disk, overwriting the old one. The content of the file is now out of its context, but the agent’s knowledge (“I have successfully updated main.py”) remains. The “latest version” is simply the version that exists on disk. This pattern keeps the context window clean, containing only pointers (filenames) and the specific chunk of code being edited right now, rather than five confusing, slightly different full-file drafts.
The Shop of Specialists
While these techniques help, they still operate on a single workbench. The most powerful strategy for managing cognitive load is to not do the work yourself. This brings us to a critical architectural decision: the difference between an atomic tool and a sub-agent.
An “atomic” tool is just a function call. It’s a simple, known-quantity operation. Think of get_current_weather(“San Francisco”). The agent calls this tool, and it returns a small, predictable piece of data. It’s like a screwdriver: you pick it up, you turn a screw, you put it down. It’s a self-contained, deterministic action that doesn’t add much cognitive overhead. It doesn’t “think”; it just does.
A sub-agent is something entirely different. It’s a specialized “worker” agent that is, itself, a tool in the main agent’s toolkit. Instead of giving it a simple instruction, the main “orchestrator” agent gives it a goal. This sub-agent is a fully capable agent in its own right, and to be effective, it also uses all the context management strategies we just discussed—summarization, pruning, and its own scratchpad—within its own, private workbench. With this step, we don’t just have a single workbench; we have an entire shop filled with specialized benches and dedicated craftspeople ready to work.
Imagine the main agent’s task is to “Write a market analysis report on the future of renewable energy.” An atomic-tool approach, even with a scratchpad, would be a cognitive nightmare for the main agent. It would have to call search_web, get a messy list of results, write them to a file to clean its context, read the file back to pick one, scrape the URL, get a huge block of text, write that to another file, and repeat this messy loop, all while its main context is filled with the intermediate agent_thought steps of “what do I do next?”.
The sub-agent approach is far cleaner. The main agent has a tool called research_specialist. It calls this tool with a single goal: research_specialist.run(“market analysis for renewable energy”).
This research_specialist is a complete agent in its own right. It has its own context window, its own set of tools (search_web, scrape_url, write_file), and its own reasoning loop. It performs all the messy steps—searching, scraping, reading, summarizing, getting lost, backtracking—in its own workspace. The main agent’s context window remains clean, containing only one entry: “Waiting for research_specialist to return.”
Finally, the sub-agent finishes its work and returns a single, clean, final answer: a concise, multi-paragraph analysis. This is the only piece of information that enters the main agent’s context.
This hierarchical approach is one of the most important patterns in modern agent architecture. It’s the ultimate act of cognitive delegation. It’s the difference between a CEO trying to personally write every line of code for the company website and a CEO hiring a VP of Engineering and trusting them to return a finished product.
This also clarifies the line between what we’re discussing here and the topic of Part 9, multi-agent systems. A sub-agent, as we’ve defined it, is a hierarchical tool. It is called by a superior orchestrator, it performs a task, and it returns a result. A true multi-agent system, as we’ll explore in Part 9, implies collaboration between peers. These agents may negotiate, compete, or work in parallel, often without a single “boss” dictating their actions. What we’ve built here is the foundation for that: a single, capable orchestrator, which we will soon teach to collaborate.
A Curation Playbook
We’ve covered a lot of strategies, from simple pruning to complex delegation. As a practical builder, you might be wondering: “When do I use each one?”
The answer depends on the complexity of your agent’s task. We can think of it as a series of levels, a playbook for deciding which strategy to deploy.
Level 0: The “Fire-and-Forget” Agent. For simple, one-shot tasks (e.g., “Summarize this article” or “Classify this email”), you don’t need complex context management. The prompt is the entire context. It’s self-contained and requires no memory of the past.
Level 1: The “Long-Running Conversation.” The moment your agent needs to remember what was said five messages ago, you’ve hit Level 1. This is the baseline for any stateful chatbot or assistant. Your non-negotiable strategy here is Summarization or History Pruning. Without it, the agent will quickly lose the thread of the conversation, and performance will degrade.
Level 2: The “Stateful Workspace.” The instant your agent needs to modify an external resource—like our coding agent that edits files—a conversation summary is no longer enough. You’re at Level 2. The agent must have an External State Management system (our “Scratchpad”). It needs to be able to read and write to a reliable source of truth so it isn’t confused by its own drafts and intermediate steps.
Level 3: The “Complex, Multi-Step Project.” When the task stops being a single goal and becomes a messy, multi-step project (e.g., “Do a complete market analysis,” “Plan a product launch,” or “Refactor this entire codebase”), the main agent’s cognitive load will become too high, even with a scratchpad. This is your trigger to move to Level 3: Sub-Agent Delegation. You don’t ask one agent to do everything; you ask one orchestrator agent to hire and manage a team of specialists.
As you can see, this logic—deciding when to summarize, what to prune, how to manage a scratchpad, and how to orchestrate sub-agents—is complex. It’s a lot of scaffolding to build by hand for every new project. And that is precisely why agent frameworks exist.
Attention is the New Scarcity
In the early days of computing, we were constrained by memory and processing power. In the age of AI agents, the new frontier of scarcity is attention. The “Lost in the Middle” problem and the quadratic scaling of self-attention aren’t just implementation details; they are fundamental physical and cognitive limits we must design around.
A common question is whether these techniques are just “hacks” for today’s flawed models. Will techniques like strategic re-ordering be obsolete when the “Lost in the Middle” problem is solved? Perhaps. But while specific tactics may fade, the principle of managing cognitive load is timeless.
Two things tell us this is a permanent challenge. First, there’s the simple matter of performance. Even if an agent had an infinite and perfect context window, it doesn’t have infinite time. Reviewing a million tokens to find one relevant fact is profoundly inefficient. We want our agents to be speedy and responsive, and that means designing them to carry only the context necessary for the immediate task.
Second, this problem doesn’t disappear with AGI. On the contrary, it may become more critical. As we discussed, humans—the only general intelligences we know—are hobbled by cognitive load. We invented notebooks, calendars, and filing cabinets to manage our own attention. It’s a reasonable assumption that even a super-intelligence will require similar external systems to help it focus, manage its goals, and not get lost in an infinite sea of its own thoughts.
An agent’s effectiveness, then, will always be defined not just by how much information it can access, but by how efficiently it can access and focus on the right information at the right time.
Building successful agents requires us to shift our thinking from “how big can we make the context window?” to “how can we design a system that needs as little context as possible?”
By distinguishing between long-term memory and the active workbench, by managing our agent’s state in an external scratchpad, and, most importantly, by delegating complex tasks to specialized sub-agents, we can build systems that are not just more powerful, but more reliable, efficient, and intelligent.
Next in The Agentic Shift: We’ve now assembled all the core concepts of a single, capable agent. But how do you actually build one without starting from scratch? In Part 8, we’ll explore the landscape of agent frameworks—the scaffolding that provides pre-built components for memory, tools, and context management, helping us go from idea to implementation.
[…] Part 7: Managing the Agent’s Attention […]