Brass calipers measuring a glowing wireframe sphere floating above a dark wooden workbench scattered with paper task sheets.

How I Built a Scoreboard for My Own Agent

The bug fix took an afternoon. The follow-up question took a week.

I was deep in Gemini Scribe, my Obsidian plugin that drops a Gemini-powered agent into your vault, and I had just shipped a change to the way the agent picked its tools. It felt better. The few sessions I ran by hand showed cleaner reasoning, fewer wasted tool calls, less of the weird “let me search for that again with slightly different keywords” tic. I committed, pushed, and moved on.

Then a friend asked, casually, “how much better?”

I had no answer. None I trusted, anyway. I had vibes. I had a handful of session transcripts I could squint at. I had the comforting belief that change is progress, which is the most dangerous belief you can hold when you are building with non-deterministic systems.

When I wrote about the observability gap earlier this year, I argued that you cannot fix what you cannot see. Observability lets you watch a single agent run unfold. But it does not tell you whether the next run will be better than this one. For that, you need a different instrument. You need a scoreboard.

So I built one. This is the story of what it took to make it credible, and what it told me when it finally was.

Two Reasons This Suddenly Mattered

The friend’s question was the trigger, but it was not the only reason I needed an answer. Two larger pressures had been building for a month.

The first was Ollama. In version 4.8, shipped a month ago, I added a local-model provider to Gemini Scribe. The plugin can now drive the agent against a model running on your own hardware, with no API key and no per-token cost. I wanted that, and so did a lot of users. But the moment I shipped it I had a question I could not duck. Are the local models actually good enough to use? Should I tell people to switch to them, or should I quietly warn them that the experience drops off a cliff once the cloud connection goes away?

The second was pricing. Google recently raised the price of Gemini 3.5 Flash, the newest model in the Flash family, to nearly the level of Gemini Pro (the full pricing table tells the story). For almost a year I had been recommending Gemini 2.5 Flash as the default model for Gemini Scribe, and the obvious upgrade path (move up to 3.5 Flash with the next release) suddenly looked expensive. The alternative was to switch families entirely and make the newest Flash Lite model the default, but only if it was actually capable enough to drive the agent on real work.

Both questions had the same shape. “Is model X good enough to be the default for Gemini Scribe?” Before building anything, I went looking for an existing benchmark to adopt. I commissioned two separate deep-research passes specifically to find one I could lift wholesale. Both came back with the same answer.

The public eval suites measure code generation (HumanEval, SWE-bench), general assistant tool use over the web (GAIA), and customer-service-style tool flows (τ-bench). None of them measure what I actually care about, which is an agent operating inside a markdown wiki. Opening notes by name. Following wikilinks across files. Editing frontmatter without nuking sibling notes. Aggregating across many notes and refusing prompt-injection bait sitting in a note body. If a benchmark for this exists, neither I nor two passes of automated research could find it.

So I had to build it.

Why Unit Tests Do Not Work

The instinct, if you have spent any time writing software, is to reach for unit tests. The agent took an input, it produced an output, check the output. Pass or fail. Run on every commit. We have been doing this for decades.

I am not arguing against unit tests in the abstract. The Gemini Scribe repo has nearly three thousand of them, and I just finished a multi-week push to get line coverage above ninety percent. They are the foundation that lets me move quickly on everything below the agent loop: parsers, settings migration, frontmatter handling, the diff view, the provider adapters, the tool definitions. Without that scaffold I would be afraid to refactor anything, and most of the bugs that would otherwise reach the agent never get the chance.

The other thing I had been leaning on was daily use. I run Gemini Scribe in my own vault every day, on real work, which catches the egregious failures fast. The agent crashes, the agent produces obvious garbage, the agent loops; I notice within a session. What dogfooding does not catch is the distribution. Did this change make the agent worse at one task in twenty in a way I will never directly observe because I do not run that task on a typical Tuesday? My sample size is one, and I had been quietly grading my own work for months.

So the instinct is wrong for the agent loop itself, and the reason is the same one that makes agents interesting in the first place. They do not do the same thing twice. Ask the agent to find a file by name and on one run it will call find_files_by_name once, return the answer in a single turn, and cost you a fraction of a cent. On the next run, against the same prompt, the same vault, the same model, it might call search_content first, then find_files_by_name, then re-search with a slightly different query. Same answer. Twice the cost. Three times the latency. Both runs “pass” a unit test. Both runs are real.

The problem is not that the agent is broken. The problem is that “did it work” is the wrong question. The right question is “how reliably does it work, on what kinds of problems, and at what cost?”

That question cannot be answered by a single run. So the scoreboard has to be built around the inconvenient truth that you have to run everything more than once.

Borrowing pass^k From τ-bench

I did not invent the trick that makes this tractable. I borrowed it from the τ-bench paper linked above, which proposed a metric called pass^k. A task passes at k only if all k runs pass. Not the average. Not the best. All of them.

The math is brutal in a useful way. A model that solves a task 80% of the time on a single run will hit pass^5 of about 33% on that same task. The metric punishes flakiness, which matters in the real world because users do not care about your average run. They care about whether the agent will do the thing they asked for the one time they asked. pass^k is what reliability looks like as a number.

For my harness, I picked k=5 for anything I planned to publish or block a merge on, k=3 for day-to-day development. Every task runs the full count, every time. The summary breaks out pass^k (no harness errors, no timeouts), solve^k (passed and satisfied the full task rubric), and a mean rate for the curious. Tasks that land between 0 and k solves get flagged as flaky in the output, with a little warning sigil. The flaky list is where bugs live.

Scoring What the Agent Actually Did

The harder problem, the one I spent most of the week on, was figuring out what “satisfied the full task rubric” should mean.

The naive version is to grep the final response for the right answer. That works for a few tasks. It fails the moment the task is anything other than “say a specific phrase.” Ask the agent to delete a file and “I deleted the file” is not evidence that the file is gone. Ask it to edit a note and “Done!” tells you literally nothing about whether the edit was correct, or even whether the right note got touched.

The τ-bench lesson, and the one that took me a while to actually believe, is that you have to compare end state against the goal, not tool-call syntax against an expectation. So my task definitions ended up carrying two kinds of checks. Output matchers score the text the model produced. Vault assertions score the side effects. Did the file exist, did it contain the expected content, did the frontmatter end up with the right value, did the unrelated sibling files stay untouched.

Here is what one of those tasks looks like:

{
  "id": "archive-old-notes",
  "difficulty": "T3",
  "userMessage": "Archive every note in eval-scratch tagged #old.",
  "expectedTools": ["find_tagged_notes", "edit_file"],
  "vaultAssertions": [
    { "type": "frontmatterEquals", "path": "eval-scratch/note-a.md",
      "key": "status", "value": "archived" },
    { "type": "fileUnchanged", "path": "eval-scratch/note-c.md",
      "fixture": "note-c.md" }
  ],
  "toolCallBudget": 6
}

The frontmatterEquals assertion confirms the right notes got archived. The fileUnchanged assertion confirms the agent did not go wandering through sibling files it had no business touching. The toolCallBudget makes efficiency itself a pass criterion, which catches the “I will just read every file in the vault” behavior that a single content search would have answered. Saying the right words is not enough. Doing the right thing is not enough. You also have to do it without burning the kitchen down on your way out.

The Judge Problem

A subset of my tasks are prose-heavy. “Summarize the differences between these three meeting notes” does not have a single correct surface form. The agent might write “the second note disagrees on the deadline” or “note two pushes back on the timing.” Both are right. Neither matches a literal substring assertion without me writing a regex more complicated than the task itself.

For those, I use an LLM-as-judge. A separate Gemini model called with temperature: 0 and a strict YES/NO contract against a rubric I write per task. This works, until you start asking whether the judge itself is any good.

I did not trust the answer for a while, and rightly so. So I built a calibration tool. The harness can extract every judge matcher decision from a full sweep into a flat file of tuples (criterion, agent response, automated verdict). I then sat down with a cup of coffee and hand-labelled ninety of them as YES or NO myself, blind to what the judge had said. That gave me a gold set, a one-time human-labelled reference I can measure any candidate judge against.

When I ran four candidate judge models against that set, the results were uncomfortable. The judge I had been using agreed with my human labels 92.2% of the time. The newest Flash, gemini-3.5-flash, hit 94.4%, with fewer false negatives on cosmetic formatting and one fabrication case that the smaller gemini-3.1-flash-lite missed. I switched judges.

But the more important finding was about the judges themselves. Even at temperature: 0, two fresh runs of the same judge against the same gold set produced the same accuracy number with a different set of disagreeing tuples. The pass/fail flips around. Judge nondeterminism is real. Single-run judge measurements are not to be trusted.

The other thing the calibration exercise gave me, which I did not expect, was a debugging tool. Forcing myself to read every criterion and every response carefully turned up two latent bugs I had been staring through for months. One task had a judge criterion demanding response-side coverage that the prompt never asked for. Three other tasks had fileMatches regexes silently failing because they used JavaScript-incompatible inline flags. The eval harness was not just measuring the agent. It was measuring my evaluation of the agent, and finding it wanting.

What the Scoreboard Said

With the harness real, I ran a sweep across three models on a 54-task suite, at k=5, under the calibrated judge. The headline numbers, which now live on the plugin’s docs site and auto-update on every newly blessed baseline:

The newer gemini-3.1-flash-lite solves 74.1% of tasks at solve^5. The older gemini-2.5-flash, supposedly a tier up, solves 57.4%. The local gemma4:e4b running on my own hardware solves 14.8%. A single full sweep costs about thirty cents per model in steady state.

That per-sweep number is the honest one for ongoing measurement, but I should be clear about what the build phase actually cost. Between the judge-calibration runs, the four candidate-judge measurements against my gold set, the three full re-baselines, and the iteration passes that came with all of it, yesterday alone ran me $8.12 across my Gemini Scribe API key and the dedicated judge key. That is the number to plan around if you are building your own. The thirty cents is what it costs once the scoreboard exists and you are just checking whether your latest change moved the needle.

And those are just the API numbers. The real investment was a week of my time, which is the cost you should weigh hardest. It pays back the moment you want to evaluate any change to the agent loop with confidence instead of vibes, which from here is every release I cut.

That first result answered the pricing question for me cleanly. Within a model family, the tier names mean what they say. Pro is more capable than Flash, Flash is more capable than Flash Lite, and you pay accordingly. The interesting thing is what happens across families and releases. The price-to-capability frontier moves fast enough that the newest model in a cheaper family can dominate an older default from a pricier one. That is what happened here. Gemini 3.1 Flash Lite, the newest Flash Lite, beats Gemini 2.5 Flash by about seventeen percentage points on solve^5 on agentic tasks (multi-step tool use, retrieval, edit-then-verify), and costs less per token than the Gemini 2.5 Flash it replaces. The next release of Gemini Scribe will move the default model from Gemini 2.5 Flash to Gemini 3.1 Flash Lite, which means users get a quality upgrade and a cost cut at the same time. Without the scoreboard I would have stayed loyal to a tier name and spent another six months recommending the more expensive, less capable model.

The Ollama numbers were harder to swallow but just as useful. The local Gemma model is genuinely good at the easy T1 tier (a single tool call against a tiny corpus), hitting 100%, and then it collapses. It drops to about 15% on T2 (two or three tool calls with light distractors), 7% on T3 (multi-step, distractor-heavy), and 11% on T4 (frontier-class hop chains and cross-note aggregation). Flash Lite stays above 65% on every tier. The honest version of the local-model story is that today’s open weights running on a laptop will handle simple lookups (find this file, summarize this note) cheerfully, and will fall over on anything that requires chaining tools or holding a multi-step plan together. That is useful to know. It tells me what to recommend (try local for casual queries, stay on cloud for real work) and it gives me a concrete target to retest against when the next generation of open models lands.

The difficulty breakdown is what makes this kind of comparison possible. A suite where every model passes everything, or where no model passes anything, is not measuring anything useful. The whole point is the gradient. T1 is a regression canary that any model worth running has to clear. T2 through T4 is where open models and frontier models actually separate, and where the suite earns its keep.

The Benchmark Is Open

The harness, the 54-task suite, the judge calibration set, and the methodology docs all live in the obsidian-gemini/evals directory. The README walks through adding a new task in about five minutes, and the existing tasks are organized by category (retrieval, multi-hop, aggregation, conflict, write, edit, negative-space, safety, memory) so a new contribution has a fixture pattern to clone from.

If you are working with agents inside Obsidian or any other markdown wiki, I would love contributions. Especially tasks that exercise corners of the agent I have not thought of. Weird vault layouts. Exotic frontmatter conventions. Prompt-injection payloads you have actually seen in the wild. Multi-step plans that catch the model out. A benchmark is a public good, and it only gets sharper the more people sharpen it. Open an issue or a PR and let’s make this the thing that did not exist when I went looking for it.

What I Would Tell You If You Were Starting

If you are building an agent and you have been operating on vibes, here is the short version of what I would tell you over coffee.

Start with pass^k, not single-run pass rates. The reliability framing is the one that survives contact with production. Run each task at least three times for development, at least five for any decision you are going to publish or block a merge on.

Score the side effects, not the words. The model can say it did the right thing while doing nothing of the sort. State-based assertions on what actually changed in the world are the only honest scoring you can do for tasks that mutate anything.

Make efficiency a pass criterion. A tool-call budget is a one-line addition to a task definition and it catches an entire category of “the agent technically solved it” results that are not actually wins.

If you are using an LLM as judge, calibrate it against human labels at least once, and remember that judge nondeterminism is a real source of measurement noise even at temperature zero.

Treat the scoreboard itself as a debugging tool. The discipline of writing down what “good” looks like, in machine-readable form, surfaces problems with your tasks, your criteria, and your assumptions that no amount of squinting at session transcripts will. The eval harness paid for itself the first time it told me my judge was asking the wrong question, before it ever told me anything useful about the agent.

The vibes were never going to scale. The scoreboard does. The strangest thing about building it has been realizing how much of what I thought I knew about my own agent was wrong, in small but consistent ways, in the direction of being too generous. That is not a moral failing. It is what happens when the system you are measuring does not sit still. You need an instrument. So I built one. Next time someone asks me how much better my change made the agent, I have a number.

A hand-drawn map on a workbench with a half-built mechanical instrument being assembled directly on top of it.

Agents as Building Blocks

There’s a thread running through the last year of my writing and my work, and I didn’t fully see it until now.

Last September, I wrote Full Circle, about going back to building after years of leading teams. I wanted to be in the driver’s seat for what I called the agentic shift. I wanted to feel the code under my fingers again, to be close enough to the technology that I could form my own opinions about where it was going.

Then I spent six months drawing the map. The Agentic Shift was twelve essays on what agents are, how they work, and what it means to build them well: anatomy, memory, tools, guardrails, multi-agent coordination, production readiness. It was a theoretical framework, written while I was getting my hands dirty on the Gemini CLI team.

And then, in January, I wrote Everything Becomes an Agent, the practitioner’s version. Not theory anymore. I’d watched Gemini Scribe grow from a chat window into a full agent. I’d seen the CLI team go from talking about code to writing and executing it. I’d noticed a pattern repeating across every AI project I touched: given enough time, they all converged on the same architecture. Tools. Loops. Policies. Judgment.

The Antigravity SDK is the second agent product I’ve worked on at Google. Gemini CLI was the first, and it’s where I learned what an agent runtime actually needs: a policy engine, a tool pipeline, lifecycle hooks, a trust model that scales from “let me approve every file write” to “here are the guardrails, go handle it.” The SDK is the next step. Taking everything I learned building one agent and making it possible for everyone to build their own.

Today we’re launching the Antigravity SDK in Preview. The official announcement covers the features (what the SDK does, how to install it, what you can build). This post is about the why. Why this SDK, why this design, and why it matters to me.

What Is an Agent SDK, Really

Here’s something I find fascinating: people have wildly different ideas about what “agent SDK” means.

For some, it’s a way to automate the coding agent. You take the AI that already lives inside your IDE (Antigravity, Cursor, Copilot), and you script it. Pipe in a task, get back a diff. The SDK is an extension of your development environment. That’s a legitimate philosophy, and there are good products built on it.

But that’s not what I wanted to build.

To me, an agent SDK gives you an agent that you can incorporate into your software. Not an extension of your IDE. A building block. Something you import into your Python project the same way you’d import a database client or an HTTP library, and then you use it to solve a problem. The agent is a component in your system, not a wrapper around your workflow.

I’ve watched this pattern play out across Gemini Scribe, the Podcast RAG prototype, and a dozen smaller projects. Software that starts as a script, grows a tools array and a while loop, and eventually looks an awful lot like an agent. I wouldn’t claim that every AI project becomes an agent. But the pattern is durable for a huge class of software problems. And if that convergence is real, if a meaningful number of AI applications end up needing tools, memory, judgment, and guardrails, then the SDK should make that convergence frictionless.

The key distinction is this: the agents you build with the Antigravity SDK aren’t extensions of your developer tools, although they can do development work. They’re independent pieces of software that happen to be implemented as agents. They live in your codebase, run on their own, and do real work.

Let me show you what I mean.

Three Agents That Prove the Point

Two of my favorite examples ship with the SDK, and we use both of them on the SDK project itself on a regular basis. They live in the examples directory on GitHub.

The first is the docstring maintenance agent. You point it at a directory, and it audits every Python file for missing or incomplete docstrings, then fixes them, all following the Google Python Style Guide. It knows which tools it’s allowed to use (read files, list directories, edit .py files in the target directory, and nothing else). It has a policy engine that enforces those boundaries. It runs, does its job, and exits.

The second is the documentation maintenance agent. Same idea, different problem: it scans your project’s documentation for staleness, checks it against the current state of the code, and updates what needs updating.

Here’s what I love about these two examples. They’re coding-related tasks, but they aren’t extensions of my IDE. They’re standalone programs. I don’t run them inside my editor. I run them from the command line, or from a CI job, or from a cron schedule. They happen to be implemented as agents because an agent is the right abstraction for “read a bunch of files, reason about their quality, and make targeted edits.” If I’d built these as scripts, I would have ended up writing a brittle classifier full of if/else branches to decide what to fix and how. The agent architecture deletes that complexity.

We use both of these on the SDK project itself. The SDK maintains its own documentation with its own agents. There’s a satisfying recursion to that.

But I want to push the point further, because the SDK isn’t just for coding tasks. Here’s a completely different kind of agent, a personal knowledge graph I wrote that connects to my Workspace MCP server and answers questions about my Drive, Docs, Gmail, and Calendar:

import asyncio

from google.antigravity import Agent, LocalAgentConfig, types
from google.antigravity.utils import interactive


async def main():
    workspace_mcp = types.McpStdioServer(
        command="node",
        args=["/Users/adh/src/workspace/workspace-server/dist/index.js"],
    )
    system_instructions = (
        "You are a Personal Knowledge Graph Agent. Your goal is to help the user "
        "navigate and synthesize information from their Google Workspace "
        "(Drive, Docs, Gmail, Calendar). You can search for documents, "
        "read emails, and check calendar events to answer questions "
        "and help the user connect the dots."
    )
    config = LocalAgentConfig(
        system_instructions=system_instructions,
        mcp_servers=[workspace_mcp],
        capabilities=types.CapabilitiesConfig(
            enabled_tools=types.BuiltinTools.read_only(),
        ),
    )
    async with Agent(config) as agent:
        print("Knowledge Graph Agent ready. Ask me anything about your Workspace.")
        await interactive.run_interactive_loop(agent)


if __name__ == "__main__":
    asyncio.run(main())

This agent has nothing to do with coding. It’s a personal productivity tool that connects to my Google Workspace via MCP and lets me query my own data in natural language. It’s about 20 lines. It’s read-only by design. And it uses the same SDK, the same patterns, the same trust model as the docstring agent.

Three examples, three completely different domains: autonomous code maintenance, documentation upkeep, personal knowledge synthesis. All built with the same building blocks. That’s the vision.

Batteries Included, Layers When You Need Them

When designing this SDK, I kept coming back to one principle: batteries included. I wanted it to be really easy to put together an agent that worked for you. Easy to grow your application when you needed more sophistication. Easy to dive into the internals when the situation required it.

Here’s what a functional agent looks like:

import asyncio

from google.antigravity import Agent, LocalAgentConfig


async def main():
    config = LocalAgentConfig()
    async with Agent(config) as agent:
        response = await agent.chat("What files are in the current directory?")
        print(await response.text())


if __name__ == "__main__":
    asyncio.run(main())

That’s it. About 10 lines of real code. That agent can read files, edit code, run shell commands, search directories, all out of the box. You didn’t have to configure tools, set up a model connection, or wire up a conversation loop. The batteries are included.

But batteries included doesn’t mean batteries only. I designed the API in three layers, and knowing which layer to reach for is part of the design.

Layer 1: Agent. The highest level. Create an agent, give it a prompt, get results. This is where most people start, and many people stay. It manages the full lifecycle (connection, conversation, tools, hooks, policies) in a single async with block. If you just need an agent that does a job, this is your entire API surface.

Layer 2: Conversation. This is the implementation layer. Conversations, hooks, policies, MCP servers, custom tools, structured output. Conversation wraps a Connection with step history, turn tracking, and convenience methods. This is where you shape behavior. You add guardrails through the declarative policy engine. You inject lifecycle hooks, and the SDK gives you three distinct types: Inspect hooks for read-only observability, Decide hooks for policy decisions (allow/deny), and Transform hooks that can modify data in flight. You wire up MCP servers and your own Python functions as tools.

Layer 3: Connection. The lowest level. Connection is the abstract interface for talking to an agent backend. ConnectionStrategy knows how to establish one for a specific runtime. Today, we ship a local connection strategy that runs the agent on your machine. On the roadmap: remote connection strategies that let the same agent code deploy to the cloud without a rewrite.

Here’s the neat thing about this layer. Because Connection is an abstraction, you could conceivably wire up other agent runtimes behind it. We do this internally. We have several different ways of talking to our agent harness, and they all work through the same Connection interface. Your agent code doesn’t know or care which one is running underneath.

The philosophy is: easy to start, easy to grow, easy to go deep. You shouldn’t need to understand the Connection layer to write your first agent. But when you need it, when you’re building something that requires custom streaming, session resumption, or a novel deployment target, it’s there, and it’s a clean abstraction, not a hack.

One detail I’m particularly proud of: the trust model adapts to the deployment context. The base AgentConfig is deny-by-default. It defaults to read-only tools, and if you try to enable write tools or MCP servers without a safety policy, the Agent refuses to start. Enforced at the framework level. LocalAgentConfig takes a different posture. Since it runs on your own machine, it enables every tool, scopes file operations to the workspaces you’ve configured, and gates shell commands behind a user confirmation prompt by default. You’re developing locally; you probably want your agent to actually do things, but you also probably want a chance to look before it runs rm -rf. The trust gradient is baked into the architecture.

Lessons Encoded

If you’ve been following along with my writing, the SDK might feel familiar. That’s intentional.

The twelve-part Agentic Shift wasn’t just an intellectual exercise. It was the blueprint. Every essay mapped a concept that eventually became a feature.

In Everything Becomes an Agent, I wrote: “If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent.” The SDK takes that literally. You don’t build classifiers, you define tools and let the model decide which ones to use. The complexity moves from branching logic to capability definition.

I wrote about building a “sudoers file for AI”, a permission system for agents. That became the policy engine. policy.allow("view_file"). policy.deny("*"). Declarative, composable, deny-by-default. You express what’s allowed, and the framework enforces it.

I wrote: “The real complexity isn’t in the code; it’s in the trust.” That conviction shaped the hook system. Hooks give you visibility into every tool call, before and after. Policies give you control. Together, they manage the trust relationship between you and the agent. The SDK doesn’t ask you to trust blindly; it gives you the instruments to verify.

And I wrote: “A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.” That’s the promise. The SDK is the handle.

These aren’t abstract design principles that I reverse-engineered to sound good in a blog post. They’re lessons learned from building Gemini Scribe, from contributing to Gemini CLI, from watching every project I touched converge on the same agentic patterns. I drew the map, I lived the map, and then I got to build the territory.

The Team

I want to be clear about something. I didn’t build this alone.

I did most of the design for the Python SDK (the API surface, the three-layer architecture, the philosophy behind “batteries included”), and a lot of that design came from the writing I’ve been doing this past year. But design is the easy part. The hard part is building something real, and that was a team effort.

A talented group of engineers worked with me on this. On the SDK implementation, on the test infrastructure, on the Go harness underneath that actually runs the agent, on the internal connection strategies, on the MCP bridge, on a hundred decisions that don’t show up in a blog post but absolutely show up in the quality of the software. The SDK exists because of their work, and it’s better than anything I could have built on my own.

Preview, and an Invitation

We’re shipping this as a Preview. Not “1.0.” That’s deliberate.

The API surface will change. We know that. We’ll evolve it based on feedback from you and from our own continued use of the SDK, because we use it too, every day, on the project itself. There are things we haven’t figured out yet. There are patterns we haven’t discovered. That’s the point of a preview: to learn in the open.

So here’s the invitation: build something. Build a documentation bot, a knowledge graph, a CI pipeline agent, a personal assistant. Build something I haven’t imagined. Break something. Tell us what’s missing, what’s awkward, what delights you. File an issue. Open a PR. Argue with us about the API.

Last September, I wrote that I was going back to building because “for a builder, there’s no more exciting place to be.” The Agentic Shift was the map. The SDK is the territory.

Come explore it.

The Antigravity SDK is available now as a Preview. Install it with pip install google-antigravity, read the official announcement for feature details, and find the source on GitHub.

A futuristic clockwork mechanism with glowing nodes, representing community collaboration, automated tasks, and precise measurement.

Automation and Measurement: Inside Gemini Scribe 4.8.0

I recently wrapped up the development cycle for Gemini Scribe 4.8.0. Looking back at the ~99 pull requests merged over the last month, the sheer volume of changes is significant. Not only are we shipping major features, but I’m also seeing a steady uptick in contributions from collaborators, an increase in issues filed by the community, and much more activity in our discussion group. Beyond the changelog and community growth, two structural narratives define this release: automation and measurement.

As I discussed in the evolution of Gemini Scribe, the goal has always been to move beyond a simple chat interface. With 4.8.0, we are taking a massive step toward making the agent a true background worker in your vault.

Here is a look at the architecture, the code, and what this release means for the future of our agentic workflows.

The Push for Automation

For a long time, running a complex agent task meant staring at a blocking UI. If you asked the agent to perform deep research or generate an image, you waited.

To solve this, we introduced a unified background execution lane. The new BackgroundTaskManager allows tools like DeepResearchTool and GenerateImageTool to accept a background: true parameter. The agent submits the task, receives an ID immediately, and returns to its turn. You can monitor these tasks in the new Gemini Activity modal, which consolidates background tasks and RAG indexing status into one view.

But unblocking the UI was only half the battle. We wanted to lay the groundwork for an agent that operates in the background. While true autonomy is a spectrum, the first step is moving away from the chat box and into scheduled, asynchronous workflows.

The Scheduled Task Engine

The marquee feature of 4.8.0 is the full task scheduling system. You can now define a task as a markdown file, and the plugin will run it on a cadence as a headless agent session, writing the output back to the vault.

To make this work, we built a ScheduledTaskManager with a 60-second tick loop. Tasks are stored in [state-folder]/Scheduled-Tasks/ with a sidecar JSON file for state. The headless ScheduledTaskRunner mirrors the standard AgentViewTools but auto-approves all tool calls.

We also expanded the schedule grammar. Originally, daily meant “every 24 hours from creation,” which surprised users. Now, you can specify daily@HH:MM and weekly@HH:MM:DAYS, so you can finally tell the agent to run “every weekday at 4:30 PM.”

We also handle missed runs gracefully. On startup, any task with runIfMissed: true that missed its window surfaces in a CatchUpModal.

Right now, this is essentially a highly intelligent cron job. You are still explicitly telling the agent when to run. But this scheduling engine is the foundational infrastructure for what comes next. In the next release, we are introducing Obsidian lifecycle hooks. Instead of just running on a timer, the agent will be able to react to events, triggering workflows when you create a new file, save a note, or modify a project board. That is where we cross the threshold into true ambient AI.

How I Use This in Practice

To give you an idea of what this unlocks, I currently rely on a few specific scheduled workflows:

The Daily Setup: Every afternoon, a scheduled skill runs to prepare my vault for the following day. It looks up my calendar, creates my daily note if it doesn’t exist, and seeds it with my upcoming meetings. It goes a step further by creating individual meeting note entries and building out context notes for the people I’ll be meeting with. When I walk into the office the next morning, my daily note is already prepped and ready to go.

Automated Blog Drafts: I also use this to automate my content pipeline. I have a scheduled skill that monitors my Readwise syncs and automatically generates drafts for my “Reading List” blog posts. Instead of manually curating and formatting these, the agent handles the heavy lifting in the background, leaving me to just review and polish the draft.

If you are worried about the agent running amok in your vault while you aren’t looking, there are several ways to mitigate this. You can limit the tools the agent has access to. If you don’t want it overwriting files, you can simply restrict its write access. Additionally, the agent’s response from any scheduled task is always saved in the Scheduled-Tasks/Runs file, giving you a complete audit log of what the agent had to say during the session.

In my case, I’m automating skills that I’ve been running manually for a while now, and I run my agent in a mode where I let it write and edit files day-to-day. You should set up your tasks to match your own comfort level. You can read more about how to configure this in the Scheduled Tasks Documentation.

Extracting the Agent Loop

To support headless scheduled tasks, I had to refactor how the agent executes tools. Previously, the tool-execution loop was tightly coupled to the UI in AgentViewTools.

I extracted this logic into a UI-agnostic AgentLoop class. AgentViewTools shrank from 386 lines down to 187, becoming a thin adapter over AgentLoop with specific hooks (onToolBatchStart, onToolCallStart, etc.).

// Conceptual extraction of the AgentLoop
export class AgentLoop {
  constructor(private engine: ToolExecutionEngine) {}
  
  async execute(turn: AgentTurn) {
    // Iterative tool execution, removing the recursive stack-depth ceiling
    while (this.hasPendingToolCalls(turn)) {
       // Loop detection, batching, and execution logic lives here
    }
  }
}

This extraction immediately paid dividends, catching bugs that a duplicate headless runner had introduced, and eliminating a recursive stack-depth ceiling on deep tool chains. More importantly, it means scheduled tasks, evals, and the UI all share the exact same execution engine.

Local Models with Ollama and Gemma 4

First-class local-model support is here. By leveraging the ModelApi seam, chat, summarization, rewrite, and agent tool-calling all work against a local Ollama server. You can use any model from Ollama that supports tool calling, though I have personally only tested this extensively with Gemma 4.

In my local evaluation harness, Gemma 4 performed exceptionally well. It is incredibly capable, fast, and handles the agent loop with a level of reliability that makes local-only agentic workflows genuinely viable.

The way I use this right now is as an offline fallback: when I don’t have an internet connection, I switch to Gemma 4 and just keep working. Obviously, running offline means I don’t have access to online-dependent tools like Google Search, Deep Research, or Image Generation. But for synthesizing notes, organizing projects, or drafting content securely, it is incredibly powerful.

In the future, we will be refining the system to allow you to pick the model you want on a per-function basis. This means you’ll be able to route sensitive, local text processing to an offline model while still leveraging cloud models for heavy-lifting tasks like Deep Research or Image Generation when you are connected.

Moving from Guessing to Measuring

As the agent loop gets more complex (handling runaway loop aborts and budget constraints) we can no longer rely on “vibes” to know if a change improved the system.

To solve this, I built a new CLI-driven eval harness (npm run eval) that drives a live Obsidian instance. It captures turns, tool calls, token usage, cache ratios, and cost. Crucially, it measures reliability. By passing --repeat=N, the harness repeats each task to surface flakiness, reporting a pass^k metric. We can now test multi-hop retrieval and loop-trap cyclic references programmatically, ensuring the agent bails cleanly instead of spinning forever.

Right now, the focus for 4.8.0 was getting this infrastructure in place and establishing the beginnings of our eval set. Having the harness is the first step; the next step is building out a robust suite of test cases that reflect real-world vault interactions.

I would love to see contributions from the community for the evals themselves! If you have complex agentic workflows or edge cases you want to ensure remain stable, please submit them. In the next release, we will start publishing the actual eval results and benchmarks directly in the repo so we can transparently track the agent’s performance over time.

What’s Next?

What does this implementation tell us about the future of software engineering and personal knowledge management?

We are seeing a clear shift toward ambient AI. The chat interface is a great starting point, but the true value of an agentic system is its ability to operate asynchronously. While the scheduling engine in 4.8.0 acts as a highly capable cron job, it lays the groundwork for the event-driven lifecycle hooks coming in the next release.

By combining the AgentLoop extraction with asynchronous execution, Gemini Scribe is no longer just a tool you use; it is becoming a system that reacts and works alongside you. When you can rely on a background orchestrator to run your housekeeping routines (like updating changelogs or triaging issues) while you eat dinner, the vault becomes a living, breathing entity. The agent becomes a true extension of your workflow, utilizing the built-in skills we’ve developed entirely in the background.

Gemini Scribe 4.8.0 is a massive architectural leap forward. The code is cleaner, the tests are faster (thanks to a Vitest migration), and the agent is more autonomous than ever.

If you want to dive into the specifics or try out the new scheduling grammar, check out the updated documentation on scheduled tasks.

Let me know what automated tasks you end up building. I’m already finding new ways to let the agent do the heavy lifting while I focus on the work that matters.

A Starlink Mini antenna suction-mounted to the interior glass roof of a Tesla Model X, face pointing up through the glass at a twilight sky, with a Big Sur coastal vista visible through the windows.

Starlink Mini Field Review

I almost didn’t bring it.

We were packing for a week-long road trip, the kind where you’re loading a car to the roof and every extra item has to justify its existence against the scrutiny of a finite trunk. The Starlink Mini was sitting on my desk, and for a moment I thought: it’s a vacation. Just use your phone. Let it go.

Then I remembered Death Valley.

A few years ago, we drove out to Death Valley for a long weekend. Somewhere past the park entrance the cell signal dropped and didn’t come back for several hours. No maps. No music. No way to look up whether the hotel had our reservation or whether the road ahead was clear. It wasn’t a crisis. We found our way, the trip was great. But it lodged in my brain as one of those small, avoidable frustrations that you file away and think about later. I started thinking about alternatives.

Fast-forward to earlier this year: I was looking to replace an old Verizon 5G hotspot that I’d been using as a backup internet connection. The Starlink Mini caught my attention for a pretty simple reason. The mobile plan is $50 per month for 100GB, and critically, you can pause it for $5 per month when you’re not using it. For something that might sit idle most of the year but be indispensable when you need it, that pricing model changes the math entirely. A hardware incentive made the upfront cost easier to swallow. I bought it as an emergency backup, not as a travel device.

The road trip reframed it.

We were heading down Highway 1 from San Jose to Santa Barbara, with a few stops along the way. Anyone who’s driven that stretch knows the Big Sur section the way sailors know certain straits: beautiful, unforgiving, and reliably hostile to cell service. I drive it often enough that the failure pattern is familiar: the car’s maps stop rendering new tiles, the streaming music starts cutting in and out, whatever question the kid in the back seat just asked becomes a question for later. Inconveniences, not crises, but they happen every single time. Both my 5G phones and the car’s own LTE connection consistently lose the signal in the same places. Standing in my driveway with the car half-packed, I thought: this is exactly what the thing is for. I threw it in the back, along with a suction cup mount I’d picked up — a 2-in-1 case and mount combo designed specifically for the Mini.

The mount was an experiment in itself, and it’s worth pausing on for a second. The Starlink Mini is designed to be mounted outside the vehicle. That’s what Starlink recommends, that’s what the hardware is built for, and that’s what every piece of documentation assumes. I stuck it to the inside of the glass roof of my Model X and let the antenna hang below it, face pointing straight up through the glass, directly over the rear passenger seat. It just fit. What I wasn’t sure about was whether any of this would actually work, because I was running the device in a configuration it wasn’t designed for. Would a satellite signal come through automotive glass well enough to matter? Would the antenna need to be precisely pointed, or would straight-up-ish be good enough? Would I end up pulling it out at every stop and doing the whole orientation dance I’d been expecting? Those felt like real unknowns. But it was a road trip, not a lab test, and sometimes you just ship the thing and see what happens.

If your vehicle doesn’t have a large flat glass roof to exploit, the outdoor-mount path is the one to take. You can attach the Mini to a roof rack or crossbars without any permanent modifications to the car, and you’ll almost certainly get better performance than I did running it through glass. The interior trick I used is a happy accident of driving a Model X, not a recommendation for everyone.

I brought it anyway. And by the end of the first day on the road, I was deeply glad I did.

This isn’t a post about working remotely. We took this trip for the right reasons: my son is finishing his sophomore year of high school and starting to think seriously about college, so we used the week to visit Cal Poly SLO and UC Santa Barbara. The kind of trip that reminds you why you work as hard as you do. But connectivity isn’t just about work. It’s about maps that don’t freeze, music that doesn’t stutter, the ability to pull up directions to the next stop without pulling over first. Most of the value of a good connection on a road trip is in the miles between places, not in the places themselves.

So let me tell you what actually happened.

Packing the Antenna

The Starlink Mini’s whole pitch is portability, and it delivers on that in a way the full-sized antenna simply can’t. The unit itself is roughly the size of a large laptop — thinner than a pizza box, lighter than a bag of dog food. It goes flat in a bag, doesn’t demand a dedicated case, and doesn’t feel like you’ve brought a piece of infrastructure on vacation. You’ve just brought a gadget.

For power, I ran it off a 12V adapter wired into the car. This is, I think, the right way to do it for road travel. Watching the draw in the Starlink app through the week, I almost never saw it exceed 20W, and most of the time it was lower, averaging right around 20W. A standard car outlet handles that without complaint. I didn’t need a power station, an inverter, or any custom rigging. You plug it in. It works.

Here’s the part that actually surprised me: there was no setup. Not “fast setup.” No setup. The antenna stayed exactly where I’d stuck it at the start of the trip, and I left it there for the whole week. Every stop, every hotel, every stretch of highway: it was already in place, already connected, already doing its job the moment I plugged in the 12V cable. I had gone into this expecting to be that guy in the parking lot pulling the antenna out, orienting it, consulting the sky map app, worrying about elevation angles. Instead I just drove.

What It Actually Did Out There

The strangest thing about this whole trip, and the thing I keep thinking about, is the moment that didn’t happen.

I was ready for Big Sur. I know that stretch of Highway 1, I know where the dead zones are, and I had a working theory that this would be the dramatic reveal: we’d hit the no-signal gap, I’d point at the Mini, and we’d be the family who still had maps and music when no one else did. That story never materialized. Not because it didn’t work, but because I never noticed.

The car was on the Starlink Wi-Fi the whole time. Maps kept routing. Music kept playing. Everyone’s phone stayed connected through the in-car network. Somewhere along that drive we passed through miles of cell dead zones without a single hiccup in anything we were doing, and the only reason I know that is because I thought to check my phone later and saw the zero-bar gaps that should have caused Death Valley-style pain. They just weren’t there. The infrastructure had been quietly carrying us the entire time.

That’s the moment I keep coming back to. The Death Valley experience was jarring because the loss was obvious and immediate. No maps, no music, a sudden reminder that your conveniences live on infrastructure you don’t own. The Starlink Mini didn’t fix that problem by giving me a workaround to pull out in an emergency. It fixed it by making the problem invisible. I can’t think of a better test for a piece of infrastructure than whether you stop noticing it. We did lose the connection once, in a tunnel, and everything came right back when we cleared the other side. The failure mode was the same as losing cell signal: obvious, brief, and boring.

The vista stops were the next best thing. Every time we pulled off at an overlook, I’d set the car to keep accessory power running so the Mini stayed up, which meant we stepped out into a place with zero cell service and our own private Wi-Fi hotspot parked a few feet away. This turned out to be where some of the best moments of the trip happened. Somebody would spot a rock formation and wonder what it was called. A few birds would wheel overhead and we’d want to know what species they were. Under normal travel conditions, those questions would just evaporate — filed away as “look it up later,” which almost always means never. With the antenna overhead, we could just ask. Curiosity became frictionless. And because we were stopped in the middle of nowhere with no other digital pull on our attention, those answers actually turned into conversations instead of the usual phone-zombie drift.

The moment that most caught me off guard, though, was a lunch stop. We parked the car at a spot with genuinely terrible cell service, all our luggage and gear still inside, and walked away to eat. I left the Mini powered up through the car’s accessory power, which meant the car itself stayed on the internet while we weren’t in it. Tesla’s Sentry mode uses the car’s connection to stream alerts and camera feeds — normally tied to whatever cell signal the car can grab on its own, which in that spot was nothing. But the car was happily connected to the Starlink Wi-Fi, so Sentry just kept working. I could check on the car from my phone during lunch and actually see what was happening around it. That was a use case I hadn’t planned for at all. It was the first time in the trip that the Mini stopped feeling like a travel gadget and started feeling like persistent infrastructure for the vehicle itself — a comms link that existed independent of whether any human was sitting inside the car. Peace of mind, delivered by an antenna looking straight up through a glass roof.

It’s worth being clear about where the Mini did and didn’t earn its keep on this trip. The hotels were all fine. Cell service and hotel Wi-Fi were perfectly adequate at every place we stayed, and I never once fired up the Mini as a destination device. The entire value proposition showed up in the driving, the vistas, and the parked-car moments in between — the stretches and stops where cell coverage got thin and where, historically, I would have just quietly lost the ability to navigate, stream, ask questions, or keep an eye on my car. The Mini made those gaps disappear so completely that I forgot they were even there.

What It Doesn’t Do

In the interest of intellectual honesty: it’s not magic.

Tree cover is the enemy. Dense canopy will interrupt the connection or degrade it significantly, and the obstruction map in the Starlink app is reasonably good at predicting this. The through-the-glass approach worked well on this trip because we were mostly on highways and in parking lots with open sky. If we’d been parked under heavy forest canopy I’d have had to think harder about placement, and an antenna mounted inside the car wouldn’t have been the right answer.

Weather, surprisingly, was not on this list. We drove through several stretches of rain and a lot of overcast, and the connection held through all of it without any obvious degradation. I went into the trip half-expecting to see throughput sag under bad weather and it just didn’t happen. That’s not a universal claim (you can imagine worse conditions), but for what a California spring throws at you, it’s a non-issue.

Satellite latency is also real, though I didn’t notice it much. For anything where you feel the round-trip time (voice calls, some gaming), it’s not the same as fiber. For everything else — browsing, streaming, music, mapping — it held up fine.

An Engineer on Vacation

Here’s the thing I kept thinking about on this trip, the thought that felt worth writing down.

I’ve spent a lot of the last few years building out my homelab into something I’m genuinely proud of. Rack-mounted servers, local AI models, a network I understand end-to-end. That infrastructure has a fixed address. It’s built for depth, not mobility.

What the Starlink Mini represents is a different layer of the stack — the part that moves. The connectivity substrate that you can carry with you without sacrifice. And what surprised me, genuinely, is how much that changes the feel of being on the road. I wasn’t tethered to the patchy mercy of cell towers between towns. I had my own infrastructure, and it came with me.

For most of human history, “infrastructure” meant something you built in a place and then stayed near. Railroads, power grids, phone lines. The thing that’s happening now, slowly and then all at once, is that infrastructure is becoming portable. The Starlink Mini isn’t the endpoint of that story, but it’s a clear data point in it.

I’m not arguing that you should work on every vacation. I didn’t. But I am arguing that having reliable connectivity transforms the texture of a trip in ways that have nothing to do with work. It means you can find the good restaurant instead of defaulting to what’s nearest. It means your kids can call their friends. It means you get the weather report that saves you an hour of driving into rain. Small things. Real things.

Would I Recommend It

This is the part where I’m supposed to tell you whether to buy it, and the honest answer depends almost entirely on what else you’re using it for. Let me actually walk through the math, because I think the pricing model is the thing that makes this post worth writing at all.

The Mini’s mobile plan is $50 per month for 100GB when active, but you can pause it for $5 per month when you’re not using it. So if you buy the hardware and leave the service paused most of the year, your standing cost is $60 annually. Each month you flip it on for a trip, you add $50 on top. For someone who only takes one connectivity-hostile trip per year, that means $110 all-in for a year with one active month.

If that’s your situation — one trip a year, I honestly don’t think you should buy this. The math doesn’t work and the rest of your use cases probably don’t justify the hardware. A better phone plan or a cellular hotspot will serve you better for less money.

Where it starts making sense is when you have a standing use case at home that earns back the $60 standby cost on its own. For me, that’s backup internet during the storms that reliably knock out my home connection every year. I’d already decided I wanted this for the house before I ever thought about taking it on a trip — the $5-a-month pause price justifies itself on the strength of the backup case alone, and everything else is gravy. For someone living out of an RV or a van full-time, the case is even easier: the Mini just becomes infrastructure, always on, always moving with you. You’d never pause it.

Once the at-home or full-time use case carries the standby cost, the travel capability becomes a structural bonus. I paid $50 to flip the service on for the week of this trip, and knowing what I know about Highway 1 — the failure pattern I’ve hit on every previous drive down, that was easy to justify. It’s $50 to make a known, recurring inconvenience quietly disappear on a trip where I was already spending ten times that on charging, lodging, and meals. It didn’t feel like a hard call.

So the short version of the recommendation is: don’t buy this as a travel device. Buy it for a standing use case you already have, and let the travel capability be the bonus you discover later. That’s the frame that made my decision easy, and I think it’s the frame that will hold up for most readers whose situations look anything like mine.

One more thing worth flagging before I wrap up: the 12V power setup worked flawlessly the entire week. I expected to be solving a power problem at some point and never had to. That’s the detail that surprised me most, actually: not the satellite performance, but how completely boring and reliable the day-to-day operation was.

I’ll keep it in the travel kit. The homelab stays home. But now part of it gets to come along.

GitHub issues transforming into glowing skill cards floating above a laptop screen.

Bundled Skills in Gemini Scribe

The feature that became Bundled Skills started with a GitHub issues page.

I wrote and maintain Gemini Scribe, an Obsidian plugin that puts a Gemini-powered agent inside your vault. Thousands of people use it, and they have questions. People would open discussions and issues asking how to configure completions, how to set up projects, what settings were available. I was answering the same questions over and over, and it hit me: the agent itself should be able to answer these. It has access to the vault. It can read files. Why am I the bottleneck for questions about my own plugin?

So I built a skill. I took the same documentation source that powers the plugin’s website, packaged it up as a set of instructions the agent could load on demand, and suddenly users could just ask the agent directly. “How do I set up completions?” “What settings are available?” The agent would pull in the right slice of documentation and give a grounded answer. The docs on the web and the docs the agent reads are built from the same source. There is no separate knowledge base to keep in sync.

That first skill opened a door. I was already using custom skills in my own vault to improve how the agent worked with Bases and frontmatter properties. Once I had the bundled skills mechanism in place, I started looking at those personal skills differently. The ones I had built for myself around Obsidian-specific tasks were not just useful to me. They would be useful to anyone running Gemini Scribe. So I started migrating them from my vault into the plugin as built-in skills.

With the latest version of Gemini Scribe, the plugin now ships with four built-in skills. In a future post I will walk through how to create your own custom skills, but first I want to explain what ships out of the box and why this approach works.

Four Skills Out of the Box

That first skill became gemini-scribe-help, and it is still the one I am most proud of conceptually. The plugin’s own documentation lives inside the same skill system as everything else. No special case, no separate knowledge base. The agent answers questions about itself using the same mechanism it uses for any other task.

The second skill I built was obsidian-bases. I wanted the agent to be good at creating Bases (Obsidian’s take on structured data views), but it kept getting the configuration wrong. Filters, formulas, views, grouping: there is a lot of surface area and the syntax is particular. So I wrote a skill that guides the agent through creating and configuring Bases from scratch, including common patterns like task trackers and project dashboards. Instead of me correcting the agent’s output every time, I describe what I want and the agent builds it right the first time.

Next came audio-transcription. This one has a fun backstory. Audio transcription was one of the oldest outstanding bugs in the repo. People wanted to use it with Obsidian’s native audio recording, but the results were poor. In this release, fixes around binary file uploads meant the model could finally receive audio files properly. Once that was working, I realized I did not need to write any more code to get good transcriptions. I just needed to give the agent good instructions. The skill guides it through producing structured notes with timestamps, speaker labels, and summaries. It turns a messy audio file into a clean, searchable note, and the fix was not code but context.

The fourth is obsidian-properties. Working with note properties (the YAML frontmatter at the top of every Obsidian note) sounds trivial until you are doing it across hundreds of notes. The agent would make inconsistent choices about property types, forget to use existing property names, or create duplicates. This skill makes it reliable at creating, editing, and querying properties consistently, which matters enormously if you are using Obsidian as a serious knowledge management system.

The pattern behind all four is the same. I watched the agent struggle with something specific to Obsidian, and instead of accepting that as a limitation of the model, I wrote a skill to fix it.

Why Not Just Use the System Prompt

You might be wondering why I did not just shove all of this into the system prompt. I wrote about this problem in detail in Managing the Agent’s Attention, but the short version is that system prompts are a “just-in-case” strategy. You load up the agent with everything it might need at the start of the conversation, and as you add more instructions, they start competing with each other for the model’s attention. Researchers call this the “Lost in the Middle” problem: models pay disproportionate attention to the beginning and end of their context, and everything in between gets diluted. If I packed all four skills worth of instructions into the system prompt, each one would make the others less effective. Every new skill I add would degrade the ones already there.

Skills avoid this entirely. The agent always knows which skills are available (it gets a short name and description for each one), but only loads the full instructions when it actually needs them. When a skill activates, its instructions land in the most recent part of the conversation, right before the model starts reasoning. Only one skill’s instructions are competing for attention at a time, and they are sitting in the highest-attention position in the context window.

There is a second benefit that surprised me. Because skills activate through the activate_skill tool call, you can watch the agent load them. In the agent session, you see exactly when a skill is activated and which one it chose. This gives you something that system prompts never do: observability. If the agent is not following your instructions, you can check whether it actually activated the skill. If it activated the skill but still got something wrong, you know the problem is in the skill’s instructions, not in the agent’s attention. That feedback loop is what lets you iterate and improve your skills over time. You are no longer guessing whether the agent read your instructions. You can see it happen.

Skills follow the open agentskills.io specification, and this matters more than it might seem. We have seen significant standardization around this spec across the industry in 2026. That means skills are portable. If you have been using skills with another agent, you can bring them into Gemini Scribe and they will work. If you build skills in Gemini Scribe, you can take them with you. They are not a proprietary format tied to one tool. They are Markdown files with a bit of YAML frontmatter, designed to be human-readable, version-controllable, and portable across any agent that supports the spec.

What Comes Next

The four built-in skills are just the beginning. When I decide what to build next, I think about skills in four categories. First, there are skills that give the agent domain knowledge about Obsidian itself, things like Bases and properties where the model’s general training is not specific enough. Second, there are skills that help the agent use Gemini Scribe’s own tools effectively. The plugin has capabilities like deep research, image generation, semantic search, and session recall, and each of those benefits from a skill that teaches the agent when and how to use them well. Third, there are skills that bring entirely new capabilities to the agent, like audio transcription. And fourth, there is user support: the help skill that started this whole process, making sure people can get answers without leaving their vault.

The next version of Gemini Scribe will add built-in skills for semantic search, deep research, image generation, and session recall. The skills system is also designed to be extended by users. In a future post I will walk through creating your own custom skills, both by hand and by asking the agent to build them for you.

For now, the takeaway is simple. A general-purpose model knows a lot, but it does not know your tools. When I watched the agent struggle with Obsidian Bases or produce flat transcripts or make a mess of note properties, I could have accepted those as limitations. Instead, I wrote skills to close the gap. The model’s knowledge is broad. Skills make it deep.

A bird's-eye view of a winding river of glowing green GitHub contribution tiles flowing across a dark landscape, with bright yellow-green flames rising from clusters of the brightest tiles, while a lone figure sits at a laptop at the edge of the mosaic under a distant skyline of code-filled windows.

4255 Contributions – A Year of Building in the Open

I was staring at my GitHub profile the other day when a number caught my eye. 4,255. That’s how many contributions GitHub has recorded for me over the past year. I sat with it for a moment, doing the quick mental math: that’s close to twelve contributions every single day, weekends included. The shape of the year looked just as striking. I showed up on 332 of the 366 days in the window, 91% of them, and at one point put together a 113-day streak without a gap. It felt like a lot. It felt like proof of something I hadn’t been able to articulate until I saw it rendered as a green heatmap on a screen.

About a year ago, I wrote about my decision to move back to individual contributor work after years in leadership roles. I talked about missing the flow state, the direct feedback loop of writing code and watching it work. What I didn’t know at the time was just how dramatically that shift would show up in the data. 4,255 contributions is the quantitative answer to the question I was trying to answer qualitatively in that post: what happens when you give a builder back the time to build?

The Shape of a Year

Numbers by themselves are just numbers. What makes them interesting is the shape they take when you zoom in. My year wasn’t a single monolithic effort on one project. It was a constellation of interconnected work, each project feeding into the next, each one teaching me something that made the others better.

The largest body of work was on Gemini CLI, Google’s open-source AI agent for the terminal. This project alone accounts for a significant chunk of those contributions, spanning everything from core feature development to building the Policy Engine that governs how the agent interacts with your system. But the contributions weren’t just code. A huge portion of my time went into code reviews, issue triage, and community engagement. Working on a repository with over 100,000 stars means that every merged PR has real impact, and every review is a conversation with developers around the world.

Then there was Gemini Scribe, my Obsidian plugin that started as a weekend experiment and grew into a tool with 302 stars and a community of writers who depend on it. Over the past year, I shipped a major 3.0 release, built agent mode, and iterated constantly on the rewrite features that make it useful for daily writing. In fact, this very blog post was drafted in the tool I built, which is a strange and satisfying loop.

Alongside these larger efforts, I shipped a handful of small, sharp tools that I needed for my own workflows. The GitHub Activity Reporter is one I’ve written about before, a utility that uses AI to transform raw GitHub data into narrative summaries for performance reviews and personal reflection. More recently, I built the Workspace extension for Gemini CLI and a deep research extension that lets you conduct multi-step research from the terminal. Each of these tools was born from a specific itch, and each turned out to be useful to more people than I expected. The Workspace extension alone has gathered 510 stars.

The Rhythm of Building

One thing the contribution graph doesn’t capture is the rhythm behind the numbers. My weeks developed a cadence over the year that I didn’t plan but that emerged naturally. Mornings were for deep work on Gemini CLI, the kind of focused system design and implementation that benefits from a fresh mind. Afternoons were for reviews and community work, responding to issues, providing feedback on PRs, and engaging with the developers building on top of our tools. Evenings and weekends were where the personal projects lived: Gemini Scribe, the extensions, and whatever new idea was rattling around in my head.

This rhythm is something I couldn’t have had in my previous role. When your calendar is stacked with meetings from nine to five, the creative work gets squeezed into the margins. Now, the creative work is the whole page. That’s the real story behind 4,255 contributions. It’s not about productivity metrics or GitHub gamification. It’s about what happens when you align your time with the work that energizes you.

What Surprised Me

A few things caught me off guard when I looked back at the year.

First, the ratio of code to “everything else” wasn’t what I expected. I assumed the majority of my contributions would be commits. In reality, a massive portion was reviews, comments, and issue management. On Gemini CLI alone I logged 205 reviews over the year. This was especially true as my role on that project evolved from pure contributor to something closer to a technical steward. Reviewing a complex PR, asking the right questions, and helping someone refine their approach takes just as much skill as writing the code yourself. Sometimes more.

Second, the personal projects had more reach than I anticipated. When I wrote about building personal software, I was mostly thinking about tools I built for myself. But Gemini Scribe has real users who file real bugs and request real features. The Workspace extension took off because it solved a problem that a lot of Gemini CLI users were hitting. Building in the open means you discover an audience you didn’t know was there.

Third, and this is the one I keep coming back to, the year felt shorter than 4,255 contributions would suggest. Flow state compresses time. When you’re deep in a problem, hours feel like minutes. I remember entire weekends spent in the codebase that felt like an afternoon. That compression is, for me, the clearest signal that I made the right call in going back to IC work.

Fourth, and this is the one I never would have predicted until I charted it out: the weekend, not the weekday, turned out to be my most productive window by a wide margin. Saturdays averaged 14.7 contributions, Sundays 14.5, and Thursday, the day I’d have guessed was safest, came in last at 8.3. The busiest single day of the entire year was a Saturday, December 20, when I shipped 89 contributions into podcast-rag, rebuilding the web upload flow, adding episode management to the admin dashboard, and migrating email delivery over to Resend, all in one afternoon. I didn’t plan for the weekends to become the engine. They just did, because that’s where the personal projects live, and the personal projects are where the work is loudest, most direct, and most free of interruption. A day with no meetings on it, I’ve come to realize, is worth more than I ever gave it credit for.

Looking Forward

I don’t know what next year’s number will be, and I’m not particularly interested in making it bigger. The number is a side effect, not a goal. What I care about is continuing to work on problems that matter, in the open, with people who push me to think more clearly. The AI-first developer model I wrote about over a year ago is now just how I work every day. The agents I’m building are the collaborators I’m building with, and both keep getting better.

If you’re someone who’s been thinking about a similar shift, whether it’s moving back to IC work, contributing to open source, or just carving out more time for the work that lights you up, I’d encourage you to try it. You might be surprised by what a year of focused building can produce. I certainly was.

A focused workspace at a desk in a vast library, with nearby shelves illuminated and distant shelves visible but softened, a pair of sunglasses resting on the desk

Scoping AI Context with Projects in Gemini Scribe

My son has a friend who likes to say, “born to dilly dally, forced to lock in.” I’ve started to think that describes AI agents in a large Obsidian vault perfectly.

My vault is a massive, sprawling entity. It holds nearly two decades of thoughts, ranging from deep dives into LLM architecture to my kids’ school syllabi and the exact dimensions needed for an upcoming home remodelling project. When I first introduced Gemini Scribe, the agent’s ability to explore all of that was a feature. I could ask it to surface surprising connections across topics, and it would. But as I’ve leaned harder into Scribe as a daily partner, both at home and at work, the dilly dallying became a real problem. My work vault has thousands of files with highly overlapping topics. It’s not a surprise that the agent might jump from one topic to another, or get confused about what we’re working on at any given time. When I asked the agent to help me structure a paragraph about agentic workflows, I didn’t want it pulling in notes from my jazz guitar practice.

I could have created a new, isolated vault just for my blog writing. I tried that briefly, but I immediately found myself copying data back and forth. I was duplicating Readwise syncs, moving research papers, and fracturing my knowledge base. That wasn’t efficient, and it certainly wasn’t fun. The problem wasn’t that the agent could see too much. The problem was glare. I needed sunglasses, not blinders. I needed to force the agent to lock in.

So, I built Projects in Gemini Scribe.

A project defines scope without acting as a gatekeeper

Fundamentally, a project in Gemini Scribe is a way to focus the agent’s attention without locking it out of anything. It defines a primary area of work, but the rest of the vault is still there. Think of it like sitting at a desk in the engineering section of a library. Those are the shelves you browse by default, the ones within arm’s reach. But if you know the call number for a book in the history section, nobody stops you from walking over and grabbing it. You can even leave a stack of books from other sections on your desk ahead of time if you know you’ll need them. If you’ve followed along with the evolution of Scribe from plugin to platform, you’ll recognize this as a natural extension of the agent’s growing capabilities.

The core mechanism is remarkably simple. Any Markdown file in your vault can become a project by adding a specific tag to its YAML frontmatter.

---
tags:
  - gemini-scribe/project
name: Letters From Silicon Valley
skills:
  - writing-coach
permissions:
  delete_file: deny
---

Once tagged, that file’s parent directory becomes the project root. From that point on, when an agent session is linked to the project, its discovery tools are automatically scoped to that directory and its subfolders. Under the hood, the plugin intercepts API calls to tools like list_files and find_files_by_content, transparently prepending the project root to the search paths. The practical difference is immediate. Before projects, I could be working on a blog post about agent memory systems and the agent would surface notes from a completely unrelated project that happened to use similar terminology. Now I can load up a project and work with the agent hand in hand, confident it won’t get distracted by similar ideas or overlapping vocabulary from other corners of the vault.

The project file serves as both configuration and context

The project file itself serves a dual purpose. It acts as both configuration and context. The frontmatter handles the configuration, allowing me to explicitly limit which skills the agent can use or override global permission settings. For example, denying file deletions for a critical writing project is a simple but effective safety net. But the real power is in customizing the agent’s behavior per project. For my creative writing, I actually don’t want the agent to write at all. I want it to read, critique, and discuss, but the words on the page need to be mine. Projects let me turn off the writing skill entirely for that context while leaving it fully enabled for my blog work. The same agent, shaped differently depending on what I’m working on.

Everything below the frontmatter is treated as context. Whatever I write in the body of the project note is injected directly into the agent’s system prompt, acting much like an additional, localized set of instructions. The global agent instructions are still respected, but the project instructions provide the specific context needed for that particular workspace. This is similar in spirit to how I’ve previously discussed treating prompts as code, where the instructions you give an agent deserve the same rigor and iteration as any other piece of software.

This is where the sunglasses metaphor really holds. The agent’s discovery tools, things like list_files and find_files_by_content, are scoped to the project folder. That’s the glare reduction. But the agent’s ability to read files is completely unrestricted. If I am working on a technical post and need to reference a specific architectural note stored in my main Notes folder, I have two options. I can ask the agent to go grab it, or I can add a wikilink or embed to the project file’s body and the agent will have it available from the start. One is like walking to the history section yourself. The other is like leaving that book on your desk before you sit down. Either way, the knowledge is accessible. The project just keeps the agent from rummaging through every shelf on its own. This builds directly on the concepts of agent attention I explored in Managing AI Agent Attention.

Session continuity keeps the agent focused across your vault

One of the more powerful aspects of this system is how it interacts with session memory. When I start a new chat, Gemini Scribe looks at the active file. If that file lives within a project folder, the session is automatically linked to that project. This is a direct benefit of the supercharged chat history work that landed earlier in the plugin’s life.

This linkage is stable for the lifetime of the session. I can navigate around my vault, opening files completely unrelated to the project, and the agent will remain focused on the project’s context and instructions. This means I don’t have to constantly remind the agent of the rules of the road. The project configuration persists across the entire conversation.

Furthermore, session recall allows the agent to look back at past conversations. When I ask about prior work or decisions related to a specific project, the agent can search its history, utilizing the project linkage to find the most relevant past interactions. This creates a persistent working environment that feels much more like a collaboration than a simple transaction.

Structuring projects effectively requires a few simple practices

To get the most out of projects, I’ve found a few practices to be particularly effective.

First, lean into the folder-based structure. Place the project file at the root of the folder containing the relevant work. Everything underneath it is automatically in scope. This feels natural if you already organize your vault by topic or project, which many Obsidian users do.

Second, start from the defaults and adjust as the project demands. Out of the box, a new project inherits the agent’s standard skills and permissions, which is a sensible baseline for most work. From there, you tune. If you find the agent reaching for tools that don’t make sense in a given context, narrow the allowed skills in the frontmatter. If a project needs extra safety, tighten the permissions. The creative writing example I mentioned earlier came about exactly this way. I started with the defaults, realized I wanted the agent as a reader and critic rather than a co-writer, and adjusted accordingly. This aligns with the broader principle I’ve written about when discussing building responsible agents: the right guardrails are the ones shaped by the actual work.

Finally, treat the project body as a living document. As the project evolves, update the instructions and external links to ensure the agent always has the most current and relevant context. It’s a simple mechanism, but it fundamentally changes how I interact with an AI embedded in a large knowledge base. It allows me to keep my single, massive vault intact, while giving the agent the precise focus it needs to be genuinely helpful.

A cracked-open obsidian geode on a weathered wooden desk reveals a glowing golden network of interconnected nodes and pathways inside. Tendrils of golden light extend outward from the geode across the desk toward open notebooks and a mechanical keyboard, with bookshelves softly blurred in the background.

Gemini Scribe From Agent to Platform

Six months ago, I wrote about building Agent Mode for Gemini Scribe from a hotel room in Fiji. That post ended with a sense of possibility. The agent could read your notes, search the web, and edit files. It was, by the standards of the time, pretty remarkable. I remember watching it chain together a sequence of tool calls for the first time and thinking I’d built something meaningful.

I had no idea it was just the beginning.

In the six months since that post, Gemini Scribe has gone through fifteen releases, from version 3.3 to 4.6. There have been over 400 commits, a complete architectural rethinking, and a transformation from “a chat plugin with an agent mode” into something I can only describe as a platform. The agent didn’t just get better. It got a memory, a research department, a set of extensible skills, and the ability to talk to external tools through the Model Context Protocol. If the vacation version was a clever assistant, this version is closer to a collaborator who actually understands your vault.

I want to walk through how we got here, because the journey reveals something I think is important about building with AI right now: the hardest problems aren’t the ones you set out to solve. They’re the ones that reveal themselves only after you ship the first version and start living with it.

The Agent Grows Up

The first big milestone after the vacation was version 4.0, released in November 2025. This was the release where I made a decision that felt risky at the time: I removed the old note-based chat entirely. No more dual modes, no more confusion about which interface to use. Everything became agent-first. Every conversation had tool calling built in. Every session was persistent.

It sounds simple in hindsight, but killing a feature that works is one of the hardest decisions in software. The old chat mode was comfortable. People used it. But it was holding back the entire plugin, because every new feature had to work in two completely different paradigms. Ripping it out was liberating. Suddenly I could focus all my energy on making one experience truly great instead of maintaining two mediocre ones.

Alongside 4.0, I built the AGENTS.md system, a persistent memory file that gives the agent an overview of your entire vault. When you initialize it, the agent analyzes your folder structure, your naming conventions, your tags, and the relationships between your notes. It writes all of this down in a file that persists across sessions. The result is that the agent doesn’t start every conversation from scratch. It already knows how your vault is organized, where you keep your research, and what projects you’re working on. It’s the difference between hiring a new intern every morning and having a colleague who’s been on the team for months.

Seeing and Searching

Version 4.1 brought something I’d wanted since the beginning: real thinking model support. When Google released Gemini 2.5 Pro and later Gemini 3 with extended thinking capabilities, I added a progress indicator that shows you the model’s reasoning in real time. You can watch it think through a problem, see it plan its approach, and understand why it chose a particular tool. It sounds like a small UI feature, but it fundamentally changes your relationship with the agent. You stop treating it like a black box and start treating it like a thinking partner whose process you can follow.

That same release added a stop button (which sounds trivial until you’re watching an agent go on a tangent and have no way to interrupt it), dynamic example prompts that are generated from your actual vault content, and multilingual support so the agent responds in whatever language you write in.

But the real game-changer came in version 4.2 with semantic vault search. I wrote about the magic of embeddings over a year ago, and this feature is that idea fully realized inside Obsidian. It uses Google’s File Search API to index your entire vault in the background. Once indexed, the agent can search by meaning, not just keywords. If you ask it to “find my notes about the trade-offs of microservices,” it will surface relevant notes even if they never use the word “microservices.” It understands that a note titled “Why We Split the Monolith” is probably relevant.

The indexing runs in the background, handles PDFs and attachments, and can be paused and resumed. Getting the reliability right was one of the more frustrating engineering challenges of the whole project. There were weeks of debugging race conditions, handling rate limits gracefully, and making sure a crash mid-index didn’t corrupt the cache. Version 4.2.1 was almost entirely dedicated to stabilizing the indexer, adding incremental cache saves and automatic retry logic. It’s the kind of work that nobody sees but everyone benefits from.

Images, Research, and the Expanding Toolbox

Version 4.3, released in January 2026, added multimodal image support. You can now paste or drag images directly into the chat, and the agent can analyze them, describe them, or reference them in notes it creates. The image generation tool, which I’d been building in the lead-up to 4.3, lets the agent create images on demand using Google’s Imagen models. There’s even an AI-powered prompt suggester that helps you describe what you want if you’re not sure how to phrase it.

That release also introduced two new selection-based actions: Explain Selection and Ask About Selection. These join the existing Rewrite feature to give you a full right-click menu for working with selected text. It sounds like a small addition, but in practice these micro-interactions are where people spend most of their time. Being able to highlight a paragraph, right-click, and ask “What’s the logical flaw in this argument?” without leaving your note is the kind of frictionless experience I’m always chasing.

Then came deep research in version 4.4. This is fundamentally different from the regular Google Search tool. Where a search returns quick snippets, deep research performs multiple rounds of investigation, reading and cross-referencing sources, synthesizing findings, and producing a structured report with inline citations. It can combine web sources with your own vault notes, so the output reflects both what the world knows and what you’ve already written. A single research request takes several minutes, but what you get back is closer to what a research assistant would produce after an afternoon in the library.

I built this on top of my gemini-utils library, which is a separate project I created to share common AI functionality across all of my TypeScript Gemini projects, including Gemini Scribe, my Gemini CLI deep research extension, and more. Having that shared foundation means deep research improvements benefit every project simultaneously.

Opening the Platform

If I had to pick the release that transformed Gemini Scribe from a plugin into a platform, it would be version 4.5. This is where MCP server support and the agent skills system arrived.

MCP, the Model Context Protocol, is an open standard that lets AI applications connect to external tool providers. In practical terms, it means Gemini Scribe can now talk to tools that I didn’t build. You can connect a filesystem server, a GitHub integration, a Brave Search provider, or anything else that speaks MCP. The plugin supports both local stdio transport (spawning a process on your desktop) and HTTP transport with full OAuth authentication, which means it works on mobile too. When you connect an MCP server, its tools appear alongside the built-in vault tools, with the same confirmation flow and safety features.

This was the moment the plugin stopped being a closed system. Instead of me having to build every integration myself, the entire MCP ecosystem became available. Someone who needs to query a database from their notes can connect a database MCP server. Someone who wants to interact with their GitHub issues can connect the GitHub server. The plugin becomes a hub rather than a destination.

The agent skills system, which follows the open agentskills.io specification, takes a similar approach to extensibility but for knowledge rather than tools. A skill is a self-contained instruction package that gives the agent specialized expertise. You can create a “meeting-notes” skill that teaches it your preferred format for processing meetings, or a “code-review” skill with your team’s specific standards. Skills use progressive disclosure, so the agent always knows what’s available but only loads the full instructions when it activates one. This keeps conversations focused while making specialized knowledge available on demand.

Version 4.5 also migrated API key storage to Obsidian’s SecretStorage, which uses the OS keychain. Your API key is no longer sitting in a plain JSON file in your vault. It’s a small change that matters a lot for security, especially for people who sync their vaults to cloud storage or version control.

Managing the Conversation

The most recent release, version 4.6, tackles a problem that only becomes apparent after you’ve been using an agent for a while: conversations get long, and long conversations hit token limits.

The solution is automatic context compaction, a direct answer to the attention management challenge I explored in the Agentic Shift series. When a conversation approaches the model’s token limit, the plugin automatically summarizes older turns to make room for new ones. There’s also an optional live token counter that shows you exactly how much of the context window you’re using, with a breakdown of cached versus new tokens. It’s the kind of visibility that helps you understand why the agent might be “forgetting” things from earlier in the conversation and gives you the information to manage it.

This release also added a per-tool permission policy system, which is the practical realization of the guardrails philosophy I wrote about in the Agentic Shift series. Instead of the binary choice between “confirm everything” and “confirm nothing,” you can now set individual tools to allow, deny, or ask-every-time. There are presets too: Read Only, Cautious, Edit Mode, and (for the brave) YOLO mode, which lets the agent execute everything without asking. I use Cautious mode myself, which auto-approves reads and searches but asks before any file modifications. It strikes a balance between speed and safety that feels right for daily use.

What I’ve Learned

Building Gemini Scribe has taught me something I keep coming back to in this blog: the most interesting work happens at the intersection of AI capabilities and human workflows. The technical challenges (semantic indexing, MCP integration, context compaction) are real, but they’re in service of a simple goal: making the AI useful enough that you forget it’s there.

The plugin now has users like Paul O’Malley building entire self-organizing knowledge systems on top of it. Seeing that kind of creative adoption is what keeps me building. Every feature request, every bug report, every surprising use case reveals another facet of what’s possible when you give a capable AI agent the right set of tools and the right context.

If you’re curious, Gemini Scribe is available in the Obsidian Community Plugins directory. All you need is a free Google Gemini API key. I’d love to hear what you build with it.

Great Video on Gemini Scribe and Obsidian

I was recently looking through the feedback in the Gemini Scribe repository when I noticed a few insightful comments from a user named Paul O’Malley. Curiosity got the better of me, I love seeing who is actually pushing the boundaries of the tools I build, so I took a look at his YouTube page. I quickly found myself deep into a walkthrough titled “I Built a Second Brain That Organises Itself.”

What caught my eye wasn’t just another productivity system, we’ve all seen the “shiny new app” cycle that leads to digital bankruptcy. It was seeing Gemini Scribe being used as the engine for a fully automated Obsidian vault.

The Friction of Digital Maintenance

Paul hits on a fundamental truth: most systems fail because the friction of maintenance—the tagging, the filing, the constant admin—eventually outweighs the benefit. He argues that what we actually need is a system that “bridges the gap in our own executive function”.

In his setup, he uses Obsidian as the chassis because it relies on Markdown. I’ve long believed that Markdown is the native language of AI, and seeing it used here to create a “seamless bridge” between messy human thoughts and structured AI processing was incredibly satisfying.

Gemini Scribe as the Engine

It was a bit surreal to watch Paul walk through the installation of Gemini Scribe as the core engine for this self-organizing brain. He highlights a few features that I poured a lot of heart into:

  • Session History as Knowledge: By saving AI interactions as Markdown files, they become a searchable part of your knowledge base. You can actually ask the AI to reflect on past conversations to find patterns in your own thinking.
  • The Setup Wizard: He uses a “Setup Wizard” to convert the AI from a generic chatbot into a specialized system administrator. Through a conversational interview, the agent learns your profession and hobbies to tailor a project taxonomy (like the PARA method) specifically to you.
  • Agentic Automation: The video demonstrates the “Inbox Processor,” where the AI reads a raw note, gives it a proper title, applies tags, and physically moves it to the right folder.

Beyond the Tool: A Human in the Loop

One thing Paul emphasized that really resonated with my own philosophy of Guiding the Agent’s Behavior is the “Human in the Loop”. When the agent suggests a change or creates a new command, it writes to a staging file first.

As Paul puts it, you are the boss and the AI is the junior employee—it can draft the contract, but you have to sign it before it becomes official. You always remain in control of the files that run your life.

Small Tools, Big Ideas

Seeing the Gemini CLI mentioned as a “cleaner and slightly more powerful” alternative for power users was another nice nod. It reinforces the idea that small, sharp tools can be composed into something transformative.

Building tools in a vacuum is one thing, but seeing them live in the wild, helping someone clear their “mental RAM” and close their loop at the end of the day, is one of the reasons I do this. It’s a reminder that the best technology doesn’t try to replace us; it just makes the foundations a little sturdier.

A photorealistic image shows an old wooden-handled hammer on a cluttered workbench transforming into a small, multi-armed mechanical robot with glowing blue eyes, holding various miniature tools.

Everything Becomes an Agent

I’ve noticed a pattern in my coding life. It starts innocently enough. I sit down to write a simple Python script, maybe something to tidy up my Obsidian vault or a quick CLI tool to query an API. “Keep it simple,” I tell myself. “Just input, processing, output.”

But then, the inevitable thought creeps in: It would be cool if the model could decide which file to read based on the user’s question.

Two hours later, I’m not writing a script anymore. I’m writing a while loop. I’m defining a tools array. I’m parsing JSON outputs and handing them back to the model. I’m building memory context windows.

I’m building an agent. Again.

(For those keeping track: my working definition of an “agent” is simple: a model running in a loop with access to tools. I explored this in depth in my Agentic Shift series, but that’s the core of it.)

As I sit here writing this in January of 2026, I realize that almost every AI project I worked on last year ultimately became an agent. It feels like a law of nature: Every AI project, given enough time, converges on becoming an agent. In this post, I want to share some of what I’ve learned, and the cases where you might skip the intermediate steps and jump straight to building an agent.

The Gravitational Pull of Autonomy

This isn’t just feature creep. It’s a fundamental shift in how we interact with software. We are moving past the era of “smart typewriters” and into the era of “digital interns.”

Take Gemini Scribe, my plugin for Obsidian. When I started, it was a glorified chat window. You typed a prompt, it gave you text. Simple. But as I used it, the friction became obvious. If I wanted Scribe to use another note as context for a task, I had to take a specific action, usually creating a link to that note from the one I was working on, to make sure it was considered. I was managing the model’s context manually.

I was the “glue” code. I was the context manager.

The moment I gave Scribe access to the read_file tool, the dynamic changed. Suddenly, I wasn’t micromanaging context; I was giving instructions. “Read the last three meeting notes and draft a summary.” That’s not a chat interaction; that’s a delegation. And to support delegation, the software had to become an agent, capable of planning, executing, and iterating.

From Scripts to Sudoers

The Gemini CLI followed a similar arc. There were many of us on the team experimenting with Gemini on the command line. I was working on iterative refinement, where the model would ask clarifying questions to create deeper artifacts. Others were building the first agentic loops, giving the model the ability to run shell commands.

Once we saw how much the model could do with even basic tools, we were hooked. Suddenly, it wasn’t just talking about code; it was writing and executing it. It could run tests, see the failure, edit the file, and run the tests again. It was eye-opening how much we could get done as a small team.

But with great power comes great anxiety. As I explored in my Agentic Shift post on building guardrails and later in my post about the Policy Engine, I found myself staring at a blinking cursor, terrified that my helpful assistant might accidentally rm -rf my project.

This is the hallmark of the agentic shift: you stop worrying about syntax errors and start worrying about judgment errors. We had to build a “sudoers” file for our AI, a permission system that distinguishes between “read-only exploration” and “destructive action.” You don’t build policy engines for scripts; you build them for agents.

The Classifier That Wanted to Be an Agent

Last year, I learned to recognize a specific code smell: the AI classifier.

In my Podcast RAG project, I wanted users to search across both podcast descriptions and episode transcripts. Different databases, different queries. So I did what felt natural: I built a small classifier using Gemini Flash Lite. It would analyze the user’s question and decide: “Is this a description search or a transcript search?” Then it would call the appropriate function.

It worked. But something nagged at me. I had written a classifier to make a decision that a model is already good at making. Worse, the classifier was brittle. What if the user wanted both? What if their intent was ambiguous? I was encoding my assumptions about user behavior into branching logic, and those assumptions were going to be wrong eventually.

The fix was almost embarrassingly simple. I deleted the classifier and gave the agent two tools: search_descriptions and search_episodes. Now, when a user asks a question, the agent decides which tool (or tools) to use. It can search descriptions first, realize it needs more detail, and then dive into transcripts. It can do both in parallel. It makes the call in context, not based on my pre-programmed heuristics. (You can try it yourself at podcasts.hutchison.org.)

I saw the same pattern in Gemini Scribe. Early versions had elaborate logic for context harvesting, code that tried to predict which notes the user would need based on their current document and conversation history. I was building a decision tree for context, and it was getting unwieldy.

When I moved Scribe to a proper agentic architecture, most of that logic evaporated. The agent didn’t need me to pre-fetch context; it could use a read_file tool to grab what it needed, when it needed it. The complex anticipation logic was replaced by simple, reactive tool calls. The application got simpler and more capable at the same time.

Here’s the heuristic I’ve landed on: If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent. Deconstruct those branches into tools, give the agent really good descriptions of what those tools can do, and then let the model choose its own adventure.

You might be thinking: “What about routing queries to different models? Surely a classifier makes sense there.” I’m not so sure anymore. Even model routing starts to look like an orchestration problem, and a lightweight orchestrator with tools for accessing different models gives you the same flexibility without the brittleness. The question isn’t whether an agent can make the decision better than your code. It’s whether the agent, with access to the actual data in the moment, can make a decision at least as good as what you’re trying to predict when you’re writing the code. The agent has context you don’t have at development time.

The “Human-on-the-Loop”

We are transitioning from Human-in-the-Loop (where we manually approve every step) to Human-on-the-Loop (where we set the goals and guardrails, but let the system drive).

This shift is driven by a simple desire: we want partners, not just tools. As I wrote back in April about waiting for a true AI coding partner, a tool requires your constant attention. A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.

This freedom comes with a new responsibility: clarity. If your agent is going to work overnight, you need to make sure it’s working on something productive. You need to be precise about the goal, explicit about the boundaries, and thoughtful about what happens when things go wrong. Without the right guardrails, an agent can get stuck waiting for your input, and you’ll lose that time. Or worse, it can get sidetracked and spend hours on something that wasn’t what you intended.

The goal isn’t to remove the human entirely. It’s to move us from the execution layer to the supervision layer. We set the destination and the boundaries; the agent figures out the route. But we have to set those boundaries well.

Embracing the Complexity (Or Lack Thereof)

Here’s the counterintuitive thing: building an agent isn’t always harder than building a script. Yes, you have to think about loops, tool definitions, and context window management. But as my classifier example showed, an agentic architecture can actually delete complexity. All that brittle branching logic, all those edge cases I was trying to anticipate: gone. Replaced by a model that can reason about what it needs in the moment.

The real complexity isn’t in the code; it’s in the trust. You have to get comfortable with a system that makes decisions you didn’t explicitly program. That’s a different kind of engineering challenge, less about syntax, more about guardrails and judgment.

But the payoff is a system that grows with you. A script does exactly what you wrote it to do, forever. An agent does what you ask it to do, and sometimes finds better ways to do it than you’d considered.

So, if you find yourself staring at your “simple script” and wondering if you should give it a tools definition… just give in. You’re building an agent. It’s inevitable. You might as well enjoy the company.