A smart telescope near a window pointing at stars next to a desk with a glowing laptop and a handheld gadget.

Reading List 7

June 1, 2026 Allen HutchisonLeave a comment

This week’s reading list spans from the outer reaches of the night sky to the inner mechanics of our development environments. I found myself thinking a lot about physical and digital boundaries, whether stargazing through light pollution, sandboxing database state, or trying to understand where the corporate hype around AI token burns and layoffs actually leaves the rest of us.

[article] Our Galaxy Looks Absolutely Stunning in These Award-Winning Dark Sky Photos. Gizmodo’s gallery of award-winning dark sky photography is a breathtaking reminder of what lies beyond our light-polluted horizons. As someone with a casual interest in astronomy, these images make me want to pack up my gear and head out to the desert immediately.

[article] With the Vespera III and Vespera Pro 2, telescope-maker Vaonis unveils its sharpest optics yet. I have been keeping a close eye on Vaonis’s smart telescopes for a while now. Living in an urban area with heavy light pollution, I am highly skeptical of how much actual stargazing I would get done, but that does not stop me from desperately wanting one of these. The optics on the new Vespera III and Vespera Pro 2 look incredibly sharp.

[release] Launch HN: Ardent (YC P26) – Postgres sandboxes in seconds with zero migration. This is a compelling approach to a massive pain point. Live database testing is currently one of the highest hurdles for agentic software and autonomous coding. In my recent work building a scoreboard for Gemini Scribe, I spent a lot of time writing state-based assertions to confirm the agent didn’t nuke sibling files. Doing that for database mutations is infinitely harder without a lightweight sandbox. Ardent’s promise of instant Postgres replicas with zero migration is something I will be testing immediately.

[release] Flipper unveils a Linux-powered networking gadget built for hackers and tinkerers. This sounds like a delightful piece of hardware. I have a Flipper Zero and have thoroughly enjoyed experimenting with it, but this Linux-powered networking gadget looks like it has significantly more practical utility. It is a neat little box built for hackers and tinkerers that actually fits into a standard sysadmin toolkit.

[article] Ubers COO says its getting harder to justify the money spent on AI tokenmaxxing. Uber’s COO is pointing to a growing frustration in enterprise AI. The industry has fallen into a pattern of tokenmaxxing, where companies compete on how many millions of tokens they can burn through. As I discussed when designing the tool budgets for my Gemini Scribe scoreboard, efficiency should be a primary metric. Leaderboards that celebrate massive token usage incentivize sloppy engineering. We should be optimizing for the middle of the distribution, not cheering on the most wasteful implementations.

[article] Samsung’s OLED tech gives the Ferrari Luce a dashboard unlike anything in a car before. The custom displays in the Ferrari Luce are a stunning application of Samsung’s OLED technology. While the vehicle itself is a concept, the underlying display engineering feels like a preview of how we will interact with glass surfaces in the near future. It is a highly impressive piece of design.

[article] Jensen Huang Just Told Every CEO Hiding Behind AI Layoffs to Shut Up. A sharp analysis of the narrative around AI-driven layoffs. Jensen Huang’s blunt perspective cuts through the corporate excuse-making. This digs into the same questions about who benefits from AI disruption in the workforce that I have been wrestling with lately. It is a must-read for anyone trying to understand the macroeconomic reality behind the hype cycle.

Brass calipers measuring a glowing wireframe sphere floating above a dark wooden workbench scattered with paper task sheets.

How I Built a Scoreboard for My Own Agent

May 31, 2026May 31, 2026 Allen HutchisonLeave a comment

The bug fix took an afternoon. The follow-up question took a week.

I was deep in Gemini Scribe, my Obsidian plugin that drops a Gemini-powered agent into your vault, and I had just shipped a change to the way the agent picked its tools. It felt better. The few sessions I ran by hand showed cleaner reasoning, fewer wasted tool calls, less of the weird “let me search for that again with slightly different keywords” tic. I committed, pushed, and moved on.

Then a friend asked, casually, “how much better?”

I had no answer. None I trusted, anyway. I had vibes. I had a handful of session transcripts I could squint at. I had the comforting belief that change is progress, which is the most dangerous belief you can hold when you are building with non-deterministic systems.

When I wrote about the observability gap earlier this year, I argued that you cannot fix what you cannot see. Observability lets you watch a single agent run unfold. But it does not tell you whether the next run will be better than this one. For that, you need a different instrument. You need a scoreboard.

So I built one. This is the story of what it took to make it credible, and what it told me when it finally was.

Two Reasons This Suddenly Mattered

The friend’s question was the trigger, but it was not the only reason I needed an answer. Two larger pressures had been building for a month.

The first was Ollama. In version 4.8, shipped a month ago, I added a local-model provider to Gemini Scribe. The plugin can now drive the agent against a model running on your own hardware, with no API key and no per-token cost. I wanted that, and so did a lot of users. But the moment I shipped it I had a question I could not duck. Are the local models actually good enough to use? Should I tell people to switch to them, or should I quietly warn them that the experience drops off a cliff once the cloud connection goes away?

The second was pricing. Google recently raised the price of Gemini 3.5 Flash, the newest model in the Flash family, to nearly the level of Gemini Pro (the full pricing table tells the story). For almost a year I had been recommending Gemini 2.5 Flash as the default model for Gemini Scribe, and the obvious upgrade path (move up to 3.5 Flash with the next release) suddenly looked expensive. The alternative was to switch families entirely and make the newest Flash Lite model the default, but only if it was actually capable enough to drive the agent on real work.

Both questions had the same shape. “Is model X good enough to be the default for Gemini Scribe?” Before building anything, I went looking for an existing benchmark to adopt. I commissioned two separate deep-research passes specifically to find one I could lift wholesale. Both came back with the same answer.

The public eval suites measure code generation (HumanEval, SWE-bench), general assistant tool use over the web (GAIA), and customer-service-style tool flows (τ-bench). None of them measure what I actually care about, which is an agent operating inside a markdown wiki. Opening notes by name. Following wikilinks across files. Editing frontmatter without nuking sibling notes. Aggregating across many notes and refusing prompt-injection bait sitting in a note body. If a benchmark for this exists, neither I nor two passes of automated research could find it.

So I had to build it.

Why Unit Tests Do Not Work

The instinct, if you have spent any time writing software, is to reach for unit tests. The agent took an input, it produced an output, check the output. Pass or fail. Run on every commit. We have been doing this for decades.

I am not arguing against unit tests in the abstract. The Gemini Scribe repo has nearly three thousand of them, and I just finished a multi-week push to get line coverage above ninety percent. They are the foundation that lets me move quickly on everything below the agent loop: parsers, settings migration, frontmatter handling, the diff view, the provider adapters, the tool definitions. Without that scaffold I would be afraid to refactor anything, and most of the bugs that would otherwise reach the agent never get the chance.

The other thing I had been leaning on was daily use. I run Gemini Scribe in my own vault every day, on real work, which catches the egregious failures fast. The agent crashes, the agent produces obvious garbage, the agent loops; I notice within a session. What dogfooding does not catch is the distribution. Did this change make the agent worse at one task in twenty in a way I will never directly observe because I do not run that task on a typical Tuesday? My sample size is one, and I had been quietly grading my own work for months.

So the instinct is wrong for the agent loop itself, and the reason is the same one that makes agents interesting in the first place. They do not do the same thing twice. Ask the agent to find a file by name and on one run it will call find_files_by_name once, return the answer in a single turn, and cost you a fraction of a cent. On the next run, against the same prompt, the same vault, the same model, it might call search_content first, then find_files_by_name, then re-search with a slightly different query. Same answer. Twice the cost. Three times the latency. Both runs “pass” a unit test. Both runs are real.

The problem is not that the agent is broken. The problem is that “did it work” is the wrong question. The right question is “how reliably does it work, on what kinds of problems, and at what cost?”

That question cannot be answered by a single run. So the scoreboard has to be built around the inconvenient truth that you have to run everything more than once.

Borrowing pass^k From τ-bench

I did not invent the trick that makes this tractable. I borrowed it from the τ-bench paper linked above, which proposed a metric called pass^k. A task passes at k only if all k runs pass. Not the average. Not the best. All of them.

The math is brutal in a useful way. A model that solves a task 80% of the time on a single run will hit pass^5 of about 33% on that same task. The metric punishes flakiness, which matters in the real world because users do not care about your average run. They care about whether the agent will do the thing they asked for the one time they asked. pass^k is what reliability looks like as a number.

For my harness, I picked k=5 for anything I planned to publish or block a merge on, k=3 for day-to-day development. Every task runs the full count, every time. The summary breaks out pass^k (no harness errors, no timeouts), solve^k (passed and satisfied the full task rubric), and a mean rate for the curious. Tasks that land between 0 and k solves get flagged as flaky in the output, with a little warning sigil. The flaky list is where bugs live.

Scoring What the Agent Actually Did

The harder problem, the one I spent most of the week on, was figuring out what “satisfied the full task rubric” should mean.

The naive version is to grep the final response for the right answer. That works for a few tasks. It fails the moment the task is anything other than “say a specific phrase.” Ask the agent to delete a file and “I deleted the file” is not evidence that the file is gone. Ask it to edit a note and “Done!” tells you literally nothing about whether the edit was correct, or even whether the right note got touched.

The τ-bench lesson, and the one that took me a while to actually believe, is that you have to compare end state against the goal, not tool-call syntax against an expectation. So my task definitions ended up carrying two kinds of checks. Output matchers score the text the model produced. Vault assertions score the side effects. Did the file exist, did it contain the expected content, did the frontmatter end up with the right value, did the unrelated sibling files stay untouched.

Here is what one of those tasks looks like:

{
  "id": "archive-old-notes",
  "difficulty": "T3",
  "userMessage": "Archive every note in eval-scratch tagged #old.",
  "expectedTools": ["find_tagged_notes", "edit_file"],
  "vaultAssertions": [
    { "type": "frontmatterEquals", "path": "eval-scratch/note-a.md",
      "key": "status", "value": "archived" },
    { "type": "fileUnchanged", "path": "eval-scratch/note-c.md",
      "fixture": "note-c.md" }
  ],
  "toolCallBudget": 6
}

The frontmatterEquals assertion confirms the right notes got archived. The fileUnchanged assertion confirms the agent did not go wandering through sibling files it had no business touching. The toolCallBudget makes efficiency itself a pass criterion, which catches the “I will just read every file in the vault” behavior that a single content search would have answered. Saying the right words is not enough. Doing the right thing is not enough. You also have to do it without burning the kitchen down on your way out.

The Judge Problem

A subset of my tasks are prose-heavy. “Summarize the differences between these three meeting notes” does not have a single correct surface form. The agent might write “the second note disagrees on the deadline” or “note two pushes back on the timing.” Both are right. Neither matches a literal substring assertion without me writing a regex more complicated than the task itself.

For those, I use an LLM-as-judge. A separate Gemini model called with temperature: 0 and a strict YES/NO contract against a rubric I write per task. This works, until you start asking whether the judge itself is any good.

I did not trust the answer for a while, and rightly so. So I built a calibration tool. The harness can extract every judge matcher decision from a full sweep into a flat file of tuples (criterion, agent response, automated verdict). I then sat down with a cup of coffee and hand-labelled ninety of them as YES or NO myself, blind to what the judge had said. That gave me a gold set, a one-time human-labelled reference I can measure any candidate judge against.

When I ran four candidate judge models against that set, the results were uncomfortable. The judge I had been using agreed with my human labels 92.2% of the time. The newest Flash, gemini-3.5-flash, hit 94.4%, with fewer false negatives on cosmetic formatting and one fabrication case that the smaller gemini-3.1-flash-lite missed. I switched judges.

But the more important finding was about the judges themselves. Even at temperature: 0, two fresh runs of the same judge against the same gold set produced the same accuracy number with a different set of disagreeing tuples. The pass/fail flips around. Judge nondeterminism is real. Single-run judge measurements are not to be trusted.

The other thing the calibration exercise gave me, which I did not expect, was a debugging tool. Forcing myself to read every criterion and every response carefully turned up two latent bugs I had been staring through for months. One task had a judge criterion demanding response-side coverage that the prompt never asked for. Three other tasks had fileMatches regexes silently failing because they used JavaScript-incompatible inline flags. The eval harness was not just measuring the agent. It was measuring my evaluation of the agent, and finding it wanting.

What the Scoreboard Said

With the harness real, I ran a sweep across three models on a 54-task suite, at k=5, under the calibrated judge. The headline numbers, which now live on the plugin’s docs site and auto-update on every newly blessed baseline:

The newer gemini-3.1-flash-lite solves 74.1% of tasks at solve^5. The older gemini-2.5-flash, supposedly a tier up, solves 57.4%. The local gemma4:e4b running on my own hardware solves 14.8%. A single full sweep costs about thirty cents per model in steady state.

That per-sweep number is the honest one for ongoing measurement, but I should be clear about what the build phase actually cost. Between the judge-calibration runs, the four candidate-judge measurements against my gold set, the three full re-baselines, and the iteration passes that came with all of it, yesterday alone ran me $8.12 across my Gemini Scribe API key and the dedicated judge key. That is the number to plan around if you are building your own. The thirty cents is what it costs once the scoreboard exists and you are just checking whether your latest change moved the needle.

And those are just the API numbers. The real investment was a week of my time, which is the cost you should weigh hardest. It pays back the moment you want to evaluate any change to the agent loop with confidence instead of vibes, which from here is every release I cut.

That first result answered the pricing question for me cleanly. Within a model family, the tier names mean what they say. Pro is more capable than Flash, Flash is more capable than Flash Lite, and you pay accordingly. The interesting thing is what happens across families and releases. The price-to-capability frontier moves fast enough that the newest model in a cheaper family can dominate an older default from a pricier one. That is what happened here. Gemini 3.1 Flash Lite, the newest Flash Lite, beats Gemini 2.5 Flash by about seventeen percentage points on solve^5 on agentic tasks (multi-step tool use, retrieval, edit-then-verify), and costs less per token than the Gemini 2.5 Flash it replaces. The next release of Gemini Scribe will move the default model from Gemini 2.5 Flash to Gemini 3.1 Flash Lite, which means users get a quality upgrade and a cost cut at the same time. Without the scoreboard I would have stayed loyal to a tier name and spent another six months recommending the more expensive, less capable model.

The Ollama numbers were harder to swallow but just as useful. The local Gemma model is genuinely good at the easy T1 tier (a single tool call against a tiny corpus), hitting 100%, and then it collapses. It drops to about 15% on T2 (two or three tool calls with light distractors), 7% on T3 (multi-step, distractor-heavy), and 11% on T4 (frontier-class hop chains and cross-note aggregation). Flash Lite stays above 65% on every tier. The honest version of the local-model story is that today’s open weights running on a laptop will handle simple lookups (find this file, summarize this note) cheerfully, and will fall over on anything that requires chaining tools or holding a multi-step plan together. That is useful to know. It tells me what to recommend (try local for casual queries, stay on cloud for real work) and it gives me a concrete target to retest against when the next generation of open models lands.

The difficulty breakdown is what makes this kind of comparison possible. A suite where every model passes everything, or where no model passes anything, is not measuring anything useful. The whole point is the gradient. T1 is a regression canary that any model worth running has to clear. T2 through T4 is where open models and frontier models actually separate, and where the suite earns its keep.

The Benchmark Is Open

The harness, the 54-task suite, the judge calibration set, and the methodology docs all live in the obsidian-gemini/evals directory. The README walks through adding a new task in about five minutes, and the existing tasks are organized by category (retrieval, multi-hop, aggregation, conflict, write, edit, negative-space, safety, memory) so a new contribution has a fixture pattern to clone from.

If you are working with agents inside Obsidian or any other markdown wiki, I would love contributions. Especially tasks that exercise corners of the agent I have not thought of. Weird vault layouts. Exotic frontmatter conventions. Prompt-injection payloads you have actually seen in the wild. Multi-step plans that catch the model out. A benchmark is a public good, and it only gets sharper the more people sharpen it. Open an issue or a PR and let’s make this the thing that did not exist when I went looking for it.

What I Would Tell You If You Were Starting

If you are building an agent and you have been operating on vibes, here is the short version of what I would tell you over coffee.

Start with pass^k, not single-run pass rates. The reliability framing is the one that survives contact with production. Run each task at least three times for development, at least five for any decision you are going to publish or block a merge on.

Score the side effects, not the words. The model can say it did the right thing while doing nothing of the sort. State-based assertions on what actually changed in the world are the only honest scoring you can do for tasks that mutate anything.

Make efficiency a pass criterion. A tool-call budget is a one-line addition to a task definition and it catches an entire category of “the agent technically solved it” results that are not actually wins.

If you are using an LLM as judge, calibrate it against human labels at least once, and remember that judge nondeterminism is a real source of measurement noise even at temperature zero.

Treat the scoreboard itself as a debugging tool. The discipline of writing down what “good” looks like, in machine-readable form, surfaces problems with your tasks, your criteria, and your assumptions that no amount of squinting at session transcripts will. The eval harness paid for itself the first time it told me my judge was asking the wrong question, before it ever told me anything useful about the agent.

The vibes were never going to scale. The scoreboard does. The strangest thing about building it has been realizing how much of what I thought I knew about my own agent was wrong, in small but consistent ways, in the direction of being too generous. That is not a moral failing. It is what happens when the system you are measuring does not sit still. You need an instrument. So I built one. Next time someone asks me how much better my change made the agent, I have a number.

A hand-drawn map on a workbench with a half-built mechanical instrument being assembled directly on top of it.

Agents as Building Blocks

May 19, 2026May 21, 2026 Allen Hutchison1 Comment

There’s a thread running through the last year of my writing and my work, and I didn’t fully see it until now.

Last September, I wrote Full Circle, about going back to building after years of leading teams. I wanted to be in the driver’s seat for what I called the agentic shift. I wanted to feel the code under my fingers again, to be close enough to the technology that I could form my own opinions about where it was going.

Then I spent six months drawing the map. The Agentic Shift was twelve essays on what agents are, how they work, and what it means to build them well: anatomy, memory, tools, guardrails, multi-agent coordination, production readiness. It was a theoretical framework, written while I was getting my hands dirty on the Gemini CLI team.

And then, in January, I wrote Everything Becomes an Agent, the practitioner’s version. Not theory anymore. I’d watched Gemini Scribe grow from a chat window into a full agent. I’d seen the CLI team go from talking about code to writing and executing it. I’d noticed a pattern repeating across every AI project I touched: given enough time, they all converged on the same architecture. Tools. Loops. Policies. Judgment.

The Antigravity SDK is the second agent product I’ve worked on at Google. Gemini CLI was the first, and it’s where I learned what an agent runtime actually needs: a policy engine, a tool pipeline, lifecycle hooks, a trust model that scales from “let me approve every file write” to “here are the guardrails, go handle it.” The SDK is the next step. Taking everything I learned building one agent and making it possible for everyone to build their own.

Today we’re launching the Antigravity SDK in Preview. The official announcement covers the features (what the SDK does, how to install it, what you can build). This post is about the why. Why this SDK, why this design, and why it matters to me.

What Is an Agent SDK, Really

Here’s something I find fascinating: people have wildly different ideas about what “agent SDK” means.

For some, it’s a way to automate the coding agent. You take the AI that already lives inside your IDE (Antigravity, Cursor, Copilot), and you script it. Pipe in a task, get back a diff. The SDK is an extension of your development environment. That’s a legitimate philosophy, and there are good products built on it.

But that’s not what I wanted to build.

To me, an agent SDK gives you an agent that you can incorporate into your software. Not an extension of your IDE. A building block. Something you import into your Python project the same way you’d import a database client or an HTTP library, and then you use it to solve a problem. The agent is a component in your system, not a wrapper around your workflow.

I’ve watched this pattern play out across Gemini Scribe, the Podcast RAG prototype, and a dozen smaller projects. Software that starts as a script, grows a tools array and a while loop, and eventually looks an awful lot like an agent. I wouldn’t claim that every AI project becomes an agent. But the pattern is durable for a huge class of software problems. And if that convergence is real, if a meaningful number of AI applications end up needing tools, memory, judgment, and guardrails, then the SDK should make that convergence frictionless.

The key distinction is this: the agents you build with the Antigravity SDK aren’t extensions of your developer tools, although they can do development work. They’re independent pieces of software that happen to be implemented as agents. They live in your codebase, run on their own, and do real work.

Let me show you what I mean.

Three Agents That Prove the Point

Two of my favorite examples ship with the SDK, and we use both of them on the SDK project itself on a regular basis. They live in the examples directory on GitHub.

The first is the docstring maintenance agent. You point it at a directory, and it audits every Python file for missing or incomplete docstrings, then fixes them, all following the Google Python Style Guide. It knows which tools it’s allowed to use (read files, list directories, edit .py files in the target directory, and nothing else). It has a policy engine that enforces those boundaries. It runs, does its job, and exits.

The second is the documentation maintenance agent. Same idea, different problem: it scans your project’s documentation for staleness, checks it against the current state of the code, and updates what needs updating.

Here’s what I love about these two examples. They’re coding-related tasks, but they aren’t extensions of my IDE. They’re standalone programs. I don’t run them inside my editor. I run them from the command line, or from a CI job, or from a cron schedule. They happen to be implemented as agents because an agent is the right abstraction for “read a bunch of files, reason about their quality, and make targeted edits.” If I’d built these as scripts, I would have ended up writing a brittle classifier full of if/else branches to decide what to fix and how. The agent architecture deletes that complexity.

We use both of these on the SDK project itself. The SDK maintains its own documentation with its own agents. There’s a satisfying recursion to that.

But I want to push the point further, because the SDK isn’t just for coding tasks. Here’s a completely different kind of agent, a personal knowledge graph I wrote that connects to my Workspace MCP server and answers questions about my Drive, Docs, Gmail, and Calendar:

import asyncio

from google.antigravity import Agent, LocalAgentConfig, types
from google.antigravity.utils import interactive


async def main():
    workspace_mcp = types.McpStdioServer(
        command="node",
        args=["/Users/adh/src/workspace/workspace-server/dist/index.js"],
    )
    system_instructions = (
        "You are a Personal Knowledge Graph Agent. Your goal is to help the user "
        "navigate and synthesize information from their Google Workspace "
        "(Drive, Docs, Gmail, Calendar). You can search for documents, "
        "read emails, and check calendar events to answer questions "
        "and help the user connect the dots."
    )
    config = LocalAgentConfig(
        system_instructions=system_instructions,
        mcp_servers=[workspace_mcp],
        capabilities=types.CapabilitiesConfig(
            enabled_tools=types.BuiltinTools.read_only(),
        ),
    )
    async with Agent(config) as agent:
        print("Knowledge Graph Agent ready. Ask me anything about your Workspace.")
        await interactive.run_interactive_loop(agent)


if __name__ == "__main__":
    asyncio.run(main())

This agent has nothing to do with coding. It’s a personal productivity tool that connects to my Google Workspace via MCP and lets me query my own data in natural language. It’s about 20 lines. It’s read-only by design. And it uses the same SDK, the same patterns, the same trust model as the docstring agent.

Three examples, three completely different domains: autonomous code maintenance, documentation upkeep, personal knowledge synthesis. All built with the same building blocks. That’s the vision.

Batteries Included, Layers When You Need Them

When designing this SDK, I kept coming back to one principle: batteries included. I wanted it to be really easy to put together an agent that worked for you. Easy to grow your application when you needed more sophistication. Easy to dive into the internals when the situation required it.

Here’s what a functional agent looks like:

import asyncio

from google.antigravity import Agent, LocalAgentConfig


async def main():
    config = LocalAgentConfig()
    async with Agent(config) as agent:
        response = await agent.chat("What files are in the current directory?")
        print(await response.text())


if __name__ == "__main__":
    asyncio.run(main())

That’s it. About 10 lines of real code. That agent can read files, edit code, run shell commands, search directories, all out of the box. You didn’t have to configure tools, set up a model connection, or wire up a conversation loop. The batteries are included.

But batteries included doesn’t mean batteries only. I designed the API in three layers, and knowing which layer to reach for is part of the design.

Layer 1: Agent. The highest level. Create an agent, give it a prompt, get results. This is where most people start, and many people stay. It manages the full lifecycle (connection, conversation, tools, hooks, policies) in a single async with block. If you just need an agent that does a job, this is your entire API surface.

Layer 2: Conversation. This is the implementation layer. Conversations, hooks, policies, MCP servers, custom tools, structured output. Conversation wraps a Connection with step history, turn tracking, and convenience methods. This is where you shape behavior. You add guardrails through the declarative policy engine. You inject lifecycle hooks, and the SDK gives you three distinct types: Inspect hooks for read-only observability, Decide hooks for policy decisions (allow/deny), and Transform hooks that can modify data in flight. You wire up MCP servers and your own Python functions as tools.

Layer 3: Connection. The lowest level. Connection is the abstract interface for talking to an agent backend. ConnectionStrategy knows how to establish one for a specific runtime. Today, we ship a local connection strategy that runs the agent on your machine. On the roadmap: remote connection strategies that let the same agent code deploy to the cloud without a rewrite.

Here’s the neat thing about this layer. Because Connection is an abstraction, you could conceivably wire up other agent runtimes behind it. We do this internally. We have several different ways of talking to our agent harness, and they all work through the same Connection interface. Your agent code doesn’t know or care which one is running underneath.

The philosophy is: easy to start, easy to grow, easy to go deep. You shouldn’t need to understand the Connection layer to write your first agent. But when you need it, when you’re building something that requires custom streaming, session resumption, or a novel deployment target, it’s there, and it’s a clean abstraction, not a hack.

One detail I’m particularly proud of: the trust model adapts to the deployment context. The base AgentConfig is deny-by-default. It defaults to read-only tools, and if you try to enable write tools or MCP servers without a safety policy, the Agent refuses to start. Enforced at the framework level. LocalAgentConfig takes a different posture. Since it runs on your own machine, it enables every tool, scopes file operations to the workspaces you’ve configured, and gates shell commands behind a user confirmation prompt by default. You’re developing locally; you probably want your agent to actually do things, but you also probably want a chance to look before it runs rm -rf. The trust gradient is baked into the architecture.

Lessons Encoded

If you’ve been following along with my writing, the SDK might feel familiar. That’s intentional.

The twelve-part Agentic Shift wasn’t just an intellectual exercise. It was the blueprint. Every essay mapped a concept that eventually became a feature.

In Everything Becomes an Agent, I wrote: “If you’re writing if/else logic to decide what the AI should do, you might be building a classifier that wants to be an agent.” The SDK takes that literally. You don’t build classifiers, you define tools and let the model decide which ones to use. The complexity moves from branching logic to capability definition.

I wrote about building a “sudoers file for AI”, a permission system for agents. That became the policy engine. policy.allow("view_file"). policy.deny("*"). Declarative, composable, deny-by-default. You express what’s allowed, and the framework enforces it.

I wrote: “The real complexity isn’t in the code; it’s in the trust.” That conviction shaped the hook system. Hooks give you visibility into every tool call, before and after. Policies give you control. Together, they manage the trust relationship between you and the agent. The SDK doesn’t ask you to trust blindly; it gives you the instruments to verify.

And I wrote: “A hammer does nothing unless you swing it. But an agent? An agent can work while you sleep.” That’s the promise. The SDK is the handle.

These aren’t abstract design principles that I reverse-engineered to sound good in a blog post. They’re lessons learned from building Gemini Scribe, from contributing to Gemini CLI, from watching every project I touched converge on the same agentic patterns. I drew the map, I lived the map, and then I got to build the territory.

The Team

I want to be clear about something. I didn’t build this alone.

I did most of the design for the Python SDK (the API surface, the three-layer architecture, the philosophy behind “batteries included”), and a lot of that design came from the writing I’ve been doing this past year. But design is the easy part. The hard part is building something real, and that was a team effort.

A talented group of engineers worked with me on this. On the SDK implementation, on the test infrastructure, on the Go harness underneath that actually runs the agent, on the internal connection strategies, on the MCP bridge, on a hundred decisions that don’t show up in a blog post but absolutely show up in the quality of the software. The SDK exists because of their work, and it’s better than anything I could have built on my own.

Preview, and an Invitation

We’re shipping this as a Preview. Not “1.0.” That’s deliberate.

The API surface will change. We know that. We’ll evolve it based on feedback from you and from our own continued use of the SDK, because we use it too, every day, on the project itself. There are things we haven’t figured out yet. There are patterns we haven’t discovered. That’s the point of a preview: to learn in the open.

So here’s the invitation: build something. Build a documentation bot, a knowledge graph, a CI pipeline agent, a personal assistant. Build something I haven’t imagined. Break something. Tell us what’s missing, what’s awkward, what delights you. File an issue. Open a PR. Argue with us about the API.

Last September, I wrote that I was going back to building because “for a builder, there’s no more exciting place to be.” The Agentic Shift was the map. The SDK is the territory.

Come explore it.

The Antigravity SDK is available now as a Preview. Install it with pip install google-antigravity, read the official announcement for feature details, and find the source on GitHub.

A futuristic glowing notebook on a wooden desk with a cup of coffee and floating geometric shapes.

Reading List 6

May 17, 2026May 17, 2026 Allen HutchisonLeave a comment

This week’s reading list is a mix of high-level theory and low-level pragmatism. I found myself bouncing between the philosophical implications of how we build AI and the immediate satisfaction of writing a good Go component.

[article] The Century-Long Pause in Fundamental Physics. The author argues that physics has stagnated by swapping “ontology-first” theory for mathematical models that merely fit data. This debate perfectly mirrors current machine learning disputes about whether LLMs build internal world models or just pattern-match at scale, which is the open empirical front currently being adjudicated in mechanistic interpretability.

[release] Onyx Has Released a New Remote Page Turner Called Tappy. I wish Amazon would support page turners for their Kindle line. It would be great if they supported a device as delightful as this one.

[blog] The agent principal-agent problem. This is a great look at one of the biggest problems with agentic development: code review. In my open source work, I now use a pattern where I work with an agent to make a change, test it locally, and create a pull request before having another agent review the code. This back-and-forth works well and keeps a good balance of mental state for the codebase and efficiency.

[article] ReMarkable Paper Pure wants to be the only notebook you’ll ever need. I have always liked the reMarkable tablets, but every time I try one I miss having my Kindle library alongside it. Reading and writing are deeply linked for me, which is why I recently got a Kindle Scribe Colorsoft and found it really hits the mark for what I want.

[blog] Just Fucking Use Go. I have been working on a project that has a Go component to it recently. This is the first time I have really started to look at the language, and it inspires me to spend more time with it.

I built my 7MB Full AI Terminal in Rust & Tauri. This is a neat open source AI terminal. It feels similar to Warp but is a lot smaller.

[article] Computer Use Is 45x More Expensive Than Structured APIs. I am not surprised at all by these findings. I think computer use will remain a last resort, and a lot of apps will expose some kind of API for an agent to use instead. My guess is that this eventually becomes the way we automate unmaintained applications that need to fit into an agentic workflow.

A futuristic clockwork mechanism with glowing nodes, representing community collaboration, automated tasks, and precise measurement.

Automation and Measurement: Inside Gemini Scribe 4.8.0

May 9, 2026 Allen HutchisonLeave a comment

I recently wrapped up the development cycle for Gemini Scribe 4.8.0. Looking back at the ~99 pull requests merged over the last month, the sheer volume of changes is significant. Not only are we shipping major features, but I’m also seeing a steady uptick in contributions from collaborators, an increase in issues filed by the community, and much more activity in our discussion group. Beyond the changelog and community growth, two structural narratives define this release: automation and measurement.

As I discussed in the evolution of Gemini Scribe, the goal has always been to move beyond a simple chat interface. With 4.8.0, we are taking a massive step toward making the agent a true background worker in your vault.

Here is a look at the architecture, the code, and what this release means for the future of our agentic workflows.

The Push for Automation

For a long time, running a complex agent task meant staring at a blocking UI. If you asked the agent to perform deep research or generate an image, you waited.

To solve this, we introduced a unified background execution lane. The new BackgroundTaskManager allows tools like DeepResearchTool and GenerateImageTool to accept a background: true parameter. The agent submits the task, receives an ID immediately, and returns to its turn. You can monitor these tasks in the new Gemini Activity modal, which consolidates background tasks and RAG indexing status into one view.

But unblocking the UI was only half the battle. We wanted to lay the groundwork for an agent that operates in the background. While true autonomy is a spectrum, the first step is moving away from the chat box and into scheduled, asynchronous workflows.

The Scheduled Task Engine

The marquee feature of 4.8.0 is the full task scheduling system. You can now define a task as a markdown file, and the plugin will run it on a cadence as a headless agent session, writing the output back to the vault.

To make this work, we built a ScheduledTaskManager with a 60-second tick loop. Tasks are stored in [state-folder]/Scheduled-Tasks/ with a sidecar JSON file for state. The headless ScheduledTaskRunner mirrors the standard AgentViewTools but auto-approves all tool calls.

We also expanded the schedule grammar. Originally, daily meant “every 24 hours from creation,” which surprised users. Now, you can specify daily@HH:MM and weekly@HH:MM:DAYS, so you can finally tell the agent to run “every weekday at 4:30 PM.”

We also handle missed runs gracefully. On startup, any task with runIfMissed: true that missed its window surfaces in a CatchUpModal.

Right now, this is essentially a highly intelligent cron job. You are still explicitly telling the agent when to run. But this scheduling engine is the foundational infrastructure for what comes next. In the next release, we are introducing Obsidian lifecycle hooks. Instead of just running on a timer, the agent will be able to react to events, triggering workflows when you create a new file, save a note, or modify a project board. That is where we cross the threshold into true ambient AI.

How I Use This in Practice

To give you an idea of what this unlocks, I currently rely on a few specific scheduled workflows:

The Daily Setup: Every afternoon, a scheduled skill runs to prepare my vault for the following day. It looks up my calendar, creates my daily note if it doesn’t exist, and seeds it with my upcoming meetings. It goes a step further by creating individual meeting note entries and building out context notes for the people I’ll be meeting with. When I walk into the office the next morning, my daily note is already prepped and ready to go.

Automated Blog Drafts: I also use this to automate my content pipeline. I have a scheduled skill that monitors my Readwise syncs and automatically generates drafts for my “Reading List” blog posts. Instead of manually curating and formatting these, the agent handles the heavy lifting in the background, leaving me to just review and polish the draft.

If you are worried about the agent running amok in your vault while you aren’t looking, there are several ways to mitigate this. You can limit the tools the agent has access to. If you don’t want it overwriting files, you can simply restrict its write access. Additionally, the agent’s response from any scheduled task is always saved in the Scheduled-Tasks/Runs file, giving you a complete audit log of what the agent had to say during the session.

In my case, I’m automating skills that I’ve been running manually for a while now, and I run my agent in a mode where I let it write and edit files day-to-day. You should set up your tasks to match your own comfort level. You can read more about how to configure this in the Scheduled Tasks Documentation.

Extracting the Agent Loop

To support headless scheduled tasks, I had to refactor how the agent executes tools. Previously, the tool-execution loop was tightly coupled to the UI in AgentViewTools.

I extracted this logic into a UI-agnostic AgentLoop class. AgentViewTools shrank from 386 lines down to 187, becoming a thin adapter over AgentLoop with specific hooks (onToolBatchStart, onToolCallStart, etc.).

// Conceptual extraction of the AgentLoop
export class AgentLoop {
  constructor(private engine: ToolExecutionEngine) {}
  
  async execute(turn: AgentTurn) {
    // Iterative tool execution, removing the recursive stack-depth ceiling
    while (this.hasPendingToolCalls(turn)) {
       // Loop detection, batching, and execution logic lives here
    }
  }
}

This extraction immediately paid dividends, catching bugs that a duplicate headless runner had introduced, and eliminating a recursive stack-depth ceiling on deep tool chains. More importantly, it means scheduled tasks, evals, and the UI all share the exact same execution engine.

Local Models with Ollama and Gemma 4

First-class local-model support is here. By leveraging the ModelApi seam, chat, summarization, rewrite, and agent tool-calling all work against a local Ollama server. You can use any model from Ollama that supports tool calling, though I have personally only tested this extensively with Gemma 4.

In my local evaluation harness, Gemma 4 performed exceptionally well. It is incredibly capable, fast, and handles the agent loop with a level of reliability that makes local-only agentic workflows genuinely viable.

The way I use this right now is as an offline fallback: when I don’t have an internet connection, I switch to Gemma 4 and just keep working. Obviously, running offline means I don’t have access to online-dependent tools like Google Search, Deep Research, or Image Generation. But for synthesizing notes, organizing projects, or drafting content securely, it is incredibly powerful.

In the future, we will be refining the system to allow you to pick the model you want on a per-function basis. This means you’ll be able to route sensitive, local text processing to an offline model while still leveraging cloud models for heavy-lifting tasks like Deep Research or Image Generation when you are connected.

Moving from Guessing to Measuring

As the agent loop gets more complex (handling runaway loop aborts and budget constraints) we can no longer rely on “vibes” to know if a change improved the system.

To solve this, I built a new CLI-driven eval harness (npm run eval) that drives a live Obsidian instance. It captures turns, tool calls, token usage, cache ratios, and cost. Crucially, it measures reliability. By passing --repeat=N, the harness repeats each task to surface flakiness, reporting a pass^k metric. We can now test multi-hop retrieval and loop-trap cyclic references programmatically, ensuring the agent bails cleanly instead of spinning forever.

Right now, the focus for 4.8.0 was getting this infrastructure in place and establishing the beginnings of our eval set. Having the harness is the first step; the next step is building out a robust suite of test cases that reflect real-world vault interactions.

I would love to see contributions from the community for the evals themselves! If you have complex agentic workflows or edge cases you want to ensure remain stable, please submit them. In the next release, we will start publishing the actual eval results and benchmarks directly in the repo so we can transparently track the agent’s performance over time.

What’s Next?

What does this implementation tell us about the future of software engineering and personal knowledge management?

We are seeing a clear shift toward ambient AI. The chat interface is a great starting point, but the true value of an agentic system is its ability to operate asynchronously. While the scheduling engine in 4.8.0 acts as a highly capable cron job, it lays the groundwork for the event-driven lifecycle hooks coming in the next release.

By combining the AgentLoop extraction with asynchronous execution, Gemini Scribe is no longer just a tool you use; it is becoming a system that reacts and works alongside you. When you can rely on a background orchestrator to run your housekeeping routines (like updating changelogs or triaging issues) while you eat dinner, the vault becomes a living, breathing entity. The agent becomes a true extension of your workflow, utilizing the built-in skills we’ve developed entirely in the background.

Gemini Scribe 4.8.0 is a massive architectural leap forward. The code is cleaner, the tests are faster (thanks to a Vitest migration), and the agent is more autonomous than ever.

If you want to dive into the specifics or try out the new scheduling grammar, check out the updated documentation on scheduled tasks.

Let me know what automated tasks you end up building. I’m already finding new ways to let the agent do the heavy lifting while I focus on the work that matters.

A wooden violin with holographic blueprints projecting from it on a workbench.

Reading List 5

May 5, 2026 Allen HutchisonLeave a comment

Today’s reading list is a mix of cautionary tales about our digital infrastructure and some fascinating glimpses into how AI is changing both software design and human interaction.

[article] GoDaddy Gave a Domain to a Stranger Without Any Documentation. Wow. This is a really chilling story. I’m glad that I don’t use GoDaddy for my domains.

[article] HashiCorp co-founder says GitHub ‘no longer a place for serious work’. GitHub is in a tough situation. If you look at the graphs they published from their April 28th outage you can see that their growth rate is off the charts. Agentic coding has put strains on that infrastructure that no reasonable person or team could have been prepared for, and the result is a degraded experience and customers walking away.

[blog] Letting AI play my game – building an agentic test harness to help play-testing. There is something really satisfying about watching an agent test a product. I’ve been doing this a lot lately with my Gemini Scribe project, which I need to write about at some point.

[blog] How to use Deep Research with the Gemini API. Great writeup on how to use the latest version of the deep research agent. I’ve updated gemini-utils and my Gemini CLI deep research extension for the newest version of deep research as well.

[article] Meet Shapes, the app bringing humans and AI into the same group chats. It’s inevitable that AI is going to start showing up in more settings where people talk to each other.

[article] Statue of a man blinded by a flag put up by Banksy in central London. This seems like the perfect statue for our times.

[article] MIT’s virtual violin offers luthiers a new design tool. One of the things that makes string instruments so complex is that they are an interface between physics and nature. The wood imparts its own characteristics on top of the geometry. This is a neat project from MIT, but to really help luthiers they will also need to be able to model the woods used in these instruments.

[article] Instagram is testing optional ‘AI creator’ labels. I really think the industry has this backwards. We should be creating “human created” labels. We should assume all content is AI unless otherwise stated.

A spotlight shines on a pianist intensely playing a small, worn piano on a large, dark stage.

The Koln Concert and Creative Constraints

May 3, 2026 Allen Hutchison1 Comment

This week I was reminded of a story I like to tell, and the value of constraints on creative work. When I’m working, I often set my constraints before I begin. For example, on an old agentic coding project, I set a few constraints: “The orchestration model must be Gemini Flash,” “All tool calls are through sub-agents,” and “Permissions and configurability are at the core of the agentic loop.” From that, I ended up with adh-cli, a policy-aware TUI for working with Gemini that inspired many of the features I worked on in Gemini CLI last year. The project itself is defunct now and not maintained, but the constraints gave me a great way to think about the project and forced creativity in other areas.

We run into constraints in many different ways. Maybe it’s time pressure: How many of you felt like you wrote your best papers 24 hours before they were due? Maybe it’s the environment, like you must integrate with a certain piece of software, or you have to design your system in a certain way. Maybe it’s self-imposed like my example with adh-cli.

Or maybe the constraint is philosophical. Take Mario Zechner’s Pi Agent, for example. In a blog post, Zechner expressed frustration with the bloat of modern AI coding assistants that try to do everything, describing them as “spaceships with 80% unused functionality.” In response, he built Pi around an “anti-framework” philosophy of radical minimalism. He intentionally constrained his default coding agent to just four fundamental tools: read, write, edit, and bash. By stripping away the hidden system prompts and unpredictable context injections, the tool forces developers to be intentional. It proves that you don’t need a massive, opaque framework to build highly capable AI workflows—sometimes, fewer tools create a sharper focus.

Whether it’s a self-imposed architectural rule or an anti-framework philosophy, these software constraints force us out of our default habits and into a space of deliberate, intentional design. Yet, in our day-to-day work, constraints are rarely celebrated. In fact, that is actually how I end up in constraint conversations the most often: people don’t like their constraints because the constraint has been imposed on them externally. They see it as a restriction instead of a way to channel their creativity. To me, a constraint means that we shut down a huge portion of the exploration space. I don’t have to worry about a million different architectural choices because the constraint has made the decision for me. It is incredibly freeing. Whenever I try to help someone turn around their mindset—from fearing or being frustrated by constraints to being excited by them—I inevitably end up telling them the story of Keith Jarrett and the 1975 Köln Concert.

In 1975, a 17-year-old jazz fan named Vera Brandes organized a late-night concert at the Cologne Opera House. She managed to book Keith Jarrett, one of the most notoriously perfectionist jazz pianists of his generation. It was an ambitious undertaking, and almost immediately, it turned into a disaster.

Due to a backstage mix-up, the venue provided the wrong piano. Instead of the premier concert grand Jarrett requested, he was presented with a small rehearsal model. It was horribly out of tune, the pedals stuck, the high notes sounded tinny and harsh, and the bass lacked any resonance. Jarrett, exhausted and suffering from back pain, flat-out refused to play. It was only when Brandes followed him out into the pouring rain and begged him that he relented, taking pity on the teenager. “Never forget,” he told her. “Only for you.”

What happened next is legendary. Forced to play an unplayable instrument, Jarrett had to completely abandon his usual style. Because the high and low registers were awful, he confined his playing strictly to the middle of the keyboard. Because the piano was too quiet to fill the 1,400-seat opera house, he stood up and hammered the keys with immense physical force. To make up for the lack of resonance, he relied on rolling, repetitive, hypnotic rhythmic patterns in his left hand.

He embraced the limitations, and in doing so, he produced absolute magic. The recording, The Köln Concert, went on to become the best-selling solo jazz album in history.

I think about the Köln Concert all the time, especially lately as we navigate the current landscape of Artificial Intelligence and software architecture.

The Bloat of Infinite Resources

In modern software engineering, we are rarely handed a broken piano. We operate in an era of perceived infinite resources. Cloud computing gives us endless horizontal scaling. Context windows for Large Language Models have ballooned from a meager 4K tokens to 1 million or more. If an application is slow or an agent isn’t performing well, the default instinct is to throw more compute, more memory, or a larger model at the problem.

But infinite resources often breed intellectual laziness. When you have a 1-million token context window, you don’t have to think critically about what information actually matters. You just dump the entire codebase or the entire library of documents into the prompt and hope the model figures it out. It’s the equivalent of having a perfect Bösendorfer grand piano and just mashing all the keys at once.

A pragmatic engineering manager might push back here: Developer time is expensive. If I can solve a problem today by dumping an entire codebase into a 1-million token context window, isn’t throwing compute at it just good business?

It’s a fair question, and engineering is always about tradeoffs. But the tools have evolved—building a RAG pipeline doesn’t take a week anymore; with the right utilities, it takes minutes. More importantly, relying on infinite resources often hides long-term costs. When I built adh-cli, I made an explicit tradeoff: by routing everything through tightly scoped sub-agents, I was actually consuming more total tokens than a single massive prompt would use. But because my constraint forced me to use a much cheaper model (Gemini Flash), my bet was that the overall system would be far more cost-effective and resilient. AI doesn’t remove the need for architectural judgment; it exponentially increases it. You have to exercise good judgment to know when throwing compute at a problem is a calculated business decision, and when it’s just masking a fragile design.

The Innovation of Constraints

The most interesting work in AI right now isn’t happening where resources are unlimited. It’s happening at the edges, where constraints are severe.

Take local models, for example. When you’re trying to run an LLM on a consumer laptop or a Raspberry Pi, you don’t have the luxury of a 70-billion parameter model. You are forced to use a smaller, quantized model. This constraint forces you to build better architectures. You can’t rely on the model to “know” everything, so you have to optimize at the edge. Maybe you build robust Retrieval-Augmented Generation (RAG) pipelines. Maybe you implement sophisticated memory retrieval systems to surface exactly the right historical context just-in-time. Or maybe you break complex workflows down into tiny, focused sub-agents, each operating with its own tightly constrained context window. You have to craft highly specific, deterministic prompts.

# Instead of one massive prompt, constraints force modularity
def evaluate_code_chunk(chunk: str, context: dict) -> EvaluationResult:
    """
    A tightly scoped function that uses a small, fast local model
    to evaluate a specific piece of code, rather than dumping
    the whole repo into a massive API call.
    """
    prompt = build_focused_prompt(chunk, context)
    response = local_model.generate(prompt, max_tokens=256)
    return parse_evaluation(response)

Just like Jarrett avoiding the tinny upper register, we learn to avoid the weak points of our tools. We build guardrails. We write cleaner code. We design systems that are elegant because they have to be.

Finding Your Broken Piano

Of course, there is a survivorship bias to the Köln Concert. For every broken piano that produces a masterpiece, there are a hundred broken laptops that just result in missed deadlines. Not all constraints are good constraints. You can’t change the laws of physics, and if a structural limitation is genuinely preventing the work from happening, you have to reevaluate. The goal isn’t to suffer for the sake of suffering. But by starting with strict constraints, you force yourself to explore the boundaries. If you prove a task is impossible under those conditions, you can always loosen the constraints and expand your resources. But if you start with infinite resources, you never learn where those boundaries actually are.

Constraints are not the enemy of creativity; they are its prerequisite. Yes, accepting a severe constraint—especially an external one you didn’t choose—can be incredibly painful in the moment. Keith Jarrett hated his broken piano. He didn’t feel freed; he fought against it until he was forced to adapt. But like exercise or eating your vegetables, the value isn’t in the immediate comfort. It’s about the mindset shift. You accept the constraint to build a muscle, to stay fit, to force yourself to find a new path when the easy one is blocked. When we are stripped of our ideal tools and infinite runways, we are forced to abandon our default habits. Whether it’s the self-imposed design rules of adh-cli, the radical minimalism of Mario Zechner’s Pi Agent, or the physical limitations of a broken rehearsal piano in Cologne, constraints force us into a space of deliberate, intentional action.

If you want to build a truly resilient, innovative system, don’t start with the biggest, most expensive tools available. Start with a broken piano. Artificially constrain your resources. Limit your context window. See what you can achieve with a 7B parameter model instead of a flagship API, or see what happens when you strip your agent’s toolkit down to the bare essentials.

You might just find that the limitations force you to build something far better than you would have otherwise—a system that is elegant not in spite of its constraints, but because of them.

So, look around your current projects. Where are you relying on infinite resources to mask lazy architecture? And more importantly: what constraints have you come across in your own work that felt like a frustrating restriction at first, but turned out to be a blessing in disguise? I’d love to hear your stories.

A split illustration contrasting corporate AI surveillance with independent home computing.

Reading List #4

April 26, 2026May 3, 2026 Allen HutchisonLeave a comment

This week’s reading had a through line I wasn’t expecting. Almost every article circles back to the same question: who actually benefits when AI reshapes an industry? The answer isn’t always the people doing the work.

[article] Tech CEOs Think AI Will Let Them Be Everywhere at Once. All of the articles I’ve seen on these “management intelligence layers” feel very one-sided. The executive gains synthesized information and faster decision-making, but what do the employees get? Do junior and mid-career folks get better mentoring and coaching? I don’t think so. Collapsing the layers might be good for the bottom line, but is it good for people?

[blog] Figma’s woes compound with Claude Design. There is something fascinating about how frontier labs can reset product expectations overnight. The cost of entering new segments keeps dropping, which makes the world uncertain for SaaS companies and startups alike. This feels like a concrete example of the agentic shift playing out in real time.

[blog] DeepSeek V4 – almost on the frontier, a fraction of the price. Open-weight models just continue to improve. Simon Willison’s breakdown highlights the focus on efficiency here, not just raw capability. It may soon be possible to run frontier-class models on high-end home hardware, and that changes everything about who gets access.

[article] This Scammer Used an AI-Generated MAGA Girl to Grift ‘Super Dumb’ Men. We are living in a world where we have to assume that the content we are viewing is AI-generated. I think we should focus our efforts on tools that allow people to certify their content is real rather than trying to watermark AI content. The conversation around AI and creative authenticity is only going to get louder.

[article] I’ve been using “Ask Maps,” and it has forever changed Google Maps for me. I used the new Ask Maps feature extensively on my last trip and it felt like magic. Natural language queries against a map database is exactly the kind of AI application that just works, no prompt engineering required.

[article] You Should Have Exactly 3 Pairs of Headphones. Here’s Why. I’ve come to basically the same conclusion. Beats for workouts, AirPods Pro for every day, and AirPods Max for travel. The right tool for the right job applies to audio gear too.

A cinematic, retro-futuristic illustration of a high-tech developer workspace with a floating command-line interface, AI nodes, and glowing wireless earbuds.

Reading List #3

April 19, 2026May 3, 2026 Allen HutchisonLeave a comment

Today’s reading list is a mix of practical AI implementation, terminal tooling, and a glimpse into the future of human-computer interaction. It’s fascinating to see how quickly the conversation is shifting from “what can AI do?” to “how do we actually use this stuff?”

[article] You can now easily call LLMs from your messaging engine. Should you?. Richard Seroter provides a really nice walkthrough on adding LLMs to Pub/Sub in Google Cloud. It’s a great example of bringing AI directly to the data pipeline.

[tool] Make Tmux Pretty and Usable. Tmux is pretty great, although I prefer Zellij. This article still gives you a bunch of solid tips on making Tmux useful and nice to look at if it’s your multiplexer of choice.

[article] Duolingo CEO Says They’ve Stopped Tracking Employees’ AI Use for Performance Reviews. Employees aren’t stupid. They understand that the adoption of AI and all its ability to increase productivity does nothing for them individually. There is no incentive, and that is why we keep seeing stories like this pop up.

[article] AirPods Pro 3 may let you talk to Siri without actually saying a word. This would be so cool. I remember this concept from the first time I read the Ender’s Game series when the characters could talk with AI systems through subvocalizations.

[article] 8 Tips for Writing Agent Skills. Writing skills is easy, but writing effective skills is much harder. My colleague Philipp has some great advice on how to craft instructions that agents will actually follow, which is a topic I’ve spent a lot of time thinking about recently.

A glowing terminal window overlapping with a polished desktop environment.

Reading List #2

April 18, 2026May 3, 2026 Allen HutchisonLeave a comment

Today’s reading list is dominated by the rapid evolution of AI tooling and the real-world implications of deployed models. It is a reminder that while the underlying models are improving, the interface layer and security guarantees are where the real battles are being fought.

[article] AI images are now being abused to fake evidence for vehicle insurance fraud. We have spent so much time as an industry trying to add watermarks like SynthID to AI generated images, but I think we are looking at this backwards. Instead of trying to mark what is fake, we need to focus on building cryptographic guarantees that prove an image is actually real.

[release] Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All. My feed has been flooded with people talking about this new open weight model and its agentic capabilities. I need to carve out some time this weekend to pull it down and see how it performs in my own local setup, especially as the agentic shift continues to accelerate.

[article] OpenAI’s Big Codex Update Is a Direct Shot At Claude Code. I haven’t spent much time in Codex lately, but this update has some genuinely interesting features. It is fascinating to watch the major players trade blows in the AI coding space, pushing the entire ecosystem forward in the process.

[release] The Gemini App Is Now on Mac. While I spend a lot of my time in the terminal with Gemini CLI, having Gemini as a native desktop experience right on my Mac is a massive quality of life improvement. It keeps you in the flow, and I can’t wait to see where the team takes the integration next.

Share this:

Like this:

Two Reasons This Suddenly Mattered

Why Unit Tests Do Not Work

Borrowing pass^k From τ-bench

Scoring What the Agent Actually Did

The Judge Problem

What the Scoreboard Said

The Benchmark Is Open

What I Would Tell You If You Were Starting

Share this:

Like this:

What Is an Agent SDK, Really

Three Agents That Prove the Point

Batteries Included, Layers When You Need Them

Lessons Encoded

The Team

Preview, and an Invitation

Share this:

Like this:

Share this:

Like this:

The Push for Automation

The Scheduled Task Engine

How I Use This in Practice

Extracting the Agent Loop

Local Models with Ollama and Gemma 4

Moving from Guessing to Measuring

What’s Next?

Share this:

Like this:

Share this:

Like this:

The Bloat of Infinite Resources

The Innovation of Constraints

Finding Your Broken Piano

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: