Understanding the Evolution of Gemini Scribe: From Plugin to AI Platform

Six months ago, I wrote about building Agent Mode for Gemini Scribe from a hotel room in Fiji. That post ended with a sense of possibility. The agent could read your notes, search the web, and edit files. It was, by the standards of the time, pretty remarkable. I remember watching it chain together a sequence of tool calls for the first time and thinking I’d built something meaningful.

I had no idea it was just the beginning.

In the six months since that post, Gemini Scribe has gone through fifteen releases, from version 3.3 to 4.6. There have been over 400 commits, a complete architectural rethinking, and a transformation from “a chat plugin with an agent mode” into something I can only describe as a platform. The agent didn’t just get better. It got a memory, a research department, a set of extensible skills, and the ability to talk to external tools through the Model Context Protocol. If the vacation version was a clever assistant, this version is closer to a collaborator who actually understands your vault.

I want to walk through how we got here, because the journey reveals something I think is important about building with AI right now: the hardest problems aren’t the ones you set out to solve. They’re the ones that reveal themselves only after you ship the first version and start living with it.

The Agent Grows Up

The first big milestone after the vacation was version 4.0, released in November 2025. This was the release where I made a decision that felt risky at the time: I removed the old note-based chat entirely. No more dual modes, no more confusion about which interface to use. Everything became agent-first. Every conversation had tool calling built in. Every session was persistent.

It sounds simple in hindsight, but killing a feature that works is one of the hardest decisions in software. The old chat mode was comfortable. People used it. But it was holding back the entire plugin, because every new feature had to work in two completely different paradigms. Ripping it out was liberating. Suddenly I could focus all my energy on making one experience truly great instead of maintaining two mediocre ones.

Alongside 4.0, I built the AGENTS.md system, a persistent memory file that gives the agent an overview of your entire vault. When you initialize it, the agent analyzes your folder structure, your naming conventions, your tags, and the relationships between your notes. It writes all of this down in a file that persists across sessions. The result is that the agent doesn’t start every conversation from scratch. It already knows how your vault is organized, where you keep your research, and what projects you’re working on. It’s the difference between hiring a new intern every morning and having a colleague who’s been on the team for months.

Seeing and Searching

Version 4.1 brought something I’d wanted since the beginning: real thinking model support. When Google released Gemini 2.5 Pro and later Gemini 3 with extended thinking capabilities, I added a progress indicator that shows you the model’s reasoning in real time. You can watch it think through a problem, see it plan its approach, and understand why it chose a particular tool. It sounds like a small UI feature, but it fundamentally changes your relationship with the agent. You stop treating it like a black box and start treating it like a thinking partner whose process you can follow.

That same release added a stop button (which sounds trivial until you’re watching an agent go on a tangent and have no way to interrupt it), dynamic example prompts that are generated from your actual vault content, and multilingual support so the agent responds in whatever language you write in.

But the real game-changer came in version 4.2 with semantic vault search. I wrote about the magic of embeddings over a year ago, and this feature is that idea fully realized inside Obsidian. It uses Google’s File Search API to index your entire vault in the background. Once indexed, the agent can search by meaning, not just keywords. If you ask it to “find my notes about the trade-offs of microservices,” it will surface relevant notes even if they never use the word “microservices.” It understands that a note titled “Why We Split the Monolith” is probably relevant.

The indexing runs in the background, handles PDFs and attachments, and can be paused and resumed. Getting the reliability right was one of the more frustrating engineering challenges of the whole project. There were weeks of debugging race conditions, handling rate limits gracefully, and making sure a crash mid-index didn’t corrupt the cache. Version 4.2.1 was almost entirely dedicated to stabilizing the indexer, adding incremental cache saves and automatic retry logic. It’s the kind of work that nobody sees but everyone benefits from.

Images, Research, and the Expanding Toolbox

Version 4.3, released in January 2026, added multimodal image support. You can now paste or drag images directly into the chat, and the agent can analyze them, describe them, or reference them in notes it creates. The image generation tool, which I’d been building in the lead-up to 4.3, lets the agent create images on demand using Google’s Imagen models. There’s even an AI-powered prompt suggester that helps you describe what you want if you’re not sure how to phrase it.

That release also introduced two new selection-based actions: Explain Selection and Ask About Selection. These join the existing Rewrite feature to give you a full right-click menu for working with selected text. It sounds like a small addition, but in practice these micro-interactions are where people spend most of their time. Being able to highlight a paragraph, right-click, and ask “What’s the logical flaw in this argument?” without leaving your note is the kind of frictionless experience I’m always chasing.

Then came deep research in version 4.4. This is fundamentally different from the regular Google Search tool. Where a search returns quick snippets, deep research performs multiple rounds of investigation, reading and cross-referencing sources, synthesizing findings, and producing a structured report with inline citations. It can combine web sources with your own vault notes, so the output reflects both what the world knows and what you’ve already written. A single research request takes several minutes, but what you get back is closer to what a research assistant would produce after an afternoon in the library.

I built this on top of my gemini-utils library, which is a separate project I created to share common AI functionality across all of my TypeScript Gemini projects, including Gemini Scribe, my Gemini CLI deep research extension, and more. Having that shared foundation means deep research improvements benefit every project simultaneously.

Opening the Platform

If I had to pick the release that transformed Gemini Scribe from a plugin into a platform, it would be version 4.5. This is where MCP server support and the agent skills system arrived.

MCP, the Model Context Protocol, is an open standard that lets AI applications connect to external tool providers. In practical terms, it means Gemini Scribe can now talk to tools that I didn’t build. You can connect a filesystem server, a GitHub integration, a Brave Search provider, or anything else that speaks MCP. The plugin supports both local stdio transport (spawning a process on your desktop) and HTTP transport with full OAuth authentication, which means it works on mobile too. When you connect an MCP server, its tools appear alongside the built-in vault tools, with the same confirmation flow and safety features.

This was the moment the plugin stopped being a closed system. Instead of me having to build every integration myself, the entire MCP ecosystem became available. Someone who needs to query a database from their notes can connect a database MCP server. Someone who wants to interact with their GitHub issues can connect the GitHub server. The plugin becomes a hub rather than a destination.

The agent skills system, which follows the open agentskills.io specification, takes a similar approach to extensibility but for knowledge rather than tools. A skill is a self-contained instruction package that gives the agent specialized expertise. You can create a “meeting-notes” skill that teaches it your preferred format for processing meetings, or a “code-review” skill with your team’s specific standards. Skills use progressive disclosure, so the agent always knows what’s available but only loads the full instructions when it activates one. This keeps conversations focused while making specialized knowledge available on demand.

Version 4.5 also migrated API key storage to Obsidian’s SecretStorage, which uses the OS keychain. Your API key is no longer sitting in a plain JSON file in your vault. It’s a small change that matters a lot for security, especially for people who sync their vaults to cloud storage or version control.

Managing the Conversation

The most recent release, version 4.6, tackles a problem that only becomes apparent after you’ve been using an agent for a while: conversations get long, and long conversations hit token limits.

The solution is automatic context compaction, a direct answer to the attention management challenge I explored in the Agentic Shift series. When a conversation approaches the model’s token limit, the plugin automatically summarizes older turns to make room for new ones. There’s also an optional live token counter that shows you exactly how much of the context window you’re using, with a breakdown of cached versus new tokens. It’s the kind of visibility that helps you understand why the agent might be “forgetting” things from earlier in the conversation and gives you the information to manage it.

This release also added a per-tool permission policy system, which is the practical realization of the guardrails philosophy I wrote about in the Agentic Shift series. Instead of the binary choice between “confirm everything” and “confirm nothing,” you can now set individual tools to allow, deny, or ask-every-time. There are presets too: Read Only, Cautious, Edit Mode, and (for the brave) YOLO mode, which lets the agent execute everything without asking. I use Cautious mode myself, which auto-approves reads and searches but asks before any file modifications. It strikes a balance between speed and safety that feels right for daily use.

What I’ve Learned

Building Gemini Scribe has taught me something I keep coming back to in this blog: the most interesting work happens at the intersection of AI capabilities and human workflows. The technical challenges (semantic indexing, MCP integration, context compaction) are real, but they’re in service of a simple goal: making the AI useful enough that you forget it’s there.

The plugin now has users like Paul O’Malley building entire self-organizing knowledge systems on top of it. Seeing that kind of creative adoption is what keeps me building. Every feature request, every bug report, every surprising use case reveals another facet of what’s possible when you give a capable AI agent the right set of tools and the right context.

If you’re curious, Gemini Scribe is available in the Obsidian Community Plugins directory. All you need is a free Google Gemini API key. I’d love to hear what you build with it.

Letters from Silicon Valley

In “Letters from Silicon Valley,” I write about the convergence of technology, life, and creativity, sharing insights from my extensive experience in the tech industry along with my personal adventures in woodworking, music, and beyond.

Gemini Scribe From Agent to Platform

The Agent Grows Up

Seeing and Searching

Images, Research, and the Expanding Toolbox

Opening the Platform

Managing the Conversation

What I’ve Learned

Like this:

Related

Leave a ReplyCancel reply

The Agent Grows Up

Seeing and Searching

Images, Research, and the Expanding Toolbox

Opening the Platform

Managing the Conversation

What I’ve Learned

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Letters from Silicon Valley