Abstract digital visualization of glowing lines and nodes converging on a central geometric shape labeled 'AGENTS.md', symbolizing interconnected AI systems and a unifying standard.

On Context, Agents, and a Path to a Standard

When we were first designing the Gemini CLI, one of the foundational ideas was the importance of context. For an AI to be a true partner in a software project, it can’t just be a stateless chatbot; it needs a “worldview” of the codebase it’s operating in. It needs to understand the project’s goals, its constraints, and its key files. This philosophy isn’t unique; many agentic tools use similar mechanisms. In our case, it led to the GEMINI.md context system (which was first introduced in this commit) a simple Markdown file that acts as a charter, guiding the AI’s behavior within a specific repository.

At its core, GEMINI.md is designed for clarity and flexibility. It gives developers a straightforward way to provide durable instructions and file context to the model. We also recognized that not every project is the same, so we made the system adaptable. For instance, if you prefer a different convention, you can easily change the name of your context file with a simple setting.

This approach has worked well, but I’ve always been mindful that bespoke solutions, however effective, can lead to fragmentation. In the open, collaborative world of software development, standards are the bridges that connect disparate tools into a cohesive ecosystem.

That’s why I’ve been following the emergence of the Agents.md specification with great interest. We have several open issues in the Gemini CLI repo (like #406 and #12345) from users asking for Agents.md support, so there’s clear community interest. The idea of a universal standard for defining an AI’s context is incredibly appealing. A shared format would mean that a context file written for one tool could work seamlessly in another, allowing developers to move between tools without friction. I would love for Gemini CLI to become a first-class citizen in that ecosystem.

However, as I’ve considered a full integration, I’ve run into a few hurdles—not just technical limitations, but patterns of use that a standard would need to address. This has led me to a more concrete set of proposals for what an effective standard would need.

So, what would it take to bridge this gap? I believe with a few key additions, Agents.md could become the robust standard we need. Here’s a more detailed breakdown of what I believe is required:

  1. A Standard for @file Includes: From my perspective, this is mandatory. In any large project, you need the ability to break down a monolithic context file into smaller, logical, and more manageable parts—much like a C/C++ #include. A simple @file directive, which GEMINI.md and some other systems support, would provide the modularity needed for real-world use.
  2. A Pragma System for Model-Specific Instructions: Developers will always want to optimize prompts for specific models. To accommodate this without sacrificing portability, the standard could introduce a pragma system. This could leverage standard Markdown callouts to tag instructions that only certain models should pay attention to, while others ignore them. For example:

    > [!gemini]
    > Gemini only instructions here

    > [!claude]
    > Claude only instructions here

    > [!codex]
    > Codex only instructions here
  3. Clear Direction on Context Hierarchy: We need clear rules for how an agentic application should discover and apply context. Based on my own work, I’d propose a hierarchical strategy. When an agent is invoked, it should read the context in its current directory and all parent directories. Then, when it’s asked to read a specific file, it should first apply the context from that file’s local directory before applying the broader, inherited context. This ensures that the most specific instructions are always considered first, creating a predictable and powerful system.

If the Agents.md standard were to incorporate these three features, I believe it would unlock a new level of interoperability for AI developer tools. It would create a truly portable and powerful way to define AI context, and I would be thrilled to move Gemini CLI to a model of first-class support.

The future of AI-assisted development is collaborative, and shared standards are the bedrock of that collaboration. I’ve begun outreach to the Agents.md maintainers to discuss these proposals, and I’m optimistic that with community feedback, we can get there. If you have your own opinions on this, I’d love to hear them in the discussion on our repo.

Unlocking the Future of Coding: Introducing the Gemini CLI

Back in April, I wrote about waiting for the true AI coding partner. I articulated a vision for an AI that transcends mere code generation, one that truly understands context, acts autonomously within our development environments, and collaborates with us iteratively. Today, I’m thrilled to announce a significant step towards that vision: the launch of the Gemini CLI.

For too long, AI coding assistance has often felt like a disconnected assistant. While dedicated AI-powered IDEs like Cursor have made great strides, the common experience still involves copy-pasting code into a separate interface or breaking flow to get suggestions. This breaks flow, loses context, and frankly, isn’t how truly collaborative partners work. We need an AI that lives where we live—in the terminal, within our projects, and deeply integrated into our workflow.

This is precisely what the Gemini CLI sets out to achieve. It’s not just a fancy chatbot for your command line; it’s an experimental interface designed to bring the power of Gemini directly into your development loop, enabling intelligent, contextual, and actionable AI assistance.

It’s for this very reason that I’ve been quite heads-down over the last few months, working with a super talented team to bring this application to life. It has genuinely been one of my most fun experiences at Google in the 20+ years that I’ve been here, and I feel incredibly fortunate to have had the chance to collaborate with such brilliant people across the company.

The Power of Small Tools, Amplified by AI

In May, I explored the concept of small tools, big ideas. The premise was simple: complex problems are often best tackled by composing many small, powerful, and specialized tools. This philosophy is at the very heart of the Gemini CLI’s design.

Instead of a monolithic AI trying to do everything at once, the Gemini CLI empowers Gemini with a suite of familiar command-line tools. Imagine an AI that can:

  • Read and Write Files: Using read_file and write_file, it can inspect your codebase, understand existing logic, and propose modifications directly to your files.
  • Navigate Your Project: With list_directory and grep, it can explore your project structure, locate relevant files, or find specific patterns across your repository, just like you would.
  • Execute Shell Commands: The run_shell_command tool allows Gemini to execute commands, build your project, run tests, or even interact with external services, providing real-time feedback.
  • Search the Web: Need to look up an API, debug an error message, or find best practices? The google_web_search tool lets Gemini leverage the vastness of the internet to inform its responses and actions.
  • Edit with Precision: Beyond simple file writes, the edit_file tool allows for granular, diff-based modifications, ensuring changes are precise and reviewable.

This approach means Gemini isn’t guessing; it’s acting. It’s using the same building blocks you use every day, but with its powerful reasoning capabilities to orchestrate them towards your goals.

A Truly Contextual and Collaborative Partner

The Gemini CLI maintains a persistent session, remembering your conversation history, the files it has examined, and the results of previous tool executions. This “conversational memory” and contextual understanding are critical. It allows for a natural, iterative back-and-forth, where the AI builds on prior interactions and its understanding of your project state.

You can ask Gemini to:

  • “Find all JavaScript files in this directory that import React.” (Leveraging list_directory and grep)
  • “Refactor this component to use hooks.” (Involving read_file, edit_file, and potentially run_shell_command to run tests).
  • “What’s the best way to implement X in Python given these files?” (Using read_file to understand your existing code and google_web_search for best practices).

The workflow is truly interactive. Gemini proposes actions, and you have the power to approve them or guide it further. This human-in-the-loop design ensures you’re always in control, fostering a collaborative partnership rather than a black-box operation.

Built by Gemini CLI, For Everyone

It’s particularly exciting to share that this project was started by a small and scrappy team, and we leveraged Gemini CLI itself to help write Gemini CLI. Many of us now work almost exclusively within Gemini CLI, often using our IDEs only for viewing diffs.

And while its origins are in coding, Gemini CLI is incredibly versatile for many tasks outside of traditional development. Personally, I love using it to manage my home lab, to bulk rename and reformat files for my podcast project, and to generally act as a seamless go-between for anything complicated in GitHub. Increasingly, I’ve also been using Gemini CLI with Obsidian to understand and extract insights from my vault. With over 9000 files in my work vault alone, Gemini CLI lets me ask questions of the entire vault and even make large refactoring-style changes across the entire thing.

Beyond Today: Extensibility

One of the most exciting aspects of the Gemini CLI, and a direct nod to the “small tools, big ideas” philosophy, is its extensibility. The underlying architecture allows developers to define custom tools. This means you can teach Gemini to interact with your specific internal systems, proprietary APIs, or niche development tools. The possibilities are endless, transforming Gemini into an AI assistant perfectly tailored to your unique development environment.

Get Started Today

The Gemini CLI represents a significant leap forward in bringing intelligent AI assistance directly to where developers work most effectively: the command line. It’s a practical realization of the “true AI coding partner” vision, built on the principle that small, well-designed tools can achieve big ideas when orchestrated by a powerful intelligence.

Ready to try it out? Head over to the Gemini CLI GitHub repository to get started. Explore the commands, experiment with its capabilities, and let’s shape the future of AI-powered development together.

I’m incredibly excited about what this means for developer productivity and the evolving role of AI in our daily coding lives. Let me know what you build with it!

Docker Did Nothing Wrong (But I’m Trying Podman Anyway)

Hey everyone, welcome back to the homelab series! One of the constant themes in managing a growing homelab is figuring out the best way to run and orchestrate all the different services we rely on. For me, this has meant evolving my setup over time into distinct systems to keep things scalable and maintainable.

My current homelab nerve center is spread across a few key machines: ns1 and ns2 handle critical DNS redundancy, beluga is the fortress for storage and archives, and bubba acts as the powerhouse for all my AI experiments and compute-heavy tasks.

Up until now, Docker has been the backbone for deploying and managing services across these systems. Whether it’s containerizing AI models on bubba or managing my core network services, it’s been indispensable for packaging applications and keeping dependencies tidy. It’s served me well, allowing for rapid deployment and relatively easy management.

However, the tech landscape is always shifting, and exploring new tools is part of the homelab fun, right? Lately, I’ve been hearing more about Podman as a powerful, open-source alternative to Docker. Recent changes in the container world and simple curiosity led me to check out this excellent video overview (which I highly recommend watching!):

This video really illuminated what Podman brings to the table and sparked a ton of ideas about how it could potentially fit into, and even improve, my homelab workflow. So, in this post, I want to walk through my current Docker-based setup in a bit more detail, share the specific Podman features from the video that caught my eye, and outline some experiments I’m planning for the future. Let’s dive in!

My Current Homelab: A Multi-Server Approach

As I mentioned, to keep things organized as my homelab grew, I settled on dedicating specific roles to my main servers using Proxmox VE as the foundation for virtualization:

  • ns1 and ns2 — The Backbone of Service Discovery: These identical servers run my critical internal DNS, ensuring all my services can find each other reliably. Redundancy here is key – if one fails, the other keeps everything connected.
  • bubba — The AI Workhorse: This is my compute powerhouse, equipped with a GPU and plenty of RAM. It’s dedicated to running local AI models like LLMs via Ollama and interacting with them through tools like Open WebUI. It handles tasks like podcast transcription, embeddings, and inference workloads.
  • beluga — The Keeper of the Archives: With its focus on storage, beluga houses my media library, data archives, and backups. It’s the long-term home for files and feeds data to bubba when needed.

This separation of duties has been crucial for keeping things maintainable and scalable.

Docker’s Role in My Current Setup

So, how do I actually run the services on these different machines? Docker and Docker Compose are absolutely central to making this multi-server setup manageable. Here’s a glimpse into how it’s wired together:

  • Base Services Everywhere: I have a base Docker Compose file that runs on most, if not all, of these servers. It includes essential plumbing:
    • Traefik: My go-to reverse proxy, handling incoming traffic and routing it to the correct service container, plus managing SSL certificates.
    • Portainer Agent: Allows me to manage the Docker environment on each host from a central Portainer instance (the agent itself is part of the Portainer ecosystem).
    • Watchtower: Automatically updates containers. (I use this cautiously – often pinning major versions in my compose files while letting Watchtower handle minor updates, though for rapidly evolving things like Ollama, I sometimes let it pull latest.)
    • Dozzle Agent: Feeds container logs to a central Dozzle instance for easy viewing (the agent enables the main Dozzle UI).
  • DNS Servers (ns1/ns2): On top of the base services, the DNS servers have a dedicated compose file that adds CoreDNS, specifically using the coredns_omada project which cleverly mirrors DHCP hostnames from my TP-Link Omada network gear into DNS – super handy! ns1 also runs the central dozzle instance (the log viewer UI) and Heimdall as my main homelab dashboard, providing a single pane of glass overview. Docker makes running these critical but relatively lightweight infrastructure services incredibly straightforward.
  • AI Workloads (bubba): On the AI workhorse bubba, Docker is essential for managing the AI stack. I run ollama to serve LLMs and open-webui as a frontend, all containerized. This simplifies deployment, dependency management, and allows me to easily experiment with different models and tools without polluting the host system.
  • Storage Server Utilities (beluga): Even the storage server beluga runs containers. I have PostgreSQL running here, which primarily backs the Speedtest-Tracker service but also serves as my go-to relational database for any other containers or services that need one. Again, Docker neatly packages these distinct applications.

Essentially, Docker Compose defines the what and how for each service stack on each server, and Docker provides the runtime environment. This containerization strategy is what allows me to easily deploy, update, and manage this diverse set of applications across my specialized hardware.

Video Insights: How Podman Could Fit into This Picture

Watching the video overview of Podman did more than just introduce another tool; it sparked concrete ideas about how its specific features could integrate with, and perhaps even improve, my current homelab operations distributed across ns1/ns2, bubba, and beluga.

Perhaps the most compelling concept showcased was Podman’s native support for Pods. While Docker Compose helps manage multiple containers, the idea of grouping tightly coupled containers – like my ollama and open-webui stack on bubba, potentially along with a future vector database – into a single, network-integrated unit feels intrinsically cleaner. Managing this AI application suite as one atomic Pod could simplify networking and lifecycle management significantly. I could even see potential benefits in treating the base services running on each host (traefik, portainer-agent, etc.) as a coherent Pod.

Another significant architectural difference highlighted is Podman’s daemonless nature. Running without a central, privileged daemon is interesting for a couple of reasons. While bubba has resources to spare, my leaner DNS servers (ns1/ns2) might benefit from even slight resource savings, though that needs practical testing. More importantly, this architecture often makes running containers as non-root (rootless) more straightforward. This has direct security appeal, especially for the complex AI applications processing data on bubba or the critical DNS infrastructure on ns1/ns2, potentially reducing the attack surface compared to running everything through a root daemon.

Furthermore, the video demonstrated Podman’s ability to generate Kubernetes YAML manifests directly from running containers or pods. This feature is particularly exciting for a homelabber keen on learning! It presents a practical pathway to experimenting with Kubernetes distributions like K3s or Minikube. I could define my AI stack on bubba using Podman Pods and then export it to a Kubernetes-native format, greatly lowering the barrier to entry for learning K8s concepts with my existing workloads. Even outside of a full K8s deployment, having standardized YAML definitions could make my application deployments more portable and consistent.

Of course, for those who prefer a graphical interface, the video also touched upon Podman Desktop. While I currently use Portainer, exploring Podman Desktop could offer a different management perspective, perhaps one more focused on visualizing and managing these Pods. And crucially, knowing that Podman aims for Docker CLI compatibility for many common commands makes the idea of experimenting much less daunting – it suggests I wouldn’t have to relearn everything from scratch.

So, rather than just being ‘another container tool’, the video positioned Podman as offering specific solutions – particularly around multi-container application management via Pods, security posture through its daemonless design, and bridging towards Kubernetes – that seem highly relevant to the challenges and opportunities in my own homelab setup.

Future Homelab Goals: Experimenting with Podman

So, all this reflection on my current setup and the potential benefits highlighted in the video leads to the obvious next question: what am I actually going to do about it? While I’m not planning a wholesale migration away from Docker immediately – it’s deeply integrated and works well – the possibilities offered by Podman are too compelling not to explore.

My plan is to dip my toes into the Podman waters with a few specific, manageable experiments, leveraging the flexibility of my Proxmox setup:

  1. Dedicated Test Environment: Instead of installing Podman directly onto one of my existing servers like bubba initially, I’ll spin up a fresh virtual machine using Proxmox dedicated solely to Podman testing. This is one of the huge advantages of using Proxmox – I can create an isolated sandbox environment easily. This clean slate will be perfect for getting Podman installed, getting comfortable with the basic CLI commands (leveraging that Docker compatibility mentioned at), and working out any kinks without impacting my operational services.
  2. Migrating a Stack to a Pod: Once the test VM is set up, the real test will be taking my current ollama and open-webui Docker Compose stack (conceptually, at least) and recreating it as a Podman Pod within that VM. This will directly evaluate the Pod concept for managing related services and let me see how the networking and management feel compared to Compose in a controlled environment.
  3. Testing a Simple Service: To get a feel for basic container management and the daemonless architecture in this new VM, I’ll deploy a simpler, standalone service using Podman. Perhaps I’ll containerize a small utility or pull down a common image like postgres or speedtest-tracker just to compare the basic workflow.
  4. Generating Kubernetes Manifests: Once I (hopefully!) have the AI stack running in a Podman Pod in the test VM, I definitely want to try the Kubernetes YAML generation feature. Even if I don’t deploy it immediately, I want to see how Podman translates the Pod definition into Kubernetes resources within this testbed. This feels like a practical homework assignment for my K8s learning goals.
  5. Exploring Podman Desktop: Finally, I’ll likely install and explore Podman Desktop within the test VM. I’m curious to see what its visualization and management capabilities look like, especially for Pods, compared to my usual tools.

This isn’t about finding a ‘winner’ between Docker and Podman right now, but rather about hands-on learning in a safe, isolated environment thanks to Proxmox. It’s about understanding the practical advantages and disadvantages of Podman’s approach before considering if or how I might integrate it into my primary homelab systems (ns1/ns2, bubba, beluga) later on. I’m looking forward to experimenting and, of course, I’ll be sure to share my findings and experiences here in future posts!

That’s the plan for now! Docker continues to be a vital part of my homelab, but exploring tools like Podman is essential for learning and potentially improving how things run. The video provided some great insights, and I’m excited to see how these experiments turn out.

What about you? Are you using Docker, Podman, or something else in your homelab? Have you experimented with Pods or rootless containers? Let me know your thoughts and experiences in the comments below!

Cracking the Code: Exploring Transcription Methods for My Podcast Project

In previous posts, I outlined the process of downloading and organizing thousands of podcast episodes for my AI-driven project. After addressing the chaos of managing and cleaning up nearly 7,000 files, the next hurdle became clear: transcription. Converting all of these audio files into readable, searchable text would unlock the real potential of my dataset, allowing me to analyze, tag, and connect ideas across episodes. Since then, I’ve expanded my collection to over 10,000 episodes, further increasing the importance of finding a scalable transcription solution.

Why is transcription so critical? Most AI tools available today aren’t optimized to handle audio data natively. They need input in a format they can process—typically text. Without transcription, it would be nearly impossible for my models to work with the podcast content, limiting their ability to understand the material, extract insights, or generate meaningful connections. Converting audio into text not only makes the data usable by AI models but also allows for deeper analysis, such as searching across episodes, generating summaries, and identifying recurring themes.

In this post, I’ll explore the various transcription methods I considered, from cloud services to local AI solutions, and how I ultimately arrived at the right balance of speed, accuracy, and cost.

What Makes a Good Transcription?

Before diving into the transcription options I explored, it’s important to outline what I consider to be the key elements of a good transcription. When working with large amounts of audio data—like podcasts—the quality of the transcription can make or break the usability of the resulting text. Here are the main criteria I looked for:

  • Accuracy: The most obvious requirement is that the transcription needs to be accurate. It should capture what is said without altering the meaning. Misinterpretations, skipped words, or incorrect phrasing can lead to significant misunderstandings, especially when trying to analyze data from hours of dialogue.
  • Speaker Diarization: Diarization is the process of distinguishing and labeling different speakers in an audio recording. Many of the podcasts in my dataset feature multiple speakers, and a good transcription should clearly indicate who is speaking at any given time. This makes the conversation easier to follow and is essential for both readability and for further processing, like analyzing individual speaker contributions or summarizing conversations.
  • Punctuation and Formatting: Transcriptions need to be more than a raw dump of words. Proper punctuation and sentence structure make the resulting text more readable and usable for downstream tasks like summarization or natural language processing.
  • Identifying Music and Sound Effects: Many podcasts feature music, sound effects, or background ambiance that are integral to the listening experience. A good transcription should be able to note when these elements occur, providing context about their role in the episode. This is especially important for audio that is heavily produced, as these non-verbal elements often contribute to the overall meaning or mood.
  • Scalability: Finally, when dealing with tens of thousands of podcast episodes, scalability becomes critical. A transcription tool should not only work well for a single episode but also maintain performance when scaled to thousands of hours of audio. The ability to process large volumes of data efficiently without sacrificing quality is a key factor for a project of this scale.

These criteria shaped my approach to evaluating different transcription tools, helping me determine what worked—and what didn’t—for my specific needs.

Using Gemini for Transcription: A First Attempt

Since I work with Gemini and its APIs professionally (about me), I saw this transcription project as an opportunity to deepen my understanding of the system’s capabilities. My early experiments with Gemini were promising; the model produced highly accurate, diarized transcriptions for the first few podcast episodes I tested. I was excited by the results and the prospect of integrating Gemini into my workflow for this project. It seemed like a perfect fit—Gemini was delivering exactly what I needed in terms of transcription accuracy, making me optimistic about scaling this approach.

Early Success and Optimism

In those initial tests, Gemini excelled in several areas. The transcriptions were accurate, the diarization was clear, and the output was well-formatted. Given Gemini’s strength in understanding context and language, the transcripts felt polished, even in conversations with overlapping speech or complex dialogue. This early success gave me confidence that I had found a tool that could handle my vast dataset of podcasts while maintaining high quality.

The Challenges of Scaling

As I continued to test Gemini on a larger scale, I encountered two key issues that ultimately made the tool unsuitable for this project.

The biggest challenge was recitation errors. The Gemini API includes a mechanism that prevents it from returning text if it detects that it might be reciting copyrighted information. While this is an understandable safeguard, it became a major roadblock for my use case. Given that my project is dependent on converting copyrighted audio content into text, it wasn’t surprising that Gemini flagged some of this content during its recitation checks. However, when this error occurred, Gemini didn’t return any transcription, making the tool unreliable for my needs. I required a solution that could consistently transcribe all the audio I was working with, not just portions of it.

That said, when Gemini did return transcriptions, the quality was excellent. For instance, here’s a sample from one of the podcasts I processed using Gemini:

Where Does All The TSA Stuff Go?
0:00 - Intro music playing.
1:00 - [SOUND] Transition to podcast
1:01 - Kimberly: Hi, this is Kimberly, and we're at New York airport, and we just had our snow globe 
confiscated.
1:08 - Kimberly: Yeah, we're so pissed, and we want to know who gets all of the confiscated stuff, 
where does it go, and will we ever be able to even get our snow globe back?

In addition to the recitation issue, I didn’t want to rely on Gemini for some transcriptions and another tool for the rest. For this project, it was important to have a consistent output format across all my transcriptions. Switching between tools would introduce inconsistencies in the formatting and potentially complicate the next stages of analysis. I needed a single solution that could handle the entire podcast archive.

Using Whisper for High-Quality AI Transcription

After experiencing challenges with Gemini, I turned to OpenAI’s Whisper, a model specifically designed for speech recognition and transcription. Whisper is an open-source tool known for its accuracy in handling complex audio environments. Given that my podcast collection spans a variety of formats and sound qualities, Whisper quickly emerged as a viable solution.

Why Whisper?

  • Accuracy: Whisper consistently delivered highly accurate transcriptions, even in cases with challenging audio quality, background noise, or overlapping speakers. It also performed well with speakers of different accents and speech patterns, which is critical for the diversity of content I’m working with.
  • Diarization: While Whisper doesn’t have diarization built-in, its accuracy with speech segmentation allowed for easy integration with additional tools to identify and separate speakers. This flexibility allowed me to maintain clear, speaker-specific transcripts.
  • Open Source Flexibility: Whisper’s open-source nature allowed me to deploy it locally on my Proxmox setup, leveraging the full power of my NVIDIA RTX 4090 GPU. This setup made it possible to transcribe podcasts in near real-time, which was crucial for processing a large dataset efficiently.

Performance on My Homelab Setup

By running Whisper locally with GPU acceleration, I saw significant improvements in processing time. For shorter podcasts, Whisper was able to transcribe episodes in a matter of minutes, while longer episodes could be transcribed in near real-time. This speed, combined with its accuracy, made Whisper a strong contender for handling my entire collection of over 10,000 episodes.

For instance, here’s the same podcast episode that was transcribed with Whisper:

Hi, this is Kimberly.
And we're at Newark Airport.
And we just had our snow globe confiscated.
Yeah, we're so pissed.
And we want to know who gets all of the confiscated stuff.
Where does it go?
And will we ever be able to even get our snow globe back?

Challenges and Considerations

While Whisper excelled in many areas, one consideration is its resource demand. Running Whisper locally with GPU acceleration requires substantial computational resources. For users without access to powerful hardware, this could be a limitation. Whisper also lacks built-in diarization, which means it cannot automatically differentiate between speakers. This requires additional post-processing or integration with other tools to achieve the same level of speaker clarity. However, for my setup, the performance trade-off was worth it, as it allowed me to maintain full control over the transcription process without relying on external services.

Comparing Transcription Methods and Moving Forward

After testing both Gemini and Whisper, it became clear that each tool has its strengths, but Whisper ultimately emerged as the best option for my project’s needs. While Gemini delivered higher-quality transcriptions overall, the recitation errors and lack of reliability when dealing with copyrighted material made it unsuitable for handling my entire dataset. Whisper, on the other hand, provided consistent, highly accurate transcriptions across the board and scaled well to the volume of audio I needed to process.

Gemini’s Strengths and Limitations

  • Strengths: Gemini produced extremely polished and accurate transcriptions, outperforming Whisper in many cases. The diarization was clear, and the formatting made the transcripts easy to read and analyze.
  • Limitations: Despite its transcription quality, Gemini’s API recitation checks became a major roadblock, which made it unreliable for my use case. Additionally, I needed a single solution that could provide consistent output across all episodes, which Gemini couldn’t guarantee due to these errors.

Whisper’s Strengths and Limitations

  • Strengths: Whisper stood out for its high accuracy, scalability, and open-source flexibility. Running Whisper locally allowed me to transcribe thousands of episodes efficiently, while its robust handling of varied audio content—from background noise to multiple speakers—was a major advantage.
  • Limitations: Whisper lacks built-in diarization, which means it cannot automatically differentiate between speakers. This requires additional post-processing or integration with other tools to achieve the same level of speaker clarity. Additionally, Whisper demands significant computational resources, which could be a barrier for users without access to powerful hardware.

Final Thoughts

As I move forward with this project, Whisper will be my go-to tool for transcribing the remaining episodes. Its ability to process large amounts of audio data reliably and consistently has made it the clear winner. While there may still be room for further exploration—particularly around post-processing clean-up or integrating diarization tools—Whisper has given me the foundation I need to turn my podcast archive into a fully searchable, AI-powered dataset.

In my next post, I’ll outline how I built my transcription system using Whisper to handle all of these episodes. It was a unique experience, as I used a model to write the entire application for this project. Stay tuned for a deep dive into the system’s architecture and the steps I took to automate the transcription process at scale.

My 9,000 File Problem: How Gemini and Linux Saved My Podcast Project

We live in a world awash in data, a tidal wave of information that promises to unlock incredible insights and fuel a new generation of AI-powered applications. But as anyone who has ever waded into the deep end of a data-intensive project knows, this abundance can quickly turn into a curse. My own foray into building a podcast recommendation system recently hit a major snag when my meticulously curated dataset went rogue. The culprit? A sneaky infestation of duplicate embedding files, hiding among thousands of legitimate ones, each with “_embeddings” endlessly repeating in their file names. Manually tackling this mess would have been like trying to drain the ocean with a teaspoon. I needed a solution that could handle massive amounts of data and surgically extract the problem files.

Gemini: The AI That Can Handle ALL My Data

Faced with this mountain of unruly data, I knew I needed an extraordinary tool. I’d experimented with other large language models in the past, but they weren’t built for this. My file list, containing nearly 9,000 filenames (about 100,000 input tokens in this case), and proved too much for them to handle. That’s when I turned to Gemini, with it’s incredible ability to handle large context windows. With a touch of trepidation, I pasted the entire list into Gemini 1.5 Pro in AI Studio, hoping it wouldn’t buckle under the weight of all those file paths. To my relief, Gemini didn’t even blink. It calmly ingested the massive list, ready for my instructions. With a mix of hope and skepticism, I posed my question: “Can you find the files in this list that don’t match the _embeddings.txt pattern?” In a matter of seconds, Gemini delivered. It presented a concise list of the offending filenames, each one a testament to its remarkable pattern recognition skills.

To be honest, I hadn’t expected it to work. Pasting in such a huge list felt like a shot in the dark, and when I later tried the same task with other models I just got errors. But that’s one of the things I love about working with large models like Gemini. The barrier to entry for experimentation is so low. You can quickly iterate, trying different prompts and approaches to see what sticks. In this case, it paid off spectacularly.

From AI Insights to Linux Action

Gemini didn’t just leave me with a list of bad filenames; it went a step further, offering a solution. When I asked, “What Linux command can I use to delete these files?”, it provided the foundation for my command. I wanted an extra layer of safety, so instead of deleting the files outright, I first moved them to a temporary directory using this command:

find /srv/podcasts/Invisibilia -type f -name "*_embeddings_embeddings*" -exec mv {} /tmp \;

This command uses the find command, and it uses -exec to execute a command on each found file. Here’s how it works:

  • -exec: Tells find to execute a command.
  • mv: The move command.
  • {}: A placeholder that represents the found filename.
  • /tmp: The destination directory for the moved files.
  • \;: Terminates the -exec command.

By moving the files to /tmp, I could examine them one last time before purging them from my system. This extra step gave me peace of mind, knowing that I could easily recover the files if needed.

Reflecting on the AI-Powered Solution

In the end, what could have been a tedious and error-prone manual cleanup became a quick and efficient process, thanks to the combined power of Gemini and Linux. Gemini’s ability to understand my request, process my massive file list, and suggest a solution was remarkable. It felt like having an AI sysadmin by my side, guiding me through the data jungle. This was especially welcome for someone like me who started their career as a Unix sysadmin. Back then, cleanups like this involved hours poring over man pages and carefully crafting bash scripts, especially when deleting files. I even had a friend who accidentally ran rm -r / as root, watching in horror as his system rapidly erased itself. Needless to say, I’ve developed a healthy respect for the destructive power of the command line! In this instance, I would have easily spent an hour writing my careful script to make sure I got it right. But with Gemini, I solved the problem in about 10 minutes and was on my way. This sheer amount of time saved continues to amaze me about these new approaches to AI. More than just solving this immediate problem, this experience opened my eyes to the transformative potential of large language models for data management. Tasks that once seemed impossible or overwhelmingly time-consuming are now within reach, thanks to tools like Gemini.

Conclusion: A Journey of Discovery and Innovation

This experience was a powerful reminder that we’re living in an era of incredible technological advancements. Large language models like Gemini are no longer just fascinating research projects; they are becoming practical tools that can significantly enhance our productivity and efficiency. Gemini’s ability to handle enormous datasets, understand complex requests, and provide actionable solutions is truly game-changing. For me, this project was a perfect marriage of my early Unix sysadmin days and the exciting new world of AI. Gemini’s insights, combined with the precision of Linux commands, allowed me to quickly and safely solve a data problem that would have otherwise cost me significant time and effort.

This is just the first in an occasional series where I’ll be exploring the ways I’m using large models in my everyday work and hobbies. I’m eager to hear from you, my readers! How are you using AI to make your life easier? What would you like to be able to do with AI that you can’t do today? Share your thoughts and ideas in the comments below – let’s learn from each other and build the future of AI together!

Building My Homelab: The Journey from Gemma on a Laptop to a Rack Mounted Powerhouse

In the ever-evolving landscape of AI, there are moments when new technologies capture your imagination and set you on a path of exploration and innovation. For me, one of those moments came with the release of the Gemma models. These models, with their promise of enhanced capabilities and local deployment, ignited my curiosity and pushed me to take a significant step in my homelab journey—building a system powerful enough to run these AI models locally.

The Allure of Local AI

I’ve spent the better part of 30 years immersed in the world of machine learning and artificial intelligence. My journey began in the 90s when I was an AI major in the cognitive science program at Indiana University. Back then, AI was a field full of promise, but the tools and technologies we take for granted today were still in their infancy. Fast forward a few decades, and I found myself at Google Maps, leading teams that used machine learning to transform raw imagery into structured data, laying the groundwork for many of the services we rely on daily.

By 2021, I had transitioned to the Core ML group at Google, where my focus shifted to the nuts and bolts of AI—low-level ML infrastructure like XLA, ML runtimes, and performance optimization. The challenges were immense, but so were the opportunities to push the boundaries of what AI could do. Today, as the leader of the AI Developer team at Google, I work with some of the brightest minds in the industry, building systems and technologies that empower developers to use AI in solving meaningful, real-world problems.

Despite all these experiences, the release of the Gemma models reignited a spark in me—a reminder of the excitement I felt as a student, eager to experiment and explore the limits of AI. These models offered something unique: the ability to run sophisticated AI directly on local hardware. For someone like me, who has always believed in the power of experimentation, this was an opportunity too good to pass up.

However, I quickly realized that while I could run these models on my Mac at home, I wanted something more—something that could serve as a shared resource for my family, a system that would be plugged in and available all the time. I envisioned a platform that not only supported these AI models but also provided the flexibility to build and explore other projects. To fully engage with this new wave of AI and create a hub for ongoing experimentation, I needed a machine that could handle the load and grow with our ambitions.

That’s when I decided to take the plunge and build a powerful homelab. I started by carefully spec’ing out the components, aiming to create a system that wasn’t just about raw power but also about versatility and future-proofing. Eventually, I turned to Steiger Dynamics to bring my vision to life. Their expertise in crafting high-performance, custom-built systems made them the perfect partner for this project. But before diving into the specifics of the build, let me share why the concept of local AI holds such a special allure for someone who has been in this field for as long as I have.

Spec’ing Out the Perfect Homelab

Building a homelab is both a science and an art. It’s about balancing performance with practicality, ensuring that every component serves a purpose while also leaving room for future expansion. With the goal of creating a platform capable of handling advanced AI models like Gemma, as well as other projects that might come along, I began the process of selecting the right hardware.

The Heart of the System: CPU and GPU

At the core of any powerful AI system are the CPU and GPU. After researching various options, I decided to go with the AMD Ryzen 9 7900X3D, a 12-core, 24-thread processor that offers the multithreaded performance necessary for AI workloads while still being efficient enough for a range of homelab tasks. But the real workhorse of this system would be the NVIDIA GeForce RTX 4090. This GPU, with its 24 GB of VRAM and immense processing power, was selected to handle the computational demands of AI training, simulations, and real-time applications.

The RTX 4090 wasn’t just about raw power; it was about flexibility. This GPU allows me to experiment with larger datasets, more complex models, and even real-time AI applications. Whether I’m working on image recognition, natural language processing, or generative AI, the RTX 4090 is more than capable of handling the task.

Memory and Storage: Speed and Capacity

To complement the CPU and GPU, I knew I needed ample memory and fast storage. I opted for 128GB of DDR5 5600 MT/s RAM to ensure that the system could handle multiple tasks simultaneously without bottlenecks. This is particularly important when working with large datasets or running several virtual machines at once—a common scenario in a versatile homelab environment.

For storage, I selected two 4 TB Samsung 990 PRO Gen4 NVMe SSDs. These drives provide the speed needed for active projects, with read and write speeds of 7,450 and 6,900 MB/s, respectively, ensuring quick access to data and fast boot times. The choice of separate drives rather than a RAID configuration allows me to manage my data more flexibly, adapting to different projects as needed.

Cooling and Power: Reliability and Efficiency

Given the power-hungry components, proper cooling and a reliable power supply were non-negotiable. I chose a Quiet 360mm AIO CPU Liquid Cooling system, equipped with six temperature-controlled, pressure-optimized 120mm fans in a push/pull configuration. This setup ensures that temperatures remain in check, even during prolonged AI training sessions that can generate significant heat.

The power supply is a 1600 Watt Platinum unit with a semi-passive fan that remains silent during idle periods and stays quiet under load. This ensures stable power delivery to all components, providing the reliability needed for a system that will be running almost constantly.

Building for the Future

Finally, I wanted to ensure that this homelab wasn’t just a short-term solution but a platform that could grow with my needs. The ASUS ProArt X670E-Creator Wifi motherboard I selected provides ample expansion slots, including dual PCIe 8x lanes, which are perfect for future upgrades, whether that means adding more GPUs or expanding storage. With 10G Ethernet and Wi-Fi 6E, this system is also well-equipped for high-speed networking, both wired and wireless.

Throughout this process, my choices were heavily influenced by this Network Chuck video. His insights into building a system for local AI, particularly the importance of choosing the right balance of power and flexibility, resonated with my own goals. Watching his approach to hosting AI models locally helped solidify my decisions around components and made me confident that I was on the right track.

With all these components selected, I turned to Steiger Dynamics to assemble the system. Their expertise in custom builds meant that I didn’t have to worry about the finer details of putting everything together; I could focus on what mattered most—getting the system up and running so I could start experimenting.

Bringing the System to Life: Initial Setup and First Experiments

Once the system arrived, I was eager to get everything up and running. Unboxing the hardware was an exciting moment—seeing all the components I had carefully selected come together in a beautifully engineered machine was incredibly satisfying. But as any tech enthusiast knows, the real magic happens when you power on the system for the first time.

Setting Up Proxmox and Virtualized Environments

For this build, I chose to run Proxmox as the primary operating system. Proxmox is a powerful open-source virtualization platform that allows me to create and manage multiple virtual machines (VMs) on a single physical server. This choice provided the flexibility to run different operating systems side by side, making the most of the system’s powerful hardware.

To streamline the setup process, I utilized some excellent Proxmox helper scripts available on GitHub. These scripts made it easier to configure and manage my virtual environments, saving me time and ensuring that everything was optimized for performance right from the start.

The first VM I set up was Ubuntu 22.04 LTS, which would serve as the main environment for AI development. Ubuntu’s long-term support and robust package management make it an ideal choice for a homelab focused on AI and development. The installation process within Proxmox was smooth, and soon I had a fully functional virtual environment ready for configuration.

I started by installing the necessary drivers and updates, ensuring that the NVIDIA RTX 4090 and other components were operating at peak performance. The combination of the AMD Ryzen 9 7900X3D CPU and the RTX 4090 GPU provided a seamless experience, handling everything I threw at it with ease. With the virtualized Ubuntu environment fully updated and configured, it was time to dive into my first experiments.

Running the First AI Models

With the system ready, I turned my attention to running AI models locally using Ollama as the model management system. Ollama provided an intuitive way to manage and deploy models on my new setup, ensuring that I could easily switch between different models and configurations depending on the project at hand.

The first model I downloaded was the 24B Gemma model. The process was straightforward, thanks to the ample power and storage provided by the new setup. The RTX 4090 handled the model with impressive speed, allowing me to explore its capabilities in real-time. I could experiment with different parameters, tweak the model, and see the results almost instantaneously.

Exploring Practical Applications: Unlocking the Potential of My Homelab

With the system fully operational and the Gemma model successfully deployed, I began exploring the practical applications of my new homelab. The flexibility and power of this setup meant that the possibilities were virtually endless, and I was eager to dive into projects that could take full advantage of the capabilities I now had at my disposal.

Podcast Archive Project

One of the key projects I’ve been focusing on is my podcast archive project. With the large Gemma model running locally, I’ve been able to experiment with using AI to transcribe, analyze, and categorize vast amounts of podcast content. The speed and efficiency of the RTX 4090 have transformed what used to be a time-consuming process into something I can manage seamlessly within my homelab environment.

The ability to run complex models locally has also allowed me to iterate rapidly on how I approach the organization and retrieval of podcast data. I’ve been experimenting with different methods for tagging and indexing content, making it easier to search and interact with large archives. This project has been particularly rewarding, as it combines my love of podcasts with the cutting-edge capabilities of AI.

General Conversational Interfaces

I’ve been exploring is setting up general conversational interfaces. With the Gemma model’s conversational abilities, I’ve been able to create clients that facilitate rich, interactive dialogues. Whether for casual conversation, answering questions, or exploring specific topics, these interfaces have proven to be incredibly versatile.

Getting the models up and running with these clients was a straightforward process, and I’ve been experimenting with different use cases—everything from personal assistants to educational tools. The flexibility of the Gemma model allows for a wide range of conversational applications, making this an area ripe for further exploration.

Expanding the Homelab’s Capabilities

While I’m already taking full advantage of the system’s current capabilities, I’m constantly thinking about ways to expand and optimize the homelab further. Whether it’s adding more storage, integrating additional GPUs for even greater computational power, or exploring new software platforms that can leverage the hardware, the possibilities are exciting.

The Journey Continues

This is just the beginning of my exploration into what this powerful homelab can do. With the hardware now in place, I’m eager to dive into a myriad of projects, from refining my podcast archive system to pushing the boundaries of conversational AI. The possibilities are endless, and the excitement of discovering new applications and optimizing workflows keeps me motivated.

As I continue to explore and experiment, I’ll be sharing my experiences, insights, and challenges along the way. There’s a lot more to come, and I’m excited to see where this journey takes me. I invite you, my readers, to come along for the ride—whether you’re building your own homelab, curious about AI, or just interested in technology. Together, we’ll see just how far we can push the boundaries of what’s possible with this incredible setup.