A futuristic glowing notebook on a wooden desk with a cup of coffee and floating geometric shapes.

Reading List 6

This week’s reading list is a mix of high-level theory and low-level pragmatism. I found myself bouncing between the philosophical implications of how we build AI and the immediate satisfaction of writing a good Go component.

[article] The Century-Long Pause in Fundamental Physics. The author argues that physics has stagnated by swapping “ontology-first” theory for mathematical models that merely fit data. This debate perfectly mirrors current machine learning disputes about whether LLMs build internal world models or just pattern-match at scale, which is the open empirical front currently being adjudicated in mechanistic interpretability.

[release] Onyx Has Released a New Remote Page Turner Called Tappy. I wish Amazon would support page turners for their Kindle line. It would be great if they supported a device as delightful as this one.

[blog] The agent principal-agent problem. This is a great look at one of the biggest problems with agentic development: code review. In my open source work, I now use a pattern where I work with an agent to make a change, test it locally, and create a pull request before having another agent review the code. This back-and-forth works well and keeps a good balance of mental state for the codebase and efficiency.

[article] ReMarkable Paper Pure wants to be the only notebook you’ll ever need. I have always liked the reMarkable tablets, but every time I try one I miss having my Kindle library alongside it. Reading and writing are deeply linked for me, which is why I recently got a Kindle Scribe Colorsoft and found it really hits the mark for what I want.

[blog] Just Fucking Use Go. I have been working on a project that has a Go component to it recently. This is the first time I have really started to look at the language, and it inspires me to spend more time with it.

I built my 7MB Full AI Terminal in Rust & Tauri. This is a neat open source AI terminal. It feels similar to Warp but is a lot smaller.

[article] Computer Use Is 45x More Expensive Than Structured APIs. I am not surprised at all by these findings. I think computer use will remain a last resort, and a lot of apps will expose some kind of API for an agent to use instead. My guess is that this eventually becomes the way we automate unmaintained applications that need to fit into an agentic workflow.

A conceptual illustration showing sound waves passing through a prism and refracting into a 3D scatter plot of colored clusters, representing different speaker identities in vector space.

The Fingerprint of Sound

Hero Image Suggestion:

Last year, I spent a lot of time obsessed with the concept of embeddings. I wrote about how they act as a bridge, transforming the messy, unstructured world of human language into a clean, numerical landscape that computers can understand. In my series on the topic, I explored how text embeddings allow us to map concepts in space—how they let us mathematically prove that “king” is close to “queen,” or find a podcast episode about “economic growth” even if the specific keywords never appear in the transcript.

For me, grasping text embeddings was a watershed moment. It turned AI from a black box into a geometry problem I could solve. But recently, my friend Pete Warden released a post that clicked the another piece of the puzzle into place for me, moving that geometry from the page to the ear.

In his post, Speech Embeddings for Engineers, Pete tackles the problem of diarization—the technical term for figuring out “who spoke when” in an audio recording. If you’ve followed my podcast archive project, you know this has been a thorn in my side. I have thousands of transcripts, but they are largely monolithic blocks of text. I know what was said, but often I lose the context of who said it.

Pete’s explanation is brilliant because it leverages the exact same intuition we developed for text. Just as a text embedding captures the semantic “fingerprint” of a sentence, a speech embedding captures the vocal fingerprint of a speaker.

The mental shift is fascinating. When we embed text, we are mapping meaning. We want the vector for “dog” to be close to “puppy” and far from “motorcycle.” But when we embed speech for diarization, we don’t care about the meaning of the words at all. A speaker could be whispering a love sonnet or screaming a grocery list; semantically, those are worlds apart. But acoustically—in terms of timbre, pitch, and cadence—they share an undeniable identity.

Pete includes a Colab notebook that demonstrates this beautifully. It’s a joy to run through because it demystifies the process entirely. He walks you through taking short clips of audio, running them through a model, and visualizing the output.

Suddenly, you aren’t looking at waveforms anymore. You’re looking at clusters. You can see, visually, where one voice ends and another begins. It turns the murky problem of distinguishing speakers in a crowded room into a clean clustering algorithm, something any engineer can wrap their head around.

This reinforces a recurring theme for me: the power of small, composable tools. We often look for massive, end-to-end APIs to solve our problems—a “magic box” that takes audio and returns a perfect script. But understanding the primitives is where the real power lies. By understanding speech embeddings, we aren’t just consumers of a transcription service; we are architects who can build systems that listen, identify, and understand the nuance of conversation.

If you’ve ever wrestled with audio data, or if you just want to see how the concept of embeddings extends beyond text, I highly recommend finding a quiet hour to work through Pete’s notebook. It might just change how you hear the data.