Tokenisation

The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs

There is no seahorse emoji in Unicode. Ask a large language model to produce one and watch what happens. The failure is not a hallucination in the ordinary sense — the model knows what it wants to output but cannot output it. That distinction matters.

The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting

Everyone has a Downloads folder full of “scan0023.pdf” and “document(3)-final-FINAL.pdf”. Renaming them by content sounds trivial — read the file, understand what it is, give it a name. The implementation reveals something useful about how LLMs actually handle text: what a token is, why context windows matter in practice, why you want structured output instead of prose, and why heuristics should go first. The repository is at github.com/sebastianspicker/AI-PDF-Renamer.

Three Rs in Strawberry: What the Viral Counting Test Actually Reveals

In September 2024, OpenAI revealed that its new o1 model had been code-named “Strawberry” internally — the same word that language models have famously been unable to count letters in. The irony was too perfect to pass up. But the counting failure is not a sign that LLMs are naive or broken. It is a precise, informative symptom of how they process text. Here is the actual explanation, with a minimum of hand-waving.