<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Tokenisation on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/tokenisation/</link>
    <description>Recent content in Tokenisation on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Wed, 04 Mar 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/tokenisation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs</title>
      <link>https://sebastianspicker.github.io/posts/seahorse-emoji-vocabulary-gaps-llm/</link>
      <pubDate>Wed, 04 Mar 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/seahorse-emoji-vocabulary-gaps-llm/</guid>
      <description>There is no seahorse emoji in Unicode. Ask a large language model to produce one and watch what happens. The failure is not a hallucination in the ordinary sense — the model knows what it wants to output but cannot output it. That distinction matters.</description>
      <content:encoded><![CDATA[<p>Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.</p>
<p>That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody&rsquo;s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.</p>
<p>I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.</p>
<h2 id="the-incident">The incident</h2>
<p>In 2025, Vgel published an analysis of exactly this failure <a href="#ref-1">[1]</a>. The piece is worth reading in full, but the core finding is worth unpacking here.</p>
<p>When a model is asked to produce a seahorse emoji, something specific happens at the level of the model&rsquo;s internal representations. Using logit lens analysis — a technique for inspecting the model&rsquo;s intermediate layer activations as if they were already projecting into vocabulary space <a href="#ref-4">[4]</a> — it is possible to track what the model&rsquo;s &ldquo;working answer&rdquo; looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a &ldquo;seahorse + emoji&rdquo; representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.</p>
<p>The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model&rsquo;s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.</p>
<p>What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between &ldquo;I found the seahorse emoji&rdquo; and &ldquo;I found the best available approximation to the seahorse emoji.&rdquo; The output is produced with the same confidence either way.</p>
<p>Vgel&rsquo;s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model&rsquo;s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: &ldquo;I know what you want, and it does not exist as a token I can emit.&rdquo;</p>
<p>Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (&ldquo;wait, that is a dolphin, not a seahorse&rdquo;) rather than vocabulary-knowledge (&ldquo;there is no seahorse token&rdquo;), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.</p>
<h2 id="the-structural-failure-vocabulary-completeness-assumption">The structural failure: vocabulary completeness assumption</h2>
<p>Here is the core conceptual point, stated as cleanly as I can.</p>
<p>Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.</p>
<p>The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. <a href="#ref-5">[5]</a> and discussed in more detail in my <a href="/posts/strawberry-tokenisation/">tokenisation post</a>. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.</p>
<p>The problem is that the model treats these two representations as if they were the same. If world-knowledge says &ldquo;seahorses should have emoji,&rdquo; the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between &ldquo;I know X exists&rdquo; and &ldquo;I can express X.&rdquo; I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.</p>
<p>This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.</p>
<p>The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model&rsquo;s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.</p>
<p>When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.</p>
<p>A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write &ldquo;there is no direct English equivalent; the closest is approximately&hellip;&rdquo; and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.</p>
<h2 id="a-formal-language-perspective">A formal language perspective</h2>
<p>For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary <code>V = {t₁, …, tₙ}</code> <a href="#ref-5">[5]</a>. Every output is an element of <code>V</code>. The type system has no room for a &ldquo;near miss&rdquo; or an &ldquo;I cannot express this precisely&rdquo; — the output is always a token, drawn from the inventory established at training time.</p>
<p>This is a closed-world assumption in the formal sense <a href="#ref-6">[6]</a>: the system treats any concept not representable as an element of <code>V</code> as simply absent. There is no seahorse emoji token, so the model&rsquo;s generation step has no way to represent &ldquo;seahorse emoji&rdquo; as a distinct, exact concept. It can only represent &ldquo;nearest token to seahorse emoji in embedding space,&rdquo; which it does silently, with the same confidence it would report for a precise match.</p>
<p>The mismatch is between two representations: the model&rsquo;s internal semantic space — continuous, high-dimensional, geometrically capable of representing &ldquo;seahorse + emoji&rdquo; as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.</p>
<p>A richer output type would distinguish precise matches from approximations — an <code>Exact&lt;Token&gt;</code> versus an <code>Approximate&lt;Token&gt;</code>, or in standard option-type terms, a generation step that can return <code>None</code> when no token in <code>V</code> adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation <a href="#ref-4">[4]</a>. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why &ldquo;make the model more calibrated about facts&rdquo; addresses the wrong layer of the problem.</p>
<h2 id="the-ununderstandable-companion">The &ldquo;ununderstandable&rdquo; companion</h2>
<p>Shortly after the seahorse emoji incident circulated, a Reddit thread titled &ldquo;it&rsquo;s just the seahorse emoji all over again&rdquo; collected user reports of a structurally similar failure on the English word &ldquo;ununderstandable&rdquo; <a href="#ref-2">[2]</a>. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.</p>
<p>&ldquo;Ununderstandable&rdquo; is morphologically valid English. The prefix <em>un-</em> combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. &ldquo;Understandable&rdquo; is an unambiguous adjective. &ldquo;Ununderstandable&rdquo; means what it looks like it means, constructed by exactly the same rule that gives you all the other <em>un-</em> words. There is nothing wrong with it grammatically or semantically.</p>
<p>It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled &ldquo;ununderstandable&rdquo; as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.</p>
<p>The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that <em>un-</em> prefixation is productive, and it has learned that &ldquo;understandable&rdquo; is a word. But its trained representations do not include &ldquo;ununderstandable&rdquo; as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define &ldquo;ununderstandable,&rdquo; models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with &ldquo;misunderstandable&rdquo; or &ldquo;incomprehensible&rdquo; in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.</p>
<p>The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying &ldquo;I cannot recognise this because it is not in my training data,&rdquo; which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.</p>
<p>The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: &ldquo;this thing does not exist and I should say so&rdquo; versus &ldquo;this thing exists but I cannot produce it cleanly.&rdquo; In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the &ldquo;ununderstandable&rdquo; case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.</p>
<h2 id="what-this-means-for-users">What this means for users</h2>
<p>The practical implication is narrow but important. Asking a language model &ldquo;do you have X?&rdquo; — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser&rsquo;s vocabulary table.</p>
<p>This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.</p>
<p>The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the <a href="/posts/car-wash-grounding/">grounding failure post</a> and its companion piece on <a href="/posts/car-wash-walk/">pragmatic inference</a> for related examples, and the <a href="/posts/ai-detectors-systematic-minds/">AI detectors post</a> for a case where self-knowledge failures about writing style have real social consequences.</p>
<p>The fix is not &ldquo;make models more honest&rdquo; in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.</p>
<h2 id="the-last-mile">The last mile</h2>
<p>There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of &ldquo;seahorse emoji&rdquo; is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model&rsquo;s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between &ldquo;I successfully retrieved the seahorse emoji&rdquo; and &ldquo;I retrieved the nearest available approximation to what I was looking for.&rdquo; From the model&rsquo;s operational perspective, it completed the task.</p>
<p>This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.</p>
<p>The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work <a href="#ref-3">[3]</a> that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.</p>
<p>For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.</p>
<hr>
<h2 id="references">References</h2>
<p><span id="ref-1"></span>[1] Vogel, T. (2025). <em>Why do LLMs freak out over the seahorse emoji?</em> <a href="https://vgel.me/posts/seahorse/">https://vgel.me/posts/seahorse/</a></p>
<p><span id="ref-2"></span>[2] Reddit user (2025). It&rsquo;s just the seahorse emoji all over again. <em>r/OpenAI</em>. <a href="https://www.reddit.com/r/OpenAI/comments/1rkbeel/">https://www.reddit.com/r/OpenAI/comments/1rkbeel/</a> (reported; not independently verified)</p>
<p><span id="ref-3"></span>[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. <em>Transformer Circuits Thread</em>. <a href="https://transformer-circuits.pub/2021/framework/index.html">https://transformer-circuits.pub/2021/framework/index.html</a></p>
<p><span id="ref-4"></span>[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. <a href="https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/">https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/</a></p>
<p><span id="ref-5"></span>[5] Sennrich, R., Haddow, B., &amp; Birch, A. (2016). Neural machine translation of rare words with subword units. <em>Proceedings of ACL 2016</em>, 1715–1725.</p>
<p><span id="ref-6"></span>[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire &amp; J. Minker (Eds.), <em>Logic and Data Bases</em> (pp. 55–76). Plenum Press, New York.</p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-04-01</strong>: Updated reference [1]: author name to &ldquo;Vogel, T.&rdquo; and title to the published blog post title &ldquo;Why do LLMs freak out over the seahorse emoji?&rdquo;</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting</title>
      <link>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</link>
      <pubDate>Sat, 22 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</guid>
      <description>Everyone has a Downloads folder full of &amp;ldquo;scan0023.pdf&amp;rdquo; and &amp;ldquo;document(3)-final-FINAL.pdf&amp;rdquo;. Renaming them by content sounds trivial — read the file, understand what it is, give it a name. The implementation reveals something useful about how LLMs actually handle text: what a token is, why context windows matter in practice, why you want structured output instead of prose, and why heuristics should go first. The repository is at github.com/sebastianspicker/AI-PDF-Renamer.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.</em></p>
<hr>
<h2 id="the-problem">The Problem</h2>
<p>Every PDF acquisition pipeline eventually produces the same chaos.
Journal articles downloaded from publisher sites arrive as
<code>513194-008.pdf</code> or <code>1-s2.0-S0360131520302700-main.pdf</code>. Scanned
letters from the tax authority arrive as <code>scan0023.pdf</code>. Invoices arrive
as <code>Rechnung.pdf</code> — every invoice from every vendor, overwriting each
other if you are not paying attention. The actual content is
in the file. The filename tells you nothing.</p>
<p>The human solution is trivial: open the PDF, glance at the title or
date or sender, type a descriptive name. Thirty seconds per file,
multiplied by several hundred files accumulated over a year, becomes
a task that perpetually does not get done.</p>
<p>The automated solution sounds equally trivial: read the text, decide what
the document is, generate a filename. What could be involved?</p>
<p>Quite a bit, it turns out. Working through the implementation is a useful
way to make concrete some things about LLMs and text processing that are
easy to understand in the abstract but clearer with a specific task in
front of you.</p>
<hr>
<h2 id="step-one-getting-text-out-of-a-pdf">Step One: Getting Text Out of a PDF</h2>
<p>A PDF is not a text file. It is a binary format designed for page layout
and print fidelity — it encodes character positions, fonts, and rendering
instructions, not a linear stream of prose. The text in a PDF has to be
extracted by a parser that reassembles it from the position data.</p>
<p>For PDFs with embedded text (most modern documents), this works well
enough. For scanned PDFs — images of pages, with no embedded text at all —
you need OCR as a fallback. The pipeline handles both: native extraction
first, OCR if the text yield is below a useful threshold.</p>
<p>The result is a string. Already there are failure modes: two-column
layouts produce interleaved text if the parser reads left-to-right across
both columns simultaneously; footnotes appear in the middle of
sentences; tables produce gibberish unless the parser handles them
specifically. These are not catastrophic — for renaming purposes,
the first paragraph and the document header are usually enough, and those
are less likely to be badly formatted than the body. But they are real,
and they mean that the text passed to the next stage is not always clean.</p>
<hr>
<h2 id="step-two-the-token-budget">Step Two: The Token Budget</h2>
<p>Once you have a string representing the document&rsquo;s text, you cannot simply
pass all of it to a language model. Two reasons: context windows have hard
limits, and — even when they are large enough — filling them with the full
text of a thirty-page document is wasteful for a task that only needs the
title, date, and category.</p>
<p>Language models do not process characters. They process <em>tokens</em> — subword
units produced by the same BPE compression scheme I described
<a href="/posts/strawberry-tokenisation/">in the strawberry post</a>. A rough
practical rule for English text is:</p>
$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$<p>This is an approximation — technical text, non-English content, and
code tokenise differently — but it is useful for budgeting. A ten-page
academic paper might contain around 30,000 characters, which is
approximately 7,500 tokens. The context window of a small local model
(the default here is <code>qwen2.5:3b</code> via Ollama) is typically in the range
of 8,000–32,000 tokens, depending on the version and configuration.
You have room — but not unlimited room, and the LLM also needs space
for the prompt itself and the response.</p>
<p>The tool defaults to 28,000 tokens of extracted text
(<code>DEFAULT_MAX_CONTENT_TOKENS</code>), leaving comfortable headroom for the
prompt and response in most configurations. For documents that exceed this, the extraction
is truncated — typically to the first N characters, on the reasonable
assumption that titles, dates, and document types appear early.</p>
<p>This truncation is a design decision, not a limitation to be apologised
for. For the renaming task, the first two pages of a document contain
everything the filename needs. A strategy that extracts the first page
plus the last page (which often has a date, a signature, or a reference
number) would work for some document types. The current implementation
keeps it simple: take the front, stay within budget.</p>
<hr>
<h2 id="step-three-heuristics-first">Step Three: Heuristics First</h2>
<p>Here is something that improves almost any LLM pipeline for structured
extraction tasks: do as much work as possible with deterministic rules
before touching the model.</p>
<p>The AI PDF Renamer applies a scoring pass over the extracted text before
deciding whether to call the LLM at all. The heuristics are regex-based
rules that look for patterns likely to appear in specific document types:</p>
<ul>
<li>Date patterns: <code>\d{4}-\d{2}-\d{2}</code>, <code>\d{2}\.\d{2}\.\d{4}</code>, and a
dozen variants</li>
<li>Document type markers: &ldquo;Rechnung&rdquo;, &ldquo;Invoice&rdquo;, &ldquo;Beleg&rdquo;, &ldquo;Gutschrift&rdquo;,
&ldquo;Receipt&rdquo;</li>
<li>Author/institution lines near the document header</li>
<li>Keywords from a configurable list associated with specific categories</li>
</ul>
<p>Each rule that fires contributes a score to a candidate metadata record.
If the heuristic pass produces a confident result — date found, category
identified, a couple of distinguishing keywords present — the LLM call
is skipped entirely. The file gets renamed from the heuristic output.</p>
<p>This matters for a few reasons. Heuristics are fast (microseconds vs.
seconds for an LLM call), deterministic (the same input always produces
the same output), and do not require a running model. For a batch of
two hundred invoices from the same vendor, the heuristic pass will handle
most of them without any LLM involvement.</p>
<p>The LLM is enrichment for the hard cases: documents with unusual formats,
mixed-language content, documents where the type is not obvious from
surface features. In practice this is probably 20–40% of a typical
mixed-document folder.</p>
<hr>
<h2 id="step-four-what-to-ask-the-llm-and-how">Step Four: What to Ask the LLM, and How</h2>
<p>When a heuristic pass does not produce a confident result, the pipeline
builds a prompt from the extracted text and sends it to the local
endpoint. What the prompt asks for matters enormously.</p>
<p>The naive approach: &ldquo;Please rename this PDF. Here is the content: [text].&rdquo;
The response will be a sentence. Maybe several sentences. It will not be
parseable as a filename without further processing, and that further
processing is itself an LLM call or a fragile regex.</p>
<p>The better approach: ask for structured output. The prompt in
<code>llm_prompts.py</code> requests a JSON object conforming to a schema — something
like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;date&#34;</span><span class="p">:</span> <span class="s2">&#34;YYYYMMDD or null&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;category&#34;</span><span class="p">:</span> <span class="s2">&#34;one of: invoice, paper, letter, contract, ...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;keywords&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;max 3 short keywords&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="s2">&#34;max 5 words&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The model returns JSON. The response parser in <code>llm_parsing.py</code> validates
it against the schema, catches malformed responses, applies fallbacks for
null fields, and sanitises the individual fields before they are assembled
into a filename.</p>
<p>This works because JSON is well-represented in LLM training data —
models have seen vastly more JSON than they have seen arbitrary prose
instructions to parse. A model told to return a specific JSON structure
will do so reliably for most inputs. The failure rate (malformed JSON,
missing fields, hallucinated values) is low enough to be handled by
the fallback logic.</p>
<p>What counts as a hallucinated value in this context? Dates in the future.
Categories not in the allowed set. Keywords that are not present in the
source text. The <code>llm_schema.py</code> validation layer catches the obvious
cases; for subtler errors (a plausible-sounding date that does not appear
in the document), the tool relies on the heuristic pass having already
identified any date that can be reliably extracted.</p>
<hr>
<h2 id="step-five-the-filename">Step Five: The Filename</h2>
<p>The output format is <code>YYYYMMDD-category-keywords-summary.pdf</code>. A few
design decisions embedded in this:</p>
<p><strong>Date first.</strong> Lexicographic sorting of filenames then gives you
chronological sorting for free. This is the most useful sort order for
most document types — you want to find the most recent invoice, not
the alphabetically first one.</p>
<p><strong>Lowercase, hyphens only.</strong> No spaces (which require escaping in many
contexts), no special characters (which are illegal in some filesystems
or require quoting), no uppercase (which creates case-sensitivity issues
across platforms). The sanitisation step in <code>filename.py</code> strips or
replaces anything that does not conform.</p>
<p><strong>Collision resolution.</strong> Two documents with the same date, category,
keywords, and summary would produce the same filename. The resolver
appends a counter suffix (<code>_01</code>, <code>_02</code>, &hellip;) when a target name already
exists. This is deterministic — the same set of documents always produces
the same filenames, regardless of processing order — which matters for
the undo log.</p>
<hr>
<h2 id="local-first">Local-First</h2>
<p>The LLM endpoint defaults to <code>http://127.0.0.1:11434/v1/completions</code> —
Ollama running locally, no external traffic. This is a deliberate choice
for a document management tool. The documents being renamed are likely
to include medical records, financial statements, legal correspondence —
content that should not be routed through an external API by default.</p>
<p>A small 8B model running locally is sufficient for this task. The
extraction problem does not require deep reasoning; it requires pattern
recognition over a short text and the ability to return a specific JSON
structure. Models at this scale handle it well. The latency is measurable
(a few seconds per document on a modern laptop with a reasonably fast
inference backend) but acceptable for a batch job running in the
background.</p>
<p>For users who want to use a remote API, the endpoint is configurable —
the local default is a sensible starting point, not a hard constraint.</p>
<hr>
<h2 id="what-it-cannot-do">What It Cannot Do</h2>
<p>Renaming is a classification problem disguised as a text generation
problem. The tool works well when documents have standard structure —
title on page one, date near the header or footer, document type
identifiable from a few keywords. It works less well for documents that
are structurally atypical: a hand-written letter scanned at poor
resolution, a PDF that is essentially a single large image, a document
in a language the model handles badly.</p>
<p>The heuristic fallback means that even when the LLM produces a bad
result, the file gets a usable if imperfect name rather than a broken
one. And the undo log means that a bad batch run can be reversed. These
are not complete solutions to the hard cases, but they are the right
design response to a tool that handles real-world document noise.</p>
<p>The harder limit is semantic: the tool can tell you that a document is
an invoice and extract its date and vendor name. It cannot tell you
whether the invoice has been paid, whether it matches a purchase order,
or whether the amount is correct. For those questions, renaming is just
the first step in a longer pipeline.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.
The tokenisation background in the extraction and budgeting sections
connects to the <a href="/posts/strawberry-tokenisation/">strawberry tokenisation post</a>
and the <a href="/posts/more-context-not-always-better/">context window post</a>.</em></p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-04-02</strong>: Corrected the default model name from <code>qwen3:8b</code> to <code>qwen2.5:3b</code>. The codebase default is <code>qwen2.5:3b</code> (apple-silicon preset) or <code>qwen2.5:7b-instruct</code> (gpu preset).</li>
<li><strong>2026-04-02</strong>: Corrected <code>DEFAULT_MAX_CONTENT_TOKENS</code> description from &ldquo;28,000 characters &hellip; roughly 7,000 tokens&rdquo; to &ldquo;28,000 tokens.&rdquo; The variable is a token limit, not a character limit.</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Three Rs in Strawberry: What the Viral Counting Test Actually Reveals</title>
      <link>https://sebastianspicker.github.io/posts/strawberry-tokenisation/</link>
      <pubDate>Mon, 07 Oct 2024 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/strawberry-tokenisation/</guid>
      <description>In September 2024, OpenAI revealed that its new o1 model had been code-named &amp;ldquo;Strawberry&amp;rdquo; internally — the same word that language models have famously been unable to count letters in. The irony was too perfect to pass up. But the counting failure is not a sign that LLMs are naive or broken. It is a precise, informative symptom of how they process text. Here is the actual explanation, with a minimum of hand-waving.</description>
      <content:encoded><![CDATA[<h2 id="the-setup">The Setup</h2>
<p>In September 2024, OpenAI publicly confirmed that their new reasoning model
had been code-named &ldquo;Strawberry&rdquo; during development. This landed with a
particular thud because &ldquo;how many r&rsquo;s are in strawberry?&rdquo; had, by that
point, become one of the canonical demonstrations of language model failure.
The model named after strawberry could not count the letters in strawberry.
The internet had opinions.</p>
<p>Before the opinions: the answer is three. s-t-<strong>r</strong>-a-w-b-e-<strong>r</strong>-<strong>r</strong>-y.
One in the <em>str-</em> cluster, two in the <em>-rry</em> ending. Count carefully and
you will find that most people get this right on the first try, and most
large language models get it wrong, returning &ldquo;two&rdquo; with apparent
confidence.</p>
<p>The question worth asking is not &ldquo;why is the model stupid.&rdquo; It is not
stupid, and &ldquo;stupid&rdquo; is not a useful category here. The question is: what
does this specific error reveal about the structure of the system?</p>
<p>The answer involves tokenisation, and it is actually interesting.</p>
<hr>
<h2 id="how-you-count-letters-and-how-the-model-doesnt">How You Count Letters (and How the Model Doesn&rsquo;t)</h2>
<p>When you count the r&rsquo;s in &ldquo;strawberry,&rdquo; you do something like this:
scan the string left to right, maintain a running count, increment it
each time you see the target character. This is a sequential operation
over a character array. It requires no semantic knowledge about the word —
it does not matter whether &ldquo;strawberry&rdquo; is a fruit, a colour, or a
nonsense string. The characters are the input; the count is the output.</p>
<p>A language model does not receive a character array. It receives a
sequence of <em>tokens</em> — chunks produced by a compression algorithm called
Byte Pair Encoding (BPE) that the model was trained with. In the
tokeniser used by GPT-class models, &ldquo;strawberry&rdquo; is most likely split as:</p>
$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$<p>Three tokens. The model&rsquo;s input is these three integer IDs, each looked up
in an embedding table to produce a vector. There is no character array.
There is no letter &ldquo;r&rdquo; sitting at a known position. There are three dense
vectors representing &ldquo;str,&rdquo; &ldquo;aw,&rdquo; and &ldquo;berry.&rdquo;</p>
<hr>
<h2 id="what-bpe-does-and-doesnt-preserve">What BPE Does (and Doesn&rsquo;t) Preserve</h2>
<p>BPE is a greedy compression algorithm. Starting from individual bytes,
it iteratively merges the most frequent pair of adjacent symbols into a
single new token:</p>
$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$<p>Applied to a large text corpus until a fixed vocabulary size is reached,
this produces a vocabulary of common subwords. Frequent words and common
word-parts become single tokens; rare sequences stay as multi-token
fragments.</p>
<p>What BPE optimises for is compression efficiency, not character-level
transparency. The token &ldquo;straw&rdquo; encodes the sequence s-t-r-a-w as a
unit, but that character sequence is not explicitly represented anywhere
inside the model once the embedding lookup has occurred. The model
receives a vector for &ldquo;straw,&rdquo; not a list of its constituent letters.</p>
<p>The character composition of a token is only accessible to the model
insofar as it was implicitly learned during training — through seeing
&ldquo;straw&rdquo; appear in contexts where its internal structure was relevant.
For most tokens, most of the time, that character structure was not
relevant. The model learned what &ldquo;straw&rdquo; means, not how to spell it
character by character.</p>
<hr>
<h2 id="why-the-error-is-informative">Why the Error Is Informative</h2>
<p>Most people say the model returns &ldquo;two r&rsquo;s,&rdquo; not &ldquo;one&rdquo; or &ldquo;four&rdquo; or
&ldquo;none.&rdquo; This is not random noise. It is a systematic error, and systematic
errors are diagnostic.</p>
<p>&ldquo;berry&rdquo; contains two r&rsquo;s: b-e-<strong>r</strong>-<strong>r</strong>-y. If you ask most models
&ldquo;how many r&rsquo;s in berry?&rdquo; they get it right. The model has seen that
question, or questions closely enough related, that the right count is
encoded somewhere in the weight structure.</p>
<p>&ldquo;str&rdquo; contains one r: s-t-<strong>r</strong>. But as a token it is a short, common
prefix that appears in hundreds of words — <em>string</em>, <em>strong</em>, <em>stream</em> —
contexts in which its internal letter structure is rarely attended to.
&ldquo;aw&rdquo; contains no r&rsquo;s. When the model answers &ldquo;two,&rdquo; it is almost
certainly counting the r&rsquo;s in &ldquo;berry&rdquo; correctly and failing to notice
the one in &ldquo;str.&rdquo; The token boundaries are where the error lives.</p>
<p>This is not stupidity. It is a precise failure mode that follows directly
from the tokenisation structure. You can predict where the error will
occur by looking at the token split.</p>
<hr>
<h2 id="chain-of-thought-partially-fixes-this-and-why">Chain of Thought Partially Fixes This (and Why)</h2>
<p>If you prompt the model to &ldquo;spell out the letters first, then count,&rdquo; the
error rate drops substantially. The reason is not mysterious: forcing
the model to generate a character-by-character expansion — s, t, r, a,
w, b, e, r, r, y — puts the individual characters into the context window
as separate tokens. Now the model is not working from &ldquo;straw&rdquo; and &ldquo;berry&rdquo;;
it is working from ten single-character tokens, and counting sequential
characters in a flat list is a task the model handles much better.</p>
<p>This is, in effect, making the model do manually what a human does
automatically: convert the compressed token representation back to an
enumerable character sequence before counting. The cognitive work is the
same; the scaffolding just has to be explicit.</p>
<hr>
<h2 id="the-right-frame">The Right Frame</h2>
<p>The &ldquo;how many r&rsquo;s&rdquo; test is sometimes cited as evidence that language models
don&rsquo;t &ldquo;really&rdquo; understand text, or that they are sophisticated autocomplete
engines with no genuine knowledge. These framing choices produce more heat
than light.</p>
<p>The more precise statement is this: language models were trained to predict
likely next tokens in large text corpora. That training objective produces
a system that is very good at certain tasks (semantic inference, translation,
summarisation, code generation) and systematically bad at others (character
counting, exact arithmetic, precise spatial reasoning). The system is not
doing what you are doing when you read a sentence. It is doing something
different, which happens to produce similar outputs for a very wide range
of inputs — and different outputs for a class of inputs where the
character-level structure matters.</p>
<p>&ldquo;Strawberry&rdquo; sits squarely in that class. The model is not failing to
read the word. It is succeeding at predicting what a plausible-sounding
answer looks like, based on a compressed representation that does not
preserve the information needed to get the count right. Those are not the
same thing, and the distinction is worth keeping clear.</p>
<hr>
<p><em>The tokenisation argument here is a simplified version. Real BPE
vocabularies, positional encodings, and the specific way character
information is or isn&rsquo;t preserved in embedding tables are more complicated
than this post suggests. But the core point — that the model&rsquo;s input
representation is not a character array and never was — holds.</em></p>
<p><em>A follow-up post covers a structurally different failure mode:
<a href="/posts/car-wash-grounding/">Should I Drive to the Car Wash?</a> — where
the model understood the question perfectly but lacked access to the
world state the question was about.</em></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Gage, P. (1994). A new algorithm for data compression. <em>The C Users
Journal</em>, 12(2), 23–38.</p>
</li>
<li>
<p>Sennrich, R., Haddow, B., &amp; Birch, A. (2016). <strong>Neural machine
translation of rare words with subword units.</strong> <em>Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics
(ACL 2016)</em>, 1715–1725. <a href="https://arxiv.org/abs/1508.07909">https://arxiv.org/abs/1508.07909</a></p>
</li>
</ul>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2025-12-01</strong>: Corrected the tokenisation of &ldquo;strawberry&rdquo; from two tokens (<code>straw|berry</code>) to three tokens (<code>str|aw|berry</code>), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
