Tokenisation on Sebastian Spicker

The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs

Wed, 04 Mar 2026 00:00:00 +0000

Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.

That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody’s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.

I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.

The incident

In 2025, Vgel published an analysis of exactly this failure [1]. The piece is worth reading in full, but the core finding is worth unpacking here.

When a model is asked to produce a seahorse emoji, something specific happens at the level of the model’s internal representations. Using logit lens analysis — a technique for inspecting the model’s intermediate layer activations as if they were already projecting into vocabulary space [4] — it is possible to track what the model’s “working answer” looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a “seahorse + emoji” representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.

The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model’s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.

What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between “I found the seahorse emoji” and “I found the best available approximation to the seahorse emoji.” The output is produced with the same confidence either way.

Vgel’s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model’s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: “I know what you want, and it does not exist as a token I can emit.”

Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (“wait, that is a dolphin, not a seahorse”) rather than vocabulary-knowledge (“there is no seahorse token”), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.

The structural failure: vocabulary completeness assumption

Here is the core conceptual point, stated as cleanly as I can.

Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.

The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. [5] and discussed in more detail in my tokenisation post. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.

The problem is that the model treats these two representations as if they were the same. If world-knowledge says “seahorses should have emoji,” the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between “I know X exists” and “I can express X.” I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.

This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.

The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model’s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.

When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.

A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write “there is no direct English equivalent; the closest is approximately…” and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.

A formal language perspective

For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary V = {t₁, …, tₙ} [5]. Every output is an element of V. The type system has no room for a “near miss” or an “I cannot express this precisely” — the output is always a token, drawn from the inventory established at training time.

This is a closed-world assumption in the formal sense [6]: the system treats any concept not representable as an element of V as simply absent. There is no seahorse emoji token, so the model’s generation step has no way to represent “seahorse emoji” as a distinct, exact concept. It can only represent “nearest token to seahorse emoji in embedding space,” which it does silently, with the same confidence it would report for a precise match.

The mismatch is between two representations: the model’s internal semantic space — continuous, high-dimensional, geometrically capable of representing “seahorse + emoji” as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.

A richer output type would distinguish precise matches from approximations — an Exact versus an Approximate, or in standard option-type terms, a generation step that can return None when no token in V adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation [4]. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why “make the model more calibrated about facts” addresses the wrong layer of the problem.

The “ununderstandable” companion

Shortly after the seahorse emoji incident circulated, a Reddit thread titled “it’s just the seahorse emoji all over again” collected user reports of a structurally similar failure on the English word “ununderstandable” [2]. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.

“Ununderstandable” is morphologically valid English. The prefix un- combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. “Understandable” is an unambiguous adjective. “Ununderstandable” means what it looks like it means, constructed by exactly the same rule that gives you all the other un- words. There is nothing wrong with it grammatically or semantically.

It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled “ununderstandable” as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.

The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that un- prefixation is productive, and it has learned that “understandable” is a word. But its trained representations do not include “ununderstandable” as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define “ununderstandable,” models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with “misunderstandable” or “incomprehensible” in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.

The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying “I cannot recognise this because it is not in my training data,” which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.

The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: “this thing does not exist and I should say so” versus “this thing exists but I cannot produce it cleanly.” In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the “ununderstandable” case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.

What this means for users

The practical implication is narrow but important. Asking a language model “do you have X?” — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser’s vocabulary table.

This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.

The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the grounding failure post and its companion piece on pragmatic inference for related examples, and the AI detectors post for a case where self-knowledge failures about writing style have real social consequences.

The fix is not “make models more honest” in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.

The last mile

There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of “seahorse emoji” is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model’s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between “I successfully retrieved the seahorse emoji” and “I retrieved the nearest available approximation to what I was looking for.” From the model’s operational perspective, it completed the task.

This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.

The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work [3] that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.

For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.

References

[1] Vogel, T. (2025). Why do LLMs freak out over the seahorse emoji? https://vgel.me/posts/seahorse/

[2] Reddit user (2025). It’s just the seahorse emoji all over again. r/OpenAI. https://www.reddit.com/r/OpenAI/comments/1rkbeel/ (reported; not independently verified)

[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html

[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/

[5] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL 2016, 1715–1725.

[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55–76). Plenum Press, New York.

Changelog

2026-04-01: Updated reference [1]: author name to “Vogel, T.” and title to the published blog post title “Why do LLMs freak out over the seahorse emoji?”

The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting

Sat, 22 Mar 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/AI-PDF-Renamer.

The Problem

Every PDF acquisition pipeline eventually produces the same chaos. Journal articles downloaded from publisher sites arrive as 513194-008.pdf or 1-s2.0-S0360131520302700-main.pdf. Scanned letters from the tax authority arrive as scan0023.pdf. Invoices arrive as Rechnung.pdf — every invoice from every vendor, overwriting each other if you are not paying attention. The actual content is in the file. The filename tells you nothing.

The human solution is trivial: open the PDF, glance at the title or date or sender, type a descriptive name. Thirty seconds per file, multiplied by several hundred files accumulated over a year, becomes a task that perpetually does not get done.

The automated solution sounds equally trivial: read the text, decide what the document is, generate a filename. What could be involved?

Quite a bit, it turns out. Working through the implementation is a useful way to make concrete some things about LLMs and text processing that are easy to understand in the abstract but clearer with a specific task in front of you.

Step One: Getting Text Out of a PDF

A PDF is not a text file. It is a binary format designed for page layout and print fidelity — it encodes character positions, fonts, and rendering instructions, not a linear stream of prose. The text in a PDF has to be extracted by a parser that reassembles it from the position data.

For PDFs with embedded text (most modern documents), this works well enough. For scanned PDFs — images of pages, with no embedded text at all — you need OCR as a fallback. The pipeline handles both: native extraction first, OCR if the text yield is below a useful threshold.

The result is a string. Already there are failure modes: two-column layouts produce interleaved text if the parser reads left-to-right across both columns simultaneously; footnotes appear in the middle of sentences; tables produce gibberish unless the parser handles them specifically. These are not catastrophic — for renaming purposes, the first paragraph and the document header are usually enough, and those are less likely to be badly formatted than the body. But they are real, and they mean that the text passed to the next stage is not always clean.

Step Two: The Token Budget

Once you have a string representing the document’s text, you cannot simply pass all of it to a language model. Two reasons: context windows have hard limits, and — even when they are large enough — filling them with the full text of a thirty-page document is wasteful for a task that only needs the title, date, and category.

Language models do not process characters. They process tokens — subword units produced by the same BPE compression scheme I described in the strawberry post. A rough practical rule for English text is:

$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$

This is an approximation — technical text, non-English content, and code tokenise differently — but it is useful for budgeting. A ten-page academic paper might contain around 30,000 characters, which is approximately 7,500 tokens. The context window of a small local model (the default here is qwen2.5:3b via Ollama) is typically in the range of 8,000–32,000 tokens, depending on the version and configuration. You have room — but not unlimited room, and the LLM also needs space for the prompt itself and the response.

The tool defaults to 28,000 tokens of extracted text (DEFAULT_MAX_CONTENT_TOKENS), leaving comfortable headroom for the prompt and response in most configurations. For documents that exceed this, the extraction is truncated — typically to the first N characters, on the reasonable assumption that titles, dates, and document types appear early.

This truncation is a design decision, not a limitation to be apologised for. For the renaming task, the first two pages of a document contain everything the filename needs. A strategy that extracts the first page plus the last page (which often has a date, a signature, or a reference number) would work for some document types. The current implementation keeps it simple: take the front, stay within budget.

Step Three: Heuristics First

Here is something that improves almost any LLM pipeline for structured extraction tasks: do as much work as possible with deterministic rules before touching the model.

The AI PDF Renamer applies a scoring pass over the extracted text before deciding whether to call the LLM at all. The heuristics are regex-based rules that look for patterns likely to appear in specific document types:

Date patterns: \d{4}-\d{2}-\d{2}, \d{2}\.\d{2}\.\d{4}, and a dozen variants
Document type markers: “Rechnung”, “Invoice”, “Beleg”, “Gutschrift”, “Receipt”
Author/institution lines near the document header
Keywords from a configurable list associated with specific categories

Each rule that fires contributes a score to a candidate metadata record. If the heuristic pass produces a confident result — date found, category identified, a couple of distinguishing keywords present — the LLM call is skipped entirely. The file gets renamed from the heuristic output.

This matters for a few reasons. Heuristics are fast (microseconds vs. seconds for an LLM call), deterministic (the same input always produces the same output), and do not require a running model. For a batch of two hundred invoices from the same vendor, the heuristic pass will handle most of them without any LLM involvement.

The LLM is enrichment for the hard cases: documents with unusual formats, mixed-language content, documents where the type is not obvious from surface features. In practice this is probably 20–40% of a typical mixed-document folder.

Step Four: What to Ask the LLM, and How

When a heuristic pass does not produce a confident result, the pipeline builds a prompt from the extracted text and sends it to the local endpoint. What the prompt asks for matters enormously.

The naive approach: “Please rename this PDF. Here is the content: [text].” The response will be a sentence. Maybe several sentences. It will not be parseable as a filename without further processing, and that further processing is itself an LLM call or a fragile regex.

The better approach: ask for structured output. The prompt in llm_prompts.py requests a JSON object conforming to a schema — something like:

{
  "date": "YYYYMMDD or null",
  "category": "one of: invoice, paper, letter, contract, ...",
  "keywords": ["max 3 short keywords"],
  "summary": "max 5 words"
}

The model returns JSON. The response parser in llm_parsing.py validates it against the schema, catches malformed responses, applies fallbacks for null fields, and sanitises the individual fields before they are assembled into a filename.

This works because JSON is well-represented in LLM training data — models have seen vastly more JSON than they have seen arbitrary prose instructions to parse. A model told to return a specific JSON structure will do so reliably for most inputs. The failure rate (malformed JSON, missing fields, hallucinated values) is low enough to be handled by the fallback logic.

What counts as a hallucinated value in this context? Dates in the future. Categories not in the allowed set. Keywords that are not present in the source text. The llm_schema.py validation layer catches the obvious cases; for subtler errors (a plausible-sounding date that does not appear in the document), the tool relies on the heuristic pass having already identified any date that can be reliably extracted.

Step Five: The Filename

The output format is YYYYMMDD-category-keywords-summary.pdf. A few design decisions embedded in this:

Date first. Lexicographic sorting of filenames then gives you chronological sorting for free. This is the most useful sort order for most document types — you want to find the most recent invoice, not the alphabetically first one.

Lowercase, hyphens only. No spaces (which require escaping in many contexts), no special characters (which are illegal in some filesystems or require quoting), no uppercase (which creates case-sensitivity issues across platforms). The sanitisation step in filename.py strips or replaces anything that does not conform.

Collision resolution. Two documents with the same date, category, keywords, and summary would produce the same filename. The resolver appends a counter suffix (_01, _02, …) when a target name already exists. This is deterministic — the same set of documents always produces the same filenames, regardless of processing order — which matters for the undo log.

Local-First

The LLM endpoint defaults to http://127.0.0.1:11434/v1/completions — Ollama running locally, no external traffic. This is a deliberate choice for a document management tool. The documents being renamed are likely to include medical records, financial statements, legal correspondence — content that should not be routed through an external API by default.

A small 8B model running locally is sufficient for this task. The extraction problem does not require deep reasoning; it requires pattern recognition over a short text and the ability to return a specific JSON structure. Models at this scale handle it well. The latency is measurable (a few seconds per document on a modern laptop with a reasonably fast inference backend) but acceptable for a batch job running in the background.

For users who want to use a remote API, the endpoint is configurable — the local default is a sensible starting point, not a hard constraint.

What It Cannot Do

Renaming is a classification problem disguised as a text generation problem. The tool works well when documents have standard structure — title on page one, date near the header or footer, document type identifiable from a few keywords. It works less well for documents that are structurally atypical: a hand-written letter scanned at poor resolution, a PDF that is essentially a single large image, a document in a language the model handles badly.

The heuristic fallback means that even when the LLM produces a bad result, the file gets a usable if imperfect name rather than a broken one. And the undo log means that a bad batch run can be reversed. These are not complete solutions to the hard cases, but they are the right design response to a tool that handles real-world document noise.

The harder limit is semantic: the tool can tell you that a document is an invoice and extract its date and vendor name. It cannot tell you whether the invoice has been paid, whether it matches a purchase order, or whether the amount is correct. For those questions, renaming is just the first step in a longer pipeline.

The repository is at github.com/sebastianspicker/AI-PDF-Renamer. The tokenisation background in the extraction and budgeting sections connects to the strawberry tokenisation post and the context window post.

Changelog

2026-04-02: Corrected the default model name from qwen3:8b to qwen2.5:3b. The codebase default is qwen2.5:3b (apple-silicon preset) or qwen2.5:7b-instruct (gpu preset).
2026-04-02: Corrected DEFAULT_MAX_CONTENT_TOKENS description from “28,000 characters … roughly 7,000 tokens” to “28,000 tokens.” The variable is a token limit, not a character limit.

Three Rs in Strawberry: What the Viral Counting Test Actually Reveals

Mon, 07 Oct 2024 00:00:00 +0000

The Setup

In September 2024, OpenAI publicly confirmed that their new reasoning model had been code-named “Strawberry” during development. This landed with a particular thud because “how many r’s are in strawberry?” had, by that point, become one of the canonical demonstrations of language model failure. The model named after strawberry could not count the letters in strawberry. The internet had opinions.

Before the opinions: the answer is three. s-t-r-a-w-b-e-r-r-y. One in the str- cluster, two in the -rry ending. Count carefully and you will find that most people get this right on the first try, and most large language models get it wrong, returning “two” with apparent confidence.

The question worth asking is not “why is the model stupid.” It is not stupid, and “stupid” is not a useful category here. The question is: what does this specific error reveal about the structure of the system?

The answer involves tokenisation, and it is actually interesting.

How You Count Letters (and How the Model Doesn’t)

When you count the r’s in “strawberry,” you do something like this: scan the string left to right, maintain a running count, increment it each time you see the target character. This is a sequential operation over a character array. It requires no semantic knowledge about the word — it does not matter whether “strawberry” is a fruit, a colour, or a nonsense string. The characters are the input; the count is the output.

A language model does not receive a character array. It receives a sequence of tokens — chunks produced by a compression algorithm called Byte Pair Encoding (BPE) that the model was trained with. In the tokeniser used by GPT-class models, “strawberry” is most likely split as:

$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$

Three tokens. The model’s input is these three integer IDs, each looked up in an embedding table to produce a vector. There is no character array. There is no letter “r” sitting at a known position. There are three dense vectors representing “str,” “aw,” and “berry.”

What BPE Does (and Doesn’t) Preserve

BPE is a greedy compression algorithm. Starting from individual bytes, it iteratively merges the most frequent pair of adjacent symbols into a single new token:

$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$

Applied to a large text corpus until a fixed vocabulary size is reached, this produces a vocabulary of common subwords. Frequent words and common word-parts become single tokens; rare sequences stay as multi-token fragments.

What BPE optimises for is compression efficiency, not character-level transparency. The token “straw” encodes the sequence s-t-r-a-w as a unit, but that character sequence is not explicitly represented anywhere inside the model once the embedding lookup has occurred. The model receives a vector for “straw,” not a list of its constituent letters.

The character composition of a token is only accessible to the model insofar as it was implicitly learned during training — through seeing “straw” appear in contexts where its internal structure was relevant. For most tokens, most of the time, that character structure was not relevant. The model learned what “straw” means, not how to spell it character by character.

Why the Error Is Informative

Most people say the model returns “two r’s,” not “one” or “four” or “none.” This is not random noise. It is a systematic error, and systematic errors are diagnostic.

“berry” contains two r’s: b-e-r-r-y. If you ask most models “how many r’s in berry?” they get it right. The model has seen that question, or questions closely enough related, that the right count is encoded somewhere in the weight structure.

“str” contains one r: s-t-r. But as a token it is a short, common prefix that appears in hundreds of words — string, strong, stream — contexts in which its internal letter structure is rarely attended to. “aw” contains no r’s. When the model answers “two,” it is almost certainly counting the r’s in “berry” correctly and failing to notice the one in “str.” The token boundaries are where the error lives.

This is not stupidity. It is a precise failure mode that follows directly from the tokenisation structure. You can predict where the error will occur by looking at the token split.

Chain of Thought Partially Fixes This (and Why)

If you prompt the model to “spell out the letters first, then count,” the error rate drops substantially. The reason is not mysterious: forcing the model to generate a character-by-character expansion — s, t, r, a, w, b, e, r, r, y — puts the individual characters into the context window as separate tokens. Now the model is not working from “straw” and “berry”; it is working from ten single-character tokens, and counting sequential characters in a flat list is a task the model handles much better.

This is, in effect, making the model do manually what a human does automatically: convert the compressed token representation back to an enumerable character sequence before counting. The cognitive work is the same; the scaffolding just has to be explicit.

The Right Frame

The “how many r’s” test is sometimes cited as evidence that language models don’t “really” understand text, or that they are sophisticated autocomplete engines with no genuine knowledge. These framing choices produce more heat than light.

The more precise statement is this: language models were trained to predict likely next tokens in large text corpora. That training objective produces a system that is very good at certain tasks (semantic inference, translation, summarisation, code generation) and systematically bad at others (character counting, exact arithmetic, precise spatial reasoning). The system is not doing what you are doing when you read a sentence. It is doing something different, which happens to produce similar outputs for a very wide range of inputs — and different outputs for a class of inputs where the character-level structure matters.

“Strawberry” sits squarely in that class. The model is not failing to read the word. It is succeeding at predicting what a plausible-sounding answer looks like, based on a compressed representation that does not preserve the information needed to get the count right. Those are not the same thing, and the distinction is worth keeping clear.

The tokenisation argument here is a simplified version. Real BPE vocabularies, positional encodings, and the specific way character information is or isn’t preserved in embedding tables are more complicated than this post suggests. But the core point — that the model’s input representation is not a character array and never was — holds.

A follow-up post covers a structurally different failure mode: Should I Drive to the Car Wash? — where the model understood the question perfectly but lacked access to the world state the question was about.

References

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1715–1725. https://arxiv.org/abs/1508.07909

Changelog

2025-12-01: Corrected the tokenisation of “strawberry” from two tokens (straw|berry) to three tokens (str|aw|berry), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.