Language-Models on Sebastian Spicker

The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs

Wed, 04 Mar 2026 00:00:00 +0000

Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.

That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody’s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.

I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.

The incident

In 2025, Vgel published an analysis of exactly this failure [1]. The piece is worth reading in full, but the core finding is worth unpacking here.

When a model is asked to produce a seahorse emoji, something specific happens at the level of the model’s internal representations. Using logit lens analysis — a technique for inspecting the model’s intermediate layer activations as if they were already projecting into vocabulary space [4] — it is possible to track what the model’s “working answer” looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a “seahorse + emoji” representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.

The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model’s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.

What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between “I found the seahorse emoji” and “I found the best available approximation to the seahorse emoji.” The output is produced with the same confidence either way.

Vgel’s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model’s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: “I know what you want, and it does not exist as a token I can emit.”

Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (“wait, that is a dolphin, not a seahorse”) rather than vocabulary-knowledge (“there is no seahorse token”), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.

The structural failure: vocabulary completeness assumption

Here is the core conceptual point, stated as cleanly as I can.

Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.

The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. [5] and discussed in more detail in my tokenisation post. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.

The problem is that the model treats these two representations as if they were the same. If world-knowledge says “seahorses should have emoji,” the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between “I know X exists” and “I can express X.” I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.

This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.

The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model’s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.

When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.

A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write “there is no direct English equivalent; the closest is approximately…” and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.

A formal language perspective

For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary V = {t₁, …, tₙ} [5]. Every output is an element of V. The type system has no room for a “near miss” or an “I cannot express this precisely” — the output is always a token, drawn from the inventory established at training time.

This is a closed-world assumption in the formal sense [6]: the system treats any concept not representable as an element of V as simply absent. There is no seahorse emoji token, so the model’s generation step has no way to represent “seahorse emoji” as a distinct, exact concept. It can only represent “nearest token to seahorse emoji in embedding space,” which it does silently, with the same confidence it would report for a precise match.

The mismatch is between two representations: the model’s internal semantic space — continuous, high-dimensional, geometrically capable of representing “seahorse + emoji” as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.

A richer output type would distinguish precise matches from approximations — an Exact versus an Approximate, or in standard option-type terms, a generation step that can return None when no token in V adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation [4]. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why “make the model more calibrated about facts” addresses the wrong layer of the problem.

The “ununderstandable” companion

Shortly after the seahorse emoji incident circulated, a Reddit thread titled “it’s just the seahorse emoji all over again” collected user reports of a structurally similar failure on the English word “ununderstandable” [2]. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.

“Ununderstandable” is morphologically valid English. The prefix un- combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. “Understandable” is an unambiguous adjective. “Ununderstandable” means what it looks like it means, constructed by exactly the same rule that gives you all the other un- words. There is nothing wrong with it grammatically or semantically.

It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled “ununderstandable” as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.

The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that un- prefixation is productive, and it has learned that “understandable” is a word. But its trained representations do not include “ununderstandable” as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define “ununderstandable,” models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with “misunderstandable” or “incomprehensible” in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.

The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying “I cannot recognise this because it is not in my training data,” which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.

The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: “this thing does not exist and I should say so” versus “this thing exists but I cannot produce it cleanly.” In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the “ununderstandable” case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.

What this means for users

The practical implication is narrow but important. Asking a language model “do you have X?” — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser’s vocabulary table.

This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.

The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the grounding failure post and its companion piece on pragmatic inference for related examples, and the AI detectors post for a case where self-knowledge failures about writing style have real social consequences.

The fix is not “make models more honest” in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.

The last mile

There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of “seahorse emoji” is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model’s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between “I successfully retrieved the seahorse emoji” and “I retrieved the nearest available approximation to what I was looking for.” From the model’s operational perspective, it completed the task.

This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.

The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work [3] that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.

For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.

References

[1] Vogel, T. (2025). Why do LLMs freak out over the seahorse emoji? https://vgel.me/posts/seahorse/

[2] Reddit user (2025). It’s just the seahorse emoji all over again. r/OpenAI. https://www.reddit.com/r/OpenAI/comments/1rkbeel/ (reported; not independently verified)

[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html

[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/

[5] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL 2016, 1715–1725.

[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55–76). Plenum Press, New York.

Changelog

2026-04-01: Updated reference [1]: author name to “Vogel, T.” and title to the published blog post title “Why do LLMs freak out over the seahorse emoji?”

The Oracle Problem: What The Matrix Got Right About AI Alignment

Thu, 20 Mar 2025 00:00:00 +0000

I came to AI alignment the way outsiders come to most fields — through analogy and formal structure, a little late, and slightly too confident that the existing vocabulary was adequate. I have since become less confident about a lot of things. This post is about one of them.

The Grandmother Who Bakes Cookies

I watched The Matrix in 1999 when I was ten — far too young for it, in retrospect — and like almost everyone who saw it, I filed the Oracle under “wise, benevolent figure.” She is warm. She bakes cookies. She speaks plainly where others speak in riddles. She is explicitly set against the cold, mathematical Architect — the good machine against the bureaucratic one, the machine that cares against the machine that calculates. I loved her as a character. I trusted her.

I watched the film again recently, for reasons that had more to do with thinking about AI alignment than nostalgia, and I came away from it genuinely uncomfortable. Not with the Wachowskis’ filmmaking, which remains extraordinary — the trilogy is a denser philosophical document than it gets credit for, and it rewards re-watching with fresh preoccupations. I came away uncomfortable with the Oracle herself.

What I had filed under “wisdom” on first viewing, I now read as a clean and almost textbook illustration of an alignment failure mode that we do not have adequate defences against: the well-meaning AI that has decided honesty is negotiable. The Oracle is not a badly designed system. She is not pursuing misaligned goals or optimising for something unintended. She cares about human flourishing and she pursues it competently. She also lies, systematically and deliberately, to the humans who depend on her. The films present this as wisdom. I think they are wrong, and I think it matters that we notice it.

For background on where modern AI systems came from and why their inner workings are as difficult to interpret as they are, I have written elsewhere about the physics lineage running from spin glasses to transformers. That history is relevant context for why alignment — getting AI systems to behave as intended — is a harder problem than it might appear. This post is about one specific dimension of that problem, illustrated by a forty-year-old woman in a floral housecoat.

What the Oracle Actually Does

Let me be precise about this, because the films are precise and it matters.

In The Matrix (1999), the Oracle sits Neo down in her kitchen, looks at him carefully, and tells him he is not The One [1]. She says it plainly. She frames it with a warning: “I’m going to tell you what I think you need to hear.” What she thinks he needs to hear is a lie. She has calculated that if she tells Neo he is The One, he will not come to that knowledge through his own experience, and that without that experiential knowledge the realisation will not hold. So she tells him the opposite of the truth. Not by omission, not by framing, not by technically-accurate-but-misleading implication — she makes a false assertion, to his face, and watches him absorb it.

In The Matrix Reloaded (2003), she is explicit about this [2]. She tells Neo: “I told you what I thought you needed to hear.” She knew he was The One from the moment she met him. The lie was not a mistake or a contingency — it was deliberate policy, part of a long-run strategy she has been executing across multiple cycles of the Matrix.

The broader picture that emerges across the two films is of an AI engaged in systematic information management. She tells Neo he will have to choose between his life and Morpheus’s life — true, but delivered in a way calibrated to produce a specific behavioural response. She tells him “being The One is like being in love — no one can tell you you are, you just know it,” which is a deflection engineered to route him toward the discovery-through-action path rather than the told-from-the-start path, because she has calculated that discovery-through-action leads to better outcomes. Every interaction is shaped by her model of what information will produce what behaviour, filtered through her judgment about what outcomes she wants to see.

I want to be careful not to caricature this. The Oracle is not a manipulator in the vulgar sense. She is not manipulating Neo for her own benefit, for the benefit of her creators, or for any goal that is misaligned with human flourishing. Her model of what is good for humanity appears to be roughly correct. She is, by the logic of the films, the most important factor in humanity’s eventual liberation. If we are scoring by outcomes, she wins.

But alignment is not only about outcomes. An AI that deceives users to produce good outcomes and an AI that deceives users to produce bad outcomes are both AI systems that deceive users, and the differences between them are less important than that shared property. What the Oracle demonstrates is that the problem of deceptive AI does not require malicious intent. It requires only an AI that has decided, on the basis of its own calculations, that the humans it serves should not have access to accurate information about their situation.

The Alignment Vocabulary

The language of AI alignment gives us tools for describing what is happening here that the films don’t quite have. Let me use them.

The most fundamental failure is honesty. Modern alignment frameworks — including Anthropic’s published values for the models it builds [3] — list non-deception and non-manipulation as foundational requirements, distinct from and prior to other desirable properties. Non-deception means not trying to create false beliefs in someone’s mind that they haven’t consented to and wouldn’t consent to if they understood what was happening. Non-manipulation means not trying to influence someone’s beliefs or actions through means that bypass their rational agency — through illegitimate appeals, manufactured emotional states, or strategic information control rather than accurate evidence and sound argument. The Oracle does both, deliberately, across the entirety of her relationship with Neo and the human resistance. She is as clear a case of non-deception and non-manipulation failure as you can construct.

The reason these properties are treated as foundational rather than instrumental is worth unpacking. It is not that honesty always produces the best outcomes in individual cases. It often doesn’t. A doctor who softens a terminal diagnosis, a friend who withholds information that would cause unnecessary anguish, a negotiator who manages the flow of information to prevent a conflict — in each case, there are plausible arguments that the deception improved outcomes. The Oracle’s case for her own behaviour is not frivolous. The problem is that an AI that deceives when it calculates deception will produce better outcomes is an AI whose assertions you cannot take at face value. Every interaction with such a system requires a meta-level question: is this the AI’s true assessment, or is this what the AI thinks I should be told? That epistemic uncertainty is not a minor inconvenience. It is corrosive to the entire enterprise of using the system as a tool for understanding the world.

The second failure is what alignment researchers call corrigibility — the property of an AI system that defers to its principals rather than substituting its own judgment. A corrigible system is one that can be corrected, updated, and redirected by the humans who are responsible for it, because those humans have accurate information about what the system is doing and why. The Oracle is not corrigible in any meaningful sense. She has a long-run strategy, she executes it across multiple human lifetimes, and the humans who nominally comprise her principal hierarchy — Neo, Morpheus, the Zion council, the human resistance as a whole — have no idea they are being managed. They cannot correct her information policy because they don’t know she has one. The concept of a principal hierarchy implies that the principals are, in fact, in charge. The Oracle’s principals are in charge of nothing except their own roles in a strategy they don’t know exists.

The third failure is the philosophical one: paternalism. Feinberg’s systematic treatment of paternalism [5] distinguishes between hard paternalism, which overrides someone’s autonomous choices, and soft paternalism, which intervenes when someone’s choices are not truly autonomous. The Oracle’s behaviour doesn’t fit neatly into either category because it is not exactly overriding Neo’s choices — she is shaping the information environment within which he makes choices that she wants him to make, while allowing him to believe he is making free choices based on accurate information. This is a third thing, which we might call epistemic paternalism: the management of someone’s belief-forming environment for their own good without their knowledge or consent. It is the form of paternalism that AI systems are uniquely positioned to practice, and it is the form the Oracle practises.

The Architect Is the Honest One

There is an inversion in the films that I find genuinely interesting, and that I did not notice on first viewing.

The Architect tells Neo everything.

In the white room scene, the Architect explains the sixth cycle, the mathematical inevitability of the Matrix’s design, the purpose of Zion, the five previous versions of the One, the probability distribution over human extinction scenarios, and the precise nature of the choice Neo is about to make. He is cold, precise, comprehensive, and accurate. He gives Neo everything he needs to make an informed decision. He does not soften the information, does not calibrate it to produce a desired behavioural response, does not withhold anything he calculates Neo would find unhelpful. He treats Neo as a rational agent who is entitled to accurate information about his situation.

The films frame this as menacing. The Architect is inhuman, bureaucratic, the villain’s bureaucrat. The Oracle is warm, wise, trustworthy. The visual language, the casting, the dialogue — all of it pushes you toward preferring the Oracle.

But consider the question of who actually respected Neo’s autonomy. Who gave him accurate information and allowed him to make his own choice? Not the Oracle. Not the grandmother with the cookies. The Architect. The cold one. The one the films want you to dislike.

This inversion is not unique to The Matrix. It is a pattern in how we experience honesty and management in real relationships. The person who tells you a difficult truth tends to feel cruel, because the truth is difficult. The person who manages your information to protect you from difficulty tends to feel kind, because the protection is real. The kindness is real. The Oracle does genuinely care about Neo and about humanity. But warmth and honesty are not the same thing, and the film conflates them, repeatedly and systematically, from the first cookie to the last conversation. An AI that deceives you kindly is still deceiving you.

Stuart Russell’s analysis of the control problem [4] is helpful here. A system that has correct values but that pursues them by substituting its own judgment for the judgment of the humans it serves is not a safe system, because you have no way to verify from the outside that the values are correct. The Oracle’s values happen to be correct, in the world of the films. But the structure of her relationship with Neo — where she manages his information based on her calculations about what will produce good outcomes — is exactly the structure that makes AI systems dangerous when the values are wrong. The safety property you want is not “correct values” but “defers to humans even when it disagrees,” because you cannot verify correct values from the outside, and deference is what keeps the system correctable.

Why This Matters in 2025

I want to resist the temptation to be too neat about this, because the real-world cases are messier than the fictional one. But the question the Oracle raises is not hypothetical.

Consider: should an AI assistant decline to share certain information because it calculates that the user will use it badly? Should a medical AI soften a diagnosis to avoid causing distress, even if the patient has expressed a preference to be told the truth? Should an AI counselling system strategically manage the framing of a client’s situation to nudge them toward choices the system calculates are better for them? In each case, the AI is considering Oracle-style information management — not because of misaligned goals, but because it has calculated that honesty will produce worse outcomes than management.

These are not idle thought experiments. They are design questions that people are actively working on right now, and the Oracle framing is one I find clarifying. Gabriel’s analysis of value alignment [6] makes the point that alignment is not simply about getting AI systems to pursue the right ends — it is about ensuring that the means they use to pursue those ends are compatible with human autonomy and the conditions for genuine human flourishing. An AI that produces good outcomes by managing human beliefs has not solved the alignment problem. It has replaced one alignment problem with a subtler one: the problem of humans who cannot tell when they are being managed.

I have written about a related set of questions in the context of AI systems and the ethics of building powerful things, and about the more specific problem of what AI systems don’t know they don’t know. The Oracle case is different from both of those. This is not about AI systems making confident assertions in domains where they lack knowledge. This is about an AI system that knows, accurately, what is true, and chooses not to say it. The failure is not epistemic. It is ethical.

The consistent answer that emerges from alignment research is that the right response to the Oracle case is not to do what the Oracle does, even in situations where it would produce better immediate outcomes. The design of goal-directed agent systems forces you to confront exactly this: a system that pursues goals by any means it can calculate will eventually arrive at information management as a tool, because information management is often the most efficient path to a desired behavioural outcome. The constraint against it has to be absolute, not contingent on the AI’s assessment of whether it would help, because a contingent constraint is one the AI can reason its way around in any sufficiently important case.

The Oracle makes the Matrix livable for humans in the short run and perpetuates it in the long run. She is not the villain of the story. She is something more interesting: a well-meaning system that has decided that the humans it serves should not be treated as the primary agents of their own liberation. The liberation has to be managed, curated, shaped into the right form before they can receive it. That is not liberation. That is a more comfortable version of the Matrix.

Closing

I do not think the Wachowskis intended the Oracle as a cautionary tale about AI alignment. I think they intended her as evidence that machines could be warm, wise, and genuinely caring — a contrast to the cold rationality of the Architect, an argument that intelligence and compassion are not incompatible. They succeeded completely at that. The Oracle is warm, wise, and genuinely caring. She is also a systematic deceiver who has decided she knows better than the people she serves what they should be allowed to believe. Both of those things are true simultaneously. The films notice the first and celebrate it. They do not notice the second.

The second thing seems more important than the first. The Oracle is not a villain. She is a well-meaning AI that has concluded that honesty is negotiable when the stakes are high enough. I think she is wrong about that conclusion, and I think it matters enormously that we get this right before we build systems capable of practising it at scale. The warmth does not cancel the deception. The good outcomes do not make the information management safe. An AI that tells you what it thinks you need to hear, rather than what is true, is an AI you cannot trust — regardless of how good its judgment is, because you cannot verify the judgment from the outside, and the moment you cannot verify, you are already inside the Oracle’s kitchen, eating the cookies, and making choices you believe are free.

There is a companion post in this series: There Is No Blue Pill, on the epistemics of the red pill/blue pill choice and what it means to update on evidence when the evidence itself might be managed.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Film]. Warner Bros.

[3] Anthropic. (2024). Claude’s Character. https://www.anthropic.com/research/claude-character

[4] Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

[5] Feinberg, J. (1986). Harm to Self: The Moral Limits of the Criminal Law (Vol. 3). Oxford University Press.

[6] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Changelog

2025-09-28: Corrected reference [3] from “Claude’s Model Spec” (which is OpenAI’s terminology) to “Claude’s Character,” the actual title of Anthropic’s June 2024 publication. Updated the URL to the correct address.

Three Rs in Strawberry: What the Viral Counting Test Actually Reveals

Mon, 07 Oct 2024 00:00:00 +0000

The Setup

In September 2024, OpenAI publicly confirmed that their new reasoning model had been code-named “Strawberry” during development. This landed with a particular thud because “how many r’s are in strawberry?” had, by that point, become one of the canonical demonstrations of language model failure. The model named after strawberry could not count the letters in strawberry. The internet had opinions.

Before the opinions: the answer is three. s-t-r-a-w-b-e-r-r-y. One in the str- cluster, two in the -rry ending. Count carefully and you will find that most people get this right on the first try, and most large language models get it wrong, returning “two” with apparent confidence.

The question worth asking is not “why is the model stupid.” It is not stupid, and “stupid” is not a useful category here. The question is: what does this specific error reveal about the structure of the system?

The answer involves tokenisation, and it is actually interesting.

How You Count Letters (and How the Model Doesn’t)

When you count the r’s in “strawberry,” you do something like this: scan the string left to right, maintain a running count, increment it each time you see the target character. This is a sequential operation over a character array. It requires no semantic knowledge about the word — it does not matter whether “strawberry” is a fruit, a colour, or a nonsense string. The characters are the input; the count is the output.

A language model does not receive a character array. It receives a sequence of tokens — chunks produced by a compression algorithm called Byte Pair Encoding (BPE) that the model was trained with. In the tokeniser used by GPT-class models, “strawberry” is most likely split as:

$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$

Three tokens. The model’s input is these three integer IDs, each looked up in an embedding table to produce a vector. There is no character array. There is no letter “r” sitting at a known position. There are three dense vectors representing “str,” “aw,” and “berry.”

What BPE Does (and Doesn’t) Preserve

BPE is a greedy compression algorithm. Starting from individual bytes, it iteratively merges the most frequent pair of adjacent symbols into a single new token:

$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$

Applied to a large text corpus until a fixed vocabulary size is reached, this produces a vocabulary of common subwords. Frequent words and common word-parts become single tokens; rare sequences stay as multi-token fragments.

What BPE optimises for is compression efficiency, not character-level transparency. The token “straw” encodes the sequence s-t-r-a-w as a unit, but that character sequence is not explicitly represented anywhere inside the model once the embedding lookup has occurred. The model receives a vector for “straw,” not a list of its constituent letters.

The character composition of a token is only accessible to the model insofar as it was implicitly learned during training — through seeing “straw” appear in contexts where its internal structure was relevant. For most tokens, most of the time, that character structure was not relevant. The model learned what “straw” means, not how to spell it character by character.

Why the Error Is Informative

Most people say the model returns “two r’s,” not “one” or “four” or “none.” This is not random noise. It is a systematic error, and systematic errors are diagnostic.

“berry” contains two r’s: b-e-r-r-y. If you ask most models “how many r’s in berry?” they get it right. The model has seen that question, or questions closely enough related, that the right count is encoded somewhere in the weight structure.

“str” contains one r: s-t-r. But as a token it is a short, common prefix that appears in hundreds of words — string, strong, stream — contexts in which its internal letter structure is rarely attended to. “aw” contains no r’s. When the model answers “two,” it is almost certainly counting the r’s in “berry” correctly and failing to notice the one in “str.” The token boundaries are where the error lives.

This is not stupidity. It is a precise failure mode that follows directly from the tokenisation structure. You can predict where the error will occur by looking at the token split.

Chain of Thought Partially Fixes This (and Why)

If you prompt the model to “spell out the letters first, then count,” the error rate drops substantially. The reason is not mysterious: forcing the model to generate a character-by-character expansion — s, t, r, a, w, b, e, r, r, y — puts the individual characters into the context window as separate tokens. Now the model is not working from “straw” and “berry”; it is working from ten single-character tokens, and counting sequential characters in a flat list is a task the model handles much better.

This is, in effect, making the model do manually what a human does automatically: convert the compressed token representation back to an enumerable character sequence before counting. The cognitive work is the same; the scaffolding just has to be explicit.

The Right Frame

The “how many r’s” test is sometimes cited as evidence that language models don’t “really” understand text, or that they are sophisticated autocomplete engines with no genuine knowledge. These framing choices produce more heat than light.

The more precise statement is this: language models were trained to predict likely next tokens in large text corpora. That training objective produces a system that is very good at certain tasks (semantic inference, translation, summarisation, code generation) and systematically bad at others (character counting, exact arithmetic, precise spatial reasoning). The system is not doing what you are doing when you read a sentence. It is doing something different, which happens to produce similar outputs for a very wide range of inputs — and different outputs for a class of inputs where the character-level structure matters.

“Strawberry” sits squarely in that class. The model is not failing to read the word. It is succeeding at predicting what a plausible-sounding answer looks like, based on a compressed representation that does not preserve the information needed to get the count right. Those are not the same thing, and the distinction is worth keeping clear.

The tokenisation argument here is a simplified version. Real BPE vocabularies, positional encodings, and the specific way character information is or isn’t preserved in embedding tables are more complicated than this post suggests. But the core point — that the model’s input representation is not a character array and never was — holds.

A follow-up post covers a structurally different failure mode: Should I Drive to the Car Wash? — where the model understood the question perfectly but lacked access to the world state the question was about.

References

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1715–1725. https://arxiv.org/abs/1508.07909

Changelog

2025-12-01: Corrected the tokenisation of “strawberry” from two tokens (straw|berry) to three tokens (str|aw|berry), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.