Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.
That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody’s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.
I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.
The incident
In 2025, Vgel published an analysis of exactly this failure [1]. The piece is worth reading in full, but the core finding is worth unpacking here.
When a model is asked to produce a seahorse emoji, something specific happens at the level of the model’s internal representations. Using logit lens analysis — a technique for inspecting the model’s intermediate layer activations as if they were already projecting into vocabulary space [4] — it is possible to track what the model’s “working answer” looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a “seahorse + emoji” representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.
The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model’s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.
What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between “I found the seahorse emoji” and “I found the best available approximation to the seahorse emoji.” The output is produced with the same confidence either way.
Vgel’s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model’s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: “I know what you want, and it does not exist as a token I can emit.”
Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (“wait, that is a dolphin, not a seahorse”) rather than vocabulary-knowledge (“there is no seahorse token”), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.
The structural failure: vocabulary completeness assumption
Here is the core conceptual point, stated as cleanly as I can.
Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.
The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. [5] and discussed in more detail in my tokenisation post. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.
The problem is that the model treats these two representations as if they were the same. If world-knowledge says “seahorses should have emoji,” the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between “I know X exists” and “I can express X.” I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.
This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.
The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model’s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.
When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.
A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write “there is no direct English equivalent; the closest is approximately…” and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.
A formal language perspective
For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary V = {t₁, …, tₙ} [5]. Every output is an element of V. The type system has no room for a “near miss” or an “I cannot express this precisely” — the output is always a token, drawn from the inventory established at training time.
This is a closed-world assumption in the formal sense [6]: the system treats any concept not representable as an element of V as simply absent. There is no seahorse emoji token, so the model’s generation step has no way to represent “seahorse emoji” as a distinct, exact concept. It can only represent “nearest token to seahorse emoji in embedding space,” which it does silently, with the same confidence it would report for a precise match.
The mismatch is between two representations: the model’s internal semantic space — continuous, high-dimensional, geometrically capable of representing “seahorse + emoji” as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.
A richer output type would distinguish precise matches from approximations — an Exact<Token> versus an Approximate<Token>, or in standard option-type terms, a generation step that can return None when no token in V adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation [4]. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why “make the model more calibrated about facts” addresses the wrong layer of the problem.
The “ununderstandable” companion
Shortly after the seahorse emoji incident circulated, a Reddit thread titled “it’s just the seahorse emoji all over again” collected user reports of a structurally similar failure on the English word “ununderstandable” [2]. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.
“Ununderstandable” is morphologically valid English. The prefix un- combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. “Understandable” is an unambiguous adjective. “Ununderstandable” means what it looks like it means, constructed by exactly the same rule that gives you all the other un- words. There is nothing wrong with it grammatically or semantically.
It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled “ununderstandable” as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.
The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that un- prefixation is productive, and it has learned that “understandable” is a word. But its trained representations do not include “ununderstandable” as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define “ununderstandable,” models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with “misunderstandable” or “incomprehensible” in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.
The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying “I cannot recognise this because it is not in my training data,” which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.
The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: “this thing does not exist and I should say so” versus “this thing exists but I cannot produce it cleanly.” In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the “ununderstandable” case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.
What this means for users
The practical implication is narrow but important. Asking a language model “do you have X?” — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser’s vocabulary table.
This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.
The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the grounding failure post and its companion piece on pragmatic inference for related examples, and the AI detectors post for a case where self-knowledge failures about writing style have real social consequences.
The fix is not “make models more honest” in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.
The last mile
There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of “seahorse emoji” is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model’s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between “I successfully retrieved the seahorse emoji” and “I retrieved the nearest available approximation to what I was looking for.” From the model’s operational perspective, it completed the task.
This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.
The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work [3] that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.
For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.
References
[1] Vogel, T. (2025). Why do LLMs freak out over the seahorse emoji? https://vgel.me/posts/seahorse/
[2] Reddit user (2025). It’s just the seahorse emoji all over again. r/OpenAI. https://www.reddit.com/r/OpenAI/comments/1rkbeel/ (reported; not independently verified)
[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html
[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/
[5] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL 2016, 1715–1725.
[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55–76). Plenum Press, New York.
Changelog
- 2026-04-01: Updated reference [1]: author name to “Vogel, T.” and title to the published blog post title “Why do LLMs freak out over the seahorse emoji?”