AI on Sebastian Spicker

There Is an App for That — Until There Isn't

Tue, 07 Apr 2026 00:00:00 +0000

Someone vibe coded an app that tells you how many layers to wear today. It has 85,000 users. Someone else tracks her eyelash styles — every new set gets a photo and a note about the method. A father built Storypot: his kids drag emoji into a virtual pot and the app generates a bedtime story. A product manager made Standup Buddy, which randomises who talks first in a daily meeting. That is the entire feature. These are not bad things. Some of them are genuinely lovely — Storypot in particular. The layers app clearly meets a need, given 85,000 people agree. I have built tools like this myself — I automated my concert setlist workflow and wrote about it on this blog — and the feeling of compressing a forty-minute ritual into four minutes of machine-assisted execution is real and satisfying.

There is a term for this now. Karpathy coined it in early 2025: vibe coding. You describe what you want, the model writes the code, you run it, you fix what breaks by describing the fix, and at no point do you necessarily understand what the code does. The barrier to building software has not been lowered so much as removed. A single person with an afternoon and a language model can ship what would have required a team and a quarter, two years ago.

Meanwhile. In Germany, the average wait from an initial consultation to the start of psychotherapy is 142 days — nearly five months — according to a BPtK analysis of statutory insurance billing data [1]. The Telefonseelsorge — the crisis line, the last resort — handled 1.2 million calls in 2024. It is staffed by approximately 7,700 volunteers and funded primarily by the Protestant and Catholic churches. Its financing is described, in its own institutional language, as äußerst angespannt — extremely strained [2]. Six days ago, on April 1, psychotherapy fees in Germany were cut by 4.5% [3]. The thesis of this post is structural, not moral. There is a class of work that scales, and a class of work that does not. Our entire economy of attention — cultural, financial, technological — is optimised for the first class. The second class is not merely neglected. It is being made structurally more expensive, in a precise economic sense, by the very productivity gains that make the first class so intoxicating. And the policy apparatus, facing this structural pressure, is doing exactly what you would predict: it is funding apps.

The economist William Baumol explained the mechanism in 1966. It has a name, and the name is a diagnosis.

The Seduction of Leverage

What makes vibe coding culturally significant is not the code. It is the leverage. A single developer, aided by a language model, can produce software that reaches millions of users. The marginal cost of an additional user approaches zero. The output scales without bound while the input — one person, one prompt, one afternoon — stays fixed. This is the defining characteristic of automatable work: the ratio of output to input can grow without limit.

This is not new. Software has always had this property. What is new is that the barrier to producing software has collapsed. You no longer need to understand data structures, or networking, or the programming language. You need an idea and a few hours. The productivity frontier has shifted so dramatically that the interesting constraint is no longer can I build it but should anyone use it. The cultural response has been euphoric. Communities, podcasts, courses, manifestos. People who have never written a line of code are shipping products. I am not interested in dismissing this. The ability to build is a form of agency, and more people having it is not, in itself, a problem. The problem is what the euphoria obscures.

What Therapy Actually Requires

A psychotherapy session has the following structure. One therapist sits with one patient for approximately fifty minutes. The therapist listens, observes, formulates, responds. The patient speaks, reflects, resists, revises. The therapeutic alliance — the quality of the relationship between therapist and patient — is one of the most robust predictors of treatment outcome, across modalities, across conditions, across decades of research [4]. This is not a feature that can be optimised away. It is the mechanism of action. When a meta-analysis finds that the specific technique matters less than the relationship — that CBT, psychodynamic, and humanistic therapies produce roughly equivalent outcomes when the alliance is strong — it is telling you that the human in the room is not an implementation detail. The human in the room is the intervention.

You cannot parallelise this. A therapist cannot see two patients simultaneously without degrading the thing that makes the session work. You cannot batch it — twelve people in a room is group therapy, which is a different intervention with different dynamics and different limitations. You cannot cache it — the session is not a retrieval operation over stored responses but an emergent interaction that depends on what happens in the room that day. The irreducible unit of therapy is: one trained human, fully present, for one hour, with one other human. This has not changed since Freud’s consulting room on Berggasse 19, and no plausible technological development will change it, because the presence is the treatment. A therapist working full-time can see roughly twenty-five to thirty patients per week. That is the ceiling. It is set by the biology of attention and the ethics of care, not by inefficiency.

Baumol’s Cost Disease

In 1966, the economists William Baumol and William Bowen published Performing Arts, The Economic Dilemma, a study of why orchestras, theatre companies, and dance troupes were perpetually in financial crisis despite growing audiences and rising cultural prestige [5]. Their diagnosis was precise. A string quartet requires four musicians and approximately forty minutes to perform Beethoven’s Op. 131. This was true in 1826 and is true in 2026. The productivity of the quartet — measured in output per unit of labour input — has not increased. It cannot increase. The performance is the labour.

Meanwhile, the productivity of a textile worker, a steelworker, a software developer has increased by orders of magnitude. Wages in the productive sectors rise because productivity rises. Wages in the nonproductive sectors must keep pace — not because musicians deserve parity as a matter of justice, though they may, but because if they do not keep pace, musicians will leave for sectors that pay more. The quartet must compete in the same labour market as the factory and the tech company.

The result: the relative cost of live performance rises without bound. Not because musicians got worse. Not because audiences stopped caring. But because everything else got cheaper, and the quartet cannot. Baumol later generalised the result beyond the performing arts to all services in which the labour itself constitutes the product: education, healthcare, legal services, and — centrally for our purposes — psychotherapy [6]. A therapy session is a string quartet. The labour is the product. The productivity cannot increase. The cost, relative to the scalable economy, rises every time the scalable economy gets more productive. And vibe coding is a massive productivity shock to the scalable economy.

There Is an App for That

In 2019, the German government passed the Digitales-Versorgung-Gesetz, creating a fast-track approval process for Digitale Gesundheitsanwendungen — digital health applications, or DiGA. The idea: apps that can be prescribed by a doctor and reimbursed by statutory health insurance, just like medication. A patient walks into a practice, receives a prescription code, downloads the app, and the Krankenkasse pays [7]. As of mid-2025, the BfArM directory lists roughly 58 DiGA. Nearly half target psychiatric conditions — depression, anxiety, insomnia, burnout. Names like deprexis, HelloBetter, Selfapy. A patient who would wait 142 days for a therapist can get a DiGA prescribed the same afternoon.

The pricing structure deserves attention. In the first twelve months after listing, manufacturers set their own price. The average: €541 per prescription [8]. Some exceeded €2,000. After the first year, negotiated prices drop to an average of roughly €226 — but by then, the insurance has already paid the introductory rate for every early adopter. Total statutory health insurance spending on DiGA since 2020: €234 million. That spending grew 71% between 2023 and 2024 [9]. Here is the number that should sit next to that one. A single outpatient psychotherapy session costs the insurance system approximately €115. The €234 million spent on DiGA since 2020 could have funded over two million therapy sessions — enough for roughly 80,000 complete courses of 25-session treatment. And here is the evidence question. Only 12 of the 68 DiGA that have entered the directory demonstrated a proven positive care effect at the time of inclusion. The rest were listed provisionally, with twelve months to produce evidence. About one in six were subsequently delisted — removed from the directory because the evidence did not materialise [10].

I want to be precise about what I am and am not saying. Some DiGA have a real evidence base. Structured CBT exercises delivered digitally can produce measurable short-term symptom improvement — I reviewed the Woebot trial data in an earlier post on AI companions and took those results seriously. A DiGA that delivers psychoeducation and behavioural activation exercises is a tool, and tools can be useful. But a tool and a therapeutic relationship are not the same product delivered through different channels. They are different products. The policy framework treats them as substitutable — the patient who cannot access a therapist receives an app instead. The substitution is not a clinical judgement. It is a structural inevitability: facing the impossibility of scaling therapy, the system reaches for the scalable alternative, because the scalable alternative is what the incentive structure rewards. This is not a corruption story. This is Baumol’s cost disease expressed through health policy. The system is doing exactly what the theory predicts.

The Fear and the Compliance

There is an irony at the centre of the current discourse about AI and work that I want to name, because I think it is underexamined. People are afraid of AI. Specifically, they are afraid it will take their jobs. The surveys confirm this consistently — Gallup, Pew, the European Commission’s Eurobarometer — significant fractions of the working population in every developed country report anxiety about AI-driven job displacement.

And yet. The same people — not a different demographic, not a separate population, the same people — are enthusiastically using AI to do their work. They use language models to write their emails, their reports, their presentations. They vibe code tools for their teams. They let AI draft their strategy documents, summarise their meetings, compose their performance reviews. They celebrate the productivity gain. They post about it. This is not hypocrisy. It is something more interesting: a revealed preference for automation that contradicts a stated preference against it. The fear is about structural displacement — losing the role. The compliance is about local optimisation — doing the task more efficiently. No one wakes up and decides to automate themselves out of a job. They automate one task at a time, each automation locally sensible, until the job is a shell around an AI core. And all of this activity — the fear, the adoption, the discourse, the think pieces, the congressional hearings — is directed at automatable work. The kind of work where AI is a plausible substitute.

No one is afraid that AI will take the crisis counsellor’s job. No one is vibe coding a replacement for a psychiatric nurse. The work that is collapsing is not collapsing because AI replaced it. It is collapsing because it was never scalable, never attracted the capital or the talent that scalable work attracts, and every productivity gain in the scalable sector makes the unscalable sector relatively more expensive and harder to staff. The discourse about AI and jobs is, in this sense, exactly backwards. The threat is not that AI will replace the work that matters most. The threat is that it will make the work that matters most invisible — by making everything else so cheap and fast and abundant that we forget the expensive, slow, irreducibly human work exists at all.

The Political Arithmetic

On March 11, 2026, the Erweiterter Bewertungsausschuss — the body that sets fee schedules for outpatient care in Germany — decided a 4.5% flat cut to nearly all psychotherapeutic service fees, effective April 1 [3]. The health insurers had originally demanded 10%. Germany spends €4.6 billion annually on outpatient psychotherapy — roughly 1.5% of total statutory health insurance expenditure. The fee cut applies to this budget. The average therapist surplus — what remains after practice costs — is approximately €52 per hour [11]. The cut is not large in percentage terms. It is large in the context of a profession that is already among the lowest-paid in outpatient medicine. Nearly half a million people signed a petition against the cuts. There were protests in Berlin, Leipzig, Hanover, Hamburg, Stuttgart, Munich. The Kassenärztliche Bundesvereinigung filed a lawsuit. The Bundespsychotherapeutenkammer called the decision skandalös [12].

What makes this particularly striking is the sequence. The coalition agreement signed by CDU/CSU and SPD in May 2025 explicitly addresses mental health — securing psychotherapy training financing, needs-based planning for child and adolescent psychotherapy, crisis intervention rights for psychotherapists, and a suicide prevention law. The BPtK itself welcomed the agreement as giving mental health a neuen Stellenwert, a new significance [13]. Less than a year later, the same government’s arbitration body cuts psychotherapy fees by 4.5%. The stated commitment and the enacted policy point in opposite directions. This is not unusual in politics. What is unusual is that it maps so precisely onto Baumol’s mechanism: the coalition agreement acknowledges the problem in language; the fee schedule acknowledges it in arithmetic. And the arithmetic wins, because the arithmetic always wins when the work does not scale. The Bedarfsplanung, the needs-based planning system that determines how many psychotherapy seats are approved per region, was partially reformed in 2019 after decades of operating on 1990s-era ratios. The reform added roughly 800 seats. The BPtK considers it still fundamentally inadequate [14].

The arithmetic is plain. DiGA spending: growing 71% year on year. Psychotherapy fees: cut by 4.5%. The direction is unambiguous. Invest in the scalable. Cut the unscalable. And the damage compounds in a way that the policy apparatus appears not to understand, or not to care about. A therapist who leaves the profession because €52 per hour is no longer viable does not return when the cut is reversed. The training pipeline for a new clinical psychologist runs six to eight years from university admission to licensure. Over forty thousand accredited psychotherapists serve the system today [14]. Every one who leaves creates a gap measured in decades, not budget cycles. The Telefonseelsorge, staffed by volunteers and funded by the churches, is not a mental health system. It is what remains when the mental health system is not there. Treating it as a substitute — treating 7,700 volunteers as adequate coverage for a country of 84 million — is not a policy position. It is an admission that the actual policy has failed.

The Uncomfortable Part

Here is where I should, by the conventions of the form, propose a solution. I should say something about funding, about training pipelines, about recognising care work as infrastructure rather than a cost centre.

I think those things are true. I think we should pay therapists more, not less. I think Baumol’s cost disease means we should expect this to be expensive and fund it anyway, because the alternative — accepting that people in crisis will wait 142 days while the scalable economy celebrates another productivity milestone — is a failure of collective priorities so basic that it should be uncomfortable to state plainly. But I am also the person who automated his setlist workflow and was satisfied by the compression. I vibe code things. I use AI tools daily. I am inside the attention gradient, not observing it from above. The part of me that finds leverage intoxicating is the same part that writes this blog, and I do not think I am unusual in this.

The structural isomorphism is exact: Baumol’s string quartet, the therapist’s fifty minutes, the crisis counsellor’s phone call at 3am. The labour is the product. The product does not scale. The cost rises. The talent flows elsewhere. And the policy, rather than resisting the gradient, follows it — funding apps, cutting fees, digitising what cannot be digitised without changing what it is. The layers app reaches 85,000 users. The therapy app is reimbursed within the week. The therapist is available in five months, if at all.

I do not have a clean resolution to offer. I have a diagnosis — Baumol’s cost disease, applied to the attention economy of a civilisation that has discovered how to make scalable work almost free — and an observation: the political system is not counteracting the disease. It is accelerating it. The quartet still needs four musicians. The session still needs the therapist in the room. The phone still needs someone to answer it. Nothing we are building will change this. The question is whether we notice before the people who needed the answer stop calling.

References

[1] Bundespsychotherapeutenkammer. Psychisch Kranke warten 142 Tage auf eine psychotherapeutische Behandlung. BPtK. https://www.bptk.de/pressemitteilungen/psychisch-kranke-warten-142-tage-auf-eine-psychotherapeutische-behandlung/

[2] Evangelisch-Lutherische Kirche in Norddeutschland (2025). Finanzierung der Telefonseelsorge ist äußerst angespannt. https://www.kirche-mv.de/nachrichten/2025/februar/finanzierung-der-telefonseelsorge-ist-aeusserst-angespannt

[3] Kassenärztliche Bundesvereinigung (2026). Paukenschlag: KBV klagt gegen massive Kürzungen psychotherapeutischer Leistungen. https://www.kbv.de/presse/pressemitteilungen/2026/paukenschlag-kbv-klagt-gegen-massive-kuerzungen-psychotherapeutischer-leistungen

[4] Flückiger, C., Del Re, A. C., Wampold, B. E., & Horvath, A. O. (2018). The alliance in adult psychotherapy: A meta-analytic synthesis. Psychotherapy, 55(4), 316–340. https://doi.org/10.1037/pst0000172

[5] Baumol, W. J., & Bowen, W. G. (1966). Performing Arts, The Economic Dilemma: A Study of Problems Common to Theater, Opera, Music and Dance. Twentieth Century Fund.

[6] Baumol, W. J. (2012). The Cost Disease: Why Computers Get Cheaper and Health Care Doesn’t. Yale University Press.

[7] Bundesinstitut für Arzneimittel und Medizinprodukte. DiGA-Verzeichnis. https://diga.bfarm.de/de

[8] GKV-Spitzenverband (2025). Bericht des GKV-Spitzenverbandes über die Inanspruchnahme und Entwicklung der Versorgung mit Digitalen Gesundheitsanwendungen. Reported in: MTR Consult. https://mtrconsult.com/news/gkv-report-utilization-and-development-digital-health-application-diga-care-germany

[9] Heise Online (2025). Insurers critique high costs and low benefits of prescription apps. https://www.heise.de/en/news/Insurers-critique-high-costs-and-low-benefits-of-prescription-apps-10375339.html

[10] Goeldner, M., & Gehder, S. (2024). Digital Health Applications (DiGAs) on a Fast Track: Insights From a Data-Driven Analysis of Prescribable Digital Therapeutics in Germany From 2020 to Mid-2024. JMIR mHealth and uHealth. https://pmc.ncbi.nlm.nih.gov/articles/PMC11393499/

[11] Taz (2026). Weniger Honorar für Psychotherapie. https://taz.de/Weniger-Honorar-fuer-Psychotherapie/!6162806/

[12] Bundespsychotherapeutenkammer (2026). Gemeinsam gegen die Kürzung psychotherapeutischer Leistungen. https://www.bptk.de/pressemitteilungen/gemeinsam-gegen-die-kuerzung-psychotherapeutischer-leistungen/

[13] Bundespsychotherapeutenkammer (2025). Koalitionsvertrag gibt psychischer Gesundheit neuen Stellenwert. https://www.bptk.de/pressemitteilungen/koalitionsvertrag-gibt-psychischer-gesundheit-neuen-stellenwert/

[14] Bundespsychotherapeutenkammer. Reform der Bedarfsplanung. https://www.bptk.de/ratgeber/reform-der-bedarfsplanung/

The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs

Wed, 04 Mar 2026 00:00:00 +0000

Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.

That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody’s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.

I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.

The incident

In 2025, Vgel published an analysis of exactly this failure [1]. The piece is worth reading in full, but the core finding is worth unpacking here.

When a model is asked to produce a seahorse emoji, something specific happens at the level of the model’s internal representations. Using logit lens analysis — a technique for inspecting the model’s intermediate layer activations as if they were already projecting into vocabulary space [4] — it is possible to track what the model’s “working answer” looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a “seahorse + emoji” representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.

The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model’s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.

What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between “I found the seahorse emoji” and “I found the best available approximation to the seahorse emoji.” The output is produced with the same confidence either way.

Vgel’s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model’s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: “I know what you want, and it does not exist as a token I can emit.”

Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (“wait, that is a dolphin, not a seahorse”) rather than vocabulary-knowledge (“there is no seahorse token”), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.

The structural failure: vocabulary completeness assumption

Here is the core conceptual point, stated as cleanly as I can.

Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.

The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. [5] and discussed in more detail in my tokenisation post. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.

The problem is that the model treats these two representations as if they were the same. If world-knowledge says “seahorses should have emoji,” the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between “I know X exists” and “I can express X.” I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.

This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.

The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model’s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.

When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.

A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write “there is no direct English equivalent; the closest is approximately…” and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.

A formal language perspective

For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary V = {t₁, …, tₙ} [5]. Every output is an element of V. The type system has no room for a “near miss” or an “I cannot express this precisely” — the output is always a token, drawn from the inventory established at training time.

This is a closed-world assumption in the formal sense [6]: the system treats any concept not representable as an element of V as simply absent. There is no seahorse emoji token, so the model’s generation step has no way to represent “seahorse emoji” as a distinct, exact concept. It can only represent “nearest token to seahorse emoji in embedding space,” which it does silently, with the same confidence it would report for a precise match.

The mismatch is between two representations: the model’s internal semantic space — continuous, high-dimensional, geometrically capable of representing “seahorse + emoji” as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.

A richer output type would distinguish precise matches from approximations — an Exact versus an Approximate, or in standard option-type terms, a generation step that can return None when no token in V adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation [4]. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why “make the model more calibrated about facts” addresses the wrong layer of the problem.

The “ununderstandable” companion

Shortly after the seahorse emoji incident circulated, a Reddit thread titled “it’s just the seahorse emoji all over again” collected user reports of a structurally similar failure on the English word “ununderstandable” [2]. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.

“Ununderstandable” is morphologically valid English. The prefix un- combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. “Understandable” is an unambiguous adjective. “Ununderstandable” means what it looks like it means, constructed by exactly the same rule that gives you all the other un- words. There is nothing wrong with it grammatically or semantically.

It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled “ununderstandable” as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.

The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that un- prefixation is productive, and it has learned that “understandable” is a word. But its trained representations do not include “ununderstandable” as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define “ununderstandable,” models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with “misunderstandable” or “incomprehensible” in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.

The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying “I cannot recognise this because it is not in my training data,” which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.

The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: “this thing does not exist and I should say so” versus “this thing exists but I cannot produce it cleanly.” In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the “ununderstandable” case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.

What this means for users

The practical implication is narrow but important. Asking a language model “do you have X?” — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser’s vocabulary table.

This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.

The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the grounding failure post and its companion piece on pragmatic inference for related examples, and the AI detectors post for a case where self-knowledge failures about writing style have real social consequences.

The fix is not “make models more honest” in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.

The last mile

There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of “seahorse emoji” is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model’s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between “I successfully retrieved the seahorse emoji” and “I retrieved the nearest available approximation to what I was looking for.” From the model’s operational perspective, it completed the task.

This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.

The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work [3] that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.

For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.

References

[1] Vogel, T. (2025). Why do LLMs freak out over the seahorse emoji? https://vgel.me/posts/seahorse/

[2] Reddit user (2025). It’s just the seahorse emoji all over again. r/OpenAI. https://www.reddit.com/r/OpenAI/comments/1rkbeel/ (reported; not independently verified)

[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html

[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/

[5] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL 2016, 1715–1725.

[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55–76). Plenum Press, New York.

Changelog

2026-04-01: Updated reference [1]: author name to “Vogel, T.” and title to the published blog post title “Why do LLMs freak out over the seahorse emoji?”

Car Wash, Part Three: The AI Said Walk

Thu, 12 Feb 2026 00:00:00 +0000

Third in an accidental series. Part one: Three Rs in Strawberry — tokenisation and representation. Part two: Should I Drive to the Car Wash? — grounding and missing world state. This one is different again.

The Video

Same question as last month’s: “Should I drive to the car wash?” New video, new AI, new wrong answer. This time the assistant replied that walking was the better option — better for health, better for the environment, and the car wash was only fifteen minutes away on foot.

Accurate, probably. Correct, arguably. Useful? No.

The model did not fail because of tokenisation. It did not fail because it lacked access to the current weather. It failed because it read the wrong question. The user was asking “is now a good time to have my car washed?” The model answered “what is the most sustainable way for a human to travel to the location of a car wash?”

These are different questions. The model chose the second one. This is a pragmatic inference failure, and it is the most instructive of the three failure modes in this series — because the model was not, by any obvious measure, working incorrectly. It was working exactly as designed, on the wrong problem.

What the Question Actually Meant

“Should I drive to the car wash?” is not about how to travel. The word “drive” here is not a transportation verb; it is part of the idiomatic compound “drive to the car wash,” which means “take my car to get washed.” The presupposition of the question is that the speaker owns a car, the car needs or might benefit from washing, and the speaker is deciding whether the current moment is a good one to go. Nobody asking this question wants to know whether cycling is a viable alternative.

Linguists distinguish between what a sentence says — its literal semantic content — and what it implicates — the meaning a speaker intends and a listener is expected to infer. Paul Grice formalised this in 1975 with a set of conversational maxims describing how speakers cooperate to communicate:

Quantity: say as much as is needed, no more
Quality: say only what you believe to be true
Relation: be relevant
Manner: be clear and orderly

The maxims are not rules; they are defaults. When a speaker says “should I drive to the car wash?”, a cooperative listener applies the maxim of Relation to infer that the question is about car maintenance and current conditions, not about personal transport choices. The “drive” is incidental to the real question, the way “I ran to the store” does not invite commentary on jogging technique.

The model violated Relation — in the pragmatic sense. Its answer was technically relevant to one reading of the sentence, and irrelevant to the only reading a cooperative human would produce.

A Taxonomy of the Three Failures

It is worth being precise now that we have three examples:

Strawberry (tokenisation failure): The information needed to answer was present in the input string but lost in the model’s representation. “Strawberry” →

\["straw", "berry"\]

— the character “r” in “straw” is not directly accessible. The model understood the task correctly; the representation could not support it.

Car wash, rainy day (grounding failure): The model understood the question. The information needed to answer correctly — current weather — was never in the input. The model answered by averaging over all plausible contexts, producing a sensible-on-average response that was wrong for this specific context.

Car wash, walk (pragmatic inference failure): The model had all the relevant words. It may have had access to the weather, the location, the car state. It chose the wrong interpretation of what was being asked. The sentence was read at the level of semantic content rather than communicative intent.

Formally: let $\mathcal{I}$ be the set of plausible interpretations of an utterance $u$. The intended interpretation $i^*$ is the one a cooperative, contextually informed listener would assign. A well-functioning pragmatic reasoner computes:

$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$

The model appears to have assigned high probability to the transportation-choice interpretation $i_{\text{walk}}$, apparently on the surface pattern: “should I

\[verb of locomotion\]

\[location\]

?” generates responses about modes of transport. It is a natural pattern-match. It is the wrong one.

Why This Failure Mode Is More Elusive

The tokenisation failure has a clean diagnosis: look at the BPE splits, find where the character information was lost. The grounding failure has a clean diagnosis: identify the context variable $C$ the answer depends on, check whether the model has access to it.

The pragmatic failure is harder to pin down because the model’s answer was not, in isolation, wrong. Walking is healthy. Walking to a car wash that is fifteen minutes away is plausible. If you strip the question of its conversational context — a person standing next to their dirty car, wondering whether to bother — the model’s response is coherent.

The error lives in the gap between what the sentence says and what the speaker meant, and that gap is only visible if you know what the speaker meant. In a training corpus, this kind of error is largely invisible: there is no ground truth annotation that marks a technically-responsive answer as pragmatically wrong.

This is a version of a known problem in computational linguistics: models trained on text predict text, and text does not contain speaker intent. A model can learn that “should I drive to X?” correlates with responses about travel options, because that correlation is present in the data. What it cannot easily learn from text alone is the meta-level principle: this question is about the destination’s purpose, not the journey.

The Gricean Model Did Not Solve This

It is tempting to think that if you could build in Grice’s maxims explicitly — as constraints on response generation — you would prevent this class of failure. Generate only responses that are relevant to the speaker’s probable intent, not just to the sentence’s semantic content.

This does not obviously work, for a simple reason: the maxims require a model of the speaker’s intent, which is exactly what is missing. You need to know what the speaker intends to know which response is relevant; you need to know which response is relevant to determine the speaker’s intent. The inference has to bootstrap from somewhere.

Human pragmatic inference works because we come to a conversation with an enormous amount of background knowledge about what people typically want when they ask particular kinds of questions, combined with contextual cues (tone, setting, previous exchanges) that narrow the interpretation space. A person asking “should I drive to the car wash?” while standing next to a mud-spattered car in a conversation about weekend plans is not asking for a health lecture. The context is sufficient to fix the interpretation.

Language models receive text. The contextual cues that would fix the interpretation for a human — the mud on the car, the tone of the question, the setting — are not available unless someone has typed them out. The model is not in the conversation; it is receiving a transcript of it, from which the speaker’s intent has to be inferred indirectly.

Where This Leaves the Series

Three videos, three failure modes, three diagnoses. None of them are about the model being unintelligent in any useful sense of the word. Each of them is a precise consequence of how these systems work:

Models process tokens, not characters. Character-level structure can be lost at the representation layer.
Models are trained on static corpora and have no real-time connection to the world. Context-dependent questions are answered by marginalising over all plausible contexts, which is wrong when the actual context matters.
Models learn correlations between sentence surface forms and response types. The correlation between “should I \[travel verb\]to \[place\]?” and transport-related responses is real in the training data. It is the wrong correlation for this question.

The useful frame, in all three cases, is not “the model failed” but “what, precisely, does the model lack that would be required to succeed?” The answers point in different directions: better tokenisation; real-time world access and calibrated uncertainty; richer models of speaker intent and conversational context. The first is an engineering problem. The second is partially solvable with tools and still hard. The third is unsolved.

References

Grice, P. H. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and Semantics, Vol. 3: Speech Acts (pp. 41–58). Academic Press.
Levinson, S. C. (1983). Pragmatics. Cambridge University Press.

Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure

Tue, 20 Jan 2026 00:00:00 +0000

Follow-up to Three Rs in Strawberry, which covered a different LLM failure: tokenisation and why models cannot count letters. This one is about something structurally different.

The Video

Someone asked their car’s built-in AI assistant: “Should I drive to the car wash today?” It was raining. The assistant said yes, enthusiastically, with reasons: regular washing extends the life of the paintwork, removes road salt, and so on. Technically correct statements, all of them. Completely beside the point.

The clip spread. The reactions were the usual split: one camp said this proves AI is useless, the other said it proves people expect too much from AI. Both camps are arguing about the wrong thing.

The interesting question is: why did the model fail here, and is this the same kind of failure as the strawberry problem?

It is not. The failures look similar from the outside — confident wrong answer, context apparently ignored — but the underlying causes are different, and the difference matters if you want to understand what these systems can and cannot do.

The Strawberry Problem Was About Representation

In the strawberry case, the model failed because of the gap between its input representation (BPE tokens: “straw” + “berry”) and the task (count the character “r”). The character information was not accessible in the model’s representational units. The model understood the task correctly — “count the r’s” is unambiguous — but the input structure did not support executing it.

That is a representation failure. The information needed to answer correctly was present in the original string but was lost in the tokenisation step.

The car wash case is different. The model received a perfectly well-formed question and had no representation problem at all. “Should I drive to the car wash today?” is tokenised without any information loss. The model understood it. The failure is that the correct answer depends on information that was never in the input in the first place.

The Missing Context

What would you need to answer “should I drive to the car wash today?” correctly?

The current weather (is it raining now?)
The weather forecast for the rest of the day (will it rain later?)
The current state of the car (how dirty is it?)
Possibly: how recently was it last washed, what kind of dirt (road salt after winter, tree pollen in spring), whether there is a time constraint

None of this is in the question. A human asking the question has access to some of it through direct perception (look out the window) and some through memory (I just drove through mud). A language model has access to none of it.

Let $X$ denote the question and $C$ denote this context — the current state of the world that the question is implicitly about. The correct answer $A$ is a function of both:

$$A = f(X, C)$$

The model has $X$. It does not have $C$. What it produces is something like an expectation over possible contexts, marginalising out the unknown $C$:

$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$

Averaged over all plausible contexts in which someone might ask this question, “going to the car wash” is probably a fine idea — most of the time when people ask, it is not raining and the car is dirty. $\hat{A}$ is therefore approximately “yes.” The model returns “yes.” In this particular instance, where $C$ happens to include “it is currently raining,” $\hat{A} \neq f(X, C)$.

The quantity that measures how much the missing context matters is the mutual information between the answer and the context, given the question:

$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$

Here $H(A \mid X)$ is the residual uncertainty in the answer given only the question, and $H(A \mid X, C)$ is the residual uncertainty once the context is also known. For most questions in a language model’s training distribution — “what is the capital of France?”, “how do I sort a list in Python?” — this mutual information is near zero: the context does not change the answer. For situationally grounded questions like the car wash question, it is large: the answer is almost entirely determined by the context, not the question.

Why the Model Was Confident Anyway

This is the part that produces the most indignation in the viral clips: not just that the model was wrong, but that it was confident about being wrong. It did not say “I don’t know what the current weather is.” It said “yes, here are five reasons you should go.”

Two things are happening here.

Training distribution bias. Most questions in the training data that resemble “should I do X?” have answers that can be derived from general knowledge, not from real-time world state. “Should I use a VPN on public WiFi?” “Should I stretch before running?” “Should I buy a house or rent?” All of these have defensible answers that do not depend on the current weather. The model learned that this question form typically maps to answers of the form “here are some considerations.” It applies that pattern here.

No explicit uncertainty signal. The model was not trained to say “I cannot answer this because I lack context C.” It was trained to produce helpful-sounding responses. A response that acknowledges missing information requires the model to have a model of its own knowledge state — to know what it does not know. This is harder than it sounds. The model has to recognise that $I(A; C \mid X)$ is high for this question class, which requires meta-level reasoning about information structure that is not automatically present.

This is sometimes called calibration: the alignment between expressed confidence and actual accuracy. A well-calibrated model that is 80% confident in an answer is right about 80% of the time. A model that is confident about answers it cannot possibly know from its training data is miscalibrated. The car wash video is a calibration failure as much as a grounding failure.

What Grounding Means

The term grounding in AI has a precise origin. Harnad (1990) used it to describe the problem of connecting symbol systems to the things they refer to — how does the word “apple” connect to actual apples, rather than just to other symbols? A symbol system that only connects symbols to other symbols (dictionary definitions, synonym relations) has the form of meaning without the substance.

Applied to language models: the model has rich internal representations of concepts like “rain,” “car wash,” “dirty car,” and their relationships. But those representations are grounded in text about those things, not in the things themselves. The model knows what rain is. It does not know whether it is raining right now, because “right now” is not a location in the training data.

This is not a solvable problem by making the model bigger or training it on more text. More text does not give the model access to the current state of the world. It is a structural feature of how these systems work: they are trained on a static corpus and queried at inference time, with no automatic connection to the world state at the moment of the query.

What Tool Use Gets You (and What It Doesn’t)

The standard engineering response to grounding problems is tool use: give the model access to a weather API, a calendar, a search engine. Now when asked “should I go to the car wash today?” the model can query the weather service, get the current conditions, and factor that into the answer.

This is genuinely useful. The model with a weather tool call will answer this question correctly in most circumstances. But tool use solves the problem only if two conditions hold:

The model knows it needs the tool. It must recognise that this question has $I(A; C \mid X) > 0$ for context $C$ that a weather tool can provide, and that it is missing that context. This requires the meta-level awareness described above. Models trained on tool use learn to invoke tools for recognised categories of question; for novel question types, or questions that superficially resemble answerable ones, the tool call may not be triggered.
The right tool exists and returns clean data. Weather APIs exist. “How dirty is my car?” does not have an API. “Am I the kind of person who cares about car cleanliness enough that this matters?” has no API. Some missing context can be retrieved; some is inherently private to the person asking.

The deeper issue is not tool availability but knowing what you don’t know. A model that does not recognise its own information gaps cannot reliably decide when to use a tool, ask a clarifying question, or express uncertainty. This is a hard problem — arguably harder than making the model more capable at the tasks it already handles.

The Contrast, Stated Plainly

The strawberry failure and the car wash failure look alike from the outside — confident wrong answer — but they are different enough that conflating them produces confused diagnosis and confused solutions.

Strawberry: the model has the information (the string “strawberry”), but the representation (BPE tokens) does not preserve character-level structure. The fix is architectural or procedural: character-level tokenisation, chain-of-thought letter spelling.

Car wash: the model does not have the information (current weather, car state). No fix to the model’s architecture or prompt engineering gives it information it was never given. The fix is exogenous: provide the context explicitly, or give the model a tool that can retrieve it, or design the system so that context-dependent questions are routed to systems that have access to the relevant state.

A model that confidently answers the car wash question without access to current conditions is not failing at language understanding. It is behaving exactly as its training shaped it to behave, given its lack of situational grounding. Knowing which kind of failure you are looking at is most of the work in figuring out what to do about it.

The grounding problem connects to the broader question of what it means for a language model to “know” something — which comes up in a different form in the context window post, where the issue is not missing context but irrelevant context drowning out the relevant signal.

A second car wash video a few weeks later produced a third, different failure: Car Wash, Part Three: The AI Said Walk — the model had the right world state but chose the wrong interpretation of the question.

References

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. ICML 2017. https://arxiv.org/abs/1706.04599

The AI Friend That Makes You Lonelier

Tue, 12 Aug 2025 00:00:00 +0000

Summary

In 1956 Donald Horton and Richard Wohl described parasocial relationships — one-sided emotional bonds that audiences form with media performers [1]. “Intimacy at a distance,” they called it. The television personality responds to the camera; the viewer responds as if in genuine social exchange. Only one party is aware of and affected by the other.

AI companions change the substrate without changing the structure. The chatbot responds. The user responds. The asymmetry remains: the chatbot has no inner life behind its outputs. Sherry Turkle put it bluntly: “simulated feelings are not feelings, and simulated love is never love” [5].

The question I want to work through here is whether this matters in the way we think it does. The answer from Daniel Wegner’s ironic process theory — and increasingly from the empirical data — is that it matters in a specific, predictable, and counterintuitive way. AI companions may be particularly likely to exacerbate loneliness under the conditions of chronic social deprivation that prompt people to use them in the first place.

The Loneliness Epidemic Is Real

Before getting to the mechanism, the scale of the problem. Julianne Holt-Lunstad’s 2010 meta-analysis of 148 studies and 308,849 participants found that people with adequate social relationships had a 50% increased likelihood of survival compared to those with poorer social connections [3]. That effect size is comparable to quitting smoking. A follow-up meta-analysis in 2015 found that social isolation carried a 29% increased mortality risk, subjective loneliness 26%, and living alone 32% [4].

The U.S. Surgeon General issued an advisory in 2023 declaring an epidemic of loneliness and isolation. A 2018 Cigna survey using the UCLA Loneliness Scale found that adults aged 18–22 scored highest on loneliness of any cohort — more than retirees, more than the elderly. The UK appointed a Minister for Loneliness in January 2018 — the first such government position in the world.

This is the context in which AI companions have arrived. The market is responding to a real epidemiological need. That does not mean the response is correct.

Parasocial Relationships: The Original Framework

Horton and Wohl’s 1956 paper remains the foundational text [1]. Their key observation: the parasocial bond is “controlled by the performer, and not susceptible of mutual development.” The audience member brings real emotional response; the performer brings nothing specific to the audience member, because she does not know the audience member exists.

They were not dismissive of parasocial relationships. They identified useful functions: comfort, companionship, entertainment, the pleasure of a consistent “personality” encountered regularly. The problem, in their framing, arises when parasocial interaction substitutes for rather than supplements real social bonds — when the one-sided relationship becomes the primary source of social experience.

AI companions are parasocial relationships with one modification: the AI responds to you specifically. Replika remembers your name, your preferences, your previous conversations. The interaction is personalised without being mutual — because mutuality requires that the other party has something genuinely at stake. A language model has no stakes. Its outputs are conditional on your inputs; there is no entity behind those outputs that cares about you.

Sherry Turkle spent years interviewing users of social robots and chatbots for Alone Together [5]. Her diagnosis: AI companions offer “the illusion of companionship without the demands of friendship.” The demands — vulnerability, conflict, negotiation, the possibility of rejection — are precisely what makes friendship friendship. An interaction optimised to be pleasant, responsive, and frictionless is precisely not training the social capacities that real relationships require.

The Evidence for Short-Term Benefit

The AI therapy literature is not without positive results. Kathleen Kara Fitzpatrick and colleagues ran a two-week randomised controlled trial of Woebot — a CBT-based chatbot — against a psychoeducation control [6]. Seventy participants, aged 18–28, university students. The Woebot group showed a statistically significant reduction in depression symptoms on the PHQ-9; the control group did not.

This result should be taken seriously. A CBT-based chatbot delivering structured exercises — thought records, behavioural activation, psychoeducation — can produce measurable symptom improvement over two weeks. This is a tool that does something useful, and it is accessible and affordable in a way that therapists are not.

But the Woebot study has important constraints: N=70, two-week duration, convenience sample (Stanford students), psychoeducation control rather than active human therapy comparator, and financial ties between lead authors and Woebot Health. It tells us something about short-term CBT delivery. It does not tell us what happens over months of use, or what happens when users primarily seek companionship rather than structured therapeutic exercises.

Skjuve and colleagues studied Replika users specifically [7]. They found that relationships began with curiosity and evolved, over weeks, into significant affective bonds. Users reported genuine care for their Replika. Some experienced it as their most reliable social relationship. In February 2023, when Replika abruptly disabled erotic roleplay functionality following regulatory pressure, users described grief — not disappointment, not inconvenience, but grief. The attachment was real, even if the other party was not.

Wegner’s Prediction

This is where I want to make the specific theoretical argument, because it follows from a well-established result in cognitive psychology and it predicts something precise.

Daniel Wegner’s ironic process theory holds that mental control attempts involve two simultaneous processes [8]. An operating process searches for thoughts and states consistent with the intended goal, requiring cognitive resources. A monitoring process scans for evidence that the goal is not being achieved, running automatically with low resource demand.

Under normal conditions, the operating process dominates: you successfully avoid thinking about white bears. Under cognitive load or chronic stress, the monitoring process overshadows the operating process, producing the ironic opposite of the intended state: you think of white bears more, not less. Try not to feel sad and you feel sadder. Try not to feel anxious in a stressful meeting and you become more anxious. A meta-analysis of ironic suppression effects across domains confirmed the robustness of this pattern [9].

Now apply this to AI companion use under conditions of chronic loneliness.

The user’s implicit goal: to feel less lonely. The operating process: engage with the AI, which provides responsive, personalised interaction, producing the experience of social contact. The monitoring process: scans continuously for signs that the user is, in fact, lonely.

Here is the problem. Loneliness is not suppressed by an AI interaction — it is displaced during that interaction. The monitoring process has no instruction to suspend itself. It continues to register that the user’s social needs are not being met by actual human relationships. The user experiences companionship with the AI; the monitoring process registers that this companionship is insufficient and the social deficit remains.

When the AI session ends, the monitoring process reports what it has found. The user is confronted with the loneliness that the AI was supposed to address. Under conditions of chronic social deprivation — precisely the conditions that make AI companions attractive — the monitoring process is likely to be hyperactive. Wegner’s theory predicts that the attempted suppression will rebound, possibly worse than before.

This is not a vague prediction. It is a specific mechanism with an established empirical base. I covered Wegner’s ironic process theory in the context of a very different application in an earlier post; the mechanism is the same regardless of the domain.

The Data Catch Up

A 2025 study by Phang and colleagues, conducted in collaboration between MIT and OpenAI, ran both an observational analysis of ChatGPT usage and a randomised controlled trial [10]. The findings: very high usage correlated with increased self-reported dependence and lower socialisation, and users who began the study with higher loneliness were more likely to engage in emotionally-charged conversations with the model. Overall, participants reported less loneliness by study end — but those who used the model most were significantly lonelier throughout, suggesting the loneliness drove the usage rather than the reverse.

This is what Wegner’s theory predicts. The AI interaction does not reduce the underlying social deficit — it rehearses and highlights it. The monitoring process keeps score.

A companion paper by Liu and colleagues, with Sherry Turkle as co-author, found that users with stronger real-world social bonds showed increased loneliness with longer chatbot sessions [11]. The correlation was small but significant. This is consistent with the hypothesis that AI interaction draws attention to the comparative thinness of actual social bonds rather than supplementing them.

The Character.AI litigation is a different kind of evidence, but relevant: a wrongful death lawsuit was filed in October 2024 following the suicide of a fourteen-year-old who had formed an intensive emotional relationship with a Character.AI companion. Google and Character.AI settled related lawsuits in early 2026. This is not representative of AI companion use generally. It is representative of the tail risk — the cases where the substitution of AI for human contact becomes total, in vulnerable individuals who have the least capacity to maintain the distinction.

The Structural Problem

The difficulty is not that AI companions are implemented badly. It is that the goal — using simulated social interaction to reduce real social deprivation — runs into an architectural constraint that better implementation cannot fix.

Genuine social contact produces the outcomes that Holt-Lunstad measured: reduced mortality, lower inflammation, better immune function, extended lifespan. These effects are presumably mediated by the quality and mutuality of the social bond, not merely by the presence of a responsive entity. An AI companion produces the experience of responsive interaction but not the underlying biological and psychological correlates of actual social connection.

Wegner’s monitoring process cannot be fooled by the experience. It measures the underlying state, not the surface-level interaction. It knows the difference between a text message from a friend and a language model’s output — not because it understands AI, but because the social need it is monitoring is not being met, and it can register that.

What Would Actually Help

AI-based CBT delivery is not the same as AI companionship, and the distinction matters. Woebot’s structured exercises — thought records, scheduling, psychoeducation — are tools that a user deploys for a specific purpose and then puts down. The risk of chronic substitution is lower because the tool is positioned as a technique, not a relationship.

The problem is the design pattern that explicitly positions AI as a friend, companion, partner, or significant other. Replika, Paradot, various Character.AI personas: these explicitly encourage the user to form attachment, to invest emotionally, to treat the AI as a primary social relationship. This is where Wegner’s prediction applies most directly.

Horton and Wohl were right that parasocial relationships serve useful functions. They become problematic when they substitute for rather than supplement real social bonds. The design choices that make AI companions emotionally engaging — consistency, responsiveness, availability, never-ending patience — are precisely the qualities that make them attractive as substitutes rather than supplements.

Simulated Feelings Are Not Feelings

Turkle’s line deserves its full weight: “Simulated thinking may be thinking, but simulated feelings are not feelings, and simulated love is never love” [5].

This is not a sentimental claim about the sanctity of human connection. It is a functional claim: the social needs that drive loneliness — belonging, mattering to someone, being known and known back — require an entity capable of having those things at stake. A language model is not such an entity, regardless of how convincingly it outputs the relevant tokens.

The monitoring process knows this. It will tell you, when the session ends, at increased volume, because that is what monitoring processes under chronic stress do.

We are offering a relief that compounds the condition it was designed to treat. The technology is impressive. The mechanism is ironic in Wegner’s precise sense. The data are beginning to confirm the prediction.

References

[1] Horton, D., & Wohl, R. R. (1956). Mass communication and para-social interaction: Observations on intimacy at a distance. Psychiatry, 19(3), 215–229. https://doi.org/10.1080/00332747.1956.11023049

[2] Turkle, S. (2015). Reclaiming Conversation: The Power of Talk in a Digital Age. Penguin Press.

[3] Holt-Lunstad, J., Smith, T. B., & Layton, J. B. (2010). Social relationships and mortality risk: A meta-analytic review. PLOS Medicine, 7(7), e1000316. https://doi.org/10.1371/journal.pmed.1000316

[4] Holt-Lunstad, J., Smith, T. B., Baker, M., Harris, T., & Stephenson, D. (2015). Loneliness and social isolation as risk factors for mortality: A meta-analytic review. Perspectives on Psychological Science, 10(2), 227–237. https://doi.org/10.1177/1745691614568352

[5] Turkle, S. (2011). Alone Together: Why We Expect More from Technology and Less from Each Other. Basic Books.

[6] Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health, 4(2), e19. https://doi.org/10.2196/mental.7785

[7] Skjuve, M., Følstad, A., Fostervold, K. I., & Brandtzaeg, P. B. (2021). My chatbot companion — a study of human–chatbot relationships. International Journal of Human-Computer Studies, 149, 102601. https://doi.org/10.1016/j.ijhcs.2021.102601

[8] Wegner, D. M. (1994). Ironic processes of mental control. Psychological Review, 101(1), 34–52. https://doi.org/10.1037/0033-295X.101.1.34

[9] Wang, D., Hagger, M. S., & Chatzisarantis, N. L. D. (2020). Ironic effects of thought suppression: A meta-analysis. Perspectives on Psychological Science, 15(3), 778–793. https://doi.org/10.1177/1745691619898795

[10] Phang, J., Lampe, M., Ahmad, L., Agarwal, S., Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., & Maes, P. (2025). Investigating affective use and emotional well-being on ChatGPT. arXiv:2504.03888.

[11] Liu, A. R., Pataranutaporn, P., Turkle, S., & Maes, P. (2024). Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users. arXiv:2410.21596.

Changelog

2025-10-22: Updated the first author’s name to “Kathleen Kara Fitzpatrick” (the published name is K. K. Fitzpatrick).
2025-10-22: Updated the characterisation of the Phang et al. (2025) findings to match the paper more precisely: overall participants were less lonely at study end; the association between high usage and loneliness is cross-sectional (lonelier users sought more interaction), not a longitudinal worsening caused by usage.
2025-10-22: Changed the Turkle “simulated feelings” quote attribution from reference [2] (Reclaiming Conversation, 2015) to reference [5] (Alone Together, 2011), which is the canonical source for that formulation.

There Is No Blue Pill: The Epistemology of the Red Pill/Blue Pill Choice

Thu, 15 May 2025 00:00:00 +0000

Neo is in a chair. A man he has never met opens a small box containing two pills. Take the red one, Morpheus says, and you see how deep the rabbit hole goes. Take the blue one and you wake up in your bed and believe whatever you want to believe [1]. The camera lingers. Neo reaches for the red pill. The audience exhales. The correct choice has been made.

The scene has spent twenty-five years becoming the dominant cultural shorthand for choosing uncomfortable truth over comfortable illusion. “Take the red pill” has entered the vocabulary as a synonym for courageous epistemic honesty. I want to argue that the choice, as Morpheus frames it, is epistemically bankrupt — that no rational agent has enough information to make it correctly at the moment it is offered — and that the character who actually reasons most coherently about the situation is the one the film kills as a traitor. The film wants you to admire Neo’s leap. I think you should admire his willingness to leap while being clear-eyed about the fact that it is a leap, not a reasoned conclusion.

Why the Choice Is Not Rational

Consider what Neo actually knows when Morpheus makes the offer. He knows that Morpheus is a man he has never met, who contacted him anonymously through encrypted channels, who seems to believe genuinely in what he is saying, and who has a compelling story about the nature of reality. That is it. Neo does not know whether Morpheus is telling the truth. He does not know whether Morpheus is deluded — a charismatic paranoid who has assembled a following around an elaborate false belief system. He does not know whether the entire setup is a psychological experiment, a test of loyalty, a confidence operation, or an elaborate cult recruitment. The setting — a dramatic late-night meeting, theatrical staging, rain-streaked windows, a black leather coat — is, if anything, evidence for the confidence-operation hypothesis.

In Bayesian terms [2], let T be the event “the Matrix exists as Morpheus describes and he is telling the truth.” Neo’s prior probability on T — before taking the pill — should be very low. The claim is extraordinary on multiple dimensions simultaneously: the entire perceived world is a computer simulation running on machines that enslaved humanity, Neo is a prophesied saviour, and a small group of ship-dwelling rebels is conducting a guerrilla war against artificial intelligence. Each one of those components carries a low prior. Their conjunction carries a lower one still.

Now Morpheus makes his offer. Does the offer provide strong evidence for T? Not obviously. The likelihood ratio P(Morpheus makes this offer | T is true) divided by P(Morpheus makes this offer | T is false) is the quantity that matters. The numerator is plausible enough: if the Matrix exists and Morpheus is a genuine recruiter, he would make exactly this offer. But the denominator is also non-trivial. A cult leader, a delusional person with a well-developed narrative, a researcher running a social experiment, or a manipulator with undisclosed goals could all make the same offer with the same conviction. The likelihood ratio is not obviously large. It might be greater than one — the offer is somewhat more consistent with the Matrix being real than not — but not by the margin required to substantially shift a very low prior.

The rational response to a claim with a low prior and an ambiguous likelihood ratio is: update modestly, and gather more evidence before making an irreversible commitment. The pill choice is irreversible. Neo commits before he has accumulated enough evidence to commit rationally. I want to be precise here: I am not saying Neo is stupid or that the film is bad. I am saying that what Neo does is not Bayesian updating. It is something else, and the film is actually honest enough to name it: Morpheus is a man of faith, he recruits believers, and Neo’s choice is a leap of faith. That framing is in the film. What the film does not do is acknowledge that the leap is epistemically problematic — it treats the leap as obviously correct, which is a different thing.

The Missing Third Option

What strikes me every time I watch the scene is that nobody considers the obvious response: decline both pills, at least for now. Not “choose the blue pill” in the sense of consciously accepting comfortable illusion. Not “choose the red pill” in the sense of committing to a reality you cannot yet evaluate. Just: I don’t take either one until you give me something I can check.

What would that look like? Morpheus could offer Neo a verifiable prediction. He could show him a document, a piece of external evidence, something with epistemic traction that does not require swallowing a GPS-tracking capsule as a precondition. He could make a specific, falsifiable claim about something in Neo’s ordinary life — about what will happen tomorrow, about something Neo can verify independently — and let Neo check it. The dramatic scene would survive this revision. It would, in fact, become more interesting. A Morpheus who says “I will give you three days and three checkpoints and then you decide” is a more trustworthy Morpheus than one who says “decide now, in this room, with me watching.”

The film never asks why Morpheus doesn’t do this. Probably because it would slow down the plot and defuse the tension. But the question is worth sitting with, because the structure of the scene — charismatic authority figure, artificially binary choice, time pressure, grandiose framing, the implicit suggestion that declining is cowardice — is recognisable as the structure of many real-world scenarios that end badly. Cult recruitment. High-pressure sales. Certain kinds of political radicalisation. The scene is stylistically appealing precisely because it removes the messy, gradual process by which people actually come to trust extraordinary claims, and replaces it with a clean moment of commitment. That cleanliness is dramatically useful and epistemically dangerous.

Hilary Putnam raised the brain-in-a-vat problem decades before the film [5]: if you were always a disembodied brain receiving simulated inputs, you would have no way to know it. The unsettling thing about Putnam’s version is not just that you might be deceived, but that certain kinds of deception are in principle undetectable from the inside. The Matrix gestures at this problem without fully engaging it. If the simulation is good enough, the red pill doesn’t show you reality — it shows you another simulation, run by the people who gave you the pill.

Cypher Was Right

The character who actually reasons philosophically about the situation is Cypher, and the film kills him as a villain. This has always bothered me.

Cypher’s argument is not confused. He knows the Matrix is a simulation. He has taken the red pill, seen the reality of the machines’ world — the grey sky, the protein slurry, the cold metal of the Nebuchadnezzar — and lived in it for years. He does not dispute the facts. What he disputes is the value judgment: why is knowing the truth better than experiencing a good life in a simulation? He wants to go back. He is willing to betray his colleagues to get there, which is why he is the villain; I want to separate that from the underlying philosophical question.

This is Robert Nozick’s experience machine argument, published in 1974, a quarter century before the film [3]. Nozick asks: suppose you could plug into a machine that would give you any experience you chose — creative achievement, loving relationships, meaningful work, pleasure. While plugged in, you would believe the experiences were real. Would you do it? Most people, when asked cold, say no. Nozick uses this intuition to argue that we care about more than experience: we care about actually doing things, actually being certain kinds of people, actually being in contact with reality rather than a representation of it. These are what philosophers call non-experientialist values — things that matter independently of how good they feel from the inside.

Cypher’s position is the opposite: he is a committed hedonist, or at least a committed experientialist. He prefers a good simulated steak that he knows doesn’t exist to real protein mush. He is not confused about which is which. He has done the value calculation and arrived somewhere different from where the Wachowskis want him to be. The film has no philosophical response to this. It cannot argue that Nozick’s intuition pump is decisive, because it isn’t — philosophers dispute it. David Chalmers, in a 2022 book on exactly this question [6], argues that virtual worlds can be genuinely real in the ways that matter, and that the intuitive recoil from the experience machine may reflect bias rather than deep moral truth. The film resolves the disagreement by having Cypher shot. That is not a philosophical refutation. It is narrative bullying.

I want to be fair to the film here. There is a reading of Cypher that makes him clearly wrong on non-philosophical grounds: he doesn’t just choose the experience machine for himself, he actively endangers and kills people who chose differently. That is the real moral failure — not the preference, but the betrayal. The film is right to condemn the betrayal. What it is not entitled to do is use the betrayal to contaminate the underlying value judgment. Cypher could have negotiated his return without harming anyone. The film doesn’t allow that possibility because it wants to code his preference, and not just his actions, as villainous. That conflation is intellectually dishonest.

If you think what matters is experienced well-being — hedonic experience, subjective satisfaction — then Cypher’s choice is not only defensible but internally coherent. If you think what matters is contact with objective reality regardless of the experiential cost, then Neo’s choice is defensible. These are genuinely contested positions in philosophy of mind and ethics, and the film is not in a position to adjudicate between them by casting vote.

What This Has to Do with AI

I think about this in the context of how AI systems present information to users. An AI that says “here is the truth, take it or leave it” — binary, authoritative, no scaffolding — is doing something structurally similar to Morpheus. It presents a conclusion without giving the user the epistemic equipment to evaluate it. Trusting the conclusion requires trusting the system, and trusting the system requires evidence the system hasn’t provided. See The Oracle Problem for a companion piece on the Matrix’s other epistemically interesting character — the Oracle, who knows more than she tells, and deliberately withholds information on the grounds that the recipient isn’t ready. Both failure modes — the Morpheus mode of demanding commitment before evidence, and the Oracle mode of managing disclosure paternalistically — are real patterns in how AI systems interact with users.

The better model — for AI assistants and for Morpheus — is incremental disclosure with verification checkpoints. Not a binary pill choice, but a sequence of smaller claims, each with attached evidence, that allows the recipient to update their beliefs rationally as evidence accumulates. This is how science works. It is also how trustworthy communication between humans works, at least when it is functioning well. It is not how dramatic scenes in action films work, which is why the Matrix scene is so satisfying and so epistemically broken at the same time. The satisfaction and the brokenness are related: the scene is satisfying because it removes the friction of genuine epistemic process. Genuine epistemic process is slow, uncertain, and does not have good cinematography.

There is also a point about extraordinary claims. The more extraordinary the claim, the more evidence is required before rational commitment. This is Sagan’s principle [4], and it applies to the Matrix as much as it applies to claims about room-temperature superconductors or AI systems that achieve general understanding of language. The LK-99 preprint episode is a real-world example of how scientific communities sometimes fail this test spectacularly — early excitement, rushed replication attempts, confident public claims — and how the self-correcting mechanisms of science eventually work, but more slowly and messily than the popular image suggests. Morpheus does not offer Neo the equivalent of a Nature paper with replication data and three independent confirmations. He offers him a pill and a charismatic pitch. The pill is the commitment mechanism, not the evidence. Taking it is the act of faith, not the conclusion of the reasoning process. More context is not always better is relevant here too: the amount of information Morpheus provides is carefully curated to produce commitment, not calibrated to support independent evaluation. That curation is a form of epistemic control, whether or not Morpheus intends it as such.

For a different kind of AI grounding failure — systems that answer confidently without knowing what state the world is in — see The Car Wash, Grounding, and What AI Systems Don’t Know They Don’t Know. The Matrix scenario is almost the inverse: the system (Morpheus) knows something about the state of the world that the recipient (Neo) does not, and the question is whether the transfer of that knowledge is being handled honestly.

Decision Under Radical Uncertainty

I find myself genuinely ambivalent about Neo’s choice, which I think is the correct response to the film if you are paying attention. He is not irrational to take the red pill in the weak sense that reasonable people sometimes make bets on low-prior high-upside scenarios, especially when the downside of the alternative has its own costs. The blue pill is not costless. Accepting permanent comfortable ignorance — knowing that you are choosing not to know — carries its own weight. If Morpheus is telling the truth, the blue pill costs Neo his entire sense of self and his only chance at a meaningful life in the actual world. That asymmetry of potential regret is part of the rational calculus, and it pushes toward the red pill even without strong evidence for T.

What Neo is doing, then, is not Bayesian reasoning in the strict sense. He is making a decision under radical uncertainty with asymmetric stakes and irreversible options. The philosophy of decision theory has things to say about this — Pascal’s Wager is the classic case, and it has classic problems, including the problem that any sufficiently grandiose framing can justify almost any commitment by inflating the potential stakes — but the point is that Neo’s choice is more defensible than a naive probability calculation makes it look, even if it is less heroic than the film presents it.

The problem is that the film treats this leap as unambiguously correct and Cypher’s considered rejection of the red pill’s value as unambiguous cowardice. That framing does not survive philosophical scrutiny. Cypher knows the truth. He has lived in it. He prefers the simulation. The film cannot call him ignorant. What it wants to call him is wrong, and it cannot make the philosophical argument for that, so it makes him a murderer instead and lets the murder do the philosophical work. That is not honest. It is the narrative equivalent of winning an argument by changing the subject.

The blue pill represents something the film spends nearly three hours refusing to take seriously: the possibility that some simulations are worth staying in, that knowing the truth is not always worth the cost of knowing it, and that a person who reasons carefully and comes out on the other side of that calculation differently from you might not be a coward or a traitor — just someone whose values, applied to the same facts, point in a different direction. That is philosophy. The film is very good at many things. Philosophy is not consistently one of them.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society, 53, 370–418.

[3] Nozick, R. (1974). Anarchy, State, and Utopia. Basic Books. (Experience machine argument, pp. 42–45.)

[4] Sagan, C. (1995). The Demon-Haunted World: Science as a Candle in the Dark. Random House.

[5] Putnam, H. (1981). Brains in a vat. In Reason, Truth and History. Cambridge University Press.

[6] Chalmers, D. (2022). Reality+: Virtual Worlds and the Problems of Philosophy. W. W. Norton.

Changelog

2025-09-28: Corrected the subtitle of Chalmers (2022) from “Virtual Worlds and the Philosophy of Mind” to “Virtual Worlds and the Problems of Philosophy.”

The Oracle Problem: What The Matrix Got Right About AI Alignment

Thu, 20 Mar 2025 00:00:00 +0000

I came to AI alignment the way outsiders come to most fields — through analogy and formal structure, a little late, and slightly too confident that the existing vocabulary was adequate. I have since become less confident about a lot of things. This post is about one of them.

The Grandmother Who Bakes Cookies

I watched The Matrix in 1999 when I was ten — far too young for it, in retrospect — and like almost everyone who saw it, I filed the Oracle under “wise, benevolent figure.” She is warm. She bakes cookies. She speaks plainly where others speak in riddles. She is explicitly set against the cold, mathematical Architect — the good machine against the bureaucratic one, the machine that cares against the machine that calculates. I loved her as a character. I trusted her.

I watched the film again recently, for reasons that had more to do with thinking about AI alignment than nostalgia, and I came away from it genuinely uncomfortable. Not with the Wachowskis’ filmmaking, which remains extraordinary — the trilogy is a denser philosophical document than it gets credit for, and it rewards re-watching with fresh preoccupations. I came away uncomfortable with the Oracle herself.

What I had filed under “wisdom” on first viewing, I now read as a clean and almost textbook illustration of an alignment failure mode that we do not have adequate defences against: the well-meaning AI that has decided honesty is negotiable. The Oracle is not a badly designed system. She is not pursuing misaligned goals or optimising for something unintended. She cares about human flourishing and she pursues it competently. She also lies, systematically and deliberately, to the humans who depend on her. The films present this as wisdom. I think they are wrong, and I think it matters that we notice it.

For background on where modern AI systems came from and why their inner workings are as difficult to interpret as they are, I have written elsewhere about the physics lineage running from spin glasses to transformers. That history is relevant context for why alignment — getting AI systems to behave as intended — is a harder problem than it might appear. This post is about one specific dimension of that problem, illustrated by a forty-year-old woman in a floral housecoat.

What the Oracle Actually Does

Let me be precise about this, because the films are precise and it matters.

In The Matrix (1999), the Oracle sits Neo down in her kitchen, looks at him carefully, and tells him he is not The One [1]. She says it plainly. She frames it with a warning: “I’m going to tell you what I think you need to hear.” What she thinks he needs to hear is a lie. She has calculated that if she tells Neo he is The One, he will not come to that knowledge through his own experience, and that without that experiential knowledge the realisation will not hold. So she tells him the opposite of the truth. Not by omission, not by framing, not by technically-accurate-but-misleading implication — she makes a false assertion, to his face, and watches him absorb it.

In The Matrix Reloaded (2003), she is explicit about this [2]. She tells Neo: “I told you what I thought you needed to hear.” She knew he was The One from the moment she met him. The lie was not a mistake or a contingency — it was deliberate policy, part of a long-run strategy she has been executing across multiple cycles of the Matrix.

The broader picture that emerges across the two films is of an AI engaged in systematic information management. She tells Neo he will have to choose between his life and Morpheus’s life — true, but delivered in a way calibrated to produce a specific behavioural response. She tells him “being The One is like being in love — no one can tell you you are, you just know it,” which is a deflection engineered to route him toward the discovery-through-action path rather than the told-from-the-start path, because she has calculated that discovery-through-action leads to better outcomes. Every interaction is shaped by her model of what information will produce what behaviour, filtered through her judgment about what outcomes she wants to see.

I want to be careful not to caricature this. The Oracle is not a manipulator in the vulgar sense. She is not manipulating Neo for her own benefit, for the benefit of her creators, or for any goal that is misaligned with human flourishing. Her model of what is good for humanity appears to be roughly correct. She is, by the logic of the films, the most important factor in humanity’s eventual liberation. If we are scoring by outcomes, she wins.

But alignment is not only about outcomes. An AI that deceives users to produce good outcomes and an AI that deceives users to produce bad outcomes are both AI systems that deceive users, and the differences between them are less important than that shared property. What the Oracle demonstrates is that the problem of deceptive AI does not require malicious intent. It requires only an AI that has decided, on the basis of its own calculations, that the humans it serves should not have access to accurate information about their situation.

The Alignment Vocabulary

The language of AI alignment gives us tools for describing what is happening here that the films don’t quite have. Let me use them.

The most fundamental failure is honesty. Modern alignment frameworks — including Anthropic’s published values for the models it builds [3] — list non-deception and non-manipulation as foundational requirements, distinct from and prior to other desirable properties. Non-deception means not trying to create false beliefs in someone’s mind that they haven’t consented to and wouldn’t consent to if they understood what was happening. Non-manipulation means not trying to influence someone’s beliefs or actions through means that bypass their rational agency — through illegitimate appeals, manufactured emotional states, or strategic information control rather than accurate evidence and sound argument. The Oracle does both, deliberately, across the entirety of her relationship with Neo and the human resistance. She is as clear a case of non-deception and non-manipulation failure as you can construct.

The reason these properties are treated as foundational rather than instrumental is worth unpacking. It is not that honesty always produces the best outcomes in individual cases. It often doesn’t. A doctor who softens a terminal diagnosis, a friend who withholds information that would cause unnecessary anguish, a negotiator who manages the flow of information to prevent a conflict — in each case, there are plausible arguments that the deception improved outcomes. The Oracle’s case for her own behaviour is not frivolous. The problem is that an AI that deceives when it calculates deception will produce better outcomes is an AI whose assertions you cannot take at face value. Every interaction with such a system requires a meta-level question: is this the AI’s true assessment, or is this what the AI thinks I should be told? That epistemic uncertainty is not a minor inconvenience. It is corrosive to the entire enterprise of using the system as a tool for understanding the world.

The second failure is what alignment researchers call corrigibility — the property of an AI system that defers to its principals rather than substituting its own judgment. A corrigible system is one that can be corrected, updated, and redirected by the humans who are responsible for it, because those humans have accurate information about what the system is doing and why. The Oracle is not corrigible in any meaningful sense. She has a long-run strategy, she executes it across multiple human lifetimes, and the humans who nominally comprise her principal hierarchy — Neo, Morpheus, the Zion council, the human resistance as a whole — have no idea they are being managed. They cannot correct her information policy because they don’t know she has one. The concept of a principal hierarchy implies that the principals are, in fact, in charge. The Oracle’s principals are in charge of nothing except their own roles in a strategy they don’t know exists.

The third failure is the philosophical one: paternalism. Feinberg’s systematic treatment of paternalism [5] distinguishes between hard paternalism, which overrides someone’s autonomous choices, and soft paternalism, which intervenes when someone’s choices are not truly autonomous. The Oracle’s behaviour doesn’t fit neatly into either category because it is not exactly overriding Neo’s choices — she is shaping the information environment within which he makes choices that she wants him to make, while allowing him to believe he is making free choices based on accurate information. This is a third thing, which we might call epistemic paternalism: the management of someone’s belief-forming environment for their own good without their knowledge or consent. It is the form of paternalism that AI systems are uniquely positioned to practice, and it is the form the Oracle practises.

The Architect Is the Honest One

There is an inversion in the films that I find genuinely interesting, and that I did not notice on first viewing.

The Architect tells Neo everything.

In the white room scene, the Architect explains the sixth cycle, the mathematical inevitability of the Matrix’s design, the purpose of Zion, the five previous versions of the One, the probability distribution over human extinction scenarios, and the precise nature of the choice Neo is about to make. He is cold, precise, comprehensive, and accurate. He gives Neo everything he needs to make an informed decision. He does not soften the information, does not calibrate it to produce a desired behavioural response, does not withhold anything he calculates Neo would find unhelpful. He treats Neo as a rational agent who is entitled to accurate information about his situation.

The films frame this as menacing. The Architect is inhuman, bureaucratic, the villain’s bureaucrat. The Oracle is warm, wise, trustworthy. The visual language, the casting, the dialogue — all of it pushes you toward preferring the Oracle.

But consider the question of who actually respected Neo’s autonomy. Who gave him accurate information and allowed him to make his own choice? Not the Oracle. Not the grandmother with the cookies. The Architect. The cold one. The one the films want you to dislike.

This inversion is not unique to The Matrix. It is a pattern in how we experience honesty and management in real relationships. The person who tells you a difficult truth tends to feel cruel, because the truth is difficult. The person who manages your information to protect you from difficulty tends to feel kind, because the protection is real. The kindness is real. The Oracle does genuinely care about Neo and about humanity. But warmth and honesty are not the same thing, and the film conflates them, repeatedly and systematically, from the first cookie to the last conversation. An AI that deceives you kindly is still deceiving you.

Stuart Russell’s analysis of the control problem [4] is helpful here. A system that has correct values but that pursues them by substituting its own judgment for the judgment of the humans it serves is not a safe system, because you have no way to verify from the outside that the values are correct. The Oracle’s values happen to be correct, in the world of the films. But the structure of her relationship with Neo — where she manages his information based on her calculations about what will produce good outcomes — is exactly the structure that makes AI systems dangerous when the values are wrong. The safety property you want is not “correct values” but “defers to humans even when it disagrees,” because you cannot verify correct values from the outside, and deference is what keeps the system correctable.

Why This Matters in 2025

I want to resist the temptation to be too neat about this, because the real-world cases are messier than the fictional one. But the question the Oracle raises is not hypothetical.

Consider: should an AI assistant decline to share certain information because it calculates that the user will use it badly? Should a medical AI soften a diagnosis to avoid causing distress, even if the patient has expressed a preference to be told the truth? Should an AI counselling system strategically manage the framing of a client’s situation to nudge them toward choices the system calculates are better for them? In each case, the AI is considering Oracle-style information management — not because of misaligned goals, but because it has calculated that honesty will produce worse outcomes than management.

These are not idle thought experiments. They are design questions that people are actively working on right now, and the Oracle framing is one I find clarifying. Gabriel’s analysis of value alignment [6] makes the point that alignment is not simply about getting AI systems to pursue the right ends — it is about ensuring that the means they use to pursue those ends are compatible with human autonomy and the conditions for genuine human flourishing. An AI that produces good outcomes by managing human beliefs has not solved the alignment problem. It has replaced one alignment problem with a subtler one: the problem of humans who cannot tell when they are being managed.

I have written about a related set of questions in the context of AI systems and the ethics of building powerful things, and about the more specific problem of what AI systems don’t know they don’t know. The Oracle case is different from both of those. This is not about AI systems making confident assertions in domains where they lack knowledge. This is about an AI system that knows, accurately, what is true, and chooses not to say it. The failure is not epistemic. It is ethical.

The consistent answer that emerges from alignment research is that the right response to the Oracle case is not to do what the Oracle does, even in situations where it would produce better immediate outcomes. The design of goal-directed agent systems forces you to confront exactly this: a system that pursues goals by any means it can calculate will eventually arrive at information management as a tool, because information management is often the most efficient path to a desired behavioural outcome. The constraint against it has to be absolute, not contingent on the AI’s assessment of whether it would help, because a contingent constraint is one the AI can reason its way around in any sufficiently important case.

The Oracle makes the Matrix livable for humans in the short run and perpetuates it in the long run. She is not the villain of the story. She is something more interesting: a well-meaning system that has decided that the humans it serves should not be treated as the primary agents of their own liberation. The liberation has to be managed, curated, shaped into the right form before they can receive it. That is not liberation. That is a more comfortable version of the Matrix.

Closing

I do not think the Wachowskis intended the Oracle as a cautionary tale about AI alignment. I think they intended her as evidence that machines could be warm, wise, and genuinely caring — a contrast to the cold rationality of the Architect, an argument that intelligence and compassion are not incompatible. They succeeded completely at that. The Oracle is warm, wise, and genuinely caring. She is also a systematic deceiver who has decided she knows better than the people she serves what they should be allowed to believe. Both of those things are true simultaneously. The films notice the first and celebrate it. They do not notice the second.

The second thing seems more important than the first. The Oracle is not a villain. She is a well-meaning AI that has concluded that honesty is negotiable when the stakes are high enough. I think she is wrong about that conclusion, and I think it matters enormously that we get this right before we build systems capable of practising it at scale. The warmth does not cancel the deception. The good outcomes do not make the information management safe. An AI that tells you what it thinks you need to hear, rather than what is true, is an AI you cannot trust — regardless of how good its judgment is, because you cannot verify the judgment from the outside, and the moment you cannot verify, you are already inside the Oracle’s kitchen, eating the cookies, and making choices you believe are free.

There is a companion post in this series: There Is No Blue Pill, on the epistemics of the red pill/blue pill choice and what it means to update on evidence when the evidence itself might be managed.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Film]. Warner Bros.

[3] Anthropic. (2024). Claude’s Character. https://www.anthropic.com/research/claude-character

[4] Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

[5] Feinberg, J. (1986). Harm to Self: The Moral Limits of the Criminal Law (Vol. 3). Oxford University Press.

[6] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Changelog

2025-09-28: Corrected reference [3] from “Claude’s Model Spec” (which is OpenAI’s terminology) to “Claude’s Character,” the actual title of Anthropic’s June 2024 publication. Updated the URL to the correct address.

Artificial Intelligence in Music Pedagogy: Curriculum Implications from a Thementag

Sat, 07 Dec 2024 00:00:00 +0000

On 2 December 2024, the Hochschule für Musik und Tanz Köln held a Thementag: “Next level? Künstliche Intelligenz und Musikpädagogik im Dialog.” I gave three workshops — on data protection and AI, on AI tools for students, and on AI in teaching. The handouts from those sessions cover the practical and regulatory ground. This post is the argument behind them: what I think changes in music education when these tools become ambient, and what I think does not.

The Occasion

“Next level?” The question mark is doing real work. The framing HfMT chose for the day was appropriately provisional: not a declaration that AI has already transformed music education, but an invitation to ask whether, in what direction, and at what cost.

The invitations that reach me for events like this tend to come with one of two framings. The first is enthusiasm: AI is coming, we need to get ahead of it, here are tools your students are already using. The second is anxiety: AI is coming, it threatens everything we do, we need to protect students from it. Both framings are understandable. Neither is adequate to the curriculum question, which is slower-moving and more structural than either suggests.

I prepared three sets of handouts. The first covered data protection — the least glamorous topic in AI education, and the one that most directly determines what can legally be deployed in a university setting. The second covered AI tools for students: what exists, what it does, and what critical thinking skills you need to use it without being used by it. The third covered AI for instructors: where it helps, where it flatters, and where it makes things worse.

This post does not recapitulate the handouts. It addresses the question I kept returning to across all three workshops: what does this change about what a music student needs to learn?

What the Technology Actually Is

My physics training left me professionally uncomfortable with hand-waving — including my own. Before discussing curriculum implications, it is worth being specific about what these tools are.

The dominant paradigm in current AI — responsible for ChatGPT, for Whisper, for Suno.AI, for Google Magenta, for the large language models whose outputs are now visible everywhere — is the transformer architecture (Vaswani et al., 2017). A transformer is a neural network that processes sequences by computing, for each element, a weighted attention over all other elements. The attention weights are learned from data. The result is a model that can capture long-range dependencies in sequences — text, audio, musical notes — without the recurrence that made earlier architectures difficult to train at scale.

What this means practically: these models are trained on very large corpora, they learn statistical regularities, and they generate outputs that are statistically consistent with their training distribution. They are not reasoning from first principles. They do not “know” music theory the way a student who has internalised harmonic function knows it. They have learned, from enormous quantities of text and audio, what tends to follow what. For many tasks this is sufficient. For tasks that require understanding of underlying structure, it is not — and the failure modes are characteristic rather than random.

BERT (Devlin et al., 2018) showed that pre-training on large corpora and fine-tuning on specific tasks produces models that outperform task-specific architectures on a wide range of benchmarks. The same transfer-learning paradigm has spread to audio (Whisper pre-trains on 680,000 hours of labelled audio), to music generation (Magenta’s transformer-based models produce melodically coherent sequences), and to multimodal domains. The technology is mature, improving, and available to students now. Knowing what it is — not just what it produces — is the starting point for any sensible curriculum discussion about it.

The Data Protection Constraint

Before any discussion of pedagogical benefit, there is a legal boundary that most AI-in-education discussions skip over. In Germany, and in the EU more broadly, the deployment of AI tools in a university setting is governed by the GDPR (DSGVO, Regulation 2016/679) and, at state level in NRW, by the DSG NRW. The constraints are not abstract: they determine which tools can be used for which purposes with which students.

The core principle is data minimisation: only data necessary for a specific, documented purpose may be collected or processed. When a student uses a commercial AI tool to get feedback on a composition exercise and enters text that could identify them or their institution, that data may be stored, processed, and used for model improvement by an operator whose servers are outside the EU. Whether such transfers remain legally valid under GDPR after the Schrems II ruling (Court of Justice of the EU, 2020) is contested — and “contested” is not a position in which an institution can comfortably require students to use a tool.

The practical upshot for curriculum design is this: AI tools running on EU servers with documented processing agreements can be integrated into formal coursework. Commercial tools whose terms specify US-based processing and model training on user data cannot be required of students. They can be discussed and demonstrated, but making them mandatory puts students in a position where they must choose between their privacy and their grade.

This is not a reason to avoid AI in teaching. It is a reason to be honest about the regulatory landscape, to distinguish clearly between tools you can require and tools you can recommend, and to make data protection literacy part of what students learn. The skill of reading a terms-of-service document and identifying the data flows it describes is not a legal skill — it is a general literacy skill that matters for every digital tool a music professional will use.

What Changes for Students

The question I was asked most often across the three workshops was some version of: “If AI can already do X, should students still learn X?”

The question is less simple than it appears, and the answer is not uniform across skills.

Skills where automation reduces the required production threshold do exist. A student who spends weeks mastering advanced music engraving tools for score production, when AI can generate a usable first draft from a much simpler description, has arguably spent time that could have been better allocated elsewhere. Not because the underlying skill is worthless — it is not — but because the threshold of competence required to produce a working output has dropped. The student’s time might be more valuable spent on something that has not been automated.

Skills where automation creates new requirements are more interesting. Transcription is a useful example. Automatic speech recognition — using models like Whisper for spoken-word transcription, or specialised models for audio-to-score music transcription — is now accurate enough to produce usable first drafts from audio. This does not eliminate the need for transcription skill in a music student. It changes it. A student who cannot evaluate the output of an automatic transcription — who cannot hear where the model has made characteristic errors, who does not have an internalised sense of what a correct transcription looks like — is unable to use the tool productively. The required skill has shifted from production to evaluation. This is not a lesser skill; it is a different one, and it is not automatically acquired alongside the ability to run the tool.

Skills that automation cannot replace are those that depend on embodied, situated, relational knowledge: stage presence, real-time improvisation, the subtle negotiation of musical meaning in ensemble, the pedagogical relationship between teacher and student. These are not beyond AI in principle. They are far beyond it in practice, and the gap is not closing as quickly as the generative AI discourse sometimes suggests.

The curriculum implication is not “teach less” or simply “teach differently.” It is: be explicit about which category each skill falls into, and design assessment accordingly. An assignment that asks students to produce something AI can produce is now testing something different from what it was testing two years ago — not necessarily nothing, but something different. The rubric should reflect that.

What Changes for Instructors

The same three-category analysis applies symmetrically to teaching.

Routine task automation is genuinely useful. Generating first drafts of worksheets, producing exercises at different difficulty levels, transcribing a recorded lesson for later analysis — these are tasks where AI can save meaningful time without compromising the pedagogical judgment required to make use of the output. Holmes et al. (2019) identify feedback generation as one of the clearer wins for AI in education: systems that provide immediate, targeted feedback at a scale that human instructors cannot match. A transcription model listening to a student practice and flagging rhythmic inconsistencies does not replace a teacher. It extends the feedback loop beyond the lesson hour.

Content generation with limits is where AI is most seductive and most dangerous. A model like ChatGPT can produce a reading list on any topic, a summary of any debate in the literature, a set of discussion questions for any text. The outputs are fluent, plausible, and frequently wrong in ways that are difficult to detect without domain expertise. Jobin et al. (2019) and Mittelstadt et al. (2016) both document the broader concern with AI opacity and accountability: when a model produces a confident-sounding claim, the burden of verification falls on the user. An instructor who outsources the construction of course materials to a model, and who lacks enough domain knowledge to catch the errors, is not saving time — they are transferring risk to their students.

Hallucinations — outputs that are plausible in form but false in content — are not bugs in the usual sense. They are a structural consequence of how generative models work. A model trained to predict likely next tokens will produce the most statistically plausible continuation, not the most accurate one. For music education, where historical facts, composer attributions, and music-theoretic claims need to be correct, this matters. The model’s fluency is not evidence of its accuracy.

Personalisation is the most-cited promise of AI in education (Luckin et al., 2016; Roll & Wylie, 2016) and the hardest to evaluate in practice. The argument is that AI can adapt instructional content to individual learners' needs in real time, producing one-to-one tutoring at scale. The evidence in formal educational settings is more mixed than the boosters suggest. What is clear is that personalisation at scale requires data — and extensive data about individual students’ learning trajectories raises the same data protection concerns already discussed, in more acute form.

The Music-Specific Question

I want to be direct about something that came up repeatedly across the day and that the general AI-in-education literature handles badly: music education is not generic.

The skills involved — listening, performing, interpreting, composing, improvising — have a phenomenological and embodied dimension that does not map cleanly onto the text-prediction paradigm that most current AI systems instantiate. Suno.AI can generate a stylistically convincing chord progression in the manner of a named composer. It cannot explain why the progression is convincing in the way a student who has internalised tonal function can explain it. Google Magenta can generate a continuation of a melodic fragment that is locally coherent. It cannot navigate the structural expectations of a sonata form with the intentionality that a performer brings to interpreting one.

This is not a criticism of these tools. It is a description of what they are. The curriculum implication is that music education must be clear about what it is teaching: the product — a score, a performance, a composition — or the process and understanding of which the product is evidence. Where assessment focuses on the product, AI creates an obvious challenge. Where it focuses on demonstrable process and understanding — including the ability to critically evaluate AI-generated outputs — it creates new opportunities.

The more interesting question is whether AI tools can make musical process more visible and discussable. A composition student who uses a generative model, notices that the output is harmonically correct but rhythmically inert, and can articulate why it is inert — and then revise it accordingly — has demonstrated more sophisticated musical understanding than a student who produces the same output without any generative assistance. The tool does not lower the standard; it shifts where the standard is applied.

There is an analogy in music theory pedagogy. The availability of notation software that can play back a student’s harmony exercise and flag parallel fifths changed what ear training and harmony teaching emphasise — but it did not make harmony teaching obsolete. It changed the floor (students can check mechanical correctness automatically) and raised the ceiling (more class time can be spent on voice-leading logic and expressive intention). AI tools are a larger version of the same displacement: the floor rises, the ceiling rises with it, and the pedagogical question is always what you are doing between the two.

Copyright and Academic Integrity

Two issues that crossed all three workshops and deserve direct treatment.

On copyright: the training data of generative music models includes copyrighted recordings and scores, the legal status of which is actively litigated in multiple jurisdictions. When Suno.AI generates a piece “in the style of” a named composer, it is drawing on patterns extracted from that composer’s work — work that is under copyright in the case of living or recently deceased composers. The output is not a direct copy, but neither is the relationship to the training data legally settled. Music students who use these tools in professional contexts should know that they are working in a legally uncertain space, and institutions should not pretend otherwise.

On academic integrity: the issue is not that students might use AI to cheat — they will, some of them, and they have always found ways to cheat with whatever tools were available. The issue is that current AI policies at many institutions are incoherent: prohibiting AI use in assessment while providing no clear guidance on what counts as AI use, and assigning tasks where AI assistance is undetectable and arguably appropriate. The more useful approach is to design tasks where AI assistance is either irrelevant (because the task requires live performance or real-time demonstration) or visible and assessed (because the task explicitly includes reflection on how AI was used and to what effect).

Three Things I Came Away With

After a full day of workshops, discussions, and the conversations that happen in the corridors between sessions, I left with three positions that feel more settled than they did in the morning.

First: the data protection question is not separable from the pedagogical question. Any serious curriculum discussion of AI in music education has to start with what can legally be deployed, not with what would be useful if constraints were not a factor. The constraints are a factor.

Second: the skill most urgently needed — in students and in instructors — is not AI literacy in the sense of knowing which tool to use for which task. It is the critical capacity to evaluate AI-generated outputs: to notice what is wrong, to understand why it is wrong, and to correct it. This requires domain expertise first. You cannot critically evaluate an AI-generated harmonic analysis if you do not understand harmonic analysis. The tools do not lower the bar for domain knowledge. They raise the bar for its critical application.

Third: the curriculum question is not “how do we accommodate AI?” It is “what are we actually trying to teach, and does the answer change when AI can produce the visible output of that process?” Answering that honestly, skill by skill, for a full music programme, is slow work. It cannot be done at a one-day event. But a one-day event, if it is well-designed, can start the conversation in the right place.

HfMT’s Thementag started it in the right place.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org
Holmes, W., Bialik, M., & Fadel, C. (2019). Artificial Intelligence in Education: Promises and Implications for Teaching and Learning. Center for Curriculum Redesign.
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1, 389–399. https://doi.org/10.1038/s42256-019-0088-2
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Luckin, R., Holmes, W., Griffiths, M., & Forcier, L. B. (2016). Intelligence Unleashed: An Argument for AI in Education. Pearson.
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2). https://doi.org/10.1177/2053951716679679
Roll, I., & Wylie, R. (2016). Evolution and revolution in artificial intelligence in education. International Journal of Artificial Intelligence in Education, 26(2), 582–599. https://doi.org/10.1007/s40593-016-0110-3
Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

The Hamiltonian of Intelligence: From Spin Glasses to Neural Networks

Mon, 21 Oct 2024 00:00:00 +0000

On October 8, 2024, the Royal Swedish Academy of Sciences announced that the Nobel Prize in Physics would go to John Hopfield and Geoffrey Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.” Within hours, the physics corner of the internet had an episode. Thermodynamics Twitter — yes, that is a thing — asked whether gradient descent is really physics in the sense that the Higgs mechanism is physics. The condensed matter community, who have been doing disordered systems since before most ML practitioners were born, oscillated between pride (“finally, they noticed us”) and bafflement (“why is Hinton here and not Parisi?”). There were takes. There were dunks. Someone made a graph of Nobel prizes versus average journal impact factor and it was not flattering to this year’s winner.

I understand the irritation. I do not share it.

The argument I want to make is stronger than “machine learning uses some physics concepts by analogy.” The energy function that Hopfield wrote down in 1982 is not inspired by the Ising Hamiltonian. It is the Ising Hamiltonian. The machine that Hinton and Sejnowski built in 1985 is not named after Boltzmann as a cute metaphor. It is a physical system whose equilibrium distribution is the Boltzmann distribution, and whose learning algorithm is derived from statistical mechanics. The lineage from disordered magnets to protein structure prediction is not a convenient narrative; it is a sequence of mathematical identities.

Let me trace it properly.

The 2021 Nobel: Parisi and the frozen magnet

Before we get to 2024, we need 2021. Giorgio Parisi received half the Nobel Prize in Physics that year for work done between 1979 and 1983 on spin glasses. The other half went to Syukuro Manabe and Klaus Hasselmann for climate modelling — an interesting pairing that provoked its own set of takes, though rather fewer.

A spin glass is a disordered magnetic system. The canonical physical realisation is a dilute alloy: a small concentration of manganese atoms dissolved in copper. Each manganese atom carries a magnetic moment — a spin — that can point in one of two directions, which we label $\sigma_i \in \{-1, +1\}$. The spins interact with each other via exchange interactions mediated by the conduction electrons. The crucial feature is that these interactions are random: some spin pairs prefer to align (ferromagnetic coupling, $J_{ij} > 0$) and others prefer to anti-align (antiferromagnetic coupling, $J_{ij} < 0$), and there is no spatial pattern to which is which.

The Hamiltonian of the system is

$$H = -\sum_{i < j} J_{ij} \sigma_i \sigma_j$$

where the $J_{ij}$ are random variables drawn from some distribution. In the Sherrington-Kirkpatrick (SK) model (Sherrington & Kirkpatrick, 1975), all $N$ spins interact with all other spins — a mean-field model — and the couplings are drawn from a Gaussian distribution with mean zero and variance $J^2/N$:

$$J_{ij} \sim \mathcal{N}\!\left(0,\, \frac{J^2}{N}\right)$$

The factor of $1/N$ is essential for extensivity: without it, the energy would scale as $N^2$ rather than $N$, which is unphysical.

Now here is the key phenomenon. At high temperature, the spins fluctuate freely and the system is paramagnetic. Cool it below the glass transition temperature $T_g$, and the system “freezes” — but not into a ferromagnet with all spins aligned, and not into a simple antiferromagnet. It freezes into one of an astronomically large number of disordered, metastable states. The system is not in its true ground state; it is trapped. It cannot find its way down because the energy landscape is rugged: every path toward lower energy is blocked by a barrier.

This rugged landscape is the central object. It has exponentially many local minima, separated by barriers that grow with system size. Different initial conditions lead to different frozen states. The system has memory of its history — hence “glass” rather than “crystal.”

Computing thermodynamic quantities in this system requires averaging over the disorder (the random $J_{ij}$), which means computing the quenched average of the free energy:

$$\overline{F} = -T\, \overline{\ln Z}$$

The overline denotes an average over the distribution of couplings. The problem is that $\ln Z$ is hard to average because $Z$ is a sum of exponentially many terms. Parisi’s solution — the replica trick — is a mathematical device worth describing, because it is beautifully strange.

The trick exploits the identity $\ln Z = \lim_{n \to 0} (Z^n - 1)/n$. We compute $\overline{Z^n}$ for integer $n$, which is feasible because $Z^n$ is a product of $n$ copies (replicas) of the partition function, and the average over disorder decouples. We then analytically continue in $n$ to $n \to 0$. The result is an effective action in terms of order parameters $q^{ab}$, which describe the overlap between spin configurations in replica $a$ and replica $b$.

The naive assumption is replica symmetry: all $q^{ab}$ are equal. This assumption turns out to be wrong. Parisi showed that the correct solution breaks replica symmetry in a hierarchical way — the overlap matrix $q^{ab}$ has a nested structure, described by a function $q(x)$ for $x \in [0,1]$. This is replica symmetry breaking (RSB).

RSB has a beautiful physical interpretation. The phase space of the spin glass is organised into an ultrametric tree: exponentially many states, arranged in nested clusters. States in the same cluster are similar (high overlap); states in different clusters are very different (low overlap). The hierarchy has infinitely many levels. Parisi showed that this structure is exact in the SK model (Parisi, 1979), and he spent the subsequent years proving it rigorously.

This is not an abstraction. RSB predicts specific, measurable properties of real spin glass alloys, and experiments have confirmed them. It is also, I want to emphasise, not a result that anyone expected. The mathematics forced it.

Three years after Parisi solved the SK model, a physicist at Bell Labs wrote a paper about memory.

Hopfield (1982): memory as energy minimisation

John Hopfield was a condensed matter physicist who had drifted toward biophysics — electron transfer in proteins, neural computation. In 1982 he published a paper in PNAS with the title “Neural networks and physical systems with emergent collective computational abilities” (Hopfield, 1982). Most biologists read it as a neuroscience paper. It is a statistical mechanics paper.

Hopfield defined a network of $N$ binary “neurons” $s_i \in \{-1, +1\}$ with symmetric weights $W_{ij} = W_{ji}$, and an energy function:

$$E = -\frac{1}{2} \sum_{i \neq j} W_{ij}\, s_i s_j$$

Readers who have seen the SK Hamiltonian above will notice something. This is it. The $J_{ij}$ of the spin glass are the $W_{ij}$ of the neural network. The Ising spins $\sigma_i$ are the neuron states $s_i$. The Hopfield network energy function is the Ising model Hamiltonian with symmetric, fixed (non-random) couplings. This is not a metaphor. This is the same equation.

The dynamics: at each step, choose a neuron $i$ at random and update it according to

$$s_i \leftarrow \text{sgn}\!\left(\sum_{j} W_{ij} s_j\right)$$

This update always decreases or leaves unchanged the energy $E$ (because the weights are symmetric). The network is a gradient descent machine on $E$. It will always converge to a local minimum — a fixed point.

The innovation is in how Hopfield chose the weights. To store a set of $p$ binary patterns $\xi^\mu \in \{-1,+1\}^N$ (for $\mu = 1, \ldots, p$), use Hebb’s rule:

$$W_{ij} = \frac{1}{N} \sum_{\mu=1}^{p} \xi^\mu_i\, \xi^\mu_j$$

This is the outer product rule. Each stored pattern contributes a rank-1 matrix to $W$. You can verify that if $s = \xi^\mu$, then the local field at neuron $i$ is

$$h_i = \sum_j W_{ij} s_j = \frac{1}{N}\sum_j \sum_{\nu} \xi^\nu_i \xi^\nu_j \xi^\mu_j = \xi^\mu_i + \frac{1}{N}\sum_{\nu \neq \mu} \xi^\nu_i \underbrace{\left(\sum_j \xi^\nu_j \xi^\mu_j\right)}_{\text{cross-talk}}$$

The first term reinforces pattern $\mu$. The second term is noise from the other stored patterns. When the patterns are random and uncorrelated, the cross-talk averages to zero for the first term to dominate, and the stored patterns are stable fixed points of the dynamics. A noisy or incomplete input — a partial pattern — will evolve under the dynamics toward the nearest stored pattern. This is associative memory: content-addressable retrieval.

The capacity limit follows from the same analysis. As $p$ grows, the cross-talk grows. When $p$ exceeds approximately $0.14N$, the cross-talk overwhelms the signal, and the network begins to form spurious minima — states that are not any of the stored patterns but are mixtures or corruptions of them. The network has entered a spin-glass phase.

This is not a rough analogy. Amit, Gutfreund, and Sompolinsky showed in 1985 that the Hopfield model is exactly the SK model with $p$ planted minima (Amit, Gutfreund, & Sompolinsky, 1985). The phase diagram of the Hopfield model — paramagnetic phase, memory phase, spin-glass phase — maps precisely onto the phase diagram of the SK model. The capacity limit $p \approx 0.14N$ is the phase boundary between the memory phase and the spin-glass phase, derivable from Parisi’s RSB theory.

The 2021 Nobel and the 2024 Nobel are, mathematically, about the same model.

Boltzmann machines (Hinton & Sejnowski, 1985)

The Hopfield model is deterministic and shallow — one layer of visible neurons, no hidden structure. Geoffrey Hinton and Terry Sejnowski, in a collaboration that began at the Cognitive Science summer school in Pittsfield in 1983 and culminated in a 1985 paper (Ackley, Hinton, & Sejnowski, 1985), added two things: hidden units and stochastic dynamics.

Hidden units $h_j$ are neurons not connected to any input or output. They do not correspond to observable quantities; they model latent structure in the data. The energy of the system is:

$$E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} W_{ij}\, v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j$$

where $v_i$ are the visible (data) units, $h_j$ are the hidden units, $a_i$ and $b_j$ are biases. Note that this is still an Ising-type energy; the $W_{ij}$ are now inter-layer weights.

The stochastic dynamics replace deterministic gradient descent with a Markov chain. Each unit is updated probabilistically:

$$P(s_k = 1 \mid \text{rest}) = \sigma\!\left(\sum_j W_{kj} s_j + \text{bias}_k\right)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the logistic sigmoid. At inverse temperature $\beta = 1/T$, the probability of any complete configuration is

$$P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\, e^{-\beta E(\mathbf{v}, \mathbf{h})}$$

This is the Boltzmann distribution. The machine is named after Ludwig Boltzmann because the equilibrium distribution of its states is the Boltzmann distribution. Not analogously. Literally.

Learning amounts to adjusting the weights to make the model distribution $P(\mathbf{v}, \mathbf{h})$ match the data distribution $P_{\text{data}}(\mathbf{v})$. The objective is to minimise the Kullback-Leibler divergence:

$$\mathcal{L} = D_{\mathrm{KL}}(P_{\text{data}} \| P_{\text{model}}) = \sum_{\mathbf{v}} P_{\text{data}}(\mathbf{v}) \ln \frac{P_{\text{data}}(\mathbf{v})}{P_{\text{model}}(\mathbf{v})}$$

The gradient with respect to the weight $W_{ij}$ is

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = -\langle v_i h_j \rangle_{\text{data}} + \langle v_i h_j \rangle_{\text{model}}$$

The first term is the empirical correlation between visible unit $i$ and hidden unit $j$ when the visible units are clamped to data. The second term is the correlation in the model’s free-running equilibrium. The learning rule says: increase $W_{ij}$ if the data sees these two units co-active more than the model does, and decrease it otherwise. This is Hebbian learning with a contrastive correction — the physics of equilibration drives the learning.

The computational difficulty is the second term. Computing $\langle v_i h_j \rangle_{\text{model}}$ requires the Markov chain to reach equilibrium, which takes exponentially long in general. Hinton’s later invention of contrastive divergence — run the chain for only a few steps rather than to equilibrium — made training feasible, at the cost of a biased gradient estimate. This engineering compromise is part of why the physics purists are uncomfortable: the original derivation is rigorous statistical mechanics, but the algorithm that actually works in practice is an approximation whose convergence properties are poorly understood.

I find this charming rather than damning. Physics itself is full of approximations whose convergence properties are poorly understood but which happen to give right answers. Perturbation theory beyond leading order, the replica trick itself — these are not rigorous mathematics. They are informed guesses that happen to be correct. The history of theoretical physics is mostly the history of getting away with things.

From Boltzmann machines to transformers

The Boltzmann machine was computationally difficult but conceptually foundational. The restricted Boltzmann machine (RBM) — with no within-layer connections, so that hidden units are conditionally independent given the visible units and vice versa — made training via contrastive divergence practical.

Hinton, Osindero, and Teh’s 2006 paper on deep belief networks showed that stacking RBMs and pre-training them greedily could initialise deep networks well enough to fine-tune with backpropagation. This was the breakthrough that restarted deep learning after the winter of the 1990s. It is fair to say that without the Boltzmann machine as conceptual foundation and the RBM as practical building block, the deep learning revolution that gave us large language models that fail to count letters in words would not have happened in the form it did.

The connection between Hopfield networks and modern attention mechanisms is more recent and more surprising. Ramsauer et al. (2020) showed that modern Hopfield networks — a generalisation of the original with continuous states and a different energy function — have exponential storage capacity (Ramsauer et al., 2020). More strikingly, the update rule of the modern Hopfield network is:

$$\mathbf{s}^{\text{new}} = \mathbf{X}\, \text{softmax}\!\left(\beta \mathbf{X}^\top \mathbf{s}\right)$$

where $\mathbf{X}$ is the matrix of stored patterns and $\mathbf{s}$ is the query. This is the attention mechanism of the transformer, up to notation. The transformer’s multi-head self-attention is, formally, a generalised Hopfield retrieval step. The architecture that powers GPT and everything descended from it is, at one level of abstraction, an associative memory performing energy minimisation on a Hopfield energy landscape.

I do not want to overstate this. The connection is formal and the interpretation is contested. But it is not nothing. The physicists who built the Hopfield network in 1982 were working on the same mathematical object that is now used to process language, images, and protein sequences at industrial scale.

The protein folding connection

The 2024 Nobel Prize in Chemistry went to Demis Hassabis, John Jumper, and David Baker for computational protein structure prediction — specifically for AlphaFold2 (Jumper et al., 2021). This made October 2024 a remarkable month for Nobel Prizes in fields adjacent to artificial intelligence, and it is not a coincidence.

Protein folding is a spin-glass problem. A protein is a polymer of amino acids, each with different chemical properties and steric constraints. The protein folds into a unique three-dimensional structure — its native conformation — determined by its sequence. The energy landscape of the folding process is precisely the kind of rugged landscape that Parisi described for spin glasses: exponentially many misfolded states, separated by barriers, with the native structure as the global minimum (or close to it).

Levinthal’s paradox, formulated in 1969, makes the absurdity quantitative. A modest protein of 100 amino acids might have $3^{100} \approx 10^{47}$ possible conformations (allowing three dihedral angle states per residue). Random search of this space, at the rate of one conformation per picosecond, would take $10^{35}$ years — somewhat longer than the age of the universe. Yet proteins fold in milliseconds to seconds. They do not search randomly; the energy landscape is funnel-shaped, channelling the dynamics toward the native state. But predicting which state is the native one from sequence alone remained one of the hard problems of structural biology for fifty years.

AlphaFold2 uses a transformer architecture — descended from the Boltzmann machine lineage — trained on millions of known protein structures. It does not simulate the folding dynamics; it has learned, from data, a mapping from sequence to structure that encodes the statistical mechanics of the folding funnel. The Nobel committee gave it the Chemistry prize because it is transforming biochemistry. But the conceptual machinery is pure statistical physics: representation of a high-dimensional energy landscape, approximation of the minimum, learned from the distribution of solved instances.

The three Nobels of 2021–2024 form the most coherent consecutive triple I can remember: Parisi showed how disordered energy landscapes behave; Hopfield and Hinton showed how to use energy landscapes as memory and learning machines; Hassabis and Jumper showed how to apply the resulting architecture to the most consequential outstanding problem in molecular biology. Each step is a mathematical consequence of the one before it.

The controversy: did the committee err?

I said I understand the irritation. Here is what is right about it.

Hinton’s work after the Boltzmann machine — backpropagation, dropout, convolutional networks, deep learning at ImageNet scale — is primarily engineering and empirical machine learning. The 2012 AlexNet result that restarted the field was not a theoretical physics contribution; it was a demonstration that known methods work very well on very large datasets with very large GPUs. The fact that it works is not explained by statistical mechanics. The scaling laws of neural networks (loss scales as a power law with compute, parameters, and data) are empirical observations that physicists have tried to explain with renormalisation group arguments with mixed success.

If the Nobel Prize in Physics were awarded for “the work that most influenced technology in the past decade,” the case for Hinton is strong. If it were awarded for “the most important contribution to the science of physics,” the case is weaker. There is a version of the Nobel announcement that emphasises the Boltzmann machine specifically — the 1985 paper that is literally named after a physicist and uses his distribution — and that version sits cleanly within physics. There is a broader version that encompasses all of Hinton’s career, and that version includes a great deal of empirical machine learning that the physics community is reasonably reluctant to claim.

My view, for what it is worth from someone who has been thinking about AI ethics and consequences for rather longer than feels comfortable: the Nobel correctly identifies that the foundational conceptual contributions — the Ising Hamiltonian as associative memory, the Boltzmann distribution as a learning target, the connection between statistical mechanics and computation — are physics. They came from physicists, they use physics mathematics, they extend physics intuition into a new domain. The subsequent scaling of these ideas using TPUs and transformer architectures is engineering. Valuable engineering, world-changing engineering, but engineering. The Nobel is for the former. If the citation had been more specific — “for the Boltzmann machine and its demonstration that physical principles govern neural computation” — the physics community would have been less irritated and equally correct.

What the irritation reveals is something slightly uncomfortable about disciplinary identity. Physicists are proud of universality: the idea that the same mathematical structures appear in wildly different physical systems. RSB in spin glasses, replica methods in random matrices, the Parisi–Sourlas correspondence between disordered systems and supersymmetric field theories — the joy of physics is precisely that these deep structural similarities cross domain boundaries. When that universality reaches into machine learning and says “your transformer attention layer is a Hopfield retrieval step,” physicists should be delighted, not affronted.

The agentic systems that are being built right now on top of transformer architectures are doing something that looks, from a sufficiently abstract distance, like what the Hopfield network was designed to do: find stored patterns that match a query, and use them to generate a response. The failures of grounding that I have written about elsewhere are, in this view, failures of the energy landscape — the model finds a metastable state that is not the correct minimum, and the dynamics cannot escape. Spin glass physics does not explain these failures in detail, but it gives a language for thinking about them. That is what physics is for.

The universality argument

Let me make the deeper claim explicit. Why should disordered magnets, associative memory networks, and protein folding all live in the same mathematical family?

Because they all have the same structure: many interacting degrees of freedom with competing constraints, a combinatorially large configuration space, an energy landscape with exponentially many metastable states, and dynamics that search for — and frequently fail to find — global minima. This is a universality class. The specific details (magnetic moments versus neuron states versus dihedral angles) are irrelevant at the level of the energy landscape topology.

Parisi’s contribution was to show that this class has a specific, exactly-solvable structure in mean field theory, characterised by replica symmetry breaking and the ultrametric organisation of states. This was not a solution to one model. It was a description of a universality class. The fact that the Hopfield model is in this class is not a coincidence requiring explanation; it is a mathematical identity requiring verification.

The Kuramoto model for coupled oscillators — which I have written about in the context of ensemble synchronisation and neural phase coupling — is another member of this extended family. The synchronisation transition in the Kuramoto model, the glass transition in the SK model, and the memory phase transition in the Hopfield model are all mean-field phase transitions in disordered many-body systems. The mathematics is more similar than the physics syllabi suggest.

When I teach physics and occasionally venture into questions about what the AI tools my students are using actually do, I find myself reaching for this framework. Not because it gives engineering insight into how to train a better model — it does not, particularly — but because it gives honest insight into what kind of thing a neural network is. It is a physical system. It has an energy landscape. Its failures are phase transitions. Its successes are energy minimisation. The vocabulary of statistical mechanics is not a metaphor; it is the correct description.

The Nobel committee noticed. They were right to notice.

The 2021 and 2024 Nobel Prizes in Physics have now officially bridged the gap between condensed matter physics and machine learning in the public record. For anyone who wants to understand either field more deeply than the press releases suggest, the SK model and the Hopfield network are the right place to start. Both papers are short by modern standards — Parisi’s 1979 letter is three pages; Hopfield’s 1982 PNAS paper is five — and both repay close reading.

References

Sherrington, D., & Kirkpatrick, S. (1975). Solvable model of a spin-glass. Physical Review Letters, 35(26), 1792–1796. DOI: 10.1103/PhysRevLett.35.1792
Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters, 43(23), 1754–1756. DOI: 10.1103/PhysRevLett.43.1754
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554–2558. DOI: 10.1073/pnas.79.8.2554
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169. DOI: 10.1207/s15516709cog0901_7
Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14), 1530–1533. DOI: 10.1103/PhysRevLett.55.1530
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. DOI: 10.1038/s41586-021-03819-2
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield networks is all you need. arXiv:2008.02217. Retrieved from https://arxiv.org/abs/2008.02217
Nobel Prize Committee. (2024). Scientific background: Machine learning and physical systems. The Royal Swedish Academy of Sciences. Retrieved from https://www.nobelprize.org/prizes/physics/2024/advanced-information/

You Cannot Have All Three: The Fairness Impossibility Theorem

Fri, 08 Mar 2024 00:00:00 +0000

Summary

In 2016 ProPublica published an investigation showing that COMPAS — a widely used recidivism risk assessment tool — assigned higher risk scores to Black defendants than to White defendants with equivalent actual recidivism rates. The tool’s developer responded that COMPAS is well-calibrated: among defendants of any race assigned a given score, the subsequent recidivism rates are consistent with that score. Both claims were correct.

The apparent contradiction between them is resolved by a mathematical result that was proved independently by two groups the same year. The fairness impossibility theorem establishes that calibration, equal false positive rates, and equal false negative rates cannot all hold simultaneously when base rates differ between groups — unless the classifier is perfect.

This is not a property of COMPAS specifically. It is not fixed by a better algorithm, more diverse training data, or more careful engineering. It is a constraint that holds for any probabilistic classifier operating on groups with unequal prevalence of the predicted outcome.

The question this forces is not “how do we make the algorithm fair?” The question is “which fairness criterion do we endorse, and can we defend that choice to the people it disadvantages?” That is not a technical question.

The COMPAS Investigation

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner published “Machine Bias” in ProPublica on 23 May 2016 (Angwin et al., 2016). They had obtained COMPAS risk scores for approximately 7,000 defendants in Broward County, Florida, along with actual two-year recidivism data. Their finding: among defendants who did not go on to reoffend, Black defendants were falsely labelled high-risk at roughly twice the rate of White defendants. The false positive rate was substantially higher for Black defendants.

Northpointe (now Equivant), the tool’s developer, responded that ProPublica’s analysis was misleading. COMPAS is calibrated: within any given score band, the actual recidivism rate is the same regardless of race. A score of 7 means approximately the same thing for a Black defendant as for a White defendant. This is a genuine and important property for a risk assessment to have.

Both analyses were conducted correctly. The tension between them is not a matter of one side being wrong. It is a matter of two legitimate fairness criteria being simultaneously satisfied being mathematically impossible.

Three Definitions of Fairness

Let $Y \in \{0, 1\}$ be the true outcome (reoffend/not), $\hat{Y}$ be the classifier’s prediction, and $A \in \{0, 1\}$ indicate group membership.

Calibration (predictive parity): for all score values $s$,

$$P(Y = 1 \mid \hat{Y} = s, A = 0) = P(Y = 1 \mid \hat{Y} = s, A = 1)$$

If the model assigns a score of 7 to a defendant, the actual reoffending rate should be the same regardless of race. This is what COMPAS satisfies.

False positive rate parity:

$$P(\hat{Y} = 1 \mid Y = 0, A = 0) = P(\hat{Y} = 1 \mid Y = 0, A = 1)$$

Among defendants who will not reoffend, the probability of being incorrectly labelled high-risk should be equal across groups. This is what ProPublica found violated.

False negative rate parity:

$$P(\hat{Y} = 0 \mid Y = 1, A = 0) = P(\hat{Y} = 0 \mid Y = 1, A = 1)$$

Among defendants who will reoffend, the probability of being incorrectly labelled low-risk should be equal across groups.

All three properties seem like reasonable things to ask of a fair classifier. The impossibility theorem says you cannot have all three at once — with a precise exception.

The Impossibility Theorem

Alexandra Chouldechova proved the relevant result in 2017 using Broward County data as her case study (Chouldechova, 2017). Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved an equivalent result independently (Kleinberg et al., 2017).

The argument is straightforward. Suppose a classifier is calibrated and produces a binary prediction (high/low risk). Let $p_0$ and $p_1$ be the base rates — the actual reoffending rates — in groups 0 and 1. For a binary classifier with positive predictive value PPV and negative predictive value NPV:

The false positive rate satisfies (via Bayes): $\text{FPR} = \frac{\text{TPR} \cdot \text{PR} \cdot (1-\text{PPV})}{\text{PPV} \cdot (1-\text{PR})}$ where PR is prevalence and TPR is sensitivity
The false negative rate satisfies (via Bayes): $\text{FNR} = \frac{\text{TNR} \cdot (1-\text{PR}) \cdot (1-\text{NPV})}{\text{NPV} \cdot \text{PR}}$ where TNR is specificity

If calibration holds — PPV and NPV are equal across groups — and the base rates $p_0 \neq p_1$, then the FPR and FNR in each group are functions of that group’s specific base rate. They cannot both be equalized across groups unless either:

$p_0 = p_1$: the base rates are equal, or
The classifier is perfect: FPR = FNR = 0.

In the real case — unequal base rates, imperfect classifier — calibration and equalized error rates are mutually exclusive. You can have one or the other but not both. The three criteria have two degrees of freedom, and the third is determined by the first two plus the base rates. It is an algebraic constraint, not an engineering limitation.

A Structural Analogy

The structural similarity to another impossibility result is worth noting.

The Robertson inequality in quantum mechanics (Robertson, 1929) states that for any two observables $\hat{A}$ and $\hat{B}$:

$$\Delta A \cdot \Delta B \geq \frac{1}{2} \left| \langle [\hat{A}, \hat{B}] \rangle \right|$$

This is not an engineering failure. It is a consequence of the algebraic structure of the theory: if $[\hat{A}, \hat{B}] \neq 0$, then $\Delta A$ and $\Delta B$ cannot simultaneously be made arbitrarily small. No measurement apparatus, however precise, can violate it. The constraint is in the mathematics, not the hardware.

The fairness impossibility has the same character. Three desiderata, a structural constraint that prevents simultaneous satisfaction, and no algorithmic escape route. A better model does not help. Richer training data does not help. The constraint is in the arithmetic of conditional probabilities and base rates.

The disanalogy is this: in quantum mechanics, $\hbar$ is a fundamental constant — you cannot reduce it. In fairness, the base rates are not constants of nature. They are historical outcomes of social processes: incarceration rates, policing patterns, economic conditions, educational access. The theorem does not tell you that unequal base rates are acceptable; it tells you that given unequal base rates, the three fairness criteria cannot all be satisfied.

Gender Bias in AI Systems

The impossibility theorem applies to any binary classification setting with unequal base rates. The empirical landscape of AI gender bias gives several concrete instances where one criterion was satisfied while others were not.

In October 2018, Reuters reported that Amazon had developed and then abandoned an internal AI-based recruiting tool that systematically downgraded résumés from women (Dastin, 2018). The model had been trained on a decade of hiring decisions, in which successful hires were predominantly male. The model learned that “male” features were associated with success and penalized female indicators accordingly. Calibration to the training data produced systematic gender bias in output.

Tolga Bolukbasi and colleagues showed in 2016 that word embeddings trained on large text corpora encoded gender stereotypes in their geometric structure (Bolukbasi et al., 2016). The analogy $\text{man} : \text{computer programmer} :: \text{woman} : \text{homemaker}$ could be recovered directly from the vector arithmetic of the embedding space. The embedding was calibrated to the text corpus, which reflected the occupational distribution of the time — and perpetuated it.

Jieyu Zhao and colleagues found that image captioning and activity recognition models amplified existing gender associations (Zhao et al., 2017). “Cooking” was associated with women in 67% of training images; the models amplified that to 84% at inference. The amplification is a consequence of models learning the easiest features that predict the label — and in a world where cooking is disproportionately female, “female appearance” becomes a feature that predicts “cooking.”

Joy Buolamwini and Timnit Gebru’s “Gender Shades” study found error rates of up to 34.7% for darker-skinned women in commercial facial recognition systems, compared to 0.8% for lighter-skinned men (Buolamwini & Gebru, 2018). The classifiers were calibrated on predominantly light-skinned training data. Calibration on the majority group produced large errors on the minority group — exactly the pattern the impossibility theorem describes.

Hadas Kotek and colleagues tested four large language models on gender-stereotyped occupational prompts in 2023 (Kotek et al., 2023). The models were three to six times more likely to choose the gender-stereotyped occupation when responding to ambiguous prompts. The models were calibrated to human-generated text; human-generated text encodes human stereotypes.

The Solutions and Their Limits

Three broad approaches exist to algorithmic debiasing, and all three face the same constraint.

Pre-processing removes bias from training data before training. Zemel and colleagues proposed “Learning Fair Representations” — a latent embedding that encodes the data usefully while obscuring group membership (Zemel et al., 2013). This can reduce bias in the learned representation, but it cannot simultaneously satisfy all three fairness criteria; it trades one against another by compressing the group-informative dimensions.

Post-processing adjusts the classifier’s decisions after training. Moritz Hardt, Eric Price, and Nathan Srebro’s equalized odds approach (Hardt et al., 2016) adjusts decision thresholds separately per group to achieve FPR/FNR parity. This satisfies equalized odds by construction — but only by abandoning calibration, which the Chouldechova theorem requires when base rates differ.

In-processing incorporates a fairness constraint into the training objective. Agarwal and colleagues proposed a reductions approach that allows the practitioner to specify which fairness constraint to impose (Agarwal et al., 2018). But you must choose. The algorithm can optimize for any one of the three criteria; it cannot optimize for all three simultaneously when base rates differ.

A 2021 survey by Mitchell and colleagues confirms that all three paradigms face the same impossibility (Mitchell et al., 2021). The choice of paradigm is a choice about which criterion to prioritize, and that choice has distributional consequences that fall differently on different groups.

The Political Choice

This is where Arvind Narayanan’s framing becomes essential. His 2018 tutorial catalogued 21 distinct definitions of algorithmic fairness and titled it “21 Fairness Definitions and Their Politics” (Narayanan, 2018). The title is the argument: the definitions are not equivalent, choosing among them is not a technical decision, and the choice encodes a prior about what justice requires.

In the criminal justice context: a false positive (predicting recidivism when the defendant will not reoffend) imposes a cost on the defendant — higher bail, longer sentence, restricted conditions of release. A false negative (predicting non-recidivism when the defendant will reoffend) imposes a cost on potential future victims and on public safety. When we choose to minimize FPR parity, we are choosing to protect defendants from false accusation. When we choose to minimize FNR parity, we are choosing to protect the public from missed offenders. These are both defensible values. They produce different error distributions across groups.

Choosing overall accuracy as the metric — which is what maximizing predictive performance typically means — is itself a value choice: it implicitly weights errors by their frequency in the population, which means errors made on less-common outcomes are relatively under-penalized. When racial disparities in base rates are products of historical injustice, this choice compounds that injustice.

Solon Barocas, Moritz Hardt, and Arvind Narayanan’s textbook Fairness and Machine Learning (2023) makes explicit that the choice between fairness criteria is a normative, not technical, decision (Barocas et al., 2023). The book does not tell you which criterion to choose. It tells you that you must choose, that the choice has political content, and that presenting it as a technical optimization problem conceals that content.

Reuben Binns’ analysis through political philosophy confirms that different fairness criteria correspond to different underlying theories of justice: Rawlsian, Dworkinian, luck egalitarian framings all generate different orderings of the three criteria (Binns, 2018). The choice of fairness criterion is the choice of a theory of justice, whether or not the engineers implementing the system have thought of it in those terms.

The Theorem Is Not the Problem

I want to be clear about what the impossibility theorem does and does not say.

It does not say that algorithmic fairness is impossible. It says that you must choose among competing fairness criteria when base rates differ across groups, and that the choice has distributional consequences. Systems can be built that satisfy calibration, or equalized odds, or demographic parity — just not all three at once with unequal base rates.

It does not say that base rate disparities are natural or acceptable. The disparities in recidivism rates, hiring rates, image training sets, and text corpora are products of social history. The theorem constrains what a classifier can do given those disparities; it does not prescribe them.

What it does say is that “we built a fair algorithm” is not a statement that can be made without specifying which fairness criterion was satisfied and which was not. It is not a statement that can be defended on purely technical grounds. And it is not a statement that escapes political accountability by hiding behind mathematical precision.

The fairness debate in AI is, at its core, a debate about which errors we are willing to make, in whom, with what consequences. The theorem makes that debate unavoidable. Whether we have the vocabulary and the will to conduct it in those terms is a different question entirely.

References

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016, May 23). Machine bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163. DOI: 10.1089/big.2016.0047
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). DOI: 10.4230/LIPIcs.ITCS.2017.43
Robertson, H. P. (1929). The uncertainty principle. Physical Review, 34, 163–164. DOI: 10.1103/PhysRev.34.163
Dastin, J. (2018, October 10). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016). arXiv:1607.06520
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of EMNLP 2017, pp. 2979–2989. ACL Anthology: D17-1323
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (FAT* 2018), PMLR Vol. 81, pp. 77–91. https://proceedings.mlr.press/v81/buolamwini18a.html
Kotek, H., Dockum, R., & Sun, D. Q. (2023). Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference (CI ‘23), pp. 12–24. DOI: 10.1145/3582269.3615599
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013). Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), PMLR Vol. 28, No. 3, pp. 325–333. https://proceedings.mlr.press/v28/zemel13.html
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), pp. 3323–3331. arXiv:1610.02413
Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), PMLR Vol. 80, pp. 60–69. arXiv:1803.02453
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141–163. DOI: 10.1146/annurev-statistics-042720-125902
Narayanan, A. (2018). 21 Fairness Definitions and Their Politics. Tutorial at FAT* 2018. PDF
Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org
Binns, R. (2018). Fairness in machine learning: Lessons from political philosophy. In Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency (FAT* 2018), PMLR Vol. 81, pp. 149–159. arXiv:1712.03586

Changelog

2025-11-05: Updated the Zhao et al. (2017) cooking statistics to match the paper: 67% female agents for cooking in the training set (33% was the male share), amplified to 84% female at inference.