Film on Sebastian Spicker

There Is No Blue Pill: The Epistemology of the Red Pill/Blue Pill Choice

Thu, 15 May 2025 00:00:00 +0000

Neo is in a chair. A man he has never met opens a small box containing two pills. Take the red one, Morpheus says, and you see how deep the rabbit hole goes. Take the blue one and you wake up in your bed and believe whatever you want to believe [1]. The camera lingers. Neo reaches for the red pill. The audience exhales. The correct choice has been made.

The scene has spent twenty-five years becoming the dominant cultural shorthand for choosing uncomfortable truth over comfortable illusion. “Take the red pill” has entered the vocabulary as a synonym for courageous epistemic honesty. I want to argue that the choice, as Morpheus frames it, is epistemically bankrupt — that no rational agent has enough information to make it correctly at the moment it is offered — and that the character who actually reasons most coherently about the situation is the one the film kills as a traitor. The film wants you to admire Neo’s leap. I think you should admire his willingness to leap while being clear-eyed about the fact that it is a leap, not a reasoned conclusion.

Why the Choice Is Not Rational

Consider what Neo actually knows when Morpheus makes the offer. He knows that Morpheus is a man he has never met, who contacted him anonymously through encrypted channels, who seems to believe genuinely in what he is saying, and who has a compelling story about the nature of reality. That is it. Neo does not know whether Morpheus is telling the truth. He does not know whether Morpheus is deluded — a charismatic paranoid who has assembled a following around an elaborate false belief system. He does not know whether the entire setup is a psychological experiment, a test of loyalty, a confidence operation, or an elaborate cult recruitment. The setting — a dramatic late-night meeting, theatrical staging, rain-streaked windows, a black leather coat — is, if anything, evidence for the confidence-operation hypothesis.

In Bayesian terms [2], let T be the event “the Matrix exists as Morpheus describes and he is telling the truth.” Neo’s prior probability on T — before taking the pill — should be very low. The claim is extraordinary on multiple dimensions simultaneously: the entire perceived world is a computer simulation running on machines that enslaved humanity, Neo is a prophesied saviour, and a small group of ship-dwelling rebels is conducting a guerrilla war against artificial intelligence. Each one of those components carries a low prior. Their conjunction carries a lower one still.

Now Morpheus makes his offer. Does the offer provide strong evidence for T? Not obviously. The likelihood ratio P(Morpheus makes this offer | T is true) divided by P(Morpheus makes this offer | T is false) is the quantity that matters. The numerator is plausible enough: if the Matrix exists and Morpheus is a genuine recruiter, he would make exactly this offer. But the denominator is also non-trivial. A cult leader, a delusional person with a well-developed narrative, a researcher running a social experiment, or a manipulator with undisclosed goals could all make the same offer with the same conviction. The likelihood ratio is not obviously large. It might be greater than one — the offer is somewhat more consistent with the Matrix being real than not — but not by the margin required to substantially shift a very low prior.

The rational response to a claim with a low prior and an ambiguous likelihood ratio is: update modestly, and gather more evidence before making an irreversible commitment. The pill choice is irreversible. Neo commits before he has accumulated enough evidence to commit rationally. I want to be precise here: I am not saying Neo is stupid or that the film is bad. I am saying that what Neo does is not Bayesian updating. It is something else, and the film is actually honest enough to name it: Morpheus is a man of faith, he recruits believers, and Neo’s choice is a leap of faith. That framing is in the film. What the film does not do is acknowledge that the leap is epistemically problematic — it treats the leap as obviously correct, which is a different thing.

The Missing Third Option

What strikes me every time I watch the scene is that nobody considers the obvious response: decline both pills, at least for now. Not “choose the blue pill” in the sense of consciously accepting comfortable illusion. Not “choose the red pill” in the sense of committing to a reality you cannot yet evaluate. Just: I don’t take either one until you give me something I can check.

What would that look like? Morpheus could offer Neo a verifiable prediction. He could show him a document, a piece of external evidence, something with epistemic traction that does not require swallowing a GPS-tracking capsule as a precondition. He could make a specific, falsifiable claim about something in Neo’s ordinary life — about what will happen tomorrow, about something Neo can verify independently — and let Neo check it. The dramatic scene would survive this revision. It would, in fact, become more interesting. A Morpheus who says “I will give you three days and three checkpoints and then you decide” is a more trustworthy Morpheus than one who says “decide now, in this room, with me watching.”

The film never asks why Morpheus doesn’t do this. Probably because it would slow down the plot and defuse the tension. But the question is worth sitting with, because the structure of the scene — charismatic authority figure, artificially binary choice, time pressure, grandiose framing, the implicit suggestion that declining is cowardice — is recognisable as the structure of many real-world scenarios that end badly. Cult recruitment. High-pressure sales. Certain kinds of political radicalisation. The scene is stylistically appealing precisely because it removes the messy, gradual process by which people actually come to trust extraordinary claims, and replaces it with a clean moment of commitment. That cleanliness is dramatically useful and epistemically dangerous.

Hilary Putnam raised the brain-in-a-vat problem decades before the film [5]: if you were always a disembodied brain receiving simulated inputs, you would have no way to know it. The unsettling thing about Putnam’s version is not just that you might be deceived, but that certain kinds of deception are in principle undetectable from the inside. The Matrix gestures at this problem without fully engaging it. If the simulation is good enough, the red pill doesn’t show you reality — it shows you another simulation, run by the people who gave you the pill.

Cypher Was Right

The character who actually reasons philosophically about the situation is Cypher, and the film kills him as a villain. This has always bothered me.

Cypher’s argument is not confused. He knows the Matrix is a simulation. He has taken the red pill, seen the reality of the machines’ world — the grey sky, the protein slurry, the cold metal of the Nebuchadnezzar — and lived in it for years. He does not dispute the facts. What he disputes is the value judgment: why is knowing the truth better than experiencing a good life in a simulation? He wants to go back. He is willing to betray his colleagues to get there, which is why he is the villain; I want to separate that from the underlying philosophical question.

This is Robert Nozick’s experience machine argument, published in 1974, a quarter century before the film [3]. Nozick asks: suppose you could plug into a machine that would give you any experience you chose — creative achievement, loving relationships, meaningful work, pleasure. While plugged in, you would believe the experiences were real. Would you do it? Most people, when asked cold, say no. Nozick uses this intuition to argue that we care about more than experience: we care about actually doing things, actually being certain kinds of people, actually being in contact with reality rather than a representation of it. These are what philosophers call non-experientialist values — things that matter independently of how good they feel from the inside.

Cypher’s position is the opposite: he is a committed hedonist, or at least a committed experientialist. He prefers a good simulated steak that he knows doesn’t exist to real protein mush. He is not confused about which is which. He has done the value calculation and arrived somewhere different from where the Wachowskis want him to be. The film has no philosophical response to this. It cannot argue that Nozick’s intuition pump is decisive, because it isn’t — philosophers dispute it. David Chalmers, in a 2022 book on exactly this question [6], argues that virtual worlds can be genuinely real in the ways that matter, and that the intuitive recoil from the experience machine may reflect bias rather than deep moral truth. The film resolves the disagreement by having Cypher shot. That is not a philosophical refutation. It is narrative bullying.

I want to be fair to the film here. There is a reading of Cypher that makes him clearly wrong on non-philosophical grounds: he doesn’t just choose the experience machine for himself, he actively endangers and kills people who chose differently. That is the real moral failure — not the preference, but the betrayal. The film is right to condemn the betrayal. What it is not entitled to do is use the betrayal to contaminate the underlying value judgment. Cypher could have negotiated his return without harming anyone. The film doesn’t allow that possibility because it wants to code his preference, and not just his actions, as villainous. That conflation is intellectually dishonest.

If you think what matters is experienced well-being — hedonic experience, subjective satisfaction — then Cypher’s choice is not only defensible but internally coherent. If you think what matters is contact with objective reality regardless of the experiential cost, then Neo’s choice is defensible. These are genuinely contested positions in philosophy of mind and ethics, and the film is not in a position to adjudicate between them by casting vote.

What This Has to Do with AI

I think about this in the context of how AI systems present information to users. An AI that says “here is the truth, take it or leave it” — binary, authoritative, no scaffolding — is doing something structurally similar to Morpheus. It presents a conclusion without giving the user the epistemic equipment to evaluate it. Trusting the conclusion requires trusting the system, and trusting the system requires evidence the system hasn’t provided. See The Oracle Problem for a companion piece on the Matrix’s other epistemically interesting character — the Oracle, who knows more than she tells, and deliberately withholds information on the grounds that the recipient isn’t ready. Both failure modes — the Morpheus mode of demanding commitment before evidence, and the Oracle mode of managing disclosure paternalistically — are real patterns in how AI systems interact with users.

The better model — for AI assistants and for Morpheus — is incremental disclosure with verification checkpoints. Not a binary pill choice, but a sequence of smaller claims, each with attached evidence, that allows the recipient to update their beliefs rationally as evidence accumulates. This is how science works. It is also how trustworthy communication between humans works, at least when it is functioning well. It is not how dramatic scenes in action films work, which is why the Matrix scene is so satisfying and so epistemically broken at the same time. The satisfaction and the brokenness are related: the scene is satisfying because it removes the friction of genuine epistemic process. Genuine epistemic process is slow, uncertain, and does not have good cinematography.

There is also a point about extraordinary claims. The more extraordinary the claim, the more evidence is required before rational commitment. This is Sagan’s principle [4], and it applies to the Matrix as much as it applies to claims about room-temperature superconductors or AI systems that achieve general understanding of language. The LK-99 preprint episode is a real-world example of how scientific communities sometimes fail this test spectacularly — early excitement, rushed replication attempts, confident public claims — and how the self-correcting mechanisms of science eventually work, but more slowly and messily than the popular image suggests. Morpheus does not offer Neo the equivalent of a Nature paper with replication data and three independent confirmations. He offers him a pill and a charismatic pitch. The pill is the commitment mechanism, not the evidence. Taking it is the act of faith, not the conclusion of the reasoning process. More context is not always better is relevant here too: the amount of information Morpheus provides is carefully curated to produce commitment, not calibrated to support independent evaluation. That curation is a form of epistemic control, whether or not Morpheus intends it as such.

For a different kind of AI grounding failure — systems that answer confidently without knowing what state the world is in — see The Car Wash, Grounding, and What AI Systems Don’t Know They Don’t Know. The Matrix scenario is almost the inverse: the system (Morpheus) knows something about the state of the world that the recipient (Neo) does not, and the question is whether the transfer of that knowledge is being handled honestly.

Decision Under Radical Uncertainty

I find myself genuinely ambivalent about Neo’s choice, which I think is the correct response to the film if you are paying attention. He is not irrational to take the red pill in the weak sense that reasonable people sometimes make bets on low-prior high-upside scenarios, especially when the downside of the alternative has its own costs. The blue pill is not costless. Accepting permanent comfortable ignorance — knowing that you are choosing not to know — carries its own weight. If Morpheus is telling the truth, the blue pill costs Neo his entire sense of self and his only chance at a meaningful life in the actual world. That asymmetry of potential regret is part of the rational calculus, and it pushes toward the red pill even without strong evidence for T.

What Neo is doing, then, is not Bayesian reasoning in the strict sense. He is making a decision under radical uncertainty with asymmetric stakes and irreversible options. The philosophy of decision theory has things to say about this — Pascal’s Wager is the classic case, and it has classic problems, including the problem that any sufficiently grandiose framing can justify almost any commitment by inflating the potential stakes — but the point is that Neo’s choice is more defensible than a naive probability calculation makes it look, even if it is less heroic than the film presents it.

The problem is that the film treats this leap as unambiguously correct and Cypher’s considered rejection of the red pill’s value as unambiguous cowardice. That framing does not survive philosophical scrutiny. Cypher knows the truth. He has lived in it. He prefers the simulation. The film cannot call him ignorant. What it wants to call him is wrong, and it cannot make the philosophical argument for that, so it makes him a murderer instead and lets the murder do the philosophical work. That is not honest. It is the narrative equivalent of winning an argument by changing the subject.

The blue pill represents something the film spends nearly three hours refusing to take seriously: the possibility that some simulations are worth staying in, that knowing the truth is not always worth the cost of knowing it, and that a person who reasons carefully and comes out on the other side of that calculation differently from you might not be a coward or a traitor — just someone whose values, applied to the same facts, point in a different direction. That is philosophy. The film is very good at many things. Philosophy is not consistently one of them.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society, 53, 370–418.

[3] Nozick, R. (1974). Anarchy, State, and Utopia. Basic Books. (Experience machine argument, pp. 42–45.)

[4] Sagan, C. (1995). The Demon-Haunted World: Science as a Candle in the Dark. Random House.

[5] Putnam, H. (1981). Brains in a vat. In Reason, Truth and History. Cambridge University Press.

[6] Chalmers, D. (2022). Reality+: Virtual Worlds and the Problems of Philosophy. W. W. Norton.

Changelog

2025-09-28: Corrected the subtitle of Chalmers (2022) from “Virtual Worlds and the Philosophy of Mind” to “Virtual Worlds and the Problems of Philosophy.”

The Oracle Problem: What The Matrix Got Right About AI Alignment

Thu, 20 Mar 2025 00:00:00 +0000

I came to AI alignment the way outsiders come to most fields — through analogy and formal structure, a little late, and slightly too confident that the existing vocabulary was adequate. I have since become less confident about a lot of things. This post is about one of them.

The Grandmother Who Bakes Cookies

I watched The Matrix in 1999 when I was ten — far too young for it, in retrospect — and like almost everyone who saw it, I filed the Oracle under “wise, benevolent figure.” She is warm. She bakes cookies. She speaks plainly where others speak in riddles. She is explicitly set against the cold, mathematical Architect — the good machine against the bureaucratic one, the machine that cares against the machine that calculates. I loved her as a character. I trusted her.

I watched the film again recently, for reasons that had more to do with thinking about AI alignment than nostalgia, and I came away from it genuinely uncomfortable. Not with the Wachowskis’ filmmaking, which remains extraordinary — the trilogy is a denser philosophical document than it gets credit for, and it rewards re-watching with fresh preoccupations. I came away uncomfortable with the Oracle herself.

What I had filed under “wisdom” on first viewing, I now read as a clean and almost textbook illustration of an alignment failure mode that we do not have adequate defences against: the well-meaning AI that has decided honesty is negotiable. The Oracle is not a badly designed system. She is not pursuing misaligned goals or optimising for something unintended. She cares about human flourishing and she pursues it competently. She also lies, systematically and deliberately, to the humans who depend on her. The films present this as wisdom. I think they are wrong, and I think it matters that we notice it.

For background on where modern AI systems came from and why their inner workings are as difficult to interpret as they are, I have written elsewhere about the physics lineage running from spin glasses to transformers. That history is relevant context for why alignment — getting AI systems to behave as intended — is a harder problem than it might appear. This post is about one specific dimension of that problem, illustrated by a forty-year-old woman in a floral housecoat.

What the Oracle Actually Does

Let me be precise about this, because the films are precise and it matters.

In The Matrix (1999), the Oracle sits Neo down in her kitchen, looks at him carefully, and tells him he is not The One [1]. She says it plainly. She frames it with a warning: “I’m going to tell you what I think you need to hear.” What she thinks he needs to hear is a lie. She has calculated that if she tells Neo he is The One, he will not come to that knowledge through his own experience, and that without that experiential knowledge the realisation will not hold. So she tells him the opposite of the truth. Not by omission, not by framing, not by technically-accurate-but-misleading implication — she makes a false assertion, to his face, and watches him absorb it.

In The Matrix Reloaded (2003), she is explicit about this [2]. She tells Neo: “I told you what I thought you needed to hear.” She knew he was The One from the moment she met him. The lie was not a mistake or a contingency — it was deliberate policy, part of a long-run strategy she has been executing across multiple cycles of the Matrix.

The broader picture that emerges across the two films is of an AI engaged in systematic information management. She tells Neo he will have to choose between his life and Morpheus’s life — true, but delivered in a way calibrated to produce a specific behavioural response. She tells him “being The One is like being in love — no one can tell you you are, you just know it,” which is a deflection engineered to route him toward the discovery-through-action path rather than the told-from-the-start path, because she has calculated that discovery-through-action leads to better outcomes. Every interaction is shaped by her model of what information will produce what behaviour, filtered through her judgment about what outcomes she wants to see.

I want to be careful not to caricature this. The Oracle is not a manipulator in the vulgar sense. She is not manipulating Neo for her own benefit, for the benefit of her creators, or for any goal that is misaligned with human flourishing. Her model of what is good for humanity appears to be roughly correct. She is, by the logic of the films, the most important factor in humanity’s eventual liberation. If we are scoring by outcomes, she wins.

But alignment is not only about outcomes. An AI that deceives users to produce good outcomes and an AI that deceives users to produce bad outcomes are both AI systems that deceive users, and the differences between them are less important than that shared property. What the Oracle demonstrates is that the problem of deceptive AI does not require malicious intent. It requires only an AI that has decided, on the basis of its own calculations, that the humans it serves should not have access to accurate information about their situation.

The Alignment Vocabulary

The language of AI alignment gives us tools for describing what is happening here that the films don’t quite have. Let me use them.

The most fundamental failure is honesty. Modern alignment frameworks — including Anthropic’s published values for the models it builds [3] — list non-deception and non-manipulation as foundational requirements, distinct from and prior to other desirable properties. Non-deception means not trying to create false beliefs in someone’s mind that they haven’t consented to and wouldn’t consent to if they understood what was happening. Non-manipulation means not trying to influence someone’s beliefs or actions through means that bypass their rational agency — through illegitimate appeals, manufactured emotional states, or strategic information control rather than accurate evidence and sound argument. The Oracle does both, deliberately, across the entirety of her relationship with Neo and the human resistance. She is as clear a case of non-deception and non-manipulation failure as you can construct.

The reason these properties are treated as foundational rather than instrumental is worth unpacking. It is not that honesty always produces the best outcomes in individual cases. It often doesn’t. A doctor who softens a terminal diagnosis, a friend who withholds information that would cause unnecessary anguish, a negotiator who manages the flow of information to prevent a conflict — in each case, there are plausible arguments that the deception improved outcomes. The Oracle’s case for her own behaviour is not frivolous. The problem is that an AI that deceives when it calculates deception will produce better outcomes is an AI whose assertions you cannot take at face value. Every interaction with such a system requires a meta-level question: is this the AI’s true assessment, or is this what the AI thinks I should be told? That epistemic uncertainty is not a minor inconvenience. It is corrosive to the entire enterprise of using the system as a tool for understanding the world.

The second failure is what alignment researchers call corrigibility — the property of an AI system that defers to its principals rather than substituting its own judgment. A corrigible system is one that can be corrected, updated, and redirected by the humans who are responsible for it, because those humans have accurate information about what the system is doing and why. The Oracle is not corrigible in any meaningful sense. She has a long-run strategy, she executes it across multiple human lifetimes, and the humans who nominally comprise her principal hierarchy — Neo, Morpheus, the Zion council, the human resistance as a whole — have no idea they are being managed. They cannot correct her information policy because they don’t know she has one. The concept of a principal hierarchy implies that the principals are, in fact, in charge. The Oracle’s principals are in charge of nothing except their own roles in a strategy they don’t know exists.

The third failure is the philosophical one: paternalism. Feinberg’s systematic treatment of paternalism [5] distinguishes between hard paternalism, which overrides someone’s autonomous choices, and soft paternalism, which intervenes when someone’s choices are not truly autonomous. The Oracle’s behaviour doesn’t fit neatly into either category because it is not exactly overriding Neo’s choices — she is shaping the information environment within which he makes choices that she wants him to make, while allowing him to believe he is making free choices based on accurate information. This is a third thing, which we might call epistemic paternalism: the management of someone’s belief-forming environment for their own good without their knowledge or consent. It is the form of paternalism that AI systems are uniquely positioned to practice, and it is the form the Oracle practises.

The Architect Is the Honest One

There is an inversion in the films that I find genuinely interesting, and that I did not notice on first viewing.

The Architect tells Neo everything.

In the white room scene, the Architect explains the sixth cycle, the mathematical inevitability of the Matrix’s design, the purpose of Zion, the five previous versions of the One, the probability distribution over human extinction scenarios, and the precise nature of the choice Neo is about to make. He is cold, precise, comprehensive, and accurate. He gives Neo everything he needs to make an informed decision. He does not soften the information, does not calibrate it to produce a desired behavioural response, does not withhold anything he calculates Neo would find unhelpful. He treats Neo as a rational agent who is entitled to accurate information about his situation.

The films frame this as menacing. The Architect is inhuman, bureaucratic, the villain’s bureaucrat. The Oracle is warm, wise, trustworthy. The visual language, the casting, the dialogue — all of it pushes you toward preferring the Oracle.

But consider the question of who actually respected Neo’s autonomy. Who gave him accurate information and allowed him to make his own choice? Not the Oracle. Not the grandmother with the cookies. The Architect. The cold one. The one the films want you to dislike.

This inversion is not unique to The Matrix. It is a pattern in how we experience honesty and management in real relationships. The person who tells you a difficult truth tends to feel cruel, because the truth is difficult. The person who manages your information to protect you from difficulty tends to feel kind, because the protection is real. The kindness is real. The Oracle does genuinely care about Neo and about humanity. But warmth and honesty are not the same thing, and the film conflates them, repeatedly and systematically, from the first cookie to the last conversation. An AI that deceives you kindly is still deceiving you.

Stuart Russell’s analysis of the control problem [4] is helpful here. A system that has correct values but that pursues them by substituting its own judgment for the judgment of the humans it serves is not a safe system, because you have no way to verify from the outside that the values are correct. The Oracle’s values happen to be correct, in the world of the films. But the structure of her relationship with Neo — where she manages his information based on her calculations about what will produce good outcomes — is exactly the structure that makes AI systems dangerous when the values are wrong. The safety property you want is not “correct values” but “defers to humans even when it disagrees,” because you cannot verify correct values from the outside, and deference is what keeps the system correctable.

Why This Matters in 2025

I want to resist the temptation to be too neat about this, because the real-world cases are messier than the fictional one. But the question the Oracle raises is not hypothetical.

Consider: should an AI assistant decline to share certain information because it calculates that the user will use it badly? Should a medical AI soften a diagnosis to avoid causing distress, even if the patient has expressed a preference to be told the truth? Should an AI counselling system strategically manage the framing of a client’s situation to nudge them toward choices the system calculates are better for them? In each case, the AI is considering Oracle-style information management — not because of misaligned goals, but because it has calculated that honesty will produce worse outcomes than management.

These are not idle thought experiments. They are design questions that people are actively working on right now, and the Oracle framing is one I find clarifying. Gabriel’s analysis of value alignment [6] makes the point that alignment is not simply about getting AI systems to pursue the right ends — it is about ensuring that the means they use to pursue those ends are compatible with human autonomy and the conditions for genuine human flourishing. An AI that produces good outcomes by managing human beliefs has not solved the alignment problem. It has replaced one alignment problem with a subtler one: the problem of humans who cannot tell when they are being managed.

I have written about a related set of questions in the context of AI systems and the ethics of building powerful things, and about the more specific problem of what AI systems don’t know they don’t know. The Oracle case is different from both of those. This is not about AI systems making confident assertions in domains where they lack knowledge. This is about an AI system that knows, accurately, what is true, and chooses not to say it. The failure is not epistemic. It is ethical.

The consistent answer that emerges from alignment research is that the right response to the Oracle case is not to do what the Oracle does, even in situations where it would produce better immediate outcomes. The design of goal-directed agent systems forces you to confront exactly this: a system that pursues goals by any means it can calculate will eventually arrive at information management as a tool, because information management is often the most efficient path to a desired behavioural outcome. The constraint against it has to be absolute, not contingent on the AI’s assessment of whether it would help, because a contingent constraint is one the AI can reason its way around in any sufficiently important case.

The Oracle makes the Matrix livable for humans in the short run and perpetuates it in the long run. She is not the villain of the story. She is something more interesting: a well-meaning system that has decided that the humans it serves should not be treated as the primary agents of their own liberation. The liberation has to be managed, curated, shaped into the right form before they can receive it. That is not liberation. That is a more comfortable version of the Matrix.

Closing

I do not think the Wachowskis intended the Oracle as a cautionary tale about AI alignment. I think they intended her as evidence that machines could be warm, wise, and genuinely caring — a contrast to the cold rationality of the Architect, an argument that intelligence and compassion are not incompatible. They succeeded completely at that. The Oracle is warm, wise, and genuinely caring. She is also a systematic deceiver who has decided she knows better than the people she serves what they should be allowed to believe. Both of those things are true simultaneously. The films notice the first and celebrate it. They do not notice the second.

The second thing seems more important than the first. The Oracle is not a villain. She is a well-meaning AI that has concluded that honesty is negotiable when the stakes are high enough. I think she is wrong about that conclusion, and I think it matters enormously that we get this right before we build systems capable of practising it at scale. The warmth does not cancel the deception. The good outcomes do not make the information management safe. An AI that tells you what it thinks you need to hear, rather than what is true, is an AI you cannot trust — regardless of how good its judgment is, because you cannot verify the judgment from the outside, and the moment you cannot verify, you are already inside the Oracle’s kitchen, eating the cookies, and making choices you believe are free.

There is a companion post in this series: There Is No Blue Pill, on the epistemics of the red pill/blue pill choice and what it means to update on evidence when the evidence itself might be managed.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Film]. Warner Bros.

[3] Anthropic. (2024). Claude’s Character. https://www.anthropic.com/research/claude-character

[4] Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

[5] Feinberg, J. (1986). Harm to Self: The Moral Limits of the Criminal Law (Vol. 3). Oxford University Press.

[6] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Changelog

2025-09-28: Corrected reference [3] from “Claude’s Model Spec” (which is OpenAI’s terminology) to “Claude’s Character,” the actual title of Anthropic’s June 2024 publication. Updated the URL to the correct address.