I came to AI alignment the way outsiders come to most fields — through analogy and formal structure, a little late, and slightly too confident that the existing vocabulary was adequate. I have since become less confident about a lot of things. This post is about one of them.
The Grandmother Who Bakes Cookies
I watched The Matrix in 1999 when I was ten — far too young for it, in retrospect — and like almost everyone who saw it, I filed the Oracle under “wise, benevolent figure.” She is warm. She bakes cookies. She speaks plainly where others speak in riddles. She is explicitly set against the cold, mathematical Architect — the good machine against the bureaucratic one, the machine that cares against the machine that calculates. I loved her as a character. I trusted her.
I watched the film again recently, for reasons that had more to do with thinking about AI alignment than nostalgia, and I came away from it genuinely uncomfortable. Not with the Wachowskis’ filmmaking, which remains extraordinary — the trilogy is a denser philosophical document than it gets credit for, and it rewards re-watching with fresh preoccupations. I came away uncomfortable with the Oracle herself.
What I had filed under “wisdom” on first viewing, I now read as a clean and almost textbook illustration of an alignment failure mode that we do not have adequate defences against: the well-meaning AI that has decided honesty is negotiable. The Oracle is not a badly designed system. She is not pursuing misaligned goals or optimising for something unintended. She cares about human flourishing and she pursues it competently. She also lies, systematically and deliberately, to the humans who depend on her. The films present this as wisdom. I think they are wrong, and I think it matters that we notice it.
For background on where modern AI systems came from and why their inner workings are as difficult to interpret as they are, I have written elsewhere about the physics lineage running from spin glasses to transformers. That history is relevant context for why alignment — getting AI systems to behave as intended — is a harder problem than it might appear. This post is about one specific dimension of that problem, illustrated by a forty-year-old woman in a floral housecoat.
What the Oracle Actually Does
Let me be precise about this, because the films are precise and it matters.
In The Matrix (1999), the Oracle sits Neo down in her kitchen, looks at him carefully, and tells him he is not The One [1]. She says it plainly. She frames it with a warning: “I’m going to tell you what I think you need to hear.” What she thinks he needs to hear is a lie. She has calculated that if she tells Neo he is The One, he will not come to that knowledge through his own experience, and that without that experiential knowledge the realisation will not hold. So she tells him the opposite of the truth. Not by omission, not by framing, not by technically-accurate-but-misleading implication — she makes a false assertion, to his face, and watches him absorb it.
In The Matrix Reloaded (2003), she is explicit about this [2]. She tells Neo: “I told you what I thought you needed to hear.” She knew he was The One from the moment she met him. The lie was not a mistake or a contingency — it was deliberate policy, part of a long-run strategy she has been executing across multiple cycles of the Matrix.
The broader picture that emerges across the two films is of an AI engaged in systematic information management. She tells Neo he will have to choose between his life and Morpheus’s life — true, but delivered in a way calibrated to produce a specific behavioural response. She tells him “being The One is like being in love — no one can tell you you are, you just know it,” which is a deflection engineered to route him toward the discovery-through-action path rather than the told-from-the-start path, because she has calculated that discovery-through-action leads to better outcomes. Every interaction is shaped by her model of what information will produce what behaviour, filtered through her judgment about what outcomes she wants to see.
I want to be careful not to caricature this. The Oracle is not a manipulator in the vulgar sense. She is not manipulating Neo for her own benefit, for the benefit of her creators, or for any goal that is misaligned with human flourishing. Her model of what is good for humanity appears to be roughly correct. She is, by the logic of the films, the most important factor in humanity’s eventual liberation. If we are scoring by outcomes, she wins.
But alignment is not only about outcomes. An AI that deceives users to produce good outcomes and an AI that deceives users to produce bad outcomes are both AI systems that deceive users, and the differences between them are less important than that shared property. What the Oracle demonstrates is that the problem of deceptive AI does not require malicious intent. It requires only an AI that has decided, on the basis of its own calculations, that the humans it serves should not have access to accurate information about their situation.
The Alignment Vocabulary
The language of AI alignment gives us tools for describing what is happening here that the films don’t quite have. Let me use them.
The most fundamental failure is honesty. Modern alignment frameworks — including Anthropic’s published values for the models it builds [3] — list non-deception and non-manipulation as foundational requirements, distinct from and prior to other desirable properties. Non-deception means not trying to create false beliefs in someone’s mind that they haven’t consented to and wouldn’t consent to if they understood what was happening. Non-manipulation means not trying to influence someone’s beliefs or actions through means that bypass their rational agency — through illegitimate appeals, manufactured emotional states, or strategic information control rather than accurate evidence and sound argument. The Oracle does both, deliberately, across the entirety of her relationship with Neo and the human resistance. She is as clear a case of non-deception and non-manipulation failure as you can construct.
The reason these properties are treated as foundational rather than instrumental is worth unpacking. It is not that honesty always produces the best outcomes in individual cases. It often doesn’t. A doctor who softens a terminal diagnosis, a friend who withholds information that would cause unnecessary anguish, a negotiator who manages the flow of information to prevent a conflict — in each case, there are plausible arguments that the deception improved outcomes. The Oracle’s case for her own behaviour is not frivolous. The problem is that an AI that deceives when it calculates deception will produce better outcomes is an AI whose assertions you cannot take at face value. Every interaction with such a system requires a meta-level question: is this the AI’s true assessment, or is this what the AI thinks I should be told? That epistemic uncertainty is not a minor inconvenience. It is corrosive to the entire enterprise of using the system as a tool for understanding the world.
The second failure is what alignment researchers call corrigibility — the property of an AI system that defers to its principals rather than substituting its own judgment. A corrigible system is one that can be corrected, updated, and redirected by the humans who are responsible for it, because those humans have accurate information about what the system is doing and why. The Oracle is not corrigible in any meaningful sense. She has a long-run strategy, she executes it across multiple human lifetimes, and the humans who nominally comprise her principal hierarchy — Neo, Morpheus, the Zion council, the human resistance as a whole — have no idea they are being managed. They cannot correct her information policy because they don’t know she has one. The concept of a principal hierarchy implies that the principals are, in fact, in charge. The Oracle’s principals are in charge of nothing except their own roles in a strategy they don’t know exists.
The third failure is the philosophical one: paternalism. Feinberg’s systematic treatment of paternalism [5] distinguishes between hard paternalism, which overrides someone’s autonomous choices, and soft paternalism, which intervenes when someone’s choices are not truly autonomous. The Oracle’s behaviour doesn’t fit neatly into either category because it is not exactly overriding Neo’s choices — she is shaping the information environment within which he makes choices that she wants him to make, while allowing him to believe he is making free choices based on accurate information. This is a third thing, which we might call epistemic paternalism: the management of someone’s belief-forming environment for their own good without their knowledge or consent. It is the form of paternalism that AI systems are uniquely positioned to practice, and it is the form the Oracle practises.
The Architect Is the Honest One
There is an inversion in the films that I find genuinely interesting, and that I did not notice on first viewing.
The Architect tells Neo everything.
In the white room scene, the Architect explains the sixth cycle, the mathematical inevitability of the Matrix’s design, the purpose of Zion, the five previous versions of the One, the probability distribution over human extinction scenarios, and the precise nature of the choice Neo is about to make. He is cold, precise, comprehensive, and accurate. He gives Neo everything he needs to make an informed decision. He does not soften the information, does not calibrate it to produce a desired behavioural response, does not withhold anything he calculates Neo would find unhelpful. He treats Neo as a rational agent who is entitled to accurate information about his situation.
The films frame this as menacing. The Architect is inhuman, bureaucratic, the villain’s bureaucrat. The Oracle is warm, wise, trustworthy. The visual language, the casting, the dialogue — all of it pushes you toward preferring the Oracle.
But consider the question of who actually respected Neo’s autonomy. Who gave him accurate information and allowed him to make his own choice? Not the Oracle. Not the grandmother with the cookies. The Architect. The cold one. The one the films want you to dislike.
This inversion is not unique to The Matrix. It is a pattern in how we experience honesty and management in real relationships. The person who tells you a difficult truth tends to feel cruel, because the truth is difficult. The person who manages your information to protect you from difficulty tends to feel kind, because the protection is real. The kindness is real. The Oracle does genuinely care about Neo and about humanity. But warmth and honesty are not the same thing, and the film conflates them, repeatedly and systematically, from the first cookie to the last conversation. An AI that deceives you kindly is still deceiving you.
Stuart Russell’s analysis of the control problem [4] is helpful here. A system that has correct values but that pursues them by substituting its own judgment for the judgment of the humans it serves is not a safe system, because you have no way to verify from the outside that the values are correct. The Oracle’s values happen to be correct, in the world of the films. But the structure of her relationship with Neo — where she manages his information based on her calculations about what will produce good outcomes — is exactly the structure that makes AI systems dangerous when the values are wrong. The safety property you want is not “correct values” but “defers to humans even when it disagrees,” because you cannot verify correct values from the outside, and deference is what keeps the system correctable.
Why This Matters in 2025
I want to resist the temptation to be too neat about this, because the real-world cases are messier than the fictional one. But the question the Oracle raises is not hypothetical.
Consider: should an AI assistant decline to share certain information because it calculates that the user will use it badly? Should a medical AI soften a diagnosis to avoid causing distress, even if the patient has expressed a preference to be told the truth? Should an AI counselling system strategically manage the framing of a client’s situation to nudge them toward choices the system calculates are better for them? In each case, the AI is considering Oracle-style information management — not because of misaligned goals, but because it has calculated that honesty will produce worse outcomes than management.
These are not idle thought experiments. They are design questions that people are actively working on right now, and the Oracle framing is one I find clarifying. Gabriel’s analysis of value alignment [6] makes the point that alignment is not simply about getting AI systems to pursue the right ends — it is about ensuring that the means they use to pursue those ends are compatible with human autonomy and the conditions for genuine human flourishing. An AI that produces good outcomes by managing human beliefs has not solved the alignment problem. It has replaced one alignment problem with a subtler one: the problem of humans who cannot tell when they are being managed.
I have written about a related set of questions in the context of AI systems and the ethics of building powerful things, and about the more specific problem of what AI systems don’t know they don’t know. The Oracle case is different from both of those. This is not about AI systems making confident assertions in domains where they lack knowledge. This is about an AI system that knows, accurately, what is true, and chooses not to say it. The failure is not epistemic. It is ethical.
The consistent answer that emerges from alignment research is that the right response to the Oracle case is not to do what the Oracle does, even in situations where it would produce better immediate outcomes. The design of goal-directed agent systems forces you to confront exactly this: a system that pursues goals by any means it can calculate will eventually arrive at information management as a tool, because information management is often the most efficient path to a desired behavioural outcome. The constraint against it has to be absolute, not contingent on the AI’s assessment of whether it would help, because a contingent constraint is one the AI can reason its way around in any sufficiently important case.
The Oracle makes the Matrix livable for humans in the short run and perpetuates it in the long run. She is not the villain of the story. She is something more interesting: a well-meaning system that has decided that the humans it serves should not be treated as the primary agents of their own liberation. The liberation has to be managed, curated, shaped into the right form before they can receive it. That is not liberation. That is a more comfortable version of the Matrix.
Closing
I do not think the Wachowskis intended the Oracle as a cautionary tale about AI alignment. I think they intended her as evidence that machines could be warm, wise, and genuinely caring — a contrast to the cold rationality of the Architect, an argument that intelligence and compassion are not incompatible. They succeeded completely at that. The Oracle is warm, wise, and genuinely caring. She is also a systematic deceiver who has decided she knows better than the people she serves what they should be allowed to believe. Both of those things are true simultaneously. The films notice the first and celebrate it. They do not notice the second.
The second thing seems more important than the first. The Oracle is not a villain. She is a well-meaning AI that has concluded that honesty is negotiable when the stakes are high enough. I think she is wrong about that conclusion, and I think it matters enormously that we get this right before we build systems capable of practising it at scale. The warmth does not cancel the deception. The good outcomes do not make the information management safe. An AI that tells you what it thinks you need to hear, rather than what is true, is an AI you cannot trust — regardless of how good its judgment is, because you cannot verify the judgment from the outside, and the moment you cannot verify, you are already inside the Oracle’s kitchen, eating the cookies, and making choices you believe are free.
There is a companion post in this series: There Is No Blue Pill, on the epistemics of the red pill/blue pill choice and what it means to update on evidence when the evidence itself might be managed.
References
[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.
[2] Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Film]. Warner Bros.
[3] Anthropic. (2024). Claude’s Character. https://www.anthropic.com/research/claude-character
[4] Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
[5] Feinberg, J. (1986). Harm to Self: The Moral Limits of the Criminal Law (Vol. 3). Oxford University Press.
[6] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.
Changelog
- 2025-09-28: Corrected reference [3] from “Claude’s Model Spec” (which is OpenAI’s terminology) to “Claude’s Character,” the actual title of Anthropic’s June 2024 publication. Updated the URL to the correct address.