Reasoning on Sebastian Spicker

Car Wash, Part Three: The AI Said Walk

Thu, 12 Feb 2026 00:00:00 +0000

Third in an accidental series. Part one: Three Rs in Strawberry — tokenisation and representation. Part two: Should I Drive to the Car Wash? — grounding and missing world state. This one is different again.

The Video

Same question as last month’s: “Should I drive to the car wash?” New video, new AI, new wrong answer. This time the assistant replied that walking was the better option — better for health, better for the environment, and the car wash was only fifteen minutes away on foot.

Accurate, probably. Correct, arguably. Useful? No.

The model did not fail because of tokenisation. It did not fail because it lacked access to the current weather. It failed because it read the wrong question. The user was asking “is now a good time to have my car washed?” The model answered “what is the most sustainable way for a human to travel to the location of a car wash?”

These are different questions. The model chose the second one. This is a pragmatic inference failure, and it is the most instructive of the three failure modes in this series — because the model was not, by any obvious measure, working incorrectly. It was working exactly as designed, on the wrong problem.

What the Question Actually Meant

“Should I drive to the car wash?” is not about how to travel. The word “drive” here is not a transportation verb; it is part of the idiomatic compound “drive to the car wash,” which means “take my car to get washed.” The presupposition of the question is that the speaker owns a car, the car needs or might benefit from washing, and the speaker is deciding whether the current moment is a good one to go. Nobody asking this question wants to know whether cycling is a viable alternative.

Linguists distinguish between what a sentence says — its literal semantic content — and what it implicates — the meaning a speaker intends and a listener is expected to infer. Paul Grice formalised this in 1975 with a set of conversational maxims describing how speakers cooperate to communicate:

Quantity: say as much as is needed, no more
Quality: say only what you believe to be true
Relation: be relevant
Manner: be clear and orderly

The maxims are not rules; they are defaults. When a speaker says “should I drive to the car wash?”, a cooperative listener applies the maxim of Relation to infer that the question is about car maintenance and current conditions, not about personal transport choices. The “drive” is incidental to the real question, the way “I ran to the store” does not invite commentary on jogging technique.

The model violated Relation — in the pragmatic sense. Its answer was technically relevant to one reading of the sentence, and irrelevant to the only reading a cooperative human would produce.

A Taxonomy of the Three Failures

It is worth being precise now that we have three examples:

Strawberry (tokenisation failure): The information needed to answer was present in the input string but lost in the model’s representation. “Strawberry” →

\["straw", "berry"\]

— the character “r” in “straw” is not directly accessible. The model understood the task correctly; the representation could not support it.

Car wash, rainy day (grounding failure): The model understood the question. The information needed to answer correctly — current weather — was never in the input. The model answered by averaging over all plausible contexts, producing a sensible-on-average response that was wrong for this specific context.

Car wash, walk (pragmatic inference failure): The model had all the relevant words. It may have had access to the weather, the location, the car state. It chose the wrong interpretation of what was being asked. The sentence was read at the level of semantic content rather than communicative intent.

Formally: let $\mathcal{I}$ be the set of plausible interpretations of an utterance $u$. The intended interpretation $i^*$ is the one a cooperative, contextually informed listener would assign. A well-functioning pragmatic reasoner computes:

$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$

The model appears to have assigned high probability to the transportation-choice interpretation $i_{\text{walk}}$, apparently on the surface pattern: “should I

\[verb of locomotion\]

\[location\]

?” generates responses about modes of transport. It is a natural pattern-match. It is the wrong one.

Why This Failure Mode Is More Elusive

The tokenisation failure has a clean diagnosis: look at the BPE splits, find where the character information was lost. The grounding failure has a clean diagnosis: identify the context variable $C$ the answer depends on, check whether the model has access to it.

The pragmatic failure is harder to pin down because the model’s answer was not, in isolation, wrong. Walking is healthy. Walking to a car wash that is fifteen minutes away is plausible. If you strip the question of its conversational context — a person standing next to their dirty car, wondering whether to bother — the model’s response is coherent.

The error lives in the gap between what the sentence says and what the speaker meant, and that gap is only visible if you know what the speaker meant. In a training corpus, this kind of error is largely invisible: there is no ground truth annotation that marks a technically-responsive answer as pragmatically wrong.

This is a version of a known problem in computational linguistics: models trained on text predict text, and text does not contain speaker intent. A model can learn that “should I drive to X?” correlates with responses about travel options, because that correlation is present in the data. What it cannot easily learn from text alone is the meta-level principle: this question is about the destination’s purpose, not the journey.

The Gricean Model Did Not Solve This

It is tempting to think that if you could build in Grice’s maxims explicitly — as constraints on response generation — you would prevent this class of failure. Generate only responses that are relevant to the speaker’s probable intent, not just to the sentence’s semantic content.

This does not obviously work, for a simple reason: the maxims require a model of the speaker’s intent, which is exactly what is missing. You need to know what the speaker intends to know which response is relevant; you need to know which response is relevant to determine the speaker’s intent. The inference has to bootstrap from somewhere.

Human pragmatic inference works because we come to a conversation with an enormous amount of background knowledge about what people typically want when they ask particular kinds of questions, combined with contextual cues (tone, setting, previous exchanges) that narrow the interpretation space. A person asking “should I drive to the car wash?” while standing next to a mud-spattered car in a conversation about weekend plans is not asking for a health lecture. The context is sufficient to fix the interpretation.

Language models receive text. The contextual cues that would fix the interpretation for a human — the mud on the car, the tone of the question, the setting — are not available unless someone has typed them out. The model is not in the conversation; it is receiving a transcript of it, from which the speaker’s intent has to be inferred indirectly.

Where This Leaves the Series

Three videos, three failure modes, three diagnoses. None of them are about the model being unintelligent in any useful sense of the word. Each of them is a precise consequence of how these systems work:

Models process tokens, not characters. Character-level structure can be lost at the representation layer.
Models are trained on static corpora and have no real-time connection to the world. Context-dependent questions are answered by marginalising over all plausible contexts, which is wrong when the actual context matters.
Models learn correlations between sentence surface forms and response types. The correlation between “should I \[travel verb\]to \[place\]?” and transport-related responses is real in the training data. It is the wrong correlation for this question.

The useful frame, in all three cases, is not “the model failed” but “what, precisely, does the model lack that would be required to succeed?” The answers point in different directions: better tokenisation; real-time world access and calibrated uncertainty; richer models of speaker intent and conversational context. The first is an engineering problem. The second is partially solvable with tools and still hard. The third is unsolved.

References

Grice, P. H. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and Semantics, Vol. 3: Speech Acts (pp. 41–58). Academic Press.
Levinson, S. C. (1983). Pragmatics. Cambridge University Press.

Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure

Tue, 20 Jan 2026 00:00:00 +0000

Follow-up to Three Rs in Strawberry, which covered a different LLM failure: tokenisation and why models cannot count letters. This one is about something structurally different.

The Video

Someone asked their car’s built-in AI assistant: “Should I drive to the car wash today?” It was raining. The assistant said yes, enthusiastically, with reasons: regular washing extends the life of the paintwork, removes road salt, and so on. Technically correct statements, all of them. Completely beside the point.

The clip spread. The reactions were the usual split: one camp said this proves AI is useless, the other said it proves people expect too much from AI. Both camps are arguing about the wrong thing.

The interesting question is: why did the model fail here, and is this the same kind of failure as the strawberry problem?

It is not. The failures look similar from the outside — confident wrong answer, context apparently ignored — but the underlying causes are different, and the difference matters if you want to understand what these systems can and cannot do.

The Strawberry Problem Was About Representation

In the strawberry case, the model failed because of the gap between its input representation (BPE tokens: “straw” + “berry”) and the task (count the character “r”). The character information was not accessible in the model’s representational units. The model understood the task correctly — “count the r’s” is unambiguous — but the input structure did not support executing it.

That is a representation failure. The information needed to answer correctly was present in the original string but was lost in the tokenisation step.

The car wash case is different. The model received a perfectly well-formed question and had no representation problem at all. “Should I drive to the car wash today?” is tokenised without any information loss. The model understood it. The failure is that the correct answer depends on information that was never in the input in the first place.

The Missing Context

What would you need to answer “should I drive to the car wash today?” correctly?

The current weather (is it raining now?)
The weather forecast for the rest of the day (will it rain later?)
The current state of the car (how dirty is it?)
Possibly: how recently was it last washed, what kind of dirt (road salt after winter, tree pollen in spring), whether there is a time constraint

None of this is in the question. A human asking the question has access to some of it through direct perception (look out the window) and some through memory (I just drove through mud). A language model has access to none of it.

Let $X$ denote the question and $C$ denote this context — the current state of the world that the question is implicitly about. The correct answer $A$ is a function of both:

$$A = f(X, C)$$

The model has $X$. It does not have $C$. What it produces is something like an expectation over possible contexts, marginalising out the unknown $C$:

$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$

Averaged over all plausible contexts in which someone might ask this question, “going to the car wash” is probably a fine idea — most of the time when people ask, it is not raining and the car is dirty. $\hat{A}$ is therefore approximately “yes.” The model returns “yes.” In this particular instance, where $C$ happens to include “it is currently raining,” $\hat{A} \neq f(X, C)$.

The quantity that measures how much the missing context matters is the mutual information between the answer and the context, given the question:

$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$

Here $H(A \mid X)$ is the residual uncertainty in the answer given only the question, and $H(A \mid X, C)$ is the residual uncertainty once the context is also known. For most questions in a language model’s training distribution — “what is the capital of France?”, “how do I sort a list in Python?” — this mutual information is near zero: the context does not change the answer. For situationally grounded questions like the car wash question, it is large: the answer is almost entirely determined by the context, not the question.

Why the Model Was Confident Anyway

This is the part that produces the most indignation in the viral clips: not just that the model was wrong, but that it was confident about being wrong. It did not say “I don’t know what the current weather is.” It said “yes, here are five reasons you should go.”

Two things are happening here.

Training distribution bias. Most questions in the training data that resemble “should I do X?” have answers that can be derived from general knowledge, not from real-time world state. “Should I use a VPN on public WiFi?” “Should I stretch before running?” “Should I buy a house or rent?” All of these have defensible answers that do not depend on the current weather. The model learned that this question form typically maps to answers of the form “here are some considerations.” It applies that pattern here.

No explicit uncertainty signal. The model was not trained to say “I cannot answer this because I lack context C.” It was trained to produce helpful-sounding responses. A response that acknowledges missing information requires the model to have a model of its own knowledge state — to know what it does not know. This is harder than it sounds. The model has to recognise that $I(A; C \mid X)$ is high for this question class, which requires meta-level reasoning about information structure that is not automatically present.

This is sometimes called calibration: the alignment between expressed confidence and actual accuracy. A well-calibrated model that is 80% confident in an answer is right about 80% of the time. A model that is confident about answers it cannot possibly know from its training data is miscalibrated. The car wash video is a calibration failure as much as a grounding failure.

What Grounding Means

The term grounding in AI has a precise origin. Harnad (1990) used it to describe the problem of connecting symbol systems to the things they refer to — how does the word “apple” connect to actual apples, rather than just to other symbols? A symbol system that only connects symbols to other symbols (dictionary definitions, synonym relations) has the form of meaning without the substance.

Applied to language models: the model has rich internal representations of concepts like “rain,” “car wash,” “dirty car,” and their relationships. But those representations are grounded in text about those things, not in the things themselves. The model knows what rain is. It does not know whether it is raining right now, because “right now” is not a location in the training data.

This is not a solvable problem by making the model bigger or training it on more text. More text does not give the model access to the current state of the world. It is a structural feature of how these systems work: they are trained on a static corpus and queried at inference time, with no automatic connection to the world state at the moment of the query.

What Tool Use Gets You (and What It Doesn’t)

The standard engineering response to grounding problems is tool use: give the model access to a weather API, a calendar, a search engine. Now when asked “should I go to the car wash today?” the model can query the weather service, get the current conditions, and factor that into the answer.

This is genuinely useful. The model with a weather tool call will answer this question correctly in most circumstances. But tool use solves the problem only if two conditions hold:

The model knows it needs the tool. It must recognise that this question has $I(A; C \mid X) > 0$ for context $C$ that a weather tool can provide, and that it is missing that context. This requires the meta-level awareness described above. Models trained on tool use learn to invoke tools for recognised categories of question; for novel question types, or questions that superficially resemble answerable ones, the tool call may not be triggered.
The right tool exists and returns clean data. Weather APIs exist. “How dirty is my car?” does not have an API. “Am I the kind of person who cares about car cleanliness enough that this matters?” has no API. Some missing context can be retrieved; some is inherently private to the person asking.

The deeper issue is not tool availability but knowing what you don’t know. A model that does not recognise its own information gaps cannot reliably decide when to use a tool, ask a clarifying question, or express uncertainty. This is a hard problem — arguably harder than making the model more capable at the tasks it already handles.

The Contrast, Stated Plainly

The strawberry failure and the car wash failure look alike from the outside — confident wrong answer — but they are different enough that conflating them produces confused diagnosis and confused solutions.

Strawberry: the model has the information (the string “strawberry”), but the representation (BPE tokens) does not preserve character-level structure. The fix is architectural or procedural: character-level tokenisation, chain-of-thought letter spelling.

Car wash: the model does not have the information (current weather, car state). No fix to the model’s architecture or prompt engineering gives it information it was never given. The fix is exogenous: provide the context explicitly, or give the model a tool that can retrieve it, or design the system so that context-dependent questions are routed to systems that have access to the relevant state.

A model that confidently answers the car wash question without access to current conditions is not failing at language understanding. It is behaving exactly as its training shaped it to behave, given its lack of situational grounding. Knowing which kind of failure you are looking at is most of the work in figuring out what to do about it.

The grounding problem connects to the broader question of what it means for a language model to “know” something — which comes up in a different form in the context window post, where the issue is not missing context but irrelevant context drowning out the relevant signal.

A second car wash video a few weeks later produced a third, different failure: Car Wash, Part Three: The AI Said Walk — the model had the right world state but chose the wrong interpretation of the question.

References

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. ICML 2017. https://arxiv.org/abs/1706.04599