Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure

Tue, 20 Jan 2026 00:00:00 +0000

Follow-up to Three Rs in Strawberry, which covered a different LLM failure: tokenisation and why models cannot count letters. This one is about something structurally different.

The Video

Someone asked their car’s built-in AI assistant: “Should I drive to the car wash today?” It was raining. The assistant said yes, enthusiastically, with reasons: regular washing extends the life of the paintwork, removes road salt, and so on. Technically correct statements, all of them. Completely beside the point.

The clip spread. The reactions were the usual split: one camp said this proves AI is useless, the other said it proves people expect too much from AI. Both camps are arguing about the wrong thing.

The interesting question is: why did the model fail here, and is this the same kind of failure as the strawberry problem?

It is not. The failures look similar from the outside — confident wrong answer, context apparently ignored — but the underlying causes are different, and the difference matters if you want to understand what these systems can and cannot do.

The Strawberry Problem Was About Representation

In the strawberry case, the model failed because of the gap between its input representation (BPE tokens: “straw” + “berry”) and the task (count the character “r”). The character information was not accessible in the model’s representational units. The model understood the task correctly — “count the r’s” is unambiguous — but the input structure did not support executing it.

That is a representation failure. The information needed to answer correctly was present in the original string but was lost in the tokenisation step.

The car wash case is different. The model received a perfectly well-formed question and had no representation problem at all. “Should I drive to the car wash today?” is tokenised without any information loss. The model understood it. The failure is that the correct answer depends on information that was never in the input in the first place.

The Missing Context

What would you need to answer “should I drive to the car wash today?” correctly?

The current weather (is it raining now?)
The weather forecast for the rest of the day (will it rain later?)
The current state of the car (how dirty is it?)
Possibly: how recently was it last washed, what kind of dirt (road salt after winter, tree pollen in spring), whether there is a time constraint

None of this is in the question. A human asking the question has access to some of it through direct perception (look out the window) and some through memory (I just drove through mud). A language model has access to none of it.

Let $X$ denote the question and $C$ denote this context — the current state of the world that the question is implicitly about. The correct answer $A$ is a function of both:

$$A = f(X, C)$$

The model has $X$. It does not have $C$. What it produces is something like an expectation over possible contexts, marginalising out the unknown $C$:

$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$

Averaged over all plausible contexts in which someone might ask this question, “going to the car wash” is probably a fine idea — most of the time when people ask, it is not raining and the car is dirty. $\hat{A}$ is therefore approximately “yes.” The model returns “yes.” In this particular instance, where $C$ happens to include “it is currently raining,” $\hat{A} \neq f(X, C)$.

The quantity that measures how much the missing context matters is the mutual information between the answer and the context, given the question:

$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$

Here $H(A \mid X)$ is the residual uncertainty in the answer given only the question, and $H(A \mid X, C)$ is the residual uncertainty once the context is also known. For most questions in a language model’s training distribution — “what is the capital of France?”, “how do I sort a list in Python?” — this mutual information is near zero: the context does not change the answer. For situationally grounded questions like the car wash question, it is large: the answer is almost entirely determined by the context, not the question.

Why the Model Was Confident Anyway

This is the part that produces the most indignation in the viral clips: not just that the model was wrong, but that it was confident about being wrong. It did not say “I don’t know what the current weather is.” It said “yes, here are five reasons you should go.”

Two things are happening here.

Training distribution bias. Most questions in the training data that resemble “should I do X?” have answers that can be derived from general knowledge, not from real-time world state. “Should I use a VPN on public WiFi?” “Should I stretch before running?” “Should I buy a house or rent?” All of these have defensible answers that do not depend on the current weather. The model learned that this question form typically maps to answers of the form “here are some considerations.” It applies that pattern here.

No explicit uncertainty signal. The model was not trained to say “I cannot answer this because I lack context C.” It was trained to produce helpful-sounding responses. A response that acknowledges missing information requires the model to have a model of its own knowledge state — to know what it does not know. This is harder than it sounds. The model has to recognise that $I(A; C \mid X)$ is high for this question class, which requires meta-level reasoning about information structure that is not automatically present.

This is sometimes called calibration: the alignment between expressed confidence and actual accuracy. A well-calibrated model that is 80% confident in an answer is right about 80% of the time. A model that is confident about answers it cannot possibly know from its training data is miscalibrated. The car wash video is a calibration failure as much as a grounding failure.

What Grounding Means

The term grounding in AI has a precise origin. Harnad (1990) used it to describe the problem of connecting symbol systems to the things they refer to — how does the word “apple” connect to actual apples, rather than just to other symbols? A symbol system that only connects symbols to other symbols (dictionary definitions, synonym relations) has the form of meaning without the substance.

Applied to language models: the model has rich internal representations of concepts like “rain,” “car wash,” “dirty car,” and their relationships. But those representations are grounded in text about those things, not in the things themselves. The model knows what rain is. It does not know whether it is raining right now, because “right now” is not a location in the training data.

This is not a solvable problem by making the model bigger or training it on more text. More text does not give the model access to the current state of the world. It is a structural feature of how these systems work: they are trained on a static corpus and queried at inference time, with no automatic connection to the world state at the moment of the query.

What Tool Use Gets You (and What It Doesn’t)

The standard engineering response to grounding problems is tool use: give the model access to a weather API, a calendar, a search engine. Now when asked “should I go to the car wash today?” the model can query the weather service, get the current conditions, and factor that into the answer.

This is genuinely useful. The model with a weather tool call will answer this question correctly in most circumstances. But tool use solves the problem only if two conditions hold:

The model knows it needs the tool. It must recognise that this question has $I(A; C \mid X) > 0$ for context $C$ that a weather tool can provide, and that it is missing that context. This requires the meta-level awareness described above. Models trained on tool use learn to invoke tools for recognised categories of question; for novel question types, or questions that superficially resemble answerable ones, the tool call may not be triggered.
The right tool exists and returns clean data. Weather APIs exist. “How dirty is my car?” does not have an API. “Am I the kind of person who cares about car cleanliness enough that this matters?” has no API. Some missing context can be retrieved; some is inherently private to the person asking.

The deeper issue is not tool availability but knowing what you don’t know. A model that does not recognise its own information gaps cannot reliably decide when to use a tool, ask a clarifying question, or express uncertainty. This is a hard problem — arguably harder than making the model more capable at the tasks it already handles.

The Contrast, Stated Plainly

The strawberry failure and the car wash failure look alike from the outside — confident wrong answer — but they are different enough that conflating them produces confused diagnosis and confused solutions.

Strawberry: the model has the information (the string “strawberry”), but the representation (BPE tokens) does not preserve character-level structure. The fix is architectural or procedural: character-level tokenisation, chain-of-thought letter spelling.

Car wash: the model does not have the information (current weather, car state). No fix to the model’s architecture or prompt engineering gives it information it was never given. The fix is exogenous: provide the context explicitly, or give the model a tool that can retrieve it, or design the system so that context-dependent questions are routed to systems that have access to the relevant state.

A model that confidently answers the car wash question without access to current conditions is not failing at language understanding. It is behaving exactly as its training shaped it to behave, given its lack of situational grounding. Knowing which kind of failure you are looking at is most of the work in figuring out what to do about it.

The grounding problem connects to the broader question of what it means for a language model to “know” something — which comes up in a different form in the context window post, where the issue is not missing context but irrelevant context drowning out the relevant signal.

A second car wash video a few weeks later produced a third, different failure: Car Wash, Part Three: The AI Said Walk — the model had the right world state but chose the wrong interpretation of the question.

References

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. ICML 2017. https://arxiv.org/abs/1706.04599

Context-Awareness on Sebastian Spicker