<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Context-Awareness on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/context-awareness/</link>
    <description>Recent content in Context-Awareness on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Tue, 20 Jan 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/context-awareness/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-grounding/</link>
      <pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-grounding/</guid>
      <description>A viral video this month showed an AI assistant confidently answering &amp;ldquo;should I go to the car wash today?&amp;rdquo; without knowing it was raining outside. The internet found it funny. The failure mode is real but distinct from the strawberry counting problem — this is not a representation issue, it is a grounding issue. The model understood the question perfectly. What it lacked was access to the state of the world the question was about.</description>
      <content:encoded><![CDATA[<p><em>Follow-up to <a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a>,
which covered a different LLM failure: tokenisation and why models cannot
count letters. This one is about something structurally different.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Someone asked their car&rsquo;s built-in AI assistant: &ldquo;Should I drive to the
car wash today?&rdquo; It was raining. The assistant said yes, enthusiastically,
with reasons: regular washing extends the life of the paintwork, removes
road salt, and so on. Technically correct statements, all of them.
Completely beside the point.</p>
<p>The clip spread. The reactions were the usual split: one camp said this
proves AI is useless, the other said it proves people expect too much
from AI. Both camps are arguing about the wrong thing.</p>
<p>The interesting question is: why did the model fail here, and is this
the same kind of failure as the strawberry problem?</p>
<p>It is not. The failures look similar from the outside — confident wrong
answer, context apparently ignored — but the underlying causes are
different, and the difference matters if you want to understand what
these systems can and cannot do.</p>
<hr>
<h2 id="the-strawberry-problem-was-about-representation">The Strawberry Problem Was About Representation</h2>
<p>In the strawberry case, the model failed because of the gap between its
input representation (BPE tokens: &ldquo;straw&rdquo; + &ldquo;berry&rdquo;) and the task (count
the character &ldquo;r&rdquo;). The character information was not accessible in the
model&rsquo;s representational units. The model understood the task correctly —
&ldquo;count the r&rsquo;s&rdquo; is unambiguous — but the input structure did not support
executing it.</p>
<p>That is a <em>representation</em> failure. The information needed to answer
correctly was present in the original string but was lost in the
tokenisation step.</p>
<p>The car wash case is different. The model received a perfectly
well-formed question and had no representation problem at all. &ldquo;Should I
drive to the car wash today?&rdquo; is tokenised without any information loss.
The model understood it. The failure is that the correct answer depends
on information that was never in the input in the first place.</p>
<hr>
<h2 id="the-missing-context">The Missing Context</h2>
<p>What would you need to answer &ldquo;should I drive to the car wash today?&rdquo;
correctly?</p>
<ul>
<li>The current weather (is it raining now?)</li>
<li>The weather forecast for the rest of the day (will it rain later?)</li>
<li>The current state of the car (how dirty is it?)</li>
<li>Possibly: how recently was it last washed, what kind of dirt (road
salt after winter, tree pollen in spring), whether there is a time
constraint</li>
</ul>
<p>None of this is in the question. A human asking the question has access
to some of it through direct perception (look out the window) and some
through memory (I just drove through mud). A language model has access
to none of it.</p>
<p>Let $X$ denote the question and $C$ denote this context — the current
state of the world that the question is implicitly about. The correct
answer $A$ is a function of both:</p>
$$A = f(X, C)$$<p>The model has $X$. It does not have $C$. What it produces is something
like an expectation over possible contexts, marginalising out the unknown
$C$:</p>
$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$<p>Averaged over all plausible contexts in which someone might ask this
question, &ldquo;going to the car wash&rdquo; is probably a fine idea — most of the
time when people ask, it is not raining and the car is dirty.
$\hat{A}$ is therefore approximately &ldquo;yes.&rdquo; The model returns &ldquo;yes.&rdquo;
In this particular instance, where $C$ happens to include &ldquo;it is
currently raining,&rdquo; $\hat{A} \neq f(X, C)$.</p>
<p>The quantity that measures how much the missing context matters is the
mutual information between the answer and the context, given the
question:</p>
$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$<p>Here $H(A \mid X)$ is the residual uncertainty in the answer given only
the question, and $H(A \mid X, C)$ is the residual uncertainty once the
context is also known. For most questions in a language model&rsquo;s training
distribution — &ldquo;what is the capital of France?&rdquo;, &ldquo;how do I sort a list
in Python?&rdquo; — this mutual information is near zero: the context does not
change the answer. For situationally grounded questions like the car wash
question, it is large: the answer is almost entirely determined by the
context, not the question.</p>
<hr>
<h2 id="why-the-model-was-confident-anyway">Why the Model Was Confident Anyway</h2>
<p>This is the part that produces the most indignation in the viral clips:
not just that the model was wrong, but that it was <em>confident</em> about
being wrong. It did not say &ldquo;I don&rsquo;t know what the current weather is.&rdquo;
It said &ldquo;yes, here are five reasons you should go.&rdquo;</p>
<p>Two things are happening here.</p>
<p><strong>Training distribution bias.</strong> Most questions in the training data that
resemble &ldquo;should I do X?&rdquo; have answers that can be derived from general
knowledge, not from real-time world state. &ldquo;Should I use a VPN on public
WiFi?&rdquo; &ldquo;Should I stretch before running?&rdquo; &ldquo;Should I buy a house or rent?&rdquo;
All of these have defensible answers that do not depend on the current
weather. The model learned that this question <em>form</em> typically maps to
answers of the form &ldquo;here are some considerations.&rdquo; It applies that
pattern here.</p>
<p><strong>No explicit uncertainty signal.</strong> The model was not trained to say
&ldquo;I cannot answer this because I lack context C.&rdquo; It was trained to
produce helpful-sounding responses. A response that acknowledges
missing information requires the model to have a model of its own
knowledge state — to know what it does not know. This is harder than
it sounds. The model has to recognise that $I(A; C \mid X)$ is high
for this question class, which requires meta-level reasoning about
information structure that is not automatically present.</p>
<p>This is sometimes called <em>calibration</em>: the alignment between expressed
confidence and actual accuracy. A well-calibrated model that is 80%
confident in an answer is right about 80% of the time. A model that is
confident about answers it cannot possibly know from its training data
is miscalibrated. The car wash video is a calibration failure as much
as a grounding failure.</p>
<hr>
<h2 id="what-grounding-means">What Grounding Means</h2>
<p>The term <em>grounding</em> in AI has a precise origin. Harnad (1990) used it
to describe the problem of connecting symbol systems to the things
they refer to — how does the word &ldquo;apple&rdquo; connect to actual apples,
rather than just to other symbols? A symbol system that only connects
symbols to other symbols (dictionary definitions, synonym relations)
has the form of meaning without the substance.</p>
<p>Applied to language models: the model has rich internal representations
of concepts like &ldquo;rain,&rdquo; &ldquo;car wash,&rdquo; &ldquo;dirty car,&rdquo; and their relationships.
But those representations are grounded in text about those things, not in
the things themselves. The model knows what rain is. It does not know
whether it is raining right now, because &ldquo;right now&rdquo; is not a location
in the training data.</p>
<p>This is not a solvable problem by making the model bigger or training it
on more text. More text does not give the model access to the current
state of the world. It is a structural feature of how these systems work:
they are trained on a static corpus and queried at inference time, with
no automatic connection to the world state at the moment of the query.</p>
<hr>
<h2 id="what-tool-use-gets-you-and-what-it-doesnt">What Tool Use Gets You (and What It Doesn&rsquo;t)</h2>
<p>The standard engineering response to grounding problems is tool use:
give the model access to a weather API, a calendar, a search engine.
Now when asked &ldquo;should I go to the car wash today?&rdquo; the model can query
the weather service, get the current conditions, and factor that into
the answer.</p>
<p>This is genuinely useful. The model with a weather tool call will answer
this question correctly in most circumstances. But tool use solves the
problem only if two conditions hold:</p>
<ol>
<li>
<p><strong>The model knows it needs the tool.</strong> It must recognise that this
question has $I(A; C \mid X) > 0$ for context $C$ that a weather
tool can provide, and that it is missing that context. This requires
the meta-level awareness described above. Models trained on tool use
learn to invoke tools for recognised categories of question; for novel
question types, or questions that superficially resemble answerable
ones, the tool call may not be triggered.</p>
</li>
<li>
<p><strong>The right tool exists and returns clean data.</strong> Weather APIs exist.
&ldquo;How dirty is my car?&rdquo; does not have an API. &ldquo;Am I the kind of person
who cares about car cleanliness enough that this matters?&rdquo; has no API.
Some missing context can be retrieved; some is inherently private to
the person asking.</p>
</li>
</ol>
<p>The deeper issue is not tool availability but <em>knowing what you don&rsquo;t
know</em>. A model that does not recognise its own information gaps cannot
reliably decide when to use a tool, ask a clarifying question, or
express uncertainty. This is a hard problem — arguably harder than
making the model more capable at the tasks it already handles.</p>
<hr>
<h2 id="the-contrast-stated-plainly">The Contrast, Stated Plainly</h2>
<p>The strawberry failure and the car wash failure look alike from the
outside — confident wrong answer — but they are different enough that
conflating them produces confused diagnosis and confused solutions.</p>
<p>Strawberry: the model has the information (the string &ldquo;strawberry&rdquo;), but
the representation (BPE tokens) does not preserve character-level
structure. The fix is architectural or procedural: character-level
tokenisation, chain-of-thought letter spelling.</p>
<p>Car wash: the model does not have the information (current weather,
car state). No fix to the model&rsquo;s architecture or prompt engineering
gives it information it was never given. The fix is exogenous: provide
the context explicitly, or give the model a tool that can retrieve it,
or design the system so that context-dependent questions are routed to
systems that have access to the relevant state.</p>
<p>A model that confidently answers the car wash question without access to
current conditions is not failing at language understanding. It is
behaving exactly as its training shaped it to behave, given its lack of
situational grounding. Knowing which kind of failure you are looking at
is most of the work in figuring out what to do about it.</p>
<hr>
<p><em>The grounding problem connects to the broader question of what it means
for a language model to &ldquo;know&rdquo; something — which comes up in a different
form in the <a href="/posts/more-context-not-always-better/">context window post</a>,
where the issue is not missing context but irrelevant context drowning
out the relevant signal.</em></p>
<p><em>A second car wash video a few weeks later produced a third, different
failure: <a href="/posts/car-wash-walk/">Car Wash, Part Three: The AI Said Walk</a> —
the model had the right world state but chose the wrong interpretation
of the question.</em></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Harnad, S. (1990). The symbol grounding problem. <em>Physica D:
Nonlinear Phenomena</em>, 42(1–3), 335–346.
<a href="https://doi.org/10.1016/0167-2789(90)90087-6">https://doi.org/10.1016/0167-2789(90)90087-6</a></p>
</li>
<li>
<p>Guo, C., Pleiss, G., Sun, Y., &amp; Weinberger, K. Q. (2017). <strong>On
calibration of modern neural networks.</strong> <em>ICML 2017</em>.
<a href="https://arxiv.org/abs/1706.04599">https://arxiv.org/abs/1706.04599</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
