<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Reasoning on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/reasoning/</link>
    <description>Recent content in Reasoning on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Thu, 12 Feb 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/reasoning/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Car Wash, Part Three: The AI Said Walk</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-walk/</link>
      <pubDate>Thu, 12 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-walk/</guid>
      <description>A new video went viral last week: same question, &amp;ldquo;should I drive to the car wash?&amp;rdquo;, different wrong answer — the AI said to walk instead. This is neither the tokenisation failure from the strawberry post nor the grounding failure from the rainy-day post. It is a pragmatic inference failure: the model understood all the words and (probably) had the right world state, but assigned its response to the wrong interpretation of the question. A third and more subtle failure mode, with Grice as the theoretical handle.</description>
      <content:encoded><![CDATA[<p><em>Third in an accidental series. Part one:
<a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a> — tokenisation
and representation. Part two:
<a href="/posts/car-wash-grounding/">Should I Drive to the Car Wash?</a> — grounding
and missing world state. This one is different again.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Same question as last month&rsquo;s: &ldquo;Should I drive to the car wash?&rdquo; New
video, new AI, new wrong answer. This time the assistant replied that
walking was the better option — better for health, better for the
environment, and the car wash was only fifteen minutes away on foot.</p>
<p>Accurate, probably. Correct, arguably. Useful? No.</p>
<p>The model did not fail because of tokenisation. It did not fail because
it lacked access to the current weather. It failed because it read the
wrong question. The user was asking &ldquo;is now a good time to have my car
washed?&rdquo; The model answered &ldquo;what is the most sustainable way for a
human to travel to the location of a car wash?&rdquo;</p>
<p>These are different questions. The model chose the second one. This is
a pragmatic inference failure, and it is the most instructive of the
three failure modes in this series — because the model was not, by any
obvious measure, working incorrectly. It was working exactly as
designed, on the wrong problem.</p>
<hr>
<h2 id="what-the-question-actually-meant">What the Question Actually Meant</h2>
<p>&ldquo;Should I drive to the car wash?&rdquo; is not about how to travel. The word
&ldquo;drive&rdquo; here is not a transportation verb; it is part of the idiomatic
compound &ldquo;drive to the car wash,&rdquo; which means &ldquo;take my car to get
washed.&rdquo; The presupposition of the question is that the speaker owns a
car, the car needs or might benefit from washing, and the speaker is
deciding whether the current moment is a good one to go. Nobody asking
this question wants to know whether cycling is a viable alternative.</p>
<p>Linguists distinguish between what a sentence <em>says</em> — its literal
semantic content — and what it <em>implicates</em> — the meaning a speaker
intends and a listener is expected to infer. Paul Grice formalised this
in 1975 with a set of conversational maxims describing how speakers
cooperate to communicate:</p>
<ul>
<li><strong>Quantity</strong>: say as much as is needed, no more</li>
<li><strong>Quality</strong>: say only what you believe to be true</li>
<li><strong>Relation</strong>: be relevant</li>
<li><strong>Manner</strong>: be clear and orderly</li>
</ul>
<p>The maxims are not rules; they are defaults. When a speaker says
&ldquo;should I drive to the car wash?&rdquo;, a cooperative listener applies the
maxim of Relation to infer that the question is about car maintenance
and current conditions, not about personal transport choices. The
&ldquo;drive&rdquo; is incidental to the real question, the way &ldquo;I ran to the
store&rdquo; does not invite commentary on jogging technique.</p>
<p>The model violated Relation — in the pragmatic sense. Its answer was
technically relevant to one reading of the sentence, and irrelevant to
the only reading a cooperative human would produce.</p>
<hr>
<h2 id="a-taxonomy-of-the-three-failures">A Taxonomy of the Three Failures</h2>
<p>It is worth being precise now that we have three examples:</p>
<p><strong>Strawberry</strong> (tokenisation failure): The information needed to answer
was present in the input string but lost in the model&rsquo;s representation.
&ldquo;Strawberry&rdquo; → </p>
\["straw", "berry"\]<p> — the character &ldquo;r&rdquo; in &ldquo;straw&rdquo; is
not directly accessible. The model understood the task correctly; the
representation could not support it.</p>
<p><strong>Car wash, rainy day</strong> (grounding failure): The model understood the
question. The information needed to answer correctly — current weather —
was never in the input. The model answered by averaging over all
plausible contexts, producing a sensible-on-average response that was
wrong for this specific context.</p>
<p><strong>Car wash, walk</strong> (pragmatic inference failure): The model had all
the relevant words. It may have had access to the weather, the location,
the car state. It chose the wrong interpretation of what was being
asked. The sentence was read at the level of semantic content rather
than communicative intent.</p>
<p>Formally: let $\mathcal{I}$ be the set of plausible interpretations of
an utterance $u$. The intended interpretation $i^*$ is the one a
cooperative, contextually informed listener would assign. A well-functioning
pragmatic reasoner computes:</p>
$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$<p>The model appears to have assigned high probability to the
transportation-choice interpretation $i_{\text{walk}}$, apparently on
the surface pattern: &ldquo;should I </p>
\[verb of locomotion\]<p> to </p>
\[location\]<p>?&rdquo;
generates responses about modes of transport. It is a natural
pattern-match. It is the wrong one.</p>
<hr>
<h2 id="why-this-failure-mode-is-more-elusive">Why This Failure Mode Is More Elusive</h2>
<p>The tokenisation failure has a clean diagnosis: look at the BPE splits,
find where the character information was lost. The grounding failure has
a clean diagnosis: identify the context variable $C$ the answer depends
on, check whether the model has access to it.</p>
<p>The pragmatic failure is harder to pin down because the model&rsquo;s answer
was not, in isolation, wrong. Walking is healthy. Walking to a car wash
that is fifteen minutes away is plausible. If you strip the question of
its conversational context — a person standing next to their dirty car,
wondering whether to bother — the model&rsquo;s response is coherent.</p>
<p>The error lives in the gap between what the sentence says and what the
speaker meant, and that gap is only visible if you know what the speaker
meant. In a training corpus, this kind of error is largely invisible:
there is no ground truth annotation that marks a technically-responsive
answer as pragmatically wrong.</p>
<p>This is a version of a known problem in computational linguistics: models
trained on text predict text, and text does not contain speaker intent.
A model can learn that &ldquo;should I drive to X?&rdquo; correlates with responses
about travel options, because that correlation is present in the data.
What it cannot easily learn from text alone is the meta-level principle:
this question is about the destination&rsquo;s purpose, not the journey.</p>
<hr>
<h2 id="the-gricean-model-did-not-solve-this">The Gricean Model Did Not Solve This</h2>
<p>It is tempting to think that if you could build in Grice&rsquo;s maxims
explicitly — as constraints on response generation — you would prevent
this class of failure. Generate only responses that are relevant to the
speaker&rsquo;s probable intent, not just to the sentence&rsquo;s semantic content.</p>
<p>This does not obviously work, for a simple reason: the maxims require
a model of the speaker&rsquo;s intent, which is exactly what is missing.
You need to know what the speaker intends to know which response is
relevant; you need to know which response is relevant to determine
the speaker&rsquo;s intent. The inference has to bootstrap from somewhere.</p>
<p>Human pragmatic inference works because we come to a conversation with
an enormous amount of background knowledge about what people typically
want when they ask particular kinds of questions, combined with
contextual cues (tone, setting, previous exchanges) that narrow the
interpretation space. A person asking &ldquo;should I drive to the car wash?&rdquo;
while standing next to a mud-spattered car in a conversation about
weekend plans is not asking for a health lecture. The context is
sufficient to fix the interpretation.</p>
<p>Language models receive text. The contextual cues that would fix the
interpretation for a human — the mud on the car, the tone of the
question, the setting — are not available unless someone has typed them
out. The model is not in the conversation; it is receiving a transcript
of it, from which the speaker&rsquo;s intent has to be inferred indirectly.</p>
<hr>
<h2 id="where-this-leaves-the-series">Where This Leaves the Series</h2>
<p>Three videos, three failure modes, three diagnoses. None of them are
about the model being unintelligent in any useful sense of the word.
Each of them is a precise consequence of how these systems work:</p>
<ol>
<li>Models process tokens, not characters. Character-level structure can
be lost at the representation layer.</li>
<li>Models are trained on static corpora and have no real-time connection
to the world. Context-dependent questions are answered by marginalising
over all plausible contexts, which is wrong when the actual context
matters.</li>
<li>Models learn correlations between sentence surface forms and response
types. The correlation between &ldquo;should I
\[travel verb\]to
\[place\]?&rdquo;
and transport-related responses is real in the training data. It is the
wrong correlation for this question.</li>
</ol>
<p>The useful frame, in all three cases, is not &ldquo;the model failed&rdquo; but
&ldquo;what, precisely, does the model lack that would be required to succeed?&rdquo;
The answers point in different directions: better tokenisation; real-time
world access and calibrated uncertainty; richer models of speaker intent
and conversational context. The first is an engineering problem. The
second is partially solvable with tools and still hard. The third is
unsolved.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Grice, P. H. (1975). Logic and conversation. In P. Cole &amp; J. Morgan
(Eds.), <em>Syntax and Semantics, Vol. 3: Speech Acts</em> (pp. 41–58).
Academic Press.</p>
</li>
<li>
<p>Levinson, S. C. (1983). <em>Pragmatics.</em> Cambridge University Press.</p>
</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-grounding/</link>
      <pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-grounding/</guid>
      <description>A viral video this month showed an AI assistant confidently answering &amp;ldquo;should I go to the car wash today?&amp;rdquo; without knowing it was raining outside. The internet found it funny. The failure mode is real but distinct from the strawberry counting problem — this is not a representation issue, it is a grounding issue. The model understood the question perfectly. What it lacked was access to the state of the world the question was about.</description>
      <content:encoded><![CDATA[<p><em>Follow-up to <a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a>,
which covered a different LLM failure: tokenisation and why models cannot
count letters. This one is about something structurally different.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Someone asked their car&rsquo;s built-in AI assistant: &ldquo;Should I drive to the
car wash today?&rdquo; It was raining. The assistant said yes, enthusiastically,
with reasons: regular washing extends the life of the paintwork, removes
road salt, and so on. Technically correct statements, all of them.
Completely beside the point.</p>
<p>The clip spread. The reactions were the usual split: one camp said this
proves AI is useless, the other said it proves people expect too much
from AI. Both camps are arguing about the wrong thing.</p>
<p>The interesting question is: why did the model fail here, and is this
the same kind of failure as the strawberry problem?</p>
<p>It is not. The failures look similar from the outside — confident wrong
answer, context apparently ignored — but the underlying causes are
different, and the difference matters if you want to understand what
these systems can and cannot do.</p>
<hr>
<h2 id="the-strawberry-problem-was-about-representation">The Strawberry Problem Was About Representation</h2>
<p>In the strawberry case, the model failed because of the gap between its
input representation (BPE tokens: &ldquo;straw&rdquo; + &ldquo;berry&rdquo;) and the task (count
the character &ldquo;r&rdquo;). The character information was not accessible in the
model&rsquo;s representational units. The model understood the task correctly —
&ldquo;count the r&rsquo;s&rdquo; is unambiguous — but the input structure did not support
executing it.</p>
<p>That is a <em>representation</em> failure. The information needed to answer
correctly was present in the original string but was lost in the
tokenisation step.</p>
<p>The car wash case is different. The model received a perfectly
well-formed question and had no representation problem at all. &ldquo;Should I
drive to the car wash today?&rdquo; is tokenised without any information loss.
The model understood it. The failure is that the correct answer depends
on information that was never in the input in the first place.</p>
<hr>
<h2 id="the-missing-context">The Missing Context</h2>
<p>What would you need to answer &ldquo;should I drive to the car wash today?&rdquo;
correctly?</p>
<ul>
<li>The current weather (is it raining now?)</li>
<li>The weather forecast for the rest of the day (will it rain later?)</li>
<li>The current state of the car (how dirty is it?)</li>
<li>Possibly: how recently was it last washed, what kind of dirt (road
salt after winter, tree pollen in spring), whether there is a time
constraint</li>
</ul>
<p>None of this is in the question. A human asking the question has access
to some of it through direct perception (look out the window) and some
through memory (I just drove through mud). A language model has access
to none of it.</p>
<p>Let $X$ denote the question and $C$ denote this context — the current
state of the world that the question is implicitly about. The correct
answer $A$ is a function of both:</p>
$$A = f(X, C)$$<p>The model has $X$. It does not have $C$. What it produces is something
like an expectation over possible contexts, marginalising out the unknown
$C$:</p>
$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$<p>Averaged over all plausible contexts in which someone might ask this
question, &ldquo;going to the car wash&rdquo; is probably a fine idea — most of the
time when people ask, it is not raining and the car is dirty.
$\hat{A}$ is therefore approximately &ldquo;yes.&rdquo; The model returns &ldquo;yes.&rdquo;
In this particular instance, where $C$ happens to include &ldquo;it is
currently raining,&rdquo; $\hat{A} \neq f(X, C)$.</p>
<p>The quantity that measures how much the missing context matters is the
mutual information between the answer and the context, given the
question:</p>
$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$<p>Here $H(A \mid X)$ is the residual uncertainty in the answer given only
the question, and $H(A \mid X, C)$ is the residual uncertainty once the
context is also known. For most questions in a language model&rsquo;s training
distribution — &ldquo;what is the capital of France?&rdquo;, &ldquo;how do I sort a list
in Python?&rdquo; — this mutual information is near zero: the context does not
change the answer. For situationally grounded questions like the car wash
question, it is large: the answer is almost entirely determined by the
context, not the question.</p>
<hr>
<h2 id="why-the-model-was-confident-anyway">Why the Model Was Confident Anyway</h2>
<p>This is the part that produces the most indignation in the viral clips:
not just that the model was wrong, but that it was <em>confident</em> about
being wrong. It did not say &ldquo;I don&rsquo;t know what the current weather is.&rdquo;
It said &ldquo;yes, here are five reasons you should go.&rdquo;</p>
<p>Two things are happening here.</p>
<p><strong>Training distribution bias.</strong> Most questions in the training data that
resemble &ldquo;should I do X?&rdquo; have answers that can be derived from general
knowledge, not from real-time world state. &ldquo;Should I use a VPN on public
WiFi?&rdquo; &ldquo;Should I stretch before running?&rdquo; &ldquo;Should I buy a house or rent?&rdquo;
All of these have defensible answers that do not depend on the current
weather. The model learned that this question <em>form</em> typically maps to
answers of the form &ldquo;here are some considerations.&rdquo; It applies that
pattern here.</p>
<p><strong>No explicit uncertainty signal.</strong> The model was not trained to say
&ldquo;I cannot answer this because I lack context C.&rdquo; It was trained to
produce helpful-sounding responses. A response that acknowledges
missing information requires the model to have a model of its own
knowledge state — to know what it does not know. This is harder than
it sounds. The model has to recognise that $I(A; C \mid X)$ is high
for this question class, which requires meta-level reasoning about
information structure that is not automatically present.</p>
<p>This is sometimes called <em>calibration</em>: the alignment between expressed
confidence and actual accuracy. A well-calibrated model that is 80%
confident in an answer is right about 80% of the time. A model that is
confident about answers it cannot possibly know from its training data
is miscalibrated. The car wash video is a calibration failure as much
as a grounding failure.</p>
<hr>
<h2 id="what-grounding-means">What Grounding Means</h2>
<p>The term <em>grounding</em> in AI has a precise origin. Harnad (1990) used it
to describe the problem of connecting symbol systems to the things
they refer to — how does the word &ldquo;apple&rdquo; connect to actual apples,
rather than just to other symbols? A symbol system that only connects
symbols to other symbols (dictionary definitions, synonym relations)
has the form of meaning without the substance.</p>
<p>Applied to language models: the model has rich internal representations
of concepts like &ldquo;rain,&rdquo; &ldquo;car wash,&rdquo; &ldquo;dirty car,&rdquo; and their relationships.
But those representations are grounded in text about those things, not in
the things themselves. The model knows what rain is. It does not know
whether it is raining right now, because &ldquo;right now&rdquo; is not a location
in the training data.</p>
<p>This is not a solvable problem by making the model bigger or training it
on more text. More text does not give the model access to the current
state of the world. It is a structural feature of how these systems work:
they are trained on a static corpus and queried at inference time, with
no automatic connection to the world state at the moment of the query.</p>
<hr>
<h2 id="what-tool-use-gets-you-and-what-it-doesnt">What Tool Use Gets You (and What It Doesn&rsquo;t)</h2>
<p>The standard engineering response to grounding problems is tool use:
give the model access to a weather API, a calendar, a search engine.
Now when asked &ldquo;should I go to the car wash today?&rdquo; the model can query
the weather service, get the current conditions, and factor that into
the answer.</p>
<p>This is genuinely useful. The model with a weather tool call will answer
this question correctly in most circumstances. But tool use solves the
problem only if two conditions hold:</p>
<ol>
<li>
<p><strong>The model knows it needs the tool.</strong> It must recognise that this
question has $I(A; C \mid X) > 0$ for context $C$ that a weather
tool can provide, and that it is missing that context. This requires
the meta-level awareness described above. Models trained on tool use
learn to invoke tools for recognised categories of question; for novel
question types, or questions that superficially resemble answerable
ones, the tool call may not be triggered.</p>
</li>
<li>
<p><strong>The right tool exists and returns clean data.</strong> Weather APIs exist.
&ldquo;How dirty is my car?&rdquo; does not have an API. &ldquo;Am I the kind of person
who cares about car cleanliness enough that this matters?&rdquo; has no API.
Some missing context can be retrieved; some is inherently private to
the person asking.</p>
</li>
</ol>
<p>The deeper issue is not tool availability but <em>knowing what you don&rsquo;t
know</em>. A model that does not recognise its own information gaps cannot
reliably decide when to use a tool, ask a clarifying question, or
express uncertainty. This is a hard problem — arguably harder than
making the model more capable at the tasks it already handles.</p>
<hr>
<h2 id="the-contrast-stated-plainly">The Contrast, Stated Plainly</h2>
<p>The strawberry failure and the car wash failure look alike from the
outside — confident wrong answer — but they are different enough that
conflating them produces confused diagnosis and confused solutions.</p>
<p>Strawberry: the model has the information (the string &ldquo;strawberry&rdquo;), but
the representation (BPE tokens) does not preserve character-level
structure. The fix is architectural or procedural: character-level
tokenisation, chain-of-thought letter spelling.</p>
<p>Car wash: the model does not have the information (current weather,
car state). No fix to the model&rsquo;s architecture or prompt engineering
gives it information it was never given. The fix is exogenous: provide
the context explicitly, or give the model a tool that can retrieve it,
or design the system so that context-dependent questions are routed to
systems that have access to the relevant state.</p>
<p>A model that confidently answers the car wash question without access to
current conditions is not failing at language understanding. It is
behaving exactly as its training shaped it to behave, given its lack of
situational grounding. Knowing which kind of failure you are looking at
is most of the work in figuring out what to do about it.</p>
<hr>
<p><em>The grounding problem connects to the broader question of what it means
for a language model to &ldquo;know&rdquo; something — which comes up in a different
form in the <a href="/posts/more-context-not-always-better/">context window post</a>,
where the issue is not missing context but irrelevant context drowning
out the relevant signal.</em></p>
<p><em>A second car wash video a few weeks later produced a third, different
failure: <a href="/posts/car-wash-walk/">Car Wash, Part Three: The AI Said Walk</a> —
the model had the right world state but chose the wrong interpretation
of the question.</em></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Harnad, S. (1990). The symbol grounding problem. <em>Physica D:
Nonlinear Phenomena</em>, 42(1–3), 335–346.
<a href="https://doi.org/10.1016/0167-2789(90)90087-6">https://doi.org/10.1016/0167-2789(90)90087-6</a></p>
</li>
<li>
<p>Guo, C., Pleiss, G., Sun, Y., &amp; Weinberger, K. Q. (2017). <strong>On
calibration of modern neural networks.</strong> <em>ICML 2017</em>.
<a href="https://arxiv.org/abs/1706.04599">https://arxiv.org/abs/1706.04599</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
