<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Language on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/language/</link>
    <description>Recent content in Language on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Thu, 12 Feb 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/language/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Car Wash, Part Three: The AI Said Walk</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-walk/</link>
      <pubDate>Thu, 12 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-walk/</guid>
      <description>A new video went viral last week: same question, &amp;ldquo;should I drive to the car wash?&amp;rdquo;, different wrong answer — the AI said to walk instead. This is neither the tokenisation failure from the strawberry post nor the grounding failure from the rainy-day post. It is a pragmatic inference failure: the model understood all the words and (probably) had the right world state, but assigned its response to the wrong interpretation of the question. A third and more subtle failure mode, with Grice as the theoretical handle.</description>
      <content:encoded><![CDATA[<p><em>Third in an accidental series. Part one:
<a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a> — tokenisation
and representation. Part two:
<a href="/posts/car-wash-grounding/">Should I Drive to the Car Wash?</a> — grounding
and missing world state. This one is different again.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Same question as last month&rsquo;s: &ldquo;Should I drive to the car wash?&rdquo; New
video, new AI, new wrong answer. This time the assistant replied that
walking was the better option — better for health, better for the
environment, and the car wash was only fifteen minutes away on foot.</p>
<p>Accurate, probably. Correct, arguably. Useful? No.</p>
<p>The model did not fail because of tokenisation. It did not fail because
it lacked access to the current weather. It failed because it read the
wrong question. The user was asking &ldquo;is now a good time to have my car
washed?&rdquo; The model answered &ldquo;what is the most sustainable way for a
human to travel to the location of a car wash?&rdquo;</p>
<p>These are different questions. The model chose the second one. This is
a pragmatic inference failure, and it is the most instructive of the
three failure modes in this series — because the model was not, by any
obvious measure, working incorrectly. It was working exactly as
designed, on the wrong problem.</p>
<hr>
<h2 id="what-the-question-actually-meant">What the Question Actually Meant</h2>
<p>&ldquo;Should I drive to the car wash?&rdquo; is not about how to travel. The word
&ldquo;drive&rdquo; here is not a transportation verb; it is part of the idiomatic
compound &ldquo;drive to the car wash,&rdquo; which means &ldquo;take my car to get
washed.&rdquo; The presupposition of the question is that the speaker owns a
car, the car needs or might benefit from washing, and the speaker is
deciding whether the current moment is a good one to go. Nobody asking
this question wants to know whether cycling is a viable alternative.</p>
<p>Linguists distinguish between what a sentence <em>says</em> — its literal
semantic content — and what it <em>implicates</em> — the meaning a speaker
intends and a listener is expected to infer. Paul Grice formalised this
in 1975 with a set of conversational maxims describing how speakers
cooperate to communicate:</p>
<ul>
<li><strong>Quantity</strong>: say as much as is needed, no more</li>
<li><strong>Quality</strong>: say only what you believe to be true</li>
<li><strong>Relation</strong>: be relevant</li>
<li><strong>Manner</strong>: be clear and orderly</li>
</ul>
<p>The maxims are not rules; they are defaults. When a speaker says
&ldquo;should I drive to the car wash?&rdquo;, a cooperative listener applies the
maxim of Relation to infer that the question is about car maintenance
and current conditions, not about personal transport choices. The
&ldquo;drive&rdquo; is incidental to the real question, the way &ldquo;I ran to the
store&rdquo; does not invite commentary on jogging technique.</p>
<p>The model violated Relation — in the pragmatic sense. Its answer was
technically relevant to one reading of the sentence, and irrelevant to
the only reading a cooperative human would produce.</p>
<hr>
<h2 id="a-taxonomy-of-the-three-failures">A Taxonomy of the Three Failures</h2>
<p>It is worth being precise now that we have three examples:</p>
<p><strong>Strawberry</strong> (tokenisation failure): The information needed to answer
was present in the input string but lost in the model&rsquo;s representation.
&ldquo;Strawberry&rdquo; → </p>
\["straw", "berry"\]<p> — the character &ldquo;r&rdquo; in &ldquo;straw&rdquo; is
not directly accessible. The model understood the task correctly; the
representation could not support it.</p>
<p><strong>Car wash, rainy day</strong> (grounding failure): The model understood the
question. The information needed to answer correctly — current weather —
was never in the input. The model answered by averaging over all
plausible contexts, producing a sensible-on-average response that was
wrong for this specific context.</p>
<p><strong>Car wash, walk</strong> (pragmatic inference failure): The model had all
the relevant words. It may have had access to the weather, the location,
the car state. It chose the wrong interpretation of what was being
asked. The sentence was read at the level of semantic content rather
than communicative intent.</p>
<p>Formally: let $\mathcal{I}$ be the set of plausible interpretations of
an utterance $u$. The intended interpretation $i^*$ is the one a
cooperative, contextually informed listener would assign. A well-functioning
pragmatic reasoner computes:</p>
$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$<p>The model appears to have assigned high probability to the
transportation-choice interpretation $i_{\text{walk}}$, apparently on
the surface pattern: &ldquo;should I </p>
\[verb of locomotion\]<p> to </p>
\[location\]<p>?&rdquo;
generates responses about modes of transport. It is a natural
pattern-match. It is the wrong one.</p>
<hr>
<h2 id="why-this-failure-mode-is-more-elusive">Why This Failure Mode Is More Elusive</h2>
<p>The tokenisation failure has a clean diagnosis: look at the BPE splits,
find where the character information was lost. The grounding failure has
a clean diagnosis: identify the context variable $C$ the answer depends
on, check whether the model has access to it.</p>
<p>The pragmatic failure is harder to pin down because the model&rsquo;s answer
was not, in isolation, wrong. Walking is healthy. Walking to a car wash
that is fifteen minutes away is plausible. If you strip the question of
its conversational context — a person standing next to their dirty car,
wondering whether to bother — the model&rsquo;s response is coherent.</p>
<p>The error lives in the gap between what the sentence says and what the
speaker meant, and that gap is only visible if you know what the speaker
meant. In a training corpus, this kind of error is largely invisible:
there is no ground truth annotation that marks a technically-responsive
answer as pragmatically wrong.</p>
<p>This is a version of a known problem in computational linguistics: models
trained on text predict text, and text does not contain speaker intent.
A model can learn that &ldquo;should I drive to X?&rdquo; correlates with responses
about travel options, because that correlation is present in the data.
What it cannot easily learn from text alone is the meta-level principle:
this question is about the destination&rsquo;s purpose, not the journey.</p>
<hr>
<h2 id="the-gricean-model-did-not-solve-this">The Gricean Model Did Not Solve This</h2>
<p>It is tempting to think that if you could build in Grice&rsquo;s maxims
explicitly — as constraints on response generation — you would prevent
this class of failure. Generate only responses that are relevant to the
speaker&rsquo;s probable intent, not just to the sentence&rsquo;s semantic content.</p>
<p>This does not obviously work, for a simple reason: the maxims require
a model of the speaker&rsquo;s intent, which is exactly what is missing.
You need to know what the speaker intends to know which response is
relevant; you need to know which response is relevant to determine
the speaker&rsquo;s intent. The inference has to bootstrap from somewhere.</p>
<p>Human pragmatic inference works because we come to a conversation with
an enormous amount of background knowledge about what people typically
want when they ask particular kinds of questions, combined with
contextual cues (tone, setting, previous exchanges) that narrow the
interpretation space. A person asking &ldquo;should I drive to the car wash?&rdquo;
while standing next to a mud-spattered car in a conversation about
weekend plans is not asking for a health lecture. The context is
sufficient to fix the interpretation.</p>
<p>Language models receive text. The contextual cues that would fix the
interpretation for a human — the mud on the car, the tone of the
question, the setting — are not available unless someone has typed them
out. The model is not in the conversation; it is receiving a transcript
of it, from which the speaker&rsquo;s intent has to be inferred indirectly.</p>
<hr>
<h2 id="where-this-leaves-the-series">Where This Leaves the Series</h2>
<p>Three videos, three failure modes, three diagnoses. None of them are
about the model being unintelligent in any useful sense of the word.
Each of them is a precise consequence of how these systems work:</p>
<ol>
<li>Models process tokens, not characters. Character-level structure can
be lost at the representation layer.</li>
<li>Models are trained on static corpora and have no real-time connection
to the world. Context-dependent questions are answered by marginalising
over all plausible contexts, which is wrong when the actual context
matters.</li>
<li>Models learn correlations between sentence surface forms and response
types. The correlation between &ldquo;should I
\[travel verb\]to
\[place\]?&rdquo;
and transport-related responses is real in the training data. It is the
wrong correlation for this question.</li>
</ol>
<p>The useful frame, in all three cases, is not &ldquo;the model failed&rdquo; but
&ldquo;what, precisely, does the model lack that would be required to succeed?&rdquo;
The answers point in different directions: better tokenisation; real-time
world access and calibrated uncertainty; richer models of speaker intent
and conversational context. The first is an engineering problem. The
second is partially solvable with tools and still hard. The third is
unsolved.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Grice, P. H. (1975). Logic and conversation. In P. Cole &amp; J. Morgan
(Eds.), <em>Syntax and Semantics, Vol. 3: Speech Acts</em> (pp. 41–58).
Academic Press.</p>
</li>
<li>
<p>Levinson, S. C. (1983). <em>Pragmatics.</em> Cambridge University Press.</p>
</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
