<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Llm on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/llm/</link>
    <description>Recent content in Llm on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Sun, 22 Feb 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/llm/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>More Context Is Not Always Better</title>
      <link>https://sebastianspicker.github.io/posts/more-context-not-always-better/</link>
      <pubDate>Sun, 22 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/more-context-not-always-better/</guid>
      <description>The intuition that feeding a language model more information improves its outputs is wrong often enough to matter. Here is why, and what to do about it.</description>
      <content:encoded><![CDATA[<h2 id="summary">Summary</h2>
<p>There is a popular intuition in LLM engineering that context is a resource
you should always spend freely: more background, more history, more examples —
inevitably better answers. This intuition is wrong often enough to be
dangerous. Context has a signal-to-noise structure, attention has a
position-dependent bias, and the architecture that processes all of it scales
quadratically. Adding irrelevant tokens does not leave performance neutral; it
actively degrades it. This post argues for <em>structured sparsity</em> as a design
principle: give a model exactly the context it needs for the decision it is
making right now, and nothing else.</p>
<hr>
<h2 id="background">Background</h2>
<p>The &ldquo;more is more&rdquo; assumption has an obvious origin. Transformers were
designed to condition on sequences, and every new token in the context window
is, in principle, available to every attention head. The release of models with
128k, 200k, and now million-token context windows reinforced the story: the
constraint is gone, so pack in everything you have.</p>
<p>Two lines of empirical and theoretical work complicate this story.</p>
<p><strong>The lost-in-the-middle problem.</strong> Liu et al. <a href="#ref-1">[1]</a> showed that retrieval
accuracy on multi-document question answering degrades sharply when the
relevant passage appears in the <em>middle</em> of a long context, compared to the
beginning or end. Performance on 20-document prompts dropped by more than
20 percentage points relative to the single-document baseline — not because the
model lacked the information, but because it was buried. The effect is
consistent across model families and persists at model scales where you would
not expect it.</p>
<p><strong>The complexity argument.</strong> Standard scaled dot-product attention <a href="#ref-2">[2]</a> is</p>
\[
  \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V
\]<p>The \( QK^{\top} \) product is \( O(n^2) \) in sequence length \( n \).
Inference-time KV-cache mitigates compute cost, but memory grows linearly and
the softmax normalises over a denominator that grows with \( n \). A head
attending to 128 000 tokens is averaging over a vastly noisier signal than one
attending to 512.</p>
<hr>
<h2 id="the-idea">The Idea</h2>
<h3 id="context-as-a-noisy-channel">Context as a noisy channel</h3>
<p>Think of the information reaching a given attention head as a noisy channel in
the Shannon sense. The signal is the subset of tokens that are actually
relevant to the current decoding step; the rest is noise. Signal-to-noise ratio
is</p>
\[
  \text{SNR} = \frac{|\mathcal{S}|}{n - |\mathcal{S}|}
\]<p>where \( \mathcal{S} \subset \{1, \ldots, n\} \) is the set of relevant token
positions and \( n \) is total context length. For a fixed task, \( |\mathcal{S}| \)
is roughly constant. So SNR is a <em>decreasing</em> function of \( n \). Adding
irrelevant context makes the problem strictly harder in this framing — it does
not leave it unchanged.</p>
<p>This is a toy model, but it captures something real. The softmax in the
attention head distributes a probability mass of 1.0 across \( n \) positions.
If the attended sequence doubles in length and the relevant positions remain the
same, each relevant position receives roughly half the probability mass it did
before — unless the model&rsquo;s learned attention patterns are precise enough to
suppress the irrelevant positions to near-zero, which is a strong assumption.</p>
<h3 id="position-bias-compounds-the-problem">Position bias compounds the problem</h3>
<p>Empirically, transformers exhibit a U-shaped recall curve over context
position: tokens near the start (primacy) and tokens near the end (recency)
are retrieved more reliably than tokens in the middle. If you stuff a long
context with background material and bury the task-relevant information in the
middle, you are fighting the architecture&rsquo;s learned inductive bias.</p>
<p>The effect is roughly consistent with what would emerge if the model&rsquo;s
attention weight distribution were modelled as a mixture of a flat prior and a
position-biased component. Under that model, increasing \( n \) inflates the
flat component&rsquo;s contribution and dilutes the position-biased recovery of
relevant tokens.</p>
<h3 id="what-structured-sparsity-looks-like-in-practice">What structured sparsity looks like in practice</h3>
<p>The corrective is not to artificially shrink context windows — it is to ensure
that at each decision point, the context is populated with tokens that are
<em>relevant to that decision</em>. Three practical expressions of this principle:</p>
<ol>
<li>
<p><strong>Retrieval over recall.</strong> Rather than prepending a full document corpus,
retrieve the top-\( k \) passages at query time. This keeps \( n \) small
and \( |\mathcal{S}| / n \) high.</p>
</li>
<li>
<p><strong>Rolling summarisation.</strong> Compress history into a running summary and
discard the raw transcript. The summary carries the signal; the raw
transcript is mostly noise by the time it is several turns old.</p>
</li>
<li>
<p><strong>Phased orchestration.</strong> Decompose a multi-step task into phases, each
with its own focused context. Phase \( t \) receives only the output of
phase \( t-1 \) (plus any task-specific retrieval), not the entire
accumulated history of all prior phases. This keeps per-phase \( n \) bounded
even as the total task length grows.</p>
</li>
</ol>
<hr>
<h2 id="discussion">Discussion</h2>
<p>The argument above is not novel — pieces of it appear scattered across the
alignment and inference-efficiency literatures. What I think is underappreciated
is that it applies to <em>agentic</em> systems with particular force. A single-shot
prompt has a fixed, author-controlled context. An agent accumulating tool
outputs, prior reasoning traces, and retrieved documents across a long task
trajectory will naturally inflate its own context window over time — and
degrade its own performance as a result, without any external change in task
difficulty.</p>
<p>The naive fix is to give the agent a bigger context window. The correct fix is
to never let it accumulate a bloated context in the first place.</p>
<p><strong>Limitations.</strong> The SNR framing treats all irrelevant tokens as equally noisy,
which is false — some irrelevant tokens are actively misleading (distractors) <a href="#ref-3">[3]</a>,
others are benign fillers. The quadratic cost argument mostly applies to
full-attention models; sparse and linear attention variants have different
scaling properties. And &ldquo;relevant&rdquo; is itself a function of the model&rsquo;s
knowledge, which makes the optimisation circular in practice.</p>
<p><strong>What would make this publishable.</strong> Controlled ablation: fix a task, vary
context length by inserting null tokens of increasing volume, measure
performance as a function of \( n \) and of the position of the relevant
material. Do this across model sizes and families to separate architectural
effects from scale effects. The lost-in-the-middle paper is close to this but
does not isolate null-token inflation from document-count inflation.</p>
<hr>
<h2 id="references">References</h2>
<p><span id="ref-1"></span>[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., &amp; Liang, P. (2024). Lost in the middle: How language models use long contexts. <em>Transactions of the Association for Computational Linguistics</em>, 12, 157–173. <a href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></p>
<p><span id="ref-2"></span>[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &amp; Polosukhin, I. (2017). Attention is all you need. <em>Advances in Neural Information Processing Systems</em>, 30. <a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></p>
<p><span id="ref-3"></span>[3] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Schärli, N., &amp; Zhou, D. (2023). Large language models can be easily distracted by irrelevant context. <em>Proceedings of the 40th International Conference on Machine Learning (ICML 2023)</em>. <a href="https://arxiv.org/abs/2302.00093">https://arxiv.org/abs/2302.00093</a></p>
<hr>
<p><em>The phased orchestration argument in the Discussion section is not just
theoretical hand-waving — I have been building a concrete implementation of it.
The current state lives at
<a href="https://github.com/sebastianspicker/phased-agent-orchestration">sebastianspicker/phased-agent-orchestration</a>.
It is rough, but the core idea is there: each agent phase gets a bounded,
purpose-built context rather than the full accumulated history. Feedback very
welcome.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>If You Think This Is Written by AI, You Are Both Right and Wrong</title>
      <link>https://sebastianspicker.github.io/posts/ai-detectors-systematic-minds/</link>
      <pubDate>Wed, 18 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-detectors-systematic-minds/</guid>
      <description>AI detectors flag the US Constitution as machine-generated. They also flag technical papers, legal prose, and — with striking consistency — writing produced by autistic minds and physics-trained ones. The error is not in the measurement. It is in the baseline assumption: that systematic, precise writing is inhuman.</description>
      <content:encoded><![CDATA[<p>I use AI tools in my writing. This post, like several others on this blog,
was written with LLM assistance — research, structure, drafting,
revision. If you run any of these posts through an AI writing detector, you
will likely receive a high probability-of-AI score. The detector will be
picking up something real.</p>
<p>It will also be wrong about what that means.</p>
<hr>
<h2 id="the-constitution-problem">The Constitution Problem</h2>
<p>In 2023, as universities began deploying AI detection tools at scale,
educators started testing them on texts that were definitively not
AI-generated. The results were instructive. The United States Constitution
received high AI-probability scores from multiple commercial detectors.
GPTZero returned a rating of 92% likely AI-written. The Federalist Papers
fared similarly. So did sections of the King James Bible and Kant&rsquo;s <em>Critique
of Pure Reason</em>. Historical documents, written by humans, for human purposes,
in an era when no AI existed — flagged as machine-generated.</p>
<p>This was not a marginal edge case. It was consistent across tools and across
documents. And while it was widely reported as evidence that the detectors
were broken, there is a more precise reading available: the detectors were
working correctly, and we had misunderstood what they were measuring.</p>
<hr>
<h2 id="what-the-detectors-actually-measure">What the Detectors Actually Measure</h2>
<p>Most commercial AI detectors — GPTZero, Turnitin&rsquo;s detection layer,
Copyleaks — use some combination of two statistical signals.</p>
<p><strong>Perplexity.</strong> A language model assigns a probability to each token given
the preceding tokens. Low perplexity means the text was, token by token,
what the model expected — it sits close to the centre of the probability
distribution. AI-generated text tends to have low perplexity because that
is precisely what generation does: it samples from the high-probability
region of the distribution <a href="#ref-1">[1]</a>. Human text, on average,
has higher perplexity, because humans write for specific contexts with
idiosyncratic word choices, rhetorical effects that require the unexpected,
and the accumulated noise of composing for a real reader.</p>
<p><strong>Burstiness.</strong> A term introduced by Edward Tian, GPTZero&rsquo;s creator: human
writing has high burstiness — sentence lengths vary widely, vocabulary
density shifts, complex constructions alternate with simple ones. AI writing
is more uniform. The statistical distribution of sentence lengths in LLM
output is narrower than in most human prose <a href="#ref-2">[2]</a>.</p>
<p>The underlying assumption these tools share: human writing is variable,
contextually messy, idiosyncratic. AI writing is smooth and predictable.</p>
<p>This is accurate for a large class of human writing — casual prose, personal
essays, social media, student writing in informal registers. It is wrong
about a different and well-defined class of human writing. The Constitution
sits in that class. So does a lot of other text.</p>
<hr>
<h2 id="the-systemising-brain">The Systemising Brain</h2>
<p>Simon Baron-Cohen&rsquo;s empathising–systemising (E-S) theory distinguishes two
cognitive orientations. Empathising involves attending to social and emotional
cues, inferring mental states, navigating the pragmatic, implicit layer of
communication — what is meant rather than what is said. Systemising involves
attending to rules, patterns, and underlying regularities — the drive to
understand how things work and to represent them in explicit, transferable,
internally consistent terms <a href="#ref-3">[3]</a>.</p>
<p>Both orientations are distributed across the human population. They are not
exclusive, and neither is pathological. But autism spectrum conditions are
robustly associated with high systemising and relatively lower empathising —
not because autistic people lack emotions or care about others, but because
the cognitive mode that comes naturally to them is one of rules, structures,
and explicit representation rather than social inference and pragmatic
implication. The intense world theory <a href="#ref-4">[4]</a> adds a
complementary perspective: autistic brains may be characterised by
hyper-reactivity and hyper-plasticity, with pattern-seeking and systematising
serving partly as a way of making a too-intense world navigable. The
systematicity is not a deficit. It is an adaptation.</p>
<p>This has direct consequences for writing.</p>
<p>High-systemising writing tends toward:</p>
<ul>
<li>
<p><strong>Consistent vocabulary.</strong> The same term is used for the same concept
throughout, because substituting a synonym introduces ambiguity about
whether the referent is actually the same. Neurotypical writing freely
uses synonyms for stylistic variety; systemising writing resists this
on principle.</p>
</li>
<li>
<p><strong>Explicit logical structure.</strong> Claims are supported by stated reasons
rather than left to pragmatic inference. If there are three conditions,
all three are named. Nothing is &ldquo;needless to say.&rdquo;</p>
</li>
<li>
<p><strong>Low social hedging.</strong> Phrases like &ldquo;as everyone knows&rdquo; or &ldquo;obviously&rdquo;
are avoided, because they perform social alignment rather than convey
information — and they depend on shared assumptions the writer is not
confident are actually shared. (This connects to a point I made in the
<a href="/posts/car-wash-walk/">car-wash-walk post</a> about Gricean pragmatics:
autistic communication often violates the maxim of quantity in the
direction of over-informing, because nothing is assumed implicit.)</p>
</li>
<li>
<p><strong>Grammatical parallelism.</strong> Parallel logical content takes parallel
grammatical form. This is not stylistic affectation; it is a natural
consequence of representing structure explicitly.</p>
</li>
<li>
<p><strong>Minimal rhetorical noise.</strong> The prose does not meander, warm up, or
perform relatability. It states what needs to be stated.</p>
</li>
</ul>
<p>Now run text with these properties through an AI detector. Consistent
vocabulary reads as low lexical diversity. Explicit structure reads as low
burstiness. Minimal rhetorical noise reads as smooth, generated output. The
detector is measuring these properties accurately. The attribution to machine
generation is where it goes wrong.</p>
<p>Liang et al. <a href="#ref-5">[5]</a> demonstrated a closely related failure empirically: AI
detectors are significantly more likely to flag writing by non-native English
speakers as AI-generated. Non-native writers at advanced levels of formal
English tend to write more carefully, more consistently, and more in
accordance with explicit grammar rules — because they learned the language
as a system of explicit rules rather than acquiring it through immersive
social exposure. More systematic writing: higher AI probability score. The
mechanism is the same. The population is different.</p>
<hr>
<h2 id="the-physicist-brain">The Physicist Brain</h2>
<p>Physics writing has its own conventions, independently developed but pointing
in the same direction.</p>
<p>Scientific prose requires defined terms used consistently: in a paper about
quantum error correction, &ldquo;logical qubit,&rdquo; &ldquo;physical qubit,&rdquo; and &ldquo;syndrome&rdquo;
each mean exactly one thing, used identically in section 2 and section 5.
It requires explicit assumptions: &ldquo;We assume the noise is Markovian.&rdquo; &ldquo;In
the limit of large N.&rdquo; These are not vague hedges; they are precise
statements about the domain of validity of the results. It requires logical
derivation over rhetorical persuasion: the connectives are &ldquo;since,&rdquo;
&ldquo;therefore,&rdquo; &ldquo;it follows that&rdquo; — explicit logical operators, not narrative
bridges. And the passive construction of &ldquo;the signal was measured&rdquo; rather
than &ldquo;I measured the signal&rdquo; removes the individual from the result,
because the result should be reproducible regardless of who performs the
measurement.</p>
<p>The outcome is prose that is systematic, consistent, and structurally
predictable. From the outside — and from the vantage point of an AI
detector — it looks machine-generated.</p>
<p>Paul Dirac is the physicist who comes to mind first here. His 1928 paper
deriving the relativistic wave equation for the electron contains almost no
rhetorical apparatus. Motivation, equation, consequence: each stated once,
clearly, with no warm-up and no elaboration beyond what the argument
requires. It is not warm. It is not discursive. It is beautiful in the way
that a proof is beautiful: every element earns its place. Run it through
GPTZero and see what you get.</p>
<p>This connection between the physicist&rsquo;s prose style and the autistic cognitive
mode is not accidental. Baron-Cohen et al. <a href="#ref-6">[6]</a> surveyed Cambridge students
by academic discipline and found that physical scientists and mathematicians
scored consistently higher on the Autism Quotient (AQ) than humanities
students and controls, with mathematicians scoring highest of all. The
systemising orientation associated with autism spectrum conditions is also
associated with — and presumably selected for — in quantitative scientific
disciplines. The physicist&rsquo;s prose reflects this. So does the writing of a
high-systemising person who has never studied physics.</p>
<p>The categories overlap without being identical. What they share is a
cognitive preference for explicit structure, consistent vocabulary, and
logical transparency over social performance and rhetorical persuasion. The
writing that emerges from that preference looks, to an AI detector, like it
was generated by a machine.</p>
<p>It was not.</p>
<hr>
<h2 id="the-category-error">The Category Error</h2>
<p>The error AI detectors make is not a measurement error. It is a category
error.</p>
<p>They are trained to distinguish two things: output generated by a
contemporary LLM, and a specific subset of human writing — typically casual,
personal, or student prose collected from online sources. When they encounter
text outside either of those training categories — systematic and precise but
human-generated — the classifier has no good option. The text does not match
the &ldquo;AI&rdquo; training data exactly, and it does not match the &ldquo;human&rdquo; baseline
either. It gets assigned to the bin it fits least badly.</p>
<p>What is happening when the Constitution is flagged: it is systematic,
definitional, prescriptive, and internally consistent. It was written by
lawyers and statesmen who understood that ambiguity in foundational documents
creates legal chaos. They wrote to be unambiguous. The result is text with
low perplexity and low burstiness — the statistical signature the detector
associates with AI.</p>
<p>GPTZero&rsquo;s creator Edward Tian acknowledged this problem when it was reported:
the Constitution appears so frequently in LLM training data that it registers
as &ldquo;already known&rdquo; to the model, which artificially lowers its perplexity
score. That is a real and specific issue. But it is secondary. The deeper
issue is that the Constitution would score low-perplexity even without the
training-data contamination effect, because systematic, definitional prose
is intrinsically low-perplexity. Precise language is predictable language.
That is partly the point of precise language.</p>
<p>The baseline assumption — that human writing is variable and idiosyncratic —
holds for much human writing. It does not hold for legal drafting, technical
documentation, scientific papers, sacred and historical texts written to be
durable and precise, writing by people with high systemising orientation, or
writing by non-native speakers at formal registers. That is not a small
population of edge cases. It is a substantial fraction of all written
material that exists.</p>
<hr>
<h2 id="right-and-wrong-at-the-same-time">Right and Wrong at the Same Time</h2>
<p>So: if you think these posts are AI-generated, you are right and wrong at
the same time.</p>
<p>Right, in two ways. First: yes, I use AI tools. LLM assistance is part of
my writing process — not an occasional aid, but a regular part of how
research notes and half-formed arguments become structured posts. Second:
the writing style of these posts is systematic and precise in ways that
detectors register as machine-generated. That systematicity is real, and
if a detector picks it up, it is measuring something.</p>
<p>Wrong, also in two ways. First: the ideas, judgments, and connections in
these posts are mine. The decisions about what to include and what to leave
out, which papers to cite and how to frame their implications, where the
interesting tension lies between neurodiversity research and the assumptions
baked into AI detection tools — those are not outputs of a language model
working in isolation. They are the product of someone who works at the
intersection of these fields and has thought about them for a while. An LLM
cannot generate these posts without a human who has already decided what
to say.</p>
<p>Second, and more important for the argument here: the systematic, precise
character of this writing is not evidence of machine generation. It is a
cognitive signature — one associated with physics training, with high
systemising orientation, with the <a href="/posts/inner-echo/">overlap between those two things that I
have written about elsewhere</a> in the context of
neurodiversity more broadly.</p>
<p>The detector is measuring a real property of the text. It is misattributing
the origin of that property.</p>
<p>The interesting question this opens is not &ldquo;did AI write this?&rdquo; That question
is increasingly poorly posed in an era where thinking and writing are already
deeply entangled with machine assistance, in ways that differ sharply from
person to person and task to task. The better question is: <em>whose judgment
is in the text?</em> Whose choices about what to include, what to connect, what
to leave out?</p>
<p>The systematicity in this writing is mine. The recognition that AI detectors
systematically disadvantage autistic writers, physicist writers, and
non-native speakers is a judgment I made, not one a language model was
prompted to produce. The connection to the Constitution — a document written
to be maximally unambiguous, flagged as maximally AI-like — is a connection
I found worth drawing.</p>
<p>Whether that makes this text &ldquo;human&rdquo; is a philosophical question I am happy
to leave open. What it is not is AI hallucination.</p>
<hr>
<h2 id="references">References</h2>
<p><span id="ref-1"></span>[1] Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., &amp; Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. <em>Proceedings of the 40th International Conference on Machine Learning (ICML 2023)</em>. <a href="https://arxiv.org/abs/2301.11305">https://arxiv.org/abs/2301.11305</a></p>
<p><span id="ref-2"></span>[2] Gehrmann, S., Strobelt, H., &amp; Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. <em>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</em>, 111–116. <a href="https://doi.org/10.18653/v1/P19-3019">https://doi.org/10.18653/v1/P19-3019</a></p>
<p><span id="ref-3"></span>[3] Baron-Cohen, S. (2009). Autism: The empathising–systemising (E-S) theory. <em>Annals of the New York Academy of Sciences</em>, 1156(1), 68–80. <a href="https://doi.org/10.1111/j.1749-6632.2009.04467.x">https://doi.org/10.1111/j.1749-6632.2009.04467.x</a></p>
<p><span id="ref-4"></span>[4] Markram, K., &amp; Markram, H. (2010). The intense world theory — a unifying theory of the neurobiology of autism. <em>Frontiers in Human Neuroscience</em>, 4, 224. <a href="https://doi.org/10.3389/fnhum.2010.00224">https://doi.org/10.3389/fnhum.2010.00224</a></p>
<p><span id="ref-5"></span>[5] Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., &amp; Zou, J. (2023). GPT detectors are biased against non-native English writers. <em>Patterns</em>, 4(7), 100779. <a href="https://doi.org/10.1016/j.patter.2023.100779">https://doi.org/10.1016/j.patter.2023.100779</a></p>
<p><span id="ref-6"></span>[6] Baron-Cohen, S., Wheelwright, S., Skinner, R., Martin, J., &amp; Clubley, E. (2001). The autism-spectrum quotient (AQ): Evidence from Asperger syndrome/high-functioning autism, males and females, scientists and mathematicians. <em>Journal of Autism and Developmental Disorders</em>, 31(1), 5–17. <a href="https://doi.org/10.1023/A:1005653411471">https://doi.org/10.1023/A:1005653411471</a></p>
]]></content:encoded>
    </item>
    <item>
      <title>Car Wash, Part Three: The AI Said Walk</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-walk/</link>
      <pubDate>Thu, 12 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-walk/</guid>
      <description>A new video went viral last week: same question, &amp;ldquo;should I drive to the car wash?&amp;rdquo;, different wrong answer — the AI said to walk instead. This is neither the tokenisation failure from the strawberry post nor the grounding failure from the rainy-day post. It is a pragmatic inference failure: the model understood all the words and (probably) had the right world state, but assigned its response to the wrong interpretation of the question. A third and more subtle failure mode, with Grice as the theoretical handle.</description>
      <content:encoded><![CDATA[<p><em>Third in an accidental series. Part one:
<a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a> — tokenisation
and representation. Part two:
<a href="/posts/car-wash-grounding/">Should I Drive to the Car Wash?</a> — grounding
and missing world state. This one is different again.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Same question as last month&rsquo;s: &ldquo;Should I drive to the car wash?&rdquo; New
video, new AI, new wrong answer. This time the assistant replied that
walking was the better option — better for health, better for the
environment, and the car wash was only fifteen minutes away on foot.</p>
<p>Accurate, probably. Correct, arguably. Useful? No.</p>
<p>The model did not fail because of tokenisation. It did not fail because
it lacked access to the current weather. It failed because it read the
wrong question. The user was asking &ldquo;is now a good time to have my car
washed?&rdquo; The model answered &ldquo;what is the most sustainable way for a
human to travel to the location of a car wash?&rdquo;</p>
<p>These are different questions. The model chose the second one. This is
a pragmatic inference failure, and it is the most instructive of the
three failure modes in this series — because the model was not, by any
obvious measure, working incorrectly. It was working exactly as
designed, on the wrong problem.</p>
<hr>
<h2 id="what-the-question-actually-meant">What the Question Actually Meant</h2>
<p>&ldquo;Should I drive to the car wash?&rdquo; is not about how to travel. The word
&ldquo;drive&rdquo; here is not a transportation verb; it is part of the idiomatic
compound &ldquo;drive to the car wash,&rdquo; which means &ldquo;take my car to get
washed.&rdquo; The presupposition of the question is that the speaker owns a
car, the car needs or might benefit from washing, and the speaker is
deciding whether the current moment is a good one to go. Nobody asking
this question wants to know whether cycling is a viable alternative.</p>
<p>Linguists distinguish between what a sentence <em>says</em> — its literal
semantic content — and what it <em>implicates</em> — the meaning a speaker
intends and a listener is expected to infer. Paul Grice formalised this
in 1975 with a set of conversational maxims describing how speakers
cooperate to communicate:</p>
<ul>
<li><strong>Quantity</strong>: say as much as is needed, no more</li>
<li><strong>Quality</strong>: say only what you believe to be true</li>
<li><strong>Relation</strong>: be relevant</li>
<li><strong>Manner</strong>: be clear and orderly</li>
</ul>
<p>The maxims are not rules; they are defaults. When a speaker says
&ldquo;should I drive to the car wash?&rdquo;, a cooperative listener applies the
maxim of Relation to infer that the question is about car maintenance
and current conditions, not about personal transport choices. The
&ldquo;drive&rdquo; is incidental to the real question, the way &ldquo;I ran to the
store&rdquo; does not invite commentary on jogging technique.</p>
<p>The model violated Relation — in the pragmatic sense. Its answer was
technically relevant to one reading of the sentence, and irrelevant to
the only reading a cooperative human would produce.</p>
<hr>
<h2 id="a-taxonomy-of-the-three-failures">A Taxonomy of the Three Failures</h2>
<p>It is worth being precise now that we have three examples:</p>
<p><strong>Strawberry</strong> (tokenisation failure): The information needed to answer
was present in the input string but lost in the model&rsquo;s representation.
&ldquo;Strawberry&rdquo; → </p>
\["straw", "berry"\]<p> — the character &ldquo;r&rdquo; in &ldquo;straw&rdquo; is
not directly accessible. The model understood the task correctly; the
representation could not support it.</p>
<p><strong>Car wash, rainy day</strong> (grounding failure): The model understood the
question. The information needed to answer correctly — current weather —
was never in the input. The model answered by averaging over all
plausible contexts, producing a sensible-on-average response that was
wrong for this specific context.</p>
<p><strong>Car wash, walk</strong> (pragmatic inference failure): The model had all
the relevant words. It may have had access to the weather, the location,
the car state. It chose the wrong interpretation of what was being
asked. The sentence was read at the level of semantic content rather
than communicative intent.</p>
<p>Formally: let $\mathcal{I}$ be the set of plausible interpretations of
an utterance $u$. The intended interpretation $i^*$ is the one a
cooperative, contextually informed listener would assign. A well-functioning
pragmatic reasoner computes:</p>
$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$<p>The model appears to have assigned high probability to the
transportation-choice interpretation $i_{\text{walk}}$, apparently on
the surface pattern: &ldquo;should I </p>
\[verb of locomotion\]<p> to </p>
\[location\]<p>?&rdquo;
generates responses about modes of transport. It is a natural
pattern-match. It is the wrong one.</p>
<hr>
<h2 id="why-this-failure-mode-is-more-elusive">Why This Failure Mode Is More Elusive</h2>
<p>The tokenisation failure has a clean diagnosis: look at the BPE splits,
find where the character information was lost. The grounding failure has
a clean diagnosis: identify the context variable $C$ the answer depends
on, check whether the model has access to it.</p>
<p>The pragmatic failure is harder to pin down because the model&rsquo;s answer
was not, in isolation, wrong. Walking is healthy. Walking to a car wash
that is fifteen minutes away is plausible. If you strip the question of
its conversational context — a person standing next to their dirty car,
wondering whether to bother — the model&rsquo;s response is coherent.</p>
<p>The error lives in the gap between what the sentence says and what the
speaker meant, and that gap is only visible if you know what the speaker
meant. In a training corpus, this kind of error is largely invisible:
there is no ground truth annotation that marks a technically-responsive
answer as pragmatically wrong.</p>
<p>This is a version of a known problem in computational linguistics: models
trained on text predict text, and text does not contain speaker intent.
A model can learn that &ldquo;should I drive to X?&rdquo; correlates with responses
about travel options, because that correlation is present in the data.
What it cannot easily learn from text alone is the meta-level principle:
this question is about the destination&rsquo;s purpose, not the journey.</p>
<hr>
<h2 id="the-gricean-model-did-not-solve-this">The Gricean Model Did Not Solve This</h2>
<p>It is tempting to think that if you could build in Grice&rsquo;s maxims
explicitly — as constraints on response generation — you would prevent
this class of failure. Generate only responses that are relevant to the
speaker&rsquo;s probable intent, not just to the sentence&rsquo;s semantic content.</p>
<p>This does not obviously work, for a simple reason: the maxims require
a model of the speaker&rsquo;s intent, which is exactly what is missing.
You need to know what the speaker intends to know which response is
relevant; you need to know which response is relevant to determine
the speaker&rsquo;s intent. The inference has to bootstrap from somewhere.</p>
<p>Human pragmatic inference works because we come to a conversation with
an enormous amount of background knowledge about what people typically
want when they ask particular kinds of questions, combined with
contextual cues (tone, setting, previous exchanges) that narrow the
interpretation space. A person asking &ldquo;should I drive to the car wash?&rdquo;
while standing next to a mud-spattered car in a conversation about
weekend plans is not asking for a health lecture. The context is
sufficient to fix the interpretation.</p>
<p>Language models receive text. The contextual cues that would fix the
interpretation for a human — the mud on the car, the tone of the
question, the setting — are not available unless someone has typed them
out. The model is not in the conversation; it is receiving a transcript
of it, from which the speaker&rsquo;s intent has to be inferred indirectly.</p>
<hr>
<h2 id="where-this-leaves-the-series">Where This Leaves the Series</h2>
<p>Three videos, three failure modes, three diagnoses. None of them are
about the model being unintelligent in any useful sense of the word.
Each of them is a precise consequence of how these systems work:</p>
<ol>
<li>Models process tokens, not characters. Character-level structure can
be lost at the representation layer.</li>
<li>Models are trained on static corpora and have no real-time connection
to the world. Context-dependent questions are answered by marginalising
over all plausible contexts, which is wrong when the actual context
matters.</li>
<li>Models learn correlations between sentence surface forms and response
types. The correlation between &ldquo;should I
\[travel verb\]to
\[place\]?&rdquo;
and transport-related responses is real in the training data. It is the
wrong correlation for this question.</li>
</ol>
<p>The useful frame, in all three cases, is not &ldquo;the model failed&rdquo; but
&ldquo;what, precisely, does the model lack that would be required to succeed?&rdquo;
The answers point in different directions: better tokenisation; real-time
world access and calibrated uncertainty; richer models of speaker intent
and conversational context. The first is an engineering problem. The
second is partially solvable with tools and still hard. The third is
unsolved.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Grice, P. H. (1975). Logic and conversation. In P. Cole &amp; J. Morgan
(Eds.), <em>Syntax and Semantics, Vol. 3: Speech Acts</em> (pp. 41–58).
Academic Press.</p>
</li>
<li>
<p>Levinson, S. C. (1983). <em>Pragmatics.</em> Cambridge University Press.</p>
</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure</title>
      <link>https://sebastianspicker.github.io/posts/car-wash-grounding/</link>
      <pubDate>Tue, 20 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/car-wash-grounding/</guid>
      <description>A viral video this month showed an AI assistant confidently answering &amp;ldquo;should I go to the car wash today?&amp;rdquo; without knowing it was raining outside. The internet found it funny. The failure mode is real but distinct from the strawberry counting problem — this is not a representation issue, it is a grounding issue. The model understood the question perfectly. What it lacked was access to the state of the world the question was about.</description>
      <content:encoded><![CDATA[<p><em>Follow-up to <a href="/posts/strawberry-tokenisation/">Three Rs in Strawberry</a>,
which covered a different LLM failure: tokenisation and why models cannot
count letters. This one is about something structurally different.</em></p>
<hr>
<h2 id="the-video">The Video</h2>
<p>Someone asked their car&rsquo;s built-in AI assistant: &ldquo;Should I drive to the
car wash today?&rdquo; It was raining. The assistant said yes, enthusiastically,
with reasons: regular washing extends the life of the paintwork, removes
road salt, and so on. Technically correct statements, all of them.
Completely beside the point.</p>
<p>The clip spread. The reactions were the usual split: one camp said this
proves AI is useless, the other said it proves people expect too much
from AI. Both camps are arguing about the wrong thing.</p>
<p>The interesting question is: why did the model fail here, and is this
the same kind of failure as the strawberry problem?</p>
<p>It is not. The failures look similar from the outside — confident wrong
answer, context apparently ignored — but the underlying causes are
different, and the difference matters if you want to understand what
these systems can and cannot do.</p>
<hr>
<h2 id="the-strawberry-problem-was-about-representation">The Strawberry Problem Was About Representation</h2>
<p>In the strawberry case, the model failed because of the gap between its
input representation (BPE tokens: &ldquo;straw&rdquo; + &ldquo;berry&rdquo;) and the task (count
the character &ldquo;r&rdquo;). The character information was not accessible in the
model&rsquo;s representational units. The model understood the task correctly —
&ldquo;count the r&rsquo;s&rdquo; is unambiguous — but the input structure did not support
executing it.</p>
<p>That is a <em>representation</em> failure. The information needed to answer
correctly was present in the original string but was lost in the
tokenisation step.</p>
<p>The car wash case is different. The model received a perfectly
well-formed question and had no representation problem at all. &ldquo;Should I
drive to the car wash today?&rdquo; is tokenised without any information loss.
The model understood it. The failure is that the correct answer depends
on information that was never in the input in the first place.</p>
<hr>
<h2 id="the-missing-context">The Missing Context</h2>
<p>What would you need to answer &ldquo;should I drive to the car wash today?&rdquo;
correctly?</p>
<ul>
<li>The current weather (is it raining now?)</li>
<li>The weather forecast for the rest of the day (will it rain later?)</li>
<li>The current state of the car (how dirty is it?)</li>
<li>Possibly: how recently was it last washed, what kind of dirt (road
salt after winter, tree pollen in spring), whether there is a time
constraint</li>
</ul>
<p>None of this is in the question. A human asking the question has access
to some of it through direct perception (look out the window) and some
through memory (I just drove through mud). A language model has access
to none of it.</p>
<p>Let $X$ denote the question and $C$ denote this context — the current
state of the world that the question is implicitly about. The correct
answer $A$ is a function of both:</p>
$$A = f(X, C)$$<p>The model has $X$. It does not have $C$. What it produces is something
like an expectation over possible contexts, marginalising out the unknown
$C$:</p>
$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$<p>Averaged over all plausible contexts in which someone might ask this
question, &ldquo;going to the car wash&rdquo; is probably a fine idea — most of the
time when people ask, it is not raining and the car is dirty.
$\hat{A}$ is therefore approximately &ldquo;yes.&rdquo; The model returns &ldquo;yes.&rdquo;
In this particular instance, where $C$ happens to include &ldquo;it is
currently raining,&rdquo; $\hat{A} \neq f(X, C)$.</p>
<p>The quantity that measures how much the missing context matters is the
mutual information between the answer and the context, given the
question:</p>
$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$<p>Here $H(A \mid X)$ is the residual uncertainty in the answer given only
the question, and $H(A \mid X, C)$ is the residual uncertainty once the
context is also known. For most questions in a language model&rsquo;s training
distribution — &ldquo;what is the capital of France?&rdquo;, &ldquo;how do I sort a list
in Python?&rdquo; — this mutual information is near zero: the context does not
change the answer. For situationally grounded questions like the car wash
question, it is large: the answer is almost entirely determined by the
context, not the question.</p>
<hr>
<h2 id="why-the-model-was-confident-anyway">Why the Model Was Confident Anyway</h2>
<p>This is the part that produces the most indignation in the viral clips:
not just that the model was wrong, but that it was <em>confident</em> about
being wrong. It did not say &ldquo;I don&rsquo;t know what the current weather is.&rdquo;
It said &ldquo;yes, here are five reasons you should go.&rdquo;</p>
<p>Two things are happening here.</p>
<p><strong>Training distribution bias.</strong> Most questions in the training data that
resemble &ldquo;should I do X?&rdquo; have answers that can be derived from general
knowledge, not from real-time world state. &ldquo;Should I use a VPN on public
WiFi?&rdquo; &ldquo;Should I stretch before running?&rdquo; &ldquo;Should I buy a house or rent?&rdquo;
All of these have defensible answers that do not depend on the current
weather. The model learned that this question <em>form</em> typically maps to
answers of the form &ldquo;here are some considerations.&rdquo; It applies that
pattern here.</p>
<p><strong>No explicit uncertainty signal.</strong> The model was not trained to say
&ldquo;I cannot answer this because I lack context C.&rdquo; It was trained to
produce helpful-sounding responses. A response that acknowledges
missing information requires the model to have a model of its own
knowledge state — to know what it does not know. This is harder than
it sounds. The model has to recognise that $I(A; C \mid X)$ is high
for this question class, which requires meta-level reasoning about
information structure that is not automatically present.</p>
<p>This is sometimes called <em>calibration</em>: the alignment between expressed
confidence and actual accuracy. A well-calibrated model that is 80%
confident in an answer is right about 80% of the time. A model that is
confident about answers it cannot possibly know from its training data
is miscalibrated. The car wash video is a calibration failure as much
as a grounding failure.</p>
<hr>
<h2 id="what-grounding-means">What Grounding Means</h2>
<p>The term <em>grounding</em> in AI has a precise origin. Harnad (1990) used it
to describe the problem of connecting symbol systems to the things
they refer to — how does the word &ldquo;apple&rdquo; connect to actual apples,
rather than just to other symbols? A symbol system that only connects
symbols to other symbols (dictionary definitions, synonym relations)
has the form of meaning without the substance.</p>
<p>Applied to language models: the model has rich internal representations
of concepts like &ldquo;rain,&rdquo; &ldquo;car wash,&rdquo; &ldquo;dirty car,&rdquo; and their relationships.
But those representations are grounded in text about those things, not in
the things themselves. The model knows what rain is. It does not know
whether it is raining right now, because &ldquo;right now&rdquo; is not a location
in the training data.</p>
<p>This is not a solvable problem by making the model bigger or training it
on more text. More text does not give the model access to the current
state of the world. It is a structural feature of how these systems work:
they are trained on a static corpus and queried at inference time, with
no automatic connection to the world state at the moment of the query.</p>
<hr>
<h2 id="what-tool-use-gets-you-and-what-it-doesnt">What Tool Use Gets You (and What It Doesn&rsquo;t)</h2>
<p>The standard engineering response to grounding problems is tool use:
give the model access to a weather API, a calendar, a search engine.
Now when asked &ldquo;should I go to the car wash today?&rdquo; the model can query
the weather service, get the current conditions, and factor that into
the answer.</p>
<p>This is genuinely useful. The model with a weather tool call will answer
this question correctly in most circumstances. But tool use solves the
problem only if two conditions hold:</p>
<ol>
<li>
<p><strong>The model knows it needs the tool.</strong> It must recognise that this
question has $I(A; C \mid X) > 0$ for context $C$ that a weather
tool can provide, and that it is missing that context. This requires
the meta-level awareness described above. Models trained on tool use
learn to invoke tools for recognised categories of question; for novel
question types, or questions that superficially resemble answerable
ones, the tool call may not be triggered.</p>
</li>
<li>
<p><strong>The right tool exists and returns clean data.</strong> Weather APIs exist.
&ldquo;How dirty is my car?&rdquo; does not have an API. &ldquo;Am I the kind of person
who cares about car cleanliness enough that this matters?&rdquo; has no API.
Some missing context can be retrieved; some is inherently private to
the person asking.</p>
</li>
</ol>
<p>The deeper issue is not tool availability but <em>knowing what you don&rsquo;t
know</em>. A model that does not recognise its own information gaps cannot
reliably decide when to use a tool, ask a clarifying question, or
express uncertainty. This is a hard problem — arguably harder than
making the model more capable at the tasks it already handles.</p>
<hr>
<h2 id="the-contrast-stated-plainly">The Contrast, Stated Plainly</h2>
<p>The strawberry failure and the car wash failure look alike from the
outside — confident wrong answer — but they are different enough that
conflating them produces confused diagnosis and confused solutions.</p>
<p>Strawberry: the model has the information (the string &ldquo;strawberry&rdquo;), but
the representation (BPE tokens) does not preserve character-level
structure. The fix is architectural or procedural: character-level
tokenisation, chain-of-thought letter spelling.</p>
<p>Car wash: the model does not have the information (current weather,
car state). No fix to the model&rsquo;s architecture or prompt engineering
gives it information it was never given. The fix is exogenous: provide
the context explicitly, or give the model a tool that can retrieve it,
or design the system so that context-dependent questions are routed to
systems that have access to the relevant state.</p>
<p>A model that confidently answers the car wash question without access to
current conditions is not failing at language understanding. It is
behaving exactly as its training shaped it to behave, given its lack of
situational grounding. Knowing which kind of failure you are looking at
is most of the work in figuring out what to do about it.</p>
<hr>
<p><em>The grounding problem connects to the broader question of what it means
for a language model to &ldquo;know&rdquo; something — which comes up in a different
form in the <a href="/posts/more-context-not-always-better/">context window post</a>,
where the issue is not missing context but irrelevant context drowning
out the relevant signal.</em></p>
<p><em>A second car wash video a few weeks later produced a third, different
failure: <a href="/posts/car-wash-walk/">Car Wash, Part Three: The AI Said Walk</a> —
the model had the right world state but chose the wrong interpretation
of the question.</em></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Harnad, S. (1990). The symbol grounding problem. <em>Physica D:
Nonlinear Phenomena</em>, 42(1–3), 335–346.
<a href="https://doi.org/10.1016/0167-2789(90)90087-6">https://doi.org/10.1016/0167-2789(90)90087-6</a></p>
</li>
<li>
<p>Guo, C., Pleiss, G., Sun, Y., &amp; Weinberger, K. Q. (2017). <strong>On
calibration of modern neural networks.</strong> <em>ICML 2017</em>.
<a href="https://arxiv.org/abs/1706.04599">https://arxiv.org/abs/1706.04599</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters</title>
      <link>https://sebastianspicker.github.io/posts/ralph-loop/</link>
      <pubDate>Thu, 04 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ralph-loop/</guid>
      <description>In late 2025, agentic coding tools went from impressive demos to daily infrastructure. The problem nobody talked about enough: when an LLM agent has write access to a codebase and no formal constraints, reproducibility breaks down. The Ralph Loop is a deterministic, story-driven execution framework that addresses this — one tool call per story, scoped writes, atomic state. A design rationale with a formal sketch of why the constraints matter.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
This post is the design rationale.</em></p>
<hr>
<h2 id="december-2025">December 2025</h2>
<p>It happened fast. In the twelve months before I am writing this, agentic
coding went from a niche research topic to the default mode for several
categories of software engineering task. Codex runs code in a sandboxed
container and submits pull requests. Claude Code works through a task list
in your terminal while you make coffee. Cursor&rsquo;s agent mode rewrites a
file, runs the tests, reads the failures, and tries again — automatically,
without waiting for you to press a button.</p>
<p>The demos are impressive. The production reality is messier.</p>
<p>The problem is not that these systems do not work. They work well enough,
often enough, to be genuinely useful. The problem is that &ldquo;works&rdquo; means
something different when an agent is executing than when a human is.
A human who makes a mistake can tell you what they were thinking.
An agent that produces a subtly wrong result leaves you with a diff and
no explanation. And an agent run that worked last Tuesday might not work
today, because the model changed, or the context window filled differently,
or the prompt-to-output mapping is, at bottom, a stochastic function.</p>
<p>This is the problem the Ralph Loop is designed to address: not &ldquo;make
agents more capable&rdquo; but &ldquo;make agent runs reproducible.&rdquo;</p>
<hr>
<h2 id="the-reproducibility-problem-formally">The Reproducibility Problem, Formally</h2>
<p>An LLM tool call is a stochastic function. Given a prompt $p$, the
model samples from a distribution over possible outputs:</p>
$$T : \mathcal{P} \to \Delta(\mathcal{O})$$<p>where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space
of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over
$\mathcal{O}$.</p>
<p>At temperature zero — the most deterministic setting most systems support —
this collapses toward a point mass:</p>
$$T_0(p) \approx \delta_{o^*}$$<p>where $o^*$ is the argmax token sequence. &ldquo;Approximately&rdquo; because hardware
non-determinism, batching effects, and floating-point accumulation mean
that even $T_0$ is not strictly reproducible across runs, environments, or
model versions.</p>
<p>A naive agentic loop composes these calls. If an agent takes $k$ sequential
tool calls to complete a task, the result is a $k$-fold composition:</p>
$$o_k = T(T(\cdots T(p_0) \cdots))$$<p>The variance does not merely add — it propagates through the dependencies.
Early outputs condition later prompts; a small deviation at step 2 can
shift the trajectory of step 5 substantially. This is not a theoretical
concern. It is the practical experience of anyone who has tried to reproduce
a multi-step agent run.</p>
<p>The Ralph Loop does not solve the stochasticity of $T$. What it does is
prevent the composition.</p>
<hr>
<h2 id="the-ralph-loop-as-a-state-machine">The Ralph Loop as a State Machine</h2>
<p>The system&rsquo;s state at any point in a run is a triple:</p>
$$\sigma = (Q,\; S,\; L)$$<p>where:</p>
<ul>
<li>$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD
(product requirements document) — with stories sorted by priority, then
by ID</li>
<li>$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$
is the status vector, one entry per story</li>
<li>$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock
state protecting $S$ from concurrent writes</li>
</ul>
<p>The transition function $\delta$ at each step is:</p>
<ol>
<li><strong>Select</strong>: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ —
deterministic by construction, since $Q$ has a fixed ordering</li>
<li><strong>Build</strong>: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of
the story definition and the static policy document; no dependency on
previous tool outputs</li>
<li><strong>Execute</strong>: $o \sim T(p)$ — exactly one tool call, output captured</li>
<li><strong>Accept</strong>: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the
acceptance criterion (was the expected report file created at the
expected path?)</li>
<li><strong>Commit</strong>: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$;
otherwise increment the attempt counter; write atomically under lock $L$</li>
</ol>
<p>The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in
exactly one position. The loop continues until no open stories remain or
a story limit $N$ is reached.</p>
<p><strong>Termination.</strong> Since $|Q| = n$ is finite, $S$ has at most $n$ open
entries, and each step either closes one entry or increments an attempt
counter bounded by $A_{\max}$, the loop terminates in at most
$n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually
satisfies any reachable acceptance criterion — which is what CODEX.md&rsquo;s
constraints are designed to encourage — the loop converges in exactly $n$
successful transitions.</p>
<p><strong>Replay.</strong> The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to
\sigma_k$ is determined by $Q$ and the sequence of tool outputs
$o_1, o_2, \ldots, o_k$. The <code>.runtime/events.log</code> records these
outputs. If tool outputs are deterministic, the run is fully deterministic.
If they are not — as in practice they will not be — the stochasticity is
at least isolated to individual steps rather than allowed to compound
across the chain.</p>
<hr>
<h2 id="the-one-tool-call-invariant">The One-Tool-Call Invariant</h2>
<p>The most important constraint in the Ralph Loop is also the simplest:
exactly one tool call per story attempt.</p>
<p>This is not the natural design. A natural agentic loop would let the model
plan, execute, observe, reflect, and re-execute within a single story.
Some frameworks call this &ldquo;inner monologue&rdquo; or &ldquo;chain-of-thought with tool
use.&rdquo; The model emits reasoning tokens, calls a tool, reads the result,
emits more reasoning, calls another tool, and eventually produces the
final output.</p>
<p>This is more capable for complex tasks. It is also what makes
reproducibility hard. Each additional tool call in the chain is a fresh
draw from $T$, conditioned on the previous outputs. After five tool calls,
the prompt for the fifth includes four previous outputs — each of which
varied slightly from the last run. The fifth output is now conditioned on
a different input.</p>
<p>Formally: let the multi-call policy use $k$ sequential calls per story.
Each call $c_j$ produces output $o_j \sim T(p_j)$, where
$p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function
$f$. The variance of the final output $o_k$ depends on the accumulated
conditioning:</p>
<p>$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]</p>
<ul>
<li>\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$</li>
</ul>
<p>By the law of total variance, applied recursively, the total variance
decomposes into explained and residual components — conditioning
redistributes variance but does not eliminate the residual term. In a
well-designed, low-variance chain the residual may stay small; in
practice, LLM outputs have non-trivial variance at each step, and that
variance propagates through the conditioning chain.</p>
<p>The one-call constraint collapses $k$ to 1:</p>
$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$<p>The output depends only on the story definition and the static policy
document. Not on previous tool outputs. The stories are designed to be
atomic enough that one call is sufficient. If a story requires more, it
should be split into two stories in the PRD. This is a forcing function
toward better task decomposition, which I consider a feature rather than
a limitation.</p>
<hr>
<h2 id="scope-as-a-topological-constraint">Scope as a Topological Constraint</h2>
<p>In fixing mode, each story carries a <code>scope[]</code> field listing the files
or directories the agent is permitted to modify. The runner captures a
snapshot of the repository state before execution:</p>
$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>where $h(f)$ is a hash of the file contents. After the tool call:</p>
$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must
satisfy:</p>
$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$<p>This is a locality constraint on the filesystem graph: the agent&rsquo;s writes
are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the
story&rsquo;s scope declaration. Writes that escape this neighbourhood are a
story failure, regardless of whether they look correct.</p>
<p>The motivation is containment. When a fixing agent makes a &ldquo;small repair&rdquo;
to one file but also helpfully tidies up three adjacent files it noticed
while reading, you have three undocumented changes outside the story&rsquo;s
intent. In a system with many stories running sequentially, out-of-scope
changes accumulate silently. The scope constraint prevents this.
Crucially, prompt instructions alone are not sufficient — an agent told
&ldquo;only modify files in scope&rdquo; can still modify out-of-scope files if the
instructions are interpreted loosely or the context is long. The runner
enforces scope at the file system level, after the fact, and that
enforcement cannot be argued with.</p>
<hr>
<h2 id="acceptance-criteria-grounding-evaluation-in-filesystem-events">Acceptance Criteria: Grounding Evaluation in Filesystem Events</h2>
<p>Each story&rsquo;s acceptance criterion is a single line of the form
<code>Created &lt;path&gt;</code> — the path where the report or output file should appear.</p>
<p>This is intentionally minimal. The alternative — semantic acceptance
criteria (&ldquo;did the agent identify all relevant security issues?&rdquo;) — would
require another model call to evaluate, reintroducing stochasticity at
the evaluation layer and creating the infinite regress of &ldquo;who checks the
checker.&rdquo; A created file at the right path is a necessary condition for
a valid run. It is not a sufficient condition for correctness, but
necessary conditions that can be checked deterministically are already
more than most agentic pipelines provide.</p>
<p>The quality of the outputs — whether the audit findings are accurate,
whether the fix is correct — depends on the model and the prompt quality.
The Ralph Loop gives you a framework for running agents safely and
repeatably. Verifying that the agent was right is a different problem and,
arguably, a harder one.</p>
<hr>
<h2 id="why-bash">Why Bash</h2>
<p>A question I have fielded: why Bash and jq, not Python or Node.js?</p>
<p>The practical reason: the target environment is an agent sandbox that has
reliable POSIX tooling but variable package availability. Python dependency
management inside a constrained container is itself a source of variance.
Bash with jq has no dependencies beyond what any standard Unix environment
provides.</p>
<p>The philosophical reason: the framework&rsquo;s job is orchestration, not
computation. It selects stories, builds prompts from templates, calls one
external tool, parses one file path, and updates one JSON field. None of
this requires a type system or a rich standard library. Bash is the right
tool for glue that does not need to be impressive.</p>
<p>The one place Bash becomes awkward is the schema validation layer, which
is implemented with a separate <code>jq</code> script against a JSON Schema. This
works but is not elegant. If the PRD schema grows substantially, that
component would be worth replacing with something that has native schema
validation support.</p>
<hr>
<h2 id="what-this-is-not">What This Is Not</h2>
<p>The Ralph Loop is not an agent. It is a harness for agents. It does not
decide what tasks to run, does not reason about a codebase, and does not
write code. It sequences discrete, pre-specified stories, enforces the
constraints on each execution, and records the outcomes. The intelligence
is in the model and in the story design; the framework contributes only
discipline.</p>
<p>This distinction matters because the current wave of agentic tools
conflates two things that are worth keeping separate: the capability to
reason and act (what the model provides) and the infrastructure for doing
so safely and repeatably (what the harness provides). Improving the model
does not automatically improve the harness — and a better model in a
poorly constrained harness just fails more impressively.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
The Bash implementation, the PRD schema, the CODEX.md policy document,
and the test suite are all there.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting</title>
      <link>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</link>
      <pubDate>Sat, 22 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</guid>
      <description>Everyone has a Downloads folder full of &amp;ldquo;scan0023.pdf&amp;rdquo; and &amp;ldquo;document(3)-final-FINAL.pdf&amp;rdquo;. Renaming them by content sounds trivial — read the file, understand what it is, give it a name. The implementation reveals something useful about how LLMs actually handle text: what a token is, why context windows matter in practice, why you want structured output instead of prose, and why heuristics should go first. The repository is at github.com/sebastianspicker/AI-PDF-Renamer.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.</em></p>
<hr>
<h2 id="the-problem">The Problem</h2>
<p>Every PDF acquisition pipeline eventually produces the same chaos.
Journal articles downloaded from publisher sites arrive as
<code>513194-008.pdf</code> or <code>1-s2.0-S0360131520302700-main.pdf</code>. Scanned
letters from the tax authority arrive as <code>scan0023.pdf</code>. Invoices arrive
as <code>Rechnung.pdf</code> — every invoice from every vendor, overwriting each
other if you are not paying attention. The actual content is
in the file. The filename tells you nothing.</p>
<p>The human solution is trivial: open the PDF, glance at the title or
date or sender, type a descriptive name. Thirty seconds per file,
multiplied by several hundred files accumulated over a year, becomes
a task that perpetually does not get done.</p>
<p>The automated solution sounds equally trivial: read the text, decide what
the document is, generate a filename. What could be involved?</p>
<p>Quite a bit, it turns out. Working through the implementation is a useful
way to make concrete some things about LLMs and text processing that are
easy to understand in the abstract but clearer with a specific task in
front of you.</p>
<hr>
<h2 id="step-one-getting-text-out-of-a-pdf">Step One: Getting Text Out of a PDF</h2>
<p>A PDF is not a text file. It is a binary format designed for page layout
and print fidelity — it encodes character positions, fonts, and rendering
instructions, not a linear stream of prose. The text in a PDF has to be
extracted by a parser that reassembles it from the position data.</p>
<p>For PDFs with embedded text (most modern documents), this works well
enough. For scanned PDFs — images of pages, with no embedded text at all —
you need OCR as a fallback. The pipeline handles both: native extraction
first, OCR if the text yield is below a useful threshold.</p>
<p>The result is a string. Already there are failure modes: two-column
layouts produce interleaved text if the parser reads left-to-right across
both columns simultaneously; footnotes appear in the middle of
sentences; tables produce gibberish unless the parser handles them
specifically. These are not catastrophic — for renaming purposes,
the first paragraph and the document header are usually enough, and those
are less likely to be badly formatted than the body. But they are real,
and they mean that the text passed to the next stage is not always clean.</p>
<hr>
<h2 id="step-two-the-token-budget">Step Two: The Token Budget</h2>
<p>Once you have a string representing the document&rsquo;s text, you cannot simply
pass all of it to a language model. Two reasons: context windows have hard
limits, and — even when they are large enough — filling them with the full
text of a thirty-page document is wasteful for a task that only needs the
title, date, and category.</p>
<p>Language models do not process characters. They process <em>tokens</em> — subword
units produced by the same BPE compression scheme I described
<a href="/posts/strawberry-tokenisation/">in the strawberry post</a>. A rough
practical rule for English text is:</p>
$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$<p>This is an approximation — technical text, non-English content, and
code tokenise differently — but it is useful for budgeting. A ten-page
academic paper might contain around 30,000 characters, which is
approximately 7,500 tokens. The context window of a small local model
(the default here is <code>qwen2.5:3b</code> via Ollama) is typically in the range
of 8,000–32,000 tokens, depending on the version and configuration.
You have room — but not unlimited room, and the LLM also needs space
for the prompt itself and the response.</p>
<p>The tool defaults to 28,000 tokens of extracted text
(<code>DEFAULT_MAX_CONTENT_TOKENS</code>), leaving comfortable headroom for the
prompt and response in most configurations. For documents that exceed this, the extraction
is truncated — typically to the first N characters, on the reasonable
assumption that titles, dates, and document types appear early.</p>
<p>This truncation is a design decision, not a limitation to be apologised
for. For the renaming task, the first two pages of a document contain
everything the filename needs. A strategy that extracts the first page
plus the last page (which often has a date, a signature, or a reference
number) would work for some document types. The current implementation
keeps it simple: take the front, stay within budget.</p>
<hr>
<h2 id="step-three-heuristics-first">Step Three: Heuristics First</h2>
<p>Here is something that improves almost any LLM pipeline for structured
extraction tasks: do as much work as possible with deterministic rules
before touching the model.</p>
<p>The AI PDF Renamer applies a scoring pass over the extracted text before
deciding whether to call the LLM at all. The heuristics are regex-based
rules that look for patterns likely to appear in specific document types:</p>
<ul>
<li>Date patterns: <code>\d{4}-\d{2}-\d{2}</code>, <code>\d{2}\.\d{2}\.\d{4}</code>, and a
dozen variants</li>
<li>Document type markers: &ldquo;Rechnung&rdquo;, &ldquo;Invoice&rdquo;, &ldquo;Beleg&rdquo;, &ldquo;Gutschrift&rdquo;,
&ldquo;Receipt&rdquo;</li>
<li>Author/institution lines near the document header</li>
<li>Keywords from a configurable list associated with specific categories</li>
</ul>
<p>Each rule that fires contributes a score to a candidate metadata record.
If the heuristic pass produces a confident result — date found, category
identified, a couple of distinguishing keywords present — the LLM call
is skipped entirely. The file gets renamed from the heuristic output.</p>
<p>This matters for a few reasons. Heuristics are fast (microseconds vs.
seconds for an LLM call), deterministic (the same input always produces
the same output), and do not require a running model. For a batch of
two hundred invoices from the same vendor, the heuristic pass will handle
most of them without any LLM involvement.</p>
<p>The LLM is enrichment for the hard cases: documents with unusual formats,
mixed-language content, documents where the type is not obvious from
surface features. In practice this is probably 20–40% of a typical
mixed-document folder.</p>
<hr>
<h2 id="step-four-what-to-ask-the-llm-and-how">Step Four: What to Ask the LLM, and How</h2>
<p>When a heuristic pass does not produce a confident result, the pipeline
builds a prompt from the extracted text and sends it to the local
endpoint. What the prompt asks for matters enormously.</p>
<p>The naive approach: &ldquo;Please rename this PDF. Here is the content: [text].&rdquo;
The response will be a sentence. Maybe several sentences. It will not be
parseable as a filename without further processing, and that further
processing is itself an LLM call or a fragile regex.</p>
<p>The better approach: ask for structured output. The prompt in
<code>llm_prompts.py</code> requests a JSON object conforming to a schema — something
like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;date&#34;</span><span class="p">:</span> <span class="s2">&#34;YYYYMMDD or null&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;category&#34;</span><span class="p">:</span> <span class="s2">&#34;one of: invoice, paper, letter, contract, ...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;keywords&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;max 3 short keywords&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="s2">&#34;max 5 words&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The model returns JSON. The response parser in <code>llm_parsing.py</code> validates
it against the schema, catches malformed responses, applies fallbacks for
null fields, and sanitises the individual fields before they are assembled
into a filename.</p>
<p>This works because JSON is well-represented in LLM training data —
models have seen vastly more JSON than they have seen arbitrary prose
instructions to parse. A model told to return a specific JSON structure
will do so reliably for most inputs. The failure rate (malformed JSON,
missing fields, hallucinated values) is low enough to be handled by
the fallback logic.</p>
<p>What counts as a hallucinated value in this context? Dates in the future.
Categories not in the allowed set. Keywords that are not present in the
source text. The <code>llm_schema.py</code> validation layer catches the obvious
cases; for subtler errors (a plausible-sounding date that does not appear
in the document), the tool relies on the heuristic pass having already
identified any date that can be reliably extracted.</p>
<hr>
<h2 id="step-five-the-filename">Step Five: The Filename</h2>
<p>The output format is <code>YYYYMMDD-category-keywords-summary.pdf</code>. A few
design decisions embedded in this:</p>
<p><strong>Date first.</strong> Lexicographic sorting of filenames then gives you
chronological sorting for free. This is the most useful sort order for
most document types — you want to find the most recent invoice, not
the alphabetically first one.</p>
<p><strong>Lowercase, hyphens only.</strong> No spaces (which require escaping in many
contexts), no special characters (which are illegal in some filesystems
or require quoting), no uppercase (which creates case-sensitivity issues
across platforms). The sanitisation step in <code>filename.py</code> strips or
replaces anything that does not conform.</p>
<p><strong>Collision resolution.</strong> Two documents with the same date, category,
keywords, and summary would produce the same filename. The resolver
appends a counter suffix (<code>_01</code>, <code>_02</code>, &hellip;) when a target name already
exists. This is deterministic — the same set of documents always produces
the same filenames, regardless of processing order — which matters for
the undo log.</p>
<hr>
<h2 id="local-first">Local-First</h2>
<p>The LLM endpoint defaults to <code>http://127.0.0.1:11434/v1/completions</code> —
Ollama running locally, no external traffic. This is a deliberate choice
for a document management tool. The documents being renamed are likely
to include medical records, financial statements, legal correspondence —
content that should not be routed through an external API by default.</p>
<p>A small 8B model running locally is sufficient for this task. The
extraction problem does not require deep reasoning; it requires pattern
recognition over a short text and the ability to return a specific JSON
structure. Models at this scale handle it well. The latency is measurable
(a few seconds per document on a modern laptop with a reasonably fast
inference backend) but acceptable for a batch job running in the
background.</p>
<p>For users who want to use a remote API, the endpoint is configurable —
the local default is a sensible starting point, not a hard constraint.</p>
<hr>
<h2 id="what-it-cannot-do">What It Cannot Do</h2>
<p>Renaming is a classification problem disguised as a text generation
problem. The tool works well when documents have standard structure —
title on page one, date near the header or footer, document type
identifiable from a few keywords. It works less well for documents that
are structurally atypical: a hand-written letter scanned at poor
resolution, a PDF that is essentially a single large image, a document
in a language the model handles badly.</p>
<p>The heuristic fallback means that even when the LLM produces a bad
result, the file gets a usable if imperfect name rather than a broken
one. And the undo log means that a bad batch run can be reversed. These
are not complete solutions to the hard cases, but they are the right
design response to a tool that handles real-world document noise.</p>
<p>The harder limit is semantic: the tool can tell you that a document is
an invoice and extract its date and vendor name. It cannot tell you
whether the invoice has been paid, whether it matches a purchase order,
or whether the amount is correct. For those questions, renaming is just
the first step in a longer pipeline.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.
The tokenisation background in the extraction and budgeting sections
connects to the <a href="/posts/strawberry-tokenisation/">strawberry tokenisation post</a>
and the <a href="/posts/more-context-not-always-better/">context window post</a>.</em></p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-04-02</strong>: Corrected the default model name from <code>qwen3:8b</code> to <code>qwen2.5:3b</code>. The codebase default is <code>qwen2.5:3b</code> (apple-silicon preset) or <code>qwen2.5:7b-instruct</code> (gpu preset).</li>
<li><strong>2026-04-02</strong>: Corrected <code>DEFAULT_MAX_CONTENT_TOKENS</code> description from &ldquo;28,000 characters &hellip; roughly 7,000 tokens&rdquo; to &ldquo;28,000 tokens.&rdquo; The variable is a token limit, not a character limit.</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Artificial Intelligence in Music Pedagogy: Curriculum Implications from a Thementag</title>
      <link>https://sebastianspicker.github.io/posts/ai-music-pedagogy-day/</link>
      <pubDate>Sat, 07 Dec 2024 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-music-pedagogy-day/</guid>
      <description>On 2 December 2024 I gave three workshops at HfMT Köln&amp;rsquo;s Thementag on AI and music education. The handouts covered data protection, AI tools for students, and AI in teaching. This post is the argument behind them — focused on the curriculum question that none of the tools answer on their own: what should change, and what should not?</description>
      <content:encoded><![CDATA[<p><em>On 2 December 2024, the Hochschule für Musik und Tanz Köln held a Thementag:
&ldquo;Next level? Künstliche Intelligenz und Musikpädagogik im Dialog.&rdquo; I gave three
workshops — on data protection and AI, on AI tools for students, and on AI in
teaching. The handouts from those sessions cover the practical and regulatory
ground. This post is the argument behind them: what I think changes in music
education when these tools become ambient, and what I think does not.</em></p>
<hr>
<h2 id="the-occasion">The Occasion</h2>
<p>&ldquo;Next level?&rdquo; The question mark is doing real work. The framing HfMT chose for
the day was appropriately provisional: not a declaration that AI has already
transformed music education, but an invitation to ask whether, in what
direction, and at what cost.</p>
<p>The invitations that reach me for events like this tend to come with one of two
framings. The first is enthusiasm: AI is coming, we need to get ahead of it,
here are tools your students are already using. The second is anxiety: AI is
coming, it threatens everything we do, we need to protect students from it.
Both framings are understandable. Neither is adequate to the curriculum
question, which is slower-moving and more structural than either suggests.</p>
<p>I prepared three sets of handouts. The first covered data protection — the
least glamorous topic in AI education, and the one that most directly
determines what can legally be deployed in a university setting. The second
covered AI tools for students: what exists, what it does, and what critical
thinking skills you need to use it without being used by it. The third covered
AI for instructors: where it helps, where it flatters, and where it makes
things worse.</p>
<p>This post does not recapitulate the handouts. It addresses the question I kept
returning to across all three workshops: what does this change about what a
music student needs to learn?</p>
<hr>
<h2 id="what-the-technology-actually-is">What the Technology Actually Is</h2>
<p>My physics training left me professionally uncomfortable
with hand-waving — including my own. Before discussing curriculum implications,
it is worth being specific about what these tools are.</p>
<p>The dominant paradigm in current AI — responsible for ChatGPT, for Whisper, for
Suno.AI, for Google Magenta, for the large language models whose outputs are
now visible everywhere — is the transformer architecture (Vaswani et al.,
2017). A transformer is a neural network that processes sequences by computing,
for each element, a weighted attention over all other elements. The attention
weights are learned from data. The result is a model that can capture
long-range dependencies in sequences — text, audio, musical notes — without the
recurrence that made earlier architectures difficult to train at scale.</p>
<p>What this means practically: these models are trained on very large corpora,
they learn statistical regularities, and they generate outputs that are
statistically consistent with their training distribution. They are not
reasoning from first principles. They do not &ldquo;know&rdquo; music theory the way a
student who has internalised harmonic function knows it. They have learned, from
enormous quantities of text and audio, what tends to follow what. For many tasks
this is sufficient. For tasks that require understanding of underlying structure,
it is not — and the failure modes are characteristic rather than random.</p>
<p>BERT (Devlin et al., 2018) showed that pre-training on large corpora and
fine-tuning on specific tasks produces models that outperform task-specific
architectures on a wide range of benchmarks. The same transfer-learning
paradigm has spread to audio (Whisper pre-trains on 680,000 hours of labelled
audio), to music generation (Magenta&rsquo;s transformer-based models produce
melodically coherent sequences), and to multimodal domains. The technology is
mature, improving, and available to students now. Knowing what it is — not
just what it produces — is the starting point for any sensible curriculum
discussion about it.</p>
<hr>
<h2 id="the-data-protection-constraint">The Data Protection Constraint</h2>
<p>Before any discussion of pedagogical benefit, there is a legal boundary that
most AI-in-education discussions skip over. In Germany, and in the EU more
broadly, the deployment of AI tools in a university setting is governed by the
GDPR (DSGVO, Regulation 2016/679) and, at state level in NRW, by the DSG NRW.
The constraints are not abstract: they determine which tools can be used for
which purposes with which students.</p>
<p>The core principle is data minimisation: only data necessary for a specific,
documented purpose may be collected or processed. When a student uses a
commercial AI tool to get feedback on a composition exercise and enters text
that could identify them or their institution, that data may be stored,
processed, and used for model improvement by an operator whose servers are
outside the EU. Whether such transfers remain legally valid under GDPR after
the Schrems II ruling (Court of Justice of the EU, 2020) is contested — and
&ldquo;contested&rdquo; is not a position in which an institution can comfortably require
students to use a tool.</p>
<p>The practical upshot for curriculum design is this: AI tools running on EU
servers with documented processing agreements can be integrated into formal
coursework. Commercial tools whose terms specify US-based processing and model
training on user data cannot be required of students. They can be discussed and
demonstrated, but making them mandatory puts students in a position where they
must choose between their privacy and their grade.</p>
<p>This is not a reason to avoid AI in teaching. It is a reason to be honest about
the regulatory landscape, to distinguish clearly between tools you can require
and tools you can recommend, and to make data protection literacy part of what
students learn. The skill of reading a terms-of-service document and identifying
the data flows it describes is not a legal skill — it is a general literacy
skill that matters for every digital tool a music professional will use.</p>
<hr>
<h2 id="what-changes-for-students">What Changes for Students</h2>
<p>The question I was asked most often across the three workshops was some version
of: &ldquo;If AI can already do X, should students still learn X?&rdquo;</p>
<p>The question is less simple than it appears, and the answer is not uniform
across skills.</p>
<p><strong>Skills where automation reduces the required production threshold</strong> do exist.
A student who spends weeks mastering advanced music engraving tools for score
production, when AI can generate a usable first draft from a much simpler
description, has arguably spent time that could have been better allocated
elsewhere. Not because the underlying skill is worthless — it is not — but
because the threshold of competence required to produce a working output has
dropped. The student&rsquo;s time might be more valuable spent on something that
has not been automated.</p>
<p><strong>Skills where automation creates new requirements</strong> are more interesting.
Transcription is a useful example. Automatic speech recognition — using
models like Whisper for spoken-word transcription, or specialised models
for audio-to-score music transcription — is now accurate enough to produce
usable first drafts from audio. This does not
eliminate the need for transcription skill in a music student. It changes it.
A student who cannot evaluate the output of an automatic transcription — who
cannot hear where the model has made characteristic errors, who does not have
an internalised sense of what a correct transcription looks like — is unable
to use the tool productively. The required skill has shifted from production
to evaluation. This is not a lesser skill; it is a different one, and it is
not automatically acquired alongside the ability to run the tool.</p>
<p><strong>Skills that automation cannot replace</strong> are those that depend on embodied,
situated, relational knowledge: stage presence, real-time improvisation, the
subtle negotiation of musical meaning in ensemble, the pedagogical relationship
between teacher and student. These are not beyond AI in principle. They are
far beyond it in practice, and the gap is not closing as quickly as the
generative AI discourse sometimes suggests.</p>
<p>The curriculum implication is not &ldquo;teach less&rdquo; or simply &ldquo;teach differently.&rdquo;
It is: be explicit about which category each skill falls into, and design
assessment accordingly. An assignment that asks students to produce something
AI can produce is now testing something different from what it was testing two
years ago — not necessarily nothing, but something different. The rubric should
reflect that.</p>
<hr>
<h2 id="what-changes-for-instructors">What Changes for Instructors</h2>
<p>The same three-category analysis applies symmetrically to teaching.</p>
<p><strong>Routine task automation</strong> is genuinely useful. Generating first drafts of
worksheets, producing exercises at different difficulty levels, transcribing a
recorded lesson for later analysis — these are tasks where AI can save
meaningful time without compromising the pedagogical judgment required to make
use of the output. Holmes et al. (2019) identify feedback generation as one
of the clearer wins for AI in education: systems that provide immediate,
targeted feedback at a scale that human instructors cannot match. A
transcription model listening to a student practice and flagging rhythmic
inconsistencies does not replace a teacher. It extends the feedback loop
beyond the lesson hour.</p>
<p><strong>Content generation with limits</strong> is where AI is most seductive and most
dangerous. A model like ChatGPT can produce a reading list on any topic, a
summary of any debate in the literature, a set of discussion questions for any
text. The outputs are fluent, plausible, and frequently wrong in ways that are
difficult to detect without domain expertise. Jobin et al. (2019) and
Mittelstadt et al. (2016) both document the broader concern with AI opacity
and accountability: when a model produces a confident-sounding claim, the
burden of verification falls on the user. An instructor who outsources the
construction of course materials to a model, and who lacks enough domain
knowledge to catch the errors, is not saving time — they are transferring
risk to their students.</p>
<p>Hallucinations — outputs that are plausible in form but false in content — are
not bugs in the usual sense. They are a structural consequence of how generative
models work. A model trained to predict likely next tokens will produce the most
statistically plausible continuation, not the most accurate one. For music
education, where historical facts, composer attributions, and music-theoretic
claims need to be correct, this matters. The model&rsquo;s fluency is not evidence
of its accuracy.</p>
<p><strong>Personalisation</strong> is the most-cited promise of AI in education (Luckin et
al., 2016; Roll &amp; Wylie, 2016) and the hardest to evaluate in practice. The
argument is that AI can adapt instructional content to individual learners'
needs in real time, producing one-to-one tutoring at scale. The evidence in
formal educational settings is more mixed than the boosters suggest. What is
clear is that personalisation at scale requires data — and extensive data about
individual students&rsquo; learning trajectories raises the same data protection
concerns already discussed, in more acute form.</p>
<hr>
<h2 id="the-music-specific-question">The Music-Specific Question</h2>
<p>I want to be direct about something that came up repeatedly across the day and
that the general AI-in-education literature handles badly: music education is
not generic.</p>
<p>The skills involved — listening, performing, interpreting, composing,
improvising — have a phenomenological and embodied dimension that does not map
cleanly onto the text-prediction paradigm that most current AI systems
instantiate. Suno.AI can generate a stylistically convincing chord progression
in the manner of a named composer. It cannot explain why the progression is
convincing in the way a student who has internalised tonal function can explain
it. Google Magenta can generate a continuation of a melodic fragment that is
locally coherent. It cannot navigate the structural expectations of a sonata
form with the intentionality that a performer brings to interpreting one.</p>
<p>This is not a criticism of these tools. It is a description of what they are.
The curriculum implication is that music education must be clear about what it
is teaching: the <em>product</em> — a score, a performance, a composition — or the
<em>process and understanding</em> of which the product is evidence. Where assessment
focuses on the product, AI creates an obvious challenge. Where it focuses on
demonstrable process and understanding — including the ability to critically
evaluate AI-generated outputs — it creates new opportunities.</p>
<p>The more interesting question is whether AI tools can make musical <em>process</em>
more visible and discussable. A composition student who uses a generative model,
notices that the output is harmonically correct but rhythmically inert, and can
articulate <em>why</em> it is inert — and then revise it accordingly — has
demonstrated more sophisticated musical understanding than a student who
produces the same output without any generative assistance. The tool does not
lower the standard; it shifts where the standard is applied.</p>
<p>There is an analogy in music theory pedagogy. The availability of notation
software that can play back a student&rsquo;s harmony exercise and flag parallel
fifths changed what ear training and harmony teaching emphasise — but it did
not make harmony teaching obsolete. It changed the floor (students can check
mechanical correctness automatically) and raised the ceiling (more class time
can be spent on voice-leading logic and expressive intention). AI tools are a
larger version of the same displacement: the floor rises, the ceiling rises
with it, and the pedagogical question is always what you are doing between
the two.</p>
<hr>
<h2 id="copyright-and-academic-integrity">Copyright and Academic Integrity</h2>
<p>Two issues that crossed all three workshops and deserve direct treatment.</p>
<p>On copyright: the training data of generative music models includes copyrighted
recordings and scores, the legal status of which is actively litigated in
multiple jurisdictions. When Suno.AI generates a piece &ldquo;in the style of&rdquo;
a named composer, it is drawing on patterns extracted from that composer&rsquo;s work
— work that is under copyright in the case of living or recently deceased
composers. The output is not a direct copy, but neither is the relationship
to the training data legally settled. Music students who use these tools in
professional contexts should know that they are working in a legally uncertain
space, and institutions should not pretend otherwise.</p>
<p>On academic integrity: the issue is not that students might use AI to cheat —
they will, some of them, and they have always found ways to cheat with whatever
tools were available. The issue is that current AI policies at many institutions
are incoherent: prohibiting AI use in assessment while providing no clear
guidance on what counts as AI use, and assigning tasks where AI assistance is
undetectable and arguably appropriate. The more useful approach is to design
tasks where AI assistance is either irrelevant (because the task requires live
performance or real-time demonstration) or visible and assessed (because the
task explicitly includes reflection on how AI was used and to what effect).</p>
<hr>
<h2 id="three-things-i-came-away-with">Three Things I Came Away With</h2>
<p>After a full day of workshops, discussions, and the conversations that happen
in the corridors between sessions, I left with three positions that feel more
settled than they did in the morning.</p>
<p><strong>First</strong>: the data protection question is not separable from the pedagogical
question. Any serious curriculum discussion of AI in music education has to
start with what can legally be deployed, not with what would be useful if
constraints were not a factor. The constraints are a factor.</p>
<p><strong>Second</strong>: the skill most urgently needed — in students and in instructors —
is not AI literacy in the sense of knowing which tool to use for which task.
It is the critical capacity to evaluate AI-generated outputs: to notice what
is wrong, to understand <em>why</em> it is wrong, and to correct it. This requires
domain expertise first. You cannot critically evaluate an AI-generated harmonic
analysis if you do not understand harmonic analysis. The tools do not lower
the bar for domain knowledge. They raise the bar for its critical application.</p>
<p><strong>Third</strong>: the curriculum question is not &ldquo;how do we accommodate AI?&rdquo; It is
&ldquo;what are we actually trying to teach, and does the answer change when AI can
produce the visible output of that process?&rdquo; Answering that honestly, skill
by skill, for a full music programme, is slow work. It cannot be done at a
one-day event. But a one-day event, if it is well-designed, can start the
conversation in the right place.</p>
<p>HfMT&rsquo;s Thementag started it in the right place.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Devlin, J., Chang, M.-W., Lee, K., &amp; Toutanova, K. (2018). BERT:
Pre-training of deep bidirectional transformers for language understanding.
<em>arXiv preprint arXiv:1810.04805</em>. <a href="https://arxiv.org/abs/1810.04805">https://arxiv.org/abs/1810.04805</a></p>
</li>
<li>
<p>Goodfellow, I., Bengio, Y., &amp; Courville, A. (2016). <em>Deep Learning.</em>
MIT Press. <a href="https://www.deeplearningbook.org">https://www.deeplearningbook.org</a></p>
</li>
<li>
<p>Holmes, W., Bialik, M., &amp; Fadel, C. (2019). <em>Artificial Intelligence in
Education: Promises and Implications for Teaching and Learning.</em> Center for
Curriculum Redesign.</p>
</li>
<li>
<p>Jobin, A., Ienca, M., &amp; Vayena, E. (2019). The global landscape of AI ethics
guidelines. <em>Nature Machine Intelligence</em>, 1, 389–399.
<a href="https://doi.org/10.1038/s42256-019-0088-2">https://doi.org/10.1038/s42256-019-0088-2</a></p>
</li>
<li>
<p>LeCun, Y., Bengio, Y., &amp; Hinton, G. (2015). Deep learning. <em>Nature</em>,
521(7553), 436–444. <a href="https://doi.org/10.1038/nature14539">https://doi.org/10.1038/nature14539</a></p>
</li>
<li>
<p>Luckin, R., Holmes, W., Griffiths, M., &amp; Forcier, L. B. (2016).
<em>Intelligence Unleashed: An Argument for AI in Education.</em> Pearson.</p>
</li>
<li>
<p>Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., &amp; Floridi, L.
(2016). The ethics of algorithms: Mapping the debate. <em>Big Data &amp; Society</em>,
3(2). <a href="https://doi.org/10.1177/2053951716679679">https://doi.org/10.1177/2053951716679679</a></p>
</li>
<li>
<p>Roll, I., &amp; Wylie, R. (2016). Evolution and revolution in artificial
intelligence in education. <em>International Journal of Artificial Intelligence
in Education</em>, 26(2), 582–599.
<a href="https://doi.org/10.1007/s40593-016-0110-3">https://doi.org/10.1007/s40593-016-0110-3</a></p>
</li>
<li>
<p>Russell, S., &amp; Norvig, P. (2020). <em>Artificial Intelligence: A Modern
Approach</em> (4th ed.). Pearson.</p>
</li>
<li>
<p>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, Ł., &amp; Polosukhin, I. (2017). Attention is all you need.
<em>Advances in Neural Information Processing Systems</em>, 30.
<a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></p>
</li>
</ul>
]]></content:encoded>
    </item>
    <item>
      <title>Three Rs in Strawberry: What the Viral Counting Test Actually Reveals</title>
      <link>https://sebastianspicker.github.io/posts/strawberry-tokenisation/</link>
      <pubDate>Mon, 07 Oct 2024 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/strawberry-tokenisation/</guid>
      <description>In September 2024, OpenAI revealed that its new o1 model had been code-named &amp;ldquo;Strawberry&amp;rdquo; internally — the same word that language models have famously been unable to count letters in. The irony was too perfect to pass up. But the counting failure is not a sign that LLMs are naive or broken. It is a precise, informative symptom of how they process text. Here is the actual explanation, with a minimum of hand-waving.</description>
      <content:encoded><![CDATA[<h2 id="the-setup">The Setup</h2>
<p>In September 2024, OpenAI publicly confirmed that their new reasoning model
had been code-named &ldquo;Strawberry&rdquo; during development. This landed with a
particular thud because &ldquo;how many r&rsquo;s are in strawberry?&rdquo; had, by that
point, become one of the canonical demonstrations of language model failure.
The model named after strawberry could not count the letters in strawberry.
The internet had opinions.</p>
<p>Before the opinions: the answer is three. s-t-<strong>r</strong>-a-w-b-e-<strong>r</strong>-<strong>r</strong>-y.
One in the <em>str-</em> cluster, two in the <em>-rry</em> ending. Count carefully and
you will find that most people get this right on the first try, and most
large language models get it wrong, returning &ldquo;two&rdquo; with apparent
confidence.</p>
<p>The question worth asking is not &ldquo;why is the model stupid.&rdquo; It is not
stupid, and &ldquo;stupid&rdquo; is not a useful category here. The question is: what
does this specific error reveal about the structure of the system?</p>
<p>The answer involves tokenisation, and it is actually interesting.</p>
<hr>
<h2 id="how-you-count-letters-and-how-the-model-doesnt">How You Count Letters (and How the Model Doesn&rsquo;t)</h2>
<p>When you count the r&rsquo;s in &ldquo;strawberry,&rdquo; you do something like this:
scan the string left to right, maintain a running count, increment it
each time you see the target character. This is a sequential operation
over a character array. It requires no semantic knowledge about the word —
it does not matter whether &ldquo;strawberry&rdquo; is a fruit, a colour, or a
nonsense string. The characters are the input; the count is the output.</p>
<p>A language model does not receive a character array. It receives a
sequence of <em>tokens</em> — chunks produced by a compression algorithm called
Byte Pair Encoding (BPE) that the model was trained with. In the
tokeniser used by GPT-class models, &ldquo;strawberry&rdquo; is most likely split as:</p>
$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$<p>Three tokens. The model&rsquo;s input is these three integer IDs, each looked up
in an embedding table to produce a vector. There is no character array.
There is no letter &ldquo;r&rdquo; sitting at a known position. There are three dense
vectors representing &ldquo;str,&rdquo; &ldquo;aw,&rdquo; and &ldquo;berry.&rdquo;</p>
<hr>
<h2 id="what-bpe-does-and-doesnt-preserve">What BPE Does (and Doesn&rsquo;t) Preserve</h2>
<p>BPE is a greedy compression algorithm. Starting from individual bytes,
it iteratively merges the most frequent pair of adjacent symbols into a
single new token:</p>
$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$<p>Applied to a large text corpus until a fixed vocabulary size is reached,
this produces a vocabulary of common subwords. Frequent words and common
word-parts become single tokens; rare sequences stay as multi-token
fragments.</p>
<p>What BPE optimises for is compression efficiency, not character-level
transparency. The token &ldquo;straw&rdquo; encodes the sequence s-t-r-a-w as a
unit, but that character sequence is not explicitly represented anywhere
inside the model once the embedding lookup has occurred. The model
receives a vector for &ldquo;straw,&rdquo; not a list of its constituent letters.</p>
<p>The character composition of a token is only accessible to the model
insofar as it was implicitly learned during training — through seeing
&ldquo;straw&rdquo; appear in contexts where its internal structure was relevant.
For most tokens, most of the time, that character structure was not
relevant. The model learned what &ldquo;straw&rdquo; means, not how to spell it
character by character.</p>
<hr>
<h2 id="why-the-error-is-informative">Why the Error Is Informative</h2>
<p>Most people say the model returns &ldquo;two r&rsquo;s,&rdquo; not &ldquo;one&rdquo; or &ldquo;four&rdquo; or
&ldquo;none.&rdquo; This is not random noise. It is a systematic error, and systematic
errors are diagnostic.</p>
<p>&ldquo;berry&rdquo; contains two r&rsquo;s: b-e-<strong>r</strong>-<strong>r</strong>-y. If you ask most models
&ldquo;how many r&rsquo;s in berry?&rdquo; they get it right. The model has seen that
question, or questions closely enough related, that the right count is
encoded somewhere in the weight structure.</p>
<p>&ldquo;str&rdquo; contains one r: s-t-<strong>r</strong>. But as a token it is a short, common
prefix that appears in hundreds of words — <em>string</em>, <em>strong</em>, <em>stream</em> —
contexts in which its internal letter structure is rarely attended to.
&ldquo;aw&rdquo; contains no r&rsquo;s. When the model answers &ldquo;two,&rdquo; it is almost
certainly counting the r&rsquo;s in &ldquo;berry&rdquo; correctly and failing to notice
the one in &ldquo;str.&rdquo; The token boundaries are where the error lives.</p>
<p>This is not stupidity. It is a precise failure mode that follows directly
from the tokenisation structure. You can predict where the error will
occur by looking at the token split.</p>
<hr>
<h2 id="chain-of-thought-partially-fixes-this-and-why">Chain of Thought Partially Fixes This (and Why)</h2>
<p>If you prompt the model to &ldquo;spell out the letters first, then count,&rdquo; the
error rate drops substantially. The reason is not mysterious: forcing
the model to generate a character-by-character expansion — s, t, r, a,
w, b, e, r, r, y — puts the individual characters into the context window
as separate tokens. Now the model is not working from &ldquo;straw&rdquo; and &ldquo;berry&rdquo;;
it is working from ten single-character tokens, and counting sequential
characters in a flat list is a task the model handles much better.</p>
<p>This is, in effect, making the model do manually what a human does
automatically: convert the compressed token representation back to an
enumerable character sequence before counting. The cognitive work is the
same; the scaffolding just has to be explicit.</p>
<hr>
<h2 id="the-right-frame">The Right Frame</h2>
<p>The &ldquo;how many r&rsquo;s&rdquo; test is sometimes cited as evidence that language models
don&rsquo;t &ldquo;really&rdquo; understand text, or that they are sophisticated autocomplete
engines with no genuine knowledge. These framing choices produce more heat
than light.</p>
<p>The more precise statement is this: language models were trained to predict
likely next tokens in large text corpora. That training objective produces
a system that is very good at certain tasks (semantic inference, translation,
summarisation, code generation) and systematically bad at others (character
counting, exact arithmetic, precise spatial reasoning). The system is not
doing what you are doing when you read a sentence. It is doing something
different, which happens to produce similar outputs for a very wide range
of inputs — and different outputs for a class of inputs where the
character-level structure matters.</p>
<p>&ldquo;Strawberry&rdquo; sits squarely in that class. The model is not failing to
read the word. It is succeeding at predicting what a plausible-sounding
answer looks like, based on a compressed representation that does not
preserve the information needed to get the count right. Those are not the
same thing, and the distinction is worth keeping clear.</p>
<hr>
<p><em>The tokenisation argument here is a simplified version. Real BPE
vocabularies, positional encodings, and the specific way character
information is or isn&rsquo;t preserved in embedding tables are more complicated
than this post suggests. But the core point — that the model&rsquo;s input
representation is not a character array and never was — holds.</em></p>
<p><em>A follow-up post covers a structurally different failure mode:
<a href="/posts/car-wash-grounding/">Should I Drive to the Car Wash?</a> — where
the model understood the question perfectly but lacked access to the
world state the question was about.</em></p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>
<p>Gage, P. (1994). A new algorithm for data compression. <em>The C Users
Journal</em>, 12(2), 23–38.</p>
</li>
<li>
<p>Sennrich, R., Haddow, B., &amp; Birch, A. (2016). <strong>Neural machine
translation of rare words with subword units.</strong> <em>Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics
(ACL 2016)</em>, 1715–1725. <a href="https://arxiv.org/abs/1508.07909">https://arxiv.org/abs/1508.07909</a></p>
</li>
</ul>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2025-12-01</strong>: Corrected the tokenisation of &ldquo;strawberry&rdquo; from two tokens (<code>straw|berry</code>) to three tokens (<code>str|aw|berry</code>), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
