Nlp on Sebastian Spicker

Car Wash, Part Three: The AI Said Walk

Thu, 12 Feb 2026 00:00:00 +0000

Third in an accidental series. Part one: Three Rs in Strawberry — tokenisation and representation. Part two: Should I Drive to the Car Wash? — grounding and missing world state. This one is different again.

The Video

Same question as last month’s: “Should I drive to the car wash?” New video, new AI, new wrong answer. This time the assistant replied that walking was the better option — better for health, better for the environment, and the car wash was only fifteen minutes away on foot.

Accurate, probably. Correct, arguably. Useful? No.

The model did not fail because of tokenisation. It did not fail because it lacked access to the current weather. It failed because it read the wrong question. The user was asking “is now a good time to have my car washed?” The model answered “what is the most sustainable way for a human to travel to the location of a car wash?”

These are different questions. The model chose the second one. This is a pragmatic inference failure, and it is the most instructive of the three failure modes in this series — because the model was not, by any obvious measure, working incorrectly. It was working exactly as designed, on the wrong problem.

What the Question Actually Meant

“Should I drive to the car wash?” is not about how to travel. The word “drive” here is not a transportation verb; it is part of the idiomatic compound “drive to the car wash,” which means “take my car to get washed.” The presupposition of the question is that the speaker owns a car, the car needs or might benefit from washing, and the speaker is deciding whether the current moment is a good one to go. Nobody asking this question wants to know whether cycling is a viable alternative.

Linguists distinguish between what a sentence says — its literal semantic content — and what it implicates — the meaning a speaker intends and a listener is expected to infer. Paul Grice formalised this in 1975 with a set of conversational maxims describing how speakers cooperate to communicate:

Quantity: say as much as is needed, no more
Quality: say only what you believe to be true
Relation: be relevant
Manner: be clear and orderly

The maxims are not rules; they are defaults. When a speaker says “should I drive to the car wash?”, a cooperative listener applies the maxim of Relation to infer that the question is about car maintenance and current conditions, not about personal transport choices. The “drive” is incidental to the real question, the way “I ran to the store” does not invite commentary on jogging technique.

The model violated Relation — in the pragmatic sense. Its answer was technically relevant to one reading of the sentence, and irrelevant to the only reading a cooperative human would produce.

A Taxonomy of the Three Failures

It is worth being precise now that we have three examples:

Strawberry (tokenisation failure): The information needed to answer was present in the input string but lost in the model’s representation. “Strawberry” →

\["straw", "berry"\]

— the character “r” in “straw” is not directly accessible. The model understood the task correctly; the representation could not support it.

Car wash, rainy day (grounding failure): The model understood the question. The information needed to answer correctly — current weather — was never in the input. The model answered by averaging over all plausible contexts, producing a sensible-on-average response that was wrong for this specific context.

Car wash, walk (pragmatic inference failure): The model had all the relevant words. It may have had access to the weather, the location, the car state. It chose the wrong interpretation of what was being asked. The sentence was read at the level of semantic content rather than communicative intent.

Formally: let $\mathcal{I}$ be the set of plausible interpretations of an utterance $u$. The intended interpretation $i^*$ is the one a cooperative, contextually informed listener would assign. A well-functioning pragmatic reasoner computes:

$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$

The model appears to have assigned high probability to the transportation-choice interpretation $i_{\text{walk}}$, apparently on the surface pattern: “should I

\[verb of locomotion\]

\[location\]

?” generates responses about modes of transport. It is a natural pattern-match. It is the wrong one.

Why This Failure Mode Is More Elusive

The tokenisation failure has a clean diagnosis: look at the BPE splits, find where the character information was lost. The grounding failure has a clean diagnosis: identify the context variable $C$ the answer depends on, check whether the model has access to it.

The pragmatic failure is harder to pin down because the model’s answer was not, in isolation, wrong. Walking is healthy. Walking to a car wash that is fifteen minutes away is plausible. If you strip the question of its conversational context — a person standing next to their dirty car, wondering whether to bother — the model’s response is coherent.

The error lives in the gap between what the sentence says and what the speaker meant, and that gap is only visible if you know what the speaker meant. In a training corpus, this kind of error is largely invisible: there is no ground truth annotation that marks a technically-responsive answer as pragmatically wrong.

This is a version of a known problem in computational linguistics: models trained on text predict text, and text does not contain speaker intent. A model can learn that “should I drive to X?” correlates with responses about travel options, because that correlation is present in the data. What it cannot easily learn from text alone is the meta-level principle: this question is about the destination’s purpose, not the journey.

The Gricean Model Did Not Solve This

It is tempting to think that if you could build in Grice’s maxims explicitly — as constraints on response generation — you would prevent this class of failure. Generate only responses that are relevant to the speaker’s probable intent, not just to the sentence’s semantic content.

This does not obviously work, for a simple reason: the maxims require a model of the speaker’s intent, which is exactly what is missing. You need to know what the speaker intends to know which response is relevant; you need to know which response is relevant to determine the speaker’s intent. The inference has to bootstrap from somewhere.

Human pragmatic inference works because we come to a conversation with an enormous amount of background knowledge about what people typically want when they ask particular kinds of questions, combined with contextual cues (tone, setting, previous exchanges) that narrow the interpretation space. A person asking “should I drive to the car wash?” while standing next to a mud-spattered car in a conversation about weekend plans is not asking for a health lecture. The context is sufficient to fix the interpretation.

Language models receive text. The contextual cues that would fix the interpretation for a human — the mud on the car, the tone of the question, the setting — are not available unless someone has typed them out. The model is not in the conversation; it is receiving a transcript of it, from which the speaker’s intent has to be inferred indirectly.

Where This Leaves the Series

Three videos, three failure modes, three diagnoses. None of them are about the model being unintelligent in any useful sense of the word. Each of them is a precise consequence of how these systems work:

Models process tokens, not characters. Character-level structure can be lost at the representation layer.
Models are trained on static corpora and have no real-time connection to the world. Context-dependent questions are answered by marginalising over all plausible contexts, which is wrong when the actual context matters.
Models learn correlations between sentence surface forms and response types. The correlation between “should I \[travel verb\]to \[place\]?” and transport-related responses is real in the training data. It is the wrong correlation for this question.

The useful frame, in all three cases, is not “the model failed” but “what, precisely, does the model lack that would be required to succeed?” The answers point in different directions: better tokenisation; real-time world access and calibrated uncertainty; richer models of speaker intent and conversational context. The first is an engineering problem. The second is partially solvable with tools and still hard. The third is unsolved.

References

Grice, P. H. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and Semantics, Vol. 3: Speech Acts (pp. 41–58). Academic Press.
Levinson, S. C. (1983). Pragmatics. Cambridge University Press.

Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure

Tue, 20 Jan 2026 00:00:00 +0000

Follow-up to Three Rs in Strawberry, which covered a different LLM failure: tokenisation and why models cannot count letters. This one is about something structurally different.

The Video

Someone asked their car’s built-in AI assistant: “Should I drive to the car wash today?” It was raining. The assistant said yes, enthusiastically, with reasons: regular washing extends the life of the paintwork, removes road salt, and so on. Technically correct statements, all of them. Completely beside the point.

The clip spread. The reactions were the usual split: one camp said this proves AI is useless, the other said it proves people expect too much from AI. Both camps are arguing about the wrong thing.

The interesting question is: why did the model fail here, and is this the same kind of failure as the strawberry problem?

It is not. The failures look similar from the outside — confident wrong answer, context apparently ignored — but the underlying causes are different, and the difference matters if you want to understand what these systems can and cannot do.

The Strawberry Problem Was About Representation

In the strawberry case, the model failed because of the gap between its input representation (BPE tokens: “straw” + “berry”) and the task (count the character “r”). The character information was not accessible in the model’s representational units. The model understood the task correctly — “count the r’s” is unambiguous — but the input structure did not support executing it.

That is a representation failure. The information needed to answer correctly was present in the original string but was lost in the tokenisation step.

The car wash case is different. The model received a perfectly well-formed question and had no representation problem at all. “Should I drive to the car wash today?” is tokenised without any information loss. The model understood it. The failure is that the correct answer depends on information that was never in the input in the first place.

The Missing Context

What would you need to answer “should I drive to the car wash today?” correctly?

The current weather (is it raining now?)
The weather forecast for the rest of the day (will it rain later?)
The current state of the car (how dirty is it?)
Possibly: how recently was it last washed, what kind of dirt (road salt after winter, tree pollen in spring), whether there is a time constraint

None of this is in the question. A human asking the question has access to some of it through direct perception (look out the window) and some through memory (I just drove through mud). A language model has access to none of it.

Let $X$ denote the question and $C$ denote this context — the current state of the world that the question is implicitly about. The correct answer $A$ is a function of both:

$$A = f(X, C)$$

The model has $X$. It does not have $C$. What it produces is something like an expectation over possible contexts, marginalising out the unknown $C$:

$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$

Averaged over all plausible contexts in which someone might ask this question, “going to the car wash” is probably a fine idea — most of the time when people ask, it is not raining and the car is dirty. $\hat{A}$ is therefore approximately “yes.” The model returns “yes.” In this particular instance, where $C$ happens to include “it is currently raining,” $\hat{A} \neq f(X, C)$.

The quantity that measures how much the missing context matters is the mutual information between the answer and the context, given the question:

$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$

Here $H(A \mid X)$ is the residual uncertainty in the answer given only the question, and $H(A \mid X, C)$ is the residual uncertainty once the context is also known. For most questions in a language model’s training distribution — “what is the capital of France?”, “how do I sort a list in Python?” — this mutual information is near zero: the context does not change the answer. For situationally grounded questions like the car wash question, it is large: the answer is almost entirely determined by the context, not the question.

Why the Model Was Confident Anyway

This is the part that produces the most indignation in the viral clips: not just that the model was wrong, but that it was confident about being wrong. It did not say “I don’t know what the current weather is.” It said “yes, here are five reasons you should go.”

Two things are happening here.

Training distribution bias. Most questions in the training data that resemble “should I do X?” have answers that can be derived from general knowledge, not from real-time world state. “Should I use a VPN on public WiFi?” “Should I stretch before running?” “Should I buy a house or rent?” All of these have defensible answers that do not depend on the current weather. The model learned that this question form typically maps to answers of the form “here are some considerations.” It applies that pattern here.

No explicit uncertainty signal. The model was not trained to say “I cannot answer this because I lack context C.” It was trained to produce helpful-sounding responses. A response that acknowledges missing information requires the model to have a model of its own knowledge state — to know what it does not know. This is harder than it sounds. The model has to recognise that $I(A; C \mid X)$ is high for this question class, which requires meta-level reasoning about information structure that is not automatically present.

This is sometimes called calibration: the alignment between expressed confidence and actual accuracy. A well-calibrated model that is 80% confident in an answer is right about 80% of the time. A model that is confident about answers it cannot possibly know from its training data is miscalibrated. The car wash video is a calibration failure as much as a grounding failure.

What Grounding Means

The term grounding in AI has a precise origin. Harnad (1990) used it to describe the problem of connecting symbol systems to the things they refer to — how does the word “apple” connect to actual apples, rather than just to other symbols? A symbol system that only connects symbols to other symbols (dictionary definitions, synonym relations) has the form of meaning without the substance.

Applied to language models: the model has rich internal representations of concepts like “rain,” “car wash,” “dirty car,” and their relationships. But those representations are grounded in text about those things, not in the things themselves. The model knows what rain is. It does not know whether it is raining right now, because “right now” is not a location in the training data.

This is not a solvable problem by making the model bigger or training it on more text. More text does not give the model access to the current state of the world. It is a structural feature of how these systems work: they are trained on a static corpus and queried at inference time, with no automatic connection to the world state at the moment of the query.

What Tool Use Gets You (and What It Doesn’t)

The standard engineering response to grounding problems is tool use: give the model access to a weather API, a calendar, a search engine. Now when asked “should I go to the car wash today?” the model can query the weather service, get the current conditions, and factor that into the answer.

This is genuinely useful. The model with a weather tool call will answer this question correctly in most circumstances. But tool use solves the problem only if two conditions hold:

The model knows it needs the tool. It must recognise that this question has $I(A; C \mid X) > 0$ for context $C$ that a weather tool can provide, and that it is missing that context. This requires the meta-level awareness described above. Models trained on tool use learn to invoke tools for recognised categories of question; for novel question types, or questions that superficially resemble answerable ones, the tool call may not be triggered.
The right tool exists and returns clean data. Weather APIs exist. “How dirty is my car?” does not have an API. “Am I the kind of person who cares about car cleanliness enough that this matters?” has no API. Some missing context can be retrieved; some is inherently private to the person asking.

The deeper issue is not tool availability but knowing what you don’t know. A model that does not recognise its own information gaps cannot reliably decide when to use a tool, ask a clarifying question, or express uncertainty. This is a hard problem — arguably harder than making the model more capable at the tasks it already handles.

The Contrast, Stated Plainly

The strawberry failure and the car wash failure look alike from the outside — confident wrong answer — but they are different enough that conflating them produces confused diagnosis and confused solutions.

Strawberry: the model has the information (the string “strawberry”), but the representation (BPE tokens) does not preserve character-level structure. The fix is architectural or procedural: character-level tokenisation, chain-of-thought letter spelling.

Car wash: the model does not have the information (current weather, car state). No fix to the model’s architecture or prompt engineering gives it information it was never given. The fix is exogenous: provide the context explicitly, or give the model a tool that can retrieve it, or design the system so that context-dependent questions are routed to systems that have access to the relevant state.

A model that confidently answers the car wash question without access to current conditions is not failing at language understanding. It is behaving exactly as its training shaped it to behave, given its lack of situational grounding. Knowing which kind of failure you are looking at is most of the work in figuring out what to do about it.

The grounding problem connects to the broader question of what it means for a language model to “know” something — which comes up in a different form in the context window post, where the issue is not missing context but irrelevant context drowning out the relevant signal.

A second car wash video a few weeks later produced a third, different failure: Car Wash, Part Three: The AI Said Walk — the model had the right world state but chose the wrong interpretation of the question.

References

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. ICML 2017. https://arxiv.org/abs/1706.04599

Artificial Intelligence in Music Pedagogy: Curriculum Implications from a Thementag

Sat, 07 Dec 2024 00:00:00 +0000

On 2 December 2024, the Hochschule für Musik und Tanz Köln held a Thementag: “Next level? Künstliche Intelligenz und Musikpädagogik im Dialog.” I gave three workshops — on data protection and AI, on AI tools for students, and on AI in teaching. The handouts from those sessions cover the practical and regulatory ground. This post is the argument behind them: what I think changes in music education when these tools become ambient, and what I think does not.

The Occasion

“Next level?” The question mark is doing real work. The framing HfMT chose for the day was appropriately provisional: not a declaration that AI has already transformed music education, but an invitation to ask whether, in what direction, and at what cost.

The invitations that reach me for events like this tend to come with one of two framings. The first is enthusiasm: AI is coming, we need to get ahead of it, here are tools your students are already using. The second is anxiety: AI is coming, it threatens everything we do, we need to protect students from it. Both framings are understandable. Neither is adequate to the curriculum question, which is slower-moving and more structural than either suggests.

I prepared three sets of handouts. The first covered data protection — the least glamorous topic in AI education, and the one that most directly determines what can legally be deployed in a university setting. The second covered AI tools for students: what exists, what it does, and what critical thinking skills you need to use it without being used by it. The third covered AI for instructors: where it helps, where it flatters, and where it makes things worse.

This post does not recapitulate the handouts. It addresses the question I kept returning to across all three workshops: what does this change about what a music student needs to learn?

What the Technology Actually Is

My physics training left me professionally uncomfortable with hand-waving — including my own. Before discussing curriculum implications, it is worth being specific about what these tools are.

The dominant paradigm in current AI — responsible for ChatGPT, for Whisper, for Suno.AI, for Google Magenta, for the large language models whose outputs are now visible everywhere — is the transformer architecture (Vaswani et al., 2017). A transformer is a neural network that processes sequences by computing, for each element, a weighted attention over all other elements. The attention weights are learned from data. The result is a model that can capture long-range dependencies in sequences — text, audio, musical notes — without the recurrence that made earlier architectures difficult to train at scale.

What this means practically: these models are trained on very large corpora, they learn statistical regularities, and they generate outputs that are statistically consistent with their training distribution. They are not reasoning from first principles. They do not “know” music theory the way a student who has internalised harmonic function knows it. They have learned, from enormous quantities of text and audio, what tends to follow what. For many tasks this is sufficient. For tasks that require understanding of underlying structure, it is not — and the failure modes are characteristic rather than random.

BERT (Devlin et al., 2018) showed that pre-training on large corpora and fine-tuning on specific tasks produces models that outperform task-specific architectures on a wide range of benchmarks. The same transfer-learning paradigm has spread to audio (Whisper pre-trains on 680,000 hours of labelled audio), to music generation (Magenta’s transformer-based models produce melodically coherent sequences), and to multimodal domains. The technology is mature, improving, and available to students now. Knowing what it is — not just what it produces — is the starting point for any sensible curriculum discussion about it.

The Data Protection Constraint

Before any discussion of pedagogical benefit, there is a legal boundary that most AI-in-education discussions skip over. In Germany, and in the EU more broadly, the deployment of AI tools in a university setting is governed by the GDPR (DSGVO, Regulation 2016/679) and, at state level in NRW, by the DSG NRW. The constraints are not abstract: they determine which tools can be used for which purposes with which students.

The core principle is data minimisation: only data necessary for a specific, documented purpose may be collected or processed. When a student uses a commercial AI tool to get feedback on a composition exercise and enters text that could identify them or their institution, that data may be stored, processed, and used for model improvement by an operator whose servers are outside the EU. Whether such transfers remain legally valid under GDPR after the Schrems II ruling (Court of Justice of the EU, 2020) is contested — and “contested” is not a position in which an institution can comfortably require students to use a tool.

The practical upshot for curriculum design is this: AI tools running on EU servers with documented processing agreements can be integrated into formal coursework. Commercial tools whose terms specify US-based processing and model training on user data cannot be required of students. They can be discussed and demonstrated, but making them mandatory puts students in a position where they must choose between their privacy and their grade.

This is not a reason to avoid AI in teaching. It is a reason to be honest about the regulatory landscape, to distinguish clearly between tools you can require and tools you can recommend, and to make data protection literacy part of what students learn. The skill of reading a terms-of-service document and identifying the data flows it describes is not a legal skill — it is a general literacy skill that matters for every digital tool a music professional will use.

What Changes for Students

The question I was asked most often across the three workshops was some version of: “If AI can already do X, should students still learn X?”

The question is less simple than it appears, and the answer is not uniform across skills.

Skills where automation reduces the required production threshold do exist. A student who spends weeks mastering advanced music engraving tools for score production, when AI can generate a usable first draft from a much simpler description, has arguably spent time that could have been better allocated elsewhere. Not because the underlying skill is worthless — it is not — but because the threshold of competence required to produce a working output has dropped. The student’s time might be more valuable spent on something that has not been automated.

Skills where automation creates new requirements are more interesting. Transcription is a useful example. Automatic speech recognition — using models like Whisper for spoken-word transcription, or specialised models for audio-to-score music transcription — is now accurate enough to produce usable first drafts from audio. This does not eliminate the need for transcription skill in a music student. It changes it. A student who cannot evaluate the output of an automatic transcription — who cannot hear where the model has made characteristic errors, who does not have an internalised sense of what a correct transcription looks like — is unable to use the tool productively. The required skill has shifted from production to evaluation. This is not a lesser skill; it is a different one, and it is not automatically acquired alongside the ability to run the tool.

Skills that automation cannot replace are those that depend on embodied, situated, relational knowledge: stage presence, real-time improvisation, the subtle negotiation of musical meaning in ensemble, the pedagogical relationship between teacher and student. These are not beyond AI in principle. They are far beyond it in practice, and the gap is not closing as quickly as the generative AI discourse sometimes suggests.

The curriculum implication is not “teach less” or simply “teach differently.” It is: be explicit about which category each skill falls into, and design assessment accordingly. An assignment that asks students to produce something AI can produce is now testing something different from what it was testing two years ago — not necessarily nothing, but something different. The rubric should reflect that.

What Changes for Instructors

The same three-category analysis applies symmetrically to teaching.

Routine task automation is genuinely useful. Generating first drafts of worksheets, producing exercises at different difficulty levels, transcribing a recorded lesson for later analysis — these are tasks where AI can save meaningful time without compromising the pedagogical judgment required to make use of the output. Holmes et al. (2019) identify feedback generation as one of the clearer wins for AI in education: systems that provide immediate, targeted feedback at a scale that human instructors cannot match. A transcription model listening to a student practice and flagging rhythmic inconsistencies does not replace a teacher. It extends the feedback loop beyond the lesson hour.

Content generation with limits is where AI is most seductive and most dangerous. A model like ChatGPT can produce a reading list on any topic, a summary of any debate in the literature, a set of discussion questions for any text. The outputs are fluent, plausible, and frequently wrong in ways that are difficult to detect without domain expertise. Jobin et al. (2019) and Mittelstadt et al. (2016) both document the broader concern with AI opacity and accountability: when a model produces a confident-sounding claim, the burden of verification falls on the user. An instructor who outsources the construction of course materials to a model, and who lacks enough domain knowledge to catch the errors, is not saving time — they are transferring risk to their students.

Hallucinations — outputs that are plausible in form but false in content — are not bugs in the usual sense. They are a structural consequence of how generative models work. A model trained to predict likely next tokens will produce the most statistically plausible continuation, not the most accurate one. For music education, where historical facts, composer attributions, and music-theoretic claims need to be correct, this matters. The model’s fluency is not evidence of its accuracy.

Personalisation is the most-cited promise of AI in education (Luckin et al., 2016; Roll & Wylie, 2016) and the hardest to evaluate in practice. The argument is that AI can adapt instructional content to individual learners' needs in real time, producing one-to-one tutoring at scale. The evidence in formal educational settings is more mixed than the boosters suggest. What is clear is that personalisation at scale requires data — and extensive data about individual students’ learning trajectories raises the same data protection concerns already discussed, in more acute form.

The Music-Specific Question

I want to be direct about something that came up repeatedly across the day and that the general AI-in-education literature handles badly: music education is not generic.

The skills involved — listening, performing, interpreting, composing, improvising — have a phenomenological and embodied dimension that does not map cleanly onto the text-prediction paradigm that most current AI systems instantiate. Suno.AI can generate a stylistically convincing chord progression in the manner of a named composer. It cannot explain why the progression is convincing in the way a student who has internalised tonal function can explain it. Google Magenta can generate a continuation of a melodic fragment that is locally coherent. It cannot navigate the structural expectations of a sonata form with the intentionality that a performer brings to interpreting one.

This is not a criticism of these tools. It is a description of what they are. The curriculum implication is that music education must be clear about what it is teaching: the product — a score, a performance, a composition — or the process and understanding of which the product is evidence. Where assessment focuses on the product, AI creates an obvious challenge. Where it focuses on demonstrable process and understanding — including the ability to critically evaluate AI-generated outputs — it creates new opportunities.

The more interesting question is whether AI tools can make musical process more visible and discussable. A composition student who uses a generative model, notices that the output is harmonically correct but rhythmically inert, and can articulate why it is inert — and then revise it accordingly — has demonstrated more sophisticated musical understanding than a student who produces the same output without any generative assistance. The tool does not lower the standard; it shifts where the standard is applied.

There is an analogy in music theory pedagogy. The availability of notation software that can play back a student’s harmony exercise and flag parallel fifths changed what ear training and harmony teaching emphasise — but it did not make harmony teaching obsolete. It changed the floor (students can check mechanical correctness automatically) and raised the ceiling (more class time can be spent on voice-leading logic and expressive intention). AI tools are a larger version of the same displacement: the floor rises, the ceiling rises with it, and the pedagogical question is always what you are doing between the two.

Copyright and Academic Integrity

Two issues that crossed all three workshops and deserve direct treatment.

On copyright: the training data of generative music models includes copyrighted recordings and scores, the legal status of which is actively litigated in multiple jurisdictions. When Suno.AI generates a piece “in the style of” a named composer, it is drawing on patterns extracted from that composer’s work — work that is under copyright in the case of living or recently deceased composers. The output is not a direct copy, but neither is the relationship to the training data legally settled. Music students who use these tools in professional contexts should know that they are working in a legally uncertain space, and institutions should not pretend otherwise.

On academic integrity: the issue is not that students might use AI to cheat — they will, some of them, and they have always found ways to cheat with whatever tools were available. The issue is that current AI policies at many institutions are incoherent: prohibiting AI use in assessment while providing no clear guidance on what counts as AI use, and assigning tasks where AI assistance is undetectable and arguably appropriate. The more useful approach is to design tasks where AI assistance is either irrelevant (because the task requires live performance or real-time demonstration) or visible and assessed (because the task explicitly includes reflection on how AI was used and to what effect).

Three Things I Came Away With

After a full day of workshops, discussions, and the conversations that happen in the corridors between sessions, I left with three positions that feel more settled than they did in the morning.

First: the data protection question is not separable from the pedagogical question. Any serious curriculum discussion of AI in music education has to start with what can legally be deployed, not with what would be useful if constraints were not a factor. The constraints are a factor.

Second: the skill most urgently needed — in students and in instructors — is not AI literacy in the sense of knowing which tool to use for which task. It is the critical capacity to evaluate AI-generated outputs: to notice what is wrong, to understand why it is wrong, and to correct it. This requires domain expertise first. You cannot critically evaluate an AI-generated harmonic analysis if you do not understand harmonic analysis. The tools do not lower the bar for domain knowledge. They raise the bar for its critical application.

Third: the curriculum question is not “how do we accommodate AI?” It is “what are we actually trying to teach, and does the answer change when AI can produce the visible output of that process?” Answering that honestly, skill by skill, for a full music programme, is slow work. It cannot be done at a one-day event. But a one-day event, if it is well-designed, can start the conversation in the right place.

HfMT’s Thementag started it in the right place.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org
Holmes, W., Bialik, M., & Fadel, C. (2019). Artificial Intelligence in Education: Promises and Implications for Teaching and Learning. Center for Curriculum Redesign.
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1, 389–399. https://doi.org/10.1038/s42256-019-0088-2
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Luckin, R., Holmes, W., Griffiths, M., & Forcier, L. B. (2016). Intelligence Unleashed: An Argument for AI in Education. Pearson.
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2). https://doi.org/10.1177/2053951716679679
Roll, I., & Wylie, R. (2016). Evolution and revolution in artificial intelligence in education. International Journal of Artificial Intelligence in Education, 26(2), 582–599. https://doi.org/10.1007/s40593-016-0110-3
Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

Three Rs in Strawberry: What the Viral Counting Test Actually Reveals

Mon, 07 Oct 2024 00:00:00 +0000

The Setup

In September 2024, OpenAI publicly confirmed that their new reasoning model had been code-named “Strawberry” during development. This landed with a particular thud because “how many r’s are in strawberry?” had, by that point, become one of the canonical demonstrations of language model failure. The model named after strawberry could not count the letters in strawberry. The internet had opinions.

Before the opinions: the answer is three. s-t-r-a-w-b-e-r-r-y. One in the str- cluster, two in the -rry ending. Count carefully and you will find that most people get this right on the first try, and most large language models get it wrong, returning “two” with apparent confidence.

The question worth asking is not “why is the model stupid.” It is not stupid, and “stupid” is not a useful category here. The question is: what does this specific error reveal about the structure of the system?

The answer involves tokenisation, and it is actually interesting.

How You Count Letters (and How the Model Doesn’t)

When you count the r’s in “strawberry,” you do something like this: scan the string left to right, maintain a running count, increment it each time you see the target character. This is a sequential operation over a character array. It requires no semantic knowledge about the word — it does not matter whether “strawberry” is a fruit, a colour, or a nonsense string. The characters are the input; the count is the output.

A language model does not receive a character array. It receives a sequence of tokens — chunks produced by a compression algorithm called Byte Pair Encoding (BPE) that the model was trained with. In the tokeniser used by GPT-class models, “strawberry” is most likely split as:

$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$

Three tokens. The model’s input is these three integer IDs, each looked up in an embedding table to produce a vector. There is no character array. There is no letter “r” sitting at a known position. There are three dense vectors representing “str,” “aw,” and “berry.”

What BPE Does (and Doesn’t) Preserve

BPE is a greedy compression algorithm. Starting from individual bytes, it iteratively merges the most frequent pair of adjacent symbols into a single new token:

$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$

Applied to a large text corpus until a fixed vocabulary size is reached, this produces a vocabulary of common subwords. Frequent words and common word-parts become single tokens; rare sequences stay as multi-token fragments.

What BPE optimises for is compression efficiency, not character-level transparency. The token “straw” encodes the sequence s-t-r-a-w as a unit, but that character sequence is not explicitly represented anywhere inside the model once the embedding lookup has occurred. The model receives a vector for “straw,” not a list of its constituent letters.

The character composition of a token is only accessible to the model insofar as it was implicitly learned during training — through seeing “straw” appear in contexts where its internal structure was relevant. For most tokens, most of the time, that character structure was not relevant. The model learned what “straw” means, not how to spell it character by character.

Why the Error Is Informative

Most people say the model returns “two r’s,” not “one” or “four” or “none.” This is not random noise. It is a systematic error, and systematic errors are diagnostic.

“berry” contains two r’s: b-e-r-r-y. If you ask most models “how many r’s in berry?” they get it right. The model has seen that question, or questions closely enough related, that the right count is encoded somewhere in the weight structure.

“str” contains one r: s-t-r. But as a token it is a short, common prefix that appears in hundreds of words — string, strong, stream — contexts in which its internal letter structure is rarely attended to. “aw” contains no r’s. When the model answers “two,” it is almost certainly counting the r’s in “berry” correctly and failing to notice the one in “str.” The token boundaries are where the error lives.

This is not stupidity. It is a precise failure mode that follows directly from the tokenisation structure. You can predict where the error will occur by looking at the token split.

Chain of Thought Partially Fixes This (and Why)

If you prompt the model to “spell out the letters first, then count,” the error rate drops substantially. The reason is not mysterious: forcing the model to generate a character-by-character expansion — s, t, r, a, w, b, e, r, r, y — puts the individual characters into the context window as separate tokens. Now the model is not working from “straw” and “berry”; it is working from ten single-character tokens, and counting sequential characters in a flat list is a task the model handles much better.

This is, in effect, making the model do manually what a human does automatically: convert the compressed token representation back to an enumerable character sequence before counting. The cognitive work is the same; the scaffolding just has to be explicit.

The Right Frame

The “how many r’s” test is sometimes cited as evidence that language models don’t “really” understand text, or that they are sophisticated autocomplete engines with no genuine knowledge. These framing choices produce more heat than light.

The more precise statement is this: language models were trained to predict likely next tokens in large text corpora. That training objective produces a system that is very good at certain tasks (semantic inference, translation, summarisation, code generation) and systematically bad at others (character counting, exact arithmetic, precise spatial reasoning). The system is not doing what you are doing when you read a sentence. It is doing something different, which happens to produce similar outputs for a very wide range of inputs — and different outputs for a class of inputs where the character-level structure matters.

“Strawberry” sits squarely in that class. The model is not failing to read the word. It is succeeding at predicting what a plausible-sounding answer looks like, based on a compressed representation that does not preserve the information needed to get the count right. Those are not the same thing, and the distinction is worth keeping clear.

The tokenisation argument here is a simplified version. Real BPE vocabularies, positional encodings, and the specific way character information is or isn’t preserved in embedding tables are more complicated than this post suggests. But the core point — that the model’s input representation is not a character array and never was — holds.

A follow-up post covers a structurally different failure mode: Should I Drive to the Car Wash? — where the model understood the question perfectly but lacked access to the world state the question was about.

References

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1715–1725. https://arxiv.org/abs/1508.07909

Changelog

2025-12-01: Corrected the tokenisation of “strawberry” from two tokens (straw|berry) to three tokens (str|aw|berry), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.