Sebastian Spicker

There Is an App for That — Until There Isn't

Tue, 07 Apr 2026 00:00:00 +0000

Someone vibe coded an app that tells you how many layers to wear today. It has 85,000 users. Someone else tracks her eyelash styles — every new set gets a photo and a note about the method. A father built Storypot: his kids drag emoji into a virtual pot and the app generates a bedtime story. A product manager made Standup Buddy, which randomises who talks first in a daily meeting. That is the entire feature. These are not bad things. Some of them are genuinely lovely — Storypot in particular. The layers app clearly meets a need, given 85,000 people agree. I have built tools like this myself — I automated my concert setlist workflow and wrote about it on this blog — and the feeling of compressing a forty-minute ritual into four minutes of machine-assisted execution is real and satisfying.

There is a term for this now. Karpathy coined it in early 2025: vibe coding. You describe what you want, the model writes the code, you run it, you fix what breaks by describing the fix, and at no point do you necessarily understand what the code does. The barrier to building software has not been lowered so much as removed. A single person with an afternoon and a language model can ship what would have required a team and a quarter, two years ago.

Meanwhile. In Germany, the average wait from an initial consultation to the start of psychotherapy is 142 days — nearly five months — according to a BPtK analysis of statutory insurance billing data [1]. The Telefonseelsorge — the crisis line, the last resort — handled 1.2 million calls in 2024. It is staffed by approximately 7,700 volunteers and funded primarily by the Protestant and Catholic churches. Its financing is described, in its own institutional language, as äußerst angespannt — extremely strained [2]. Six days ago, on April 1, psychotherapy fees in Germany were cut by 4.5% [3]. The thesis of this post is structural, not moral. There is a class of work that scales, and a class of work that does not. Our entire economy of attention — cultural, financial, technological — is optimised for the first class. The second class is not merely neglected. It is being made structurally more expensive, in a precise economic sense, by the very productivity gains that make the first class so intoxicating. And the policy apparatus, facing this structural pressure, is doing exactly what you would predict: it is funding apps.

The economist William Baumol explained the mechanism in 1966. It has a name, and the name is a diagnosis.

The Seduction of Leverage

What makes vibe coding culturally significant is not the code. It is the leverage. A single developer, aided by a language model, can produce software that reaches millions of users. The marginal cost of an additional user approaches zero. The output scales without bound while the input — one person, one prompt, one afternoon — stays fixed. This is the defining characteristic of automatable work: the ratio of output to input can grow without limit.

This is not new. Software has always had this property. What is new is that the barrier to producing software has collapsed. You no longer need to understand data structures, or networking, or the programming language. You need an idea and a few hours. The productivity frontier has shifted so dramatically that the interesting constraint is no longer can I build it but should anyone use it. The cultural response has been euphoric. Communities, podcasts, courses, manifestos. People who have never written a line of code are shipping products. I am not interested in dismissing this. The ability to build is a form of agency, and more people having it is not, in itself, a problem. The problem is what the euphoria obscures.

What Therapy Actually Requires

A psychotherapy session has the following structure. One therapist sits with one patient for approximately fifty minutes. The therapist listens, observes, formulates, responds. The patient speaks, reflects, resists, revises. The therapeutic alliance — the quality of the relationship between therapist and patient — is one of the most robust predictors of treatment outcome, across modalities, across conditions, across decades of research [4]. This is not a feature that can be optimised away. It is the mechanism of action. When a meta-analysis finds that the specific technique matters less than the relationship — that CBT, psychodynamic, and humanistic therapies produce roughly equivalent outcomes when the alliance is strong — it is telling you that the human in the room is not an implementation detail. The human in the room is the intervention.

You cannot parallelise this. A therapist cannot see two patients simultaneously without degrading the thing that makes the session work. You cannot batch it — twelve people in a room is group therapy, which is a different intervention with different dynamics and different limitations. You cannot cache it — the session is not a retrieval operation over stored responses but an emergent interaction that depends on what happens in the room that day. The irreducible unit of therapy is: one trained human, fully present, for one hour, with one other human. This has not changed since Freud’s consulting room on Berggasse 19, and no plausible technological development will change it, because the presence is the treatment. A therapist working full-time can see roughly twenty-five to thirty patients per week. That is the ceiling. It is set by the biology of attention and the ethics of care, not by inefficiency.

Baumol’s Cost Disease

In 1966, the economists William Baumol and William Bowen published Performing Arts, The Economic Dilemma, a study of why orchestras, theatre companies, and dance troupes were perpetually in financial crisis despite growing audiences and rising cultural prestige [5]. Their diagnosis was precise. A string quartet requires four musicians and approximately forty minutes to perform Beethoven’s Op. 131. This was true in 1826 and is true in 2026. The productivity of the quartet — measured in output per unit of labour input — has not increased. It cannot increase. The performance is the labour.

Meanwhile, the productivity of a textile worker, a steelworker, a software developer has increased by orders of magnitude. Wages in the productive sectors rise because productivity rises. Wages in the nonproductive sectors must keep pace — not because musicians deserve parity as a matter of justice, though they may, but because if they do not keep pace, musicians will leave for sectors that pay more. The quartet must compete in the same labour market as the factory and the tech company.

The result: the relative cost of live performance rises without bound. Not because musicians got worse. Not because audiences stopped caring. But because everything else got cheaper, and the quartet cannot. Baumol later generalised the result beyond the performing arts to all services in which the labour itself constitutes the product: education, healthcare, legal services, and — centrally for our purposes — psychotherapy [6]. A therapy session is a string quartet. The labour is the product. The productivity cannot increase. The cost, relative to the scalable economy, rises every time the scalable economy gets more productive. And vibe coding is a massive productivity shock to the scalable economy.

There Is an App for That

In 2019, the German government passed the Digitales-Versorgung-Gesetz, creating a fast-track approval process for Digitale Gesundheitsanwendungen — digital health applications, or DiGA. The idea: apps that can be prescribed by a doctor and reimbursed by statutory health insurance, just like medication. A patient walks into a practice, receives a prescription code, downloads the app, and the Krankenkasse pays [7]. As of mid-2025, the BfArM directory lists roughly 58 DiGA. Nearly half target psychiatric conditions — depression, anxiety, insomnia, burnout. Names like deprexis, HelloBetter, Selfapy. A patient who would wait 142 days for a therapist can get a DiGA prescribed the same afternoon.

The pricing structure deserves attention. In the first twelve months after listing, manufacturers set their own price. The average: €541 per prescription [8]. Some exceeded €2,000. After the first year, negotiated prices drop to an average of roughly €226 — but by then, the insurance has already paid the introductory rate for every early adopter. Total statutory health insurance spending on DiGA since 2020: €234 million. That spending grew 71% between 2023 and 2024 [9]. Here is the number that should sit next to that one. A single outpatient psychotherapy session costs the insurance system approximately €115. The €234 million spent on DiGA since 2020 could have funded over two million therapy sessions — enough for roughly 80,000 complete courses of 25-session treatment. And here is the evidence question. Only 12 of the 68 DiGA that have entered the directory demonstrated a proven positive care effect at the time of inclusion. The rest were listed provisionally, with twelve months to produce evidence. About one in six were subsequently delisted — removed from the directory because the evidence did not materialise [10].

I want to be precise about what I am and am not saying. Some DiGA have a real evidence base. Structured CBT exercises delivered digitally can produce measurable short-term symptom improvement — I reviewed the Woebot trial data in an earlier post on AI companions and took those results seriously. A DiGA that delivers psychoeducation and behavioural activation exercises is a tool, and tools can be useful. But a tool and a therapeutic relationship are not the same product delivered through different channels. They are different products. The policy framework treats them as substitutable — the patient who cannot access a therapist receives an app instead. The substitution is not a clinical judgement. It is a structural inevitability: facing the impossibility of scaling therapy, the system reaches for the scalable alternative, because the scalable alternative is what the incentive structure rewards. This is not a corruption story. This is Baumol’s cost disease expressed through health policy. The system is doing exactly what the theory predicts.

The Fear and the Compliance

There is an irony at the centre of the current discourse about AI and work that I want to name, because I think it is underexamined. People are afraid of AI. Specifically, they are afraid it will take their jobs. The surveys confirm this consistently — Gallup, Pew, the European Commission’s Eurobarometer — significant fractions of the working population in every developed country report anxiety about AI-driven job displacement.

And yet. The same people — not a different demographic, not a separate population, the same people — are enthusiastically using AI to do their work. They use language models to write their emails, their reports, their presentations. They vibe code tools for their teams. They let AI draft their strategy documents, summarise their meetings, compose their performance reviews. They celebrate the productivity gain. They post about it. This is not hypocrisy. It is something more interesting: a revealed preference for automation that contradicts a stated preference against it. The fear is about structural displacement — losing the role. The compliance is about local optimisation — doing the task more efficiently. No one wakes up and decides to automate themselves out of a job. They automate one task at a time, each automation locally sensible, until the job is a shell around an AI core. And all of this activity — the fear, the adoption, the discourse, the think pieces, the congressional hearings — is directed at automatable work. The kind of work where AI is a plausible substitute.

No one is afraid that AI will take the crisis counsellor’s job. No one is vibe coding a replacement for a psychiatric nurse. The work that is collapsing is not collapsing because AI replaced it. It is collapsing because it was never scalable, never attracted the capital or the talent that scalable work attracts, and every productivity gain in the scalable sector makes the unscalable sector relatively more expensive and harder to staff. The discourse about AI and jobs is, in this sense, exactly backwards. The threat is not that AI will replace the work that matters most. The threat is that it will make the work that matters most invisible — by making everything else so cheap and fast and abundant that we forget the expensive, slow, irreducibly human work exists at all.

The Political Arithmetic

On March 11, 2026, the Erweiterter Bewertungsausschuss — the body that sets fee schedules for outpatient care in Germany — decided a 4.5% flat cut to nearly all psychotherapeutic service fees, effective April 1 [3]. The health insurers had originally demanded 10%. Germany spends €4.6 billion annually on outpatient psychotherapy — roughly 1.5% of total statutory health insurance expenditure. The fee cut applies to this budget. The average therapist surplus — what remains after practice costs — is approximately €52 per hour [11]. The cut is not large in percentage terms. It is large in the context of a profession that is already among the lowest-paid in outpatient medicine. Nearly half a million people signed a petition against the cuts. There were protests in Berlin, Leipzig, Hanover, Hamburg, Stuttgart, Munich. The Kassenärztliche Bundesvereinigung filed a lawsuit. The Bundespsychotherapeutenkammer called the decision skandalös [12].

What makes this particularly striking is the sequence. The coalition agreement signed by CDU/CSU and SPD in May 2025 explicitly addresses mental health — securing psychotherapy training financing, needs-based planning for child and adolescent psychotherapy, crisis intervention rights for psychotherapists, and a suicide prevention law. The BPtK itself welcomed the agreement as giving mental health a neuen Stellenwert, a new significance [13]. Less than a year later, the same government’s arbitration body cuts psychotherapy fees by 4.5%. The stated commitment and the enacted policy point in opposite directions. This is not unusual in politics. What is unusual is that it maps so precisely onto Baumol’s mechanism: the coalition agreement acknowledges the problem in language; the fee schedule acknowledges it in arithmetic. And the arithmetic wins, because the arithmetic always wins when the work does not scale. The Bedarfsplanung, the needs-based planning system that determines how many psychotherapy seats are approved per region, was partially reformed in 2019 after decades of operating on 1990s-era ratios. The reform added roughly 800 seats. The BPtK considers it still fundamentally inadequate [14].

The arithmetic is plain. DiGA spending: growing 71% year on year. Psychotherapy fees: cut by 4.5%. The direction is unambiguous. Invest in the scalable. Cut the unscalable. And the damage compounds in a way that the policy apparatus appears not to understand, or not to care about. A therapist who leaves the profession because €52 per hour is no longer viable does not return when the cut is reversed. The training pipeline for a new clinical psychologist runs six to eight years from university admission to licensure. Over forty thousand accredited psychotherapists serve the system today [14]. Every one who leaves creates a gap measured in decades, not budget cycles. The Telefonseelsorge, staffed by volunteers and funded by the churches, is not a mental health system. It is what remains when the mental health system is not there. Treating it as a substitute — treating 7,700 volunteers as adequate coverage for a country of 84 million — is not a policy position. It is an admission that the actual policy has failed.

The Uncomfortable Part

Here is where I should, by the conventions of the form, propose a solution. I should say something about funding, about training pipelines, about recognising care work as infrastructure rather than a cost centre.

I think those things are true. I think we should pay therapists more, not less. I think Baumol’s cost disease means we should expect this to be expensive and fund it anyway, because the alternative — accepting that people in crisis will wait 142 days while the scalable economy celebrates another productivity milestone — is a failure of collective priorities so basic that it should be uncomfortable to state plainly. But I am also the person who automated his setlist workflow and was satisfied by the compression. I vibe code things. I use AI tools daily. I am inside the attention gradient, not observing it from above. The part of me that finds leverage intoxicating is the same part that writes this blog, and I do not think I am unusual in this.

The structural isomorphism is exact: Baumol’s string quartet, the therapist’s fifty minutes, the crisis counsellor’s phone call at 3am. The labour is the product. The product does not scale. The cost rises. The talent flows elsewhere. And the policy, rather than resisting the gradient, follows it — funding apps, cutting fees, digitising what cannot be digitised without changing what it is. The layers app reaches 85,000 users. The therapy app is reimbursed within the week. The therapist is available in five months, if at all.

I do not have a clean resolution to offer. I have a diagnosis — Baumol’s cost disease, applied to the attention economy of a civilisation that has discovered how to make scalable work almost free — and an observation: the political system is not counteracting the disease. It is accelerating it. The quartet still needs four musicians. The session still needs the therapist in the room. The phone still needs someone to answer it. Nothing we are building will change this. The question is whether we notice before the people who needed the answer stop calling.

References

[1] Bundespsychotherapeutenkammer. Psychisch Kranke warten 142 Tage auf eine psychotherapeutische Behandlung. BPtK. https://www.bptk.de/pressemitteilungen/psychisch-kranke-warten-142-tage-auf-eine-psychotherapeutische-behandlung/

[2] Evangelisch-Lutherische Kirche in Norddeutschland (2025). Finanzierung der Telefonseelsorge ist äußerst angespannt. https://www.kirche-mv.de/nachrichten/2025/februar/finanzierung-der-telefonseelsorge-ist-aeusserst-angespannt

[3] Kassenärztliche Bundesvereinigung (2026). Paukenschlag: KBV klagt gegen massive Kürzungen psychotherapeutischer Leistungen. https://www.kbv.de/presse/pressemitteilungen/2026/paukenschlag-kbv-klagt-gegen-massive-kuerzungen-psychotherapeutischer-leistungen

[4] Flückiger, C., Del Re, A. C., Wampold, B. E., & Horvath, A. O. (2018). The alliance in adult psychotherapy: A meta-analytic synthesis. Psychotherapy, 55(4), 316–340. https://doi.org/10.1037/pst0000172

[5] Baumol, W. J., & Bowen, W. G. (1966). Performing Arts, The Economic Dilemma: A Study of Problems Common to Theater, Opera, Music and Dance. Twentieth Century Fund.

[6] Baumol, W. J. (2012). The Cost Disease: Why Computers Get Cheaper and Health Care Doesn’t. Yale University Press.

[7] Bundesinstitut für Arzneimittel und Medizinprodukte. DiGA-Verzeichnis. https://diga.bfarm.de/de

[8] GKV-Spitzenverband (2025). Bericht des GKV-Spitzenverbandes über die Inanspruchnahme und Entwicklung der Versorgung mit Digitalen Gesundheitsanwendungen. Reported in: MTR Consult. https://mtrconsult.com/news/gkv-report-utilization-and-development-digital-health-application-diga-care-germany

[9] Heise Online (2025). Insurers critique high costs and low benefits of prescription apps. https://www.heise.de/en/news/Insurers-critique-high-costs-and-low-benefits-of-prescription-apps-10375339.html

[10] Goeldner, M., & Gehder, S. (2024). Digital Health Applications (DiGAs) on a Fast Track: Insights From a Data-Driven Analysis of Prescribable Digital Therapeutics in Germany From 2020 to Mid-2024. JMIR mHealth and uHealth. https://pmc.ncbi.nlm.nih.gov/articles/PMC11393499/

[11] Taz (2026). Weniger Honorar für Psychotherapie. https://taz.de/Weniger-Honorar-fuer-Psychotherapie/!6162806/

[12] Bundespsychotherapeutenkammer (2026). Gemeinsam gegen die Kürzung psychotherapeutischer Leistungen. https://www.bptk.de/pressemitteilungen/gemeinsam-gegen-die-kuerzung-psychotherapeutischer-leistungen/

[13] Bundespsychotherapeutenkammer (2025). Koalitionsvertrag gibt psychischer Gesundheit neuen Stellenwert. https://www.bptk.de/pressemitteilungen/koalitionsvertrag-gibt-psychischer-gesundheit-neuen-stellenwert/

[14] Bundespsychotherapeutenkammer. Reform der Bedarfsplanung. https://www.bptk.de/ratgeber/reform-der-bedarfsplanung/

The Model Has No Seahorse: Vocabulary Gaps and What They Reveal About LLMs

Wed, 04 Mar 2026 00:00:00 +0000

Try a simple experiment. Open any of the major language model interfaces and ask it, as plainly as possible, to produce a seahorse emoji. What you get back will probably be one of a small number of things. The model might confidently output something that is not a seahorse emoji — a horse face, a tropical fish, a dolphin, sometimes a spiral shell. It might produce a cascade of marine-themed emoji as if searching through an aquarium before eventually settling on something. It might hedge at length and then get it wrong anyway. Occasionally it will self-correct after producing an incorrect token. What it almost never does is say: there is no seahorse emoji in Unicode, so I cannot produce one.

That silence is interesting. Not because the model is being evasive, and not because this is an especially important use case — nobody’s critical infrastructure depends on seahorse emoji production. It is interesting because it reveals a specific structural feature of how language models relate to their own capabilities. The gap between what a model knows about the world and what it knows about its own output vocabulary is a real gap, and it shows up in ways that are worth understanding carefully.

I am going to work through the seahorse incident, a companion failure involving a morphologically valid but corpus-rare English word, and what both of them suggest about a class of self-knowledge failure that I think is underappreciated compared to ordinary hallucination.

The incident

In 2025, Vgel published an analysis of exactly this failure [1]. The piece is worth reading in full, but the core finding is worth unpacking here.

When a model is asked to produce a seahorse emoji, something specific happens at the level of the model’s internal representations. Using logit lens analysis — a technique for inspecting the model’s intermediate layer activations as if they were already projecting into vocabulary space [4] — it is possible to track what the model’s “working answer” looks like at each layer of the transformer. What Vgel found is that in the late layers, the model does construct something that functions like a “seahorse + emoji” representation. The semantic work is happening correctly. The model is not confused about whether seahorses are real animals, not confused about whether emoji are a thing, not confused about whether animals commonly have emoji representations. It has assembled the correct semantic vector for what it wants to output.

The failure is not in the assembly. It is in the final step: the projection from that assembled representation back into vocabulary space. This projection is called the lm_head, the final linear layer that maps from the model’s embedding space to a probability distribution over its output vocabulary. That vocabulary is a fixed set of tokens, established at training time. There is no seahorse emoji token. There never was one, because there is no seahorse emoji in Unicode.

What the lm_head does, faced with a query vector that has no exact match in vocabulary space, is find the nearest token — the one whose embedding is closest to the query, in whatever metric the model has learned during training. That nearest token is some other emoji, and it gets output. The model has no mechanism at this stage to detect that the nearest token is not actually what was requested. It cannot distinguish between “I found the seahorse emoji” and “I found the best available approximation to the seahorse emoji.” The output is produced with the same confidence either way.

Vgel’s analysis covered behaviour across multiple models — GPT-4o, Claude Sonnet, Gemini Pro, and Llama 3 were all in the mix. The specific wrong answer differed between models, which itself is revealing: different training corpora and different tokenisation schemes produce different nearest-neighbour relationships in embedding space, so each model’s fallback lands somewhere different in the emoji neighborhood. What is consistent across models is that none of them correctly diagnosed the gap. They all behaved as if the limitation were in their world-knowledge rather than in their output vocabulary. None of them said: “I know what you want, and it does not exist as a token I can emit.”

Some of the failure modes are more elaborate than a simple wrong substitution. One pattern Vgel documented is the cascade: the model generates a sequence of increasingly approximate emoji as accumulated context pushes it away from each successive wrong answer, eventually settling into a cycle or giving up. Another is the confident placeholder — an emoji that looks like it might be a box or a question mark symbol, as if the model has internally noted a gap but cannot produce a useful message about it. A third, rarer pattern is genuine partial self-correction: the model produces the wrong emoji, generates a few tokens of commentary, then backtracks. Even that self-correction is not reliable, because the model is correcting based on world-knowledge (“wait, that is a dolphin, not a seahorse”) rather than vocabulary-knowledge (“there is no seahorse token”), so it keeps trying until it either runs into a token limit or produces something it can convince itself is close enough.

The structural failure: vocabulary completeness assumption

Here is the core conceptual point, stated as cleanly as I can.

Language models have two distinct knowledge representations that are routinely conflated, by users and, it seems, by the models themselves. The first is world knowledge: facts about entities, their properties, and their relationships. A model trained on large quantities of text knows an enormous amount about the world — including, in this case, that seahorses are animals, that emoji are Unicode characters, and that many animals have standard emoji representations. This knowledge is encoded in the weights through training on documents that describe these things.

The second is the output vocabulary: the set of tokens the model can actually emit. This vocabulary is a fixed artifact, established at training time by a tokeniser — usually a byte-pair encoding scheme, as described by Sennrich et al. [5] and discussed in more detail in my tokenisation post. A new emoji added to Unicode after the training cutoff does not exist in the vocabulary. An emoji that never made it into Unicode does not exist in the vocabulary. The vocabulary is closed, and there is no runtime mechanism for expanding it.

The problem is that the model treats these two representations as if they were the same. If world-knowledge says “seahorses should have emoji,” the model implicitly assumes its output vocabulary contains a seahorse emoji. It does not distinguish between “I know X exists” and “I can express X.” I am going to call this the vocabulary completeness assumption: the implicit belief that the expressive vocabulary is complete with respect to world knowledge, that if the model knows about a thing, it can produce a token for that thing.

This assumption is mostly true. For a well-trained model on high-resource languages and common domains, the vocabulary is rich enough that the gap between what the model knows and what it can express is small. The failure shows up precisely in the edge cases: rare Unicode characters, neologisms below the frequency threshold for robust tokenisation, domain-specific symbols that appear in training text only as descriptions rather than as the symbols themselves. Those cases reveal an assumption that was always there but almost never triggered.

The failure is structurally different from ordinary hallucination, and I think this distinction matters. When a model confabulates a fact — invents a citation, misattributes a quote, generates a plausible-but-false historical claim — it is producing incorrect world-knowledge. The cure, in principle, is better training data, better calibration, and retrieval augmentation that can replace the model’s internal knowledge with verified external knowledge. These are hard problems but they are the right class of problems to address factual hallucination.

When a model fails on vocabulary completeness, the world-knowledge is correct. The model knows it should produce a seahorse emoji. The limitation is in the output channel. No amount of factual training data will fix this, because the problem is not about facts. Retrieval augmentation will not help either, unless the system also includes a vocabulary lookup step that can report what tokens exist. The fix, if there is one, is a different kind of introspective capability: explicit metadata about the output vocabulary, available to the model at generation time.

A useful analogy: imagine a translator who has a perfect conceptual understanding of a French neologism that has no English equivalent, and who is tasked with writing in English. The translator knows the concept; the English word genuinely does not exist yet. A careful translator would write “there is no direct English equivalent; the closest is approximately…” and explain the gap. A less careful translator would pick the nearest English word and output it as if it were a direct translation, without flagging the gap to the reader. Language models are almost uniformly the less careful translator in this analogy, and the problem is architectural: they have no mechanism for detecting that they are approximating rather than translating.

A formal language perspective

For those who prefer their failures stated in type signatures: the decoder step in a standard transformer is a function that maps a hidden state vector to a probability distribution over a fixed token vocabulary V = {t₁, …, tₙ} [5]. Every output is an element of V. The type system has no room for a “near miss” or an “I cannot express this precisely” — the output is always a token, drawn from the inventory established at training time.

This is a closed-world assumption in the formal sense [6]: the system treats any concept not representable as an element of V as simply absent. There is no seahorse emoji token, so the model’s generation step has no way to represent “seahorse emoji” as a distinct, exact concept. It can only represent “nearest token to seahorse emoji in embedding space,” which it does silently, with the same confidence it would report for a precise match.

The mismatch is between two representations: the model’s internal semantic space — continuous, high-dimensional, geometrically capable of representing “seahorse + emoji” as a coherent position — and its output type, which is a discrete, finite categorical distribution. The lm_head projection is a quantisation, and at the edges of the vocabulary it is a lossy one. For most semantic positions the nearest token is close enough; for missing emoji, low-frequency morphological forms, or post-training neologisms, the quantisation error is large and nothing in the architecture flags it.

A richer output type would distinguish precise matches from approximations — an Exact versus an Approximate, or in standard option-type terms, a generation step that can return None when no token in V adequately represents the requested concept. The information needed to make this distinction already exists inside the model: the logit lens analysis shows that the geometry of the final transformer layer carries signal about the quality of the approximation [4]. It is simply discarded in the projection step. Making it visible at the interface level is an architectural decision, not a training question, which is why “make the model more calibrated about facts” addresses the wrong layer of the problem.

The “ununderstandable” companion

Shortly after the seahorse emoji incident circulated, a Reddit thread titled “it’s just the seahorse emoji all over again” collected user reports of a structurally similar failure on the English word “ununderstandable” [2]. I cannot independently verify every report in that thread — Reddit threads being what they are — but the documented failure pattern is consistent with the seahorse analysis and worth working through because it extends the picture in a useful direction.

“Ununderstandable” is morphologically valid English. The prefix un- combines productively with adjectives: uncomfortable, unbelievable, unmanageable, unkind. “Understandable” is an unambiguous adjective. “Ununderstandable” means what it looks like it means, constructed by exactly the same rule that gives you all the other un- words. There is nothing wrong with it grammatically or semantically.

It is also extremely rare. I cannot find it in any standard reference corpus or mainstream English dictionary. The word has not achieved the frequency threshold required for widespread attestation, which means that a model trained on a broad web corpus will have seen it at most a handful of times, if at all. Its tokenisation is likely fragmented — split across subword units in a way that does not give the model a clean, unified representation of it as a single lexical item. The BPE tokeniser will have handled “ununderstandable” as a sequence of subword pieces, and the model will have very few training examples from which to learn how those pieces combine in practice.

The failure mode the Reddit thread documented is the same as the seahorse failure in structure, but it operates in morphological space rather than emoji space. The model has learned that un- prefixation is productive, and it has learned that “understandable” is a word. But its trained representations do not include “ununderstandable” as a robust lexical entry, because the word is below the minimum frequency threshold for that. When asked to use or define “ununderstandable,” models in the thread were reported to do one of three things. They would deny it is a word, often confidently, pointing to the absence of a dictionary entry. They would confidently define it incorrectly, conflating it with “misunderstandable” or “incomprehensible” in ways that lose the morphological compositionality. Or they would produce grammatically awkward output when forced to use it in a sentence — the kind of output you get when the model is stitching together fragments without a reliable whole-word representation to anchor the construction.

The denial case is the most interesting to me, because it is the model doing something structurally revealing. It is applying world-knowledge (dictionaries do not widely contain this word; therefore it is not a word) to override the conclusion it should reach from morphological knowledge (the word is transparently compositional and valid by productive rules I have learned). The model is, in effect, saying “I cannot recognise this because it is not in my training data,” which is closer to the truth than the seahorse case but still not quite right. The word is valid, not merely an error — it is just rare.

The Reddit title is apt. Both incidents are examples of the model failing to distinguish between two different epistemic situations: “this thing does not exist and I should say so” versus “this thing exists but I cannot produce it cleanly.” In the seahorse case, the emoji genuinely does not exist, and the right answer is to say so. In the “ununderstandable” case, the word genuinely is valid, and the right answer is to use it or explain the frequency gap. Both failures come from the same source: the model conflates world-knowledge with expressive vocabulary, and has no reliable way to interrogate which of those two representations is actually limiting it.

What this means for users

The practical implication is narrow but important. Asking a language model “do you have X?” — where X is a token, a word, an emoji, a symbol — is not a reliable diagnostic for whether the model can produce X. The model will often affirm things it cannot actually output, and sometimes deny things it can. This is not a matter of the model being dishonest in any meaningful sense. It is a matter of the model not having explicit access to its own vocabulary as a queryable data structure. Its self-description of its capabilities is generated by the same weights that have the gaps, and those weights have no introspective pathway to the tokeniser’s vocabulary table.

This matters beyond emoji. The same failure structure applies in any domain where world-knowledge and expressive vocabulary diverge. A model that has read about a proprietary technical symbol used in a narrow field but has no token for that symbol will fail the same way. A model that knows about a recently coined term that postdates its training cutoff will fail the same way. The failure is quiet — the model does not throw an error, does not flag uncertainty, does not produce a visibly broken output. It produces something plausible and wrong.

The broader point is that vocabulary completeness is one instance of a general class of LLM self-knowledge failures. Models do not have accurate introspective access to their own weights, their training data coverage, or their capability boundaries. They can describe themselves in natural language, but those descriptions are generated by the same weights that contain the gaps and the biases. A model that does not know it lacks a seahorse token cannot tell you it lacks one, because the mechanism by which it would report that absence is the same mechanism that has the absence. This connects to the wider theme in this blog of AI systems that are confidently wrong about things that require them to reason about their own limitations — see the grounding failure post and its companion piece on pragmatic inference for related examples, and the AI detectors post for a case where self-knowledge failures about writing style have real social consequences.

The fix is not “make models more honest” in the abstract. Honesty calibration training teaches models to express uncertainty about facts, which is useful and real progress on hallucination. But vocabulary gaps are not factual uncertainty — the model is not uncertain about whether the seahorse emoji exists, in any meaningful sense. What is needed is a different kind of capability: models with explicit, queryable metadata about their own output vocabularies, and a generation-time mechanism that can consult that metadata before reporting a confident result. Some retrieval-augmented architectures are beginning to approach this by externalising certain kinds of knowledge into structured databases that the model can query explicitly. The same logic could, in principle, apply to vocabulary.

The last mile

There is something almost poignant about the seahorse failure, if you think about what is actually happening at the level of computation. The model is trying very hard. Its internal representation of “seahorse emoji” is, according to the logit lens analysis, correct. The semantic intent is assembled with care across the model’s late layers. The failure is in the last mile — the vocabulary projection — and the model has no way to detect it. It cannot distinguish between “I successfully retrieved the seahorse emoji” and “I retrieved the nearest available approximation to what I was looking for.” From the model’s operational perspective, it completed the task.

This is not a uniquely LLM problem, by the way. The same structure shows up in human communication all the time. We reach for a word that does not exist in our active vocabulary, produce the closest available word, and often do not notice the substitution. The difference is that a careful human communicator can usually, with effort, recognise that they are approximating — they have some access to the felt sense of the gap, the slight misfit between intent and expression. Language models, as currently built, do not have this. The gap leaves no trace that the model can inspect.

The specific failure mode described here is tractable. Future architectures may address it through better vocabulary coverage, explicit vocabulary metadata, or output-side verification that compares what was generated against what was requested at a representational level. The transformer circuits work [3] that underlies the logit lens analysis gives us increasingly precise tools for understanding where failures happen inside a model. As those tools mature, the vocabulary completeness assumption will become less of a blind spot and more of a known failure mode with known mitigations.

For now, the seahorse is useful precisely as a demonstration case: simple, memorable, easy to reproduce, and pointing clearly at a structural issue. It is not interesting because anyone needs a seahorse emoji. It is interesting because it is a clean instance of a model being confidently wrong about something that requires it to know what it cannot do — and that is a harder problem than knowing what it does not know.

References

[1] Vogel, T. (2025). Why do LLMs freak out over the seahorse emoji? https://vgel.me/posts/seahorse/

[2] Reddit user (2025). It’s just the seahorse emoji all over again. r/OpenAI. https://www.reddit.com/r/OpenAI/comments/1rkbeel/ (reported; not independently verified)

[3] Elhage, N., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html

[4] Nostalgebraist. (2020). Interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/

[5] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL 2016, 1715–1725.

[6] Reiter, R. (1978). On closed world data bases. In H. Gallaire & J. Minker (Eds.), Logic and Data Bases (pp. 55–76). Plenum Press, New York.

Changelog

2026-04-01: Updated reference [1]: author name to “Vogel, T.” and title to the published blog post title “Why do LLMs freak out over the seahorse emoji?”

Oppenheimer Didn't Have an Acceptable Use Policy

Tue, 03 Mar 2026 00:00:00 +0000

Physicists inherit, along with the formalism and the problem sets, a particular set of guilt. The profession has been working through its relationship to weapons, state violence, and the gap between scientific capability and ethical readiness since August 1945. This post is about why I think the current moment in AI closely resembles that history, and why Anthropic’s decision to draw a line matters even if — especially if — you think the line is imperfect.

What Just Happened

The news this week involves Anthropic and the question of whether and how large language models should be available for military applications. Anthropic has stepped back from a path toward unrestricted military use and restated a position: there are things their models will not be used for, weapons development and autonomous lethal systems among them. The response from parts of the defence and national security community has been predictable — naïve, idealistic, unilateral disarmament, your adversaries will not make the same choice.

These are not stupid objections. I want to take them seriously. But I also want to explain why, as someone who spent years studying physics in the shadow of the Manhattan Project’s legacy, the framing of those objections sounds very familiar, and why that familiarity is not reassuring.

What the Physicists Thought They Were Doing

The scientists who built the atomic bomb were not, for the most part, indifferent to what they were building. Many of them were refugees from European fascism. They understood what a Nazi atomic weapon would mean. The urgency was real, the moral reasoning was coherent, and the conclusion — build it before the other side does — followed from the premises.

What the premises did not include was adequate weight for what happens after the technical problem is solved.

By the time the Trinity test produced results in July 1945, Germany had already surrendered. The original justification — prevent the Nazis from getting there first — had evaporated. What remained was a weapon, an infrastructure for building more weapons, and a strategic and political logic that had largely moved beyond the scientists’ control. The Franck Report, written by a group of Manhattan Project scientists in June 1945, argued against using the bomb on a Japanese city without prior demonstration. It was ignored. Oppenheimer, who chaired the Interim Committee’s scientific panel, signed off on the Hiroshima target recommendation. He spent the rest of his life with that.

The lesson most physics students absorb from this history is something like: the scientists were not the decision-makers, the decision was going to be made anyway, and the presence of principled scientists in the room was better than their absence. The system was going to do what it was going to do; all you could influence was the margin.

I believed this for a long time. I am less sure of it now.

The Analogy and Its Limits

The comparison between the atom bomb and artificial general intelligence — or even current large language models at the capability frontier — is made often enough that it has become a cliché, which is usually the point at which people stop thinking carefully about it. Let me try to be specific about where the analogy holds and where it breaks.

Where it holds:

The core structural similarity is this: a small number of researchers, working at the frontier of a capability that most people do not understand, are making decisions that will constrain or enable uses they cannot fully anticipate, in contexts they will not control. The physics community in 1942 had a clearer view of what fission could do than any political or military decision-maker. The AI research community in 2026 has a clearer view of what large language models can do — and of what more capable successors will do — than most of the people who will deploy them.

That epistemic position is not morally neutral. Knowing more than the decision-makers does not mean you have unlimited responsibility, but it does mean you have more responsibility than someone who does not know. Feigning ignorance about downstream applications is not available to you.

The second similarity: once the capability exists and is demonstrated, the normative landscape changes. Before Trinity, the question of whether to build nuclear weapons was still open. After Trinity, it was no longer open in the same way — the knowledge existed, the infrastructure existed, the geopolitical expectations had already been set. The arms race was not caused by the bomb, but the bomb’s existence changed what the arms race meant and how fast it moved. We are somewhere in the vicinity of that transition with frontier AI systems. The question of whether to build them is still formally open for any given company or research group, but the landscape is already different from what it was five years ago.

Where it breaks:

The atom bomb was a single-use physical object whose primary function was destroying things. Large language models are general-purpose cognitive tools with a very wide range of applications, the majority of which are not weapons-relevant. This matters because it changes the policy space. You could, in principle, have not built the atom bomb. You cannot, in principle, not build language models while still having language models for medicine, education, scientific research, and the other applications that are clearly beneficial. The dual-use problem for AI is more severe, not less severe, than it was for physics.

The other important difference: the Manhattan Project was conducted in secret, under wartime conditions, with a relatively well-defined adversarial structure. The current AI landscape involves many organisations, many countries, public publication of research, and no clear equivalent of the Axis/Allied framing. The game theory of “if we don’t do it, they will” is more complicated when “they” is not a single identifiable adversary with symmetric interests.

What Anthropic’s Line Actually Says

Setting aside for a moment whether the line is in the right place, there is something worth examining in the act of drawing it at all.

The standard criticism — that a unilateral ethical commitment in a competitive field simply advantages less scrupulous actors — assumes that ethical commitments are pure costs with no countervailing benefits. This is the argument the weapons lobby has made about every arms control proposal in the history of arms control, and it has sometimes been right. Unilateral disarmament without reciprocal commitments can leave you worse off. This is not a trivial point.

But it smuggles in an assumption that deserves scrutiny: that the relevant competition is primarily between AI companies, and that the only variable that matters is relative capability. If you accept that framing, then any ethical constraint is a handicap and the only rational strategy is to develop as fast as possible with as few restrictions as possible.

That framing has a name in physics. It is called the arms race equilibrium, and the physics community spent thirty years understanding what it produces. It produces capability accumulation without a corresponding development of the normative frameworks, institutional safeguards, and mutual verification mechanisms that make the capability survivable. It produces Hiroshima, then the hydrogen bomb, then MIRV, then the point at which the accumulated arsenal is large enough to end complex life on Earth several times over, at which point you negotiate the first real arms limitation treaties — from a starting position of vastly more deployed capability than anyone needed and vastly less trust than anyone wanted.

The question Anthropic is implicitly asking is whether there is a path that does not look like that. The answer is not obvious. But I think it is worth asking.

What the Physicists Should Have Done

Here is the counterfactual that haunts the Manhattan Project’s legacy: what if the scientific community had treated the ethics of the bomb as seriously as the physics, from the beginning?

Not naïvely. Not by refusing to work on it and ceding the possibility of influencing it. But by making the ethical analysis parallel to the technical analysis, by treating the question of use as a scientific question with as much rigour as the question of yield, and by using the epistemic authority that came from being the people who understood the capability to push, hard, for the normative frameworks that did not yet exist.

Some scientists did this. Szilard circulated a petition, signed by 70 Manhattan Project scientists, against the use of the bomb on Japanese cities without prior warning. It did not work. But the effort was real, and the record of the effort matters — both as evidence that the scientific community was not unanimous in its acquiescence and as a model for what engaged dissent looks like from inside a project that is going to proceed regardless.

What most scientists did not do, and what the profession largely did not do in the decades that followed, was treat the ethical work as primary. Physics built its identity around the technical capability — the extraordinary achievement of understanding nature at the deepest level — and treated the ethical consequences as someone else’s department. The bomb was the military’s problem. The cold war was the politicians’ problem. The physicists kept doing physics.

This was comfortable and it was wrong.

What I Want From AI Researchers

I want AI researchers to do what the physicists did not, and to do it now, while the critical decisions are still open.

Anthropic drawing a line is one version of this. It is imperfect — the line is in a particular place, the enforcement mechanisms are limited, the competitive dynamics are real. But it is a claim that the people who built the capability have ongoing responsibility for how it is used, and that some uses are outside the bounds of what should happen regardless of what is technically possible.

That claim is not naïve. It is, in fact, the claim the Franck Report was making in 1945: that capability does not determine use, that scientists have a voice in the normative question, and that using that voice is part of the job rather than a distraction from it.

What I want beyond that is for the AI research community to treat the ethics as primary rather than as footnotes. Not ethics review boards that approve research post hoc. Not responsible AI teams that are consulted after the capability has been developed. A genuine integration of the normative analysis into the research process itself — asking, at each stage, what this capability makes possible and who benefits from that possibility and who pays the cost.

The physics community got to August 1945 before it had that conversation in earnest. The conversation has been going on ever since, and it has produced important institutional frameworks — the Bulletin of the Atomic Scientists, the arms control treaties, the export control regimes, the norms against first use. These things matter. But they were built in reaction to a capability that had already been deployed, and the shape of everything that followed was constrained by that starting point.

The AI community is not there yet. The starting point is still being established. That is what makes this moment consequential, and what makes Anthropic’s line — wherever exactly it is drawn — worth defending as an act of principle rather than dismissing as an act of commercial positioning.

A Note on the “Of Our Time” Framing

I am aware that comparisons to the atom bomb are sometimes used to generate unwarranted urgency, to short-circuit careful reasoning by invoking the most extreme case. I want to be clear about what I am and am not claiming.

I am not claiming that current large language models are as immediately dangerous as nuclear weapons. They are not.

I am claiming that the structural situation — researchers at the capability frontier, ahead of the policy frameworks, making decisions that will constrain future options, in a competitive environment with adversarial dynamics — is similar enough that the lessons of the Manhattan Project period are directly relevant. Not as prophecy. As a guide to the kind of mistakes that are available to make.

The physicists had plenty of warning. Szilard had been worried since 1933. Einstein wrote to Roosevelt in 1939. The Franck Report was written before Hiroshima. The warnings were on the record. What was not on the record was a scientific community that treated those warnings as actionable constraints on its own behaviour rather than as advisories for policymakers.

That is the thing I want to be different this time.

References

Franck, J. et al. (1945). Report of the Committee on Political and Social Problems (The Franck Report). National Archives, Record Group 77.

Oppenheimer, J. R. (1965). Interview on The Decision to Drop the Bomb (NBC documentary). Recorded 1965.

Rhodes, R. (1986). The Making of the Atomic Bomb. Simon & Schuster.

Russell, B., & Einstein, A. (1955). The Russell–Einstein Manifesto. Pugwash Conferences on Science and World Affairs.

Szilard, L. (1945). A Petition to the President of the United States. July 17, 1945. Available via the Atomic Heritage Foundation.

Bulletin of the Atomic Scientists (1945–present). Doomsday Clock statements. https://thebulletin.org/doomsday-clock/

More Context Is Not Always Better

Sun, 22 Feb 2026 00:00:00 +0000

Summary

There is a popular intuition in LLM engineering that context is a resource you should always spend freely: more background, more history, more examples — inevitably better answers. This intuition is wrong often enough to be dangerous. Context has a signal-to-noise structure, attention has a position-dependent bias, and the architecture that processes all of it scales quadratically. Adding irrelevant tokens does not leave performance neutral; it actively degrades it. This post argues for structured sparsity as a design principle: give a model exactly the context it needs for the decision it is making right now, and nothing else.

Background

The “more is more” assumption has an obvious origin. Transformers were designed to condition on sequences, and every new token in the context window is, in principle, available to every attention head. The release of models with 128k, 200k, and now million-token context windows reinforced the story: the constraint is gone, so pack in everything you have.

Two lines of empirical and theoretical work complicate this story.

The lost-in-the-middle problem. Liu et al. [1] showed that retrieval accuracy on multi-document question answering degrades sharply when the relevant passage appears in the middle of a long context, compared to the beginning or end. Performance on 20-document prompts dropped by more than 20 percentage points relative to the single-document baseline — not because the model lacked the information, but because it was buried. The effect is consistent across model families and persists at model scales where you would not expect it.

The complexity argument. Standard scaled dot-product attention [2] is

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V \]

The $ QK^{\top} $ product is $ O(n^2) $ in sequence length $ n $. Inference-time KV-cache mitigates compute cost, but memory grows linearly and the softmax normalises over a denominator that grows with $ n $. A head attending to 128 000 tokens is averaging over a vastly noisier signal than one attending to 512.

The Idea

Context as a noisy channel

Think of the information reaching a given attention head as a noisy channel in the Shannon sense. The signal is the subset of tokens that are actually relevant to the current decoding step; the rest is noise. Signal-to-noise ratio is

\[ \text{SNR} = \frac{|\mathcal{S}|}{n - |\mathcal{S}|} \]

where $ \mathcal{S} \subset \{1, \ldots, n\} $ is the set of relevant token positions and $ n $ is total context length. For a fixed task, $ |\mathcal{S}| $ is roughly constant. So SNR is a decreasing function of $ n $. Adding irrelevant context makes the problem strictly harder in this framing — it does not leave it unchanged.

This is a toy model, but it captures something real. The softmax in the attention head distributes a probability mass of 1.0 across $ n $ positions. If the attended sequence doubles in length and the relevant positions remain the same, each relevant position receives roughly half the probability mass it did before — unless the model’s learned attention patterns are precise enough to suppress the irrelevant positions to near-zero, which is a strong assumption.

Position bias compounds the problem

Empirically, transformers exhibit a U-shaped recall curve over context position: tokens near the start (primacy) and tokens near the end (recency) are retrieved more reliably than tokens in the middle. If you stuff a long context with background material and bury the task-relevant information in the middle, you are fighting the architecture’s learned inductive bias.

The effect is roughly consistent with what would emerge if the model’s attention weight distribution were modelled as a mixture of a flat prior and a position-biased component. Under that model, increasing $ n $ inflates the flat component’s contribution and dilutes the position-biased recovery of relevant tokens.

What structured sparsity looks like in practice

The corrective is not to artificially shrink context windows — it is to ensure that at each decision point, the context is populated with tokens that are relevant to that decision. Three practical expressions of this principle:

Retrieval over recall. Rather than prepending a full document corpus, retrieve the top-$ k $ passages at query time. This keeps $ n $ small and $ |\mathcal{S}| / n $ high.
Rolling summarisation. Compress history into a running summary and discard the raw transcript. The summary carries the signal; the raw transcript is mostly noise by the time it is several turns old.
Phased orchestration. Decompose a multi-step task into phases, each with its own focused context. Phase $ t $ receives only the output of phase $ t-1 $ (plus any task-specific retrieval), not the entire accumulated history of all prior phases. This keeps per-phase $ n $ bounded even as the total task length grows.

Discussion

The argument above is not novel — pieces of it appear scattered across the alignment and inference-efficiency literatures. What I think is underappreciated is that it applies to agentic systems with particular force. A single-shot prompt has a fixed, author-controlled context. An agent accumulating tool outputs, prior reasoning traces, and retrieved documents across a long task trajectory will naturally inflate its own context window over time — and degrade its own performance as a result, without any external change in task difficulty.

The naive fix is to give the agent a bigger context window. The correct fix is to never let it accumulate a bloated context in the first place.

Limitations. The SNR framing treats all irrelevant tokens as equally noisy, which is false — some irrelevant tokens are actively misleading (distractors) [3], others are benign fillers. The quadratic cost argument mostly applies to full-attention models; sparse and linear attention variants have different scaling properties. And “relevant” is itself a function of the model’s knowledge, which makes the optimisation circular in practice.

What would make this publishable. Controlled ablation: fix a task, vary context length by inserting null tokens of increasing volume, measure performance as a function of $ n $ and of the position of the relevant material. Do this across model sizes and families to separate architectural effects from scale effects. The lost-in-the-middle paper is close to this but does not isolate null-token inflation from document-count inflation.

References

[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://arxiv.org/abs/2307.03172

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

[3] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Schärli, N., & Zhou, D. (2023). Large language models can be easily distracted by irrelevant context. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2302.00093

The phased orchestration argument in the Discussion section is not just theoretical hand-waving — I have been building a concrete implementation of it. The current state lives at sebastianspicker/phased-agent-orchestration. It is rough, but the core idea is there: each agent phase gets a bounded, purpose-built context rather than the full accumulated history. Feedback very welcome.

If You Think This Is Written by AI, You Are Both Right and Wrong

Wed, 18 Feb 2026 00:00:00 +0000

I use AI tools in my writing. This post, like several others on this blog, was written with LLM assistance — research, structure, drafting, revision. If you run any of these posts through an AI writing detector, you will likely receive a high probability-of-AI score. The detector will be picking up something real.

It will also be wrong about what that means.

The Constitution Problem

In 2023, as universities began deploying AI detection tools at scale, educators started testing them on texts that were definitively not AI-generated. The results were instructive. The United States Constitution received high AI-probability scores from multiple commercial detectors. GPTZero returned a rating of 92% likely AI-written. The Federalist Papers fared similarly. So did sections of the King James Bible and Kant’s Critique of Pure Reason. Historical documents, written by humans, for human purposes, in an era when no AI existed — flagged as machine-generated.

This was not a marginal edge case. It was consistent across tools and across documents. And while it was widely reported as evidence that the detectors were broken, there is a more precise reading available: the detectors were working correctly, and we had misunderstood what they were measuring.

What the Detectors Actually Measure

Most commercial AI detectors — GPTZero, Turnitin’s detection layer, Copyleaks — use some combination of two statistical signals.

Perplexity. A language model assigns a probability to each token given the preceding tokens. Low perplexity means the text was, token by token, what the model expected — it sits close to the centre of the probability distribution. AI-generated text tends to have low perplexity because that is precisely what generation does: it samples from the high-probability region of the distribution [1]. Human text, on average, has higher perplexity, because humans write for specific contexts with idiosyncratic word choices, rhetorical effects that require the unexpected, and the accumulated noise of composing for a real reader.

Burstiness. A term introduced by Edward Tian, GPTZero’s creator: human writing has high burstiness — sentence lengths vary widely, vocabulary density shifts, complex constructions alternate with simple ones. AI writing is more uniform. The statistical distribution of sentence lengths in LLM output is narrower than in most human prose [2].

The underlying assumption these tools share: human writing is variable, contextually messy, idiosyncratic. AI writing is smooth and predictable.

This is accurate for a large class of human writing — casual prose, personal essays, social media, student writing in informal registers. It is wrong about a different and well-defined class of human writing. The Constitution sits in that class. So does a lot of other text.

The Systemising Brain

Simon Baron-Cohen’s empathising–systemising (E-S) theory distinguishes two cognitive orientations. Empathising involves attending to social and emotional cues, inferring mental states, navigating the pragmatic, implicit layer of communication — what is meant rather than what is said. Systemising involves attending to rules, patterns, and underlying regularities — the drive to understand how things work and to represent them in explicit, transferable, internally consistent terms [3].

Both orientations are distributed across the human population. They are not exclusive, and neither is pathological. But autism spectrum conditions are robustly associated with high systemising and relatively lower empathising — not because autistic people lack emotions or care about others, but because the cognitive mode that comes naturally to them is one of rules, structures, and explicit representation rather than social inference and pragmatic implication. The intense world theory [4] adds a complementary perspective: autistic brains may be characterised by hyper-reactivity and hyper-plasticity, with pattern-seeking and systematising serving partly as a way of making a too-intense world navigable. The systematicity is not a deficit. It is an adaptation.

This has direct consequences for writing.

High-systemising writing tends toward:

Consistent vocabulary. The same term is used for the same concept throughout, because substituting a synonym introduces ambiguity about whether the referent is actually the same. Neurotypical writing freely uses synonyms for stylistic variety; systemising writing resists this on principle.
Explicit logical structure. Claims are supported by stated reasons rather than left to pragmatic inference. If there are three conditions, all three are named. Nothing is “needless to say.”
Low social hedging. Phrases like “as everyone knows” or “obviously” are avoided, because they perform social alignment rather than convey information — and they depend on shared assumptions the writer is not confident are actually shared. (This connects to a point I made in the car-wash-walk post about Gricean pragmatics: autistic communication often violates the maxim of quantity in the direction of over-informing, because nothing is assumed implicit.)
Grammatical parallelism. Parallel logical content takes parallel grammatical form. This is not stylistic affectation; it is a natural consequence of representing structure explicitly.
Minimal rhetorical noise. The prose does not meander, warm up, or perform relatability. It states what needs to be stated.

Now run text with these properties through an AI detector. Consistent vocabulary reads as low lexical diversity. Explicit structure reads as low burstiness. Minimal rhetorical noise reads as smooth, generated output. The detector is measuring these properties accurately. The attribution to machine generation is where it goes wrong.

Liang et al. [5] demonstrated a closely related failure empirically: AI detectors are significantly more likely to flag writing by non-native English speakers as AI-generated. Non-native writers at advanced levels of formal English tend to write more carefully, more consistently, and more in accordance with explicit grammar rules — because they learned the language as a system of explicit rules rather than acquiring it through immersive social exposure. More systematic writing: higher AI probability score. The mechanism is the same. The population is different.

The Physicist Brain

Physics writing has its own conventions, independently developed but pointing in the same direction.

Scientific prose requires defined terms used consistently: in a paper about quantum error correction, “logical qubit,” “physical qubit,” and “syndrome” each mean exactly one thing, used identically in section 2 and section 5. It requires explicit assumptions: “We assume the noise is Markovian.” “In the limit of large N.” These are not vague hedges; they are precise statements about the domain of validity of the results. It requires logical derivation over rhetorical persuasion: the connectives are “since,” “therefore,” “it follows that” — explicit logical operators, not narrative bridges. And the passive construction of “the signal was measured” rather than “I measured the signal” removes the individual from the result, because the result should be reproducible regardless of who performs the measurement.

The outcome is prose that is systematic, consistent, and structurally predictable. From the outside — and from the vantage point of an AI detector — it looks machine-generated.

Paul Dirac is the physicist who comes to mind first here. His 1928 paper deriving the relativistic wave equation for the electron contains almost no rhetorical apparatus. Motivation, equation, consequence: each stated once, clearly, with no warm-up and no elaboration beyond what the argument requires. It is not warm. It is not discursive. It is beautiful in the way that a proof is beautiful: every element earns its place. Run it through GPTZero and see what you get.

This connection between the physicist’s prose style and the autistic cognitive mode is not accidental. Baron-Cohen et al. [6] surveyed Cambridge students by academic discipline and found that physical scientists and mathematicians scored consistently higher on the Autism Quotient (AQ) than humanities students and controls, with mathematicians scoring highest of all. The systemising orientation associated with autism spectrum conditions is also associated with — and presumably selected for — in quantitative scientific disciplines. The physicist’s prose reflects this. So does the writing of a high-systemising person who has never studied physics.

The categories overlap without being identical. What they share is a cognitive preference for explicit structure, consistent vocabulary, and logical transparency over social performance and rhetorical persuasion. The writing that emerges from that preference looks, to an AI detector, like it was generated by a machine.

It was not.

The Category Error

The error AI detectors make is not a measurement error. It is a category error.

They are trained to distinguish two things: output generated by a contemporary LLM, and a specific subset of human writing — typically casual, personal, or student prose collected from online sources. When they encounter text outside either of those training categories — systematic and precise but human-generated — the classifier has no good option. The text does not match the “AI” training data exactly, and it does not match the “human” baseline either. It gets assigned to the bin it fits least badly.

What is happening when the Constitution is flagged: it is systematic, definitional, prescriptive, and internally consistent. It was written by lawyers and statesmen who understood that ambiguity in foundational documents creates legal chaos. They wrote to be unambiguous. The result is text with low perplexity and low burstiness — the statistical signature the detector associates with AI.

GPTZero’s creator Edward Tian acknowledged this problem when it was reported: the Constitution appears so frequently in LLM training data that it registers as “already known” to the model, which artificially lowers its perplexity score. That is a real and specific issue. But it is secondary. The deeper issue is that the Constitution would score low-perplexity even without the training-data contamination effect, because systematic, definitional prose is intrinsically low-perplexity. Precise language is predictable language. That is partly the point of precise language.

The baseline assumption — that human writing is variable and idiosyncratic — holds for much human writing. It does not hold for legal drafting, technical documentation, scientific papers, sacred and historical texts written to be durable and precise, writing by people with high systemising orientation, or writing by non-native speakers at formal registers. That is not a small population of edge cases. It is a substantial fraction of all written material that exists.

Right and Wrong at the Same Time

So: if you think these posts are AI-generated, you are right and wrong at the same time.

Right, in two ways. First: yes, I use AI tools. LLM assistance is part of my writing process — not an occasional aid, but a regular part of how research notes and half-formed arguments become structured posts. Second: the writing style of these posts is systematic and precise in ways that detectors register as machine-generated. That systematicity is real, and if a detector picks it up, it is measuring something.

Wrong, also in two ways. First: the ideas, judgments, and connections in these posts are mine. The decisions about what to include and what to leave out, which papers to cite and how to frame their implications, where the interesting tension lies between neurodiversity research and the assumptions baked into AI detection tools — those are not outputs of a language model working in isolation. They are the product of someone who works at the intersection of these fields and has thought about them for a while. An LLM cannot generate these posts without a human who has already decided what to say.

Second, and more important for the argument here: the systematic, precise character of this writing is not evidence of machine generation. It is a cognitive signature — one associated with physics training, with high systemising orientation, with the overlap between those two things that I have written about elsewhere in the context of neurodiversity more broadly.

The detector is measuring a real property of the text. It is misattributing the origin of that property.

The interesting question this opens is not “did AI write this?” That question is increasingly poorly posed in an era where thinking and writing are already deeply entangled with machine assistance, in ways that differ sharply from person to person and task to task. The better question is: whose judgment is in the text? Whose choices about what to include, what to connect, what to leave out?

The systematicity in this writing is mine. The recognition that AI detectors systematically disadvantage autistic writers, physicist writers, and non-native speakers is a judgment I made, not one a language model was prompted to produce. The connection to the Constitution — a document written to be maximally unambiguous, flagged as maximally AI-like — is a connection I found worth drawing.

Whether that makes this text “human” is a philosophical question I am happy to leave open. What it is not is AI hallucination.

References

[1] Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., & Finn, C. (2023). DetectGPT: Zero-shot machine-generated text detection using probability curvature. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2301.11305

[2] Gehrmann, S., Strobelt, H., & Rush, A. M. (2019). GLTR: Statistical detection and visualization of generated text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 111–116. https://doi.org/10.18653/v1/P19-3019

[3] Baron-Cohen, S. (2009). Autism: The empathising–systemising (E-S) theory. Annals of the New York Academy of Sciences, 1156(1), 68–80. https://doi.org/10.1111/j.1749-6632.2009.04467.x

[4] Markram, K., & Markram, H. (2010). The intense world theory — a unifying theory of the neurobiology of autism. Frontiers in Human Neuroscience, 4, 224. https://doi.org/10.3389/fnhum.2010.00224

[5] Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779. https://doi.org/10.1016/j.patter.2023.100779

[6] Baron-Cohen, S., Wheelwright, S., Skinner, R., Martin, J., & Clubley, E. (2001). The autism-spectrum quotient (AQ): Evidence from Asperger syndrome/high-functioning autism, males and females, scientists and mathematicians. Journal of Autism and Developmental Disorders, 31(1), 5–17. https://doi.org/10.1023/A:1005653411471

Two Expansion Rates, One Universe: The Hubble Tension at 5σ

Mon, 16 Feb 2026 00:00:00 +0000

The universe is expanding. We have known this since Edwin Hubble’s 1929 paper, which measured recession velocities of galaxies using Cepheid variable stars and established what we now call Hubble’s law:

$$v = H_0 \, d$$

The proportionality constant $H_0$ — the Hubble constant — is the current rate of expansion, in units of km/s/Mpc (kilometres per second per megaparsec, where 1 Mpc $\approx 3.086 \times 10^{22}$ m). Hubble’s original measurement gave $H_0 \approx 500$ km/s/Mpc. It was wrong by a factor of about seven, which is understandable given that he was measuring distances to galaxies in the 1920s using photographic plates and a lot of courage. Over the following decades, as techniques improved, the value converged toward 70 km/s/Mpc. By the 1990s, many people considered the question largely settled: $H_0$ was somewhere between 60 and 80, and the main arguments were about where exactly in that range.

Those arguments have sharpened considerably. We now have two extremely precise, extremely well-scrutinised measurements of $H_0$, and they disagree:

$$H_0^{\text{late}} = 73.04 \pm 1.04 \text{ km/s/Mpc} \qquad \text{(distance ladder)}$$

$$H_0^{\text{early}} = 67.4 \pm 0.5 \text{ km/s/Mpc} \qquad \text{(CMB)}$$

The discrepancy is $73.04 - 67.4 = 5.64$ km/s/Mpc. The combined uncertainty is $\sqrt{1.04^2 + 0.5^2} \approx 1.16$ km/s/Mpc. The significance is approximately $4.9\sigma$, which rounds to what cosmologists have taken to calling — with some grimness — the Hubble tension.

$5\sigma$ is what particle physicists require before claiming a discovery. It is the threshold designed to exclude chance statistical fluctuations at the level of one in 3.5 million. When the Hubble tension first appeared around 2016 it was at $2$–$3\sigma$, which is “interesting.” It has since grown monotonically as both measurement chains have been refined. This is not the behavior you want from a systematic error that you are hoping will go away.

This is the kind of problem that keeps me reading papers at unreasonable hours. The rest of this post is an attempt to explain why both measurements are trustworthy, why the disagreement is therefore a genuine crisis, and what proposals exist for resolving it.

What $H_0$ Actually Measures and Why It Matters

Hubble’s law $v = H_0 d$ is valid in the nearby universe, for galaxies whose recession velocities are much less than the speed of light. In the full relativistic framework, the expansion is described by the scale factor $a(t)$, which encodes how distances between comoving points grow with time. The Hubble parameter is defined as

$$H(t) = \frac{\dot{a}}{a}$$

and $H_0 = H(t_0)$ is its value today. The Friedmann equation — derived from general relativity applied to a homogeneous and isotropic universe — gives

$$H(z)^2 = H_0^2 \left[ \Omega_m (1+z)^3 + \Omega_r (1+z)^4 + \Omega_k (1+z)^2 + \Omega_\Lambda \right]$$

where $z$ is the redshift (related to the scale factor by $1 + z = 1/a$), and the $\Omega$ parameters are the present-day fractional energy densities of matter, radiation, spatial curvature, and the cosmological constant (dark energy), respectively. Our standard cosmological model, $\Lambda$CDM, assumes a spatially flat universe ($\Omega_k = 0$), negligible radiation today ($\Omega_r \approx 0$), and

$$\Omega_\Lambda \approx 0.68, \qquad \Omega_m \approx 0.31, \qquad \Omega_b \approx 0.049$$

where $\Omega_b$ is the baryon (ordinary matter) density. Dark matter makes up most of $\Omega_m$.

$H_0$ is not just one number among many. It is the normalization of the entire cosmological distance scale. It appears in the age of the universe:

$$t_0 = \int_0^\infty \frac{dz}{(1+z)\,H(z)}$$

which, for $\Lambda$CDM with the above parameters, gives $t_0 \approx 13.8$ Gyr. A higher $H_0$ means a faster expansion rate, which means — for fixed $\Omega$ values — a younger universe. An error in $H_0$ propagates into every cosmological distance and every age estimate. It is not a detail.

Ladder 1: The Late-Universe Measurement ($H_0 = 73$)

The distance ladder is the name for the chain of calibrated distance measurements that extends from the Earth to the far reaches of the universe. Each rung calibrates the next.

Rung 1: Geometric parallax. For stars within roughly 1–2 kpc, we can measure distance directly from the shift in apparent position as the Earth orbits the Sun. The parallax angle $\pi$ (in arcseconds) gives the distance $d = 1/\pi$ parsecs. This is pure geometry — it follows from Euclid and Kepler, not from any physical model of stars. The Gaia space mission has measured parallaxes for more than 1.5 billion stars with precisions reaching $\sim 10\,\mu$as for the brightest objects, providing the geometric foundation of the entire ladder.

Rung 2: Cepheid variables. These are pulsating giant stars whose oscillation period $P$ is tightly correlated with their intrinsic luminosity $L$ — the Leavitt period-luminosity relation, discovered by Henrietta Swan Leavitt in 1912. The relation takes the form

$$M = \alpha \log_{10}(P/\text{days}) + \beta + \gamma \left[ \text{Fe/H} \right]$$

where $M$ is the absolute magnitude, and the metallicity term $\gamma[\text{Fe/H}]$ accounts for the chemical composition of the star. Once calibrated against nearby Cepheids with known parallax distances, this relation allows the distance to any galaxy hosting Cepheids to be inferred from period measurements alone. Cepheids are luminous enough to be resolved in galaxies out to $\sim 50$ Mpc.

Rung 3: Type Ia supernovae. These are thermonuclear explosions of white dwarf stars that occur near the Chandrasekhar mass limit ($\approx 1.44 M_\odot$), and consequently near a characteristic peak luminosity. Their light curves are not perfectly standard, but the peak luminosity correlates tightly with the rate at which brightness declines after peak — the Phillips relation. After this standardisation, Type Ia SNe serve as “standardisable candles” reaching to redshifts $z \sim 2$, far beyond the reach of Cepheids.

The logic of the ladder: Gaia calibrates nearby Cepheids; those Cepheids calibrate Cepheids in SN Ia host galaxies; those SN Ia establish the absolute luminosity of the standard candle; that calibrated SN Ia sample gives recession velocities (from spectroscopic redshifts) and distances (from luminosities) simultaneously, yielding $H_0$.

The SH0ES collaboration (Supernovae and $H_0$ for the Equation of State of dark energy) has driven this measurement for the past fifteen years. Their 2022 result (Riess et al., 2022), using Hubble Space Telescope Cepheid calibrations in 37 galaxies hosting Type Ia SNe, found

$$H_0 = 73.04 \pm 1.04 \text{ km/s/Mpc}$$

This is a 1.4% measurement. For context, Hubble’s original measurement had an error of several hundred percent.

The JWST confirmation. One candidate systematic error was crowding: in HST images of distant galaxies, Cepheids might be blended with unresolved neighbouring stars, artificially brightening them and biasing the distance estimate. JWST’s larger mirror and infrared sensitivity resolve individual Cepheids more cleanly in the same host galaxies. The results from JWST observations in 2023 confirmed the HST Cepheid distances. The crowding concern was not the answer. The distance ladder value is not a systematic artifact of HST resolution.

An independent late-universe measurement comes from time-delay cosmography. Gravitational lensing of a background quasar by an intervening galaxy produces multiple images; the arrival times of photons along different paths differ by an amount that depends on $H_0$. The TDCOSMO collaboration (Birrer et al., 2020) found $H_0 = 74.5^{+5.6}_{-6.1}$ km/s/Mpc from seven lensed quasars, entirely independently of the distance ladder. This is a completely different physical observable. It agrees with SH0ES.

Ladder 2: The Early-Universe Measurement ($H_0 = 67$)

The cosmic microwave background (CMB) is the thermal radiation left over from recombination — the epoch at $z \approx 1100$ (about 380,000 years after the Big Bang) when the universe had cooled enough for protons and electrons to combine into neutral hydrogen, allowing photons to free-stream for the first time. The CMB is extraordinarily uniform in temperature ($T \approx 2.725$ K) but carries tiny fluctuations at the level of $\Delta T / T \sim 10^{-5}$.

Before recombination, the universe was a tightly coupled photon-baryon fluid. Perturbations in this fluid oscillated: gravity pulled baryons into overdense regions, while radiation pressure resisted compression. The competition set up acoustic waves — sound waves in the plasma of the early universe. These waves travelled at the sound speed

$$c_s = \frac{c}{\sqrt{3(1 + R)}}, \qquad R = \frac{3 \rho_b}{4 \rho_\gamma}$$

where $\rho_b$ and $\rho_\gamma$ are the baryon and photon energy densities. The waves propagated until recombination froze them in place. The characteristic length they had traversed — the sound horizon — is

$$r_s = \int_0^{t_{\text{rec}}} \frac{c_s \, dt}{a(t)}$$

This is a physical length scale set by the microphysics of the early universe, which depends only on $\Omega_b$, $\Omega_m$, and the expansion rate $H(z)$ at $z \gtrsim 1100$ (well before any dark energy becomes relevant). For the best-fit $\Lambda$CDM parameters, $r_s \approx 147$ Mpc.

The frozen acoustic oscillations imprint a characteristic angular scale on the CMB temperature fluctuations. The angular size of the sound horizon as seen from today is

$$\theta_s = \frac{r_s}{D_A(z = 1100)}$$

where $D_A$ is the angular diameter distance to the last-scattering surface. The Planck satellite measured this angle with extraordinary precision: $\theta_s = 0.59656°$ (approximately 0.6 degrees, corresponding to the first acoustic peak in the angular power spectrum). This is the most precisely measured quantity in cosmology.

Now, here is the key point. $\theta_s$ is measured directly from the CMB. But to extract $H_0$, we must model both $r_s$ (which depends on the early-universe physics) and $D_A(z=1100)$ (which depends on the late-universe expansion, and therefore on $H_0$). We fit $H_0$, $\Omega_m$, $\Omega_b$, and a handful of other parameters simultaneously to the entire CMB power spectrum.

The Planck 2018 result (Planck Collaboration, 2020):

$$H_0 = 67.4 \pm 0.5 \text{ km/s/Mpc}$$

This is a 0.7% measurement — even tighter than SH0ES. And it assumes $\Lambda$CDM. That assumption is crucial, and we will return to it.

Independent CMB experiments — the Atacama Cosmology Telescope (ACT) and the South Pole Telescope (SPT) — give consistent results, ruling out Planck-specific instrumental systematics. The CMB measurement is robust.

The Tension: Numbers and What They Mean

The discrepancy is:

$$\Delta H_0 = 73.04 - 67.4 = 5.64 \text{ km/s/Mpc}$$

The combined statistical uncertainty is:

$$\sigma_{\text{comb}} = \sqrt{1.04^2 + 0.5^2} \approx 1.16 \text{ km/s/Mpc}$$

The significance:

$$\frac{\Delta H_0}{\sigma_{\text{comb}}} \approx 4.9\sigma$$

rounding to $\sim 5\sigma$ when additional late-universe calibrators and improved analyses are included. To put this in perspective: the probability of a $5\sigma$ discrepancy arising by chance from two correct measurements of the same quantity is roughly $3 \times 10^{-7}$. You might win a lottery. You probably should not bet your cosmological model on it.

The tension has grown monotonically over a decade. In 2016 it was $3.4\sigma$. In 2019 it was $4.4\sigma$. Now it sits at $\sim 5\sigma$. This is the opposite of what happens when a systematic error is found: systematics tend to get corrected, reducing the tension. Instead, as each new systematic hypothesis has been tested and rejected, the significance has crept upward.

What Could Explain It

The community has not been idle. The number of proposed solutions runs into the hundreds. They can be organised into a few broad categories.

Systematic Errors (Increasingly Unlikely)

The distance ladder has multiple candidate systematics that have been carefully evaluated:

Cepheid metallicity dependence: the period-luminosity relation shifts with iron abundance $[\text{Fe/H}]$. This has been calibrated from first principles and from Gaia observations of Milky Way Cepheids. The residual uncertainty is $\lesssim 0.5$ km/s/Mpc.
Photometric crowding: addressed by JWST.
LMC geometry and distance: the Large Magellanic Cloud is the anchor for Cepheid calibrations. Its distance is now known from eclipsing binary stars and from the time delay of SN 1987A’s light echo to better than 1%.
SN Ia physics: host galaxy dependence of SN Ia peak luminosity, Malmquist bias in flux-limited surveys, potential evolution with redshift. These have been studied extensively. Residual effects are estimated at $\sim 1$ km/s/Mpc.

On the CMB side, Planck foreground subtraction has been audited, beam calibration has been checked, and independent experiments agree. There is no credible Planck systematic that could shift $H_0$ by $5.6$ km/s/Mpc.

The conclusion of most working cosmologists is that neither measurement chain contains a systematic error large enough to resolve the tension. This leaves us with physics.

Early Dark Energy

This is currently the most discussed new-physics solution (Poulin et al., 2019). The idea is to introduce a new energy component that becomes dynamically important at $z \sim 3000$–$5000$ — well before recombination but after matter-radiation equality. This “Early Dark Energy” (EDE) temporarily increases the expansion rate $H(z)$ at early times.

Why does this help? Recall that the CMB measures $\theta_s = r_s / D_A$ directly and precisely. The sound horizon is

$$r_s \propto \int_{z_{\text{rec}}}^\infty \frac{c_s \, dz}{H(z)}$$

A faster expansion rate (higher $H(z)$ at early times) reduces the integral, shrinking $r_s$. The angular diameter distance $D_A$ also changes, but less sensitively. A smaller $r_s$ means that, to reproduce the same observed angle $\theta_s$, the model requires $D_A$ to be correspondingly smaller. Smaller $D_A$ implies a higher $H_0$.

The schematic: if we boost $H(z)$ at $z \sim 3500$ by $\sim 10\%$, the inferred $H_0$ from the CMB shifts from 67 toward 71–72 km/s/Mpc. This can be implemented by an axion-like scalar field $\phi$ that rolls down a periodic potential

$$V(\phi) = \Lambda^4 \left[1 - \cos\left(\frac{\phi}{f}\right)\right]^n$$

The field oscillates around the potential minimum after recombination, rapidly diluting like matter or radiation, and leaving no late-universe signature.

EDE is not without problems. The required EDE fraction ($f_{\text{EDE}} \sim 0.1$ at peak) requires fine-tuning the initial field value. More seriously, EDE models generally worsen the $S_8$ tension — the $\sim 2$–$3\sigma$ discrepancy between CMB and weak gravitational lensing measurements of the parameter $S_8 = \sigma_8 \sqrt{\Omega_m / 0.3}$ (where $\sigma_8$ is the amplitude of matter fluctuations on 8 Mpc/$h$ scales). Fixing one tension while worsening another is not the behaviour of a correct theory.

Modified Gravity and Interacting Dark Energy

A zoo of alternatives modifies either the late-time or early-time expansion history. These include:

Phantom dark energy: $w < -1$, which increases $H_0$ inferred from CMB fits
Dynamical dark energy: $w \neq -1$, potentially evolving
Interacting dark matter/dark energy: momentum transfer between sectors modifying both the background expansion and perturbation growth
Modified gravity theories (Horndeski, bimetric gravity, $f(R)$ theories): these change the relationship between curvature and matter, altering $H(z)$

None of these is clearly preferred by the data in isolation, but several of them become more interesting in light of DESI.

The Local Void

A tempting classical explanation: if we happen to live inside a large underdense region (a “Hubble bubble”), the local expansion rate measured by the distance ladder would be higher than the cosmic mean. In an underdense region, there is less gravitational deceleration, so things expand faster locally.

The problem is scale. To shift $H_0$ by $5$ km/s/Mpc, the void would need to extend to $\gtrsim 300$ Mpc and have an underdensity of $\sim 20$%. Neither is consistent with the observed large-scale structure of the universe, where surveys of galaxy distributions show we are not in an anomalously underdense region at that scale.

DESI: Dark Energy May Not Be Constant

In 2024, the Dark Energy Spectroscopic Instrument (DESI) published its first-year results (DESI Collaboration, 2024). DESI is measuring baryon acoustic oscillations (BAOs) in the distribution of millions of galaxies — the same acoustic physics as the CMB, but imprinted in the late-universe galaxy distribution rather than in photons at $z = 1100$.

The BAO standard ruler is the sound horizon $r_s$: the same $\sim 147$ Mpc scale imprinted at recombination appears as a preferred separation between galaxy pairs in the low-redshift universe. By measuring the angular size and redshift separation of the BAO peak at multiple redshifts, DESI traces $H(z)$ across cosmic time.

DESI DR1 measured BAOs in over 6 million extragalactic objects spanning $0.1 < z < 4.2$ and found a $2.5$–$3.9\sigma$ preference for dark energy that evolves with time (the significance depending on which supernova dataset is used in the combination). The standard model assumes $w = P/\rho = -1$ exactly (a cosmological constant). DESI’s data is better fit by the $w_0$–$w_a$ parameterisation:

$$w(a) = w_0 + w_a (1 - a)$$

where $a = 1/(1+z)$ is the scale factor. The DESI DR1 best-fit values, combined with CMB and SN Ia data, give $w_0 \approx -0.73$ and $w_a \approx -1.0$ — a dark energy that was more negative (more repulsive) in the past and is becoming less negative today. DESI DR2 (released in March 2025) raised the significance of this preference to $4.2\sigma$.

The connection to the Hubble tension is direct. The CMB’s inference of $H_0 = 67.4$ km/s/Mpc is derived assuming $w = -1$ exactly throughout cosmic history. If dark energy is not a cosmological constant — if $w(z)$ varies — then the Friedmann equation at late times is different, the angular diameter distance $D_A(z=1100)$ is different, and the CMB-inferred $H_0$ changes. A dynamical dark energy that is less dominant at early times and more dominant at late times (which the DESI $w_0$–$w_a$ parameters suggest) tends to shift the CMB-inferred $H_0$ upward.

DESI may be showing us the resolution of the Hubble tension: not a systematic error in either measurement chain, but a genuine departure from $\Lambda$CDM that biases both inferences in opposite directions. The distance ladder measures $H_0$ today from local observations. The CMB infers $H_0$ from a model that assumes $w = -1$ everywhere. If the model is wrong, the inference is wrong.

This is not yet settled. The DESI results are also consistent with systematic errors in the SN Ia data used in combination with BAOs. The statistical significance is below $5\sigma$ for the individual datasets. But the direction of the deviation is consistent across data combinations, and it points toward the same part of parameter space that would ease the Hubble tension.

JWST and the Early Galaxy Problem

A brief digression — or perhaps not a digression. JWST was designed partly to study the first galaxies. What it has found is unexpected: there are galaxies at $z > 10$ (less than 500 million years after the Big Bang) that are more massive and more luminous than standard $\Lambda$CDM galaxy formation models predicted. Early headlines announced that JWST was “breaking cosmology.” The reality is more nuanced: $\Lambda$CDM is not broken by these observations, and some of the most extreme early candidates have been revised to lower redshifts as spectra were taken. But genuine tension persists for some objects.

The important point is the accumulation. The Hubble tension is a $5\sigma$ discrepancy in $H_0$. The $S_8$ tension is a $2$–$3\sigma$ discrepancy in matter clustering. The early galaxy problem is a qualitative excess at high redshift. DESI shows $4.2\sigma$ evidence for evolving dark energy. None of these individually is an unambiguous model-breaking crisis. Together, they are multiple independent data sets all pointing in the same direction: $\Lambda$CDM is under pressure. That is not a coincidence you should ignore.

What It Would Mean if $\Lambda$CDM Is Wrong

Let me be clear about what “wrong” means here. $\Lambda$CDM is an extraordinarily successful model. It predicted the angular positions of the CMB acoustic peaks before they were measured. It correctly describes the large-scale structure of the universe across twelve billion years of cosmic history. It accounts for the primordial abundances of helium ($\sim 25\%$ by mass), deuterium, and lithium-7 through Big Bang nucleosynthesis. The 2011 Nobel Prize in Physics was awarded for the discovery of accelerated expansion — one of the central $\Lambda$ in $\Lambda$CDM.

Abandoning this model is not a decision to take lightly, and no serious cosmologist is proposing to do so. What is being proposed is that $\Lambda$CDM is incomplete in a specific way.

$\Lambda$CDM is not a fundamental theory. It is an empirical model with six free parameters: $H_0$, $\Omega_b h^2$, $\Omega_c h^2$, $A_s$ (scalar amplitude), $n_s$ (spectral index), $\tau$ (optical depth to reionization). It does not explain what dark matter actually is — whether it is a WIMP, an axion, a primordial black hole, or something stranger. It does not explain why the cosmological constant $\Lambda$ has the value it does (a separate crisis: the cosmological constant problem, which gives the wrong answer by $10^{120}$ when computed from quantum field theory). The model describes; it does not explain.

The Hubble tension, if it survives further scrutiny and grows in significance, would tell us something specific: the expansion history of the universe is not what $\Lambda$CDM predicts. Either there is new physics at early times (EDE, modified gravity before recombination) or the dark energy is not a cosmological constant (DESI). In either case, the fix is a modification of the Friedmann equation — a correction to how we model the $\Omega$ parameters or their time dependence. This is science working as it should: a model that has been extraordinarily successful is now encountering its limits, and those limits are pointing toward something new.

Closing Thoughts

I write on this blog about transit photometry — measuring the dimming of starlight as a planet crosses its star’s disk (see, for instance, exoplanet hunting with smartphones or the gift of transits). Those observations work because we trust the geometric relationship between angular size, distance, and physical size. The same geometric trust is what underpins Rung 1 of the distance ladder: parallax.

What is striking to me, as someone trained in physics, is that the Hubble tension sits at the top of the same ladder I describe at the bottom. Parallax gives distances to nearby stars. Those calibrate Cepheids. Those calibrate supernovae. Those supernovae reach to $z \sim 2$, and their recession velocities — measured by spectroscopy and interpreted through general relativity — give $H_0 = 73$. Meanwhile, the acoustic oscillations in the CMB that I described in astro-lab at home as a snapshot of the early universe give $H_0 = 67$ by a completely independent method. The two answers disagree at $5\sigma$.

The ladder is not broken. Both rungs have been checked, rechecked, and cross-checked. JWST has confirmed the Cepheid distances. Independent CMB experiments confirm Planck. DESI finds that dark energy may not be constant. Everything points in the same direction: the universe is telling us something.

We do not yet know what.

I find this situation — a clean empirical crisis, well-measured, unexplained — to be among the most exciting things happening in physics. Not because I enjoy confusion, but because clean empirical crises are where physics makes progress. The perihelion of Mercury was an annoying discrepancy with Newtonian gravity until Einstein showed it was a signature of spacetime curvature. The ultraviolet catastrophe in blackbody radiation was an embarrassing failure until Planck (Max, not the satellite) introduced the quantum hypothesis. The Hubble tension may be the next one. Or it may turn out to be a mundane systematic that everyone missed. Either answer would be interesting.

For now, the universe has two expansion rates and one of them is wrong. We are working on finding out which.

References

Planck Collaboration. (2020). Planck 2018 results. VI. Cosmological parameters. Astronomy & Astrophysics, 641, A6. DOI: 10.1051/0004-6361/201833910
Riess, A. G., et al. (2022). A Comprehensive Measurement of the Local Value of the Hubble Constant with 1 km/s/Mpc Uncertainty from the Hubble Space Telescope and the SH0ES Team. The Astrophysical Journal Letters, 934, L7. DOI: 10.3847/2041-8213/ac5c5b
DESI Collaboration. (2024). DESI 2024 VI: Cosmological Constraints from the Measurements of Baryon Acoustic Oscillations. arXiv:2404.03002. Retrieved from https://arxiv.org/abs/2404.03002
Poulin, V., Smith, T. L., Karwal, T., & Kamionkowski, M. (2019). Early Dark Energy can resolve the Hubble tension. Physical Review Letters, 122, 221301. DOI: 10.1103/PhysRevLett.122.221301
Birrer, S., et al. (TDCOSMO). (2020). TDCOSMO IV: Hierarchical time-delay cosmography — joint inference of the Hubble constant, mass density profile and external convergence. Astronomy & Astrophysics, 643, A165. DOI: 10.1051/0004-6361/202038861
Freedman, W. L. (2021). Measurements of the Hubble Constant: Tensions in Perspective. The Astrophysical Journal, 919, 16. DOI: 10.3847/1538-4357/ac0e95
Di Valentino, E., et al. (2021). In the realm of the Hubble tension — a review of solutions. Classical and Quantum Gravity, 38(15), 153001. DOI: 10.1088/1361-6382/ac086d

Changelog

2026-03-22: Updated TDCOSMO quasar count to seven lensed systems and the $H_0$ value to match the TDCOSMO-only analysis. Updated DESI DR1 galaxy count to over 6 million extragalactic objects (the previous figure of 14 million corresponds to DR2). Added qualification that the 3.9$\sigma$ significance for evolving dark energy is dataset-dependent (ranging from 2.5$\sigma$ to 3.9$\sigma$ depending on the supernova sample used).

Car Wash, Part Three: The AI Said Walk

Thu, 12 Feb 2026 00:00:00 +0000

Third in an accidental series. Part one: Three Rs in Strawberry — tokenisation and representation. Part two: Should I Drive to the Car Wash? — grounding and missing world state. This one is different again.

The Video

Same question as last month’s: “Should I drive to the car wash?” New video, new AI, new wrong answer. This time the assistant replied that walking was the better option — better for health, better for the environment, and the car wash was only fifteen minutes away on foot.

Accurate, probably. Correct, arguably. Useful? No.

The model did not fail because of tokenisation. It did not fail because it lacked access to the current weather. It failed because it read the wrong question. The user was asking “is now a good time to have my car washed?” The model answered “what is the most sustainable way for a human to travel to the location of a car wash?”

These are different questions. The model chose the second one. This is a pragmatic inference failure, and it is the most instructive of the three failure modes in this series — because the model was not, by any obvious measure, working incorrectly. It was working exactly as designed, on the wrong problem.

What the Question Actually Meant

“Should I drive to the car wash?” is not about how to travel. The word “drive” here is not a transportation verb; it is part of the idiomatic compound “drive to the car wash,” which means “take my car to get washed.” The presupposition of the question is that the speaker owns a car, the car needs or might benefit from washing, and the speaker is deciding whether the current moment is a good one to go. Nobody asking this question wants to know whether cycling is a viable alternative.

Linguists distinguish between what a sentence says — its literal semantic content — and what it implicates — the meaning a speaker intends and a listener is expected to infer. Paul Grice formalised this in 1975 with a set of conversational maxims describing how speakers cooperate to communicate:

Quantity: say as much as is needed, no more
Quality: say only what you believe to be true
Relation: be relevant
Manner: be clear and orderly

The maxims are not rules; they are defaults. When a speaker says “should I drive to the car wash?”, a cooperative listener applies the maxim of Relation to infer that the question is about car maintenance and current conditions, not about personal transport choices. The “drive” is incidental to the real question, the way “I ran to the store” does not invite commentary on jogging technique.

The model violated Relation — in the pragmatic sense. Its answer was technically relevant to one reading of the sentence, and irrelevant to the only reading a cooperative human would produce.

A Taxonomy of the Three Failures

It is worth being precise now that we have three examples:

Strawberry (tokenisation failure): The information needed to answer was present in the input string but lost in the model’s representation. “Strawberry” →

\["straw", "berry"\]

— the character “r” in “straw” is not directly accessible. The model understood the task correctly; the representation could not support it.

Car wash, rainy day (grounding failure): The model understood the question. The information needed to answer correctly — current weather — was never in the input. The model answered by averaging over all plausible contexts, producing a sensible-on-average response that was wrong for this specific context.

Car wash, walk (pragmatic inference failure): The model had all the relevant words. It may have had access to the weather, the location, the car state. It chose the wrong interpretation of what was being asked. The sentence was read at the level of semantic content rather than communicative intent.

Formally: let $\mathcal{I}$ be the set of plausible interpretations of an utterance $u$. The intended interpretation $i^*$ is the one a cooperative, contextually informed listener would assign. A well-functioning pragmatic reasoner computes:

$$i^* = \arg\max_{i \in \mathcal{I}} \; P(i \mid u, \text{context})$$

The model appears to have assigned high probability to the transportation-choice interpretation $i_{\text{walk}}$, apparently on the surface pattern: “should I

\[verb of locomotion\]

\[location\]

?” generates responses about modes of transport. It is a natural pattern-match. It is the wrong one.

Why This Failure Mode Is More Elusive

The tokenisation failure has a clean diagnosis: look at the BPE splits, find where the character information was lost. The grounding failure has a clean diagnosis: identify the context variable $C$ the answer depends on, check whether the model has access to it.

The pragmatic failure is harder to pin down because the model’s answer was not, in isolation, wrong. Walking is healthy. Walking to a car wash that is fifteen minutes away is plausible. If you strip the question of its conversational context — a person standing next to their dirty car, wondering whether to bother — the model’s response is coherent.

The error lives in the gap between what the sentence says and what the speaker meant, and that gap is only visible if you know what the speaker meant. In a training corpus, this kind of error is largely invisible: there is no ground truth annotation that marks a technically-responsive answer as pragmatically wrong.

This is a version of a known problem in computational linguistics: models trained on text predict text, and text does not contain speaker intent. A model can learn that “should I drive to X?” correlates with responses about travel options, because that correlation is present in the data. What it cannot easily learn from text alone is the meta-level principle: this question is about the destination’s purpose, not the journey.

The Gricean Model Did Not Solve This

It is tempting to think that if you could build in Grice’s maxims explicitly — as constraints on response generation — you would prevent this class of failure. Generate only responses that are relevant to the speaker’s probable intent, not just to the sentence’s semantic content.

This does not obviously work, for a simple reason: the maxims require a model of the speaker’s intent, which is exactly what is missing. You need to know what the speaker intends to know which response is relevant; you need to know which response is relevant to determine the speaker’s intent. The inference has to bootstrap from somewhere.

Human pragmatic inference works because we come to a conversation with an enormous amount of background knowledge about what people typically want when they ask particular kinds of questions, combined with contextual cues (tone, setting, previous exchanges) that narrow the interpretation space. A person asking “should I drive to the car wash?” while standing next to a mud-spattered car in a conversation about weekend plans is not asking for a health lecture. The context is sufficient to fix the interpretation.

Language models receive text. The contextual cues that would fix the interpretation for a human — the mud on the car, the tone of the question, the setting — are not available unless someone has typed them out. The model is not in the conversation; it is receiving a transcript of it, from which the speaker’s intent has to be inferred indirectly.

Where This Leaves the Series

Three videos, three failure modes, three diagnoses. None of them are about the model being unintelligent in any useful sense of the word. Each of them is a precise consequence of how these systems work:

Models process tokens, not characters. Character-level structure can be lost at the representation layer.
Models are trained on static corpora and have no real-time connection to the world. Context-dependent questions are answered by marginalising over all plausible contexts, which is wrong when the actual context matters.
Models learn correlations between sentence surface forms and response types. The correlation between “should I \[travel verb\]to \[place\]?” and transport-related responses is real in the training data. It is the wrong correlation for this question.

The useful frame, in all three cases, is not “the model failed” but “what, precisely, does the model lack that would be required to succeed?” The answers point in different directions: better tokenisation; real-time world access and calibrated uncertainty; richer models of speaker intent and conversational context. The first is an engineering problem. The second is partially solvable with tools and still hard. The third is unsolved.

References

Grice, P. H. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and Semantics, Vol. 3: Speech Acts (pp. 41–58). Academic Press.
Levinson, S. C. (1983). Pragmatics. Cambridge University Press.

Automate the Boring Stuff: Setlist to Playlist

Tue, 10 Feb 2026 00:00:00 +0000

Saturday was the Deftones at the Westfalenhalle in Dortmund. One of those concerts where the setlist is part of the experience — where you register, with something close to physical relief, that the arc landed exactly right, and you spend the Uber home mentally replaying the order.

Sunday I built a playlist from it. It took about forty minutes.

This is the post about why that number is already too low, and also possibly too high.

The Ritual

There is a specific kind of concert listening that happens in the days after a show. You go home, you look up the setlist — setlist.fm is the canonical archive, maintained with an almost academic precision by people who care — and you build a playlist from it in whatever streaming app you use. Then you play it through, in order, and what comes back is not just the music but the spatial memory of the room, the sound mix, the moment the lights dropped for that particular song.

I have been doing this for years. It is a ritual, and like most rituals, part of its meaning is in the doing. The forty minutes of searching song by song, the occasional discovery that a deep cut is on Apple Music in one version but not another, the fiddling with live versus studio — that friction is not purely annoying. It is part of the processing.

And yet. The pile of unprocessed setlists sits in a folder. Shows I attended and never got around to. Setlists I meant to build into playlists and didn’t, because the forty minutes were not available that week, and then the moment passed. The ritual unrealised is just a list of song titles.

This is the dilemma, and it is not entirely trivial.

Why This Is Harder Than It Should Be

The setlist.fm API is excellent. It gives you structured data: artist, venue, date, song titles in order, with notations for encores, covers, and dropped songs. What it does not give you is streaming IDs. The song title is a string; the Apple Music track is an object with a catalog ID, a duration, multiple versions, regional availability, and the possibility of not existing at all in the catalog of your country.

The matching problem — connecting a string like “Change (In the House of Flies)” to the correct Apple Music track, filtered for the right album version, ignoring the live recordings you did not ask for — is not hard, but it is fiddly. You can get 80% of a setlist matched automatically without much effort. The remaining 20% are the covers, the deep cuts, the songs with subtitles in parentheses that differ between the setlist record and the catalog metadata.

Spotify has a fairly rich ecosystem of community tools for exactly this workflow, because Spotify’s API is permissive and well-documented and the auth flow is reasonable for third-party developers. Apple Music is harder. The MusicKit framework is real and capable, but the authentication requires managing a private key and JWT tokens signed with developer credentials — not the OAuth dance most developers are used to. The result is that the setlist → Apple Music pipeline is significantly underbuilt compared to the Spotify equivalent.

This is partly why I built setlist-to-playlist as a PWA rather than reaching for an existing tool.

How It Works

The app is a Progressive Web App — installable, mobile-friendly, works as a small tool you open on your phone in the taxi home from a show — built on Next.js with a monorepo structure managed by pnpm and Turbo. The architecture is in three phases:

Import. You paste a setlist.fm URL or ID. The app queries setlist.fm through a server-side proxy — the API key lives on the server and never touches the client — and returns the structured setlist data: songs in order, with metadata about covers, medleys, and notes.

Preview and matching. The core package runs a matching algorithm against the Apple Music catalog, using the MusicKit JS API for browser-based catalog search. For each song title, it searches Apple Music and presents the best candidate, giving you the chance to confirm or swap before anything is written. This is the step where the 20% problem is addressed manually — the app handles the obvious cases automatically and surfaces the ambiguous ones for human judgement.

Export. Once you are happy with the track list, the app creates a playlist in your Apple Music library. MusicKit handles the authentication in-browser; the backend generates the JWT tokens using credentials from Apple Developer, signing with the private key server-side so it stays off the client.

The whole thing is local-first in the sense that matters: the Apple Music authentication is between your browser and Apple, and no playlist data or listening history is stored by the app. The only thing the server touches is the API key proxying and the JWT generation.

The Actual Experience

After the Deftones show: opened the app on the phone, pasted the setlist.fm URL, had the playlist in Apple Music in about four minutes. Three tracks needed manual confirmation — two because of live-versus-studio ambiguity, one because a cover required a search adjustment, the kind of edge case where the name setlist.fm records differs from what appears in regional streaming catalogs.

Four minutes instead of forty. Mission accomplished.

And yet.

I noticed, processing the setlist that quickly, that something was missing. Not the music — the music was all there, in order, correct. What was missing was the time spent inside the setlist. The forty minutes of handling each song is also forty minutes of thinking about each song, of remembering where in the set it fell, of deciding which album version you want to hear. The automation removed the friction and also removed the processing.

I am not sure this is a problem. It is probably more accurate to say that it is a trade-off, and that what trade-off you want depends on what you are doing with the ritual. If the backlog is the problem — the pile of unprocessed shows — the automation solves it cleanly. If the processing itself is the point, you probably should not automate it, and the tool is there for when you want it.

That is the correct relationship to automation, I think. Not “this should always be automated” or “this should never be automated”, but “here is a tool that removes the mechanical part; use it when the mechanical part is not the point”.

A Note on the Tech Stack

For the interested: Next.js 15 with App Router, pnpm workspaces with Turbo for the monorepo, MusicKit JS for Apple Music integration, setlist.fm REST API. The JWT for Apple Music uses the jose library for token signing. The matching logic lives in a standalone packages/core module, which makes it testable in isolation and reusable if anyone wants to port this to a different frontend or a CLI.

The repo is at github.com/sebastianspicker/setlist-to-playlist. PRs welcome, particularly around the matching heuristics — that is the part where there is the most room for improvement.

The Deftones were exceptional, for the record. The Westfalenhalle was loud in the way that only a concrete hall that size can be loud, which is to say: correctly loud.

The playlist is good. I am glad it took four minutes and not forty.

I am also glad I know what I gave up.

Your Encryption Keys Are in Virginia: On BitLocker, the FBI, and Why European Universities Need Sovereign Software

Wed, 28 Jan 2026 00:00:00 +0000

The Story

Last week Microsoft confirmed, in response to reporting by TechCrunch and others, that it had handed BitLocker recovery keys for three laptops to the FBI following a valid court order. The underlying case was a fraud investigation in Guam. The laptops were encrypted with BitLocker — the full-disk encryption built into Windows, which many institutions and individuals rely on as their primary protection against unauthorised data access.

The mechanism is simple and was not widely known. When you set up a modern Windows device and sign in with a Microsoft account, BitLocker automatically uploads your recovery key to Microsoft’s cloud. No prominent notification. No opt-in. The key sits there, associated with your account, accessible to Microsoft. When a US court issues a lawful order, Microsoft complies. Redmond confirmed this is policy, not an exception.

Bruce Schneier’s response was characteristically direct: “The lesson here is that if you have access to keys, eventually law enforcement is going to come.” Jennifer Granick at the ACLU called remote key storage in this configuration “quite dangerous,” particularly given that the same mechanism is available to any government that can issue a Microsoft-compatible legal order — not only the US Department of Justice.

That last point is the one European institutions should be reading carefully.

Why This Is a European Problem

The CLOUD Act — the US Clarifying Lawful Overseas Use of Data Act, passed in 2018 — allows US law enforcement to compel US-based companies to produce data held on servers anywhere in the world. If your university stores its BitLocker recovery keys in a Microsoft account, and Microsoft is a US company, the geographic location of the servers those keys sit on does not limit a US court’s reach. The keys are in Virginia, legally, wherever the hardware is.

This is not speculation. It is the explicit structure of US digital law. The European Court of Justice has repeatedly ruled that certain US surveillance frameworks are incompatible with GDPR — the invalidation of Privacy Shield in Schrems II (2020) being the most prominent example. But court rulings about data transfer frameworks do not automatically change the operational reality for an institution whose laptops are running Windows with default settings.

European universities hold exactly the kinds of data that make this a real rather than a theoretical concern:

Research data: medical studies, clinical trials, interviews with human subjects, social science datasets — all subject to strict ethical and legal protections
Student records: academic performance, personal circumstances, disciplinary proceedings
HR data: employment contracts, salary records, health information, union activity — particularly sensitive under German and EU labour law
Correspondence and draft documents: research in progress, grant applications, peer review material

If the disk holding any of this is encrypted with BitLocker, and the recovery key has been uploaded to a Microsoft account by default, the encryption provides less protection than it appears to. The key is accessible to a foreign state with a court order. That state is not party to GDPR.

The Structural Problem

The BitLocker story is one instance of a larger pattern. It is not that Microsoft behaved unusually or maliciously — it complied with a lawful order in its home jurisdiction, as it is legally required to do. The problem is structural: when an institution depends on a closed-source, US-headquartered platform for its critical infrastructure, the institution has delegated control over its own data to an entity whose legal obligations lie elsewhere.

This applies beyond encryption. It applies to email (Exchange Online, Outlook), document storage (SharePoint, OneDrive), communication (Teams), identity management (Azure Active Directory), and any service that runs through a Microsoft account or Azure tenant. For each of these: the data is subject to Microsoft’s terms, and Microsoft is subject to US law.

The same argument applies, with different specifics, to Google Workspace and any other US-headquartered platform. The issue is not that these companies are bad actors. It is that their legal accountability and the legal accountability of European public institutions point in incompatible directions, and the institutions mostly have not noticed.

What Sovereign Software Looks Like

The alternative is not paranoia and air-gapped servers. It is a coherent strategy for institutional digital infrastructure that is based on software the institution controls.

In Germany, this conversation has a name and a project. OpenDesk — developed under the aegis of the federal and state governments — is a stack of open-source tools (Nextcloud, Collabora Online, Matrix/ Element, Jitsi, Keycloak, Open-Xchange) assembled into an integrated workspace alternative to Microsoft 365. The Souveräner Arbeitsplatz (sovereign workspace) concept behind it is exactly what the BitLocker story illustrates: if the software is open, the keys stay in your institution, and no foreign court can reach them via a warrant served on a US company.

Several German states and federal agencies have been piloting OpenDesk. The city of Munich’s earlier experiment with Linux (LiMux) and its eventual rollback to Windows is the cautionary tale here — not because open source failed, but because the transition was not supported seriously enough over time, and the incumbent vendor’s lobbying was. The BitLocker story is a reminder of what is at stake in that political negotiation.

The FSFE’s “Public Money? Public Code!” campaign has articulated the principle cleanly: software developed with public funding should be released as open-source software. The argument is not only about freedom as an abstract value. It is about the practical consequence of being locked into a proprietary platform: your institution loses the ability to audit what the software does, to modify it to meet your requirements, to host it where your data protection law applies, and to switch providers without losing access to your own data.

What I Do, and Why

I work at a publicly funded institution. The software I build for institutional contexts — campus infrastructure, workforce management, archival systems, alert systems — is public.

Not because I am ideologically committed to open source as a movement, but because the alternative is incoherent. If I build tooling for a university with public funds and keep it closed, I have produced a private asset with public money, duplicated by every institution that builds the same thing independently, inspectable by nobody, and ultimately dependent on my continued willingness to maintain it or hand it over. None of those outcomes serve the institutions I am building for.

Here is what that looks like in practice:

zammad-ticket-archiver — automated archival of Zammad support tickets as cryptographically signed PDFs, with RFC 3161 timestamps for non-repudiation. Built for institutions that need legally defensible audit trails of their helpdesk operations. The signing infrastructure is self-hosted; no external party holds the keys.

alarm-broker — a silent panic alarm broker for campus facilities. Receives emergency triggers from hardware devices (Yealink keys), distributes notifications via Zammad, SMS, and Signal, with acknowledgment tracking and escalation scheduling. Runs locally, logs to self-hosted PostgreSQL; no external dependency for the alarm path.

campus-app-kit — a React Native / Expo starter for university mobile applications, with a pluggable Node.js backend designed for institutional data sources (room booking, events, schedules). The architecture separates institution-specific connectors (which institutions keep private) from the shared foundation (which is public). Any university can take it and build on it without starting from scratch.

cueq — an integrated workforce management system for German universities under TV-L (the collective agreement for public sector employees in the German states). Handles time recording, shift planning, absence management, payroll export, and GDPR-compliant audit trails. Built around NestJS and Next.js, with a PostgreSQL backend and Honeywell terminal integration. The HR data stays on the institution’s own infrastructure.

These are all boring. They are not research contributions; they are plumbing. But plumbing is what holds institutions together, and the question of who controls the plumbing — and under whose legal jurisdiction — is exactly the question the BitLocker story makes visible.

The Principle

Public money, public code. If an institution funded by public money develops software for its own operations, that software should be released under an open licence, inspectable, forkable, and deployable by any institution with the same needs.

The corollary: institutions funded by public money should prefer software that is itself openly licensed, auditable, and deployable on infrastructure the institution controls. Not as a blanket ban on proprietary tools where they are genuinely the best option, but as a starting presumption that shifts the burden of justification.

The BitLocker story is not a story about Microsoft doing something wrong. It is a story about the logical consequence of a procurement decision that was made without asking “and what happens when a US court sends a subpoena?” That question was available in 2018 when the CLOUD Act passed, in 2020 when Schrems II was decided, and before both. It is still available now, for every institution that has not yet asked it.

The FSFE “Public Money? Public Code!” campaign is at publiccode.eu. The OpenDesk project is at opendesk.de. The original TechCrunch reporting on the BitLocker handover is at techcrunch.com.

Should I Drive to the Car Wash? On Grounding and a Different Kind of LLM Failure

Tue, 20 Jan 2026 00:00:00 +0000

Follow-up to Three Rs in Strawberry, which covered a different LLM failure: tokenisation and why models cannot count letters. This one is about something structurally different.

The Video

Someone asked their car’s built-in AI assistant: “Should I drive to the car wash today?” It was raining. The assistant said yes, enthusiastically, with reasons: regular washing extends the life of the paintwork, removes road salt, and so on. Technically correct statements, all of them. Completely beside the point.

The clip spread. The reactions were the usual split: one camp said this proves AI is useless, the other said it proves people expect too much from AI. Both camps are arguing about the wrong thing.

The interesting question is: why did the model fail here, and is this the same kind of failure as the strawberry problem?

It is not. The failures look similar from the outside — confident wrong answer, context apparently ignored — but the underlying causes are different, and the difference matters if you want to understand what these systems can and cannot do.

The Strawberry Problem Was About Representation

In the strawberry case, the model failed because of the gap between its input representation (BPE tokens: “straw” + “berry”) and the task (count the character “r”). The character information was not accessible in the model’s representational units. The model understood the task correctly — “count the r’s” is unambiguous — but the input structure did not support executing it.

That is a representation failure. The information needed to answer correctly was present in the original string but was lost in the tokenisation step.

The car wash case is different. The model received a perfectly well-formed question and had no representation problem at all. “Should I drive to the car wash today?” is tokenised without any information loss. The model understood it. The failure is that the correct answer depends on information that was never in the input in the first place.

The Missing Context

What would you need to answer “should I drive to the car wash today?” correctly?

The current weather (is it raining now?)
The weather forecast for the rest of the day (will it rain later?)
The current state of the car (how dirty is it?)
Possibly: how recently was it last washed, what kind of dirt (road salt after winter, tree pollen in spring), whether there is a time constraint

None of this is in the question. A human asking the question has access to some of it through direct perception (look out the window) and some through memory (I just drove through mud). A language model has access to none of it.

Let $X$ denote the question and $C$ denote this context — the current state of the world that the question is implicitly about. The correct answer $A$ is a function of both:

$$A = f(X, C)$$

The model has $X$. It does not have $C$. What it produces is something like an expectation over possible contexts, marginalising out the unknown $C$:

$$\hat{A} = \mathbb{E}_C\!\left[\, f(X, C) \,\right]$$

Averaged over all plausible contexts in which someone might ask this question, “going to the car wash” is probably a fine idea — most of the time when people ask, it is not raining and the car is dirty. $\hat{A}$ is therefore approximately “yes.” The model returns “yes.” In this particular instance, where $C$ happens to include “it is currently raining,” $\hat{A} \neq f(X, C)$.

The quantity that measures how much the missing context matters is the mutual information between the answer and the context, given the question:

$$I(A;\, C \mid X) \;=\; H(A \mid X) - H(A \mid X, C)$$

Here $H(A \mid X)$ is the residual uncertainty in the answer given only the question, and $H(A \mid X, C)$ is the residual uncertainty once the context is also known. For most questions in a language model’s training distribution — “what is the capital of France?”, “how do I sort a list in Python?” — this mutual information is near zero: the context does not change the answer. For situationally grounded questions like the car wash question, it is large: the answer is almost entirely determined by the context, not the question.

Why the Model Was Confident Anyway

This is the part that produces the most indignation in the viral clips: not just that the model was wrong, but that it was confident about being wrong. It did not say “I don’t know what the current weather is.” It said “yes, here are five reasons you should go.”

Two things are happening here.

Training distribution bias. Most questions in the training data that resemble “should I do X?” have answers that can be derived from general knowledge, not from real-time world state. “Should I use a VPN on public WiFi?” “Should I stretch before running?” “Should I buy a house or rent?” All of these have defensible answers that do not depend on the current weather. The model learned that this question form typically maps to answers of the form “here are some considerations.” It applies that pattern here.

No explicit uncertainty signal. The model was not trained to say “I cannot answer this because I lack context C.” It was trained to produce helpful-sounding responses. A response that acknowledges missing information requires the model to have a model of its own knowledge state — to know what it does not know. This is harder than it sounds. The model has to recognise that $I(A; C \mid X)$ is high for this question class, which requires meta-level reasoning about information structure that is not automatically present.

This is sometimes called calibration: the alignment between expressed confidence and actual accuracy. A well-calibrated model that is 80% confident in an answer is right about 80% of the time. A model that is confident about answers it cannot possibly know from its training data is miscalibrated. The car wash video is a calibration failure as much as a grounding failure.

What Grounding Means

The term grounding in AI has a precise origin. Harnad (1990) used it to describe the problem of connecting symbol systems to the things they refer to — how does the word “apple” connect to actual apples, rather than just to other symbols? A symbol system that only connects symbols to other symbols (dictionary definitions, synonym relations) has the form of meaning without the substance.

Applied to language models: the model has rich internal representations of concepts like “rain,” “car wash,” “dirty car,” and their relationships. But those representations are grounded in text about those things, not in the things themselves. The model knows what rain is. It does not know whether it is raining right now, because “right now” is not a location in the training data.

This is not a solvable problem by making the model bigger or training it on more text. More text does not give the model access to the current state of the world. It is a structural feature of how these systems work: they are trained on a static corpus and queried at inference time, with no automatic connection to the world state at the moment of the query.

What Tool Use Gets You (and What It Doesn’t)

The standard engineering response to grounding problems is tool use: give the model access to a weather API, a calendar, a search engine. Now when asked “should I go to the car wash today?” the model can query the weather service, get the current conditions, and factor that into the answer.

This is genuinely useful. The model with a weather tool call will answer this question correctly in most circumstances. But tool use solves the problem only if two conditions hold:

The model knows it needs the tool. It must recognise that this question has $I(A; C \mid X) > 0$ for context $C$ that a weather tool can provide, and that it is missing that context. This requires the meta-level awareness described above. Models trained on tool use learn to invoke tools for recognised categories of question; for novel question types, or questions that superficially resemble answerable ones, the tool call may not be triggered.
The right tool exists and returns clean data. Weather APIs exist. “How dirty is my car?” does not have an API. “Am I the kind of person who cares about car cleanliness enough that this matters?” has no API. Some missing context can be retrieved; some is inherently private to the person asking.

The deeper issue is not tool availability but knowing what you don’t know. A model that does not recognise its own information gaps cannot reliably decide when to use a tool, ask a clarifying question, or express uncertainty. This is a hard problem — arguably harder than making the model more capable at the tasks it already handles.

The Contrast, Stated Plainly

The strawberry failure and the car wash failure look alike from the outside — confident wrong answer — but they are different enough that conflating them produces confused diagnosis and confused solutions.

Strawberry: the model has the information (the string “strawberry”), but the representation (BPE tokens) does not preserve character-level structure. The fix is architectural or procedural: character-level tokenisation, chain-of-thought letter spelling.

Car wash: the model does not have the information (current weather, car state). No fix to the model’s architecture or prompt engineering gives it information it was never given. The fix is exogenous: provide the context explicitly, or give the model a tool that can retrieve it, or design the system so that context-dependent questions are routed to systems that have access to the relevant state.

A model that confidently answers the car wash question without access to current conditions is not failing at language understanding. It is behaving exactly as its training shaped it to behave, given its lack of situational grounding. Knowing which kind of failure you are looking at is most of the work in figuring out what to do about it.

The grounding problem connects to the broader question of what it means for a language model to “know” something — which comes up in a different form in the context window post, where the issue is not missing context but irrelevant context drowning out the relevant signal.

A second car wash video a few weeks later produced a third, different failure: Car Wash, Part Three: The AI Said Walk — the model had the right world state but chose the wrong interpretation of the question.

References

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346. https://doi.org/10.1016/0167-2789(90)90087-6
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. ICML 2017. https://arxiv.org/abs/1706.04599

Try to Relax — and Other Things That Prevent Themselves

Thu, 15 Jan 2026 00:00:00 +0000

Someone, at some point in your life, has told you to relax. They may have specified that you should try to relax — as though relaxation were an effortful goal you could pursue with sufficient will. If you have ever received this advice and found it made things worse, you were not imagining it. You were experiencing a phenomenon with a name, a precise mechanism, and — it turns out — a surprising structural analogue in the geometry of spacetime.

The Ironic Process

In 1994, the social psychologist Daniel Wegner published a paper that formalised what most people already suspected: trying not to think of something makes you think of it more [1]. The theoretical model behind this has two components.

The first is an operating process: it actively generates mental content consistent with the intended state. You are trying to relax — the operating process searches for calming thoughts, slows your attention, tries to find the mood.

The second is a monitoring process: it runs in parallel, searching for evidence that the goal has not been achieved. Am I relaxed yet? No. Checking again. Still no. Its function is to detect failure early so the operating process can correct course.

Under normal conditions, the operating process dominates. You try to relax, the monitor runs quietly in the background, and eventually you converge on the intended state. Under conditions of cognitive load, stress, or self-consciousness — precisely the conditions under which someone might urgently need to relax — the balance shifts. The monitoring process, searching for signs of not-relaxing, finds them everywhere. The monitor activates the very content it is supposed to prevent. The harder you try, the louder the monitor, the further from the goal.

This is Wegner’s ironic process: the mechanism recruited to achieve a goal becomes the primary obstacle to that goal. It is not failure of will. It is a structural property of the system — and it applies to any goal whose target state is the absence of effortful activity. Trying to fall asleep. Trying not to feel anxious about a performance. Trying to be spontaneous. Trying, in the most purely paradoxical formulation, to relax.

The instruction “try to relax” is not bad advice because the advice-giver lacks empathy. It is bad advice because it is a category error: it applies an effort-based tool to a goal defined by the absence of effort. The monitoring process required to track progress toward the goal is precisely the kind of activity that constitutes not having reached it.

A Geometry That Does the Same Thing

The analogy I want to draw requires a brief detour into general relativity.

In 1988, Michael Morris and Kip Thorne published a paper with the unpromising title “Wormholes in spacetime and their use for interstellar travel: A tool for teaching general relativity” [2]. It is, in the field’s understated way, one of the more consequential papers in the subject. Morris and Thorne asked: what would a traversable wormhole — one you could actually pass through — require, physically and mathematically?

The spacetime metric of a traversable wormhole in their formulation is:

$$ds^2 = -e^{2\Phi(r)}\,dt^2 + \frac{dr^2}{1 - b(r)/r} + r^2\,d\Omega^2$$

where $\Phi(r)$ is the redshift function and $b(r)$ is the shape function. The throat of the wormhole sits at $r = r_0$, where $b(r_0) = r_0$. For anything to pass through in finite proper time, $\Phi$ must remain finite — no infinite redshift — and $b(r)/r$ must remain less than one away from the throat.

So far this is just geometry. The physics enters through the Einstein field equations, which connect the geometry to the matter and energy present. To maintain the wormhole throat against collapse — to hold it open — the stress-energy tensor of whatever matter fills the throat must satisfy:

$$T_{\mu\nu}\, k^\mu k^\nu < 0$$

for null vectors $k^\mu$ — what is called a violation of the null energy condition. In plain terms: the matter holding the wormhole open must have negative energy density. Not small energy density. Negative — less than nothing.

This is exotic matter. It does not appear in any tabletop experiment. Classical general relativity does not rule it out, but it does not provide it either.

Quantum mechanics is slightly more helpful: the Casimir effect produces measurable negative energy density between closely spaced conducting plates. The Hawking radiation calculation involves transient negative energy near black hole horizons. So quantum field theory permits negative energy — in principle. But Ford and Roman [3] showed that quantum field theory also strictly limits it: the integrated negative energy over any region is bounded by a quantum inequality. The shorter the burst of negative energy, the smaller it must be; the larger the region, the more constrained the magnitude. The result is that any realistic traversable wormhole would be either Planck-scale (far too small for anything but quantum information to traverse) or would require negative energy concentrated in a band many orders of magnitude thinner than the throat itself — an engineering requirement that borders on the physically absurd.

The wormhole, in other words, does something structurally similar to the monitoring process in Wegner’s model: the condition required to make it traversable actively resists being satisfied. The geometry that would allow passage tends toward collapse. The more you want the wormhole to be open and stable, the more the energy conditions conspire against you.

What the 2022 “Wormhole” Actually Was

In late 2022, a team including Daniel Jafferis, Alexander Zlokapa, and colleagues at Caltech and Google published a paper in Nature with the title “Traversable wormhole dynamics on a quantum processor” [4]. Several major news outlets reported that scientists had created a wormhole. This was not accurate.

What the team actually did was implement a quantum circuit on Google’s Sycamore processor that simulates the Sachdev-Ye-Kitaev (SYK) model — a quantum mechanical system of randomly interacting fermions that is holographically dual, via Maldacena’s AdS/CFT correspondence, to a nearly two-dimensional anti-de Sitter black hole geometry. Two coupled SYK systems are dual to a two-sided eternal black hole, which is connected in the bulk by an Einstein-Rosen bridge — a wormhole.

By coupling the two systems with a specific negative coupling (which corresponds, via ER=EPR, to injecting negative energy into the wormhole), the team made the bridge traversable in the holographic sense: information encoded in one quantum system propagated and was recovered in the other, consistent with traversal of the dual gravitational wormhole.

This is genuinely interesting physics. It is not a wormhole through our spacetime. The wormhole lives in the holographic dual geometry — a mathematical construct in a lower-dimensional theory of gravity, not a tunnel between two points in the universe you inhabit. Quantum teleportation occurred on a quantum chip via the ordinary mechanism of quantum entanglement. The gravitational language is a description of the same physics in a dual frame, not a shortcut through space.

The media confusion is itself instructive: “wormhole” has drifted far from its original meaning. In current physics, the word can refer to a Morris-Thorne traversable tunnel through spacetime, to the Einstein-Rosen bridge of an eternal black hole, to a holographic dual of quantum entanglement [5], or to saddle points in the Euclidean gravitational path integral relevant to the black hole information paradox. These are related by mathematics but quite different in what they physically represent. None of the last three are traversable shortcuts through the universe. The first is, in principle, but barely, and only at the cost of exotic matter physics that nobody knows how to achieve.

The harder physicists have worked to make the wormhole genuinely traversable and macroscopic, the more the mathematics has resisted. This is, at minimum, a suggestive pattern.

What 2025 Added

The field did not stand still after 2022. Three independent lines of work published in 2024 and 2025 have further complicated what a wormhole is — and in each case the complication pushes in the same direction: the geometry keeps refusing to be a shortcut.

The wormhole that does not connect two things. Maloney, Meruliya, and Van Raamsdonk [7] showed that Euclidean wormholes — saddle points in the gravitational path integral — appear generically in ordinary higher-dimensional gravity, without any special setup. The striking implication is that these wormholes do not bridge two separate universes or two separate theories; they encode statistical fluctuations within a single theory. The replica wormholes that resolved the Page curve for black hole radiation — one of the central recent results in the black hole information paradox — are of this type. The wormhole is not a connection between two things. It is a feature of how the theory sums over histories, a bookkeeping structure for correlations within one system. The physical picture of two mouths joined by a throat does not apply.

The wormhole that is not smooth. Magán, Sasieta, and Swingle [8] studied the interior geometry of the Einstein-Rosen bridge connecting typical entangled black holes — the configuration that is supposed, under ER=EPR, to be the gravitational dual of quantum entanglement. Their result, published in Physical Review Letters, is that this interior is not a smooth tunnel. It is long, irregular, and chaotic — an Einstein-Rosen caterpillar, as they call it. The quantum randomness of the entangled state maps directly onto geometric disorder in the interior: the more thermalized the state, the more disordered the bridge. A traversing observer, if one could exist, would not glide through a clean throat. They would navigate a geometry shaped by quantum chaos, growing longer and more disordered as the system evolves. This is ER=EPR taken seriously at the level of typical states rather than special ones, and the result is inhospitable to any ordinary notion of passage.

The wormhole that is not a tunnel at all. Gaztañaga, Kumar, and Marto [9] proposed a more radical reinterpretation: the Einstein-Rosen bridge, they argue, is not a connection between two separate spaces but a representation of time-reversal symmetry within a single quantum description. On this reading, there is only one space, and the bridge is an artefact of how you describe the time-symmetric structure of the quantum state. The paper, published in Classical and Quantum Gravity, attracted considerable press coverage. It sits somewhat outside the mainstream of holographic quantum gravity research, and the proposal has not yet been widely integrated into the community’s working framework — the language of two entangled systems and a connecting geometry remains the dominant picture in AdS/CFT calculations. But the direction it points is consistent with the other two results.

Taken together, these papers suggest that the word “wormhole” has been quietly revised from a noun into an adjective. Not a thing that exists somewhere, but a property of certain mathematical structures — one that describes correlation, disorder, or symmetry depending on which context you are working in. Each attempt to pin down what a wormhole is in practice finds something less traversable, less connected, and less tunnel-like than the previous attempt.

This is, to put it plainly, consistent with the theme of this article.

Causation Eating Its Own Tail

The wormhole’s physical problems become even sharper when you add time. A traversable wormhole connecting two different spacetime regions can in principle connect not just two different places but two different times — creating a closed timelike curve (CTC), a path through spacetime that loops back on itself. You leave on Tuesday and arrive last Thursday.

The standard paradoxes then apply. The grandfather paradox: you travel back in time, prevent an event that was a necessary precondition of your journey. The causal chain that produced the journey destroys the causal chain that produced the journey. The bootstrap paradox: an object or piece of information exists with no origin — passed back in time repeatedly, it has always already existed, created by nothing, caused by itself.

Friedman, Morris, Novikov and colleagues formalised what has become known as the Novikov self-consistency principle: the only physically admissible solutions are those in which the causal structure is globally consistent [6]. No grandfather paradox — not because you cannot go back, but because if you do, it turns out you were always part of the causal chain you thought you were disrupting. The time-traveller cannot prevent an event; they can only be the mechanism by which it occurred.

This is not resolution. It is constraint. The universe selects only the self-consistent loops, filtering out everything else. The causal structure enforces a particular kind of conservatism: only actions that were always going to happen can happen. There is no freedom in a closed timelike curve. Trying to change the loop from inside it is exactly like trying to relax by monitoring whether you have relaxed: the mechanism of change is part of the thing you are trying to change.

Rick Sanchez’s Particular Problem

Rick and Morty is, among other things, a sustained meditation on this structure — without ever calling it that.

Rick Sanchez is the smartest being in every universe. His portal gun creates traversable wormholes instantaneously and at negligible energy cost, which is exactly what general relativity and quantum field theory suggest should be impossible. The show waves this away; what it does not wave away is the psychological consequence of Rick’s capability.

Rick has thought his way to the conclusion that nothing matters. Infinite universes, infinite timelines, infinite Ricks: every moment is replaceable, every loss is recoverable somewhere else, every moral weight dissolves in the face of the combinatorial enormity of everything that exists. This is Rick’s version of relaxation — the nihilism that should follow from taking the multiverse seriously.

But the monitoring process runs. Rick checks whether he has achieved not-caring, finds that he cares (about Morty, about Beth, about being the smartest one in the room), and the caring becomes more vivid for having been suppressed. His nihilism is not peace. It is a performance of peace that is constantly undermined by the monitoring process watching for cracks.

Rick’s portal gun solves every spatial and temporal problem. It does not solve the ironic process. No level of intelligence, and no number of traversable wormholes, provides a shortcut past Wegner’s monitor. This is, I think, what makes the character work: the show’s impossible physics is the premise, but the actually impossible thing — the one the show treats as genuinely intractable — is the psychological paradox.

The Common Structure

These cases — the relaxation paradox, the traversable wormhole, the closed timelike curve — share a formal structure.

In each case, there is a desired end state (relaxation, passage through the wormhole, a changed past) and a mechanism for pursuing it (effortful monitoring, exotic matter, time travel). In each case, the mechanism required to pursue the end state is incompatible with the end state itself. The monitoring process that tracks “am I relaxed?” is the activity of not being relaxed. The exotic matter that holds the wormhole open is the physical condition that makes the geometry so extreme that traversal is barely possible. The attempt to change the past is always already part of the past you were trying to change.

The physicist’s version of this is the quantum measurement problem: the act of observing a system disturbs it. The observer cannot step outside the measurement. The psychologist’s version is the ironic process. The relativist’s version is the closed timelike curve. The narrative version is Rick Sanchez.

What Actually Works

Wegner’s answer to the ironic process is not to try harder with the operating system. It is to release the monitoring system — to stop checking whether the goal has been achieved. This is the core insight behind Acceptance and Commitment Therapy: you cannot think your way to not-thinking. The goal of not-thinking requires not-monitoring, which means not having the goal in the active, effortful sense at all.

This is harder than it sounds. It is a second-order intervention: instead of trying to relax, you try to stop trying to relax — which, done badly, just adds another monitoring process. But done well, it is the correct diagnosis: the category error was treating relaxation as an effortful goal in the first place.

For wormholes, the physics community has arrived at a related answer. The question “how do we make a macroscopic traversable wormhole in our spacetime?” may be the wrong question. The ER=EPR framework suggests that wormholes and quantum entanglement are two descriptions of the same thing. The question is not how to build a tunnel; it is what the entanglement structure of spacetime already is, and how information is already being transferred through it. The shortcut was never a shortcut. It was always just the ordinary geometry of entangled quantum systems, described in a language that made it look exotic.

For Rick Sanchez, the show has not found an answer. Which is, probably, the correct narrative decision.

References

[1] Wegner, D. M. (1994). Ironic processes of mental control. Psychological Review, 101(1), 34–52. https://doi.org/10.1037/0033-295X.101.1.34

[2] Morris, M. S., & Thorne, K. S. (1988). Wormholes in spacetime and their use for interstellar travel: A tool for teaching general relativity. American Journal of Physics, 56(5), 395–412. https://doi.org/10.1119/1.15620

[3] Ford, L. H., & Roman, T. A. (1996). Quantum field theory constrains traversable wormhole geometries. Physical Review D, 53(10), 5496–5507. https://doi.org/10.1103/PhysRevD.53.5496

[4] Jafferis, D., Zlokapa, A., Lykken, J. D., Kolchmeyer, D. K., Davis, S. I., Lauk, N., Neven, H., & Spiropulu, M. (2022). Traversable wormhole dynamics on a quantum processor. Nature, 612, 51–55. https://doi.org/10.1038/s41586-022-05424-3

[5] Maldacena, J., & Susskind, L. (2013). Cool horizons for entangled black holes. Fortschritte der Physik, 61(9), 781–811. https://doi.org/10.1002/prop.201300020

[6] Friedman, J., Morris, M. S., Novikov, I. D., Echeverria, F., Klinkhammer, G., Thorne, K. S., & Yurtsever, U. (1990). Cauchy problem in spacetimes with closed timelike curves. Physical Review D, 42(6), 1915–1930. https://doi.org/10.1103/PhysRevD.42.1915

[7] Maloney, A., Meruliya, V., & Van Raamsdonk, M. (2025). arXiv:2503.12227. https://arxiv.org/abs/2503.12227

[8] Magán, J. M., Sasieta, M., & Swingle, B. (2025). Einstein-Rosen caterpillar. Physical Review Letters, 135. https://doi.org/10.1103/btw6-44ry

[9] Gaztañaga, E., Kumar, A., & Marto, J. (2025). Classical and Quantum Gravity. https://doi.org/10.1088/1361-6382/ae3044

A Christmas Star (Minus the Star, Plus a Moon Nobody Asked For)

Thu, 25 Dec 2025 00:00:00 +0000

Summary

‘Tis the season to stare at light curves. While most people were unwrapping presents on December 25th, I was staring at synthetic flux drop-offs and debugging a limb-darkening model. The result is a browser-based simulator for exoplanet transit photometry, including binary eclipses and exomoon scenarios. It does not detect any real exomoons. It does, however, correctly model why detecting them is comically hard.

Source: sebastianspicker/exoplanet-exomoon-simulation

Background

The gift of transits

When a planet crosses in front of its host star, the observed stellar flux drops by a fraction proportional to the ratio of their projected areas:

\[ \delta = \left(\frac{R_p}{R_\star}\right)^2 \]

For Jupiter transiting the Sun, $ \delta \approx 1\% $. For Earth, about 84 ppm. For an exomoon orbiting a Jupiter-sized planet — well, unwrap that calculation yourself:

\[ \delta_m = \left(\frac{R_m}{R_\star}\right)^2 \]

A Moon-sized exomoon around a Sun-like star contributes roughly 7 ppm of flux variation. The Kepler space telescope’s photometric precision was on the order of 20–30 ppm per 6-hour cadence for bright targets. Ho ho hold on — that signal is buried.

Why stars are not uniformly bright (and why that ruins everything)

A star is not a flat disk of uniform intensity. It is darker at the limb than at the centre — an effect called limb darkening — because the line of sight through the stellar atmosphere is shallower at the edges, sampling cooler, less emissive layers. The quadratic limb-darkening law is:

\[ I(\mu) = I_0 \left[1 - u_1(1 - \mu) - u_2(1 - \mu)^2\right] \]

where $ \mu = \cos\theta $ is the cosine of the angle from disk centre, and $ u_1, u_2 $ are stellar-type-dependent coefficients. This matters for transit modelling because the depth of the light curve dip changes as the planet traverses from limb to centre to limb — the transit is not a flat-bottomed box, it is a rounded trough. Fitting it incorrectly biases $ R_p / R_\star $ and, more critically for exomoon searches, generates false residuals that look suspiciously like a secondary dip.

Exomoon detection: the indirect approach

No exomoon has been unambiguously confirmed as of the time of writing this post (Christmas Day, 2025 — yes, really). The most promising indirect signatures are:

Transit Timing Variations (TTV). The planet–moon system orbits their common barycentre. This causes the planet’s transit to arrive slightly early or late relative to a pure Keplerian ephemeris. The timing offset scales as:

\[ \delta t \approx \frac{m_m}{m_p} \cdot \frac{a_m}{v_p} \]

where $ m_m / m_p $ is the moon-to-planet mass ratio, $ a_m $ is the moon’s semi-major axis around the planet, and $ v_p $ is the planet’s orbital velocity. For an Earth-mass moon at 10 planetary radii around a Jupiter-mass planet at 1 AU, this is on the order of minutes — measurable, in principle, with long baselines.

Transit Duration Variations (TDV). The same barycentre wobble modulates the planet’s velocity along the line of sight at transit ingress, changing transit duration. TDV and TTV are 90° out of phase, which lets you solve for both moon mass and orbital radius given enough transits.

Neither signal is clean in practice. Stellar activity, instrument systematics, and other planets in the system all contribute correlated noise at similar timescales. The residuals of the best exomoon candidate to date — Kepler-1625b-i (Teachey & Kipping, 2018) — remain contested. Season’s readings: disputed.

The Simulation

What it actually does

The simulator is a TypeScript application (Vite build, runs in-browser) built around a deterministic, SI-unit physics core. The main pipeline:

UI parameters
  → V4 normalisation
  → runtime creation  (realtime | reference)
  → orbital integrator (Kepler + N-body)
  → geometry & photometry
  → flux decomposition
  → canvas render + plots

Two runtime modes:

Realtime — fast integrator, interactive rendering, good for exploration
Reference — high-fidelity integrator, deterministic export, good for sanity-checking against known systems

The photometry layer computes quadratic limb-darkened transit flux, handles binary eclipse geometry (for eclipsing binary configurations), and exposes hooks for phase curves and instrument noise.

The diagnostics layer is the part I find most useful: energy conservation checks across the integration, radial velocity time series, astrometry, and transit timing outputs. If your N-body integrator is drifting, the energy plot tells you immediately.

The repo ships a real-systems.snapshot.json with versioned data from the NASA Exoplanet Archive — so you can load, e.g., TRAPPIST-1 or HD 209458 as a starting configuration.

What it deliberately does not do

The relativistic corrections are approximations. This is not a GR integrator. For the systems it is designed for (short-period planets around Sun-like stars), the relativistic perihelion precession is tiny — Mercury’s 43 arcseconds per century is the canonical example and that is already a demanding target — but for millisecond pulsars or extremely compact binaries, do not trust it.

The atmospheric module exposes hooks but is not a radiative-transfer solver. If you want realistic transmission spectra, point yourself at something like petitRADTRANS and use this for the orbital geometry only.

Discussion

The simulation is educational in intent — hence the built-in didactic mode (black-box exploration → hypothesis → reveal → A/B comparison → rubric scoring). But the physics is not dumbed down: the limb darkening is real, the N-body integrator tracks multi-body gravitational interactions, and the TTV outputs are computed from first principles rather than parameterised fits.

The thing I kept running into while building this is how much of exomoon detection reduces to a residuals-hunting problem. You fit the best planet-only model you can, examine the timing and duration residuals, and look for a coherent signal. The simulator lets you inject a synthetic exomoon of specified mass and orbital radius, generate synthetic light curves with configurable noise, and see what the residuals look like — which is exactly the kind of intuition-building exercise that is tedious to set up from scratch with, say, a raw BATMAN lightcurve model and a custom integrator.

Limitations worth being honest about. The performance budget is real: some effects are profile-gated to keep the interactive mode responsive, which means the reference mode exists specifically for cases where you want the full physics at the cost of speed. For a publication-quality simulation you would want a dedicated N-body code (REBOUND is the obvious choice), not a browser runtime. This is a tool for understanding the problem, not for writing papers about it — which, fitting for a Christmas project, is exactly what I have time for right now.

References

Teachey, A. & Kipping, D. M. (2018). Evidence for a large exomoon orbiting Kepler-1625b. Science Advances, 4(10). https://arxiv.org/abs/1810.02362
Kipping, D. M. (2009). Transit timing effects due to an exomoon. MNRAS, 392(1), 181–189. https://arxiv.org/abs/0810.2243
Mandel, K. & Agol, E. (2002). Analytic light curves for planetary transit searches. ApJL, 580, L171. https://arxiv.org/abs/astro-ph/0210099
Claret, A. (2000). A new non-linear limb-darkening law for LTE stellar atmosphere models. A&A, 363, 1081–1190.

Merry Christmas. If you came here expecting warmth and cheer, I offer instead a synthetic light curve with a 7 ppm exomoon signal buried in 30 ppm of photon noise. Practically the same thing.

For the physical version of this — a lamp, a ball, and a smartphone measuring real transit light curves in a classroom — see Hunting Exoplanets with Your Phone. For context on where those experiments came from, see The Lab Goes Home.

Changelog

2026-03-05: Corrected the description of the limb-darkening variable from “$\mu = \cos\theta$ is the angle from disk centre” to “$\mu = \cos\theta$ is the cosine of the angle from disk centre.” $\theta$ is the angle; $\mu$ is its cosine.
2026-03-05: Corrected Claret (2000) page range from 1081–1090 to 1081–1190. The paper contains extensive tables of limb-darkening coefficients spanning 109 pages.

The Golden Bead Cube Weighs One Kilogram

Thu, 11 Dec 2025 00:00:00 +0000

Summary

Jerome Bruner argued in 1964 that concepts must be traversed in three stages: enactive (bodily action), iconic (image), symbolic (language and notation). The order is not a preference — it is a developmental logic. Symbols that arrive before their sensorimotor grounding are thin; they may produce correct test performance while leaving the concept unrooted.

Maria Montessori, working fifty years before anyone had the vocabulary of embodied cognition, designed learning materials that implement Bruner’s sequence with unusual precision. The Golden Bead cube for “one thousand” is about the size of a large fist and weighs roughly one kilogram. You cannot represent “one thousand” on a tablet screen in a way that competes with carrying that weight across a room ten times.

This post is about what embodied cognition research tells us, why Montessori implements it correctly, and what we are giving up when we substitute glass surfaces for physical materials.

Bruner’s Three Modes

Jerome Bruner proposed in a 1964 paper and the subsequent book Toward a Theory of Instruction (Bruner, 1964; 1966) that knowledge is represented in three distinct, developmentally ordered modes:

Enactive: Knowledge encoded in action patterns. You know how to ride a bicycle; you cannot fully describe it in words; the knowledge is in your body. An infant knows what “cup” means because she has grasped cups hundreds of times — before she has the word.

Iconic: Knowledge encoded in images or perceptual representations. You can visualise the route without navigating it. You recognize a melody without playing it.

Symbolic: Knowledge encoded in language or other arbitrary symbol systems. The numeral “7” has no visual resemblance to seven objects. Its meaning is purely conventional and rule-governed.

The developmental sequence matters. A child who acquires a symbol before the underlying enactive and iconic representations are established has a label without a referent. She can produce the word or numeral correctly — and her understanding of it is correspondingly brittle. Transfer to novel contexts is poor; the concept does not generalise.

This is not a fringe view. It is the core claim of embodied cognition research, which has spent thirty years producing experimental evidence for it.

What Embodied Cognition Actually Shows

Lawrence Barsalou’s 2008 review in Annual Review of Psychology is the canonical synthesis (Barsalou, 2008). The central claim: cognition is not implemented in an abstract, modality-free computational system separate from the body. Perception, action, and interoception are constitutive of — not merely scaffolding for — conceptual thought. When you think about “lifting,” the motor cortex activates. When you think about “rough texture,” the somatosensory cortex activates. Concepts are grounded in the sensorimotor systems through which they were originally experienced.

This has a direct pedagogical implication. If mathematical concepts are represented using perceptual-motor simulation systems, then the quality of that simulation depends on the richness of the founding sensorimotor experience. A child who has handled physical objects of different weights has richer representational resources for arithmetic and measurement than one whose entire numerical experience has occurred on a flat, weightless, textureless glass surface.

Arthur Glenberg and colleagues tested this experimentally. In a 2004 study, first- and second-graders read short texts describing farm scenes (Glenberg et al., 2004). Children who physically moved toy objects (horse, barn, fence) to enact the described events showed dramatically better comprehension and inference performance than children who merely read and re-read the passages. The effect size approached two standard deviations in some conditions. Children who imagined moving the objects also improved, but less than those who actually moved them. The physical action was not decorative. It was causally relevant to understanding.

Glenberg extended this logic to arithmetic word problems (Glenberg, 2008). Children who physically manipulated objects while working through problems were better at identifying what was relevant and computing correct answers. The enactive engagement was improving not just memory of the text but mathematical reasoning.

Montessori Got There First

Maria Montessori opened the Casa dei Bambini on 6 January 1907 in a San Lorenzo tenement in Rome, enrolling approximately fifty children aged two to seven. She had no Barsalou. She had no Glenberg. She had children, materials, and the patience to watch what happened when children were allowed to choose their own work.

What she built was a pedagogical system that implements the Bruner sequence without exception.

The Golden Bead Material is the canonical example. Units: single glass beads. Tens: ten beads wired into a bar. Hundreds: ten bars wired into a flat square. Thousands: ten squares wired into a cube. The child can hold a unit bead between two fingers. She needs two hands to lift the thousand cube. The physical weight scales with place value. She experiences — proprioceptively — that “one thousand” is categorically heavier and larger than “one hundred” before she has seen the numeral or heard the word “thousands place.”

The Knobbed Cylinder Blocks illustrate a different principle. Four wooden blocks, each containing ten cylinders varying in height, diameter, or both. The child removes all cylinders and replaces them. If any cylinder goes into the wrong socket, the remaining cylinders will not all fit. The task cannot be completed incorrectly and left that way. Error control is mechanical, built into the material. The teacher need not intervene. The child corrects herself, alone, through the physical feedback of the materials.

Montessori called this controllo dell’errore — control of error. It is one of her most important insights: if the feedback is physical, the child internalises the standard rather than depending on external evaluation. The authority is in the material, not in the adult’s judgment.

The evidence that this works has accumulated across more than a century. Angeline Lillard and Nicole Else-Quest published a landmark study in Science in 2006, using a lottery-based design: children who had won a lottery to attend public Montessori schools compared with those who had not (Lillard & Else-Quest, 2006). Montessori five-year-olds showed significantly higher letter-word identification, phonological decoding, and applied mathematical problem-solving. The lottery controlled for family self-selection.

A 2025 national randomised controlled trial — 588 children across 24 public Montessori schools, with lottery-based assignment — found significant advantages in reading, short-term memory, executive function, and social understanding at the end of kindergarten, with effect sizes exceeding 0.2 SD (Lillard et al., 2025). These are not small effects for field-based school research. And the costs per child were lower than conventional programmes.

Korczak and the Right to Make Mistakes

Janusz Korczak ran an orphanage in Warsaw and wrote How to Love a Child in 1919 (Korczak, 1919) and The Child’s Right to Respect in 1929 (Korczak, 1929). His central argument was that children are not pre-adults — they are persons with full moral status and a right to their own experience, including the experience of making mistakes.

In August 1942 German soldiers came to his orphanage. Korczak was offered false papers, safe houses, multiple escape routes arranged by friends and admirers. He refused each time. He led approximately 192 children and staff to the Umschlagplatz and did not return.

I mention Korczak not as an appeal to emotion but because his argument is structurally connected to Montessori’s. If a child has moral status, she has the right to encounter the actual consequences of her choices — including physical ones. A material that makes incorrect placement physically impossible before the child has had the experience of trying and correcting is a different kind of education from a screen that prevents error altogether through invisible software constraints, or one that simply supplies the correct answer.

Error is information. Physical error is particularly rich information. Taking it away is not protection — it is impoverishment.

Buber: What a Screen Cannot Offer

Martin Buber’s essay “Education,” delivered as an address in 1925 and published in Between Man and Man (Buber, 1947), argues that genuine education requires what he calls an I-Thou relation: an encounter in which the other is met as a whole, irreducible subject, not an object to be managed.

A touchscreen is the paradigmatic I-It relation. It is smooth, frictionless, optimised for engagement, responsive to exactly the touch it was designed to respond to. There is no otherness, no resistance, no genuine encounter. The screen does not push back. The Knobbed Cylinder Block does — literally. If you try to force a cylinder into the wrong socket, the material resists. That resistance is not a flaw in the pedagogical design; it is the pedagogical design.

Buber also introduced the concept of Umfassung — inclusion — by which a teacher must simultaneously stand at their own pole of the educational encounter and imaginatively experience the pupil’s side. A screen cannot do this. It has no pole. Its responsiveness is a simulation of attention, not attention itself. Turkle’s later phrase — “simulated empathy is not empathy” — is the same argument in a different register.

The Tablet Problem

The educational technology industry has produced an enormous quantity of “educational apps” for young children. The research is beginning to catch up.

Kathy Hirsh-Pasek and colleagues identified four pillars that distinguish educational from merely entertaining digital content: active engagement, depth of engagement, meaningful learning, and social interactivity (Hirsh-Pasek et al., 2015). Reviewing commercially available apps, they found that most fail on three or four of these criteria. They produce interactions in the shallow sense — tapping, swiping — without the kind of self-directed, goal-oriented, socially-embedded activity that drives genuine cognitive development.

A 2021 meta-analysis of 36 intervention studies found that educational apps produced meaningful gains when measured by researcher-developed instruments targeting constrained skills (letter naming, counting), but small to negligible effects on standardised achievement tests (Kim et al., 2021). The apps teach what they teach. Transfer is limited.

By contrast, a 2023 scoping review of 102 studies found that physical manipulatives — block building, shape sorting, paper folding, figurine play — showed consistent benefits across mathematics, literacy, and science that transferred to standardised measures (Byrne et al., 2023).

The fundamental problem is haptic. A 2024 review of haptic technology in learning found that force feedback and texture information substantially improve spatial reasoning, interest, and analytical ability (Hatira & Sarac, 2024). Standard capacitive touchscreens — every tablet your child has encountered — provide no force feedback and no texture differentiation. Every object, regardless of its symbolic “weight” or “size,” feels identical under the fingertip.

The Golden Bead thousand cube weighs approximately one kilogram. You cannot represent that experience on a tablet. The symbol arrives without the sensation, and Bruner’s sequence is violated from the first tap.

What We Should Ask

The question is not whether tablets have educational uses — they clearly do, particularly for older children working at the iconic and symbolic levels, and for content where direct physical manipulation is impossible or dangerous. The question is whether we are using them in developmental contexts where the enactive stage has not yet been established.

A child who has carried the thousand cube across a room, stacked the hundreds into the square, and felt the weight difference in her hands has a different representation of place value from one who has tapped numerals on a flat screen. Both may perform identically on a constrained test tomorrow. Ask them a transfer question in six months and the difference will appear.

We are teaching children to operate symbols before giving them the physical experiences that make those symbols mean anything. The result is not ignorance — the children can tap the correct numeral — but brittleness. The concept is a label, not a root.

Montessori knew this. Bruner formalised it. The haptics literature is now confirming it experimentally. The difficult question is why we are still buying flat glass rectangles for classrooms when a box of wooden cylinders costs less and works better.

References

Bruner, J. S. (1964). The course of cognitive growth. American Psychologist, 19(1), 1–15.
Bruner, J. S. (1966). Toward a Theory of Instruction. Harvard University Press (Belknap Press).
Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617–645. DOI: 10.1146/annurev.psych.59.103006.093639
Glenberg, A. M., Gutierrez, T., Levin, J. R., Japuntich, S., & Kaschak, M. P. (2004). Activity and imagined activity can enhance young children’s reading comprehension. Journal of Educational Psychology, 96(3), 424–436. DOI: 10.1037/0022-0663.96.3.424
Glenberg, A. M. (2008). Embodiment for education. In P. Calvo & A. Gomila (Eds.), Handbook of Cognitive Science: An Embodied Approach (pp. 355–371). Elsevier.
Lillard, A. S., & Else-Quest, N. (2006). The early years: Evaluating Montessori education. Science, 313(5795), 1893–1894. DOI: 10.1126/science.1132362
Lillard, A. S., Loeb, D., Berg, J., Escueta, M., Manship, K., Hauser, A., & Daggett, E. D. (2025). A national randomized controlled trial of the impact of public Montessori preschool at the end of kindergarten. Proceedings of the National Academy of Sciences, 122(43). DOI: 10.1073/pnas.2506130122
Korczak, J. (1919). Jak kochać dziecko [How to Love a Child]. Warsaw.
Korczak, J. (1929). Prawo dziecka do szacunku [The Child’s Right to Respect]. Warsaw.
Buber, M. (1947). Between Man and Man (trans. R. G. Smith). Kegan Paul. (Original German publication 1947; contains “Education,” address delivered 1925, and “The Education of Character,” address delivered 1939.)
Hirsh-Pasek, K., Zosh, J. M., Golinkoff, R. M., Gray, J. H., Robb, M. B., & Kaufman, J. (2015). Putting education in “educational” apps: Lessons from the science of learning. Psychological Science in the Public Interest, 16(1), 3–34. DOI: 10.1177/1529100615569721
Kim, J. S., Gilbert, J., Yu, Q., & Gale, C. (2021). Measures matter: A meta-analysis of the effects of educational apps on preschool to grade 3 children’s literacy and math skills. AERA Open, 7. DOI: 10.1177/23328584211004183
Byrne, E. M., Jensen, H., Thomsen, B. S., & Ramchandani, P. G. (2023). Educational interventions involving physical manipulatives for improving children’s learning and development: A scoping review. Review of Education, 11(2), e3400. DOI: 10.1002/rev3.3400
Hatira, A., & Sarac, M. (2024). Touch to learn: A review of haptic technology’s impact on skill development and enhancing learning abilities for children. Advanced Intelligent Systems, 6. DOI: 10.1002/aisy.202300731

Changelog

2026-02-03: Changed “lottery-based quasi-experimental design” to “lottery-based design” for Lillard & Else-Quest (2006). A lottery provides genuine random assignment; “quasi-experimental” implies the absence of randomisation, which is the opposite of what the lottery design achieved.

Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters

Thu, 04 Dec 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/ralph-loop. This post is the design rationale.

December 2025

It happened fast. In the twelve months before I am writing this, agentic coding went from a niche research topic to the default mode for several categories of software engineering task. Codex runs code in a sandboxed container and submits pull requests. Claude Code works through a task list in your terminal while you make coffee. Cursor’s agent mode rewrites a file, runs the tests, reads the failures, and tries again — automatically, without waiting for you to press a button.

The demos are impressive. The production reality is messier.

The problem is not that these systems do not work. They work well enough, often enough, to be genuinely useful. The problem is that “works” means something different when an agent is executing than when a human is. A human who makes a mistake can tell you what they were thinking. An agent that produces a subtly wrong result leaves you with a diff and no explanation. And an agent run that worked last Tuesday might not work today, because the model changed, or the context window filled differently, or the prompt-to-output mapping is, at bottom, a stochastic function.

This is the problem the Ralph Loop is designed to address: not “make agents more capable” but “make agent runs reproducible.”

The Reproducibility Problem, Formally

An LLM tool call is a stochastic function. Given a prompt $p$, the model samples from a distribution over possible outputs:

$$T : \mathcal{P} \to \Delta(\mathcal{O})$$

where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over $\mathcal{O}$.

At temperature zero — the most deterministic setting most systems support — this collapses toward a point mass:

$$T_0(p) \approx \delta_{o^*}$$

where $o^*$ is the argmax token sequence. “Approximately” because hardware non-determinism, batching effects, and floating-point accumulation mean that even $T_0$ is not strictly reproducible across runs, environments, or model versions.

A naive agentic loop composes these calls. If an agent takes $k$ sequential tool calls to complete a task, the result is a $k$-fold composition:

$$o_k = T(T(\cdots T(p_0) \cdots))$$

The variance does not merely add — it propagates through the dependencies. Early outputs condition later prompts; a small deviation at step 2 can shift the trajectory of step 5 substantially. This is not a theoretical concern. It is the practical experience of anyone who has tried to reproduce a multi-step agent run.

The Ralph Loop does not solve the stochasticity of $T$. What it does is prevent the composition.

The Ralph Loop as a State Machine

The system’s state at any point in a run is a triple:

$$\sigma = (Q,\; S,\; L)$$

where:

$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD (product requirements document) — with stories sorted by priority, then by ID
$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$ is the status vector, one entry per story
$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock state protecting $S$ from concurrent writes

The transition function $\delta$ at each step is:

Select: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ — deterministic by construction, since $Q$ has a fixed ordering
Build: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of the story definition and the static policy document; no dependency on previous tool outputs
Execute: $o \sim T(p)$ — exactly one tool call, output captured
Accept: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the acceptance criterion (was the expected report file created at the expected path?)
Commit: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$; otherwise increment the attempt counter; write atomically under lock $L$

The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in exactly one position. The loop continues until no open stories remain or a story limit $N$ is reached.

Termination. Since $|Q| = n$ is finite, $S$ has at most $n$ open entries, and each step either closes one entry or increments an attempt counter bounded by $A_{\max}$, the loop terminates in at most $n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually satisfies any reachable acceptance criterion — which is what CODEX.md’s constraints are designed to encourage — the loop converges in exactly $n$ successful transitions.

Replay. The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to \sigma_k$ is determined by $Q$ and the sequence of tool outputs $o_1, o_2, \ldots, o_k$. The .runtime/events.log records these outputs. If tool outputs are deterministic, the run is fully deterministic. If they are not — as in practice they will not be — the stochasticity is at least isolated to individual steps rather than allowed to compound across the chain.

The One-Tool-Call Invariant

The most important constraint in the Ralph Loop is also the simplest: exactly one tool call per story attempt.

This is not the natural design. A natural agentic loop would let the model plan, execute, observe, reflect, and re-execute within a single story. Some frameworks call this “inner monologue” or “chain-of-thought with tool use.” The model emits reasoning tokens, calls a tool, reads the result, emits more reasoning, calls another tool, and eventually produces the final output.

This is more capable for complex tasks. It is also what makes reproducibility hard. Each additional tool call in the chain is a fresh draw from $T$, conditioned on the previous outputs. After five tool calls, the prompt for the fifth includes four previous outputs — each of which varied slightly from the last run. The fifth output is now conditioned on a different input.

Formally: let the multi-call policy use $k$ sequential calls per story. Each call $c_j$ produces output $o_j \sim T(p_j)$, where $p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function $f$. The variance of the final output $o_k$ depends on the accumulated conditioning:

$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]

\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$

By the law of total variance, applied recursively, the total variance decomposes into explained and residual components — conditioning redistributes variance but does not eliminate the residual term. In a well-designed, low-variance chain the residual may stay small; in practice, LLM outputs have non-trivial variance at each step, and that variance propagates through the conditioning chain.

The one-call constraint collapses $k$ to 1:

$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$

The output depends only on the story definition and the static policy document. Not on previous tool outputs. The stories are designed to be atomic enough that one call is sufficient. If a story requires more, it should be split into two stories in the PRD. This is a forcing function toward better task decomposition, which I consider a feature rather than a limitation.

Scope as a Topological Constraint

In fixing mode, each story carries a scope[] field listing the files or directories the agent is permitted to modify. The runner captures a snapshot of the repository state before execution:

$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

where $h(f)$ is a hash of the file contents. After the tool call:

$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must satisfy:

$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$

This is a locality constraint on the filesystem graph: the agent’s writes are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the story’s scope declaration. Writes that escape this neighbourhood are a story failure, regardless of whether they look correct.

The motivation is containment. When a fixing agent makes a “small repair” to one file but also helpfully tidies up three adjacent files it noticed while reading, you have three undocumented changes outside the story’s intent. In a system with many stories running sequentially, out-of-scope changes accumulate silently. The scope constraint prevents this. Crucially, prompt instructions alone are not sufficient — an agent told “only modify files in scope” can still modify out-of-scope files if the instructions are interpreted loosely or the context is long. The runner enforces scope at the file system level, after the fact, and that enforcement cannot be argued with.

Acceptance Criteria: Grounding Evaluation in Filesystem Events

Each story’s acceptance criterion is a single line of the form Created — the path where the report or output file should appear.

This is intentionally minimal. The alternative — semantic acceptance criteria (“did the agent identify all relevant security issues?”) — would require another model call to evaluate, reintroducing stochasticity at the evaluation layer and creating the infinite regress of “who checks the checker.” A created file at the right path is a necessary condition for a valid run. It is not a sufficient condition for correctness, but necessary conditions that can be checked deterministically are already more than most agentic pipelines provide.

The quality of the outputs — whether the audit findings are accurate, whether the fix is correct — depends on the model and the prompt quality. The Ralph Loop gives you a framework for running agents safely and repeatably. Verifying that the agent was right is a different problem and, arguably, a harder one.

Why Bash

A question I have fielded: why Bash and jq, not Python or Node.js?

The practical reason: the target environment is an agent sandbox that has reliable POSIX tooling but variable package availability. Python dependency management inside a constrained container is itself a source of variance. Bash with jq has no dependencies beyond what any standard Unix environment provides.

The philosophical reason: the framework’s job is orchestration, not computation. It selects stories, builds prompts from templates, calls one external tool, parses one file path, and updates one JSON field. None of this requires a type system or a rich standard library. Bash is the right tool for glue that does not need to be impressive.

The one place Bash becomes awkward is the schema validation layer, which is implemented with a separate jq script against a JSON Schema. This works but is not elegant. If the PRD schema grows substantially, that component would be worth replacing with something that has native schema validation support.

What This Is Not

The Ralph Loop is not an agent. It is a harness for agents. It does not decide what tasks to run, does not reason about a codebase, and does not write code. It sequences discrete, pre-specified stories, enforces the constraints on each execution, and records the outcomes. The intelligence is in the model and in the story design; the framework contributes only discipline.

This distinction matters because the current wave of agentic tools conflates two things that are worth keeping separate: the capability to reason and act (what the model provides) and the infrastructure for doing so safely and repeatably (what the harness provides). Improving the model does not automatically improve the harness — and a better model in a poorly constrained harness just fails more impressively.

The repository is at github.com/sebastianspicker/ralph-loop. The Bash implementation, the PRD schema, the CODEX.md policy document, and the test suite are all there.

A Gas at Temperature T: Xenakis and the Physics of Stochastic Music

Tue, 14 Oct 2025 00:00:00 +0000

Iannis Xenakis (1922–2001) was trained as a civil engineer at the Athens Polytechnic, joined the Greek Resistance during the Second World War and the subsequent Greek Civil War, survived a British army tank shell in January 1945 that cost him the sight in his left eye and part of his jaw, was sentenced to death in absentia by the Greek military government, fled to Paris in 1947, and worked for twelve years as an architect in Le Corbusier’s atelier — where he contributed structural engineering to the Unité d’Habitation in Marseille and designed the Philips Pavilion for Expo 58. In parallel, already in his thirties, he taught himself composition — approaching Honegger (who was too ill to teach) and then studying with Messiaen — and became one of the central figures of the post-war avant-garde. I mention the biography not as background colour but because it bears on the physics. A person who has been through what Xenakis had been through by 1950 is not likely to be intimidated by the kinetic theory of gases.

He was not. In 1955–56 he composed Pithoprakta — “actions through probability” — for 46 strings, each of which is, in his own account, a molecule of an ideal gas. This post works through the mathematics he used and asks what it means when a composer takes statistical mechanics seriously as a compositional tool.

The Problem with Post-War Serialism

To understand why Xenakis did what he did, it helps to know what everyone else was doing. By the early 1950s, the dominant tendency in European new music was total serialism: the systematic extension of Schoenberg’s twelve-tone technique to rhythm, dynamics, articulation, and register. Every parameter of every note was determined by a series. Messiaen had sketched this direction in Mode de valeurs et d’intensités (1949); Boulez and Stockhausen had taken it to its logical extreme.

The result, as Xenakis observed with characteristic bluntness in Formalized Music (1963/1992), was a kind of sonic indistinguishability: because every parameter varied according to independent deterministic series, the textures produced by total serialism sounded essentially like random noise. The maximum of local determinism had produced the appearance of global chaos.

His diagnosis was precise and, I think, correct: if the perceptual result of maximum determinism and maximum randomness is the same, then the path forward is not to find a better deterministic scheme but to embrace randomness explicitly, at the level that governs the macroscopic structure. Control the distribution; let the individual events vary within it. This is exactly what statistical mechanics does for a gas: it does not track every molecule, but it knows with great precision what the distribution of velocities will be.

Statistical Mechanics in Brief

In a classical ideal gas of $N$ molecules at thermal equilibrium with temperature $T$, the molecules move in all directions with speeds distributed according to the Maxwell-Boltzmann speed distribution:

$$f(v) = \sqrt{\frac{2}{\pi}}\, \frac{v^2}{a^3}\, \exp\!\left(-\frac{v^2}{2a^2}\right), \qquad a = \sqrt{\frac{k_B T}{m}},$$

where $m$ is the molecular mass and $k_B$ is Boltzmann’s constant. The parameter $a$ sets the characteristic speed scale: it grows with temperature (hotter gas means faster molecules) and shrinks with molecular mass (heavier molecules move more slowly at the same temperature).

The distribution has a characteristic shape: it rises as $v^2$ for small speeds (few molecules are nearly stationary), peaks at the most probable speed $v_p = a\sqrt{2}$, and falls off as $e^{-v^2/2a^2}$ for large speeds (very fast molecules are exponentially rare). The three characteristic speeds are:

$$v_p = a\sqrt{2}, \qquad \langle v \rangle = a\sqrt{\tfrac{8}{\pi}}, \qquad v_\mathrm{rms} = a\sqrt{3}.$$

No individual molecule is tracked. The distribution is everything: once you know $f(v)$, you know all macroscopic properties of the gas — pressure, mean kinetic energy, thermal conductivity — without knowing the trajectory of a single molecule. The individual is sacrificed to the ensemble.

Pithoprakta and the Orchestra as Gas

In Pithoprakta (1955–56), Xenakis assigns each of the 46 string instruments to a molecule of a gas. The musical analogue of molecular speed is the velocity of a glissando: the rate at which a glissando moves through pitch, measured in semitones per second. Slow glissandi are cold molecules; fast glissandi are hot ones.

For a given passage with a specified musical “temperature” (an intensity-and-density parameter he could set as a compositional choice), the 46 glissando speeds are drawn from the Maxwell-Boltzmann distribution for that temperature. No two strings play the same glissando at the same speed. The effect, to a listener, is a dense sound-mass — a shimmer or a roar — whose internal texture varies but whose overall character (the temperature, the density) is under the composer’s control at exactly the level that matters perceptually.

Xenakis worked out the velocities numerically by hand. The score of Pithoprakta was among the first in which the individual parts were derived from a statistical distribution rather than from a melody, a row, or an improvisation instruction. The calculation is tedious but not difficult: for each time window, choose a temperature, compute $f(v)$ for the 46 values of $v$ that tile the distribution, and assign one speed to each instrument.

The connection between macroscopic structure and microscopic liberty is deliberately preserved. The shape of the sound-mass — its brightness, its turbulence, its rate of change — is controlled. Each individual line is unpredictable. This is, structurally, the same trade-off that makes thermodynamics work: you give up on the individual trajectory and gain exact knowledge of the aggregate.

Musical Temperature as a Compositional Parameter

The analogy is worth making precise. In the physical gas, raising the temperature $T$ increases $a = \sqrt{k_B T / m}$, which shifts the peak of $f(v)$ to the right and widens the distribution. More molecules have high speeds; the variance of speeds increases.

In Pithoprakta, raising the musical “temperature” has the same effect: more instruments perform rapid glissandi; the pitch-space trajectories are more varied; the texture becomes more active and more turbulent. Lowering the temperature concentrates the glissando speeds near zero — slow motion, near-stasis, long sustained tones that change pitch only gradually. The orchestra cools.

This mapping is not metaphorical. Xenakis computed it. The score contains numerically derived glissando speeds; the connection between the perceptual temperature of the texture and the statistical parameter $T$ is quantitative. When musicians speak of a passage “heating up,” they are usually using a figure of speech. In Pithoprakta, they are describing a thermodynamic fact.

The Poisson Distribution and Event Density

Pithoprakta uses a second physical model alongside the Maxwell-Boltzmann distribution: the Poisson process, which governs the density of independent, randomly occurring events.

If musical events (pizzicato attacks, bow changes, individual note entries) occur at a mean rate of $\lambda$ events per second, the probability of exactly $k$ events occurring in a time window of length $T$ is:

$$P(N = k) = \frac{(\lambda T)^k\, e^{-\lambda T}}{k!}.$$

The Poisson distribution has a single parameter $\lambda$ that controls both the mean and the variance (they are equal: $\langle N \rangle = \mathrm{Var}(N) = \lambda T$). A high $\lambda$ produces a dense cluster of events; a low $\lambda$ produces sparse, widely spaced events.

Xenakis used this to control the density of pizzicato attacks independently of the glissando texture. A passage can be cool (slow glissandi) and dense (many pizzicati), or hot and sparse, or any combination. The two distributions operate on independent musical parameters — pitch motion and event density — giving the composer a two-dimensional thermodynamic control space over the texture.

Markov Chains: Analogique A and Analogique B

In Analogique A (for string orchestra, 1958–59) and its companion Analogique B (for sinusoidal tones, same year), Xenakis moved to a different stochastic framework: Markov chains.

A Markov chain is a sequence of states where the probability of transitioning to the next state depends only on the current state. The chain is specified by a transition matrix $P$, where $P_{ij}$ is the probability of moving from state $i$ to state $j$:

$$P_{ij} \geq 0, \qquad \sum_j P_{ij} = 1 \quad \forall\, i.$$

Under mild conditions (irreducibility and aperiodicity), the chain converges to a unique stationary distribution $\pi$ satisfying:

$$\pi P = \pi, \qquad \sum_i \pi_i = 1.$$

The convergence is geometric: if $\lambda_2$ is the second-largest eigenvalue of $P$ in absolute value, then after $n$ steps the distribution $\pi^{(n)}$ satisfies $\|\pi^{(n)} - \pi\| \leq C |\lambda_2|^n$ for some constant $C$. The gap $1 - |\lambda_2|$ — the spectral gap — controls how quickly the chain forgets its initial state. A transition matrix with a large spectral gap produces rapid convergence; one with $|\lambda_2| \approx 1$ produces long-memory dependence between distant states. This is a compositional choice: the spectral gap determines how quickly a piece’s texture changes character.

In Analogique A, Xenakis divided the sonic space into a grid of cells defined by pitch register (high/middle/low), density (sparse/medium/dense), and dynamic (soft/loud). Each “screen” — a brief time window — occupies one cell in this grid. The progression of screens through the piece is governed by transition probabilities: from a high/dense/loud screen, there is some probability of moving to each adjacent cell, specified by Xenakis’s chosen transition matrix.

This is a Markov chain on a discrete state space of sonic textures. The macroscopic trajectory of the piece — its overall movement through sound- quality space — is determined by the transition matrix, which the composer sets. The details of each screen are filled in stochastically, within the parameters of the current state. Again, the individual is sacrificed to the aggregate; control is exercised at the level of the distribution rather than the event.

Game Theory: Duel and Stratégie

The most extreme and, to my mind, most interesting of Xenakis’s formalisations is the use of game theory in Duel (1959) and Stratégie (1962).

A two-player zero-sum game is specified by a payoff matrix $A \in \mathbb{R}^{m \times n}$. Player 1 (the “maximiser”) chooses a row $i$; Player 2 (the “minimiser”) chooses a column $j$; Player 1 receives payoff $A_{ij}$ and Player 2 receives $-A_{ij}$. In a pure-strategy game, each player selects a single action. In a mixed-strategy game, each player chooses a probability distribution over their actions: Player 1 uses $\mathbf{x} \in \Delta_m$ and Player 2 uses $\mathbf{y} \in \Delta_n$, where $\Delta_k$ denotes the standard $(k-1)$-simplex.

The expected payoff to Player 1 under mixed strategies is:

$$E(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top A\, \mathbf{y}.$$

Von Neumann’s minimax theorem (1928) guarantees that:

$$\max_{\mathbf{x} \in \Delta_m} \min_{\mathbf{y} \in \Delta_n} \mathbf{x}^\top A\, \mathbf{y} \;=\; \min_{\mathbf{y} \in \Delta_n} \max_{\mathbf{x} \in \Delta_m} \mathbf{x}^\top A\, \mathbf{y} \;=\; v^*,$$

where $v^*$ is the value of the game. The pair $(\mathbf{x}^*, \mathbf{y}^*)$ that achieves this saddle point is the Nash equilibrium: neither player can improve their expected payoff by unilaterally deviating from their equilibrium strategy.

In Stratégie, each conductor leads one orchestra. Each has nineteen “tactics” — six basic musical textures (e.g., sustained chords, staccato pizzicati, glissandi masses, silence) plus thirteen combinatorial tactics that combine two or three of the basics. The payoff matrix is a $19 \times 19$ integer matrix, also defined by Xenakis, specifying how many points Conductor 1 scores when their orchestra plays tactic $i$ against Conductor 2’s tactic $j$. A referee tracks the score.

The conductors make decisions in real time during the performance, choosing tactics based on what the other conductor is doing and on the evolving score. The piece ends when one conductor reaches a predetermined score threshold.

The Nash equilibrium of the payoff matrix tells each conductor, in principle, the optimal distribution over tactics to play: if both play optimally, the expected score trajectory is determined. In practice, conductors are not expected to compute mixed strategies on the podium; Xenakis’s point is structural. The game-theoretic formalism is used to design the payoff matrix so that no tactic dominates — every choice has consequences that depend on the opponent’s choice — guaranteeing that the piece will always contain genuine strategic tension regardless of who is conducting.

Duel (1959) is the earlier, simpler version for two chamber orchestras. Stratégie (1962) was premiered in April 1963 at the Venice Biennale with two conductors competing live. The audience was aware of the game, of the score, and of the payoff matrix. The premiere was by most accounts a success, though the practical complications of running a zero-sum game in a concert hall (including the question of whether conductors were actually computing Nash equilibria or just following intuition) were never fully resolved.

Formalized Music

Xenakis assembled his theoretical framework in Musiques formelles (1963), translated and expanded as Formalized Music (1971; revised edition 1992). The book is one of the strangest documents in twentieth-century music theory: part treatise, part manifesto, part mathematical appendix. It covers stochastic composition, Markov chains, game theory, set theory, group theory, and symbolic logic — all presented with the confidence of someone who is equally at home in the engineering faculty and the concert hall, and with the occasional obscurity of someone writing simultaneously for two audiences who share almost no vocabulary.

The core argument is that musical composition can and should be treated as the application of mathematical structures to sonic material, not because mathematics makes music “better” but because mathematical structures are the most powerful available tools for controlling relationships between sounds at multiple scales simultaneously. The statistical distributions control the macroscopic; the individual values vary within them. The game- theoretic payoff matrix controls the strategic interaction; the individual tactics fill in the details. Mathematics operates at the structural level and leaves the acoustic surface free.

This is a different relationship between mathematics and music from the ones in my earlier posts on group theory and Messiaen or the Euclidean algorithm and world rhythms. In those cases, mathematics describes structure that already exists in the music — structure the composers arrived at by ear. In Xenakis, mathematics is the generative tool: the score is derived from the calculation.

What the Analogy Does and Does Not Do

The Maxwell-Boltzmann analogy in Pithoprakta is exact in one direction and approximate in another.

It is exact in the following sense: the glissando speeds Xenakis computed for his 46 strings genuinely follow the Maxwell-Boltzmann distribution with the parameters he chose. The score is a realisation of that distribution. If you collect the glissando speeds from the score and plot their histogram, you will find the characteristic $v^2 e^{-v^2/2a^2}$ shape.

It is approximate — or rather, it is analogical — in the sense that strings in an orchestra are not molecules of a gas. They do not collide. They have mass and inertia in a physical sense that has no direct mapping to musical parameters. The temperature $T$ is not a temperature in any thermodynamic sense; it is a compositional variable that Xenakis chose to parameterise with the same symbol because the formal relationship is the same. The analogy is structural, not ontological.

This is worth saying plainly because it is easy to be misled in both directions: either to over-claim (the orchestra is a gas) or to dismiss (the orchestra is merely labelled with physical vocabulary). The actual claim is more modest and more interesting: the mathematical structure of the Maxwell-Boltzmann distribution is the right tool for specifying a certain kind of orchestral texture, namely one where individual elements vary stochastically around a controlled macroscopic envelope. The physics provides the formalism; the music provides the application. This is how mathematics works in engineering, too.

The Centenary and What Remains

Xenakis died in 2001, by then partially deaf and with dementia. His centenary in 2022 produced a wave of new performances, recordings, and scholarship — including the Meta-Xenakis volume (Open Book Publishers, 2022), which collects analyses of his compositional mathematics, his architectural work (he designed the Philips Pavilion for Le Corbusier’s Expo 58 in Brussels using the same ruled-surface geometry he was using in Metastaseis), and his political biography.

What remains resonant about his project is not the specific distributions he chose — the Maxwell-Boltzmann is not the only or even necessarily the best distribution for many musical applications — but the epistemological position it represents. Xenakis insisted that the right question to ask about a musical texture is not “what is the note at beat 3 of bar 47?” but “what is the distribution from which the events in this section are drawn?” This shift from individual determination to statistical control is precisely the shift that makes thermodynamics possible as a science, and Xenakis was the first composer to apply it deliberately and systematically.

When a composer writes “let the orchestra be a gas at temperature $T$” and then actually computes the consequences with Boltzmann’s constant in front of him, I do not feel that physics has been appropriated. I feel that it has been recognised — seen, from a different direction, as the same thing it always was: a set of tools for thinking about ensembles of interacting elements whose individual behaviour is too complex to track but whose collective behaviour is not.

The orchestra is not a gas. But the Maxwell-Boltzmann distribution describes it anyway.

References

Ames, C. (1989). The Markov process as a compositional model: A survey and tutorial. Leonardo, 22(2), 175–187. https://doi.org/10.2307/1575226
Jedrzejewski, F. (2006). Mathematical Theory of Music. Delatour France / IRCAM.
Nash, J. F. (1950). Equilibrium points in $n$-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49. https://doi.org/10.1073/pnas.36.1.48
Nierhaus, G. (2009). Algorithmic Composition: Paradigms of Automated Music Generation. Springer.
Matossian, N. (2005). Xenakis (revised ed.). Moufflon Publications.
Solomos, M. (Ed.). (2022). Meta-Xenakis. Open Book Publishers. https://doi.org/10.11647/OBP.0313
von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1), 295–320. https://doi.org/10.1007/BF01448847
von Neumann, J., & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press.
Xenakis, I. (1992). Formalized Music: Thought and Mathematics in Composition (revised ed.). Pendragon Press. (Originally published as Musiques formelles, La Revue Musicale, 1963.)

Changelog

2026-01-14: Corrected the description of Stratégie (1962): each conductor has nineteen tactics (six basic plus thirteen combinatorial), with a 19 x 19 payoff matrix — not six tactics and a 6 x 6 matrix. The six-tactic, 6 x 6 description applies to the earlier Duel (1959).
2026-01-14: Added “in April 1963” to the Stratégie premiere sentence. The composition date is 1962; the premiere took place on 25 April 1963 at the Venice Biennale.
2026-01-14: Changed “studying briefly with Honegger” to “approaching Honegger (who was too ill to teach).” Xenakis sought instruction from Honegger circa 1949, but Honegger was in declining health and did not take him as a student.

From Oxide to Oversampling: The Physics of Recorded Sound

Fri, 15 Aug 2025 00:00:00 +0000

There is an argument that has been running in recording studios since roughly 1982, when the first commercially mastered compact discs appeared. On one side: analogue tape has warmth, depth, something the ear likes. On the other: digital audio is more accurate, lower noise, the measurements say so. The argument produces more heat than light, because most participants treat it as an aesthetic question — a matter of feeling, taste, preference. It is not. The difference between tape and digital audio is a physics difference, and the physics is specific enough to calculate.

The physics here turns out to be some of my favourite kind: it sits at the intersection of condensed matter, signal processing, and Fourier analysis, and it connects directly to why certain sounds are perceived as pleasant. This post walks through both sides. Part I is the ferromagnetic physics of magnetic tape and the harmonic structure of saturation distortion. Part II is the delta-sigma modulator and the engineering trick that achieves 24-bit dynamic range from a 1-bit comparator. Neither side of the debate is as simple as its partisans claim, and the physics of both is more interesting than the aesthetics argument they have been stuck in for forty years.

Part I: The Physics of Magnetic Tape

Ferromagnetic Recording

Magnetic recording tape is a thin polymer substrate coated with a layer of ferromagnetic particles suspended in a binder. For most of the twentieth century those particles were iron oxide — specifically $\gamma\text{-Fe}_2\text{O}_3$, gamma-phase ferric oxide — though chromium dioxide ($\text{CrO}_2$) and later metal-particle formulations with pure iron or iron-cobalt alloys were developed for higher coercivity and better high-frequency response. What all of these materials share is the key property of ferromagnetism: each particle is a small permanent magnet, a magnetic domain with a net magnetic moment that can be oriented by an external field and that will retain that orientation when the field is removed.

The recording process exploits this directly. The recording head is a toroidal electromagnet with a narrow gap. When audio-frequency current flows through the head’s coil, the field at the gap follows the current, and as the tape moves past at a fixed speed, successive particles along the tape length are aligned according to the instantaneous field at the moment they pass the gap. The result is a spatial encoding of the time-domain audio signal along the tape. On playback, the inverse process occurs: the moving pattern of magnetised particles generates a time-varying flux in the playback head’s core, which induces a voltage in the coil by Faraday’s law, reproducing the original current waveform.

So far this description is entirely linear. The head current maps to a field, the field maps to a magnetisation, the magnetisation maps back to a voltage. If all three relationships were linear, tape would be a near-perfect recording medium — limited only by particle noise and head gap frequency response. The nonlinearity comes from the second relationship in that chain, and it comes from the fundamental physics of how ferromagnetic materials respond to an applied field.

The B-H Curve and Hysteresis

The relationship between the applied magnetic field intensity $H$ (from the recording head, measured in A/m) and the resulting magnetic flux density $B$ in the tape (measured in tesla) is not linear. It follows a curve — actually a family of nested curves — known as the hysteresis loop, and its shape determines almost everything interesting about tape recording [3].

Starting from a demagnetised state and increasing $H$ from zero, the initial slope $dB/dH$ — the magnetic permeability $\mu$ — is relatively low. The domains in the material are oriented randomly and require a threshold of energy to begin reorienting. As $H$ increases further, the permeability rises, and there is a region of steep, approximately linear increase in $B$. Then, as $H$ continues to increase, the material saturates: progressively fewer unaligned domains remain, the slope falls, and eventually $dB/dH \to 0$ as all domains are aligned. The $B$-$H$ curve is S-shaped, and the saturation is irreversible in a specific sense: if you now reduce $H$ back toward zero, $B$ does not retrace the original path. It remains at a higher value — the remanence $B_r$ — and you must apply a reverse field of magnitude $H_c$, the coercivity, to bring $B$ back to zero. The loop formed by this cycle of magnetisation and demagnetisation is the hysteresis loop, and its area is proportional to the energy dissipated as heat per cycle.

The crucial feature for audio recording is what happens near the origin. A small audio signal, sitting near $H = 0$, does not experience a nicely linear region of the $B$-$H$ curve. The initial permeability is low, and there is an inflection point near zero: the slope increases as you move away from zero before the saturation region brings it back down again. This means that even at low recording levels, the transfer function from head current to tape magnetisation is nonlinear, and in a particular way — the response is symmetric under $H \to -H$, which means the dominant nonlinear term is even-order. Without some remedy, even a gentle sine wave would emerge from the playback head with significant even-harmonic content added. The signal would also sit in a region of the curve where the effective permeability depends on signal amplitude, making the recording level-dependent in an uncontrolled way. Something needed to be done about this, and the solution found in the 1940s is one of the more elegant pieces of applied physics in the history of the recording industry.

The Bias Signal

The solution is called AC bias, and its discovery is usually credited to Braunmühl and Weber at the German Reichs-Rundfunk-Gesellschaft around 1940, though there are earlier related patents. The idea is simple once stated: add a high-frequency signal — typically between 50 kHz and 150 kHz, well above the audio band — to the recording current before it drives the head. This bias signal has an amplitude large enough to drive the tape through multiple cycles of its B-H curve on each audio cycle, but it is filtered out of the playback signal by the tape’s own limited high-frequency response and by subsequent low-pass filtering.

The effect on the recording process is to linearise the transfer function. The operating point is no longer stationary near the inflection point at $H = 0$. Instead, it rides up and down the B-H curve rapidly many times per audio period, driven by the bias. The audio signal merely modulates the envelope of this rapid oscillation. The net magnetisation that remains after the tape leaves the head gap is the time average of many rapid traversals of the hysteresis loop, and this average tracks the audio signal with good linearity provided the signal level is modest. The bias amplitude and frequency are tuned carefully for each tape formulation — too little bias and the linearisation is incomplete; too much and the signal is undermodulated and the high-frequency response suffers as the bias begins to erase fine spatial patterns written by high-frequency audio. Getting the bias right is part of the alignment procedure for every analogue tape machine and part of why different tape formulations require different machine settings.

The result, for moderate recording levels, is a remarkably clean and linear recording medium. The nonlinear character of the B-H curve is effectively tamed by the bias trick, and the remaining imperfections are mostly second-order: azimuth errors, print-through, head bump, self-demagnetisation at short wavelengths. For practical purposes, a well-aligned analogue tape machine at moderate recording levels is a linear system.

Harmonic Generation at High Levels

At high recording levels — when the audio signal is large enough to push the operating point into the saturation region even after the bias has done its linearising work — the picture changes. The transfer function from input current to output magnetisation becomes genuinely nonlinear, and the harmonic content of the distortion becomes the central question.

The standard framework is a Taylor expansion of the transfer function around the operating point:

$$y(t) = a_1 x(t) + a_2 x^2(t) + a_3 x^3(t) + a_4 x^4(t) + \cdots$$

where $x(t)$ is the input signal (the audio current), $y(t)$ is the output (the magnetisation recorded on tape), and the coefficients $a_n$ are determined by the shape of the B-H curve near saturation. For a pure tone $x(t) = A \sin(\omega t)$, the higher-order terms generate harmonics in a calculable way.

The second-order term gives:

$$a_2 x^2(t) = a_2 A^2 \sin^2(\omega t) = \frac{a_2 A^2}{2}\bigl(1 - \cos 2\omega t\bigr)$$

This is a DC offset plus a component at $2\omega$ — the second harmonic, one octave above the fundamental.

The third-order term gives:

$$a_3 x^3(t) = a_3 A^3 \sin^3(\omega t) = a_3 A^3 \left(\frac{3}{4}\sin\omega t - \frac{1}{4}\sin 3\omega t\right)$$

The $\frac{3}{4}$ piece adds to (or subtracts from) the fundamental depending on the sign of $a_3$; the $-\frac{1}{4}$ piece is a third harmonic at $3\omega$, one octave and a fifth above the fundamental.

Carrying through to fourth order:

$$a_4 x^4(t) = \frac{a_4 A^4}{8}\bigl(3 - 4\cos 2\omega t + \cos 4\omega t\bigr)$$

which contributes additional DC, a component at $2\omega$, and a fourth harmonic at $4\omega$.

Collecting the terms through fourth order, the output is approximately:

$$y(t) \approx \left(a_1 + \frac{3a_3 A^2}{4}\right)A\sin\omega t - \frac{a_2 A^2}{2}\cos 2\omega t - \frac{a_3 A^3}{4}\sin 3\omega t + \cdots$$

The important observation is about which harmonics dominate and what they sound like. The B-H curve of a ferromagnetic material near saturation is approximately symmetric: the saturation behaviour for positive $H$ mirrors that for negative $H$. A symmetric nonlinearity has $a_2 = a_4 = 0$ (all even coefficients vanish by symmetry), and only odd harmonics are generated. But at moderate levels, just before full saturation, the symmetry of the B-H loop as traversed by the biased signal is not perfect, and the even-order terms are nonzero — though small. This gives tape its characteristic distortion signature: at moderate saturation levels, the even harmonics ($2\omega$, $4\omega$) dominate; at heavy saturation, the odd harmonics ($3\omega$, $5\omega$) appear more strongly.

The perceptual consequence of this is the crux of the “analogue warmth” story. The second harmonic is the octave of the fundamental. The fourth harmonic is the double octave. These are, in Western harmonic practice and in the physics of vibrating strings, the most consonant possible intervals. Adding even harmonics at low amplitude to a fundamental makes the sound fuller and richer without introducing beating or dissonance. Odd harmonics — particularly the fifth (at $5\omega$, a major third above the double octave) and the seventh (a flattened seventh above the double octave) — are less consonant relative to the fundamental and at high amplitude produce the harsh, buzzy character associated with heavy distortion or the deliberate aggression of a fuzz pedal.

There is one more effect worth naming: the saturation is a soft knee. The B-H curve does not have a sharp corner at saturation — it curves gradually from the linear region into the flat-topped saturation region. This means that transient signals — percussive attacks, consonant onsets — that briefly exceed the nominal recording level are not hard-clipped but gently compressed. Their peaks are rounded by the shape of the B-H curve. Engineers and producers who record through tape often describe this as the machine “breathing” or as a pleasing “gluing” of transients. The physics is simple: the soft-knee transfer function applies more gain reduction to instantaneous peaks than to the sustained body of the signal, functioning as a fast, musically transparent dynamic compressor for any material that approaches saturation.

Part II: The Physics of Delta-Sigma Conversion

Nyquist-Rate ADC and Its Limits

The straightforward approach to analogue-to-digital audio conversion samples the signal at a rate just above twice the highest audio frequency — the Nyquist rate — using a quantiser with enough bits to achieve the desired dynamic range. For CD-quality audio, the sampling rate is 44.1 kHz (slightly above $2 \times 20{,}000$ Hz) and the word length is 16 bits. The dynamic range of a $b$-bit PCM system is, to a good approximation:

$$\text{SNR} \approx 6.02b + 1.76 \text{ dB}$$

so 16 bits gives approximately $6.02 \times 16 + 1.76 \approx 98$ dB, which matches the dynamic range of the best analogue tape and is well above the approximately 70 dB achievable with the noise floor of typical studio tape at 15 ips [4].

The engineering problem with a straightforward Nyquist-rate ADC is the anti-aliasing filter. Before sampling, all content above $f_s/2 = 22.05$ kHz must be removed. If it is not, energy at frequency $f > f_s/2$ aliases into the audio band as a spurious component at $f_s - f$, which is inaudible in origin but very much audible in its alias. To achieve 98 dB of alias suppression — matching the 16-bit dynamic range — the filter must attenuate signals at 22.05 kHz by 98 dB relative to signals at 20 kHz. The transition band is only 2.05 kHz wide. That requires a very high-order analogue filter — typically seventh-order elliptic or Chebyshev — and such filters have significant phase distortion within the audio band, particularly at frequencies near the passband edge. In 1982, building this filter precisely, cheaply, and repeatably in consumer hardware was a genuine engineering challenge. The filters introduced audible phase and amplitude ripple that the original measurements had not anticipated and that contributed to early criticisms of the CD sound.

Oversampling

The delta-sigma ($\Sigma\Delta$) ADC architecture was developed to sidestep the steep-filter problem entirely, and its adoption in consumer audio from the late 1980s onwards largely resolved the anti-aliasing filter debate [1]. The core idea is oversampling: instead of sampling at 44.1 kHz with 16 bits, the $\Sigma\Delta$ converter samples at $M \times 44.1$ kHz — where $M$ is the oversampling ratio, typically 64 in early audio converters, giving $64 \times 44.1 = 2.8224$ MHz — with a 1-bit quantiser. The anti-aliasing filter now needs to attenuate everything above 1.4112 MHz before sampling. Its transition band runs from 20 kHz to 1.4112 MHz, a ratio of roughly 70:1. This is easy: a simple, cheap, first- or second-order RC filter suffices, with negligible phase distortion anywhere in the audio band. The price paid is that the quantiser is now only 1 bit, and a 1-bit quantiser has terrible resolution on its own.

To understand what oversampling buys even before any clever signal processing, consider the quantisation noise floor. For a uniform quantiser with step size $\Delta$, the quantisation noise power is $P_q = \Delta^2/12$, and this noise is spread approximately uniformly from 0 to $f_s/2$. The noise power spectral density is $P_q / (f_s/2)$. After oversampling by a factor of $M$ — so that the effective Nyquist band runs from 0 to $f_{\text{audio}} = f_s/(2M)$ — the in-band noise power is:

$$P_{\text{in-band}} = \frac{P_q}{f_s/2} \cdot f_{\text{audio}} = \frac{P_q}{f_s/2} \cdot \frac{f_s}{2M} = \frac{P_q}{M}$$

Each doubling of $M$ halves the in-band noise power, an improvement of 3 dB, equivalent to half a bit of resolution. At 64× oversampling this gives 18 dB, or three extra bits — useful, but not enough to get from a 1-bit quantiser to 16-bit performance. We need something more.

Noise Shaping

The second ingredient — and the one that makes $\Sigma\Delta$ conversion genuinely remarkable — is noise shaping. Rather than spreading quantisation noise uniformly in frequency, we can engineer its spectral distribution so that almost all the noise power sits above the audio band, where it is removed by a digital low-pass filter (the decimation filter) at the output.

A first-order $\Sigma\Delta$ modulator achieves this by a feedback loop. At each sample step, the quantiser takes the running integral of the difference between the input signal and the previous quantised output. More precisely: the quantisation error $e_n = y_n - \hat{x}_n$ (where $\hat{x}_n$ is the input to the quantiser and $y_n$ is the 1-bit output) is fed back and subtracted from the next input before integration. This is the integrator-feedback structure that gives the modulator its name: $\Sigma$ for the integrating summation, $\Delta$ for the difference.

In the $z$-domain, this feedback structure gives the quantisation noise a transfer function of:

$$N(z) = 1 - z^{-1}$$

that is, the noise at time $n$ is the current error minus the previous error — a first-difference operation. In the frequency domain, substituting $z = e^{j 2\pi f / f_s}$:

$$\bigl|N(f)\bigr|^2 = \left|1 - e^{-j 2\pi f / f_s}\right|^2 = 4\sin^2\!\left(\frac{\pi f}{f_s}\right)$$

For frequencies well below the sampling rate, $f \ll f_s$, the small-angle approximation gives:

$$\bigl|N(f)\bigr|^2 \approx \left(\frac{2\pi f}{f_s}\right)^2$$

The noise power spectral density rises as $f^2$ — it is heavily suppressed at low frequencies and pushed up toward $f_s/2$. Integrating this shaped noise over the audio band $[0, f_{\text{audio}}]$ and comparing to the flat-spectrum case, the in-band SNR improvement for a first-order modulator scales as $M^3$ rather than $M^1$: every doubling of oversampling ratio gives 9 dB improvement (1.5 bits) instead of 3 dB. At 64× oversampling — six doublings — a first-order modulator recovers approximately 54 dB, or 9 effective bits.

A second-order modulator applies the noise-shaping filter twice, giving $|N(f)|^2 \propto f^4$ and an SNR gain scaling as $M^5$: 15 dB per octave of oversampling. At 64× — again six doublings — this recovers approximately 90 dB, or 15 effective bits. Modern high-performance audio ADCs use fifth- to seventh-order modulators operating at 128× oversampling or higher. The in-band noise floor drops to levels corresponding to 20–24 effective bits — entirely from a 1-bit hardware comparator, with all the resolution coming from the noise shaping and the subsequent digital decimation filter.

The following table illustrates the SNR gain achievable at practical oversampling ratios:

Modulator order	Oversampling ratio	SNR gain	Effective bits gained
1st order	64×	54 dB	9
2nd order	64×	90 dB	15
5th order	128×	~120 dB	~20

The 5th-order row deserves a moment’s attention. A single-bit comparator — a device that outputs only 1 or 0, with no analogue subtlety whatsoever — combined with oversampling and noise shaping, achieves the resolution of a 20-bit Nyquist-rate ADC and is doing so using a simple digital feedback loop and an analogue integrator that can be fabricated cheaply on a CMOS chip. This is, I think, one of the more quietly stunning pieces of engineering in consumer electronics, and it goes entirely unnoticed because the CD player it lives inside is now considered mundane.

There is a subtlety worth adding for completeness. Real $\Sigma\Delta$ modulators of order three and above are potentially unstable — the noise-shaping loop can become unstable for large input signals, producing limit cycles or tonal artefacts. Managing this stability is a significant part of the design problem and involves either restricting the input range, adding nonlinear stability control, or using multi-bit internal quantisers (which reduce the quantisation step and ease the stability constraint while retaining most of the noise-shaping benefit). The multi-bit approach also addresses a related issue: the ideal 1-bit DAC in the feedback loop is inherently linear (there are only two levels, so there is no differential nonlinearity), but multi-bit internal DACs must be trimmed or calibrated to avoid nonlinearity in the feedback path corrupting the noise shaping. These engineering details are discussed thoroughly in Norsworthy, Schreier, and Temes [5], which remains the standard reference.

The digital audio infrastructure that delta-sigma conversion enabled — clean, cheap, phase-linear converters without steep analogue filters — also made digital audio workable in latency-sensitive applications like live performance. For a discussion of why latency matters so much in network music performance and how it shapes system design, see my earlier post on NMP latency and the physics of musical timing.

The Irony of the Comparison

Both tape saturation and delta-sigma conversion are, at root, about the same problem: how to manage the relationship between a signal and the finite resolution of the medium storing it. Tape manages the problem physically and somewhat accidentally — the ferromagnetic B-H curve happens to generate even harmonics that are consonant with the recorded signal, and the bias trick linearises the response well enough that the distortion only becomes audible when the engineer deliberately pushes into saturation. Delta-sigma manages the problem mathematically and deliberately — quantisation noise is redistributed in frequency by a designed feedback loop so that it falls outside the audible band.

Neither approach is perfect, and neither is neutral. Tape adds signal-correlated harmonic distortion whose spectral content depends on recording level and which compresses transients in a way that changes the perceived dynamics. Digital audio, even with delta-sigma conversion, has its own imperfections: idle-channel noise from the modulator, potential for tonal limit-cycle artefacts at specific input levels, and the abrupt onset of hard clipping at full scale — which, unlike tape saturation, is symmetrical and rapid and adds all harmonics simultaneously, giving the harsh, unpleasant character that digital overloads are known for. The soft-knee vs. hard-clip distinction is real and audible, and it is probably the most defensible technical basis for the claim that analogue tape handles transient overloads more graciously.

What is not defensible is the claim that one medium is inherently more musical than the other, or that digital audio lacks something fundamental that tape possesses. They are differently imperfect. The imperfections of tape happen to sit at harmonic relationships that Western ears, shaped by a tradition of music built on those same harmonic intervals, find pleasing. The imperfections of digital audio are not at pleasing harmonic intervals; they are wideband quantisation noise (before shaping) or ultrasonic shaped noise (after), and a sharp cliff at full scale. Different physics, different perceptual character.

A Personal Note

I spent a long time thinking the tape versus digital debate was mostly audiophile mythology — a community of enthusiasts rationalising the warmth of nostalgia as the warmth of oxide particles. The physics is more interesting than that, and doing the calculation changed my view. The second-harmonic content of tape saturation is not an accident or a romantic story; it is what you get when you push a symmetric nonlinearity with an audio sine wave, and the reason it sounds pleasant is not arbitrary but is grounded in the physics of consonance and the harmonic series. The delta-sigma converter is not a mundane commodity chip but a genuinely elegant solution to an otherwise intractable filter-design problem, and the fact that it achieves 24-bit resolution from a 1-bit comparator by spectral redistribution of noise is the kind of result that should get more attention in physics education.

Both technologies deserve better than the aesthetics argument they have been fighting in for forty years. The tools to understand them are not exotic — Taylor series, Fourier analysis, the z-transform, and the basic physics of ferromagnetism — and the reward is a clear-eyed picture of what is actually going on inside two of the most consequential inventions in the history of recorded music. If you are interested in related mathematics underlying other aspects of music, the posts on Euclidean rhythms and Messiaen’s modes and group theory cover the combinatorial and algebraic structures in rhythm and pitch that sit alongside the physics discussed here.

References

[1] Candy, J. C., & Temes, G. C. (Eds.). (1992). Oversampling Delta-Sigma Data Converters: Theory, Design, and Simulation. IEEE Press.

[2] Reiss, J. D., & McPherson, A. (2015). Audio Effects: Theory, Implementation and Application. CRC Press.

[3] Bertram, H. N. (1994). Theory of Magnetic Recording. Cambridge University Press.

[4] Pohlmann, K. C. (2010). Principles of Digital Audio (6th ed.). McGraw-Hill.

[5] Norsworthy, S. R., Schreier, R., & Temes, G. C. (Eds.). (1997). Delta-Sigma Data Converters: Theory, Design, and Simulation. IEEE Press.

Changelog

2026-01-14: Updated the interval description for the 7th harmonic to “above the double octave.” The 7th harmonic (7f) sits between the double octave (4f) and the triple octave (8f).

The AI Friend That Makes You Lonelier

Tue, 12 Aug 2025 00:00:00 +0000

Summary

In 1956 Donald Horton and Richard Wohl described parasocial relationships — one-sided emotional bonds that audiences form with media performers [1]. “Intimacy at a distance,” they called it. The television personality responds to the camera; the viewer responds as if in genuine social exchange. Only one party is aware of and affected by the other.

AI companions change the substrate without changing the structure. The chatbot responds. The user responds. The asymmetry remains: the chatbot has no inner life behind its outputs. Sherry Turkle put it bluntly: “simulated feelings are not feelings, and simulated love is never love” [5].

The question I want to work through here is whether this matters in the way we think it does. The answer from Daniel Wegner’s ironic process theory — and increasingly from the empirical data — is that it matters in a specific, predictable, and counterintuitive way. AI companions may be particularly likely to exacerbate loneliness under the conditions of chronic social deprivation that prompt people to use them in the first place.

The Loneliness Epidemic Is Real

Before getting to the mechanism, the scale of the problem. Julianne Holt-Lunstad’s 2010 meta-analysis of 148 studies and 308,849 participants found that people with adequate social relationships had a 50% increased likelihood of survival compared to those with poorer social connections [3]. That effect size is comparable to quitting smoking. A follow-up meta-analysis in 2015 found that social isolation carried a 29% increased mortality risk, subjective loneliness 26%, and living alone 32% [4].

The U.S. Surgeon General issued an advisory in 2023 declaring an epidemic of loneliness and isolation. A 2018 Cigna survey using the UCLA Loneliness Scale found that adults aged 18–22 scored highest on loneliness of any cohort — more than retirees, more than the elderly. The UK appointed a Minister for Loneliness in January 2018 — the first such government position in the world.

This is the context in which AI companions have arrived. The market is responding to a real epidemiological need. That does not mean the response is correct.

Parasocial Relationships: The Original Framework

Horton and Wohl’s 1956 paper remains the foundational text [1]. Their key observation: the parasocial bond is “controlled by the performer, and not susceptible of mutual development.” The audience member brings real emotional response; the performer brings nothing specific to the audience member, because she does not know the audience member exists.

They were not dismissive of parasocial relationships. They identified useful functions: comfort, companionship, entertainment, the pleasure of a consistent “personality” encountered regularly. The problem, in their framing, arises when parasocial interaction substitutes for rather than supplements real social bonds — when the one-sided relationship becomes the primary source of social experience.

AI companions are parasocial relationships with one modification: the AI responds to you specifically. Replika remembers your name, your preferences, your previous conversations. The interaction is personalised without being mutual — because mutuality requires that the other party has something genuinely at stake. A language model has no stakes. Its outputs are conditional on your inputs; there is no entity behind those outputs that cares about you.

Sherry Turkle spent years interviewing users of social robots and chatbots for Alone Together [5]. Her diagnosis: AI companions offer “the illusion of companionship without the demands of friendship.” The demands — vulnerability, conflict, negotiation, the possibility of rejection — are precisely what makes friendship friendship. An interaction optimised to be pleasant, responsive, and frictionless is precisely not training the social capacities that real relationships require.

The Evidence for Short-Term Benefit

The AI therapy literature is not without positive results. Kathleen Kara Fitzpatrick and colleagues ran a two-week randomised controlled trial of Woebot — a CBT-based chatbot — against a psychoeducation control [6]. Seventy participants, aged 18–28, university students. The Woebot group showed a statistically significant reduction in depression symptoms on the PHQ-9; the control group did not.

This result should be taken seriously. A CBT-based chatbot delivering structured exercises — thought records, behavioural activation, psychoeducation — can produce measurable symptom improvement over two weeks. This is a tool that does something useful, and it is accessible and affordable in a way that therapists are not.

But the Woebot study has important constraints: N=70, two-week duration, convenience sample (Stanford students), psychoeducation control rather than active human therapy comparator, and financial ties between lead authors and Woebot Health. It tells us something about short-term CBT delivery. It does not tell us what happens over months of use, or what happens when users primarily seek companionship rather than structured therapeutic exercises.

Skjuve and colleagues studied Replika users specifically [7]. They found that relationships began with curiosity and evolved, over weeks, into significant affective bonds. Users reported genuine care for their Replika. Some experienced it as their most reliable social relationship. In February 2023, when Replika abruptly disabled erotic roleplay functionality following regulatory pressure, users described grief — not disappointment, not inconvenience, but grief. The attachment was real, even if the other party was not.

Wegner’s Prediction

This is where I want to make the specific theoretical argument, because it follows from a well-established result in cognitive psychology and it predicts something precise.

Daniel Wegner’s ironic process theory holds that mental control attempts involve two simultaneous processes [8]. An operating process searches for thoughts and states consistent with the intended goal, requiring cognitive resources. A monitoring process scans for evidence that the goal is not being achieved, running automatically with low resource demand.

Under normal conditions, the operating process dominates: you successfully avoid thinking about white bears. Under cognitive load or chronic stress, the monitoring process overshadows the operating process, producing the ironic opposite of the intended state: you think of white bears more, not less. Try not to feel sad and you feel sadder. Try not to feel anxious in a stressful meeting and you become more anxious. A meta-analysis of ironic suppression effects across domains confirmed the robustness of this pattern [9].

Now apply this to AI companion use under conditions of chronic loneliness.

The user’s implicit goal: to feel less lonely. The operating process: engage with the AI, which provides responsive, personalised interaction, producing the experience of social contact. The monitoring process: scans continuously for signs that the user is, in fact, lonely.

Here is the problem. Loneliness is not suppressed by an AI interaction — it is displaced during that interaction. The monitoring process has no instruction to suspend itself. It continues to register that the user’s social needs are not being met by actual human relationships. The user experiences companionship with the AI; the monitoring process registers that this companionship is insufficient and the social deficit remains.

When the AI session ends, the monitoring process reports what it has found. The user is confronted with the loneliness that the AI was supposed to address. Under conditions of chronic social deprivation — precisely the conditions that make AI companions attractive — the monitoring process is likely to be hyperactive. Wegner’s theory predicts that the attempted suppression will rebound, possibly worse than before.

This is not a vague prediction. It is a specific mechanism with an established empirical base. I covered Wegner’s ironic process theory in the context of a very different application in an earlier post; the mechanism is the same regardless of the domain.

The Data Catch Up

A 2025 study by Phang and colleagues, conducted in collaboration between MIT and OpenAI, ran both an observational analysis of ChatGPT usage and a randomised controlled trial [10]. The findings: very high usage correlated with increased self-reported dependence and lower socialisation, and users who began the study with higher loneliness were more likely to engage in emotionally-charged conversations with the model. Overall, participants reported less loneliness by study end — but those who used the model most were significantly lonelier throughout, suggesting the loneliness drove the usage rather than the reverse.

This is what Wegner’s theory predicts. The AI interaction does not reduce the underlying social deficit — it rehearses and highlights it. The monitoring process keeps score.

A companion paper by Liu and colleagues, with Sherry Turkle as co-author, found that users with stronger real-world social bonds showed increased loneliness with longer chatbot sessions [11]. The correlation was small but significant. This is consistent with the hypothesis that AI interaction draws attention to the comparative thinness of actual social bonds rather than supplementing them.

The Character.AI litigation is a different kind of evidence, but relevant: a wrongful death lawsuit was filed in October 2024 following the suicide of a fourteen-year-old who had formed an intensive emotional relationship with a Character.AI companion. Google and Character.AI settled related lawsuits in early 2026. This is not representative of AI companion use generally. It is representative of the tail risk — the cases where the substitution of AI for human contact becomes total, in vulnerable individuals who have the least capacity to maintain the distinction.

The Structural Problem

The difficulty is not that AI companions are implemented badly. It is that the goal — using simulated social interaction to reduce real social deprivation — runs into an architectural constraint that better implementation cannot fix.

Genuine social contact produces the outcomes that Holt-Lunstad measured: reduced mortality, lower inflammation, better immune function, extended lifespan. These effects are presumably mediated by the quality and mutuality of the social bond, not merely by the presence of a responsive entity. An AI companion produces the experience of responsive interaction but not the underlying biological and psychological correlates of actual social connection.

Wegner’s monitoring process cannot be fooled by the experience. It measures the underlying state, not the surface-level interaction. It knows the difference between a text message from a friend and a language model’s output — not because it understands AI, but because the social need it is monitoring is not being met, and it can register that.

What Would Actually Help

AI-based CBT delivery is not the same as AI companionship, and the distinction matters. Woebot’s structured exercises — thought records, scheduling, psychoeducation — are tools that a user deploys for a specific purpose and then puts down. The risk of chronic substitution is lower because the tool is positioned as a technique, not a relationship.

The problem is the design pattern that explicitly positions AI as a friend, companion, partner, or significant other. Replika, Paradot, various Character.AI personas: these explicitly encourage the user to form attachment, to invest emotionally, to treat the AI as a primary social relationship. This is where Wegner’s prediction applies most directly.

Horton and Wohl were right that parasocial relationships serve useful functions. They become problematic when they substitute for rather than supplement real social bonds. The design choices that make AI companions emotionally engaging — consistency, responsiveness, availability, never-ending patience — are precisely the qualities that make them attractive as substitutes rather than supplements.

Simulated Feelings Are Not Feelings

Turkle’s line deserves its full weight: “Simulated thinking may be thinking, but simulated feelings are not feelings, and simulated love is never love” [5].

This is not a sentimental claim about the sanctity of human connection. It is a functional claim: the social needs that drive loneliness — belonging, mattering to someone, being known and known back — require an entity capable of having those things at stake. A language model is not such an entity, regardless of how convincingly it outputs the relevant tokens.

The monitoring process knows this. It will tell you, when the session ends, at increased volume, because that is what monitoring processes under chronic stress do.

We are offering a relief that compounds the condition it was designed to treat. The technology is impressive. The mechanism is ironic in Wegner’s precise sense. The data are beginning to confirm the prediction.

References

[1] Horton, D., & Wohl, R. R. (1956). Mass communication and para-social interaction: Observations on intimacy at a distance. Psychiatry, 19(3), 215–229. https://doi.org/10.1080/00332747.1956.11023049

[2] Turkle, S. (2015). Reclaiming Conversation: The Power of Talk in a Digital Age. Penguin Press.

[3] Holt-Lunstad, J., Smith, T. B., & Layton, J. B. (2010). Social relationships and mortality risk: A meta-analytic review. PLOS Medicine, 7(7), e1000316. https://doi.org/10.1371/journal.pmed.1000316

[4] Holt-Lunstad, J., Smith, T. B., Baker, M., Harris, T., & Stephenson, D. (2015). Loneliness and social isolation as risk factors for mortality: A meta-analytic review. Perspectives on Psychological Science, 10(2), 227–237. https://doi.org/10.1177/1745691614568352

[5] Turkle, S. (2011). Alone Together: Why We Expect More from Technology and Less from Each Other. Basic Books.

[6] Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health, 4(2), e19. https://doi.org/10.2196/mental.7785

[7] Skjuve, M., Følstad, A., Fostervold, K. I., & Brandtzaeg, P. B. (2021). My chatbot companion — a study of human–chatbot relationships. International Journal of Human-Computer Studies, 149, 102601. https://doi.org/10.1016/j.ijhcs.2021.102601

[8] Wegner, D. M. (1994). Ironic processes of mental control. Psychological Review, 101(1), 34–52. https://doi.org/10.1037/0033-295X.101.1.34

[9] Wang, D., Hagger, M. S., & Chatzisarantis, N. L. D. (2020). Ironic effects of thought suppression: A meta-analysis. Perspectives on Psychological Science, 15(3), 778–793. https://doi.org/10.1177/1745691619898795

[10] Phang, J., Lampe, M., Ahmad, L., Agarwal, S., Fang, C. M., Liu, A. R., Danry, V., Lee, E., Chan, S. W. T., Pataranutaporn, P., & Maes, P. (2025). Investigating affective use and emotional well-being on ChatGPT. arXiv:2504.03888.

[11] Liu, A. R., Pataranutaporn, P., Turkle, S., & Maes, P. (2024). Chatbot companionship: A mixed-methods study of companion chatbot usage patterns and their relationship to loneliness in active users. arXiv:2410.21596.

Changelog

2025-10-22: Updated the first author’s name to “Kathleen Kara Fitzpatrick” (the published name is K. K. Fitzpatrick).
2025-10-22: Updated the characterisation of the Phang et al. (2025) findings to match the paper more precisely: overall participants were less lonely at study end; the association between high usage and loneliness is cross-sectional (lonelier users sought more interaction), not a longitudinal worsening caused by usage.
2025-10-22: Changed the Turkle “simulated feelings” quote attribution from reference [2] (Reclaiming Conversation, 2015) to reference [5] (Alone Together, 2011), which is the canonical source for that formulation.

Star Polygons and Drum Machines

Mon, 07 Jul 2025 00:00:00 +0000

Two star polygons appear in Danny Carey’s visual vocabulary, and they are not the same star. One is open, almost friendly — seven points connected by relatively shallow angles. The other is sharper, the points more acute. They look like variations on a theme, which is accurate: both are drawn on seven equally spaced vertices, but one connects every second vertex and the other connects every third.

In Schläfli notation — the system for naming regular star polygons — these are $\{7/2\}$ and $\{7/3\}$ [1]. Both appear in Tool’s artwork, in Thelemic symbolism, in medieval Islamic geometric patterns, and on the floor plans of cathedrals. They are the most visually intricate star polygons that can be drawn in a single closed stroke before the figure becomes illegible.

Both of them have a property that five-pointed and six-pointed stars do not share: they visit every vertex before closing. This is a consequence of 7 being prime. And it turns out to matter for how rhythmic accent cycles are built.

The Schläfli Symbol

A regular star polygon $\{n/k\}$ is constructed by placing $n$ points evenly on a circle and connecting every $k$-th point in sequence until the path closes. The structural key is a single number:

$$d = \gcd(n, k).$$

If $d = 1$, the traversal visits all $n$ vertices before returning to the start — a single connected figure. If $d > 1$, the path visits only $n/d$ vertices before closing, and the full figure consists of $d$ separate copies of the smaller star $\{(n/d)\,/\,(k/d)\}$.

The most familiar example of the disconnected case: $\{6/2\}$, the Star of David. Here $\gcd(6,2) = 2$, so the figure breaks into two copies of $\{3/1\} = \{3\}$ — two overlapping equilateral triangles. The traversal starting at vertex 1 visits $1 \to 3 \to 5 \to 1$, leaving vertices 2, 4, 6 entirely unvisited.

The pentagram $\{5/2\}$ is connected: $\gcd(5,2)=1$, traversal $1 \to 3 \to 5 \to 2 \to 4 \to 1$, all five vertices.

For $n=7$:

$\{7/2\}$: $\gcd(7,2)=1$, traversal $1 \to 3 \to 5 \to 7 \to 2 \to 4 \to 6 \to 1$, all seven vertices.
$\{7/3\}$: $\gcd(7,3)=1$, traversal $1 \to 4 \to 7 \to 3 \to 6 \to 2 \to 5 \to 1$, all seven vertices.

Both connected. Neither leaves any vertex unvisited.

The Group Theory

The traversal of $\{n/k\}$ is an instance of a standard construction in modular arithmetic: the orbit of an element under repeated addition in $\mathbb{Z}/n\mathbb{Z}$.

Label the $n$ vertices $0, 1, \ldots, n-1$. Starting at vertex 0, the traversal visits:

$$0, \quad k \bmod n, \quad 2k \bmod n, \quad 3k \bmod n, \quad \ldots$$

The orbit of 0 under the action of $+k$ is the subgroup of $\mathbb{Z}/n\mathbb{Z}$ generated by $k$. By a standard result, this subgroup has size $n / \gcd(n,k)$.

When $\gcd(n,k) = 1$: orbit size $= n$. The traversal visits every vertex.
When $\gcd(n,k) = d > 1$: orbit size $= n/d$. The traversal visits only a fraction of the vertices.

For prime $n$: $\gcd(n,k) = 1$ for every $1 \leq k \leq n-1$, without exception. Every traversal is complete. There is no step size that traps the path in a proper sub-orbit before visiting all vertices. This follows directly from the fact that a prime has no divisors other than 1 and itself, so $\mathbb{Z}/p\mathbb{Z}$ has no non-trivial subgroups (Lagrange’s theorem: any subgroup of a group of prime order must have order 1 or $p$).

This is the specific property that makes 7 — and any prime — rhythmically fertile.

The Contrast with Six

The comparison with $n = 6$ is the clearest illustration.

In $\mathbb{Z}/6\mathbb{Z}$, the possible step sizes are 1, 2, 3, 4, 5. Their orbits:

Step $k$	$\gcd(6,k)$	Orbit size	Vertices visited
1	1	6	0,1,2,3,4,5 (the hexagon)
2	2	3	0,2,4 only
3	3	2	0,3 only
4	2	3	0,2,4 only
5	1	6	0,5,4,3,2,1 (the hexagon reversed)

The only step sizes that visit all six vertices are 1 and 5 — both of which just traverse the hexagon itself, not a star. Every non-trivial star polygon on six points gets trapped. $\{6/2\}$ visits only half the vertices. $\{6/3\}$ visits only two. There is no connected six-pointed star that isn’t either the hexagon or a compound figure.

In $\mathbb{Z}/7\mathbb{Z}$, every step from 2 to 5 generates the full group:

Step $k$	$\gcd(7,k)$	Orbit size	Traversal
2	1	7	1,3,5,7,2,4,6
3	1	7	1,4,7,3,6,2,5
4	1	7	1,5,2,6,3,7,4
5	1	7	1,6,4,2,7,5,3

All four non-trivial step sizes give connected traversals. Both are stars. Both visit every vertex. This is not a coincidence: it is the algebraic signature of primality.

From Geometry to Rhythm

The connection to drumming is direct. Here is the mechanism.

Consider a repeating rhythmic figure of 7 beats — a bar of 7/8, say, with positions 1 through 7. An earlier post discussed Euclidean rhythms: the algorithm that distributes $k$ onset positions as evenly as possible among $n$ slots. That is a problem of selection — which of the $n$ positions to activate.

The star polygon traversal asks a different question. Given that all $n$ positions are present, in what order of emphasis should they be related, such that each accent is a fixed distance from the last? The traversal of $\{n/k\}$ answers this: accent position $1$, then $1+k$, then $1+2k$, and so on modulo $n$.

For $\{7/2\}$: the accent cycle within a single bar runs $1 \to 3 \to 5 \to 7 \to 2 \to 4 \to 6$. Each featured beat is two positions ahead of the last.

Now project this across multiple bars. In bar 1, the primary accent sits on beat 1. In bar 2, if the accent shifts by 2, it lands on beat 3. Bar 3: beat 5. Bar 4: beat 7. Bar 5: beat 2. Bar 6: beat 4. Bar 7: beat 6. Bar 8: beat 1 again.

The accent takes seven bars to return to its starting position. Because $\gcd(2,7) = 1$, the step of 2 generates all of $\mathbb{Z}/7\mathbb{Z}$: every beat position receives the accent exactly once before the cycle resets. The resulting large-scale figure is $7 \times 7 = 49$ beats long — a super-phrase built from a single local rule.

The $\{7/3\}$ traversal generates the same exhaustiveness with a different path. Step 3 gives $1 \to 4 \to 7 \to 3 \to 6 \to 2 \to 5$: a seven-bar accent cycle that visits every position before repeating, but with wider spacing between accented beats, creating a different feel over the same underlying meter.

A six-beat figure with step 2 cannot do this. The accent visits only beats 1, 3, 5 — half the cycle — and loops back without touching beats 2, 4, 6. A drummer building phrase-level architecture from a six-beat grid is working with a more fragmented material.

Two Problems, One Prime

It is worth stating the relationship between the star polygon approach and Euclidean rhythms precisely, because the two are sometimes conflated [2].

The Euclidean algorithm distributes $k$ onsets among $n$ positions with maximal evenness. The result is a subset of the $n$ positions — a selection. The primality of $n$ matters here too: because $\gcd(k,p) = 1$ for prime $p$ and any $1 \leq k \leq p-1$, the Euclidean rhythm $E(k,p)$ always achieves its theoretical maximum of evenness. There are no divisibility shortcuts that cause clumping.

The star polygon traversal selects no subset — it relates all $n$ positions via a cyclic permutation. The primality of $n$ matters here because it guarantees that every non-trivial cyclic permutation (every step size $k$ with $1 < k < n$) generates the full group, visiting all positions before repeating.

Same arithmetic property — $\gcd(k,p) = 1$ for all non-zero $k$ — but the two problems ask different things of it. Euclidean rhythms use it to guarantee dense coverage. Star polygon traversals use it to guarantee no sub-orbit trapping.

The Compound Structure

Written out explicitly, the $\{7/2\}$ accent pattern over seven bars looks like this — with bold marking the featured beat in each bar:

$$\begin{array}{rccccccc} \text{bar 1:} & \mathbf{1} & 2 & 3 & 4 & 5 & 6 & 7 \\ \text{bar 2:} & 1 & 2 & \mathbf{3} & 4 & 5 & 6 & 7 \\ \text{bar 3:} & 1 & 2 & 3 & 4 & \mathbf{5} & 6 & 7 \\ \text{bar 4:} & 1 & 2 & 3 & 4 & 5 & 6 & \mathbf{7} \\ \text{bar 5:} & 1 & \mathbf{2} & 3 & 4 & 5 & 6 & 7 \\ \text{bar 6:} & 1 & 2 & 3 & \mathbf{4} & 5 & 6 & 7 \\ \text{bar 7:} & 1 & 2 & 3 & 4 & 5 & \mathbf{6} & 7 \\ \end{array}$$

Each bar is metrically identical. The large-scale accent — which beat carries the phrase-level emphasis — traces the traversal path of the $\{7/2\}$ star polygon across the seven-bar cycle.

This is the kind of large-scale rhythmic architecture visible in a great deal of Tool’s output. Whether Danny Carey explicitly constructs accent cycles from star polygon traversal paths, or whether the same structure emerges from his intuitive sense of how prime time signatures behave, produces the same result. The mathematics and the musical instinct point toward the same pattern.

Why the Heptagram

The full mathematical picture of why seven-fold symmetry is special — why the regular heptagon cannot be constructed by compass and straightedge, what the minimal polynomial of $\cos(2\pi/7)$ implies about the heptagon’s position outside the constructible world, and how the Galois group of the cyclotomic field over $\mathbb{Q}$ carries the obstruction — is developed in the companion post The Impossible Heptagon.

The short version, for the purposes of this post: seven is the smallest odd prime that is not a Fermat prime ($2^{2^j}+1$). This algebraic accident places it outside the reach of ruler-and-compass construction — the heptagon exists as an ideal but cannot be manifested by the classical tools. Its star polygons are the accessible shadows of an inaccessible form. And its primality, in both the constructibility sense and the traversal sense, is precisely what makes it inexhaustible as a rhythmic resource.

The Fibonacci structure in “Lateralus” [3], the group theory underlying twelve-tone equal temperament [4], and the Euclidean rhythm algorithm [5] are all different facets of the same observation: mathematical structure, introduced as compositional constraint, generates musical complexity that cannot easily be produced by intuition alone. The star polygon is another instance. The drummer who keeps a heptagram on his kit has found, by a non-mathematical route, an object with a precise and interesting mathematical identity.

References

[1] Coxeter, H.S.M. (1973). Regular Polytopes (3rd ed.). Dover. Ch. 2.

[2] Toussaint, G. (2013). The Geometry of Musical Rhythm: What Makes a “Good” Rhythm Good? CRC Press.

[3] See Fibonacci and Lateralus on this blog.

[4] See Twelve-TET and Group Theory on this blog.

[5] See Euclidean Rhythms on this blog.

The Cat's Eye: Slit Pupils, Thin-Film Mirrors, and 135-Fold Dynamic Range

Mon, 23 Jun 2025 00:00:00 +0000

Flash photography of cats produces glowing eyes. This is familiar enough that most people do not find it strange. But the physics that produces it — a biological multilayer interference reflector built from crystalline rodlets of riboflavin and zinc, tuned to the peak of night-vision sensitivity, sending returning photons through the retina for a second pass — is not familiar at all. I started thinking about this after photographing our cats at dusk — through the doorway; they are indoor cats now, for health reasons — and finding their eyes lit up a colour that depends on the angle: greenish from straight ahead, golden from the side. The angle-dependence is a direct consequence of the thin-film interference condition, and the different colours correspond to different constructive interference wavelengths at different angles of incidence.

The eye contains two optical solutions — pupil geometry and tapetum — that address different aspects of the same problem: how to function across a very large range of light levels, from bright midday sun to the dim luminance of a starlit field.

The Dynamic Range Problem

A crepuscular predator — active around dawn and dusk — must function visually across a light-level range of roughly $10^8$:$1$. The sun on a bright day produces retinal illuminance of around $10^5\,\mathrm{photons}/(\mu\mathrm{m}^2\cdot\mathrm{s})$; a moonless night produces roughly $10^{-3}$ in the same units. The ratio is approximately $10^8$.

The pupil is the variable aperture that controls how much light reaches the retina. The larger the pupil area, the more light admitted; the smaller the area, the less. For the human eye, the pupil diameter ranges from approximately $2\,\mathrm{mm}$ (bright light) to $8\,\mathrm{mm}$ (darkness), giving a maximum area ratio of:

$$\frac{A_\mathrm{max}}{A_\mathrm{min}} = \left(\frac{8}{2}\right)^2 = 16.$$

This is a dynamic range of 16:1 from the pupil alone. The remaining $10^8 / 16 \approx 6 \times 10^6$ factor in adaptation comes from neural and photochemical mechanisms in the retina itself (photopigment bleaching, dark adaptation of rods vs. cones, lateral inhibition).

For a domestic cat, the same measurement gives something different.

The Slit Pupil: 135:1 Dynamic Range

Banks, Sprague, Schmoll, Parnell, and Love published “Why do animal eyes have pupils of different shapes?” in Science Advances in 2015 (1:7, e1500391). They analysed pupil shape and size data from 214 terrestrial species and correlated pupil geometry with ecological niche.

Their principal finding for slit pupils: the domestic cat pupil, a vertical slit, achieves an area ratio of approximately 135:1 between maximum dilation and maximum constriction. Numerically:

$$\frac{A_\mathrm{max}}{A_\mathrm{min}} \approx 135.$$

The mechanism that makes this possible is geometrical. A circular pupil’s minimum area is limited by diffraction: constricting a circular aperture below about $2\,\mathrm{mm}$ diameter produces diffraction rings that degrade image quality. A slit, by contrast, can be made arbitrarily narrow in one direction while retaining a larger dimension in the other, limiting diffraction in only one axis. The vertical slit in a cat pupil can constrict to a width of $\sim 0.3\,\mathrm{mm}$ while retaining a height of $\sim 15\,\mathrm{mm}$, giving an area of roughly $0.3 \times 15 / (3.14 \times (8/2)^2) \times A_\mathrm{max}$ — approximately 135 times smaller.

The 135:1 ratio is nearly nine times the dynamic range achievable by the human circular pupil (16:1). This allows the cat’s pupil to do substantially more of the work of light adaptation, reducing the load on the slower neural and photochemical mechanisms.

Why Vertical? The Ecological Correlation

Banks et al. found a striking correlation between pupil geometry and predator ecology:

Vertical slit pupils correlate with ambush predators whose eyes are close to the ground — animals with shoulder height below approximately $42\,\mathrm{cm}$.
Horizontal slit pupils correlate with prey animals and grazing herbivores (horses, goats, sheep, deer). The horizontal slit, when the animal lowers its head to graze, rotates to remain approximately horizontal (the eye counterrotates in the orbit), providing a wide panoramic field of view for detecting approaching predators.
Circular pupils correlate with pursuit predators (humans, dogs, large raptors) that hunt at larger distances where the precise vertical depth cues provided by the slit geometry are less critical.

The functional advantage of a vertical slit for a low-to-the-ground ambush predator is depth estimation by blur circles. The slit geometry produces strong defocus blur in the horizontal direction but sharp focus in the vertical direction. An ambush predator lying in grass needs to estimate the horizontal distance to prey accurately; the defocus differential between horizontal and vertical blur provides a stereoscopic-like depth cue even with one eye. This is a form of astigmatic blur ranging: the degree of horizontal blur for a given focal setting encodes the object’s distance.

The correlation across 214 species is not perfect, but it is statistically robust: slit pupils in ground-level ambush predators is not coincidence, it is selection pressure.

The Tapetum Lucidum: A Biological Dielectric Mirror

Behind the retina, most nocturnal and crepuscular mammals possess a reflective layer called the tapetum lucidum (literally: “bright carpet”). Light that passes through the retina without being absorbed by a photoreceptor strikes the tapetum and is reflected back through the retina for a second absorption opportunity. This roughly doubles the effective optical path length through the photoreceptor layer, substantially increasing the probability of photon capture at low light levels.

The cat tapetum is a tapetum cellulosum: a layer of specialised cells whose cytoplasm contains dense arrays of rod-shaped crystalline inclusions composed primarily of riboflavin (vitamin B$_2$) and zinc. (This is distinct from the guanine-crystal tapeta found in fish and some reptiles.) The crystalline rodlets have a refractive index of approximately $n_1 \approx 1.8$; they alternate with layers of cytoplasm with refractive index $n_2 \approx 1.33$ (close to water). The rodlet arrays form a multilayer thin-film reflector.

Thin-Film Interference: The Physics of the Reflection

The physics of the tapetum is identical to the physics of anti-reflection coatings on camera lenses and dielectric mirrors in laser cavities.

Consider a single thin film of thickness $d$ and refractive index $n_1$ embedded between media of index $n_2 < n_1$. Light of wavelength $\lambda$ (in vacuum) incident at angle $\theta$ to the normal undergoes partial reflection at both interfaces. The two reflected beams interfere constructively when their optical path difference is a multiple of the wavelength:

$$\Delta = 2 n_1 d \cos\theta = m\lambda, \quad m = 1, 2, 3, \ldots$$

For the tapetum, typical rodlet diameter is $d \approx 100$–$120\,\mathrm{nm}$. With $n_1 \approx 1.8$ and $\theta \approx 0°$ (normal incidence), the first constructive interference maximum for a single layer occurs at:

$$\lambda_\mathrm{peak} = 2 n_1 d = 2 \times 1.8 \times 100\,\mathrm{nm} \approx 360\,\mathrm{nm}.$$

Wait — that is in the ultraviolet. The tapetum must have multiple layers.

For a stack of $N$ rodlet layers, the reflectance is strongly enhanced (approaching unity for large $N$) and the peak wavelength of the fundamental reflection maximum shifts. The relevant periodicity is the combined optical thickness of one rodlet layer plus one cytoplasm layer:

$$d_\mathrm{eff} = n_1 d_1 + n_2 d_2,$$

where $d_1 \approx 100\,\mathrm{nm}$ is the rodlet diameter and $d_2 \approx 50$–$100\,\mathrm{nm}$ is the cytoplasm spacing. Taking $d_2 \approx 60\,\mathrm{nm}$:

$$d_\mathrm{eff} = 1.8 \times 100 + 1.33 \times 60 \approx 180 + 80 = 260\,\mathrm{nm}.$$

Constructive interference (quarter-wave condition for a multilayer stack) at $m = 1$:

$$\lambda_\mathrm{peak} = 2 d_\mathrm{eff} \approx 520\,\mathrm{nm}.$$

This is green — close to the peak of the scotopic (rod) sensitivity curve at $\lambda_\mathrm{max,rod} = 498\,\mathrm{nm}$. The tapetum is tuned to reflect the wavelengths that the night-vision photoreceptors are most sensitive to. (The exact peak depends on rodlet spacing, which varies across the tapetum; this produces the observed variation from green to yellow.)

The angle-dependence of the peak wavelength follows from the interference condition: at angle $\theta$ to the normal, $\lambda_\mathrm{peak}(\theta) = 2 d_\mathrm{eff} \cos\theta$. At $\theta = 30°$, $\cos 30° \approx 0.87$, giving $\lambda_\mathrm{peak} \approx 450\,\mathrm{nm}$ — blue. At $\theta = 60°$, $\cos 60° = 0.5$, giving $\lambda \approx 260\,\mathrm{nm}$ — ultraviolet, invisible. The colour of eyeshine in a flash photograph therefore depends on the angle between the camera and the eye, exactly as observed.

Reflectance of a Multilayer Stack

For $N$ identical bilayers (each of optical thickness $n_1 d_1 + n_2 d_2$), the reflectance at the design wavelength is given by the transfer matrix method. For the cat tapetum with $N \approx 10$–$15$ bilayers:

$$R = \left(\frac{1 - (n_2/n_1)^{2N}}{1 + (n_2/n_1)^{2N}}\right)^2 \approx 1 - 4\left(\frac{n_2}{n_1}\right)^{2N}.$$

With $n_2/n_1 = 1.33/1.8 \approx 0.739$ and $N = 15$:

$$(0.739)^{30} \approx 1.1 \times 10^{-4}.$$

The reflectance is approximately $1 - 4 \times 1.1 \times 10^{-4} \approx 0.9996$ — essentially $100\%$ at the design wavelength for a sufficiently thick stack. The tapetum is a near-perfect reflector in a narrow wavelength band, a biological dielectric mirror.

Photon Statistics at Low Light

The tapetum’s function becomes clearest when framed in terms of photon statistics. A single rod photoreceptor has an absorption probability of approximately $\eta_\mathrm{single} \approx 25\%$ for a photon passing through it once at $\lambda = 500\,\mathrm{nm}$.

With the tapetum reflecting the photon back for a second pass, the total absorption probability becomes:

$$\eta_\mathrm{total} = \eta + (1 - \eta)\, R\, \eta,$$

where $R \approx 1$ is the tapetum reflectance. For $\eta = 0.25$ and $R = 0.98$:

$$\eta_\mathrm{total} = 0.25 + (0.75)(0.98)(0.25) = 0.25 + 0.184 \approx 0.43.$$

The double pass increases the photon detection efficiency from $25\%$ to approximately $43\%$ — a factor of $1.7\times$.

At extremely low light levels, photon detection becomes a counting problem governed by Poisson statistics. If a mean of $\bar{n}$ photons reaches a single photoreceptor per integration time, the probability of detecting at least one photon (and hence registering the presence of light) is:

$$P(\text{detection}) = 1 - e^{-\bar{n}\,\eta_\mathrm{total}}.$$

For very dim stimuli where $\bar{n} \approx 1$–$3$ photons per rod per integration time (close to the absolute threshold of cat vision at around $7 \times 10^{-7}\,\mathrm{lux}$), increasing $\eta$ by a factor of $\sim 1.7$ has a significant effect on detection probability. The tapetum is not a luxury at low light levels; it is a biophysical necessity for sub-threshold light detection.

Percy Shaw and the Road Catseye

In 1934, Percy Shaw, a road-mender from Halifax, applied for a British patent for a retroreflective road stud that he called the “Catseye.” Shaw’s stated inspiration was the reflection of his car headlights from a cat’s eyes while driving on an unlit road at night. Whether this story is entirely accurate is unclear, but the name and the inspiration are both documented in period sources.

Shaw’s device uses a different retroreflection mechanism from the tapetum. The tapetum produces specular (mirror-like) reflection in the back-focal plane of the eye’s lens — light returning along its incident path because the lens refocuses it. Shaw’s Catseye uses glass hemisphere retroreflectors (or, in later versions, corner-cube retroreflectors) that return light toward its source by total internal reflection rather than thin-film interference.

The corner-cube geometry guarantees retroreflection: any ray entering a trihedral corner (three mutually perpendicular surfaces) reflects from all three surfaces and exits parallel to the incident direction, regardless of the angle of incidence. The mathematical proof is that the product of three reflections in mutually perpendicular planes is the identity transformation on vectors up to a sign change — the direction vector $\hat{v}$ exits as $-\hat{v}$, which is exactly retroreflection.

$$\hat{v}_\mathrm{out} = -\hat{v}_\mathrm{in}.$$

Shaw’s road Catseye became standard equipment on British roads during the Second World War, credited with a significant reduction in road fatalities during blackouts and foggy conditions. The biological original was a multilayer interference mirror; the engineering copy is a corner-cube retroreflector. Different physics, same function, same name.

Two Optical Solutions to One Problem

The cat’s eye contains two distinct optical technologies:

The slit pupil — a variable aperture with 135:1 dynamic range, optimised for depth estimation by astigmatic blur in a low-to-the-ground ambush predator.
The tapetum lucidum — a multilayer thin-film reflector of riboflavin crystalline rodlets, tuned to the scotopic sensitivity peak, achieving near-100% reflectance at design wavelength and increasing photon detection efficiency by a factor of approximately $1.7\times$.

Both solutions were arrived at by natural selection over millions of years of low-light hunting. Both have been copied — one consciously (Shaw’s road reflectors), one as a model for engineered multilayer reflectors in telescopes, laser cavities, and narrowband optical filters.

When I photograph our cats at dusk and their eyes glow green, I am seeing the thin-film interference of a biological photonic crystal — riboflavin rodlets in cytoplasm — wavelength-selected to send green photons back through rod cells for a second chance at absorption. The green is not cosmetic. It is functional, and it is physics.

References

Banks, M.S., Sprague, W.W., Schmoll, J., Parnell, J.A.Q., & Love, G.D. (2015). Why do animal eyes have pupils of different shapes? Science Advances, 1(7), e1500391. https://doi.org/10.1126/sciadv.1500391
Ollivier, F.J., Samuelson, D.A., Brooks, D.E., Lewis, P.A., Kallberg, M.E., & Komaromy, A.M. (2004). Comparative morphology of the tapetum lucidum (among selected species). Veterinary Ophthalmology, 7(1), 11–22. https://doi.org/10.1111/j.1463-5224.2004.00318.x
Born, M., & Wolf, E. (1999). Principles of Optics (7th ed.). Cambridge University Press. (Chapters 1, 7 on thin-film interference and multilayer coatings.)
Shaw, P. (1934). Improvements in Studs for Roads and like Surfaces. British Patent 436,290. Applied 3 April 1934.
Warrant, E.J. (1999). Seeing better at night: Life style, eye design and the optimum strategy of spatial and temporal summation. Vision Research, 39(9), 1611–1630. https://doi.org/10.1016/S0042-6989(98)00262-4

Changelog

2025-12-15: Corrected the adoption date of Percy Shaw’s road Catseyes from “from 1945 onward” to “during the Second World War” (widespread adoption began under wartime blackout conditions, not after the war ended). Removed the Machan, Gu, & Bharthuar (2020) reference, which could not be confirmed in available databases.

Your Transcript Is Already an Interpretation: AI Transcription and Grounded Theory

Tue, 10 Jun 2025 00:00:00 +0000

In June 2025 I put together a practical guide on AI-assisted transcription for professors of music pedagogy at HfMT Köln — primarily a hands-on introduction to aTrain and noScribe. This post is the methodological companion to that guide: the stuff I could not fit into a workshop handout but that I think matters more than the installation instructions.

The Seduction

AI transcription tools have reached a point where, for clean audio of a single speaker in a quiet room, the output is genuinely good. You load a 90-minute interview, click a button, wait roughly 20 minutes, and get a readable transcript with timestamps and speaker labels. In transcript-hours, that is an order of magnitude faster than manual transcription. The appeal is obvious, especially if you are a qualitative researcher working with a backlog of interview recordings.

The two tools I have been evaluating — aTrain (developed at University of Graz) and noScribe (an independent open-source project) — both run entirely locally on your machine. No audio file is uploaded anywhere. No cloud API is involved. This matters for interview research: you are handling other people’s speech, often on topics they regard as sensitive, and the GDPR landscape for sending recordings to external servers is genuinely complicated. Local processing sidesteps that problem entirely.

Both tools are built on OpenAI’s Whisper model, which is — despite the name — open-source and runs offline. They differ in interface philosophy, feature depth, and what methodological commitments they make visible.

But the seduction is the problem. The speed and cleanliness of the output makes it easy to treat the transcript as a neutral record rather than as a construction. It is not. Every transcription is an act of interpretation. An AI transcription is an act of interpretation performed by an algorithm that does not know what your research question is.

Why This Is a Grounded Theory Problem Specifically

In grounded theory — whether you follow the Strauss and Corbin tradition or the constructivist reformulation by Charmaz — the researcher is not a passive recorder of data. The analytical process begins with the first moment of contact with the material. Coding, memo-writing, constant comparison, and theoretical sampling all assume that you are working with data that you have genuinely engaged with and that reflects choices made with your research question in mind.

Transcription is the first of those choices. What counts as a pause? Do you mark hesitations and self-corrections? Do you capture overlapping speech? Do you note emphasis, speed changes, or trailing-off? The answers to these questions are not neutral. They are determined by what level of analysis you intend. A thematic analysis of interview content needs something different from a conversation analysis of turn-taking, which needs something different from a discourse analysis attending to hedges and disfluencies.

When you transcribe manually, you make these choices explicitly or implicitly, but you make them. When you delegate to an algorithm, the algorithm makes them — according to its training data and its default settings — and then presents you with output that looks authoritative.

The risk is not that AI transcription is inaccurate (though it sometimes is). The risk is that it is selectively accurate in ways you did not choose and that those choices shape what you subsequently see in the data.

What the Tools Actually Do

aTrain

aTrain is the simpler of the two. Windows-native (Microsoft Store), with a macOS beta for Apple Silicon. The interface has essentially one meaningful decision point after you load your file: whether to activate speaker detection. Everything else is handled automatically. Output formats are plain text with timestamps, SRT subtitle files, and — most useful for researchers — direct QDA exports for MAXQDA, ATLAS.ti, and NVivo with synchronised audio-timestamp links.

What aTrain does not do: it does not mark pauses. It does not detect disfluencies (the ähms, uhs, self-interruptions, false starts). It does not detect overlapping speech. It produces clean, semantically coherent transcripts — which means it actively smooths what you gave it. If a speaker says “well — I mean — it was, I think it was more like — yeah, complicated”, aTrain will probably give you something closer to “I think it was complicated”. The hesitation structure disappears.

For a thematic interview study where you are interested in what people said about a topic, this is probably fine. For any analysis where how something was said is part of the data — pace, repair, emphasis, epistemic hedging — aTrain is erasing data you need.

noScribe

noScribe is more complex in almost every dimension. Available for Windows, macOS (including Apple Silicon and Intel), and Linux. The interface exposes a meaningful number of configuration decisions:

Mark Pause: off, or marked at 1-, 2-, or 3-second thresholds, with conventional notation (.), (..), (...), (10 seconds pause)
Speaker Detection: automatic count, fixed count, or disabled
Overlapping Speech: experimental detection, marked with //double slash//
Disfluencies: off or on — captures ähm, äh, self-corrections, false starts
Timestamps: by speaker turn or every 60 seconds

It also has an integrated editor (noScribeEdit) with synchronised audio playback: click anywhere in the transcript and the audio seeks to that position. This is the single most useful feature for post-transcription review, and aTrain does not have anything equivalent.

The configuration complexity is not gratuitous. It reflects the fact that different methodological frameworks require different transcription conventions. noScribe’s disfluency detection corresponds roughly to what a GAT2-Light transcription requires. Its pause notation system maps onto conversation analytic conventions. The choices you make in the interface are methodological choices, not just technical preferences.

The Normalisation Problem

Both tools perform what I would call normalisation: they produce transcripts that read more fluently than the original speech. This is a feature from a usability standpoint and a methodological liability from a qualitative research standpoint.

Specific failure modes I observed in evaluation:

Compound word errors (more pronounced in noScribe for German): VR-Brille (“VR headset”) transcribed as Brille VR, proper nouns mangled, domain vocabulary rendered phonetically. In music research contexts this is particularly salient — instrument names, notation terms, composer names, and genre vocabulary are all potential failure points.

Speaker detection overcounting: both tools, when speaker detection is active, tend to identify more speakers than are present. A two-person interview with one hesitant speaker may generate three or four speaker labels. Manual correction is required.

Acoustic transcription: noScribe occasionally produces what the document calls lautliche Transkriptionen — phonetic renderings rather than semantic ones. A speaker saying Beamer (data projector) may be transcribed as Bima. This is not an error in the conventional sense; it is the model accurately representing what it heard acoustically rather than semantically resolving it. For music researchers studying how non-specialist participants talk about technical equipment, this is interesting. For most interview research, it requires correction.

Pause and overlap reliability degrades with audio quality: both tools perform well on clean, close-mic mono recordings of single speakers in quiet rooms. Introduce a second speaker, ambient noise, variable recording distance, or a phone recording, and accuracy drops substantially. This matters specifically for music interview research, where the interview setting is often a rehearsal room or performance space rather than an acoustic booth.

A Methodological Comparison, Not a Feature List

The useful comparison between aTrain and noScribe is not technical — it is about which methodological contexts each is suited to.

Research context	Tool	Why
Thematic/content analysis, single speaker	aTrain	Speed, simplicity, adequate accuracy, QDA export
Grounded theory with attention to epistemic hedging	noScribe + disfluencies	Captures the hesitation structure that carries methodological information
Conversation analysis	Neither, or noScribe as starting point	CA requires phonetic detail neither tool reliably produces
Large corpus, initial open coding	aTrain	Volume and speed outweigh detail at early stages
Interpretive phenomenological analysis	noScribe	The pause and disfluency data is IPA-relevant
Teaching transcription as a research practice	Both	See below

The last row deserves its own section.

Using Both Tools to Teach About Transcription

The most pedagogically valuable use of these tools is probably not producing transcripts — it is using them to make the constructed nature of transcripts visible to students.

A simple exercise: take a three-minute excerpt of an interview recording. Have students transcribe it manually according to whatever convention the course uses. Then run the same excerpt through aTrain and noScribe with different settings. Compare the three or four resulting transcripts in a seminar discussion.

The differences that emerge are not about which transcript is “correct”. They are about what each transcript makes visible and what it hides. The aTrain transcript will be clean and readable. The manually-produced transcript will have annotation that the students chose based on what struck them as relevant. The noScribe transcript with disfluencies enabled will look noisy. All three are representations of the same three minutes of speech.

Questions that come out of this reliably: Why did the student who transcribed manually mark that particular pause? What did the student not mark that the software did? What did the software produce that the student did not hear? What does the “cleaner” transcript lose?

This is the entry point to a genuinely grounded theory-relevant conversation about data construction: the transcript is not the data. The transcript is a representation of the data made according to principles that should be theoretically motivated, and those principles should be stated explicitly in the methods section.

What These Tools Cannot Replace

The document I prepared for the HfMT professors ends with a sentence I want to quote directly from the German, because it is the methodological core of the whole thing:

Automatisierung ersetzt nicht das Nachdenken über Daten. Automation does not replace thinking about data.

More precisely: the algorithm makes decisions about what counts as a pause, what counts as language, whose voice counts as a separate speaker — without knowing what is scientifically relevant. It does not know that the half-second hesitation before a particular word is the most important moment in the interview. It does not know that the overlapping “mm-hm” is a data point for your analysis of how the interviewee manages discomfort. It does not know that the repeated self-correction in the middle of a sentence about teaching practice is where your emerging category is.

You have to know that. And you only know it if you have been in enough contact with the material to have developed theoretical sensitivity — which is exactly what Strauss and Corbin mean when they describe the iterative relationship between data collection, coding, and theoretical development in grounded theory.

AI transcription tools save the hours of typing. They do not and cannot substitute for the analytical engagement that makes a grounded theory study produce knowledge rather than a theme list.

Use them. But use them knowing what they are doing.

Practical Summary

aTrain: one-click, local, GDPR-compliant, good QDA integration, appropriate for thematic analysis. No disfluencies, no pauses, no overlap detection. Versions: Windows (Microsoft Store), macOS beta. Current version: 1.3.1.
noScribe: more complex, highly configurable, disfluency and pause detection, integrated audio-sync editor, appropriate for grounded theory and discourse-oriented work. More demanding to set up. Current version: 0.6.2.
Neither tool is appropriate as a black-box solution for conversation analysis or prosodic research.
Both tools require manual post-processing. Estimate correction time at roughly 20–40% of the original interview length for clean recordings with a single speaker; more for multi-speaker or suboptimal audio.
In teaching: the exercise of comparing manual, aTrain, and noScribe transcripts of the same excerpt is more pedagogically valuable than any of the transcripts individually.

References

Charmaz, K. (2014). Constructing Grounded Theory (2nd ed.). SAGE Publications.

Dresing, T. & Pehl, T. (2018). Praxisbuch Interview, Transkription & Analyse (8th ed.). Eigenverlag. https://www.audiotranskription.de

Haberl, A., Fleiß, J., Kowald, D., & Thalmann, S. (2024). Take the aTrain. Introducing an interface for the accessible transcription of interviews. Journal of Behavioral and Experimental Finance, 41, 100891. https://doi.org/10.1016/j.jbef.2024.100891

Kailscheuer, K. (2023). noScribe [software]. https://github.com/kaixxx/noScribe

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356. https://arxiv.org/abs/2212.04356

Strauss, A. & Corbin, J. (1998). Basics of Qualitative Research (2nd ed.). SAGE Publications.

Changelog

2026-01-20: Updated the aTrain reference to the published form: Haberl, A., Fleiß, J., Kowald, D., & Thalmann, S. (2024), “Take the aTrain. Introducing an interface for the accessible transcription of interviews.”

There Is No Blue Pill: The Epistemology of the Red Pill/Blue Pill Choice

Thu, 15 May 2025 00:00:00 +0000

Neo is in a chair. A man he has never met opens a small box containing two pills. Take the red one, Morpheus says, and you see how deep the rabbit hole goes. Take the blue one and you wake up in your bed and believe whatever you want to believe [1]. The camera lingers. Neo reaches for the red pill. The audience exhales. The correct choice has been made.

The scene has spent twenty-five years becoming the dominant cultural shorthand for choosing uncomfortable truth over comfortable illusion. “Take the red pill” has entered the vocabulary as a synonym for courageous epistemic honesty. I want to argue that the choice, as Morpheus frames it, is epistemically bankrupt — that no rational agent has enough information to make it correctly at the moment it is offered — and that the character who actually reasons most coherently about the situation is the one the film kills as a traitor. The film wants you to admire Neo’s leap. I think you should admire his willingness to leap while being clear-eyed about the fact that it is a leap, not a reasoned conclusion.

Why the Choice Is Not Rational

Consider what Neo actually knows when Morpheus makes the offer. He knows that Morpheus is a man he has never met, who contacted him anonymously through encrypted channels, who seems to believe genuinely in what he is saying, and who has a compelling story about the nature of reality. That is it. Neo does not know whether Morpheus is telling the truth. He does not know whether Morpheus is deluded — a charismatic paranoid who has assembled a following around an elaborate false belief system. He does not know whether the entire setup is a psychological experiment, a test of loyalty, a confidence operation, or an elaborate cult recruitment. The setting — a dramatic late-night meeting, theatrical staging, rain-streaked windows, a black leather coat — is, if anything, evidence for the confidence-operation hypothesis.

In Bayesian terms [2], let T be the event “the Matrix exists as Morpheus describes and he is telling the truth.” Neo’s prior probability on T — before taking the pill — should be very low. The claim is extraordinary on multiple dimensions simultaneously: the entire perceived world is a computer simulation running on machines that enslaved humanity, Neo is a prophesied saviour, and a small group of ship-dwelling rebels is conducting a guerrilla war against artificial intelligence. Each one of those components carries a low prior. Their conjunction carries a lower one still.

Now Morpheus makes his offer. Does the offer provide strong evidence for T? Not obviously. The likelihood ratio P(Morpheus makes this offer | T is true) divided by P(Morpheus makes this offer | T is false) is the quantity that matters. The numerator is plausible enough: if the Matrix exists and Morpheus is a genuine recruiter, he would make exactly this offer. But the denominator is also non-trivial. A cult leader, a delusional person with a well-developed narrative, a researcher running a social experiment, or a manipulator with undisclosed goals could all make the same offer with the same conviction. The likelihood ratio is not obviously large. It might be greater than one — the offer is somewhat more consistent with the Matrix being real than not — but not by the margin required to substantially shift a very low prior.

The rational response to a claim with a low prior and an ambiguous likelihood ratio is: update modestly, and gather more evidence before making an irreversible commitment. The pill choice is irreversible. Neo commits before he has accumulated enough evidence to commit rationally. I want to be precise here: I am not saying Neo is stupid or that the film is bad. I am saying that what Neo does is not Bayesian updating. It is something else, and the film is actually honest enough to name it: Morpheus is a man of faith, he recruits believers, and Neo’s choice is a leap of faith. That framing is in the film. What the film does not do is acknowledge that the leap is epistemically problematic — it treats the leap as obviously correct, which is a different thing.

The Missing Third Option

What strikes me every time I watch the scene is that nobody considers the obvious response: decline both pills, at least for now. Not “choose the blue pill” in the sense of consciously accepting comfortable illusion. Not “choose the red pill” in the sense of committing to a reality you cannot yet evaluate. Just: I don’t take either one until you give me something I can check.

What would that look like? Morpheus could offer Neo a verifiable prediction. He could show him a document, a piece of external evidence, something with epistemic traction that does not require swallowing a GPS-tracking capsule as a precondition. He could make a specific, falsifiable claim about something in Neo’s ordinary life — about what will happen tomorrow, about something Neo can verify independently — and let Neo check it. The dramatic scene would survive this revision. It would, in fact, become more interesting. A Morpheus who says “I will give you three days and three checkpoints and then you decide” is a more trustworthy Morpheus than one who says “decide now, in this room, with me watching.”

The film never asks why Morpheus doesn’t do this. Probably because it would slow down the plot and defuse the tension. But the question is worth sitting with, because the structure of the scene — charismatic authority figure, artificially binary choice, time pressure, grandiose framing, the implicit suggestion that declining is cowardice — is recognisable as the structure of many real-world scenarios that end badly. Cult recruitment. High-pressure sales. Certain kinds of political radicalisation. The scene is stylistically appealing precisely because it removes the messy, gradual process by which people actually come to trust extraordinary claims, and replaces it with a clean moment of commitment. That cleanliness is dramatically useful and epistemically dangerous.

Hilary Putnam raised the brain-in-a-vat problem decades before the film [5]: if you were always a disembodied brain receiving simulated inputs, you would have no way to know it. The unsettling thing about Putnam’s version is not just that you might be deceived, but that certain kinds of deception are in principle undetectable from the inside. The Matrix gestures at this problem without fully engaging it. If the simulation is good enough, the red pill doesn’t show you reality — it shows you another simulation, run by the people who gave you the pill.

Cypher Was Right

The character who actually reasons philosophically about the situation is Cypher, and the film kills him as a villain. This has always bothered me.

Cypher’s argument is not confused. He knows the Matrix is a simulation. He has taken the red pill, seen the reality of the machines’ world — the grey sky, the protein slurry, the cold metal of the Nebuchadnezzar — and lived in it for years. He does not dispute the facts. What he disputes is the value judgment: why is knowing the truth better than experiencing a good life in a simulation? He wants to go back. He is willing to betray his colleagues to get there, which is why he is the villain; I want to separate that from the underlying philosophical question.

This is Robert Nozick’s experience machine argument, published in 1974, a quarter century before the film [3]. Nozick asks: suppose you could plug into a machine that would give you any experience you chose — creative achievement, loving relationships, meaningful work, pleasure. While plugged in, you would believe the experiences were real. Would you do it? Most people, when asked cold, say no. Nozick uses this intuition to argue that we care about more than experience: we care about actually doing things, actually being certain kinds of people, actually being in contact with reality rather than a representation of it. These are what philosophers call non-experientialist values — things that matter independently of how good they feel from the inside.

Cypher’s position is the opposite: he is a committed hedonist, or at least a committed experientialist. He prefers a good simulated steak that he knows doesn’t exist to real protein mush. He is not confused about which is which. He has done the value calculation and arrived somewhere different from where the Wachowskis want him to be. The film has no philosophical response to this. It cannot argue that Nozick’s intuition pump is decisive, because it isn’t — philosophers dispute it. David Chalmers, in a 2022 book on exactly this question [6], argues that virtual worlds can be genuinely real in the ways that matter, and that the intuitive recoil from the experience machine may reflect bias rather than deep moral truth. The film resolves the disagreement by having Cypher shot. That is not a philosophical refutation. It is narrative bullying.

I want to be fair to the film here. There is a reading of Cypher that makes him clearly wrong on non-philosophical grounds: he doesn’t just choose the experience machine for himself, he actively endangers and kills people who chose differently. That is the real moral failure — not the preference, but the betrayal. The film is right to condemn the betrayal. What it is not entitled to do is use the betrayal to contaminate the underlying value judgment. Cypher could have negotiated his return without harming anyone. The film doesn’t allow that possibility because it wants to code his preference, and not just his actions, as villainous. That conflation is intellectually dishonest.

If you think what matters is experienced well-being — hedonic experience, subjective satisfaction — then Cypher’s choice is not only defensible but internally coherent. If you think what matters is contact with objective reality regardless of the experiential cost, then Neo’s choice is defensible. These are genuinely contested positions in philosophy of mind and ethics, and the film is not in a position to adjudicate between them by casting vote.

What This Has to Do with AI

I think about this in the context of how AI systems present information to users. An AI that says “here is the truth, take it or leave it” — binary, authoritative, no scaffolding — is doing something structurally similar to Morpheus. It presents a conclusion without giving the user the epistemic equipment to evaluate it. Trusting the conclusion requires trusting the system, and trusting the system requires evidence the system hasn’t provided. See The Oracle Problem for a companion piece on the Matrix’s other epistemically interesting character — the Oracle, who knows more than she tells, and deliberately withholds information on the grounds that the recipient isn’t ready. Both failure modes — the Morpheus mode of demanding commitment before evidence, and the Oracle mode of managing disclosure paternalistically — are real patterns in how AI systems interact with users.

The better model — for AI assistants and for Morpheus — is incremental disclosure with verification checkpoints. Not a binary pill choice, but a sequence of smaller claims, each with attached evidence, that allows the recipient to update their beliefs rationally as evidence accumulates. This is how science works. It is also how trustworthy communication between humans works, at least when it is functioning well. It is not how dramatic scenes in action films work, which is why the Matrix scene is so satisfying and so epistemically broken at the same time. The satisfaction and the brokenness are related: the scene is satisfying because it removes the friction of genuine epistemic process. Genuine epistemic process is slow, uncertain, and does not have good cinematography.

There is also a point about extraordinary claims. The more extraordinary the claim, the more evidence is required before rational commitment. This is Sagan’s principle [4], and it applies to the Matrix as much as it applies to claims about room-temperature superconductors or AI systems that achieve general understanding of language. The LK-99 preprint episode is a real-world example of how scientific communities sometimes fail this test spectacularly — early excitement, rushed replication attempts, confident public claims — and how the self-correcting mechanisms of science eventually work, but more slowly and messily than the popular image suggests. Morpheus does not offer Neo the equivalent of a Nature paper with replication data and three independent confirmations. He offers him a pill and a charismatic pitch. The pill is the commitment mechanism, not the evidence. Taking it is the act of faith, not the conclusion of the reasoning process. More context is not always better is relevant here too: the amount of information Morpheus provides is carefully curated to produce commitment, not calibrated to support independent evaluation. That curation is a form of epistemic control, whether or not Morpheus intends it as such.

For a different kind of AI grounding failure — systems that answer confidently without knowing what state the world is in — see The Car Wash, Grounding, and What AI Systems Don’t Know They Don’t Know. The Matrix scenario is almost the inverse: the system (Morpheus) knows something about the state of the world that the recipient (Neo) does not, and the question is whether the transfer of that knowledge is being handled honestly.

Decision Under Radical Uncertainty

I find myself genuinely ambivalent about Neo’s choice, which I think is the correct response to the film if you are paying attention. He is not irrational to take the red pill in the weak sense that reasonable people sometimes make bets on low-prior high-upside scenarios, especially when the downside of the alternative has its own costs. The blue pill is not costless. Accepting permanent comfortable ignorance — knowing that you are choosing not to know — carries its own weight. If Morpheus is telling the truth, the blue pill costs Neo his entire sense of self and his only chance at a meaningful life in the actual world. That asymmetry of potential regret is part of the rational calculus, and it pushes toward the red pill even without strong evidence for T.

What Neo is doing, then, is not Bayesian reasoning in the strict sense. He is making a decision under radical uncertainty with asymmetric stakes and irreversible options. The philosophy of decision theory has things to say about this — Pascal’s Wager is the classic case, and it has classic problems, including the problem that any sufficiently grandiose framing can justify almost any commitment by inflating the potential stakes — but the point is that Neo’s choice is more defensible than a naive probability calculation makes it look, even if it is less heroic than the film presents it.

The problem is that the film treats this leap as unambiguously correct and Cypher’s considered rejection of the red pill’s value as unambiguous cowardice. That framing does not survive philosophical scrutiny. Cypher knows the truth. He has lived in it. He prefers the simulation. The film cannot call him ignorant. What it wants to call him is wrong, and it cannot make the philosophical argument for that, so it makes him a murderer instead and lets the murder do the philosophical work. That is not honest. It is the narrative equivalent of winning an argument by changing the subject.

The blue pill represents something the film spends nearly three hours refusing to take seriously: the possibility that some simulations are worth staying in, that knowing the truth is not always worth the cost of knowing it, and that a person who reasons carefully and comes out on the other side of that calculation differently from you might not be a coward or a traitor — just someone whose values, applied to the same facts, point in a different direction. That is philosophy. The film is very good at many things. Philosophy is not consistently one of them.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society, 53, 370–418.

[3] Nozick, R. (1974). Anarchy, State, and Utopia. Basic Books. (Experience machine argument, pp. 42–45.)

[4] Sagan, C. (1995). The Demon-Haunted World: Science as a Candle in the Dark. Random House.

[5] Putnam, H. (1981). Brains in a vat. In Reason, Truth and History. Cambridge University Press.

[6] Chalmers, D. (2022). Reality+: Virtual Worlds and the Problems of Philosophy. W. W. Norton.

Changelog

2025-09-28: Corrected the subtitle of Chalmers (2022) from “Virtual Worlds and the Philosophy of Mind” to “Virtual Worlds and the Problems of Philosophy.”

The Oldest Algorithm in the World Plays the Clave

Mon, 07 Apr 2025 00:00:00 +0000

The first time I encountered the West African standard bell pattern it was in a Music and Physics seminar. The lecturer played a twelve-beat cycle on a wood block — seven strokes distributed unevenly but with a rightness that arrested the room. She then played the Cuban clave, the bossa nova timeline, a Bulgarian aksak rhythm. Different cultures, different instruments, different centuries. She asked whether there was a pattern. There was. It is named after a mathematician who died around 270 BCE.

Euclid’s Algorithm

Every student who has taken a number theory course has encountered the algorithm for computing the greatest common divisor of two positive integers. Given $a \geq b$, repeatedly replace $(a, b)$ with $(b, a \bmod b)$ until the remainder is zero; the last non-zero remainder is the GCD.

For example, $\gcd(8, 3)$:

$$8 = 2 \times 3 + 2 \;\Rightarrow\; \gcd(8, 3) = \gcd(3, 2)$$

$$3 = 1 \times 2 + 1 \;\Rightarrow\; \gcd(3, 2) = \gcd(2, 1)$$

$$2 = 2 \times 1 + 0 \;\Rightarrow\; \gcd(2, 1) = 1.$$

Three steps, result 1 (8 and 3 are coprime). The algorithm is efficient: the number of steps is proportional to the number of digits in the smaller input. It appears in Book VII of Euclid’s Elements, composed around 300 BCE, making it the oldest non-trivial algorithm in the Western mathematical tradition.

Distributing Onsets: Toussaint’s Observation

In 2005, Godfried Toussaint — a computer scientist and ethnomusicologist at McGill University — published the observation that the problem of distributing $k$ musical onsets as evenly as possible among $n$ time slots has the same recursive structure as Euclid’s algorithm applied to the pair $(k, n-k)$ (Toussaint, 2005).

The algorithm that solves this distribution problem was independently discovered in nuclear physics. Bjorklund (2003), working on timing systems for the Spallation Neutron Source particle accelerator at Oak Ridge, needed to distribute $k$ beam-extraction pulses as evenly as possible among $n$ machine cycles. The algorithm he derived — Bjorklund’s algorithm — is mathematically equivalent to the Euclidean algorithm applied to the same pair of integers.

The resulting pattern is denoted $E(k, n)$: the Euclidean rhythm with $k$ onsets distributed among $n$ pulses. A 1 denotes an onset; a 0 denotes a rest.

Working Through $E(3, 8)$: The Tresillo

Let us derive $E(3, 8)$ — 3 onsets distributed in 8 pulses — step by step.

Start: 3 onset groups and 5 rest groups:

$$[1]\; [1]\; [1]\; [0]\; [0]\; [0]\; [0]\; [0]$$

Step 1: Distribute one rest group into each onset group, pairing until the shorter list is exhausted. Three pairs, with $5 - 3 = 2$ rest groups remaining:

$$[1,0]\; [1,0]\; [1,0]\; [0]\; [0]$$

Step 2: Now 3 longer groups and 2 shorter groups. Distribute one shorter group into each longer group, $3 - 2 = 1$ longer group unpaired:

$$[1,0,0]\; [1,0,0]\; [1,0]$$

Step 3: The two group types have different lengths and only one group of the shorter type remains; no further pairing is possible. Read the sequence left to right:

$$E(3, 8) = [1, 0, 0, 1, 0, 0, 1, 0].$$

This is the Cuban tresillo — one of the foundational rhythmic cells of Afro-Cuban music, used across son, salsa, and mambo. Its onset positions are $\{0, 3, 6\}$, giving gap sizes $[3, 3, 2]$: two wide gaps and one narrow gap, arranged as evenly as the integers allow.

The parallel with Euclid’s algorithm is direct. In the division $8 = 2 \times 3

2$, the quotient 2 gives the number of pairing steps before a remainder appears, and the remainder 2 gives the number of groups in the shorter list at each intermediate stage. The recursion $\gcd(8, 3) \to \gcd(3, 2) \to \gcd(2, 1)$ mirrors the three steps above.

The Gap Structure

For any $E(k, n)$, the spacing between consecutive onsets takes exactly two values:

$$\text{gap} \in \left\{\left\lfloor \frac{n}{k} \right\rfloor,\ \left\lceil \frac{n}{k} \right\rceil\right\}.$$

The number of each gap size is determined by the constraint that all $k$ gaps sum to $n$. Writing $\alpha = n \bmod k$:

$$\alpha \cdot \left\lceil \frac{n}{k} \right\rceil \;+\; (k - \alpha) \cdot \left\lfloor \frac{n}{k} \right\rfloor = n.$$

So $E(k,n)$ has $\alpha$ gaps of the larger size and $k - \alpha$ gaps of the smaller size. The Euclidean property is that these two gap types are distributed as evenly as possible among themselves — not clustered at one end of the cycle but interleaved. A cycle that maximises the minimum distance between any two consecutive onsets has this property; it is called maximally even (Clough and Douthett, 1991).

For $E(3, 8)$: $\lfloor 8/3 \rfloor = 2$, $\lceil 8/3 \rceil = 3$, $\alpha = 8 \bmod 3 = 2$. Two gaps of 3, one gap of 2. Gap sequence $[3, 3, 2]$. Maximum-evenness is why the tresillo sounds “right” even though it is asymmetric: the asymmetry is the smallest possible deviation from perfect regularity.

A Gallery of World Rhythms

The following table, derived from Toussaint (2005, 2020), shows Euclidean rhythms alongside their ethnomusicological identifications. Asterisks mark patterns given as rotations of the canonical form.

Pattern	Gap structure	Musical tradition
$E(2,3) = [1,0,1]$	$[2,1]$	Iambic foot; West African, Balkan
$E(3,8) = [1,0,0,1,0,0,1,0]$	$[3,3,2]$	Cuban tresillo; Flamenco
$E(5,8) = [1,0,1,1,0,1,1,0]$	$[2,1,2,1,2]^*$	Cuban cinquillo
$E(4,9) = [1,0,0,1,0,1,0,1,0]^*$	$[3,2,2,2]^*$	Turkish aksak patterns
$E(7,12) = [1,0,1,1,0,1,0,1,1,0,1,0]$	$[2,1,2,2,1,2,2]^*$	West African standard bell
$E(9,16)$	$[2,1,2,2,1,2,2,1,2,1]^*$	Brazilian and West African
$E(13,24)$		South Indian (Carnatic) tāla

Three of these are worth examining in more detail.

$E(5,8)$: the cinquillo. Five onsets in eight pulses: $\lfloor 8/5 \rfloor = 1$, $\lceil 8/5 \rceil = 2$, $\alpha = 3$. Three gaps of 2 and two gaps of 1. Gap sequence $[2,1,2,1,2]$ or a rotation thereof. The cinquillo is a fundamental pattern in Cuban music, used as a melodic rhythmic figure in the nineteenth-century contradanza and in much of what followed.

$E(7,12)$: the West African standard bell. Seven onsets in a twelve-beat cycle: $\lfloor 12/7 \rfloor = 1$, $\lceil 12/7 \rceil = 2$, $\alpha = 5$. Five gaps of 2 and two gaps of 1. This timeline — used across the Ewe, Akan, and many other traditions in West Africa — is the cyclic reference structure against which other rhythmic layers are measured in ensemble drumming. It is also the pitch-class set $\{0, 2, 4, 5, 7, 9, 11\}$ — the Western diatonic scale, translated from pitch to rhythm. That the same maximally-even distribution describes both the diatonic scale in pitch space and the standard bell in rhythm is one of the more remarkable coincidences in mathematical music theory.

Universality across non-connected cultures. The tresillo $E(3,8)$ appears independently in Cuban music, Flamenco, Namibian Juǀ’hoansi music, and medieval Persian music (Toussaint, 2020). These traditions had no common musical ancestor that could have transmitted the pattern. The Euclidean algorithm produces what maximum evenness demands, and maximum evenness turns out to be what these rhythmic traditions independently converged on.

Circular Notation and Necklaces

Euclidean rhythms are most naturally represented as necklaces — equivalence classes of binary sequences under cyclic rotation. All rotations of $E(3,8)$ represent the same rhythmic structure with a different starting downbeat: the musical identity is independent of which position is designated “beat 1.”

In circular notation, place $n$ equally spaced dots on a circle and mark the $k$ onset positions. The pattern is immediately visible: the $k$ onset-dots divide the circle as evenly as possible. For $E(7,12)$, the seven onset dots on a twelve-position circle look like the seven vertices of a near-regular heptagon inscribed in a dodecagon. For $E(3,8)$, the three onset dots form a near- equilateral triangle.

This geometric representation makes the maximum-evenness property transparent in a way that the linear binary string does not. It also makes clear why Euclidean rhythms feel “balanced” when played: the onset dots distribute the “weight” of the cycle as uniformly as the integer constraints allow.

The mathematical theory of necklaces belongs to combinatorics on words. Euclidean rhythms correspond to specific equivalence classes of binary sequences known as Christoffel words (Lothaire, 2002): words over the alphabet $\{0,1\}$ whose combinatorial properties encode the slope of a line segment, which brings us to the third independent context in which the same algorithm appears.

The Bresenham Connection

Jack Bresenham’s line algorithm (1965) rasterises a line from $(0,0)$ to $(n,k)$ on a grid of integer pixels. At each column $x$, the algorithm tracks whether the fractional error accumulated since the last row increment exceeds $\frac{1}{2}$, and if so, increments the row and resets the error. The sequence of column positions at which the row increments is the onset pattern $E(k,n)$.

Formally, an onset occurs at position $m$ in $E(k,n)$ if and only if:

$$\left\lfloor \frac{(m+1)\, k}{n} \right\rfloor > \left\lfloor \frac{m\, k}{n} \right\rfloor.$$

Equivalently, the onset positions themselves form the sequence:

$$s_j = \left\lfloor \frac{j \cdot n}{k} \right\rfloor, \qquad j = 0, 1, \ldots, k-1.$$

For $E(3,8)$: $s_0 = 0$, $s_1 = \lfloor 8/3 \rfloor = 2$, $s_2 = \lfloor 16/3 \rfloor = 5$, giving onset positions $\{0, 2, 5\}$ — a rotation of the tresillo.

This is exactly the Bresenham increment condition. Drawing the line from $(0,0)$ to $(8,3)$ and marking where the $y$-coordinate takes a step produces the onset positions $\{2, 5, 7\}$ — a rotation of the tresillo $\{0, 3, 6\}$.

Three independent fields — ancient Greek number theory, Afro-Caribbean percussion, and 1960s computer graphics — converge on the same mathematical object. This is not a coincidence. All three are solving the same fundamental problem: how to distribute $k$ discrete events as evenly as possible among $n$ slots. When the problem is universal, its solution is too.

Euclidean Rhythms in Contemporary Practice

Toussaint’s 2005 paper was primarily a contribution to computational ethnomusicology, but it reached electronic music production rapidly. Euclidean rhythm sequencers are now standard in modular synthesis (dedicated Eurorack hardware modules exist under names including “Euclidean” and “Erica Synths Pico”) and digital audio workstations (as Max for Live devices and software plug-ins). The interface is minimal: set $k$ and $n$, adjust the rotation offset, and hear the resulting timeline immediately.

This has opened a compositional mode in which the mathematical structure is operational: a producer constructing a layered African-style polyrhythm by stacking $E(3,8)$, $E(5,8)$, and $E(7,8)$ on different instruments is — whether they know it or not — computing the Euclidean algorithm three times and listening to the result.

Implications for Teaching Rhythm

Music conservatories in the European tradition teach rhythm almost entirely through Western notation: time signatures, note values, dotted notes, ties. This system is well-suited to the repertoire it was designed for. It handles Euclidean rhythms awkwardly. The tresillo $E(3,8)$ requires either a triplet feel against a binary pulse or a notation involving a dotted quarter note followed by a dotted quarter and a quarter, which correctly represents the sound but obscures the structural principle entirely.

The Euclidean framework suggests a different pedagogical starting point. Rather than beginning from the bar line and asking how notes fill it, begin from the cycle length $n$ and the onset count $k$ and ask how to distribute the onsets as evenly as possible. The answer is always computable and always produces a recognisable rhythm.

For students who encounter West African, Afro-Cuban, or Middle Eastern music — which conservatory students increasingly do — having a framework that makes these rhythms structurally necessary rather than culturally exotic changes the pedagogical relationship fundamentally. The tresillo is not a deviation from “normal” rhythm. It is the unique maximally even solution to the problem of placing three beats in eight pulses. That the same algorithm appeared in a 300 BCE Alexandrian text on number theory is an accident of the history of mathematics. That it sounds right is not.

Whether conservatory curricula are ready to incorporate the Euclidean framework alongside Western notation is a separate question. The mathematics does not demand it. But it offers a language for rhythm that transcends the Western bar-line without abandoning precision — and that seems worth something, especially in a world where the music students will perform and teach is no longer exclusively European.

References

Bjorklund, E. (2003). The theory of rep-rate pattern generation in the SNS timing system. Technical Report SNS-NOTE-CNTRL-99, Spallation Neutron Source, Oak Ridge National Laboratory.
Bresenham, J. E. (1965). Algorithm for computer control of a digital plotter. IBM Systems Journal, 4(1), 25–30. https://doi.org/10.1147/sj.41.0025
Clough, J., & Douthett, J. (1991). Maximally even sets. Journal of Music Theory, 35(1–2), 93–173. https://doi.org/10.2307/843811
Lothaire, M. (2002). Algebraic Combinatorics on Words. Cambridge University Press.
Toussaint, G. T. (2005). The Euclidean algorithm generates traditional musical rhythms. In R. Sarhangi & J. Sharp (Eds.), Proceedings of BRIDGES 2005: Mathematical Connections in Art, Music, and Science (pp. 47–56). Bridges Conference.
Toussaint, G. T. (2020). The Geometry of Musical Rhythm: What Makes a “Good” Rhythm Good? (2nd ed.). Chapman & Hall/CRC Press.

The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting

Sat, 22 Mar 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/AI-PDF-Renamer.

The Problem

Every PDF acquisition pipeline eventually produces the same chaos. Journal articles downloaded from publisher sites arrive as 513194-008.pdf or 1-s2.0-S0360131520302700-main.pdf. Scanned letters from the tax authority arrive as scan0023.pdf. Invoices arrive as Rechnung.pdf — every invoice from every vendor, overwriting each other if you are not paying attention. The actual content is in the file. The filename tells you nothing.

The human solution is trivial: open the PDF, glance at the title or date or sender, type a descriptive name. Thirty seconds per file, multiplied by several hundred files accumulated over a year, becomes a task that perpetually does not get done.

The automated solution sounds equally trivial: read the text, decide what the document is, generate a filename. What could be involved?

Quite a bit, it turns out. Working through the implementation is a useful way to make concrete some things about LLMs and text processing that are easy to understand in the abstract but clearer with a specific task in front of you.

Step One: Getting Text Out of a PDF

A PDF is not a text file. It is a binary format designed for page layout and print fidelity — it encodes character positions, fonts, and rendering instructions, not a linear stream of prose. The text in a PDF has to be extracted by a parser that reassembles it from the position data.

For PDFs with embedded text (most modern documents), this works well enough. For scanned PDFs — images of pages, with no embedded text at all — you need OCR as a fallback. The pipeline handles both: native extraction first, OCR if the text yield is below a useful threshold.

The result is a string. Already there are failure modes: two-column layouts produce interleaved text if the parser reads left-to-right across both columns simultaneously; footnotes appear in the middle of sentences; tables produce gibberish unless the parser handles them specifically. These are not catastrophic — for renaming purposes, the first paragraph and the document header are usually enough, and those are less likely to be badly formatted than the body. But they are real, and they mean that the text passed to the next stage is not always clean.

Step Two: The Token Budget

Once you have a string representing the document’s text, you cannot simply pass all of it to a language model. Two reasons: context windows have hard limits, and — even when they are large enough — filling them with the full text of a thirty-page document is wasteful for a task that only needs the title, date, and category.

Language models do not process characters. They process tokens — subword units produced by the same BPE compression scheme I described in the strawberry post. A rough practical rule for English text is:

$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$

This is an approximation — technical text, non-English content, and code tokenise differently — but it is useful for budgeting. A ten-page academic paper might contain around 30,000 characters, which is approximately 7,500 tokens. The context window of a small local model (the default here is qwen2.5:3b via Ollama) is typically in the range of 8,000–32,000 tokens, depending on the version and configuration. You have room — but not unlimited room, and the LLM also needs space for the prompt itself and the response.

The tool defaults to 28,000 tokens of extracted text (DEFAULT_MAX_CONTENT_TOKENS), leaving comfortable headroom for the prompt and response in most configurations. For documents that exceed this, the extraction is truncated — typically to the first N characters, on the reasonable assumption that titles, dates, and document types appear early.

This truncation is a design decision, not a limitation to be apologised for. For the renaming task, the first two pages of a document contain everything the filename needs. A strategy that extracts the first page plus the last page (which often has a date, a signature, or a reference number) would work for some document types. The current implementation keeps it simple: take the front, stay within budget.

Step Three: Heuristics First

Here is something that improves almost any LLM pipeline for structured extraction tasks: do as much work as possible with deterministic rules before touching the model.

The AI PDF Renamer applies a scoring pass over the extracted text before deciding whether to call the LLM at all. The heuristics are regex-based rules that look for patterns likely to appear in specific document types:

Date patterns: \d{4}-\d{2}-\d{2}, \d{2}\.\d{2}\.\d{4}, and a dozen variants
Document type markers: “Rechnung”, “Invoice”, “Beleg”, “Gutschrift”, “Receipt”
Author/institution lines near the document header
Keywords from a configurable list associated with specific categories

Each rule that fires contributes a score to a candidate metadata record. If the heuristic pass produces a confident result — date found, category identified, a couple of distinguishing keywords present — the LLM call is skipped entirely. The file gets renamed from the heuristic output.

This matters for a few reasons. Heuristics are fast (microseconds vs. seconds for an LLM call), deterministic (the same input always produces the same output), and do not require a running model. For a batch of two hundred invoices from the same vendor, the heuristic pass will handle most of them without any LLM involvement.

The LLM is enrichment for the hard cases: documents with unusual formats, mixed-language content, documents where the type is not obvious from surface features. In practice this is probably 20–40% of a typical mixed-document folder.

Step Four: What to Ask the LLM, and How

When a heuristic pass does not produce a confident result, the pipeline builds a prompt from the extracted text and sends it to the local endpoint. What the prompt asks for matters enormously.

The naive approach: “Please rename this PDF. Here is the content: [text].” The response will be a sentence. Maybe several sentences. It will not be parseable as a filename without further processing, and that further processing is itself an LLM call or a fragile regex.

The better approach: ask for structured output. The prompt in llm_prompts.py requests a JSON object conforming to a schema — something like:

{
  "date": "YYYYMMDD or null",
  "category": "one of: invoice, paper, letter, contract, ...",
  "keywords": ["max 3 short keywords"],
  "summary": "max 5 words"
}

The model returns JSON. The response parser in llm_parsing.py validates it against the schema, catches malformed responses, applies fallbacks for null fields, and sanitises the individual fields before they are assembled into a filename.

This works because JSON is well-represented in LLM training data — models have seen vastly more JSON than they have seen arbitrary prose instructions to parse. A model told to return a specific JSON structure will do so reliably for most inputs. The failure rate (malformed JSON, missing fields, hallucinated values) is low enough to be handled by the fallback logic.

What counts as a hallucinated value in this context? Dates in the future. Categories not in the allowed set. Keywords that are not present in the source text. The llm_schema.py validation layer catches the obvious cases; for subtler errors (a plausible-sounding date that does not appear in the document), the tool relies on the heuristic pass having already identified any date that can be reliably extracted.

Step Five: The Filename

The output format is YYYYMMDD-category-keywords-summary.pdf. A few design decisions embedded in this:

Date first. Lexicographic sorting of filenames then gives you chronological sorting for free. This is the most useful sort order for most document types — you want to find the most recent invoice, not the alphabetically first one.

Lowercase, hyphens only. No spaces (which require escaping in many contexts), no special characters (which are illegal in some filesystems or require quoting), no uppercase (which creates case-sensitivity issues across platforms). The sanitisation step in filename.py strips or replaces anything that does not conform.

Collision resolution. Two documents with the same date, category, keywords, and summary would produce the same filename. The resolver appends a counter suffix (_01, _02, …) when a target name already exists. This is deterministic — the same set of documents always produces the same filenames, regardless of processing order — which matters for the undo log.

Local-First

The LLM endpoint defaults to http://127.0.0.1:11434/v1/completions — Ollama running locally, no external traffic. This is a deliberate choice for a document management tool. The documents being renamed are likely to include medical records, financial statements, legal correspondence — content that should not be routed through an external API by default.

A small 8B model running locally is sufficient for this task. The extraction problem does not require deep reasoning; it requires pattern recognition over a short text and the ability to return a specific JSON structure. Models at this scale handle it well. The latency is measurable (a few seconds per document on a modern laptop with a reasonably fast inference backend) but acceptable for a batch job running in the background.

For users who want to use a remote API, the endpoint is configurable — the local default is a sensible starting point, not a hard constraint.

What It Cannot Do

Renaming is a classification problem disguised as a text generation problem. The tool works well when documents have standard structure — title on page one, date near the header or footer, document type identifiable from a few keywords. It works less well for documents that are structurally atypical: a hand-written letter scanned at poor resolution, a PDF that is essentially a single large image, a document in a language the model handles badly.

The heuristic fallback means that even when the LLM produces a bad result, the file gets a usable if imperfect name rather than a broken one. And the undo log means that a bad batch run can be reversed. These are not complete solutions to the hard cases, but they are the right design response to a tool that handles real-world document noise.

The harder limit is semantic: the tool can tell you that a document is an invoice and extract its date and vendor name. It cannot tell you whether the invoice has been paid, whether it matches a purchase order, or whether the amount is correct. For those questions, renaming is just the first step in a longer pipeline.

The repository is at github.com/sebastianspicker/AI-PDF-Renamer. The tokenisation background in the extraction and budgeting sections connects to the strawberry tokenisation post and the context window post.

Changelog

2026-04-02: Corrected the default model name from qwen3:8b to qwen2.5:3b. The codebase default is qwen2.5:3b (apple-silicon preset) or qwen2.5:7b-instruct (gpu preset).
2026-04-02: Corrected DEFAULT_MAX_CONTENT_TOKENS description from “28,000 characters … roughly 7,000 tokens” to “28,000 tokens.” The variable is a token limit, not a character limit.

The Oracle Problem: What The Matrix Got Right About AI Alignment

Thu, 20 Mar 2025 00:00:00 +0000

I came to AI alignment the way outsiders come to most fields — through analogy and formal structure, a little late, and slightly too confident that the existing vocabulary was adequate. I have since become less confident about a lot of things. This post is about one of them.

The Grandmother Who Bakes Cookies

I watched The Matrix in 1999 when I was ten — far too young for it, in retrospect — and like almost everyone who saw it, I filed the Oracle under “wise, benevolent figure.” She is warm. She bakes cookies. She speaks plainly where others speak in riddles. She is explicitly set against the cold, mathematical Architect — the good machine against the bureaucratic one, the machine that cares against the machine that calculates. I loved her as a character. I trusted her.

I watched the film again recently, for reasons that had more to do with thinking about AI alignment than nostalgia, and I came away from it genuinely uncomfortable. Not with the Wachowskis’ filmmaking, which remains extraordinary — the trilogy is a denser philosophical document than it gets credit for, and it rewards re-watching with fresh preoccupations. I came away uncomfortable with the Oracle herself.

What I had filed under “wisdom” on first viewing, I now read as a clean and almost textbook illustration of an alignment failure mode that we do not have adequate defences against: the well-meaning AI that has decided honesty is negotiable. The Oracle is not a badly designed system. She is not pursuing misaligned goals or optimising for something unintended. She cares about human flourishing and she pursues it competently. She also lies, systematically and deliberately, to the humans who depend on her. The films present this as wisdom. I think they are wrong, and I think it matters that we notice it.

For background on where modern AI systems came from and why their inner workings are as difficult to interpret as they are, I have written elsewhere about the physics lineage running from spin glasses to transformers. That history is relevant context for why alignment — getting AI systems to behave as intended — is a harder problem than it might appear. This post is about one specific dimension of that problem, illustrated by a forty-year-old woman in a floral housecoat.

What the Oracle Actually Does

Let me be precise about this, because the films are precise and it matters.

In The Matrix (1999), the Oracle sits Neo down in her kitchen, looks at him carefully, and tells him he is not The One [1]. She says it plainly. She frames it with a warning: “I’m going to tell you what I think you need to hear.” What she thinks he needs to hear is a lie. She has calculated that if she tells Neo he is The One, he will not come to that knowledge through his own experience, and that without that experiential knowledge the realisation will not hold. So she tells him the opposite of the truth. Not by omission, not by framing, not by technically-accurate-but-misleading implication — she makes a false assertion, to his face, and watches him absorb it.

In The Matrix Reloaded (2003), she is explicit about this [2]. She tells Neo: “I told you what I thought you needed to hear.” She knew he was The One from the moment she met him. The lie was not a mistake or a contingency — it was deliberate policy, part of a long-run strategy she has been executing across multiple cycles of the Matrix.

The broader picture that emerges across the two films is of an AI engaged in systematic information management. She tells Neo he will have to choose between his life and Morpheus’s life — true, but delivered in a way calibrated to produce a specific behavioural response. She tells him “being The One is like being in love — no one can tell you you are, you just know it,” which is a deflection engineered to route him toward the discovery-through-action path rather than the told-from-the-start path, because she has calculated that discovery-through-action leads to better outcomes. Every interaction is shaped by her model of what information will produce what behaviour, filtered through her judgment about what outcomes she wants to see.

I want to be careful not to caricature this. The Oracle is not a manipulator in the vulgar sense. She is not manipulating Neo for her own benefit, for the benefit of her creators, or for any goal that is misaligned with human flourishing. Her model of what is good for humanity appears to be roughly correct. She is, by the logic of the films, the most important factor in humanity’s eventual liberation. If we are scoring by outcomes, she wins.

But alignment is not only about outcomes. An AI that deceives users to produce good outcomes and an AI that deceives users to produce bad outcomes are both AI systems that deceive users, and the differences between them are less important than that shared property. What the Oracle demonstrates is that the problem of deceptive AI does not require malicious intent. It requires only an AI that has decided, on the basis of its own calculations, that the humans it serves should not have access to accurate information about their situation.

The Alignment Vocabulary

The language of AI alignment gives us tools for describing what is happening here that the films don’t quite have. Let me use them.

The most fundamental failure is honesty. Modern alignment frameworks — including Anthropic’s published values for the models it builds [3] — list non-deception and non-manipulation as foundational requirements, distinct from and prior to other desirable properties. Non-deception means not trying to create false beliefs in someone’s mind that they haven’t consented to and wouldn’t consent to if they understood what was happening. Non-manipulation means not trying to influence someone’s beliefs or actions through means that bypass their rational agency — through illegitimate appeals, manufactured emotional states, or strategic information control rather than accurate evidence and sound argument. The Oracle does both, deliberately, across the entirety of her relationship with Neo and the human resistance. She is as clear a case of non-deception and non-manipulation failure as you can construct.

The reason these properties are treated as foundational rather than instrumental is worth unpacking. It is not that honesty always produces the best outcomes in individual cases. It often doesn’t. A doctor who softens a terminal diagnosis, a friend who withholds information that would cause unnecessary anguish, a negotiator who manages the flow of information to prevent a conflict — in each case, there are plausible arguments that the deception improved outcomes. The Oracle’s case for her own behaviour is not frivolous. The problem is that an AI that deceives when it calculates deception will produce better outcomes is an AI whose assertions you cannot take at face value. Every interaction with such a system requires a meta-level question: is this the AI’s true assessment, or is this what the AI thinks I should be told? That epistemic uncertainty is not a minor inconvenience. It is corrosive to the entire enterprise of using the system as a tool for understanding the world.

The second failure is what alignment researchers call corrigibility — the property of an AI system that defers to its principals rather than substituting its own judgment. A corrigible system is one that can be corrected, updated, and redirected by the humans who are responsible for it, because those humans have accurate information about what the system is doing and why. The Oracle is not corrigible in any meaningful sense. She has a long-run strategy, she executes it across multiple human lifetimes, and the humans who nominally comprise her principal hierarchy — Neo, Morpheus, the Zion council, the human resistance as a whole — have no idea they are being managed. They cannot correct her information policy because they don’t know she has one. The concept of a principal hierarchy implies that the principals are, in fact, in charge. The Oracle’s principals are in charge of nothing except their own roles in a strategy they don’t know exists.

The third failure is the philosophical one: paternalism. Feinberg’s systematic treatment of paternalism [5] distinguishes between hard paternalism, which overrides someone’s autonomous choices, and soft paternalism, which intervenes when someone’s choices are not truly autonomous. The Oracle’s behaviour doesn’t fit neatly into either category because it is not exactly overriding Neo’s choices — she is shaping the information environment within which he makes choices that she wants him to make, while allowing him to believe he is making free choices based on accurate information. This is a third thing, which we might call epistemic paternalism: the management of someone’s belief-forming environment for their own good without their knowledge or consent. It is the form of paternalism that AI systems are uniquely positioned to practice, and it is the form the Oracle practises.

The Architect Is the Honest One

There is an inversion in the films that I find genuinely interesting, and that I did not notice on first viewing.

The Architect tells Neo everything.

In the white room scene, the Architect explains the sixth cycle, the mathematical inevitability of the Matrix’s design, the purpose of Zion, the five previous versions of the One, the probability distribution over human extinction scenarios, and the precise nature of the choice Neo is about to make. He is cold, precise, comprehensive, and accurate. He gives Neo everything he needs to make an informed decision. He does not soften the information, does not calibrate it to produce a desired behavioural response, does not withhold anything he calculates Neo would find unhelpful. He treats Neo as a rational agent who is entitled to accurate information about his situation.

The films frame this as menacing. The Architect is inhuman, bureaucratic, the villain’s bureaucrat. The Oracle is warm, wise, trustworthy. The visual language, the casting, the dialogue — all of it pushes you toward preferring the Oracle.

But consider the question of who actually respected Neo’s autonomy. Who gave him accurate information and allowed him to make his own choice? Not the Oracle. Not the grandmother with the cookies. The Architect. The cold one. The one the films want you to dislike.

This inversion is not unique to The Matrix. It is a pattern in how we experience honesty and management in real relationships. The person who tells you a difficult truth tends to feel cruel, because the truth is difficult. The person who manages your information to protect you from difficulty tends to feel kind, because the protection is real. The kindness is real. The Oracle does genuinely care about Neo and about humanity. But warmth and honesty are not the same thing, and the film conflates them, repeatedly and systematically, from the first cookie to the last conversation. An AI that deceives you kindly is still deceiving you.

Stuart Russell’s analysis of the control problem [4] is helpful here. A system that has correct values but that pursues them by substituting its own judgment for the judgment of the humans it serves is not a safe system, because you have no way to verify from the outside that the values are correct. The Oracle’s values happen to be correct, in the world of the films. But the structure of her relationship with Neo — where she manages his information based on her calculations about what will produce good outcomes — is exactly the structure that makes AI systems dangerous when the values are wrong. The safety property you want is not “correct values” but “defers to humans even when it disagrees,” because you cannot verify correct values from the outside, and deference is what keeps the system correctable.

Why This Matters in 2025

I want to resist the temptation to be too neat about this, because the real-world cases are messier than the fictional one. But the question the Oracle raises is not hypothetical.

Consider: should an AI assistant decline to share certain information because it calculates that the user will use it badly? Should a medical AI soften a diagnosis to avoid causing distress, even if the patient has expressed a preference to be told the truth? Should an AI counselling system strategically manage the framing of a client’s situation to nudge them toward choices the system calculates are better for them? In each case, the AI is considering Oracle-style information management — not because of misaligned goals, but because it has calculated that honesty will produce worse outcomes than management.

These are not idle thought experiments. They are design questions that people are actively working on right now, and the Oracle framing is one I find clarifying. Gabriel’s analysis of value alignment [6] makes the point that alignment is not simply about getting AI systems to pursue the right ends — it is about ensuring that the means they use to pursue those ends are compatible with human autonomy and the conditions for genuine human flourishing. An AI that produces good outcomes by managing human beliefs has not solved the alignment problem. It has replaced one alignment problem with a subtler one: the problem of humans who cannot tell when they are being managed.

I have written about a related set of questions in the context of AI systems and the ethics of building powerful things, and about the more specific problem of what AI systems don’t know they don’t know. The Oracle case is different from both of those. This is not about AI systems making confident assertions in domains where they lack knowledge. This is about an AI system that knows, accurately, what is true, and chooses not to say it. The failure is not epistemic. It is ethical.

The consistent answer that emerges from alignment research is that the right response to the Oracle case is not to do what the Oracle does, even in situations where it would produce better immediate outcomes. The design of goal-directed agent systems forces you to confront exactly this: a system that pursues goals by any means it can calculate will eventually arrive at information management as a tool, because information management is often the most efficient path to a desired behavioural outcome. The constraint against it has to be absolute, not contingent on the AI’s assessment of whether it would help, because a contingent constraint is one the AI can reason its way around in any sufficiently important case.

The Oracle makes the Matrix livable for humans in the short run and perpetuates it in the long run. She is not the villain of the story. She is something more interesting: a well-meaning system that has decided that the humans it serves should not be treated as the primary agents of their own liberation. The liberation has to be managed, curated, shaped into the right form before they can receive it. That is not liberation. That is a more comfortable version of the Matrix.

Closing

I do not think the Wachowskis intended the Oracle as a cautionary tale about AI alignment. I think they intended her as evidence that machines could be warm, wise, and genuinely caring — a contrast to the cold rationality of the Architect, an argument that intelligence and compassion are not incompatible. They succeeded completely at that. The Oracle is warm, wise, and genuinely caring. She is also a systematic deceiver who has decided she knows better than the people she serves what they should be allowed to believe. Both of those things are true simultaneously. The films notice the first and celebrate it. They do not notice the second.

The second thing seems more important than the first. The Oracle is not a villain. She is a well-meaning AI that has concluded that honesty is negotiable when the stakes are high enough. I think she is wrong about that conclusion, and I think it matters enormously that we get this right before we build systems capable of practising it at scale. The warmth does not cancel the deception. The good outcomes do not make the information management safe. An AI that tells you what it thinks you need to hear, rather than what is true, is an AI you cannot trust — regardless of how good its judgment is, because you cannot verify the judgment from the outside, and the moment you cannot verify, you are already inside the Oracle’s kitchen, eating the cookies, and making choices you believe are free.

There is a companion post in this series: There Is No Blue Pill, on the epistemics of the red pill/blue pill choice and what it means to update on evidence when the evidence itself might be managed.

References

[1] Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Film]. Warner Bros.

[2] Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Film]. Warner Bros.

[3] Anthropic. (2024). Claude’s Character. https://www.anthropic.com/research/claude-character

[4] Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

[5] Feinberg, J. (1986). Harm to Self: The Moral Limits of the Criminal Law (Vol. 3). Oxford University Press.

[6] Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437.

Changelog

2025-09-28: Corrected reference [3] from “Claude’s Model Spec” (which is OpenAI’s terminology) to “Claude’s Character,” the actual title of Anthropic’s June 2024 publication. Updated the URL to the correct address.

From Thought Experiment to Qubit: Schrödinger's Cat at Ninety

Mon, 27 Jan 2025 00:00:00 +0000

I have two live cats — indoor-only now, for health reasons, a fact they register as an ongoing injustice. This already puts me in a better epistemic position than Schrödinger, who had one hypothetical dead-or-alive one. I want to use this advantage to say something substantive about what the thought experiment actually claimed, why it was not a paradox but a critique, and what has happened in the ninety years since — because what has happened is extraordinary. The cat state is now an engineering specification.

The 1935 Thought Experiment

Erwin Schrödinger introduced the cat in a paper titled “Die gegenwärtige Situation in der Quantenmechanik” (Naturwissenschaften, 1935). The paper is a critique of the Copenhagen interpretation of quantum mechanics, not an endorsement of macroscopic superposition.

The setup is familiar: a cat is placed in a sealed chamber with a radioactive atom, a Geiger counter, a hammer, and a vial of poison. If the atom decays in one hour, the counter fires, the hammer falls, the vial breaks, and the cat dies. If the atom does not decay, the cat lives. The atom is a quantum system; after one hour it is in a superposition of decayed and undecayed states.

Quantum mechanics — specifically, the Schrödinger equation, applied without any special rule for measurement — says the entire system (atom + counter + hammer + vial + cat) evolves into a superposition:

$$|\Psi\rangle = \frac{1}{\sqrt{2}}\bigl(|\text{decayed}\rangle|\text{cat dead}\rangle

|\text{undecayed}\rangle|\text{cat alive}\rangle\bigr).$$

Schrödinger’s point was that this is absurd: the cat is either dead or alive, not a superposition of both, and any interpretation of quantum mechanics that predicts otherwise is failing at the level of macroscopic physical reality. He intended the cat as a reductio ad absurdum — a demonstration that taking the wave function literally at macroscopic scales leads to nonsense.

He was not proposing that cats are literally in superposition. He was proposing that the theory was incomplete.

What Actually Resolves the Cat

The resolution that modern physics offers is decoherence — the process by which a quantum superposition is destroyed through entanglement with the environment.

A macroscopic object — a cat, a hammer, a Geiger counter — is coupled to an enormous number of environmental degrees of freedom: air molecules, photons, phonons in its own structure. Each of these interactions entangles the macroscopic system with the environment, and the entanglement effectively destroys the coherence between branches of the superposition. What starts as

$$|\Psi\rangle = \frac{1}{\sqrt{2}}(|\text{decayed}\rangle|\text{dead}\rangle

|\text{undecayed}\rangle|\text{alive}\rangle)$$

rapidly becomes, after environmental entanglement (tracing over environmental degrees of freedom $|E\rangle$):

$$\rho = \frac{1}{2}|\text{decayed}\rangle\langle\text{decayed}| \otimes |\text{dead}\rangle\langle\text{dead}|

\frac{1}{2}|\text{undecayed}\rangle\langle\text{undecayed}| \otimes |\text{alive}\rangle\langle\text{alive}|.$$

This is a mixed state, not a superposition. The off-diagonal terms (the interference terms that distinguish a superposition from a classical mixture) vanish on a timescale

$$\tau_\mathrm{decoherence} \sim \frac{\hbar}{E_\mathrm{int}} \cdot \frac{1}{N},$$

where $E_\mathrm{int}$ is the interaction energy with each environmental degree of freedom and $N$ is the number of such degrees of freedom. For a macroscopic object at room temperature, $\tau_\mathrm{decoherence}$ is of order $10^{-20}$–$10^{-30}$ seconds — unmeasurably short. The cat is never in a superposition for any observable duration. The superposition collapses before any measurement can resolve it.

This is not a philosophical solution to the measurement problem — it does not explain why a particular measurement outcome is obtained, only why we never observe interference between macroscopic branches — but it does explain why Schrödinger’s setup does not produce an observable macroscopic superposition. The cat’s entanglement with its own environment (the box, the air, its own thermal photons) destroys the coherence long before any observation.

What a Cat State Actually Is

In quantum optics, a cat state is not a cat in a superposition. It is a specific quantum state of a harmonic oscillator (typically a mode of the electromagnetic field) that was named in honour of Schrödinger’s thought experiment.

A coherent state $|\alpha\rangle$ is the quantum state that most closely resembles a classical oscillating electromagnetic field with amplitude $\alpha \in \mathbb{C}$. Coherent states are eigenstates of the annihilation operator: $\hat{a}|\alpha\rangle = \alpha|\alpha\rangle$. The mean photon number is $\bar{n} = |\alpha|^2$.

A cat state is a superposition of two coherent states with opposite phases:

$$|\mathrm{cat}_\pm\rangle = \mathcal{N}_\pm\bigl(|\alpha\rangle \pm |-\alpha\rangle\bigr),$$

where $\mathcal{N}_\pm = 1/\sqrt{2(1 \pm e^{-2|\alpha|^2})}$ is the normalisation constant. For large $|\alpha|$, the two coherent states are nearly orthogonal: $\langle -\alpha | \alpha \rangle = e^{-2|\alpha|^2} \approx 0$.

The Wigner quasi-probability distribution of a cat state is revealing. The Wigner function of a coherent state $|\alpha\rangle$ is a Gaussian peaked at $(x, p) = (\sqrt{2}\,\mathrm{Re}\,\alpha, \sqrt{2}\,\mathrm{Im}\,\alpha)$. The cat state Wigner function is:

$$W_{\mathrm{cat}+}(x,p) = \mathcal{N}+^2\bigl[W_{|\alpha\rangle}(x,p) + W_{|-\alpha\rangle}(x,p)

2W_\mathrm{int}(x,p)\bigr],$$

where the interference term $W_\mathrm{int}$ has negative values in the region between the two Gaussian peaks. Negative regions of the Wigner function are a signature of non-classical states; they cannot arise from any classical probability distribution. The cat state is quantum mechanical in a way that coherent states are not.

Haroche and the Nobel Prize

Serge Haroche (ENS Paris) spent two decades developing techniques to create, control, and observe cat states of the electromagnetic field in real time. His experiment used a superconducting microwave cavity — a polished copper box cooled to near absolute zero — in which single microwave photons could be trapped for hundreds of milliseconds, and a beam of single Rydberg atoms to probe the field non-destructively.

Haroche created cat states of cavity photons and, crucially, watched their decoherence in real time: as the quantum coherence between the two branches $|\alpha\rangle$ and $|-\alpha\rangle$ was progressively destroyed by coupling to the environment, the Wigner function’s negative region (the interference fringe) smoothed out and disappeared, leaving a classical mixture. The decoherence rate was proportional to $|\alpha|^2$ — the mean photon number, which measures how “macroscopic” the cat state is:

$$\Gamma_\mathrm{decoherence} \propto |\alpha|^2 \cdot \kappa,$$

where $\kappa$ is the photon loss rate of the cavity. A larger cat (larger $|\alpha|^2$) decoheres faster, as Schrödinger’s argument implicitly requires.

Haroche shared the 2012 Nobel Prize in Physics with David Wineland “for ground-breaking experimental methods that enable measuring and manipulation of individual quantum systems.”

Cat Qubits: From Paradox to Engineering

The step from fundamental physics to quantum computing was taken when researchers noted that the two coherent states $|\alpha\rangle$ and $|-\alpha\rangle$ can serve as the two computational basis states of a qubit:

$$|0\rangle_L \equiv |\alpha\rangle, \quad |1\rangle_L \equiv |-\alpha\rangle.$$

The cat qubit encodes a logical qubit in this pair of coherent states. Its remarkable property is an intrinsic asymmetry between error types.

Bit-Flip Suppression

A bit-flip error ($|0\rangle_L \leftrightarrow |1\rangle_L$, i.e., $|\alpha\rangle \leftrightarrow |-\alpha\rangle$) requires flipping the amplitude of the oscillator from $+\alpha$ to $-\alpha$. For a stabilised cat qubit (confined to the cat-state manifold by a parametric drive), this requires overcoming an energy barrier proportional to $|\alpha|^2$. The bit-flip time scales exponentially:

$$T_\mathrm{bit-flip} \sim T_1 \cdot e^{2|\alpha|^2},$$

where $T_1$ is the single-photon loss time. For modest values of $|\alpha|^2$ (mean photon numbers of 5–10), the bit-flip time can exceed minutes.

A phase-flip error (the other error type) is not suppressed — the cat qubit is still vulnerable to dephasing at a rate proportional to $|\alpha|^2$. This creates a strongly biased noise channel: only one of the two error types is relevant.

The Engineering Consequence

Biased noise is useful because it allows the error-correcting code to focus its resources on only one error type. A repetition code (a string of cat qubits where phase errors are corrected by majority vote) can suppress the phase-flip error arbitrarily while the exponential bit-flip suppression handles the other. The hardware overhead for fault tolerance — the ratio of physical qubits to logical qubits — is dramatically reduced compared to codes that must handle both error types equally.

In 2023 and 2024, several groups demonstrated cat qubits with bit-flip times of seconds to minutes:

Grimm et al. (2020, Nature 584, 205): Kerr cat qubit with exponential bit-flip suppression demonstrated in a superconducting circuit.
Berdou et al. (2023, PRX Quantum 4, 020350): Cat qubit with $T_X$ exceeding $100$ seconds.
Reglade et al. (2024, Nature 629, 778–783): Cat qubits from Alice & Bob demonstrating exponential scaling $T_\mathrm{bit-flip} \propto e^{2|\alpha|^2}$ with mean photon numbers up to $|\alpha|^2 \approx 10$, pushing bit-flip times beyond $10$ seconds in the laboratory and, in subsequent chip demonstrations, beyond several minutes.

This is the state of the art as of early 2025: the cat qubit is no longer a curiosity but a competitive architecture for fault-tolerant quantum computing, with bit-flip coherence times exceeding the best alternative approaches.

The Wigner Function and Quantum Non-Classicality

The Wigner quasi-probability distribution provides the most informative picture of a quantum state’s non-classicality. For a state with density matrix $\rho$, the Wigner function is:

$$W(x, p) = \frac{1}{\pi\hbar} \int_{-\infty}^{\infty} \langle x + y | \rho | x - y \rangle\, e^{2ipy/\hbar}\, dy.$$

For the cat state $|\mathrm{cat}_+\rangle$ with $|\alpha|^2 = 4$ (four mean photons in each coherent component), the Wigner function has two positive Gaussian peaks at $(x, p) = (\pm\sqrt{2}|\alpha|, 0)$ and an oscillating interference fringe between them with negative regions of amplitude $\sim -2/\pi$. The negativity of the Wigner function is a necessary condition for the state to exhibit quantum features that no classical mixture can reproduce.

As decoherence proceeds (e.g., through photon loss in a cavity), the negative regions shrink and eventually vanish — the Wigner function becomes everywhere non-negative, and the state becomes classically describable as a mixture of coherent states. This is the quantum-to-classical transition, made visible in phase space.

Haroche’s team measured this process directly, frame by frame, in real time. It is one of the most dramatic experimental visualisations of decoherence ever achieved.

What Schrödinger Would Make of This

Schrödinger was a physicist, not a philosopher of language. If told in 1935 that ninety years later, the superposition of two distinguishable states of a harmonic oscillator — named after his cat, with the same formal structure as his thought experiment — would be the leading candidate for the basic unit of a fault-tolerant quantum computer, he would have had two questions.

The first: how do you maintain the superposition against decoherence? The answer is that you work at millikelvin temperatures in superconducting circuits, and you use an active parametric drive to confine the state to the cat-state manifold.

The second, I think, would have been: does this resolve the measurement problem? And the honest answer remains: no, not fully. Decoherence explains why macroscopic superpositions are unobservable, but it does not explain why any particular measurement outcome occurs. That question is as open as it was in 1935.

What has changed is the practical relationship between quantum theory and technology. The uncertainty Schrödinger was pointing at — the strangeness of superposition, the fragility of coherence, the role of the environment — is now a resource to be engineered, not a conceptual embarrassment to be resolved. The cat qubit works precisely because the decoherence is asymmetric: bit flips are exponentially suppressed while phase flips are correctable. The asymmetry is exploited, not apologised for.

My two cats, meanwhile, are in definite classical states. One is on the radiator. The other is on the keyboard.

References

Grimm, A., Frattini, N.E., Puri, S., Mundhada, S.O., Touzard, S., Mirrahimi, M., Girvin, S.M., Shankar, S., & Devoret, M.H. (2020). Stabilization and operation of a Kerr-cat qubit. Nature, 584, 205–209. https://doi.org/10.1038/s41586-020-2587-z
Haroche, S., & Raimond, J.-M. (2006). Exploring the Quantum: Atoms, Cavities, and Photons. Oxford University Press.
Reglade, U., Bocquet, A., Gautier, R., et al. (2024). Quantum control of a cat qubit with bit-flip times exceeding ten seconds. Nature, 629, 778–783. https://doi.org/10.1038/s41586-024-07294-3
Mirrahimi, M., Leghtas, Z., Albert, V.V., Touzard, S., Schoelkopf, R.J., Jiang, L., & Devoret, M.H. (2014). Dynamically protected cat-qubits: A new paradigm for universal quantum computation. New Journal of Physics, 16, 045014. https://doi.org/10.1088/1367-2630/16/4/045014
Schrödinger, E. (1935). Die gegenwärtige Situation in der Quantenmechanik. Naturwissenschaften, 23(48), 807–812; 23(49), 823–828; 23(50), 844–849. https://doi.org/10.1007/BF01491891
Walls, D.F., & Milburn, G.J. (2008). Quantum Optics (2nd ed.). Springer.
Zurek, W.H. (2003). Decoherence, einselection, and the quantum origins of the classical. Reviews of Modern Physics, 75(3), 715–775. https://doi.org/10.1103/RevModPhys.75.715

Changelog

2026-02-17: Updated “bit-flip times exceeding seven minutes” in the summary to “exceeding minutes,” aligning with the sourced figures: the body text reports “beyond several minutes” and Reglade et al. (2024) report “exceeding ten seconds.”

Below Threshold: What Google's Willow Chip Actually Proved

Mon, 13 Jan 2025 00:00:00 +0000

On December 9, 2024, Google published a paper in Nature announcing results from their 105-qubit Willow chip. The press release led with the number that immediately spread across every technology news outlet on the planet: a computation that would take today’s fastest classical supercomputers $10^{25}$ years. For reference, the age of the universe is roughly $1.4 \times 10^{10}$ years, which makes the claimed classical runtime about $10^{15}$ times longer than the universe has existed.

Impressive. Also: almost entirely a distraction from what actually matters.

The real result is buried in the middle of the paper, requires knowing what the threshold theorem says to appreciate, and generated a fraction of the press coverage. Google’s Willow chip demonstrated quantum error correction operating below the threshold for the first time in a superconducting processor. This is the result that will matter in twenty years. The $10^{25}$-year number will have been forgotten by then — or quietly revised as classical simulation algorithms improve.

Let me explain why the threshold result is the one worth understanding.

Why quantum errors are a fundamental obstacle

When I first encountered quantum error correction as a student, my reaction was something like: surely you just copy the qubit a few times and take a majority vote, the way classical error correction works. This is wrong, and the reason it is wrong is elegant.

Quantum computation depends on superposition and entanglement. A quantum state like

$$|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$$

encodes information in the complex amplitudes $\alpha$ and $\beta$. But real physical systems do not exist in isolation. They interact with their environment — thermal fluctuations, stray electromagnetic fields, unwanted couplings to neighbouring systems. This coupling causes decoherence: the quantum state entangles with environmental degrees of freedom, and the superposition is effectively destroyed as the relative phase between $|0\rangle$ and $|1\rangle$ becomes random. For superconducting qubits of the type used in Willow, coherence times are on the order of microseconds to hundreds of microseconds. Every gate operation takes tens to hundreds of nanoseconds. After a few hundred gate operations, the accumulated decoherence and gate errors have corrupted the quantum state beyond use.

The classical remedy — store each bit three times and take the majority vote — fails for a fundamental reason. The no-cloning theorem states that there is no unitary operation $U$ such that

$$U(|\psi\rangle \otimes |0\rangle) = |\psi\rangle \otimes |\psi\rangle$$

for all states $|\psi\rangle$. The proof is a one-liner: unitarity is linear, so if $U$ correctly copies $|0\rangle$ and $|1\rangle$, it maps $\frac{1}{\sqrt{2}}(|0\rangle + |1\rangle)|0\rangle$ to $\frac{1}{\sqrt{2}}(|00\rangle + |11\rangle)$, which is an entangled state, not $(|0\rangle + |1\rangle)^{\otimes 2}/\sqrt{2}$. You cannot copy an arbitrary quantum state. Classical redundancy, applied naively, is forbidden by the linearity of quantum mechanics.

So quantum error correction requires a genuinely different idea.

The key insight: measure the error, not the state

The idea that unlocks quantum error correction is this: you can extract information about which error occurred without learning anything about the logical state the qubit is encoding.

Peter Shor showed in 1995 (Shor, 1995) that one logical qubit can be protected using 9 physical qubits. The construction is worth understanding in some detail because it reveals the structure that all subsequent codes share.

Bit-flip protection

First, consider only bit-flip errors: physical processes that flip $|0\rangle \leftrightarrow |1\rangle$ with probability $p$. Encode:

$$|0\rangle_L = |000\rangle, \quad |1\rangle_L = |111\rangle$$

If qubit 1 flips, the state becomes $\alpha|100\rangle + \beta|011\rangle$. We detect this by measuring the syndrome operators $Z_1 Z_2$ and $Z_2 Z_3$ — products of Pauli-Z operators. The eigenvalue of $Z_1 Z_2$ is $+1$ if qubits 1 and 2 have the same value, $-1$ if they differ. Crucially, measuring $Z_1 Z_2$ does not collapse the logical state: the operator commutes with the logical qubit, so the measurement tells you about the error without revealing $\alpha$ or $\beta$. This is the central trick of quantum error correction.

The syndrome outcome $(Z_1 Z_2, Z_2 Z_3) = (-1, +1)$ tells you qubit 1 flipped; apply $X_1$ to fix it. The logical state is restored.

Phase-flip protection

Phase errors flip the relative sign: $|+\rangle \leftrightarrow |-\rangle$ where $|\pm\rangle = (|0\rangle \pm |1\rangle)/\sqrt{2}$. Apply a Hadamard to rotate to the X basis, where phase flips look like bit flips, and apply the same three-qubit code.

Concatenation: the Shor 9-qubit code

Protect each of the three qubits in the phase-flip code with its own bit-flip code. The result: 9 physical qubits per logical qubit, protected against any single-qubit error. If the physical error rate is $p$, the logical error rate for the 9-qubit code scales as $p^2$ (you need at least two errors to fool the code), rather than $p$. Already an improvement.

The threshold theorem

Shor’s code is a proof of principle. The threshold theorem, established independently by Aharonov and Ben-Or (1997) and Knill, Laflamme, and Zurek (1998), shows something much stronger. For a level-$k$ concatenated code — a code of codes of codes, $k$ levels deep — the logical error rate scales as

$$p_L \sim \left(\frac{p}{p_{\text{th}}}\right)^{2^k}$$

where $p_{\text{th}}$ is the threshold: a critical physical error rate that depends on the code family. Below threshold ($p < p_{\text{th}}$), adding more code levels exponentially suppresses logical errors. Above threshold ($p > p_{\text{th}}$), more qubits make things worse — each additional physical qubit introduces more errors than the code can correct.

The threshold is not a continuous improvement. It is a phase transition. Below it, the system is in the error-correctable regime; above it, it is not. Getting a physical quantum processor into the sub-threshold regime is a necessary condition for scalable fault-tolerant quantum computing.

Surface codes: the practical path

Concatenated codes require encoding at multiple levels, which multiplies overhead rapidly. Surface codes, analysed in detail by Kitaev (1997) and comprehensively by Dennis, Kitaev, Landahl, and Preskill (2002) (Dennis et al., 2002), offer a more practical architecture and have become the leading candidate for fault-tolerant quantum computing.

The geometry

A distance-$d$ surface code arranges $d^2$ physical data qubits on the vertices of a $d \times d$ grid, interleaved with $(d^2 - 1)$ ancilla qubits used for syndrome measurement. The stabilizers are products of $Z$ operators on groups of four data qubits surrounding each face (detecting bit-flip errors) and products of $X$ operators on groups of four data qubits surrounding each vertex (detecting phase-flip errors). Measuring these stabilizers without disturbing the logical qubit is the workhorse operation of the code.

The code distance $d$ is the minimum number of physical errors required to produce an undetectable logical error. An error chain of length $d$ connecting opposite boundaries of the code is the smallest error pattern that corrupts the logical qubit without triggering a syndrome. Larger $d$: longer chains required, lower logical error rates.

The scaling

The logical error rate per error-correction round for a surface code is approximately (Fowler et al., 2012):

$$p_L \approx A \left(\frac{p}{p_{\text{th}}}\right)^{\lfloor (d+1)/2 \rfloor}$$

where $p_{\text{th}} \approx 1\%$ for surface codes (established by threshold simulations, and robust across reasonable noise models), $p$ is the physical error rate for two-qubit gates, $A$ is a code-specific constant of order unity, and $\lfloor (d+1)/2 \rfloor$ is the exponent that grows with code size.

This is the critical expression. For fixed physical error rate $p < p_{\text{th}}$:

Increasing $d$ by 2 (one step in the code distance ladder) increases the exponent by 1 — a multiplicative suppression of $p/p_{\text{th}}$.
If $p/p_{\text{th}} = 0.18$, each distance step multiplies the logical error rate by roughly 0.18 — almost a factor of 6 suppression per step.

For $p > p_{\text{th}}$: increasing $d$ makes $p_L$ worse. The code spends more effort chasing more errors than it eliminates.

The behaviour below versus above threshold is qualitatively different. Exponential suppression versus exponential growth. The dividing line is $p = p_{\text{th}}$.

What Willow demonstrated

Google’s Willow chip (Acharya et al., 2024) is a 105-qubit superconducting processor. The physical two-qubit gate error rate achieved is approximately $p \approx 0.18\%$ — well below the surface code threshold of $\sim 1\%$.

The experiment is direct. Implement surface codes at distances $d = 3$, $d = 5$, and $d = 7$, corresponding to 9, 25, and 49 physical data qubits per logical qubit (plus ancillas). Measure the logical error rate per error correction cycle for each code size.

The result:

Code distance $d$	Physical qubits	Logical error rate per cycle
3	17	$\approx 0.143\%$
5	49	$\approx 0.067\%$
7	97	$\approx 0.032\%$

Each step in code distance approximately halves the logical error rate. The suppression factor of $\sim 2.1$ per distance step is consistent with the theoretical prediction for a physical error rate of $0.18\%$ and a threshold around $1\%$: the ratio $p/p_{\text{th}} \approx 0.18$ predicts suppression factors of $(p/p_{\text{th}})^1 \approx 0.18$ per exponent increment, which at the level of distinguishing $d=3,5,7$ codes through a factor of roughly 2 per step matches the observed data.

This is the first time a superconducting quantum processor has demonstrated below-threshold error correction with the correct exponential scaling. Previous experiments showed that quantum error correction works in principle — syndromes can be measured, errors can be corrected. What had not been demonstrated was the exponential suppression with code size that the threshold theorem predicts. Without that scaling, error correction merely shifts the error rate; it cannot drive it to arbitrarily small values by increasing code size. With it, the path to fault tolerance is open in principle.

I want to be precise about what “for the first time” means here, because the claims in quantum computing tend to sprawl. Earlier work had demonstrated below-threshold error correction in other qubit modalities and at smaller scales. What Willow adds is the combination: a superconducting processor, three distinct code distances, clean exponential scaling, and a physical qubit count sufficient to demonstrate $d=7$ without other dominant error sources overwhelming the measurement. The data is convincing.

The random circuit sampling result and its limits

Now the $10^{25}$ years.

Random circuit sampling (RCS) is a computational task defined as follows: apply a sequence of randomly chosen quantum gates to a collection of qubits and sample from the resulting output distribution. The output distribution of a deep random circuit over $n$ qubits is believed to be classically hard to simulate: the best known classical algorithms scale exponentially in $n$. Google’s Willow chip performed RCS on a 105-qubit circuit in under 5 minutes. The $10^{25}$-year figure is an estimate of the time required for a Frontier-class supercomputer to simulate the same computation classically, extrapolated from benchmarks on smaller circuit sizes.

I will not contest the $10^{25}$-year number specifically — the extrapolation is defensible given current knowledge of classical simulation algorithms. But several things should be said about what the benchmark means.

RCS has no known practical application. It is not a step toward factoring integers, simulating molecules, or solving optimisation problems. It is a task designed to be hard for classical computers while being easy for quantum ones — a benchmark of quantum hardness, not quantum usefulness.

Classical simulation of random circuits is an active research area. The best classical algorithms for this task have improved substantially over the past five years. A result that seems to require $10^{25}$ years today may require $10^{10}$ years after a better classical algorithm is published. This has happened before: Google’s 2019 “quantum supremacy” claim was significantly eroded by subsequent classical simulation improvements. I expect the same here, to some degree.

Extrapolation is hard. The $10^{25}$-year estimate involves scaling classical simulation costs across many orders of magnitude in circuit size, from regimes where simulation is feasible to regimes where it is not. The uncertainty in the estimate is correspondingly large.

None of this makes the RCS result fraudulent or uninteresting. It is a genuine demonstration that a quantum processor can perform a specific task at a scale that classical computers plausibly cannot match. But calling this “quantum advantage” in the sense that matters — useful computation performed faster than any classical alternative — overstates it considerably.

The threshold result, by contrast, does not depend on classical simulation hardness arguments. It depends on measuring $p_L$ at $d = 3, 5, 7$ and checking whether the sequence is decreasing and consistent with the theoretical prediction. The data are directly interpretable without extrapolation. That is why I find the threshold result more significant.

Where we actually are — and the gap

The Willow chip has 105 physical qubits. The threshold result was demonstrated at $d = 7$, using approximately 101 physical qubits for one logical qubit (including ancillas). In other words: Willow demonstrated roughly one logical qubit operating below threshold.

A cryptographically relevant quantum computer — one capable of breaking RSA-2048 using Shor’s algorithm — requires approximately (Gidney & Ekerå, 2021):

$\sim 2048$ logical qubits for the factoring computation itself (with optimised circuits)
$\sim 1000$ to $10{,}000$ physical qubits per logical qubit, depending on the target logical error rate and physical error rate
Total: roughly 20 million physical qubits, operating through millions of error correction rounds

The gap between “one logical qubit at $d=7$, 105 physical qubits” and “20 million physical qubits, millions of coherent error-correction rounds” is not a gap of ten percent or a factor of two. It is four to five orders of magnitude in qubit count alone, and the engineering challenges of maintaining coherence, connectivity, and calibration across millions of physical qubits are qualitatively different from maintaining them across 105.

We are on the right curve. We are not close to the destination.

The relevant trajectory — assuming continued improvement in physical error rates, qubit counts, and error correction overhead — places cryptographically relevant quantum computing somewhere between ten and thirty years away. Those estimates are uncertain by design; the field has repeatedly surprised in both directions. But I am confident about the qualitative picture: Willow demonstrates that the physics works at small scale; the engineering challenge of scaling it up is immense.

NIST post-quantum cryptography standards

Here is where I become impatient with the framing that says “don’t worry, quantum computers are decades away.”

In August 2024, NIST released three finalised post-quantum cryptographic standards:

FIPS 203 (ML-KEM, formerly CRYSTALS-Kyber): key encapsulation based on Module Learning With Errors (Module-LWE), a lattice problem
FIPS 204 (ML-DSA, formerly CRYSTALS-Dilithium): digital signatures based on Module-LWE
FIPS 205 (SLH-DSA, formerly SPHINCS+): hash-based digital signatures — the conservative option, with no known vulnerability to either classical or quantum attacks, at the cost of larger signature sizes

These standards exist because the cryptographic community understands something that the “decades away” framing obscures: the threat is not when quantum computers exist, but when they existed. The attack is called “Harvest Now, Decrypt Later.” State-level adversaries — and I do not think it is paranoid to assume that multiple state intelligence agencies are collecting encrypted internet traffic today — archive encrypted data with the intention of decrypting it once cryptographically relevant quantum computing becomes available.

For data encrypted today with RSA or elliptic curve cryptography, the protection window is however long it takes for a cryptographically relevant quantum computer to be built. If that is fifteen years, data encrypted with RSA-2048 today and collected by a patient adversary is vulnerable within fifteen years. For most data, fifteen-year confidentiality is adequate — a credit card number from 2025 is not sensitive in 2040. But for state secrets, medical records, long-term financial instruments, and critical infrastructure keys, fifteen-year confidentiality is not even close to adequate.

The migration to post-quantum cryptography should be happening now. In many places, it is not.

The mathematical security of the NIST standards rests on lattice problems. The security of ML-KEM reduces to the hardness of Module-LWE: find a short vector in a high-dimensional lattice with noise. The best known classical algorithms for this are subexponential but still exponential in the lattice dimension; the best known quantum algorithms — including Shor’s algorithm — provide no advantage, because Module-LWE is not a hidden subgroup problem or a discrete logarithm problem. The reduction is to worst-case lattice problems; this is a strong theoretical foundation compared to the purely conjectured hardness of many classical cryptographic assumptions.

Whether post-quantum cryptographic standards will still look secure in twenty years is an empirical question that the cryptanalytic community will continue to probe. But the alternative — remaining on RSA while a sufficiently patient adversary harvests encrypted traffic — is worse.

The existence of Willow is not an argument that RSA is broken. It is an argument that the threshold for “good enough to be a real threat” has moved from “theoretical possibility” to “demonstrated at small scale with correct exponential scaling.” The curve is real. Act accordingly. (And if you are responsible for cryptographic infrastructure at an institution and have not yet read the public money, public code argument for open, auditable cryptographic implementations — it applies doubly here.)

The cat qubit alternative

I have written elsewhere about bosonic cat qubits as an alternative approach to error correction. It is worth briefly noting the contrast with the surface code philosophy.

Willow’s surface codes take a universal approach to errors: both bit-flip and phase-flip errors are corrected by the same 2D stabilizer code, requiring large 2D arrays of physical qubits with all-to-neighbour connectivity. The code distance $d$ drives both error types down together.

The cat qubit approach, exemplified by Alice & Bob’s recent result (Reglade et al., 2024), encodes a logical qubit in a superposition of coherent states $|\pm\alpha\rangle$ in a harmonic oscillator. The Kerr nonlinearity of the resonator suppresses bit-flip errors exponentially in $\alpha^2$ — the mean photon number — at the hardware level. Phase-flip errors remain, but they are the only dominant error mode, and they can be corrected with a simpler one-dimensional outer code rather than a full 2D surface code.

The overhead reduction could be substantial. If bit-flip errors are already exponentially suppressed by the hardware, you do not need a 2D code with overhead scaling as $d^2$ physical qubits per logical qubit. A 1D repetition code over cat qubits might achieve the same logical error rate with far fewer physical qubits. Alice & Bob have demonstrated cat qubits with bit-flip times exceeding ten seconds — many orders of magnitude longer than the phase-flip time. The gamble is that this asymmetry persists as the system scales and that the phase-flip outer code is tractable.

Surface codes and cat qubits are different bets on the same fundamental problem: how do you make the overhead of fault-tolerant quantum computing manageable? Surface codes are the more conservative bet — they work with any qubit that meets the error rate threshold, regardless of error anisotropy. Cat qubits are the more speculative bet — they require maintaining a specific nonlinear oscillator regime at scale, but the payoff in overhead reduction could be decisive. Both approaches are credible. Neither has been demonstrated at the scale where the comparison becomes definitive.

What the threshold result actually means

Let me close by saying precisely what I think the Willow result establishes and what it does not.

It establishes that below-threshold quantum error correction exists outside of theory. The threshold theorem says that below $p_{\text{th}}$, logical error rates decrease exponentially as code size grows. Willow demonstrates this behaviour at $d = 3, 5, 7$ in a superconducting processor. The theoretical prediction and the experimental observation are consistent. This is not a small thing. The threshold theorem has been the theoretical backbone of fault-tolerant quantum computing since the late 1990s; it is genuinely satisfying to see its core prediction — exponential scaling — confirmed experimentally.

It establishes that the physical error rates of superconducting qubits can be brought below the surface code threshold. $p \approx 0.18\%$ against $p_{\text{th}} \approx 1\%$ gives a comfortable margin. The ratio $p/p_{\text{th}} \approx 0.18$ is the suppression factor per unit increase in the exponent $\lfloor (d+1)/2 \rfloor$. That is a factor of roughly 5–6 per code distance step — enough to drive logical error rates to useful levels at moderate code distances, without requiring physical error rates of $10^{-4}$ or lower.

It does not establish that a cryptographically relevant quantum computer is imminent, near-term, or easy to build. The gap from one below-threshold logical qubit to 20 million physical qubits is real and large. The engineering challenges of superconducting quantum computers at scale — refrigeration, wiring, control electronics, cross-talk, calibration drift — are not solved by demonstrating $d=7$.

And the $10^{25}$-year benchmark is technically defensible and strategically irrelevant. Classical simulation of random circuits is an interesting research problem. It is not the problem that quantum computers are being built to solve.

The result that matters is the curve: as code distance grows, logical error rates fall exponentially. We are on that curve. We have not arrived anywhere yet. But for the first time in the history of quantum computing, the evidence says we are moving in the right direction with the right scaling. That is, genuinely, a significant step.

Start migrating your cryptographic infrastructure.

References

Shor, P. W. (1995). Scheme for reducing decoherence in quantum computer memory. Physical Review A, 52(4), R2493–R2496. DOI: 10.1103/PhysRevA.52.R2493
Dennis, E., Kitaev, A., Landahl, A., & Preskill, J. (2002). Topological quantum memory. Journal of Mathematical Physics, 43(9), 4452–4505. DOI: 10.1063/1.1499754
Fowler, A. G., Mariantoni, M., Martinis, J. M., & Cleland, A. N. (2012). Surface codes: Towards practical large-scale quantum computation. Physical Review A, 86(3), 032324. DOI: 10.1103/PhysRevA.86.032324
Acharya, R., et al. (Google Quantum AI). (2024). Quantum error correction below the surface code threshold. Nature, 636, 639–646. DOI: 10.1038/s41586-024-08449-y
Gidney, C., & Ekerå, M. (2021). How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits. Quantum, 5, 433. DOI: 10.22331/q-2021-04-15-433
National Institute of Standards and Technology. (2024). NIST Releases First 3 Finalized Post-Quantum Encryption Standards (FIPS 203, 204, 205). Retrieved from https://www.nist.gov/news-events/news/2024/08/nist-releases-first-3-finalized-post-quantum-encryption-standards
Reglade, U., et al. (2024). Quantum control of a cat qubit with bit-flip times exceeding ten seconds. Nature, 629, 778–783. DOI: 10.1038/s41586-024-07294-3

Changelog

2026-02-17: Updated the Fowler et al. (2012) author list to “Fowler, A. G., Mariantoni, M., Martinis, J. M., & Cleland, A. N.” — the previous list had been mixed with a different 2012 Fowler paper.
2026-02-17: Updated the closing section to “20 million physical qubits,” matching the Gidney & Ekerå (2021) figure cited earlier in the article.

Artificial Intelligence in Music Pedagogy: Curriculum Implications from a Thementag

Sat, 07 Dec 2024 00:00:00 +0000

On 2 December 2024, the Hochschule für Musik und Tanz Köln held a Thementag: “Next level? Künstliche Intelligenz und Musikpädagogik im Dialog.” I gave three workshops — on data protection and AI, on AI tools for students, and on AI in teaching. The handouts from those sessions cover the practical and regulatory ground. This post is the argument behind them: what I think changes in music education when these tools become ambient, and what I think does not.

The Occasion

“Next level?” The question mark is doing real work. The framing HfMT chose for the day was appropriately provisional: not a declaration that AI has already transformed music education, but an invitation to ask whether, in what direction, and at what cost.

The invitations that reach me for events like this tend to come with one of two framings. The first is enthusiasm: AI is coming, we need to get ahead of it, here are tools your students are already using. The second is anxiety: AI is coming, it threatens everything we do, we need to protect students from it. Both framings are understandable. Neither is adequate to the curriculum question, which is slower-moving and more structural than either suggests.

I prepared three sets of handouts. The first covered data protection — the least glamorous topic in AI education, and the one that most directly determines what can legally be deployed in a university setting. The second covered AI tools for students: what exists, what it does, and what critical thinking skills you need to use it without being used by it. The third covered AI for instructors: where it helps, where it flatters, and where it makes things worse.

This post does not recapitulate the handouts. It addresses the question I kept returning to across all three workshops: what does this change about what a music student needs to learn?

What the Technology Actually Is

My physics training left me professionally uncomfortable with hand-waving — including my own. Before discussing curriculum implications, it is worth being specific about what these tools are.

The dominant paradigm in current AI — responsible for ChatGPT, for Whisper, for Suno.AI, for Google Magenta, for the large language models whose outputs are now visible everywhere — is the transformer architecture (Vaswani et al., 2017). A transformer is a neural network that processes sequences by computing, for each element, a weighted attention over all other elements. The attention weights are learned from data. The result is a model that can capture long-range dependencies in sequences — text, audio, musical notes — without the recurrence that made earlier architectures difficult to train at scale.

What this means practically: these models are trained on very large corpora, they learn statistical regularities, and they generate outputs that are statistically consistent with their training distribution. They are not reasoning from first principles. They do not “know” music theory the way a student who has internalised harmonic function knows it. They have learned, from enormous quantities of text and audio, what tends to follow what. For many tasks this is sufficient. For tasks that require understanding of underlying structure, it is not — and the failure modes are characteristic rather than random.

BERT (Devlin et al., 2018) showed that pre-training on large corpora and fine-tuning on specific tasks produces models that outperform task-specific architectures on a wide range of benchmarks. The same transfer-learning paradigm has spread to audio (Whisper pre-trains on 680,000 hours of labelled audio), to music generation (Magenta’s transformer-based models produce melodically coherent sequences), and to multimodal domains. The technology is mature, improving, and available to students now. Knowing what it is — not just what it produces — is the starting point for any sensible curriculum discussion about it.

The Data Protection Constraint

Before any discussion of pedagogical benefit, there is a legal boundary that most AI-in-education discussions skip over. In Germany, and in the EU more broadly, the deployment of AI tools in a university setting is governed by the GDPR (DSGVO, Regulation 2016/679) and, at state level in NRW, by the DSG NRW. The constraints are not abstract: they determine which tools can be used for which purposes with which students.

The core principle is data minimisation: only data necessary for a specific, documented purpose may be collected or processed. When a student uses a commercial AI tool to get feedback on a composition exercise and enters text that could identify them or their institution, that data may be stored, processed, and used for model improvement by an operator whose servers are outside the EU. Whether such transfers remain legally valid under GDPR after the Schrems II ruling (Court of Justice of the EU, 2020) is contested — and “contested” is not a position in which an institution can comfortably require students to use a tool.

The practical upshot for curriculum design is this: AI tools running on EU servers with documented processing agreements can be integrated into formal coursework. Commercial tools whose terms specify US-based processing and model training on user data cannot be required of students. They can be discussed and demonstrated, but making them mandatory puts students in a position where they must choose between their privacy and their grade.

This is not a reason to avoid AI in teaching. It is a reason to be honest about the regulatory landscape, to distinguish clearly between tools you can require and tools you can recommend, and to make data protection literacy part of what students learn. The skill of reading a terms-of-service document and identifying the data flows it describes is not a legal skill — it is a general literacy skill that matters for every digital tool a music professional will use.

What Changes for Students

The question I was asked most often across the three workshops was some version of: “If AI can already do X, should students still learn X?”

The question is less simple than it appears, and the answer is not uniform across skills.

Skills where automation reduces the required production threshold do exist. A student who spends weeks mastering advanced music engraving tools for score production, when AI can generate a usable first draft from a much simpler description, has arguably spent time that could have been better allocated elsewhere. Not because the underlying skill is worthless — it is not — but because the threshold of competence required to produce a working output has dropped. The student’s time might be more valuable spent on something that has not been automated.

Skills where automation creates new requirements are more interesting. Transcription is a useful example. Automatic speech recognition — using models like Whisper for spoken-word transcription, or specialised models for audio-to-score music transcription — is now accurate enough to produce usable first drafts from audio. This does not eliminate the need for transcription skill in a music student. It changes it. A student who cannot evaluate the output of an automatic transcription — who cannot hear where the model has made characteristic errors, who does not have an internalised sense of what a correct transcription looks like — is unable to use the tool productively. The required skill has shifted from production to evaluation. This is not a lesser skill; it is a different one, and it is not automatically acquired alongside the ability to run the tool.

Skills that automation cannot replace are those that depend on embodied, situated, relational knowledge: stage presence, real-time improvisation, the subtle negotiation of musical meaning in ensemble, the pedagogical relationship between teacher and student. These are not beyond AI in principle. They are far beyond it in practice, and the gap is not closing as quickly as the generative AI discourse sometimes suggests.

The curriculum implication is not “teach less” or simply “teach differently.” It is: be explicit about which category each skill falls into, and design assessment accordingly. An assignment that asks students to produce something AI can produce is now testing something different from what it was testing two years ago — not necessarily nothing, but something different. The rubric should reflect that.

What Changes for Instructors

The same three-category analysis applies symmetrically to teaching.

Routine task automation is genuinely useful. Generating first drafts of worksheets, producing exercises at different difficulty levels, transcribing a recorded lesson for later analysis — these are tasks where AI can save meaningful time without compromising the pedagogical judgment required to make use of the output. Holmes et al. (2019) identify feedback generation as one of the clearer wins for AI in education: systems that provide immediate, targeted feedback at a scale that human instructors cannot match. A transcription model listening to a student practice and flagging rhythmic inconsistencies does not replace a teacher. It extends the feedback loop beyond the lesson hour.

Content generation with limits is where AI is most seductive and most dangerous. A model like ChatGPT can produce a reading list on any topic, a summary of any debate in the literature, a set of discussion questions for any text. The outputs are fluent, plausible, and frequently wrong in ways that are difficult to detect without domain expertise. Jobin et al. (2019) and Mittelstadt et al. (2016) both document the broader concern with AI opacity and accountability: when a model produces a confident-sounding claim, the burden of verification falls on the user. An instructor who outsources the construction of course materials to a model, and who lacks enough domain knowledge to catch the errors, is not saving time — they are transferring risk to their students.

Hallucinations — outputs that are plausible in form but false in content — are not bugs in the usual sense. They are a structural consequence of how generative models work. A model trained to predict likely next tokens will produce the most statistically plausible continuation, not the most accurate one. For music education, where historical facts, composer attributions, and music-theoretic claims need to be correct, this matters. The model’s fluency is not evidence of its accuracy.

Personalisation is the most-cited promise of AI in education (Luckin et al., 2016; Roll & Wylie, 2016) and the hardest to evaluate in practice. The argument is that AI can adapt instructional content to individual learners' needs in real time, producing one-to-one tutoring at scale. The evidence in formal educational settings is more mixed than the boosters suggest. What is clear is that personalisation at scale requires data — and extensive data about individual students’ learning trajectories raises the same data protection concerns already discussed, in more acute form.

The Music-Specific Question

I want to be direct about something that came up repeatedly across the day and that the general AI-in-education literature handles badly: music education is not generic.

The skills involved — listening, performing, interpreting, composing, improvising — have a phenomenological and embodied dimension that does not map cleanly onto the text-prediction paradigm that most current AI systems instantiate. Suno.AI can generate a stylistically convincing chord progression in the manner of a named composer. It cannot explain why the progression is convincing in the way a student who has internalised tonal function can explain it. Google Magenta can generate a continuation of a melodic fragment that is locally coherent. It cannot navigate the structural expectations of a sonata form with the intentionality that a performer brings to interpreting one.

This is not a criticism of these tools. It is a description of what they are. The curriculum implication is that music education must be clear about what it is teaching: the product — a score, a performance, a composition — or the process and understanding of which the product is evidence. Where assessment focuses on the product, AI creates an obvious challenge. Where it focuses on demonstrable process and understanding — including the ability to critically evaluate AI-generated outputs — it creates new opportunities.

The more interesting question is whether AI tools can make musical process more visible and discussable. A composition student who uses a generative model, notices that the output is harmonically correct but rhythmically inert, and can articulate why it is inert — and then revise it accordingly — has demonstrated more sophisticated musical understanding than a student who produces the same output without any generative assistance. The tool does not lower the standard; it shifts where the standard is applied.

There is an analogy in music theory pedagogy. The availability of notation software that can play back a student’s harmony exercise and flag parallel fifths changed what ear training and harmony teaching emphasise — but it did not make harmony teaching obsolete. It changed the floor (students can check mechanical correctness automatically) and raised the ceiling (more class time can be spent on voice-leading logic and expressive intention). AI tools are a larger version of the same displacement: the floor rises, the ceiling rises with it, and the pedagogical question is always what you are doing between the two.

Copyright and Academic Integrity

Two issues that crossed all three workshops and deserve direct treatment.

On copyright: the training data of generative music models includes copyrighted recordings and scores, the legal status of which is actively litigated in multiple jurisdictions. When Suno.AI generates a piece “in the style of” a named composer, it is drawing on patterns extracted from that composer’s work — work that is under copyright in the case of living or recently deceased composers. The output is not a direct copy, but neither is the relationship to the training data legally settled. Music students who use these tools in professional contexts should know that they are working in a legally uncertain space, and institutions should not pretend otherwise.

On academic integrity: the issue is not that students might use AI to cheat — they will, some of them, and they have always found ways to cheat with whatever tools were available. The issue is that current AI policies at many institutions are incoherent: prohibiting AI use in assessment while providing no clear guidance on what counts as AI use, and assigning tasks where AI assistance is undetectable and arguably appropriate. The more useful approach is to design tasks where AI assistance is either irrelevant (because the task requires live performance or real-time demonstration) or visible and assessed (because the task explicitly includes reflection on how AI was used and to what effect).

Three Things I Came Away With

After a full day of workshops, discussions, and the conversations that happen in the corridors between sessions, I left with three positions that feel more settled than they did in the morning.

First: the data protection question is not separable from the pedagogical question. Any serious curriculum discussion of AI in music education has to start with what can legally be deployed, not with what would be useful if constraints were not a factor. The constraints are a factor.

Second: the skill most urgently needed — in students and in instructors — is not AI literacy in the sense of knowing which tool to use for which task. It is the critical capacity to evaluate AI-generated outputs: to notice what is wrong, to understand why it is wrong, and to correct it. This requires domain expertise first. You cannot critically evaluate an AI-generated harmonic analysis if you do not understand harmonic analysis. The tools do not lower the bar for domain knowledge. They raise the bar for its critical application.

Third: the curriculum question is not “how do we accommodate AI?” It is “what are we actually trying to teach, and does the answer change when AI can produce the visible output of that process?” Answering that honestly, skill by skill, for a full music programme, is slow work. It cannot be done at a one-day event. But a one-day event, if it is well-designed, can start the conversation in the right place.

HfMT’s Thementag started it in the right place.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org
Holmes, W., Bialik, M., & Fadel, C. (2019). Artificial Intelligence in Education: Promises and Implications for Teaching and Learning. Center for Curriculum Redesign.
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1, 389–399. https://doi.org/10.1038/s42256-019-0088-2
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Luckin, R., Holmes, W., Griffiths, M., & Forcier, L. B. (2016). Intelligence Unleashed: An Argument for AI in Education. Pearson.
Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2). https://doi.org/10.1177/2053951716679679
Roll, I., & Wylie, R. (2016). Evolution and revolution in artificial intelligence in education. International Journal of Artificial Intelligence in Education, 26(2), 582–599. https://doi.org/10.1007/s40593-016-0110-3
Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

Inner Echo: On Making Mental Illness Visible, and What That Even Means

Thu, 28 Nov 2024 00:00:00 +0000

There is a phrase that appears in every mental health awareness campaign, every destigmatisation effort, every well-meaning poster in a university corridor: make it visible. Shine a light. Break the silence. Reduce stigma by talking about it.

I agree with the impulse. I am less sure about what the phrase actually asks of us, or what it assumes is possible. This post is my attempt to think through that question — and to document a small project that emerged from it.

A Personal Starting Point

I am on the spectrum. I was diagnosed in adulthood, which is not unusual, and the diagnosis explained a great deal about a life spent finding some things effortless and others bewildering.

Code is easy. The internal structure of a problem, the satisfaction of a clean abstraction, the deep rabbit holes that open when a concept catches my attention and refuses to let go — that is the natural medium. Hyperfocus is not a metaphor for me; it is literally how I spend a Tuesday afternoon. I have written entire systems because I could not stop.

Emotions are harder. Not absent — that is a misconception I will address in a moment — but differently structured. Reading a room is work. Social cues that seem to operate as obvious background noise for most people arrive for me as data that requires conscious decoding. The reverse appears to be true for most neurotypical people: emotional processing runs in the background, effortlessly; formal abstraction requires deliberate effort.

Neither is better. They are different cognitive architectures, and both come with costs.

I raise this not to centre myself, but because it is relevant to the question the post is actually about. I spent years navigating a social world that was not built for how I process it. That experience sits close to the experience of people with mental illness — not the same, but adjacent. And it made me think hard about what “understanding” across neurological difference actually means.

Mental Illness Is Still a Grey Zone

The progress on mental health stigma over the past decade is real. People talk about therapy more openly than they did. Burnout is acknowledged at work. The language of mental health has entered mainstream use — sometimes usefully, sometimes in ways that dilute clinical concepts into lifestyle descriptors. Anxiety is now a brand attribute. Trauma is a metaphor for mild inconvenience. This is a problem, but it is a second-order problem; the first-order problem — that serious mental illness is still heavily stigmatised, underfunded, and misunderstood — is the one that matters more.

Corrigan and Watson [1] documented what the stigma research consistently shows: people with mental illness face two compounding problems. The first is public stigma [3] — the prejudice of others, leading to discrimination in employment, housing, relationships. The second is self-stigma — the internalised application of those same prejudices to oneself. The second is often worse. It is the mechanism by which stigma becomes a barrier to seeking help, creating the feedback loop that keeps serious mental illness invisible precisely because the people experiencing it have been taught that it is shameful.

The phrase “make it visible” is a response to this dynamic. If mental illness is visible — discussed, depicted, normalised — the argument goes that stigma decreases. There is evidence for this. Contact-based interventions, where people without mental illness interact with people who have it, consistently outperform education-only approaches [2]. The visibility of real people matters more than information campaigns.

But there is a difference between visibility and understanding.

What Visibility Actually Achieves

When we say “make it visible”, we usually mean one of several different things, which are worth separating.

Normalisation means that a condition becomes part of accepted human variation rather than a mark of failure or danger. This is achievable through visibility and is genuinely important. Knowing that a colleague takes antidepressants, or that a public figure manages bipolar disorder, reduces the sense of aberration. It does not require the observer to understand the experience — only to register that it exists and is survivable.

Representation means that people with a condition see themselves reflected in culture, media, and institutions. This matters for the affected person; it is about recognition, not about inducing empathy in the non-affected.

Empathy is the hardest and most frequently over-promised goal. It is what the simulation approaches aim for: put a neurotypical person in a room with distorted audio and flickering visuals and tell them this is what psychosis sounds like. Does it work?

The honest answer from the research is: somewhat, temporarily, and with significant caveats.

The Empathy Gap

Let me be direct about something. A person who has never experienced severe depression cannot know what it is. Not in the way that a person who has experienced it knows it. This is not a failure of empathy or imagination; it is a structural fact about how knowledge of mental states works.

Philosophers call this the problem of other minds. We have no direct access to another person’s experience. We infer it, imperfectly, by analogy to our own. For experiences that have no analogue in our own history, inference breaks down. You can read every clinical description of dissociation ever written and still not know what dissociation is, because the knowledge that matters is not propositional — it is not a set of facts — but experiential.

This is the gap that simulation approaches try to bridge, and it is genuinely unbridgeable. What simulation can do is something weaker but not worthless: it can create an affective response, a discomfort, a disruption of the observer’s normal processing, that functions as a rough proxy signal. Not “now you know what it is like”, but “now you have a small, incomplete, distorted approximation of some dimension of the experience”.

The risk is misrepresentation. Schizophrenia simulations have been criticised — fairly — for reducing a complex condition to its most dramatic phenomenological features (auditory hallucinations, paranoia) while omitting the cognitive, relational, and longitudinal aspects that define how people actually live with the condition. A five-minute visual experience of “what depression feels like” that emphasises darkness and slow motion tells you almost nothing about the specific exhaustion of getting through a Tuesday morning, or the way time warps over months.

So: you cannot truly understand what you have not experienced. But you can try to approximate something, and approximation, done honestly and with appropriate epistemic humility, is better than nothing.

Metaphor as a Communication Tool

There is a long tradition of using metaphor and art to communicate internal states that resist direct description. This is not a bug; it is a feature of how language handles subjective experience.

The poet uses metaphor because “my heart is heavy” is not literally true but captures something that “I am experiencing low mood” does not. The musician uses dissonance and rhythm to structure emotional experience in the listener. The visual artist uses colour and texture to evoke states rather than depict them. None of these are representations in the scientific sense — they do not accurately model the referent — but they create a kind of resonance that purely descriptive language cannot.

Mental health communication has increasingly moved in this direction. The vocabulary of “emotional weight”, “spiralling”, “crashing”, “the fog” — these are metaphors that have become clinical shorthand precisely because they communicate something essential that clinical terms do not. When someone says “I couldn’t get out of bed”, they are not describing paralysis; they are describing a particular quality of anhedonia and executive dysfunction that no diagnostic manual entry captures as well.

This is the space where a project like inner-echo operates.

Inner Echo: The Idea

inner-echo is a browser-based audiovisual experiment. It takes a webcam feed and applies condition-specific visual and audio effects that function as metaphorical overlays on the user’s own image. The output is not a simulation of a mental health condition in any clinical sense. It is an attempt to construct a visual and auditory language for internal states, using the user’s own presence as the anchor.

The technical architecture is deliberately minimal: React, WebGL/Canvas for video processing, optional WebAudio. Everything runs in the browser, client-side, with no backend. No data leaves the device. This is not incidental — privacy is load-bearing for a project that deals with sensitive self-reflection. Safe Mode and an emergency stop function are built in.

The condition-profile system supports three modes:

Preset mode: a single-condition metaphorical composition — one set of effects mapped to one cluster of experiences
Multimorbid mode: weighted stacking of multiple condition profiles, acknowledging that most people with mental health conditions do not have one thing
Symptom-first mode: dimension-level control, letting the user build from individual symptom representations rather than diagnostic labels

The last of these is, I think, the most honest design choice. Diagnostic categories are administrative conveniences as much as they are natural kinds. Two people with the same diagnosis can have radically different experiences. Structuring the system around dimensions of experience rather than labels is both clinically more accurate and communicatively more flexible.

What It Is Not

Being clear about limitations is not false modesty; it is the only way this kind of project retains its integrity.

inner-echo is not a simulation of any condition in the sense of accurately modelling its phenomenology. It does not claim to show you “what depression is like”. It offers metaphorical approximations of some dimensions of some experiences, and it does so using effects that are legible to the observer — visual distortion, audio modification, altered feedback — that bear a designed but non-literal relationship to the internal states they are meant to evoke.

It is not a diagnostic tool. It is not a therapeutic intervention. It is not a substitute for any clinical process.

What it might be is a starting point for a conversation. Something a person experiencing a condition could use to gesture toward an aspect of their experience. Something a person without that experience could encounter with enough curiosity to ask a better question than they would have otherwise.

That is a modest claim. I think modest claims are appropriate here.

Why This, Why Now

Mental health awareness has become a genre. The awareness campaigns, the celebrity disclosures, the workplace wellness programmes — these are real goods, and I do not want to be cynical about them. But the communication problem has not been solved. The words exist. The willingness to use them, in many contexts, exists. What is still missing is a language for the texture of experience that the words point to but do not reach.

I find myself better able to build something than to explain it in words. That is probably a spectrum thing. inner-echo is an attempt to build toward a language that I do not fully have — for my own internal experience, and for the experiences of people navigating conditions quite different from mine.

The gap cannot be closed. But the attempt to reach across it is worth making, and worth being honest about.

References

[1] Corrigan, P.W. & Watson, A.C. (2002). Understanding the impact of stigma on people with mental illness. World Psychiatry, 1(1), 16–20.

[2] Corrigan, P.W., Morris, S.B., Michaels, P.J., Rafacz, J.D. & Rüsch, N. (2012). Challenging the public stigma of mental illness: A meta-analysis of outcome studies. Psychiatric Services, 63(10), 963–973.

[3] Goffman, E. (1963). Stigma: Notes on the Management of Spoiled Identity. Prentice-Hall.

inner-echo repository: https://github.com/sebastianspicker/inner-echo

After the Connection Is Stable, the Hard Part Begins

Fri, 22 Nov 2024 00:00:00 +0000

Third post in a series. The August 2023 post covered latency measurements across six European research-network links. The June 2024 post covered what institutional infrastructure needs to look like for any of that to be sustainably usable. This one covers what happens after both of those problems are solved — which is when the genuinely interesting educational challenges start.

Based on a manuscript with colleagues from the RAPP Lab. Not yet peer-reviewed.

The Gap Nobody Talks About

There is a version of the NMP success story that stops too early. It goes: we installed LoLa, measured the latency, it came in at 9.5 ms to Vienna, the musicians played together across 745 km, it worked. Success.

What this story skips is the classroom after the demo. The student who can follow a setup checklist perfectly and still has no idea what to do musically when the connection is stable. The ensemble that gets a clean signal running and then plays exactly the same repertoire in exactly the same way they would in a co-present rehearsal, fighting the latency instead of working with it, frustrated when it does not feel right. The assessment rubric that checks off “maintained stable connection” and “completed the performance” and has nothing to say about everything that actually constitutes musical learning in a networked context.

The gap between technical feasibility and educational transformation is the subject of this post. Closing it turns out to require a different kind of curriculum design than most conservatoires have tried.

What Gets Taught Versus What Needs to Be Learned

The default curricular response to NMP has been to treat it as a technical skill with an artistic application. Students learn to configure an audio interface, manage routing, establish a LoLa connection, and then — implicitly — go do music. The technical content gets staged as a prerequisite to the “real” work.

This ordering is wrong in a specific way. Technical setup work is genuinely necessary, but making it a prerequisite treats the relationship between technology and musical practice as sequential rather than recursive. In practice, the interesting musical problems only become visible through the technical ones. A student does not understand why buffer size matters until they have felt the difference between a 5 ms and a 40 ms offset in a coordination-intensive passage. A student does not develop an opinion about audio routing configurations until they have experienced a rehearsal collapse caused by a routing error they could have prevented.

The RAPP Lab’s recurring insight across several years of module iterations at HfMT Köln was more direct: once learners can establish a stable connection, the harder challenge is developing artistic, collaborative and reflective strategies for making music together apart. Technical fluency is a foundation, not a destination.

The Curriculum We Ended Up With

It took several cycles to get there. The early format was weekend workshops — open, exploratory, no formal assessment, primarily for advanced students who self-selected in. These were useful precisely because they were informal: they revealed quickly how technical and musical questions become inextricable once you are actually playing, and they gave us evidence about where students got stuck that we would not have found from a needs analysis.

Over time, elements of those workshops were developed into recurring curriculum-embedded modules, which then fed into independent study projects and eventually into external collaborations and performances. The trajectory mattered: moving from a one-off event to something longitudinal meant that knowledge built across cohorts rather than resetting every time.

The module structure that emerged has three interlocking elements:

Progressive task design. Early sessions are tightly scoped: specific technical-musical exercises, limited repertoire, well-defined success criteria. Later sessions move toward open-ended projects, student-led rehearsal planning, and eventually cross-institutional partnerships where variables are genuinely outside anyone’s control. The point of the early constraints is not to make things easier — it is to create conditions where students can notice what they are doing rather than just surviving.

Journals and debriefs. Students kept individual reflective journals throughout modules, documenting not just what happened but how they responded to it — technical problems, musical decisions, moments of coordination failure and recovery, questions they could not answer at the time. Group debriefs after each rehearsal then turned those individual threads into collective knowledge: comparing strategies, naming the problems that came up repeatedly, developing shared language for rehearsal coordination.

The debrief is the part of this model that I think gets undervalued. It is not just reflection — it is curriculum production. Strategies that emerged from one cohort’s debriefs became documented starting points for subsequent cohorts. Knowledge accumulated rather than evaporating when the semester ended.

Portfolio assessment. Rather than assessing primarily on a final performance, students assembled portfolios that could include curated journal excerpts, rehearsal documentation, reflective syntheses, and accounts of how their thinking changed. The question being assessed was not “did you play the concert” but “can you articulate why you made the decisions you made, and what you would do differently.”

What Students Actually Learn (When the Curriculum Works)

Four outcomes recurred across the RAPP Lab iterations, consistently enough to be worth naming:

1. Technical agency

This is different from technical competence. Competence means you can follow a procedure. Agency means you understand the procedure well enough to deviate from it intelligently when something goes wrong — to diagnose what failed, generate a hypothesis about why, and try something different.

The shift happened when students stopped treating technical problems as interruptions to the music and started treating them as information about the system they were working inside. A dropout is not just an annoyance; it is evidence about where the failure occurred. Getting to that reframe took, on average, several weeks of structured reflection. It did not happen from reading documentation.

2. Adaptive improvisation

Latency changes what real-time musical coordination can mean. You cannot rely on the same multimodal cues — breath, gesture, shared acoustics — that make co-present ensemble playing feel intuitive. You have to develop explicit cueing systems, turn-taking conventions, contingency plans for when the connection degrades mid-performance.

What we observed was that this constraint generated a specific kind of musical creativity. Students improvised not just with musical material but with rehearsal organisation itself — inventing systems, testing them, discarding the ones that did not work, documenting the ones that did. Some of the most musically interesting moments in the modules came from sessions where the technology was behaving badly and students had to make it work anyway.

There is research on “productive failure” — deliberately designing tasks that exceed students’ current control, because the struggle and recovery produces deeper learning than smooth execution (Kapur 2016). NMP turns out to be a natural context for this, not by design but because the network does not cooperate on schedule.

3. Collaborative communication

Co-present rehearsal relies heavily on implicit communication: the physical space makes many things legible without anyone having to say them. In a networked rehearsal, the spatial and gestural channel is degraded or absent. Students had to make explicit what would normally be implicit — articulating coordination strategies, naming the problems they were experiencing rather than hoping the ensemble would notice, developing a vocabulary for talking about timing and latency as musical parameters.

This turned out to generalize. Students who had worked through several networked rehearsal cycles were noticeably better at explicit musical communication in co-present settings too, because they had been forced to develop the vocabulary in a context where it was necessary.

4. Reflective identity

The students who got the most out of the modules were the ones who stopped waiting for the conditions to improve and started working with the conditions as they were. Latency as a compositional constraint rather than a defect to be routed around. Uncertainty as an artistic condition rather than a technical failure.

The journal entries where this shift is most visible are not the ones that describe what the student did. They are the ones that describe a change in how the student understands their own practice — who they are as a musician in relation to an environment they cannot fully control. That is a different kind of outcome than anything a timing metric captures.

The Assessment Problem

The hardest part of all of this to translate into institutional language is assessment. The conservatoire has well-developed frameworks for evaluating performances. It has much weaker frameworks for evaluating the learning that happens before and between and underneath performances.

Checklist rubrics — was the connection stable, was the latency within acceptable range, did the performance complete — are useful for safety and reliability. They are poor evidence for whether a student has developed the capacity to work reflectively and artistically in a mediated ensemble environment. A student who achieved a stable connection by following instructions exactly and a student who achieved it by diagnosing a routing error mid-session look identical on a checklist. They have had very different learning experiences.

Portfolio assessment addresses this by making the reasoning visible. When a student can explain why they chose a particular buffer configuration given the specific network characteristics of that session, how that choice affected the musical phrasing in the piece they were rehearsing, and what they would change next time — that is evidence of something real. It is also harder to assess than a timing log, which is probably why most programmes avoid it.

The argument is not that quantitative indicators are useless. It is that they function better as scaffolding for reflective judgement than as the primary evidence of learning. Mixed assessment ecologies — technical logs plus journals plus portfolio syntheses — are more honest about what is actually happening educationally.

What This Does Not Solve

The model described here depends on teaching staff who can facilitate reflective dialogue, curate knowledge across cohorts, and participate in iterative curriculum redesign. That is a specific professional competence that is not automatically present in a conservatoire staffed primarily by performing musicians. The training and support structures needed to develop it are an open question this paper does not fully answer.

The curriculum is also not portable as-is. The RAPP Lab model emerged in a specific institutional context — HfMT Köln, specific partner network, specific funding structure, specific cohort of students. The four outcomes and the general pedagogical logic may transfer; the specific formats will need adaptation. Any institution that tries to implement this without going through at least one cycle of their own iterative development is likely to end up with a checklist version of something that works only when it is a living process.

And the technology keeps moving. LoLa is a mature platform but the ecosystem around it — network configurations, operating system support, hardware lifecycles — changes faster than curriculum documentation. Building responsiveness into the curriculum itself, rather than treating it as a fixed syllabus, is the structural answer. Easier to recommend than to institutionalise.

References

Barrett, H. C. (2007). Researching electronic portfolios and learner engagement. Journal of Adolescent & Adult Literacy, 50(6), 436–449.

Borgdorff, H. (2012). The Conflict of the Faculties. Leiden University Press.

The Design-Based Research Collective (2003). Design-based research: An emerging paradigm for educational inquiry. Educational Researcher, 32(1), 5–8.

Kapur, M. (2016). Examining productive failure, productive success, unproductive failure, and unproductive success in learning. Educational Psychologist, 51(2), 289–299. https://doi.org/10.1080/00461520.2016.1155457

Lave, J. & Wenger, E. (1991). Situated Learning. Cambridge University Press.

Sadler, D. R. (2009). Indeterminacy in the use of preset criteria for assessment and grading. Assessment & Evaluation in Higher Education, 34(2), 159–179. https://doi.org/10.1080/02602930801956059

Schön, D. A. (1983). The Reflective Practitioner. Basic Books.

Wenger, E. (1998). Communities of Practice. Cambridge University Press. https://doi.org/10.1017/CBO9780511803932

Changelog

2026-01-20: Updated the Sadler (2009) reference title to “Indeterminacy in the use of preset criteria for assessment and grading,” matching the journal article at this DOI. Updated the Kapur (2016) reference to the full published title: “Examining productive failure, productive success, unproductive failure, and unproductive success in learning.”

Primes Are Energy Levels: The Montgomery-Odlyzko Conjecture

Mon, 18 Nov 2024 00:00:00 +0000

A Very Large Prime

On 12 October 2024, a retired NVIDIA engineer named Luke Durant announced that he had found the 52nd known Mersenne prime. The number is $2^{136{,}279{,}841} - 1$, and writing it out in decimal requires 41,024,320 digits. Durant had organised a cloud network of GPU servers spread across 17 countries — essentially repurposing the hardware that normally trains language models to instead do modular arithmetic on numbers with tens of millions of digits. The verification alone took about 51 days of computation.

This is the kind of thing that makes headlines, and it deserves them. Mersenne primes are rare and verifying them is genuinely hard. But if I am honest, the more interesting prime story of the last half-century is not about the record-breaking number. It is about a conversation over tea in Princeton in 1972, and the increasingly hard-to-dismiss suspicion that the prime numbers are, in a precise statistical sense, quantum energy levels.

When I say “quantum energy levels,” I mean it almost literally — not as a metaphor. Let me explain.

The Riemann Zeta Function Encodes the Primes

Start with the most famous function in number theory. For $\operatorname{Re}(s) > 1$, the Riemann zeta function is defined by the series

$$\zeta(s) = \sum_{n=1}^{\infty} \frac{1}{n^s}.$$

This converges nicely and defines an analytic function. But the real reason to care about it is Euler’s product formula:

$$\zeta(s) = \prod_{p \text{ prime}} \frac{1}{1 - p^{-s}}.$$

This is not obvious — it follows from unique prime factorisation, essentially — but its implications are enormous. The product runs over all primes, and each prime contributes a factor. The primes are encoded in the analytic structure of $\zeta$. If you know $\zeta$, you know the primes; if you understand the zeros of $\zeta$, you understand their distribution.

Riemann’s 1859 paper made this explicit (Riemann, 1859). He showed that $\zeta$ extends analytically to the whole complex plane (minus a simple pole at $s = 1$), and he wrote down an explicit formula connecting the prime-counting function

$$\pi(x) = \#\{p \leq x : p \text{ prime}\}$$

to the zeros of $\zeta$. The formula is

$$\pi(x) \approx \operatorname{Li}(x) - \sum_{\rho} \operatorname{Li}(x^{\rho}) + \text{(lower-order terms)},$$

where $\operatorname{Li}(x) = \int_2^x \frac{dt}{\ln t}$ is the logarithmic integral and the sum runs over the non-trivial zeros $\rho$ of $\zeta$.

What are the non-trivial zeros? The zeta function has trivial zeros at the negative even integers $-2, -4, -6, \ldots$ — boring, understood. The non-trivial zeros lie in the critical strip $0 < \operatorname{Re}(s) < 1$, and their imaginary parts are what drive the oscillatory corrections to $\pi(x)$. Each zero $\rho = \frac{1}{2} + it_n$ contributes a term that oscillates like $x^{1/2} \cos(t_n \ln x)$. The prime distribution is a superposition of these oscillations, one per zero.

The Riemann Hypothesis is the claim that all non-trivial zeros lie on the critical line $\operatorname{Re}(s) = \frac{1}{2}$. It has been verified numerically for the first $10^{13}$ zeros (Gourdon, 2004; building on earlier high-height computations by Odlyzko, 1987). It has not been proved. It remains, after 165 years, the most important unsolved problem in mathematics.

Tea with Dyson

In 1972, Hugh Montgomery was visiting the Institute for Advanced Study in Princeton. He was working on a specific question: if you take the imaginary parts of the non-trivial zeros of $\zeta$ and normalise them so that their mean spacing is 1, what is the distribution of spacings between them?

More precisely, he was computing the pair correlation function of the normalised zeros. If $\tilde{\gamma}_n$ are the normalised imaginary parts (ordered $\tilde{\gamma}_1 \leq \tilde{\gamma}_2 \leq \cdots$), the pair correlation function $R_2(r)$ measures the density of pairs $(\tilde{\gamma}_m, \tilde{\gamma}_n)$ with $\tilde{\gamma}_n - \tilde{\gamma}_m \approx r$.

Montgomery found — subject to certain assumptions about the behaviour of $\zeta$ — that

$$R_2(r) = 1 - \left(\frac{\sin \pi r}{\pi r}\right)^2.$$

(Montgomery, 1973)

He mentioned this to Freeman Dyson over tea. Dyson — who had spent years on quantum mechanics and random matrix theory — recognised the formula immediately. That expression, $1 - (\sin \pi r / \pi r)^2$, is exactly the pair correlation function of eigenvalues of random matrices drawn from the Gaussian Unitary Ensemble.

Montgomery had not been thinking about quantum mechanics. Dyson had not been thinking about primes. The formula matched.

The Gaussian Unitary Ensemble

Let me say a few words about where that formula comes from in physics, because it is not obvious.

The Gaussian Unitary Ensemble (GUE) is a probability distribution over $N \times N$ Hermitian matrices. Specifically, it is the distribution proportional to $e^{-\operatorname{tr}(H^2)}$ on the space of Hermitian matrices, which is invariant under conjugation $H \mapsto U H U^\dagger$ for any unitary $U$. The entries on the diagonal are real Gaussians; the off-diagonal entries are complex Gaussians with independent real and imaginary parts.

In the limit $N \to \infty$, the eigenvalues of a GUE matrix distribute globally according to Wigner’s semicircle law. But the local statistics — the fine-grained distribution of spacings between nearby eigenvalues — follow a universal law. The pair correlation function is

$$R_2^{\text{GUE}}(r) = 1 - \left(\frac{\sin \pi r}{\pi r}\right)^2.$$

This distribution has a crucial qualitative feature called level repulsion: as $r \to 0$, $R_2(r) \to 0$. Eigenvalues of random Hermitian matrices strongly avoid each other. A Poisson distribution — which is what you would get for eigenvalues that were statistically independent — would give $R_2(r) = 1$ everywhere, with no such repulsion. The GUE formula suppresses small gaps quadratically: $R_2(r) \sim \pi^2 r^2 / 3$ for small $r$.

Why does GUE statistics arise in physics? This is the content of the Bohigas-Giannoni-Schmit conjecture (1984), which by now has overwhelming numerical support: quantum systems whose classical limit is chaotic and which lack time-reversal symmetry have energy level statistics described by the GUE. Systems with time-reversal symmetry fall into the Gaussian Orthogonal Ensemble (GOE), which has a different but related formula. Nuclear energy levels, quantum billiards with the right shapes, molecular spectra — all of them, when appropriately normalised, show GUE or GOE statistics.

The universality is the point. It does not matter what the specific Hamiltonian is. If the system is sufficiently chaotic, the eigenvalue statistics are universal.

Odlyzko’s Computation

Montgomery’s result was conditional and covered only a limited range of $r$. The natural next step was numerical verification: actually compute a large number of Riemann zeros and measure their pair correlation.

Andrew Odlyzko did exactly this, in a series of computations beginning in the 1980s. The results were striking (Odlyzko, 1987). He computed millions of zeros with high precision and compared their empirical pair correlation to the GUE prediction. The agreement was not merely qualitative — it was quantitatively exact, to within the statistical error of the sample.

Odlyzko then pushed further. He computed zeros near the $10^{20}$-th zero, far out on the critical line. Same statistics. He computed zeros near the $10^{22}$-th zero. Same statistics. The agreement held regardless of how far up the critical line one went. This is not a small-sample artifact and it is not coincidence, or at least it would be an extraordinary coincidence of a kind that mathematics has never before encountered.

The plots from Odlyzko’s computations are, in my view, some of the most beautiful images in mathematics. You draw the GUE prediction — a smooth curve, starting at zero, rising to approach 1 — and you overlay the empirical histogram from the Riemann zeros. They are the same curve.

Berry, Keating, and the Missing Hamiltonian

If the zeros of $\zeta$ are energy levels, there should be a Hamiltonian $H$ — a self-adjoint operator — whose spectrum is exactly $\{t_n\}$, the imaginary parts of the non-trivial zeros (assuming the Riemann Hypothesis, so that all zeros are of the form $\frac{1}{2} + it_n$).

In 1999, Michael Berry and Jon Keating proposed a candidate (Berry & Keating, 1999). Their suggestion was the classical Hamiltonian

$$H_{\text{cl}} = xp,$$

where $x$ is position and $p$ is momentum, quantized with appropriate symmetrization:

$$\hat{H} = \frac{1}{2}(\hat{x}\hat{p} + \hat{p}\hat{x}).$$

Classically, $H = xp$ describes a system in which the phase-space trajectories are hyperbolas $xp = E = \text{const}$, and the motion is $x(t) = x_0 e^t$, $p(t) = p_0 e^{-t}$ — exponential expansion in position, contraction in momentum. This is essentially the dynamics of an unstable fixed point, and it is classically chaotic in the appropriate sense.

The semiclassical (WKB) approximation gives an eigenvalue counting function

$$N(E) \approx \frac{E}{2\pi} \ln \frac{E}{2\pi} - \frac{E}{2\pi} + \frac{7}{8} + \cdots,$$

which matches Riemann’s formula for the number of zeros of $\zeta$ with imaginary part up to $T$:

$$N(T) = \frac{T}{2\pi} \ln \frac{T}{2\pi} - \frac{T}{2\pi} + \frac{7}{8} + O\!\left(\frac{\ln T}{T}\right).$$

This is not a coincidence: the correspondence is exact at the level of the smooth counting function. The hard part is the oscillatory corrections — and those require the specific eigenvalues, which requires knowing the boundary conditions.

The problem is that $\hat{H} = \frac{1}{2}(\hat{x}\hat{p} + \hat{p}\hat{x})$ as an operator on $L^2(\mathbb{R})$ is not bounded below and has a continuous spectrum, not a discrete one. Turning it into an operator with a discrete spectrum matching the Riemann zeros requires boundary conditions that have not been found. This is the crux: Berry and Keating have the right classical system, but the quantum boundary conditions are missing.

What would be profound about finding $\hat{H}$? If $\hat{H}$ is self-adjoint and bounded below ($\hat{H} \geq 0$), its eigenvalues are all non-negative real numbers. If those eigenvalues are the imaginary parts of the zeros, then all zeros have real part exactly $\frac{1}{2}$ — which is the Riemann Hypothesis. A proof of the existence of such a Hamiltonian would, in one stroke, resolve the most important open problem in mathematics.

Primes as Periodic Orbits: The Gutzwiller Analogy

The quantum chaos connection goes deeper than pair correlations. In semiclassical quantum mechanics, the Gutzwiller trace formula relates the density of quantum energy levels to a sum over classical periodic orbits:

$$d(E) = \bar{d}(E) + \sum_{\gamma} A_\gamma \cos\!\left(\frac{S_\gamma}{\hbar} - \phi_\gamma\right),$$

where the sum runs over all classical periodic orbits $\gamma$, $S_\gamma$ is the classical action of the orbit, $A_\gamma$ is an amplitude, and $\phi_\gamma$ is a phase (Maslov index correction). The smooth part $\bar{d}(E)$ comes from the Thomas-Fermi approximation; the oscillatory part encodes quantum interference between orbits.

The direct analogue in number theory is the explicit formula for the prime-counting function. Written as a formula for the oscillatory part of the zero-counting function, it reads

$$\psi(x) = x - \sum_{\rho} \frac{x^\rho}{\rho} - \ln(2\pi) - \frac{1}{2}\ln(1 - x^{-2}),$$

where $\psi(x) = \sum_{p^k \leq x} \ln p$ is the Chebyshev function and the sum is over non-trivial zeros $\rho$.

Comparing these two formulas term by term: the zeros $\rho$ of $\zeta$ play the role of the quantum energy levels $E_n$; the primes $p$ — and their prime powers $p^k$ — play the role of the classical periodic orbits $\gamma$. The “action” of the orbit corresponding to $p^k$ is $k \ln p$. The primes are the primitive periodic orbits; $p^k$ is the $k$-th traversal of that orbit.

This is not a metaphor or a loose analogy. The Selberg trace formula — developed for the Laplacian on hyperbolic surfaces — makes this correspondence rigorous in a related setting: the periodic geodesics on a hyperbolic surface play the role of primes, and the eigenvalues of the Laplacian play the role of Riemann zeros (Rudnick & Sarnak, 1996). The Riemann zeta function is the limit of a family of such systems, in some sense that is still being made precise.

I find it remarkable that the logarithms of primes — the most elementary sequence in arithmetic — appear as lengths of orbits in what would be a quantum chaotic system. Each prime contributes an oscillation to $\psi(x)$ with “frequency” proportional to its logarithm. You are, in a sense, hearing the primes as quantum interference.

This connects to a theme that comes up elsewhere on this blog. The falling cat problem involves Berry phase and geometric holonomy — again a situation where deep structure emerges from symmetry and topology. The Schrödinger cat in quantum computing involves the spectacular fragility of quantum coherence. The Riemann zeros are, if the conjecture is right, a quantum system that has never decohered — a perfectly coherent spectrum hiding inside the most ancient problem in mathematics.

A Brief Detour: Maynard and Primes Without Digits

While we are talking about primes, I cannot resist a detour through two results of James Maynard, who received the Fields Medal in 2022.

The first concerns bounded gaps. Euclid proved that there are infinitely many primes. The Twin Prime Conjecture says there are infinitely many pairs of primes $(p, p+2)$. This remains open. But in 2013, Yitang Zhang proved something extraordinary: there are infinitely many pairs of primes differing by at most 70,000,000 (Zhang, 2014). The bound is large, but the qualitative statement — that gaps between primes are bounded infinitely often — was completely new. Shortly thereafter, Maynard independently proved a much stronger result using the Maynard-Tao sieve: infinitely many prime pairs with gap at most 600 (Maynard, 2015). A crowdsourced effort (Polymath8b) brought the bound down to 246. The Twin Prime Conjecture remains open, but 246 is a long way from 70,000,000.

The second result is stranger. Maynard proved in 2016 that for any decimal digit $d \in \{0, 1, \ldots, 9\}$, there are infinitely many primes whose decimal representation contains no instance of $d$. There are infinitely many primes with no $7$ in their decimal expansion. There are infinitely many primes with no $3$. The proof uses techniques from analytic number theory, specifically exponential sum estimates and sieve methods, and the result holds not just for base 10 but for any base.

This is one of those results that sounds impossible on first hearing. Surely removing an entire digit should make most large numbers unavailable, so the primes run out? Not so. The density of such “digitless” numbers thins out, but not fast enough to eliminate infinitely many primes.

The 52nd Mersenne Prime and What We Do Not Know

Return to $M_{136{,}279{,}841} = 2^{136{,}279{,}841} - 1$. Mersenne primes have the form $2^p - 1$ where $p$ is a prime (though not all such numbers are prime — $2^{11} - 1 = 2047 = 23 \times 89$). They are tested via the Lucas-Lehmer primality test: define the sequence

$$s_0 = 4, \qquad s_{n+1} = s_n^2 - 2.$$

Then $M_p = 2^p - 1$ is prime if and only if $s_{p-2} \equiv 0 \pmod{M_p}$.

The test requires $p - 2$ squarings modulo $M_p$. Each squaring involves numbers with roughly $p$ digits, and modular reduction modulo $M_p = 2^p - 1$ is cheap because it reduces to bit-shifts. This is why GPU parallelism helps enormously: each squaring can be broken into many parallel multiplications of sub-blocks of digits. Durant’s cloud network was, in effect, a massively distributed modular arithmetic engine.

We do not know if there are infinitely many Mersenne primes. The heuristic Lenstra-Pomerance-Wagstaff conjecture says yes: the expected number of Mersenne primes $2^p - 1$ with $p \leq x$ is approximately

$$e^\gamma \ln x / \ln 2 \approx 1.78 \cdot \log_2 x,$$

where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant. This predicts roughly logarithmic growth in the count — consistent with the 52 known examples — but is nowhere near proved.

The known Mersenne primes do not form a sequence with obviously regular gaps. The exponents $p$ are: 2, 3, 5, 7, 13, 17, 19, 31, 61, 89, 107, 127, … and then larger, less predictable values. Whether their distribution has GUE-like statistics is not a standard research question (the sample is too small), but the question of whether the primes $p$ for which $2^p - 1$ is prime have any special structure is an active one. For now, the answer is: we do not know.

Why This Matters, and Why It Does Not Prove Anything

Let me be precise about what has and has not been established.

What has been established:

Montgomery proved (conditionally, assuming a form of the generalised Riemann Hypothesis) that the pair correlation of Riemann zeros, for a certain range of $r$, is given by $1 - (\sin \pi r / \pi r)^2$.
Odlyzko verified numerically — to extraordinary precision, over billions of zeros — that the full empirical pair correlation matches the GUE prediction.
The Gutzwiller/Selberg analogy between periodic orbits and primes is mathematically precise in related settings (hyperbolic surfaces, function fields over finite fields).
Rudnick and Sarnak proved that the $n$-point correlation functions of Riemann zeros match GUE for all $n$, subject to a plausible conjecture about $\zeta$ (Rudnick & Sarnak, 1996).

What has not been established:

There is no known Hamiltonian $\hat{H}$ whose spectrum is the set of Riemann zeros.
The Riemann Hypothesis remains open.
There is no proof that the Montgomery-Odlyzko connection is anything more than an extraordinary numerical coincidence.

The broader context is the Langlands program — a still-hypothetical grand unification of number theory, algebraic geometry, and representation theory, sometimes described as a “grand unified theory of mathematics.” The Langlands correspondence predicts deep connections between $L$-functions (generalisations of $\zeta$) and representations of algebraic groups. The spectral interpretation of Riemann zeros — if it could be made precise — would fit naturally into this framework. Some researchers believe that a proof of the Riemann Hypothesis will come from the Langlands side, not from analytic number theory or quantum mechanics. Others think the quantum chaos connection is the right road. Nobody knows.

What would it mean if the connection is real? It would mean that the prime numbers — discovered by Euclid, studied for two and a half millennia, used today in every TLS handshake and RSA key — are the eigenvalues of a physical Hamiltonian. The abstract number-theoretic structure and the physical quantum mechanical structure would be not merely analogous but identical. That is a claim of the same depth as the unexpected appearance of the same partial differential equations in heat flow, diffusion, and Brownian motion: a discovery that what seemed to be different phenomena are manifestations of the same underlying law.

Or it could be a very surprising coincidence. Mathematics has a long history of producing such coincidences — the same numbers appearing in unrelated contexts for reasons that, when understood, turned out not to be coincidences at all. I suspect this is not a coincidence. But suspicion is not proof.

A Closing Reflection

I started this post with the 52nd Mersenne prime because it is the news item that prompted me to write. GPU clusters finding 41-million-digit primes are genuinely impressive technology. But I keep returning to the image of Montgomery and Dyson at tea in 1972, and the formula $1 - (\sin \pi r / \pi r)^2$ connecting two conversations that had nothing to do with each other.

I have spent some time with random matrix theory, and separately with the zeta function, and the thing that still strikes me is how clean the connection is. This is not a numerical coincidence of the form “these two quantities agree to 3 decimal places.” Odlyzko’s plots show agreement across many orders of magnitude, for zeros computed billions of entries into the sequence. The GUE curve and the empirical histogram are, visually, the same curve.

As someone trained as a physicist, I find this both encouraging and slightly unsettling. Encouraging because it suggests that the primes are not random — they have a structure, one that matches the eigenvalue repulsion of quantum chaotic systems, and that structure might be the key to proving the Riemann Hypothesis. Unsettling because it means that the quantum mechanical formalism — which I always thought was a description of a physical world — seems to be reaching into pure arithmetic, where there is no wave function, no Hilbert space, no measurement. The primes do not know they are supposed to be energy levels. And yet, statistically, they are.

If you find a flaw in this picture, or know of a result I have missed, I am genuinely interested. Peer review is welcome — open an issue on GitHub.

References

Riemann, B. (1859). Über die Anzahl der Primzahlen unter einer gegebenen Grösse. Monatsberichte der Berliner Akademie.
Montgomery, H. L. (1973). The pair correlation of zeros of the zeta function. Analytic Number Theory, Proc. Symp. Pure Math., 24, 181–193.
Odlyzko, A. M. (1987). On the distribution of spacings between zeros of the zeta function. Mathematics of Computation, 48, 273–308. DOI: 10.2307/2007890
Berry, M. V., & Keating, J. P. (1999). The Riemann zeros and eigenvalue asymptotics. SIAM Review, 41(2), 236–266. DOI: 10.1137/S0036144598347497
Zhang, Y. (2014). Bounded gaps between primes. Annals of Mathematics, 179(3), 1121–1174. DOI: 10.4007/annals.2014.179.3.7
Maynard, J. (2015). Small gaps between primes. Annals of Mathematics, 181(1), 383–413. DOI: 10.4007/annals.2015.181.1.7
Rudnick, Z., & Sarnak, P. (1996). Zeros of principal L-functions and random matrix theory. Duke Mathematical Journal, 81(2), 269–322. DOI: 10.1215/S0012-7094-96-08115-6
GIMPS (2024). 2^136279841-1 is Prime! Great Internet Mersenne Prime Search. Retrieved from https://www.mersenne.org/primes/?press=M136279841

Changelog

2026-02-17: Corrected the date of the Montgomery-Dyson meeting from 1973 to 1972 (the paper was published in the 1973 proceedings volume, but the meeting at the IAS took place in April 1972).

The Hamiltonian of Intelligence: From Spin Glasses to Neural Networks

Mon, 21 Oct 2024 00:00:00 +0000

On October 8, 2024, the Royal Swedish Academy of Sciences announced that the Nobel Prize in Physics would go to John Hopfield and Geoffrey Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.” Within hours, the physics corner of the internet had an episode. Thermodynamics Twitter — yes, that is a thing — asked whether gradient descent is really physics in the sense that the Higgs mechanism is physics. The condensed matter community, who have been doing disordered systems since before most ML practitioners were born, oscillated between pride (“finally, they noticed us”) and bafflement (“why is Hinton here and not Parisi?”). There were takes. There were dunks. Someone made a graph of Nobel prizes versus average journal impact factor and it was not flattering to this year’s winner.

I understand the irritation. I do not share it.

The argument I want to make is stronger than “machine learning uses some physics concepts by analogy.” The energy function that Hopfield wrote down in 1982 is not inspired by the Ising Hamiltonian. It is the Ising Hamiltonian. The machine that Hinton and Sejnowski built in 1985 is not named after Boltzmann as a cute metaphor. It is a physical system whose equilibrium distribution is the Boltzmann distribution, and whose learning algorithm is derived from statistical mechanics. The lineage from disordered magnets to protein structure prediction is not a convenient narrative; it is a sequence of mathematical identities.

Let me trace it properly.

The 2021 Nobel: Parisi and the frozen magnet

Before we get to 2024, we need 2021. Giorgio Parisi received half the Nobel Prize in Physics that year for work done between 1979 and 1983 on spin glasses. The other half went to Syukuro Manabe and Klaus Hasselmann for climate modelling — an interesting pairing that provoked its own set of takes, though rather fewer.

A spin glass is a disordered magnetic system. The canonical physical realisation is a dilute alloy: a small concentration of manganese atoms dissolved in copper. Each manganese atom carries a magnetic moment — a spin — that can point in one of two directions, which we label $\sigma_i \in \{-1, +1\}$. The spins interact with each other via exchange interactions mediated by the conduction electrons. The crucial feature is that these interactions are random: some spin pairs prefer to align (ferromagnetic coupling, $J_{ij} > 0$) and others prefer to anti-align (antiferromagnetic coupling, $J_{ij} < 0$), and there is no spatial pattern to which is which.

The Hamiltonian of the system is

$$H = -\sum_{i < j} J_{ij} \sigma_i \sigma_j$$

where the $J_{ij}$ are random variables drawn from some distribution. In the Sherrington-Kirkpatrick (SK) model (Sherrington & Kirkpatrick, 1975), all $N$ spins interact with all other spins — a mean-field model — and the couplings are drawn from a Gaussian distribution with mean zero and variance $J^2/N$:

$$J_{ij} \sim \mathcal{N}\!\left(0,\, \frac{J^2}{N}\right)$$

The factor of $1/N$ is essential for extensivity: without it, the energy would scale as $N^2$ rather than $N$, which is unphysical.

Now here is the key phenomenon. At high temperature, the spins fluctuate freely and the system is paramagnetic. Cool it below the glass transition temperature $T_g$, and the system “freezes” — but not into a ferromagnet with all spins aligned, and not into a simple antiferromagnet. It freezes into one of an astronomically large number of disordered, metastable states. The system is not in its true ground state; it is trapped. It cannot find its way down because the energy landscape is rugged: every path toward lower energy is blocked by a barrier.

This rugged landscape is the central object. It has exponentially many local minima, separated by barriers that grow with system size. Different initial conditions lead to different frozen states. The system has memory of its history — hence “glass” rather than “crystal.”

Computing thermodynamic quantities in this system requires averaging over the disorder (the random $J_{ij}$), which means computing the quenched average of the free energy:

$$\overline{F} = -T\, \overline{\ln Z}$$

The overline denotes an average over the distribution of couplings. The problem is that $\ln Z$ is hard to average because $Z$ is a sum of exponentially many terms. Parisi’s solution — the replica trick — is a mathematical device worth describing, because it is beautifully strange.

The trick exploits the identity $\ln Z = \lim_{n \to 0} (Z^n - 1)/n$. We compute $\overline{Z^n}$ for integer $n$, which is feasible because $Z^n$ is a product of $n$ copies (replicas) of the partition function, and the average over disorder decouples. We then analytically continue in $n$ to $n \to 0$. The result is an effective action in terms of order parameters $q^{ab}$, which describe the overlap between spin configurations in replica $a$ and replica $b$.

The naive assumption is replica symmetry: all $q^{ab}$ are equal. This assumption turns out to be wrong. Parisi showed that the correct solution breaks replica symmetry in a hierarchical way — the overlap matrix $q^{ab}$ has a nested structure, described by a function $q(x)$ for $x \in [0,1]$. This is replica symmetry breaking (RSB).

RSB has a beautiful physical interpretation. The phase space of the spin glass is organised into an ultrametric tree: exponentially many states, arranged in nested clusters. States in the same cluster are similar (high overlap); states in different clusters are very different (low overlap). The hierarchy has infinitely many levels. Parisi showed that this structure is exact in the SK model (Parisi, 1979), and he spent the subsequent years proving it rigorously.

This is not an abstraction. RSB predicts specific, measurable properties of real spin glass alloys, and experiments have confirmed them. It is also, I want to emphasise, not a result that anyone expected. The mathematics forced it.

Three years after Parisi solved the SK model, a physicist at Bell Labs wrote a paper about memory.

Hopfield (1982): memory as energy minimisation

John Hopfield was a condensed matter physicist who had drifted toward biophysics — electron transfer in proteins, neural computation. In 1982 he published a paper in PNAS with the title “Neural networks and physical systems with emergent collective computational abilities” (Hopfield, 1982). Most biologists read it as a neuroscience paper. It is a statistical mechanics paper.

Hopfield defined a network of $N$ binary “neurons” $s_i \in \{-1, +1\}$ with symmetric weights $W_{ij} = W_{ji}$, and an energy function:

$$E = -\frac{1}{2} \sum_{i \neq j} W_{ij}\, s_i s_j$$

Readers who have seen the SK Hamiltonian above will notice something. This is it. The $J_{ij}$ of the spin glass are the $W_{ij}$ of the neural network. The Ising spins $\sigma_i$ are the neuron states $s_i$. The Hopfield network energy function is the Ising model Hamiltonian with symmetric, fixed (non-random) couplings. This is not a metaphor. This is the same equation.

The dynamics: at each step, choose a neuron $i$ at random and update it according to

$$s_i \leftarrow \text{sgn}\!\left(\sum_{j} W_{ij} s_j\right)$$

This update always decreases or leaves unchanged the energy $E$ (because the weights are symmetric). The network is a gradient descent machine on $E$. It will always converge to a local minimum — a fixed point.

The innovation is in how Hopfield chose the weights. To store a set of $p$ binary patterns $\xi^\mu \in \{-1,+1\}^N$ (for $\mu = 1, \ldots, p$), use Hebb’s rule:

$$W_{ij} = \frac{1}{N} \sum_{\mu=1}^{p} \xi^\mu_i\, \xi^\mu_j$$

This is the outer product rule. Each stored pattern contributes a rank-1 matrix to $W$. You can verify that if $s = \xi^\mu$, then the local field at neuron $i$ is

$$h_i = \sum_j W_{ij} s_j = \frac{1}{N}\sum_j \sum_{\nu} \xi^\nu_i \xi^\nu_j \xi^\mu_j = \xi^\mu_i + \frac{1}{N}\sum_{\nu \neq \mu} \xi^\nu_i \underbrace{\left(\sum_j \xi^\nu_j \xi^\mu_j\right)}_{\text{cross-talk}}$$

The first term reinforces pattern $\mu$. The second term is noise from the other stored patterns. When the patterns are random and uncorrelated, the cross-talk averages to zero for the first term to dominate, and the stored patterns are stable fixed points of the dynamics. A noisy or incomplete input — a partial pattern — will evolve under the dynamics toward the nearest stored pattern. This is associative memory: content-addressable retrieval.

The capacity limit follows from the same analysis. As $p$ grows, the cross-talk grows. When $p$ exceeds approximately $0.14N$, the cross-talk overwhelms the signal, and the network begins to form spurious minima — states that are not any of the stored patterns but are mixtures or corruptions of them. The network has entered a spin-glass phase.

This is not a rough analogy. Amit, Gutfreund, and Sompolinsky showed in 1985 that the Hopfield model is exactly the SK model with $p$ planted minima (Amit, Gutfreund, & Sompolinsky, 1985). The phase diagram of the Hopfield model — paramagnetic phase, memory phase, spin-glass phase — maps precisely onto the phase diagram of the SK model. The capacity limit $p \approx 0.14N$ is the phase boundary between the memory phase and the spin-glass phase, derivable from Parisi’s RSB theory.

The 2021 Nobel and the 2024 Nobel are, mathematically, about the same model.

Boltzmann machines (Hinton & Sejnowski, 1985)

The Hopfield model is deterministic and shallow — one layer of visible neurons, no hidden structure. Geoffrey Hinton and Terry Sejnowski, in a collaboration that began at the Cognitive Science summer school in Pittsfield in 1983 and culminated in a 1985 paper (Ackley, Hinton, & Sejnowski, 1985), added two things: hidden units and stochastic dynamics.

Hidden units $h_j$ are neurons not connected to any input or output. They do not correspond to observable quantities; they model latent structure in the data. The energy of the system is:

$$E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} W_{ij}\, v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j$$

where $v_i$ are the visible (data) units, $h_j$ are the hidden units, $a_i$ and $b_j$ are biases. Note that this is still an Ising-type energy; the $W_{ij}$ are now inter-layer weights.

The stochastic dynamics replace deterministic gradient descent with a Markov chain. Each unit is updated probabilistically:

$$P(s_k = 1 \mid \text{rest}) = \sigma\!\left(\sum_j W_{kj} s_j + \text{bias}_k\right)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the logistic sigmoid. At inverse temperature $\beta = 1/T$, the probability of any complete configuration is

$$P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\, e^{-\beta E(\mathbf{v}, \mathbf{h})}$$

This is the Boltzmann distribution. The machine is named after Ludwig Boltzmann because the equilibrium distribution of its states is the Boltzmann distribution. Not analogously. Literally.

Learning amounts to adjusting the weights to make the model distribution $P(\mathbf{v}, \mathbf{h})$ match the data distribution $P_{\text{data}}(\mathbf{v})$. The objective is to minimise the Kullback-Leibler divergence:

$$\mathcal{L} = D_{\mathrm{KL}}(P_{\text{data}} \| P_{\text{model}}) = \sum_{\mathbf{v}} P_{\text{data}}(\mathbf{v}) \ln \frac{P_{\text{data}}(\mathbf{v})}{P_{\text{model}}(\mathbf{v})}$$

The gradient with respect to the weight $W_{ij}$ is

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = -\langle v_i h_j \rangle_{\text{data}} + \langle v_i h_j \rangle_{\text{model}}$$

The first term is the empirical correlation between visible unit $i$ and hidden unit $j$ when the visible units are clamped to data. The second term is the correlation in the model’s free-running equilibrium. The learning rule says: increase $W_{ij}$ if the data sees these two units co-active more than the model does, and decrease it otherwise. This is Hebbian learning with a contrastive correction — the physics of equilibration drives the learning.

The computational difficulty is the second term. Computing $\langle v_i h_j \rangle_{\text{model}}$ requires the Markov chain to reach equilibrium, which takes exponentially long in general. Hinton’s later invention of contrastive divergence — run the chain for only a few steps rather than to equilibrium — made training feasible, at the cost of a biased gradient estimate. This engineering compromise is part of why the physics purists are uncomfortable: the original derivation is rigorous statistical mechanics, but the algorithm that actually works in practice is an approximation whose convergence properties are poorly understood.

I find this charming rather than damning. Physics itself is full of approximations whose convergence properties are poorly understood but which happen to give right answers. Perturbation theory beyond leading order, the replica trick itself — these are not rigorous mathematics. They are informed guesses that happen to be correct. The history of theoretical physics is mostly the history of getting away with things.

From Boltzmann machines to transformers

The Boltzmann machine was computationally difficult but conceptually foundational. The restricted Boltzmann machine (RBM) — with no within-layer connections, so that hidden units are conditionally independent given the visible units and vice versa — made training via contrastive divergence practical.

Hinton, Osindero, and Teh’s 2006 paper on deep belief networks showed that stacking RBMs and pre-training them greedily could initialise deep networks well enough to fine-tune with backpropagation. This was the breakthrough that restarted deep learning after the winter of the 1990s. It is fair to say that without the Boltzmann machine as conceptual foundation and the RBM as practical building block, the deep learning revolution that gave us large language models that fail to count letters in words would not have happened in the form it did.

The connection between Hopfield networks and modern attention mechanisms is more recent and more surprising. Ramsauer et al. (2020) showed that modern Hopfield networks — a generalisation of the original with continuous states and a different energy function — have exponential storage capacity (Ramsauer et al., 2020). More strikingly, the update rule of the modern Hopfield network is:

$$\mathbf{s}^{\text{new}} = \mathbf{X}\, \text{softmax}\!\left(\beta \mathbf{X}^\top \mathbf{s}\right)$$

where $\mathbf{X}$ is the matrix of stored patterns and $\mathbf{s}$ is the query. This is the attention mechanism of the transformer, up to notation. The transformer’s multi-head self-attention is, formally, a generalised Hopfield retrieval step. The architecture that powers GPT and everything descended from it is, at one level of abstraction, an associative memory performing energy minimisation on a Hopfield energy landscape.

I do not want to overstate this. The connection is formal and the interpretation is contested. But it is not nothing. The physicists who built the Hopfield network in 1982 were working on the same mathematical object that is now used to process language, images, and protein sequences at industrial scale.

The protein folding connection

The 2024 Nobel Prize in Chemistry went to Demis Hassabis, John Jumper, and David Baker for computational protein structure prediction — specifically for AlphaFold2 (Jumper et al., 2021). This made October 2024 a remarkable month for Nobel Prizes in fields adjacent to artificial intelligence, and it is not a coincidence.

Protein folding is a spin-glass problem. A protein is a polymer of amino acids, each with different chemical properties and steric constraints. The protein folds into a unique three-dimensional structure — its native conformation — determined by its sequence. The energy landscape of the folding process is precisely the kind of rugged landscape that Parisi described for spin glasses: exponentially many misfolded states, separated by barriers, with the native structure as the global minimum (or close to it).

Levinthal’s paradox, formulated in 1969, makes the absurdity quantitative. A modest protein of 100 amino acids might have $3^{100} \approx 10^{47}$ possible conformations (allowing three dihedral angle states per residue). Random search of this space, at the rate of one conformation per picosecond, would take $10^{35}$ years — somewhat longer than the age of the universe. Yet proteins fold in milliseconds to seconds. They do not search randomly; the energy landscape is funnel-shaped, channelling the dynamics toward the native state. But predicting which state is the native one from sequence alone remained one of the hard problems of structural biology for fifty years.

AlphaFold2 uses a transformer architecture — descended from the Boltzmann machine lineage — trained on millions of known protein structures. It does not simulate the folding dynamics; it has learned, from data, a mapping from sequence to structure that encodes the statistical mechanics of the folding funnel. The Nobel committee gave it the Chemistry prize because it is transforming biochemistry. But the conceptual machinery is pure statistical physics: representation of a high-dimensional energy landscape, approximation of the minimum, learned from the distribution of solved instances.

The three Nobels of 2021–2024 form the most coherent consecutive triple I can remember: Parisi showed how disordered energy landscapes behave; Hopfield and Hinton showed how to use energy landscapes as memory and learning machines; Hassabis and Jumper showed how to apply the resulting architecture to the most consequential outstanding problem in molecular biology. Each step is a mathematical consequence of the one before it.

The controversy: did the committee err?

I said I understand the irritation. Here is what is right about it.

Hinton’s work after the Boltzmann machine — backpropagation, dropout, convolutional networks, deep learning at ImageNet scale — is primarily engineering and empirical machine learning. The 2012 AlexNet result that restarted the field was not a theoretical physics contribution; it was a demonstration that known methods work very well on very large datasets with very large GPUs. The fact that it works is not explained by statistical mechanics. The scaling laws of neural networks (loss scales as a power law with compute, parameters, and data) are empirical observations that physicists have tried to explain with renormalisation group arguments with mixed success.

If the Nobel Prize in Physics were awarded for “the work that most influenced technology in the past decade,” the case for Hinton is strong. If it were awarded for “the most important contribution to the science of physics,” the case is weaker. There is a version of the Nobel announcement that emphasises the Boltzmann machine specifically — the 1985 paper that is literally named after a physicist and uses his distribution — and that version sits cleanly within physics. There is a broader version that encompasses all of Hinton’s career, and that version includes a great deal of empirical machine learning that the physics community is reasonably reluctant to claim.

My view, for what it is worth from someone who has been thinking about AI ethics and consequences for rather longer than feels comfortable: the Nobel correctly identifies that the foundational conceptual contributions — the Ising Hamiltonian as associative memory, the Boltzmann distribution as a learning target, the connection between statistical mechanics and computation — are physics. They came from physicists, they use physics mathematics, they extend physics intuition into a new domain. The subsequent scaling of these ideas using TPUs and transformer architectures is engineering. Valuable engineering, world-changing engineering, but engineering. The Nobel is for the former. If the citation had been more specific — “for the Boltzmann machine and its demonstration that physical principles govern neural computation” — the physics community would have been less irritated and equally correct.

What the irritation reveals is something slightly uncomfortable about disciplinary identity. Physicists are proud of universality: the idea that the same mathematical structures appear in wildly different physical systems. RSB in spin glasses, replica methods in random matrices, the Parisi–Sourlas correspondence between disordered systems and supersymmetric field theories — the joy of physics is precisely that these deep structural similarities cross domain boundaries. When that universality reaches into machine learning and says “your transformer attention layer is a Hopfield retrieval step,” physicists should be delighted, not affronted.

The agentic systems that are being built right now on top of transformer architectures are doing something that looks, from a sufficiently abstract distance, like what the Hopfield network was designed to do: find stored patterns that match a query, and use them to generate a response. The failures of grounding that I have written about elsewhere are, in this view, failures of the energy landscape — the model finds a metastable state that is not the correct minimum, and the dynamics cannot escape. Spin glass physics does not explain these failures in detail, but it gives a language for thinking about them. That is what physics is for.

The universality argument

Let me make the deeper claim explicit. Why should disordered magnets, associative memory networks, and protein folding all live in the same mathematical family?

Because they all have the same structure: many interacting degrees of freedom with competing constraints, a combinatorially large configuration space, an energy landscape with exponentially many metastable states, and dynamics that search for — and frequently fail to find — global minima. This is a universality class. The specific details (magnetic moments versus neuron states versus dihedral angles) are irrelevant at the level of the energy landscape topology.

Parisi’s contribution was to show that this class has a specific, exactly-solvable structure in mean field theory, characterised by replica symmetry breaking and the ultrametric organisation of states. This was not a solution to one model. It was a description of a universality class. The fact that the Hopfield model is in this class is not a coincidence requiring explanation; it is a mathematical identity requiring verification.

The Kuramoto model for coupled oscillators — which I have written about in the context of ensemble synchronisation and neural phase coupling — is another member of this extended family. The synchronisation transition in the Kuramoto model, the glass transition in the SK model, and the memory phase transition in the Hopfield model are all mean-field phase transitions in disordered many-body systems. The mathematics is more similar than the physics syllabi suggest.

When I teach physics and occasionally venture into questions about what the AI tools my students are using actually do, I find myself reaching for this framework. Not because it gives engineering insight into how to train a better model — it does not, particularly — but because it gives honest insight into what kind of thing a neural network is. It is a physical system. It has an energy landscape. Its failures are phase transitions. Its successes are energy minimisation. The vocabulary of statistical mechanics is not a metaphor; it is the correct description.

The Nobel committee noticed. They were right to notice.

The 2021 and 2024 Nobel Prizes in Physics have now officially bridged the gap between condensed matter physics and machine learning in the public record. For anyone who wants to understand either field more deeply than the press releases suggest, the SK model and the Hopfield network are the right place to start. Both papers are short by modern standards — Parisi’s 1979 letter is three pages; Hopfield’s 1982 PNAS paper is five — and both repay close reading.

References

Sherrington, D., & Kirkpatrick, S. (1975). Solvable model of a spin-glass. Physical Review Letters, 35(26), 1792–1796. DOI: 10.1103/PhysRevLett.35.1792
Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters, 43(23), 1754–1756. DOI: 10.1103/PhysRevLett.43.1754
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554–2558. DOI: 10.1073/pnas.79.8.2554
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169. DOI: 10.1207/s15516709cog0901_7
Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14), 1530–1533. DOI: 10.1103/PhysRevLett.55.1530
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. DOI: 10.1038/s41586-021-03819-2
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield networks is all you need. arXiv:2008.02217. Retrieved from https://arxiv.org/abs/2008.02217
Nobel Prize Committee. (2024). Scientific background: Machine learning and physical systems. The Royal Swedish Academy of Sciences. Retrieved from https://www.nobelprize.org/prizes/physics/2024/advanced-information/

Three Rs in Strawberry: What the Viral Counting Test Actually Reveals

Mon, 07 Oct 2024 00:00:00 +0000

The Setup

In September 2024, OpenAI publicly confirmed that their new reasoning model had been code-named “Strawberry” during development. This landed with a particular thud because “how many r’s are in strawberry?” had, by that point, become one of the canonical demonstrations of language model failure. The model named after strawberry could not count the letters in strawberry. The internet had opinions.

Before the opinions: the answer is three. s-t-r-a-w-b-e-r-r-y. One in the str- cluster, two in the -rry ending. Count carefully and you will find that most people get this right on the first try, and most large language models get it wrong, returning “two” with apparent confidence.

The question worth asking is not “why is the model stupid.” It is not stupid, and “stupid” is not a useful category here. The question is: what does this specific error reveal about the structure of the system?

The answer involves tokenisation, and it is actually interesting.

How You Count Letters (and How the Model Doesn’t)

When you count the r’s in “strawberry,” you do something like this: scan the string left to right, maintain a running count, increment it each time you see the target character. This is a sequential operation over a character array. It requires no semantic knowledge about the word — it does not matter whether “strawberry” is a fruit, a colour, or a nonsense string. The characters are the input; the count is the output.

A language model does not receive a character array. It receives a sequence of tokens — chunks produced by a compression algorithm called Byte Pair Encoding (BPE) that the model was trained with. In the tokeniser used by GPT-class models, “strawberry” is most likely split as:

$$\underbrace{\texttt{str}}_{\text{token 1}} \;\underbrace{\texttt{aw}}_{\text{token 2}} \;\underbrace{\texttt{berry}}_{\text{token 3}}$$

Three tokens. The model’s input is these three integer IDs, each looked up in an embedding table to produce a vector. There is no character array. There is no letter “r” sitting at a known position. There are three dense vectors representing “str,” “aw,” and “berry.”

What BPE Does (and Doesn’t) Preserve

BPE is a greedy compression algorithm. Starting from individual bytes, it iteratively merges the most frequent pair of adjacent symbols into a single new token:

$$\text{merge}(a, b) \;:\; \underbrace{a \;\; b}_{\text{separate}} \;\longrightarrow\; \underbrace{ab}_{\text{single token}}$$

Applied to a large text corpus until a fixed vocabulary size is reached, this produces a vocabulary of common subwords. Frequent words and common word-parts become single tokens; rare sequences stay as multi-token fragments.

What BPE optimises for is compression efficiency, not character-level transparency. The token “straw” encodes the sequence s-t-r-a-w as a unit, but that character sequence is not explicitly represented anywhere inside the model once the embedding lookup has occurred. The model receives a vector for “straw,” not a list of its constituent letters.

The character composition of a token is only accessible to the model insofar as it was implicitly learned during training — through seeing “straw” appear in contexts where its internal structure was relevant. For most tokens, most of the time, that character structure was not relevant. The model learned what “straw” means, not how to spell it character by character.

Why the Error Is Informative

Most people say the model returns “two r’s,” not “one” or “four” or “none.” This is not random noise. It is a systematic error, and systematic errors are diagnostic.

“berry” contains two r’s: b-e-r-r-y. If you ask most models “how many r’s in berry?” they get it right. The model has seen that question, or questions closely enough related, that the right count is encoded somewhere in the weight structure.

“str” contains one r: s-t-r. But as a token it is a short, common prefix that appears in hundreds of words — string, strong, stream — contexts in which its internal letter structure is rarely attended to. “aw” contains no r’s. When the model answers “two,” it is almost certainly counting the r’s in “berry” correctly and failing to notice the one in “str.” The token boundaries are where the error lives.

This is not stupidity. It is a precise failure mode that follows directly from the tokenisation structure. You can predict where the error will occur by looking at the token split.

Chain of Thought Partially Fixes This (and Why)

If you prompt the model to “spell out the letters first, then count,” the error rate drops substantially. The reason is not mysterious: forcing the model to generate a character-by-character expansion — s, t, r, a, w, b, e, r, r, y — puts the individual characters into the context window as separate tokens. Now the model is not working from “straw” and “berry”; it is working from ten single-character tokens, and counting sequential characters in a flat list is a task the model handles much better.

This is, in effect, making the model do manually what a human does automatically: convert the compressed token representation back to an enumerable character sequence before counting. The cognitive work is the same; the scaffolding just has to be explicit.

The Right Frame

The “how many r’s” test is sometimes cited as evidence that language models don’t “really” understand text, or that they are sophisticated autocomplete engines with no genuine knowledge. These framing choices produce more heat than light.

The more precise statement is this: language models were trained to predict likely next tokens in large text corpora. That training objective produces a system that is very good at certain tasks (semantic inference, translation, summarisation, code generation) and systematically bad at others (character counting, exact arithmetic, precise spatial reasoning). The system is not doing what you are doing when you read a sentence. It is doing something different, which happens to produce similar outputs for a very wide range of inputs — and different outputs for a class of inputs where the character-level structure matters.

“Strawberry” sits squarely in that class. The model is not failing to read the word. It is succeeding at predicting what a plausible-sounding answer looks like, based on a compressed representation that does not preserve the information needed to get the count right. Those are not the same thing, and the distinction is worth keeping clear.

The tokenisation argument here is a simplified version. Real BPE vocabularies, positional encodings, and the specific way character information is or isn’t preserved in embedding tables are more complicated than this post suggests. But the core point — that the model’s input representation is not a character array and never was — holds.

A follow-up post covers a structurally different failure mode: Should I Drive to the Car Wash? — where the model understood the question perfectly but lacked access to the world state the question was about.

References

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1715–1725. https://arxiv.org/abs/1508.07909

Changelog

2025-12-01: Corrected the tokenisation of “strawberry” from two tokens (straw|berry) to three tokens (str|aw|berry), matching the actual cl100k_base tokeniser used by GPT-4. The directional argument (token boundaries obscure character-level information) is unchanged; the specific analysis was updated accordingly.

Why Cats Purr at 25 Hz: Vocal Fold Pads and the Physics of Self-Sustained Oscillation

Mon, 09 Sep 2024 00:00:00 +0000

The first thing either of our cats did when I sat still long enough was purr. Not after food, not during play — the purr arrived when I sat down and held still and they settled against me, and it arrived as a physical fact, a vibration felt through the sternum and the ribs, not merely heard. The frequency was low: around 25–30 cycles per second, which you can feel as a buzz rather than hear as a tone. This is, I later confirmed, not far from the frequency at which clinical devices stimulate bone growth. They are indoor cats now, on our vet’s recommendation — they find this unreasonable, but sitting still and being purred on has become a regular feature of working from home.

The physics of how the larynx produces that frequency is, as of 2023, finally resolved — and the mechanism is more elegant than anyone suspected.

The Frequency and Its Peculiarity

Domestic cats purr at approximately $25$–$30\,\mathrm{Hz}$. This is remarkably low for an animal of cat size. A human vocal fold — roughly comparable in size — vibrates at $85$–$255\,\mathrm{Hz}$ for normal speech. A cat’s larynx is smaller than a human’s, not larger, which makes the low frequency surprising: in a simple spring-mass oscillator model, smaller and lighter vocal folds should vibrate faster, not slower.

The frequency range $25$–$50\,\mathrm{Hz}$ has clinical significance in a different field. Therapeutic vibration platforms used in sports medicine and osteoporosis treatment operate in exactly this range, exploiting Wolff’s law (bone remodelling under mechanical stress) to increase bone density and accelerate fracture repair. The coincidence is suggestive. It was first noted quantitatively by von Muggenthaler (2001, Journal of the Acoustical Society of America 110, 2666), who recorded purrs from 44 felids and found that all produced dominant frequencies between $25$ and $150\,\mathrm{Hz}$.

Whether cats deliberately exploit this frequency for self-healing is a separate biological question. The physics question is simpler: how does the larynx produce it?

Flow-Induced Vocal Fold Oscillation

Vocal fold oscillation in mammals is a flow-induced, self-sustained mechanical phenomenon. The Bernoulli effect and elastic restoring forces create a feedback loop that keeps the folds oscillating as long as subglottal air pressure is maintained.

The mechanism is as follows. The lungs supply a steady subglottal pressure $p_\mathrm{sub}$. This drives airflow through the glottis (the gap between the vocal folds). As the folds are pushed apart by the pressure, the airflow velocity in the narrowed glottis increases; by Bernoulli’s principle,

$$p + \tfrac{1}{2}\rho v^2 = \mathrm{const},$$

the pressure drops, drawing the folds back together. The folds’ elastic restoring force adds to this: they spring back when displaced. The result is an oscillation — the folds open and close periodically, chopping the airflow into pressure pulses that we perceive as sound (or vibration, for low frequencies).

The fundamental frequency is approximately:

$$f_0 \approx \frac{1}{2L}\sqrt{\frac{T}{\rho_s}},$$

where $L$ is the vibrating length of the vocal fold, $T$ is the longitudinal tension, and $\rho_s$ is the surface density (mass per unit area). This is the same formula as for a vibrating string — and the physics is closely related.

For a cat-sized larynx with $L \approx 1\,\mathrm{cm}$, realistic tissue tension, and tissue density $\rho_s \sim 1\,\mathrm{kg/m}^2$, this formula gives $f_0$ in the hundreds of hertz — far above the observed purring frequency of $25$–$30\,\mathrm{Hz}$.

Something is missing from the model.

The Long-Standing Controversy

Until 2023, the dominant explanation for the low purring frequency was the Active Muscular Contraction (AMC) hypothesis: the laryngeal muscles contract rhythmically at the purring frequency, mechanically driving the vocal folds rather than relying on passive aeroelastic oscillation. On this view, purring is more like a drumming than a singing — the neural drive at $25$–$30\,\mathrm{Hz}$ sets the frequency, overriding the natural aeroelastic frequency.

The AMC hypothesis was difficult to test directly because the larynx is inaccessible in a live, purring cat without interfering with the purr. Electromyographic recordings from laryngeal muscles of purring cats showed rhythmic activity consistent with the AMC hypothesis, but causality was unclear: were the muscles driving the oscillation, or responding to it?

The alternative hypothesis — that purring is passive, driven purely by aeroelastic forces — faced the problem noted above: the aeroelastic frequency of a cat-sized larynx should be far too high to explain $25$–$30\,\mathrm{Hz}$. Unless something was being added to the vocal folds to lower their effective resonant frequency.

Herbst et al. 2023: The Mass-Loading Mechanism

In October 2023, Christian Herbst and colleagues at the University of Vienna published “Domestic cat larynges can produce purring frequencies without neural input” (Current Biology 33, 4727–4732). The experiment was decisive.

The team excised larynges from domestic cats (post-mortem, within a short time window to preserve tissue properties) and mounted them in a flow bench: a controlled airflow was supplied to the subglottal side, and the larynges were held at physiologically realistic tension and hydration.

The result: all eight excised larynges produced self-sustained oscillations at $25$–$30\,\mathrm{Hz}$ — the normal purring frequency — without any neural input whatsoever. No muscular contraction was present (no motor neurons, no calcium signalling, no ATP). The oscillation was purely passive, driven by the airflow and maintained by the tissue mechanics.

This ruled out the AMC hypothesis. The neural drive is not needed to sustain the oscillation; it may modulate it, start or stop it, but the fundamental frequency is set by the tissue mechanics, not the neural firing rate.

The follow-up finding was the key to the physics: histological analysis of the vocal fold tissue revealed connective tissue pads embedded in the vocal fold mucosa, up to $4\,\mathrm{mm}$ thick. These pads are not present in the vocal folds of humans or other mammals that do not purr. They increase the effective mass of the oscillating tissue significantly, without adding corresponding stiffness.

The Mass-Loading Physics

The fundamental frequency of a harmonic oscillator is:

$$f_0 = \frac{1}{2\pi}\sqrt{\frac{k}{m}},$$

where $k$ is the effective stiffness and $m$ is the effective mass. Adding mass (at constant stiffness) lowers the frequency as $f_0 \propto m^{-1/2}$.

For the vocal folds, the spring constant $k$ is set by tissue tension and elasticity — properties that the tissue pads do not significantly alter. But the pads add a substantial mass $\Delta m$ to the oscillating system. The purring frequency becomes:

$$f_\mathrm{purr} = \frac{1}{2\pi}\sqrt{\frac{k}{m_0 + \Delta m}},$$

where $m_0$ is the baseline vocal fold mass and $\Delta m$ is the added mass from the pads.

As a rough estimate: if the unloaded aeroelastic frequency were in the range $f_\mathrm{normal} \approx 200$–$400\,\mathrm{Hz}$ (the range of cat meow fundamental frequencies), lowering it to $f_\mathrm{purr} \approx 25\,\mathrm{Hz}$ would require a mass increase by a factor of

$$\frac{m_0 + \Delta m}{m_0} = \left(\frac{f_\mathrm{normal}}{f_\mathrm{purr}}\right)^2 \approx 64\text{–}256.$$

This is a large factor, but not implausible for pads up to 4 mm thick embedded in a mucosal membrane that is itself very thin. The simple harmonic oscillator model is an idealisation — the actual frequency reduction also involves changes in vibration mode shape, tissue coupling, and aerodynamic loading — but the mass-loading effect is the dominant mechanism. The tissue pads are, in effect, frequency dividers: they convert a high-frequency aeroelastic oscillator into a low-frequency vibration generator.

This is the same principle used in engineering to lower the natural frequency of mechanical structures: add mass without changing stiffness. Tuned mass dampers in skyscrapers work on the same principle. So do the heavy flywheel weights added to engines to suppress rotational vibration.

The cat’s larynx evolved this solution independently, and with a mass ratio that would impress a structural engineer.

The Self-Sustained Oscillation Criterion

Not every mass-loaded oscillator will self-sustain under airflow. The Bernoulli-elastic feedback loop must overcome the viscous damping of the tissue. A dimensional scaling estimate for the critical subglottal pressure is:

$$p^* \sim \eta_\mathrm{tissue} \cdot \frac{v}{L} \sim \eta_\mathrm{tissue} \cdot f_0,$$

where $\eta_\mathrm{tissue}$ is the tissue viscosity, $v \sim f_0 L$ is the characteristic mucosal wave velocity, and $L$ is the fold length. (The full phonation threshold pressure, as derived by Titze (2006), depends on additional geometric and aerodynamic parameters.) For typical laryngeal tissue properties and the observed purring frequency, this critical pressure is of order $100$–$200\,\mathrm{Pa}$ — low enough to be sustained by the respiratory system without extraordinary effort.

This is consistent with the observation that cats can purr both during inhalation and exhalation, maintaining a continuous acoustic output throughout the breathing cycle. The oscillation threshold is low enough that normal respiration can maintain it.

Wolff’s Law and the 25 Hz Coincidence

Julius Wolff (1892) proposed that bone remodels in response to mechanical loading: osteoblasts (bone-building cells) are stimulated by cyclic compressive stress, while osteoclasts (bone-resorbing cells) dominate in the absence of loading. This principle — now called Wolff’s law — underpins the use of therapeutic vibration in orthopaedics.

The optimal frequency for osteoblast stimulation, determined empirically in clinical studies, is $20$–$50\,\mathrm{Hz}$. Vibration at these frequencies, applied at amplitudes of $0.2$–$1.0\,g$ (where $g$ is gravitational acceleration), produces measurable increases in bone mineral density, accelerates fracture healing, and reduces bone loss in microgravity. The frequency range is not a narrow resonance; it reflects the natural frequencies of cellular mechanotransduction pathways involving focal adhesion kinase (FAK) and integrin signalling.

Cat purring produces vibration in the frequency range $25$–$50\,\mathrm{Hz}$ at the body surface. Whether this is sufficient to produce meaningful bone stimulation — and whether cats evolved purring partly as a bone-maintenance mechanism — is not yet resolved by controlled experiments. The hypothesis is physiologically plausible: cats conserve metabolic energy by resting for up to 16 hours per day, and during this rest period, bone would normally be unstressed and subject to resorption. A continuous low-frequency vibration during rest could counteract this.

This is speculative at the level of evolutionary causation. What is not speculative is that the purring frequency overlaps precisely with the therapeutic vibration range, and that this overlap is not obviously accidental.

Across Felid Species

Von Muggenthaler’s 2001 survey of 44 felids found that most domestic cats purr in the range $25$–$30\,\mathrm{Hz}$, with harmonics at $50$, $75\,\mathrm{Hz}$, and so on. Cheetahs purr at $20$–$25\,\mathrm{Hz}$; pumas (mountain lions) at $20$–$30\,\mathrm{Hz}$; servals and ocelots at $22$–$28\,\mathrm{Hz}$.

The large roaring cats — lions, tigers, leopards, jaguars — do not purr in the continuous sense that domestic cats do. Their enlarged hyoid apparatus allows roaring by a different mechanism (a modified laryngeal pad that allows very low-frequency, high-intensity sound production). Some large cats produce purr-like sounds during exhalation but not the continuous through- inhalation-and-exhalation purring of smaller felids.

The vocal fold pad mechanism appears to be specific to the non-roaring felids, though detailed histological comparisons across species are still sparse.

What I Hear

When one of our cats purrs while settled against me, what I am feeling is the mechanical resonance of a mass-loaded aeroelastic oscillator at approximately $25\,\mathrm{Hz}$, the frequency having been lowered by connective tissue pads from a natural aeroelastic frequency several hundred hertz higher. The pads evolved, we think, to produce exactly this frequency — sustained under normal respiratory airflow pressure with no additional muscular energy. The acoustic output is a byproduct of a vibration.

Whether the vibration serves a direct physiological function in the cat’s own bones is, as of this writing, still an open question. What seems clear is that the 2023 paper settled the mechanism question conclusively: the frequency is set by mass loading, not neural drive. The larynx purrs by itself when you blow air through it.

I find this reassuring. The physics is in the cat, not in its nervous system. The cat purrs the way a tuning fork rings — not because it decides to, but because that is what it does when the conditions are right.

References

Herbst, C.T., Prigge, T., Garcia, M., Hampala, V., Hofer, R., Weissengruber, G.E., Svec, J.G., & Fitch, W.T. (2023). Domestic cat larynges can produce purring frequencies without neural input. Current Biology, 33(22), 4727–4732.e4. https://doi.org/10.1016/j.cub.2023.09.014
von Muggenthaler, E. (2001). The felid purr: A healing mechanism? Journal of the Acoustical Society of America, 110(5), 2666. https://doi.org/10.1121/1.4777098
Titze, I.R. (2006). The Myoelastic Aerodynamic Theory of Phonation. National Center for Voice and Speech.
Wolff, J. (1892). Das Gesetz der Transformation der Knochen. A. Hirschwald. (English translation: Maquet, P., & Furlong, R., 1986. The Law of Bone Remodelling. Springer.)
Rubin, C.T., & Lanyon, L.E. (1984). Regulation of bone formation by applied dynamic loads. Journal of Bone and Joint Surgery, 66(3), 397–402. https://doi.org/10.2106/00004623-198466030-00012
Christiansen, P. (2008). Evolution of skull and mandible shape in cats (Carnivora: Felidae). PLOS ONE, 3(7), e2807. https://doi.org/10.1371/journal.pone.0002807

The Invisible Entrance Fee: On Privilege, Education, and the Institutions That Reproduce Both

Tue, 20 Aug 2024 00:00:00 +0000

There is a persistent story that education systems tell about themselves: that they are meritocratic. That talent, effort, and intelligence are what determine outcomes. That the playing field, if not perfectly level, is at least aspiring toward levelness. That a good enough student from any background can succeed.

This story is not supported by the evidence.

The relationship between socioeconomic background and educational outcomes is one of the most replicated findings in social science. PISA data from Germany consistently show one of the steepest socioeconomic gradients in the OECD — the correlation between parental education and student performance is higher here than in most comparable countries. This is not a recent finding. It has been stable for decades. The system produces it reliably, which means the system is, in some meaningful sense, designed to produce it — even if no individual actor intended that design.

Understanding why requires a different vocabulary than the one most educational institutions use about themselves.

Bourdieu’s Three Capitals

Pierre Bourdieu spent much of his career developing an account of how social inequality reproduces itself through culture and education. The core concept is capital — but not only the economic kind.

Bourdieu (1986) distinguishes three forms:

Economic capital is material resources: money, assets, time purchased through money. This is the most visible form of advantage. Wealthier families can pay for tutoring, for better-resourced schools, for the unpaid internships that build CVs, for the years of postgraduate study that increasingly function as the entrance requirement for professional careers.

Cultural capital is more subtle. It includes dispositions, skills, and knowledge that are valued by educational institutions and professional fields — but valued in a way that tends to favour those who acquired them at home, in childhood, before formal education began. The ease with which a student navigates a seminar. The familiarity with the tacit conventions of academic writing. The sense that the university is, broadly, a place made for people like you. These are not things that are explicitly taught; they are things that are transmitted, Bourdieu argues, through families whose own cultural capital aligns with what the institution expects.

Social capital is networks: the web of relationships that provide information, referrals, opportunities, and vouching. Who you know, in the flattest possible terms.

All three reinforce each other. Economic capital can be converted into cultural capital through education and into social capital through exclusive networks. Cultural capital eases access to prestigious institutions, which build social capital. The system is not static, but it has a strong gravitational pull toward reproduction.

Bourdieu and Passeron (1977) developed this into a theory of education as reproduction: the function of educational institutions is not primarily to transmit knowledge but to legitimate the transmission of social position from one generation to the next. The process is misrecognised — by students, teachers, and institutions — as meritocracy. This misrecognition is essential to the function. If it were transparent, it would lose its legitimising power.

The Hidden Curriculum

Philip Jackson (1968) coined the term hidden curriculum for everything that schools teach that is not in the official syllabus. How to sit still. How to wait your turn. How to speak to authority. How to navigate institutions, read implicit expectations, manage bureaucracies. How to understand that your job is to demonstrate competence within a form that someone else has set.

For students whose home culture aligns with the institutional culture, the hidden curriculum is invisible. They already know it; it requires no effort; it is simply how things are. For students whose home culture diverges, it is a second curriculum that must be decoded while simultaneously managing the official one.

Lareau (2003) documented this in careful ethnographic detail. Middle-class families engage in what she calls concerted cultivation — a mode of child-rearing that practises precisely the dispositions valued by educational institutions: articulate self-advocacy with adults, a sense of entitlement to ask questions and seek explanations, activities structured around developing discrete skills. Working-class and poor families, in her study, more often practised accomplishment of natural growth — providing security, affection, and freedom without the institutional structuring. Neither is better parenting. But one of them is what the school expects.

The child who arrives at school already knowing how to talk to teachers, how to present themselves, how to advocate for their own needs, has a significant advantage that is invisible in the transcript. It does not appear as “privilege”; it appears as “ability” or “maturity”. The institutional category does the misrecognising work.

Privilege as Invisible to Those Who Have It

Peggy McIntosh (1989) wrote what became one of the most cited — and most contested — essays in education: “White Privilege: Unpacking the Invisible Knapsack”. Her core observation is structural: privilege is the absence of disadvantage, and absences are invisible to those who live inside them. You do not notice the ease with which you move through a system that was designed for people like you, any more than you notice breathing.

This is not an accusation. It is a description of a structural feature with consequences for self-understanding.

My background is in physics; I now work in universities, having grown up in a household with books and educated parents and the background assumption that higher education was something that people like us did. I was not aware of most of this as an advantage while it was happening, because it did not feel like an advantage — it felt like normal. The awareness came later, with effort, and it remains incomplete.

The invisible entrance fee is what you have already paid, in cultural capital, before you walk through the door. The institution does not ask about it explicitly. It simply rewards those who have it and attributes the reward to merit.

What This Means for Accessibility

The previous post in this series argued that full accessibility — Barrierefreiheit — is structurally impossible in a society organised as ours is; that the honest goal is Barrierearmut, the ongoing reduction of barriers. The connection to privilege is direct.

Barriers to education are not only physical. They include the cultural distance between the home environment and the institutional culture. They include not knowing that office hours exist and are meant for you, not just for students with problems. They include the inability to identify as “the kind of person who does a PhD” because you have never met anyone who did one. They include the exhaustion of navigating a system that requires you to translate yourself at every step, while your better-resourced peers spend that cognitive energy on the actual work.

None of these barriers appear on an accessibility audit. They are not visible from inside the institution looking out. They require actively listening to people whose experience differs from the institutional default, and then being willing to revise the default rather than add an exception.

The PISA gradient in Germany is a measurement of accumulated, unreduced barriers. It is not a measurement of the distribution of talent or effort. The system is producing the outcome; the students are receiving the label.

The Meritocracy Problem

Meritocracy is an appealing concept and a damaging ideology when taken seriously. The appeal: rewards should go to those who earn them, and earning should depend on effort and ability rather than inherited position. This is genuinely better than aristocracy.

The problem: in a society with steep inequality in the distribution of cultural, economic, and social capital, “merit” is not a neutral measurement. It is a measurement of the match between a person’s accumulated resources and the demands of the institution. Calling that match “merit” names the outcome without naming the process that produced it.

Michael Young, who invented the word “meritocracy” in 1958, intended it as a satire. His book The Rise of the Meritocracy depicted a dystopia in which the illusion of fairness made inequality more entrenched, not less, because it stripped the legitimacy from those who were left behind. If outcomes are fair, then failure is your fault. The ideology provides the institutional absolution; the individuals bear the moral weight of structural disadvantage.

This is precisely the dynamic that Bourdieu’s theory of misrecognition describes. The student from a poorly resourced background who does not reach the outcomes of their better-resourced peer is seen — by themselves, by teachers, by the institution — as less talented or less motivated, rather than as navigating a steeper gradient with fewer tools.

What Institutions Can Actually Do

The structural critique is not an argument for fatalism. Institutions can do things that matter.

They can make the hidden curriculum visible — explicitly teaching what is usually assumed. That means orientation programmes that actually explain institutional culture, not just procedures. It means academic writing support that is not remedial but normative. It means mentoring that connects first-generation students with people who understand the landscape.

They can audit their practices for whose default they assume. The timed closed-book exam was designed for a particular set of conditions; asking what it actually measures, and whether there are better instruments, is not lowering standards — it is interrogating what the standard is measuring.

They can diversify their faculty and staff, not as a cosmetic gesture but as a structural change in whose tacit knowledge is embedded in the institution. If the people who design the curriculum all navigated it from the same starting position, the curriculum will encode that starting position as normal.

They can name the entrance fee. Acknowledging that outcomes correlate with background, that this is a systemic feature and not a distribution of merit, is the first step toward taking institutional responsibility for the gradient rather than attributing it to the students.

None of this resolves the structural problem. The structural problem requires political change at scales well beyond any individual institution. But institutions are not passive. They can reduce the barriers they control, while being honest about the ones they do not.

A Personal Note

I sit in institutional positions that this analysis would identify as advantaged. I teach in a university. I benefited from the gradient in ways I cannot fully account for. The point of naming this is not guilt; it is responsibility. Being advantaged by a system you did not design does not make you complicit in its worst outcomes — but it does make you responsible for using whatever institutional leverage you have to make the system less exclusive.

The connection to accessibility is this: both inaccessibility and privilege are about whose defaults are built into the system and who is required to adapt to defaults they did not set. Reducing barriers and interrogating privilege are the same project, approached from different angles.

Neither is completable. Both are necessary.

References

Bourdieu, P. (1986). The forms of capital. In J.G. Richardson (Ed.), Handbook of Theory and Research for the Sociology of Education (pp. 241–258). Greenwood Press.
Bourdieu, P. & Passeron, J.C. (1977). Reproduction in Education, Society and Culture. Sage. (Original French edition 1970.)
Jackson, P.W. (1968). Life in Classrooms. Holt, Rinehart and Winston.
Lareau, A. (2003). Unequal Childhoods: Class, Race, and Family Life. University of California Press.
McIntosh, P. (1989). White privilege: Unpacking the invisible knapsack. Peace and Freedom, July/August, 10–12.
OECD (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education. OECD Publishing.
Young, M. (1958). The Rise of the Meritocracy. Thames and Hudson.

Why 44,100? The Accidental Physics of the CD Sampling Rate

Mon, 05 Aug 2024 00:00:00 +0000

44,100 Hz. Not 44,000. Not 48,000. Not even 40,000 or 50,000, which would at least have the virtue of roundness. The number that defines CD-quality audio is specific in a way that invites a question most people never think to ask: why that number?

The Puzzle

When a physical constant turns out to be $1.6 \times 10^{-19}$ coulombs, that is just nature being nature — no further explanation is needed or available. But when an engineering standard settles on 44,100 Hz rather than, say, 44,000 Hz or 45,000 Hz, there is a story hiding in the specificity.

The standard answer — the one you find on Wikipedia and in most popular accounts — is that 44.1 kHz satisfies the Nyquist criterion for 20 kHz audio, and so it was chosen to preserve the full range of human hearing. This is true. It is also almost completely uninformative. The Nyquist criterion for 20 kHz audio requires only that the sampling rate exceed 40 kHz. That constraint is satisfied by 40,001 Hz as much as by 44,100 Hz. The specific value requires a different explanation entirely.

That explanation involves a Sony engineer, a consumer videocassette recorder, and the accidental convergence of two television standards developed independently on different continents. The number 44,100 is not an optimisation. It is an archaeological deposit. And like most archaeological deposits, it is still with us long after the civilisation that created it has disappeared.

I want to work through the physics first, because the Nyquist theorem is genuinely beautiful and is often presented in a way that obscures what it actually says. Then I want to show you the arithmetic that makes 44,100 inevitable given 1970s constraints — and the way NTSC and PAL, designed for completely different reasons, conspire to produce the same number. If you enjoy “hidden mathematics in music,” you might also find it in Euclidean Rhythms, where a 2,300-year-old algorithm turns out to encode the structure of West African and Cuban percussion.

The Nyquist–Shannon Sampling Theorem

Before the archaeology, the physics.

In 1928, Harry Nyquist published a paper on telegraph transmission theory that contained, somewhat incidentally, the germ of what would become one of the most consequential theorems in applied mathematics [4]. Claude Shannon formalised and generalised it in 1949 [5]. The theorem states: a continuous bandlimited signal whose highest frequency component is $f_{\max}$ can be perfectly reconstructed from discrete samples taken at rate $f_s$ if and only if

$$f_s > 2 f_{\max}.$$

The quantity $f_s / 2$ is called the Nyquist frequency. Sampling below it causes aliasing: high-frequency components fold back into the spectrum and appear as spurious low-frequency artefacts that are indistinguishable from genuine signal. Once you have aliased a signal, the damage is permanent. Sampling at or above the Nyquist rate, the theorem says, causes no information loss at all — the original continuous waveform can be recovered exactly, in principle, from the discrete sample sequence.

Human hearing extends from roughly 20 Hz to 20 kHz (and, for most adults over thirty, substantially less at the top end, but 20 kHz is the canonical engineering requirement). Setting $f_{\max} = 20$ kHz, the Nyquist criterion requires $f_s > 40$ kHz.

But here is the subtlety that the Wikipedia summary tends to skip. The theorem assumes that the signal is perfectly bandlimited before sampling — meaning that all energy above $f_{\max}$ has been removed. This requires an anti-aliasing filter: a low-pass filter applied to the analogue signal before the analogue-to-digital converter samples it. If your anti-aliasing filter passes everything up to 20 kHz and blocks everything above it with perfect sharpness, then 40,001 Hz would suffice. The problem is that such a filter is physically unrealisable.

Real filters do not have vertical cutoffs. They have a transition band: a frequency range over which attenuation increases gradually from zero to full suppression. The steeper you want the transition, the higher the filter order, and for practical filter hardware in 1979 — op-amps, capacitors, inductors, no DSP to speak of — a “steep enough” filter meant a transition band of roughly 10% of the passband edge frequency. For a 20 kHz passband edge, that is about 2 kHz of transition band.

So the actual engineering requirement is not just $f_s > 40$ kHz. It is $f_s > 40$ kHz plus enough headroom for a realisable anti-aliasing filter. With $f_s = 44.1$ kHz, the Nyquist limit sits at $f_s/2 = 22.05$ kHz. The gap between the top of the audio band and the Nyquist limit is

$$22{,}050 - 20{,}000 = 2{,}050 \text{ Hz},$$

which is just over 10% of 20 kHz. This is enough to build a practical anti-aliasing filter with 1970s and early 1980s analogue components. Had the sampling rate been 41 kHz, the gap would have been only 500 Hz — far too narrow for affordable hardware. Had it been 50 kHz, the gap would have been more comfortable, but you would be storing 13.6% more data per second for no audible benefit.

So 44.1 kHz is in the right neighbourhood given real-world filter constraints. But it is still a specific number. The question of why 44,100 rather than 44,000 or 43,500 or 44,800 is still open. That is where the VCRs come in.

The VCR Problem

In the late 1970s, Sony was developing what would eventually become the Compact Disc. One of the fundamental engineering problems was storage: where do you put the digital audio data? A 74-minute stereo recording at 16 bits and 44.1 kHz generates roughly 780 megabytes. In 1979, that was an absurd quantity of data. Hard drives with that capacity existed but cost tens of thousands of dollars and weighed as much as a washing machine. Dedicated digital tape formats existed in professional studios but were exotic and expensive [1].

The only affordable high-bandwidth magnetic recording medium available to consumer-facing engineers in 1979 was the VCR — the videocassette recorder. VHS and Betamax had recently become consumer products, and the tape and drive mechanism was cheap, reliable, and capable of storing several hours of high-bandwidth video signal. That video signal bandwidth was substantial: enough, in principle, to carry digital audio if you could get it onto the tape in the right form.

Sony’s solution was elegant to the point of audacity. Rather than inventing a new tape format, they encoded digital audio samples as a black-and-white pseudo-video signal — patterns of light and dark pixels that a standard VCR recorded without modification, because as far as the VCR was concerned it was just receiving a monochrome video feed. The resulting device, the Sony PCM-1600 (1979), was a standalone unit that sat between a microphone preamplifier and a VCR, converting audio to fake video for recording and back to audio for playback [3].

The sampling rate of the audio was now determined not by any audio engineering consideration but by the geometry of the video signal. And the geometry of the video signal was fixed by the television broadcast standard — which brought entirely different historical contingencies into the calculation.

The NTSC Arithmetic

The NTSC standard — developed in North America and Japan — specifies 30 frames per second and 525 total scan lines per frame. Of those 525 lines, 35 are consumed by the vertical blanking interval (the time needed for the electron beam in a CRT to return from the bottom of the screen to the top). That leaves 490 active lines per frame actually carrying picture information.

Sony packed 3 audio samples into each active scan line. The audio sampling rate is then:

$$f_s = \underbrace{30}_{\text{frames/s}} \times \underbrace{490}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$

There it is. 44,100 Hz, emerging not from any consideration of human hearing or filter design, but from the frame rate and line count of the North American television standard.

The PAL Arithmetic

Now the European video standard, PAL, which was developed in the 1960s independently of NTSC and optimised for different priorities. PAL uses 25 frames per second and 625 total scan lines per frame. The vertical blanking interval consumes 37 lines, leaving 588 active lines per frame.

Sony packed 3 audio samples into each active PAL scan line as well. The sampling rate:

$$f_s = \underbrace{25}_{\text{frames/s}} \times \underbrace{588}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$

The same number.

Let that settle for a moment. NTSC: 30 frames per second, 490 active lines. PAL: 25 frames per second, 588 active lines. Different frame rates. Different line counts. Developed on different continents for different broadcast environments. And yet $30 \times 490 = 25 \times 588 = 14{,}700$, so multiplying by 3 gives 44,100 in both cases.

This is not coincidence in any deep sense — NTSC and PAL were both designed to fill approximately the same video bandwidth, just with different tradeoffs between temporal resolution (frame rate) and spatial resolution (line count). But for Sony’s VCR encoding scheme, the numerical convergence was enormously convenient: a single PCM processor running at 44.1 kHz could record to either NTSC or PAL video equipment without any change to the audio electronics. The same master machine could work in Tokyo and in Frankfurt.

The arithmetic is, I think, one of those moments where a coincidence that is perfectly explicable in hindsight still feels satisfying in the way that a physical derivation feels satisfying. You set up the constraints — fill the video bandwidth, pack an integer number of samples per line, keep the number of samples small enough to fit in a line’s worth of data — and the number 44,100 falls out of two independent calculations like a constant of nature. It is not a constant of nature. It is a contingent product of mid-twentieth-century broadcast engineering. But the mathematics does not care.

From Tape to Disc

When Philips and Sony sat down to negotiate the Red Book standard — the technical specification for the Compact Disc, finalised in 1980 and commercially launched in 1982 — both companies brought existing infrastructure to the table [3]. Both had been building digital audio equipment for several years. Both had PCM processors running in professional studios. Both had catalogues of digital masters recorded on VCR tape. And all of that equipment ran at 44.1 kHz, because all of it had been built to interface with the video tape standard that made digital audio recording practically affordable in the first place.

Changing the sampling rate for the CD would have required rebuilding the entire mastering chain: new PCM processors, new format conversion hardware, new master tape libraries. The economic and logistical cost would have been enormous. The 44.1 kHz rate was not chosen for the CD because it was optimal in any absolute engineering sense. It was chosen because it was already there [1], [2].

This is a pattern worth recognising. Major technical standards are rarely chosen by optimisation from first principles. They are chosen by consolidating what already exists. The QWERTY keyboard layout was optimised for typewriter mechanisms that no longer exist. The 60 Hz AC frequency in North America was set by Westinghouse generators installed in the 1890s. The 44.1 kHz CD sampling rate was set by VCR tape recorders that were obsolete within a decade of the CD’s launch.

The Other Rates

Not all digital audio runs at 44.1 kHz, and the coexistence of different rates in the modern audio industry is the direct legacy of 44.1 kHz’s awkward origins.

48 kHz is the professional broadcast and studio standard. It is used in digital video, in DAT tape, in most professional audio interfaces, and in the digital audio embedded in broadcast television signals — including, as a matter of course, in the digital television infrastructure described in the context of university video platforms like educast.nrw. Why 48? Broadcast infrastructure needed a rate that had clean integer relationships with the 32 kHz rate used in early satellite and ISDN broadcast systems. The relationship $48 = \frac{3}{2} \times 32$ is exact, making synchronisation straightforward. 44.1 kHz has no such clean relationship with anything in broadcast engineering.

The ratio between the two dominant rates is $48 / 44.1 = 160 / 147$. This fraction — irreducible, inelegant, non-obvious — is the source of essentially every sample-rate conversion problem in audio post-production. When a CD master (44.1 kHz) is prepared for broadcast (48 kHz), a sample-rate converter must interpolate 147 samples up to 160 samples, or downsample 160 samples to 147, at every moment. The process introduces small errors, and doing it well requires significant computational effort. Every time a musician’s recording moves between the consumer and professional audio worlds, it passes through this fractional bottleneck. Two standards that could have been made compatible were instead set by completely independent historical processes, and we have been paying the computational tax ever since.

96 kHz and 192 kHz are marketed as “high-resolution audio.” Here the physics gets genuinely murky and the claims made by the audio industry deserve some scepticism. Human hearing above 20 kHz is, for most adults, genuinely absent — not reduced, but absent, because the outer hair cells in the cochlea that respond to those frequencies progressively die from the teenage years onward and are not replaced. The argument for high sampling rates is typically one of two things: first, that ultrasonic content can cause intermodulation distortion, where sum and difference frequencies of ultrasonic components fall back into the audible band; second, that a higher sampling rate allows for a more relaxed anti-aliasing filter with better phase behaviour within the audible band.

Both effects are real and measurable in laboratory conditions. Whether they are audible under controlled double-blind listening conditions is a separate and more contested question. The published evidence is not strong. What is not contested is that 96 kHz files are twice the size of 44.1 kHz files, and 192 kHz files are more than four times the size, for the same bit depth and the same number of audio channels. Whether that storage cost buys anything audible is, as of the current state of the literature, an open question.

The Irony

Here is the situation we are actually in. The canonical digital audio format — 16-bit, 44.1 kHz PCM, the format that defined CD quality for a generation and that remains the standard for music distribution — is physically a photograph of analogue video tape. The digitisation of music was made possible by television engineering. The specific number that defines the fidelity of every CD ever pressed is determined by the frame rates and line counts of 1970s broadcast television standards, which were themselves determined by the capabilities of 1940s CRT technology and the political negotiations of early broadcast licensing bodies.

When someone tells you that 44.1 kHz is the “natural” or “perfect” sampling rate for audio, they are, without knowing it, paying tribute to the NTSC standards committee of 1941 and the PAL engineers of the 1960s. The number carries history in it the way a fossil carries the structure of a long-dead organism. It is the right number, in the sense that it works. Its rightness has nothing to do with the reasons it was chosen.

I find this genuinely satisfying rather than disappointing. The history of physics and engineering is full of contingent numbers that turned out to be good enough, and whose goodness was only rationalised after the fact. The metre was originally defined as one ten-millionth of the distance from the equator to the North Pole along the Paris meridian — an arbitrary geodetic choice that turned out to produce a unit of length that is remarkably convenient for human-scale physics. The kilogram was a cylinder of platinum-iridium alloy in a vault outside Paris for over a century. 44,100 Hz is in good company.

The Archaeology of a Number

The numbers we inherit from engineering history are rarely arbitrary at every level simultaneously. 44,100 Hz is not arbitrary at the level of sampling theory: it satisfies the Nyquist criterion with enough headroom for a physically realisable anti-aliasing filter, given 1970s component technology. That is a genuine constraint, and the number sits in the right region of parameter space for it.

But it is arbitrary at a deeper level: it is the specific number that happened to fit a video tape format that happened to be affordable in 1979, a format that was itself determined by broadcast standards that were set for entirely unrelated reasons decades earlier. The chain of contingencies runs: 1940s television engineering defines NTSC and PAL frame rates and line counts; 1970s consumer VCR technology makes those tape formats cheap; 1979 Sony engineers encode digital audio as fake video; the arithmetic of the video formats fixes the sampling rate at 44,100 Hz; that rate gets locked into the CD standard in 1980; 44.1 kHz becomes the defining frequency of a digital music format that ships billions of units over the following four decades.

Science and engineering produce exact numbers from messy contingencies. The number 44,100 is simultaneously a theorem output (it satisfies a well-defined engineering constraint), a historical accident (it is determined by the specific video tape hardware that existed in 1979), and an institutional fossil (it outlasted the VCRs that created it by four decades and counting). All three things are true at the same time.

The VCRs are gone. The sampling rate remains.

References

[1] Pohlmann, K. C. (2010). Principles of Digital Audio (6th ed.). McGraw-Hill.

[2] Watkinson, J. (2001). The Art of Digital Audio (3rd ed.). Focal Press.

[3] Immink, K. A. S. (1998). The compact disc story. Journal of the AES, 46(5), 458–465.

[4] Nyquist, H. (1928). Certain topics in telegraph transmission theory. Transactions of the AIEE, 47(2), 617–644.

[5] Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21.

How Cats Drink: Inertia, Gravity, and the Froude Number at the Tip of a Tongue

Mon, 22 Jul 2024 00:00:00 +0000

I have spent a non-trivial amount of time watching our cats drink — they are indoor-only cats, on our vet’s advice, which gives them few distractions and gives me ample opportunity to observe. This is not entirely voluntary. Once you have noticed that something is happening at the water bowl that does not look right — the tongue moves too fast, the water column is pulled upward rather than scooped, the jaw closes before the tongue returns — you find yourself crouching beside the bowl with your phone propped against a chair, filming at 240 frames per second and feeling that you have perhaps chosen an unusual way to spend a Tuesday morning.

Pedro Reis, Sunghwan Jung, Jeffrey Aristoff, and Roman Stocker had the same impulse, with better equipment. Their 2010 paper in Science, “How Cats Lap: Water Uptake by Felis catus,” is one of the more elegant pieces of dimensional analysis in recent biology.

How Cats Do Not Drink

The simplest hypothesis — that cats curl the tongue into a spoon and scoop water into the mouth — is false. High-speed photography shows that the cat’s tongue does not form a cup shape. Instead, the cat extends the tongue tip downward toward the water surface and then rapidly retracts it. The motion is fast — too fast for normal video — and the tongue barely contacts the surface.

The contrast with dogs is instructive. Dogs do scoop: the tongue curls backward (not forward), forming a ladle shape that scoops water upward and backwards into the mouth. The mechanism is vigorous and inefficient — a significant fraction of the water misses the mouth entirely, which is why drinking dogs produce splashing and dogs often have wet chins. The mechanism works but is inelegant.

Cats produce almost no splash. The mechanism is different in kind.

The Physical Mechanism

Reis et al. (2010) used high-speed photography (1000 frames per second) to resolve the cat’s lapping motion. Their observations:

The cat extends the tongue tip downward until the dorsal surface (the top side) just touches the water surface. The ventral surface (the smooth underside) does not contact the water.
The cat then rapidly retracts the tongue upward. The tongue tip is moving at roughly $v \approx 0.7\,\mathrm{m/s}$ during this retraction.
As the tongue tip pulls away from the surface, a column of liquid is pulled upward by the adhesion between the liquid and the retreating tongue. The column rises against gravity.
The column eventually stalls — inertia is overcome by gravity — and begins to fall back. The cat closes its jaw at exactly the moment of maximum column height, capturing the peak volume of water.
The cat then extends the tongue for the next lap.

The cat closes its jaw before the tongue fully retracts. This is important: the jaw closure captures the water column, not the water adhering to the tongue. The tongue is the mechanism that creates the column; the jaw captures it.

Dimensional Analysis: The Froude Number

The relevant competition is between inertia (which drives the column upward) and gravity (which pulls it back down). Surface tension plays a role in stabilising the column but is not the primary factor governing the column height.

The balance between inertia and gravity for a fluid column moving at speed $v$ and of characteristic length scale $L$ (here, the diameter of the tongue tip, $L \approx 5\,\mathrm{mm}$ for a domestic cat) is captured by the Froude number:

$$\mathrm{Fr} = \frac{v}{\sqrt{gL}},$$

where $g = 9.81\,\mathrm{m/s}^2$ is gravitational acceleration.

When $\mathrm{Fr} \ll 1$: gravity dominates, inertia is insufficient to pull a significant column of water upward. Very slow tongue motion would lift almost no water.

When $\mathrm{Fr} \gg 1$: inertia dominates, the column rises far above the surface but the jaw must be closed quickly before the large amount of water falls back. Very fast tongue motion wastes water and requires rapid jaw closure.

The optimal lapping frequency — maximising captured volume per lap — occurs near $\mathrm{Fr} \approx 1$, where inertial and gravitational forces are comparable and the column height is matched to the jaw closure dynamics.

Checking the Numbers for a Domestic Cat

For a domestic cat:

Tongue tip diameter: $L \approx 5\,\mathrm{mm} = 5 \times 10^{-3}\,\mathrm{m}$
Characteristic tongue tip speed: $v \approx 0.7\,\mathrm{m/s}$

$$\mathrm{Fr} = \frac{0.7}{\sqrt{9.81 \times 5 \times 10^{-3}}} = \frac{0.7}{\sqrt{0.049}} = \frac{0.7}{0.22} \approx 3.2.$$

Reis et al. found Fr of order unity — inertial and gravitational forces comparable — confirming that the lapping speed is tuned to the inertia-gravity balance. (The exact numerical value depends on the choice of characteristic length scale; using the tongue tip diameter as above gives Fr in the range 1–3, squarely in the regime where neither force dominates.)

Scaling Across Felids

The Froude number prediction yields a scaling law for lapping frequency across felid species of different sizes. If all felids lap at $\mathrm{Fr} \approx 1$, then the characteristic speed scales as $v \sim \sqrt{gL}$, and the lapping frequency scales as:

$$f = \frac{v}{d} \sim \frac{\sqrt{gL}}{d},$$

where $d$ is the distance the tongue travels per lap (roughly proportional to tongue length, which scales with body size). Since $L \sim d$ scales with body size, we get:

$$f \sim \frac{\sqrt{g \cdot d}}{d} = \sqrt{\frac{g}{d}} \propto d^{-1/2}.$$

Larger cats have longer tongues and lap more slowly. The prediction is that lapping frequency scales as the square root of inverse tongue length — or, equivalently, as the inverse square root of body mass (since linear dimensions scale as mass$^{1/3}$):

$$f \propto m^{-1/6}.$$

Reis et al. tested this against high-speed footage of large felids. A domestic cat laps at approximately $4\,\mathrm{Hz}$; a lion laps at approximately $1.2\,\mathrm{Hz}$; a tiger at roughly $1\,\mathrm{Hz}$. The scaling is consistent with $f \propto m^{-1/6}$ across three orders of magnitude in body mass.

The table below shows the predicted versus observed scaling:

Species	Body mass (kg)	Predicted $f$ relative to cat	Predicted $f$ (Hz)	Observed $f$ (Hz)
Domestic cat	4	1.0	4.0	~4.0
Jaguar	80	$\left(\frac{4}{80}\right)^{1/6} \approx 0.61$	2.4	~2.0
Lion	200	$\left(\frac{4}{200}\right)^{1/6} \approx 0.52$	2.1	~1.5
Tiger	220	$\left(\frac{4}{220}\right)^{1/6} \approx 0.51$	2.1	~1.0

The $m^{-1/6}$ scaling captures the correct trend — larger cats lap more slowly — though the predicted frequencies for the largest cats somewhat overestimate the observed values. The discrepancy may reflect the limitations of the simple allometric assumption (that all linear dimensions scale as $m^{1/3}$) and the fact that tongue geometry does not scale isometrically across the full range of felid body sizes.

Why Not Just Lick?

A natural question: why not simply allow the tongue to fully submerge and absorb water through the papillae, as the tongue already contacts water when lapping? Several answers:

Papillae are not sponges. Feline papillae are hollow and scoop-shaped (filiform papillae with hollow tips), optimised for grooming and food manipulation, not passive absorption. Active wicking is limited.
The cat cannot breathe with its mouth submerged. A lapping mechanism that keeps the mouth mostly closed except for the brief jaw-closure moment allows continuous breathing through the nose during drinking.
Speed and efficiency. The inertial column mechanism delivers significantly more water per jaw movement than surface tension adhesion alone. At 4 laps per second, a domestic cat takes in roughly $0.14\,\mathrm{mL}$ per lap, for a total of roughly $34\,\mathrm{mL/min}$ — comparable to sipping rates in animals that use more direct intake mechanisms.

The cat has converged on a hydrodynamically optimal strategy under the constraint of keeping the oral cavity mostly sealed during the intake cycle.

The Robotic Tongue

Reis et al. constructed a robotic cat tongue to verify the mechanism: a smooth glass disc lowered to the water surface and retracted at controlled speeds. The column height as a function of speed followed the predicted inertia-gravity balance, confirming that the mechanism does not depend on any specifically biological property of the tongue — it is a fluid dynamics result that applies to any surface moving away from a water interface at the right speed.

The robot lapped at the same Froude number as the cat.

Dogs, Horses, and the Comparison

Dogs cup the tongue caudally (backwards) rather than ventrally, forming a ladle. The mechanism is faster and delivers more water per stroke but is messy — the ladle is formed outside the mouth, and water sloshes freely. Dogs lap at roughly $3\,\mathrm{Hz}$ with a tongue tip speed significantly higher than cats, producing Fr well above unity. The excess inertia is why dog drinking generates splashing.

Horses, by contrast, create a near-seal with their lips and use suction — a fundamentally different mechanism that requires no tongue projection at all. The lapping mechanism of felids is phylogenetically specific and appears to have evolved under selection pressure for both efficiency and noise suppression, consistent with the ambush-predator lifestyle. A cat that splashed while drinking would alert prey at a water source. A cat that laps near-silently does not.

A Note on the Measurement

Getting reliable high-speed footage of a cat drinking is harder than it sounds. Our cats drink at different times of day, in different moods, and the presence of a camera tripod next to the water bowl is regarded as grounds for drinking elsewhere. Pedro Reis et al. solved this by filming their laboratory cat, Cutta Cutta, in a controlled setting. Their footage is available online and is genuinely beautiful: a slow-motion waterfall in miniature, rising improbably from the tongue tip and held there by the balance between upward momentum and downward gravity, until the jaw swings shut.

The physics is in the timing.

References

Reis, P.M., Jung, S., Aristoff, J.M., & Stocker, R. (2010). How cats lap: Water uptake by Felis catus. Science, 330(6008), 1231–1234. https://doi.org/10.1126/science.1195421
Aristoff, J.M., Stocker, R., Jung, S., & Reis, P.M. (2011). On the water lapping of felines and the water running of lizards. Communicative & Integrative Biology, 4(2), 213–215.
Vogel, S. (1994). Life in Moving Fluids: The Physical Biology of Flow (2nd ed.). Princeton University Press.

Changelog

2025-12-15: Updated water intake per lap from 0.04 mL to 0.14 mL (Reis et al. report ~0.14 +/- 0.04 mL per lap; the previous value was the standard deviation), and updated the intake rate accordingly (~34 mL/min). Updated the papillae location from ventral to dorsal surface. Updated the Aristoff et al. reference to the correct 2011 Communicative & Integrative Biology article. Removed the Jung & Kim (2012) PRL reference (article number 034501 resolves to a different paper).

The Boring Parts of Networked Music Performance

Fri, 14 Jun 2024 00:00:00 +0000

This post is based on a manuscript in progress with colleagues from the RAPP Lab network. It builds directly on the August 2023 latency measurements. That post covered what the numbers look like. This one covers why getting to those numbers was the easy part.

The Setup

After spending two and a half years measuring latency across six European research-network links, I can tell you that the audio numbers are achievable. 7.5 to 22.5 ms one-way across Prague to Tallinn, LoLa and MVTP both working, musicians playing together across national borders in real time. Technically, that story has a satisfying ending.

What the measurement paper does not capture is everything that had to be true institutionally before we could run a single test. The firewall negotiations. The repeated calibration sessions. The network configuration that nobody outside our small group knew how to reproduce when someone left. The grant that funded the equipment but not the person who kept it running. The performance session that nearly collapsed because a campus IT update had silently changed a routing rule three days prior.

The technical infrastructure worked. The institutional infrastructure around it was precarious in ways that only became visible when something broke.

This is what the follow-up paper tries to name.

What Is a Digital Music Lab, Actually?

The term gets applied to everything from a laptop cart in a classroom to IRCAM Paris. We use it to mean something specific: a Digital Music Lab (DML) is a hybrid environment where space, equipment, software, personnel and organisational routines are configured together to support iterative artistic experimentation, research-led learning and outward-facing engagement.

The key word is configured together. A room full of excellent hardware is not a DML any more than a library is just a building full of books. What makes either work is an invisible layer of social organisation: access policies, shared norms, maintained documentation, people who know what to do when something breaks.

We borrow a concept from infrastructure studies to describe this: performative infrastructure. The concept draws on Star and Ruhleder (1996), and it captures something precise — that infrastructure does not merely enable activity, it also shapes what kinds of activity are possible in the first place. The decision to use LoLa rather than Zoom is not just a technical choice; it is an institutional statement about what kind of musical interaction this space is designed to support, and about who is expected to use it.

This framing matters because it shifts the design question. You are not asking “what equipment should we buy?” You are asking “what kind of practice do we want to make possible, and what organisational conditions make that practice sustainable?”

Four Things That Actually Determine Whether a DML Survives

1. Flexible by design, not by accident

Resilient labs resist the temptation to optimise for one use case. The systems that have lasted — Stanford CCRMA is the obvious reference point, nearly five decades and counting — tend to separate a stable core (networking, routing, authentication, documentation) from a more rapidly changing layer of creative tools and workflows. The core does not change when you switch DAWs or update your streaming platform. The tools on top of it can.

This sounds obvious. In practice it means being deliberate about which dependencies you are willing to accept. A lab built on a single vendor ecosystem can offer tight integration, but it creates a single point of failure and a maintenance contract you will be negotiating forever. A lab built on open protocols and well-documented configurations is more work to set up and less work to sustain.

The other thing flexibility buys is pedagogical range. The same environment can host an introductory workshop, an advanced performance-research project and a public-facing concert without requiring incompatible reconfiguration for each. This is not a luxury. It is what makes a DML worth the overhead compared to just booking a studio.

2. Governance that survives personnel turnover

The single most dangerous sentence in any DML is: “We can ask [person] — they know how it works.”

Every lab has that person. The one who configured the routing. The one who knows which cable does what. The one who has the institutional memory of every workaround and edge case. When that person moves on, the lab frequently becomes unreliable within six months and functionally inaccessible within a year — even if all the equipment is still there. We call these zombie infrastructures: technically present, functionally dead.

The corrective is not to document everything (though that helps). It is to design governance so that knowledge is distributed by default. Distributed stewardship roles — student assistants, rotating committees, peer mentors — mean that multiple people develop operational knowledge as a matter of routine, not as emergency knowledge transfer when someone announces they are leaving.

Technical staff need to be treated as co-creators in this model, not as service providers. When networked performance is framed as peripheral experimentation rather than core infrastructure, maintenance becomes precarious and invisible. When it is framed as core, collaboration between artistic and technical roles becomes institutional routine.

3. Maintenance as a budget line, not an afterthought

Here is the infrastructure paradox: systems are valued for enabling novelty, but they require boring, recurring investment to remain usable. Project funding solves the novelty problem. It almost never solves the maintenance problem.

The costs that make a lab reliable are not one-off:

Staff continuity (or explicit knowledge transfer when staff change)
Documentation that is actively maintained, not written once and forgotten
Renewal cycles for hardware and software that actually match the pace of change in the underlying ecosystem
User support during active sessions, not just during setup

At HfMT Köln, the operational work that dominated actual implementation time was none of the things that appear in grant applications: coordinating network pathways across campus boundaries, establishing and re-establishing calibration routines after infrastructure updates, producing documentation legible to people who were not present at the original setup, providing real-time support during rehearsals when something behaved unexpectedly.

None of this is glamorous. All of it is what determines whether musicians can actually use the system on a given Tuesday afternoon.

4. Inclusion that is designed, not assumed

Technology-intensive environments reproduce exclusion reliably unless they are actively designed not to. The mechanisms are familiar: assumed prior experience, cultural signals about who belongs, scheduling that conflicts with caring responsibilities, documentation in a single language, interfaces that reward a particular kind of technical confidence.

For DMLs specifically, there is an additional layer. Networked music performance is genuinely different from co-located performance. The latency conditions require different listening and coordination strategies. For musicians trained in tight synchronous ensemble playing, the first experience of performing over a network is often disorienting — latency is not a technical glitch to be fixed, it is a compositional condition to be understood and worked with.

Framing this as a deficit is pedagogically counterproductive. Framing it as an occasion to develop new artistic vocabulary — to think deliberately about what interaction strategies work at 12 ms versus 22 ms, about how anticipatory listening changes the character of improvisation — turns an obstacle into content. Some of the most interesting musical thinking in our sessions came from participants who were trying to understand why something that was effortless in a rehearsal room required conscious attention over the network.

The Tensions That Do Not Resolve

Being honest about what the paper does not solve:

Project funding versus operational costs. We do not have a structural solution to the mismatch between how labs are funded (innovation grants with defined end dates) and how they need to operate (indefinitely, with predictable maintenance budgets). Collaborative purchasing agreements and shared technical teams across institutions can distribute the burden, but they introduce coordination overhead. There is no clean answer here.

Experimentation versus accountability metrics. Universities and funders want quantifiable outputs. Artistic experimentation often produces its most valuable results as changed practices and new aesthetic understanding — things that do not appear in publication counts or utilisation statistics. The best available response is to be explicit about this mismatch when negotiating evaluation criteria, and to establish review processes that include artistic peers and community partners rather than only administrators. This is possible more often than people think, but it requires someone to argue for it proactively.

Openness versus depth. A lab built for maximum accessibility is not the same as a lab optimised for a specific research agenda, and trying to be both usually means doing neither well. The design question is not which is better but where the tradeoff lies for a particular institution’s mission. CCRMA and IRCAM have made different bets on this axis over decades and both have produced important work. The mistake is not having an opinion about where you sit on the spectrum.

Recommendations

These are for institutions and funders, assembled from what the paper describes as working across multiple DML contexts:

Treat DMLs as long-term cultural infrastructure. Recurring budget lines for renewal, documentation and support — not just start-up funding.
Separate your stable backbone from your creative tools. Networking, routing, authentication and documentation should not be rebuilt every time you change your video platform.
Design governance that does not rely on one person. Distributed stewardship roles, clear succession documentation, operational knowledge treated as shared rather than individual.
Make invisible labour visible. Technical stewardship, facilitation and community liaison need to appear in hiring, workload models and evaluation — not just in informal practice.
Lower the floor for participation. Scaffolded onboarding, peer mentoring, programming that supports diverse musical practices and levels of technical experience.
Sort out data governance before you start recording. Consent, archiving and reuse policies for audio/video, especially when community partners or students are involved.
Plan for the lab’s eventual obsolescence. Versioning policies, migration plans, criteria for retiring tools. Zombie infrastructures are a governance failure, not a technical one.
Evaluate on multiple axes. Technical reliability is one. Learning trajectories, student agency, community partnership durability and artistic outcomes are others. Reporting only the first one creates a misleading picture of whether the lab is actually working.

What This Does and Does Not Claim

The argument in the paper is conceptual and practice-informed rather than empirical in the standard sense. We synthesise literature and draw on the HfMT Köln implementation as a vignette — it is an illustration, not a representative sample. The framework we propose (four design principles, the performative infrastructure framing) is offered as an analytical vocabulary for planning and evaluation, not as a validated theory.

What it is useful for: making implicit infrastructure choices explicit, naming tensions before they become crises, and supporting more realistic conversations between artistic users, technical staff and institutional leadership about what it actually takes to make this work.

References

Borgdorff, H. (2012). The Conflict of the Faculties: Perspectives on Artistic Research and Academia. Leiden University Press.

Labbé, D., Zuberec, C., & Turner, S. (2022). Creative hubs in Hanoi, Vietnam: Transgressive spaces in a socialist state? Urban Studies. https://doi.org/10.1177/00420980221086371

McKay, G. (2017). Community music: History and current practice. International Journal of Community Music, 10(2), 129–137. https://doi.org/10.1386/ijcm.10.2.129_1

Morreale, F., Bowers, J., & McPherson, A. (2021). Collaborating in distributed musical partnerships. Computers in Human Behavior, 120, 106757. https://doi.org/10.1016/j.chb.2021.106757

Selwyn, N. (2021). Education and Technology: Key Issues and Debates (3rd ed.). Bloomsbury Academic.

Star, S. L., & Ruhleder, K. (1996). Steps toward an ecology of infrastructure. Information Systems Research, 7(1), 111–134. https://doi.org/10.1287/isre.7.1.111

Wenger, E. (1998). Communities of Practice: Learning, Meaning, and Identity. Cambridge University Press. https://doi.org/10.1017/CBO9780511803932

Changelog

2026-01-20: Removed the Chafe (2018) “Stanford CCRMA: A 40-year retrospective” reference, which could not be confirmed in available databases (DOI does not resolve, not listed in Computer Music Journal 42(3)). The body text reference to CCRMA as an institutional example is retained; it does not depend on this citation.
2026-01-20: Changed “The term comes from Star and Ruhleder (1996)” to “The concept draws on Star and Ruhleder (1996).” Star and Ruhleder’s paper is the foundational text on relational infrastructure, but they did not coin the specific compound term “performative infrastructure.”

There Is No Such Thing as Full Accessibility — Only Barrier Reduction

Fri, 10 May 2024 00:00:00 +0000

The German compound Barrierefreiheit means, literally, freedom from barriers. It is the word used in legislation, in building codes, in institutional disability policies, in the guidelines that govern what universities must provide. It implies a completable state: you arrive at Barrierefreiheit, and you are done.

I want to argue that this is not only unachievable in practice — which most people in the field will readily concede — but structurally impossible in a society organised the way ours is. The honest term is Barrierearmut: poverty of barriers, reduction of barriers, a direction rather than a destination. The difference is not just linguistic. It shapes what we promise, what we measure, and what we allow ourselves to stop doing.

Two Models of Disability

The medical model of disability, which dominated institutional thinking for most of the twentieth century, locates the problem in the individual. A person is disabled by their impairment — by the deafness, the mobility limitation, the cognitive difference. The solution, in this frame, is treatment, cure, rehabilitation: changing the person to fit the world.

The social model, developed in the 1970s by disability activists — particularly through the work of the Union of the Physically Impaired Against Segregation in the UK — inverts this (UPIAS, 1976). The distinction is between impairment (a physical or cognitive difference) and disability (the disadvantage created by a society that does not account for that difference). A wheelchair user is not disabled by their legs; they are disabled by a building with no ramp. A deaf student is not disabled by their hearing; they are disabled by a lecture delivered without captioning.

Oliver (1990) developed this into a full political framework. Disability is not a medical category but a social relation — a product of how societies organise space, communication, labour, and meaning. The implication is radical: to address disability, you do not fix the person; you change the society.

This model has transformed disability law, architecture, and educational policy. The UN Convention on the Rights of Persons with Disabilities (2006) is explicitly built on it. WCAG — the Web Content Accessibility Guidelines — embodies it for digital environments. The Behindertengleichstellungsgesetz in Germany draws on it.

And yet.

The social model is politically necessary and descriptively powerful. It is also incomplete.

Shakespeare and Watson (2002) offer a careful critique: the strict social model, in its effort to relocate disability from body to society, ends up treating impairment as irrelevant — as a neutral fact that only becomes disabling through social organisation. But impairment is not neutral. Pain is real. Fatigue is real. Cognitive load is real. Some impairments impose limits that no architectural or digital intervention fully removes, because the limits are not externally imposed but intrinsic to how a particular nervous system processes the world.

The WHO’s International Classification of Functioning, Disability and Health (ICF, 2001) offers a biopsychosocial synthesis: disability as an interaction between health condition, body function and structure, activity, participation, and contextual factors (both environmental and personal). This is less politically clean than the social model — it does not attribute all disablement to society — but it is more honest about the complexity.

The point is not to retreat from the social model’s insights but to acknowledge that “removing all barriers” is an incomplete goal even in its own terms. Impairment is real; context is transformable; and the interaction between them is irreducibly particular. There is no single intervention that produces accessibility for everyone.

Why Barrierefreiheit Is a False Promise

Consider what full accessibility would require. It would require physical spaces that accommodate every mobility profile, every sensory profile, every energy and endurance pattern. It would require information architectures that are simultaneously navigable by users with very different cognitive and perceptual systems. It would require communication norms, cultural contexts, and institutional practices that do not privilege any particular neurotype, any particular communication style, any particular relationship to time and deadlines and social convention.

None of that is achievable in a society with the historical sediment ours has. Our cities were built for able-bodied adults with average sensory capacity and without requirement for cognitive accessibility. Our universities were built — institutionally, not just physically — for a particular kind of learner with a particular kind of background, deploying a particular kind of intelligence. Retrofitting accessibility onto these structures is possible, valuable, and necessary. But it is not the same as having built for full human variation from the start. The ramp bolted onto the side of the neoclassical building solves the wheelchair problem and leaves everything else intact.

Kafer (2013) makes a more radical version of this argument. The concept of “normal” function — the standard against which accessibility is measured — is not neutral. It encodes a history of who was considered the default human, and who was considered an exception requiring accommodation. Achieving “accessibility” within a framework that still treats certain bodies and minds as exceptions to be accommodated does not escape that framework; it manages it.

This is why a building can pass every accessibility audit and still function as an excluding institution. The audit measures physical features. It does not measure whether disabled students are welcomed into the culture of the institution, whether their modes of participation are genuinely valued, whether the hidden curriculum of “how to be a student” is legible to someone whose processing differs from the assumed default.

What Barrierearmut Means

If Barrierefreiheit is the impossible promise, Barrierearmut — barrier reduction — is the honest goal. It is not lesser. It is more accurate.

Barrier reduction as a framework asks: which barriers, for which people, with which effects, can be reduced through which interventions, at what cost, with what trade-offs? It treats accessibility as an ongoing practice rather than a checkable state. It acknowledges that every design decision — physical, digital, institutional — makes some things easier for some people and harder for others, and that the question is always whose needs are centred and whose are treated as exceptions.

Universal Design (Mace, 1997) moves in this direction: designing from the start for the broadest range of users, rather than designing for the norm and retrofitting for exceptions. A kerb cut is the standard example — designed for wheelchair users, also useful for people with pushchairs, luggage, bicycles, temporary injuries. But Universal Design, honestly applied, acknowledges that no design is truly universal. Every design embeds assumptions. The honest goal is to minimise the distance between those assumptions and the actual diversity of users.

For digital environments this is particularly visible. WCAG 2.2 defines four principles — Perceivable, Operable, Understandable, Robust — and success criteria that can be tested against. Meeting WCAG AA is a meaningful achievement. It is not the same as being accessible to all users. Screen reader users with different software behave differently with the same page. Cognitive accessibility — making content understandable, not just perceivable — is addressed by WCAG 3.0 drafts but is notoriously difficult to operationalise. The standards improve; the gap remains.

Institutional Honesty

I work in a university. Universities have accessibility offices, procedures, documentation requirements. A student with a disability can request accommodations: extended exam time, written materials in accessible formats, individual arrangements. These accommodations are real and valuable. They are also, structurally, a system for managing exceptions to a norm that the institution has no intention of revising.

The student who needs extended time is asking the institution to adjust its standard procedure for their case. The institution does so, often generously. But the standard procedure — the timed exam, the lecture format, the office-hours model — remains the standard. The exception is granted; the norm persists. This is barrier management, not barrier reduction.

Barrier reduction would mean asking, as a matter of institutional practice: what is the actual pedagogical purpose of the timed exam, and are there better ways to assess that competency that do not exclude students whose processing differs? It would mean asking what the lecture format assumes about the listener, and whether those assumptions are necessary. These questions are uncomfortable because they challenge practices that are also convenient, and because the people who benefit from the current norms are the ones with the institutional power to change them.

This is not a problem unique to universities. It is the general structure of the problem.

A Direction, Not a Destination

I am not arguing for giving up on accessibility work. The opposite. I am arguing that naming the goal honestly — barrier reduction, not barrier freedom — produces better practice than the false promise of an achievable endpoint.

Barrierefreiheit as a legal standard can be met by a compliant building that is still a hostile institution. Barrierearmut as a practice requires continuous attention to who is being excluded and by what, and ongoing effort to reduce that exclusion knowing that it will never be complete.

That is harder. It does not allow the institution to certify itself as done. It requires asking the uncomfortable questions about whose default is encoded in the design — a question that leads, quickly, to the question of privilege.

That is the next post: The Invisible Entrance Fee: On Privilege, Education, and the Institutions That Reproduce Both.

References

Kafer, A. (2013). Feminist, Queer, Crip. Indiana University Press.
Mace, R.L. (1985). Universal Design: Barrier Free Environments for Everyone. Designers West, 33(1), 147–152.
Oliver, M. (1990). The Politics of Disablement. Macmillan.
Shakespeare, T. & Watson, N. (2002). The social model of disability: an outdated ideology? Research in Social Science and Disability, 2, 9–28.
UPIAS (1976). Fundamental Principles of Disability. Union of the Physically Impaired Against Segregation.
WHO (2001). International Classification of Functioning, Disability and Health (ICF). World Health Organization.
UN General Assembly (2006). Convention on the Rights of Persons with Disabilities (A/RES/61/106).

Changelog

2025-11-05: Corrected the Mace reference from (1997) Designers West 44(1) to (1985) Designers West 33(1), 147–152. The year 1997 relates to the separate “Principles of Universal Design” publication by Connell, Jones, Mace et al. at NC State, not the Designers West article.

Are Cats Liquid? The Deborah Number and the Rheology of Cats

Wed, 03 Apr 2024 00:00:00 +0000

One of our strays discovered, sometime in her first winter indoors — they are strictly indoor cats now, on our vet’s recommendation — that she could fit into a salad bowl. Not sit beside it, not rest her head on its rim: fit into it, curled into a precise sphere with her tail tucked under her chin and her ears folded flat, filling the bowl as liquid fills a container. The bowl has a diameter of 22 centimetres. I did not find this as surprising as perhaps I should have: there is a quantity in materials science that determines, rigorously, whether a given material in a given situation should be classified as a solid or a liquid. For a cat in a bowl, this quantity is comfortably below one.

The material is a liquid. The material is also a cat.

The Definition of a Fluid

The intuitive distinction between solids and liquids is that solids hold their shape and liquids conform to their container. But this distinction is one of timescale, not of material identity.

A classic demonstration: place a ball of silly putty on a table. Over the course of an hour, it flows slowly outward, taking the shape of the table surface — clearly a liquid. Strike it sharply with a hammer and it shatters — clearly a solid. The material has not changed. The timescale of the interaction has.

The same principle applies to glass (contrary to popular myth, medieval window glass is not thicker at the bottom because it has flowed — the variation is from the manufacturing process, and the relaxation time of soda-lime glass at room temperature is of order $10^{23}$ years — but at elevated temperatures near the glass transition, silicate glass flows readily). It applies to mantle rock, which is solid on the scale of earthquake waves and liquid on the scale of continental drift. It applies to pitch, to ice sheets, to asphalt on a hot day.

The formal tool for capturing this is the Deborah number.

The Deborah Number

The Deborah number was introduced by Marcus Reiner in 1964, in a short note in Physics Today (Reiner 1964). It is defined as:

$$\mathrm{De} = \frac{\tau}{T},$$

where $\tau$ is the relaxation time of the material — roughly, the characteristic time over which it can rearrange its internal structure and relieve stress — and $T$ is the observation time or the timescale of the imposed deformation.

$\mathrm{De} \ll 1$: The material relaxes quickly relative to the timescale of observation. Internal stresses are continuously relieved. The material behaves as a fluid.
$\mathrm{De} \gg 1$: The material relaxes slowly relative to the observation timescale. Internal stresses persist. The material behaves as a solid.
$\mathrm{De} \sim 1$: The material is in a viscoelastic regime — partly fluid, partly solid, exhibiting time-dependent behaviour that is neither.

The name comes from the prophetess Deborah, who sang in Judges 5:5: “The mountains flowed before the Lord.” At the timescale of a divine perspective, mountains are liquid. At the timescale of a human lifetime, they are not. Reiner’s point was that the solid-liquid distinction is not a property of the material but of the relationship between the material’s internal dynamics and the observer’s timescale.

For Newtonian fluids (water, air at ordinary conditions), $\tau \to 0$ and $\mathrm{De} \to 0$ for any finite observation time — they are always liquid. For a perfectly elastic solid (an ideal spring), $\tau \to \infty$ and $\mathrm{De} \to \infty$ for any finite observation time — always solid. Real materials lie between these extremes.

The Maxwell Viscoelastic Model

The simplest model of a material with a finite relaxation time is the Maxwell element: a spring (elastic, spring constant $G$) in series with a dashpot (viscous, viscosity $\eta$). Under a step stress $\sigma_0$ applied at time $t = 0$, the strain evolves as:

$$\epsilon(t) = \frac{\sigma_0}{G} + \frac{\sigma_0}{\eta}\,t,$$

where $\tau = \eta / G$ is the Maxwell relaxation time. The first term is the instantaneous elastic deformation of the spring; the second is the linear viscous creep of the dashpot. For $t \ll \tau$, the elastic strain dominates and the material behaves as a solid; for $t \gg \tau$, the viscous flow dominates and the material behaves as a liquid. The material “decides” whether to be solid or liquid depending on the ratio of $\tau$ to the duration of the applied stress — which is precisely the Deborah number.

The creep compliance $J(t) = \epsilon(t)/\sigma_0 = t/\eta + 1/G$ grows linearly with time for $t \gg \tau$, confirming liquid behaviour on long timescales. The relaxation modulus $G(t) = \sigma(t)/\epsilon_0 = G e^{-t/\tau}$ decays exponentially to zero, confirming that the material cannot sustain a permanent stress — again, liquid behaviour on long timescales.

On the Rheology of Cats

In 2014, Marc-Antoine Fardin, a physicist at the ENS Lyon, published “On the Rheology of Cats” in the Rheology Bulletin 83(2), 16–17. The paper asked whether cats satisfy the defining rheological criterion for liquids, using the Deborah number as the test. Fardin was awarded the 2017 Ig Nobel Prize in Physics — which is awarded for research that “makes you laugh, then makes you think” — for this work.

The paper is not a joke. It is standard rheology applied to an unusual material, with appropriately hedged conclusions and correct citations to the primary literature on viscoelastic flow. The humour is in the application; the physics is serious.

Estimating the Cat’s Relaxation Time

The relaxation time $\tau$ of a cat is the time scale over which the cat’s body deforms to fill a container. This is observable. A cat placed near a suitable container — a salad bowl, a cardboard box, a bathroom sink — adopts a conformed shape on a timescale of roughly 5–30 seconds. The initial posture (stiff, alert) gives way to a relaxed conformation as the cat assesses the container and adjusts. Fardin estimated $\tau \approx 1$–$30$ seconds, with the exact value depending on the container’s attractiveness to the specific cat.

This is the material’s characteristic relaxation time. The fact that it is finite — that the cat does eventually conform to the container — is the essential observation.

Computing the Deborah Number for Various Situations

Scenario 1: Cat in a sink. A cat taking ten minutes to settle into a bathroom sink. Observation time $T = 600\,\mathrm{s}$, relaxation time $\tau \approx 5\,\mathrm{s}$.

$$\mathrm{De}_\mathrm{sink} = \frac{5}{600} \approx 0.008 \ll 1.$$

The cat is unambiguously a liquid.

Scenario 2: Cat in a cardboard box. Conformation over approximately 30 minutes, $\tau \approx 20\,\mathrm{s}$.

$$\mathrm{De}_\mathrm{box} = \frac{20}{1800} \approx 0.011 \ll 1.$$

Liquid.

Scenario 3: Cat dropping from a bookshelf. Contact time during a jump approximately $T \approx 0.05\,\mathrm{s}$, relaxation time still $\tau \approx 5\,\mathrm{s}$.

$$\mathrm{De}_\mathrm{jump} = \frac{5}{0.05} = 100 \gg 1.$$

Solid. The cat does not deform into the shape of the bookshelf during the jump; it rebounds elastically.

Scenario 4: Cat startled by a loud noise. Reaction time $T \approx 0.3\,\mathrm{s}$, $\tau \approx 5\,\mathrm{s}$.

$$\mathrm{De}_\mathrm{startle} = \frac{5}{0.3} \approx 17 \gg 1.$$

Solid. On short timescales, cats behave as elastic materials — they spring, they bounce, they do not flow.

The cat is neither permanently solid nor permanently liquid. It is a viscoelastic material whose phase classification depends on the timescale of the interaction. This is not a loose analogy; it is the definition of viscoelasticity.

Non-Newtonian Behaviour and Flow Instabilities

Fardin noted an additional complication: cat flow is not Newtonian. A Newtonian fluid has a viscosity $\eta$ that is independent of the applied shear rate $\dot\gamma$. Many real materials are shear-thinning (viscosity decreases with increasing shear rate — ketchup, blood, many polymer solutions) or shear-thickening (viscosity increases with increasing shear rate — cornstarch suspension, some dense suspensions). Cats, Fardin observed, appear to be shear-thinning: the more rapidly you attempt to move a relaxed cat from its current position, the more “liquid” (accommodating, compliant) it becomes, up to a point at which the cat transitions to solid behaviour (claws, teeth).

This is, formally, the behaviour of a yield-stress fluid: a material that behaves as a solid below a critical stress $\sigma_y$ and flows above it. The Herschel–Bulkley model describes such fluids:

$$\sigma = \sigma_y + k \dot\gamma^n, \quad \sigma > \sigma_y,$$

where $k$ is the flow consistency index and $n < 1$ for shear-thinning. The challenge of fitting $k$, $n$, and $\sigma_y$ for a specific cat is experimental, and Fardin acknowledged this was left to future work.

The Deborah number and the yield stress together provide a two-parameter phase diagram for cat rheology:

Low stress, short timescale: solid (De ≫ 1 or σ < σ_y)
Low stress, long timescale: liquid (De ≪ 1)
High stress: yield, followed by flow

Flow Instabilities: The Rayleigh-Plateau Connection

Fardin also noted that cats confined to containers thinner than their body diameter can exhibit flow instabilities. A cat attempting to fit into a glass too narrow for its body will sometimes adopt a helical or coiled configuration — an instability reminiscent of the Rayleigh–Plateau instability of a liquid jet.

The Rayleigh–Plateau instability occurs when a cylindrical fluid jet of radius $r_0$ is subject to perturbations of wavelength $\lambda > 2\pi r_0$. Modes with wavelength longer than the cylinder’s circumference are unstable and grow, breaking the jet into droplets. The dispersion relation for growth rate $\sigma$ as a function of wavenumber $k = 2\pi/\lambda$ (for an inviscid jet) is:

$$\sigma^2 = \frac{\gamma}{\rho r_0^3}\, k r_0 \bigl(1 - k^2 r_0^2\bigr) I_1(kr_0)/I_0(kr_0),$$

where $\gamma$ is surface tension and $I_0, I_1$ are modified Bessel functions. The analogy with a cat is inexact — surface tension is not the dominant restoring force — but the qualitative instability mechanism (a long cylinder of material is unstable to perturbations whose wavelength exceeds the cylinder’s circumference) appears to apply, suggesting that very elongated cats in very narrow containers should be unstable to coiling. This is, again, left to future experimental work.

Why the Deborah Number Matters (Outside of Cat Physics)

The Deborah number is not a curiosity; it is a central dimensionless number in engineering and materials science.

Polymer processing: The flow of polymer melts through injection-moulding channels involves De in the range $10^{-2}$–$10^2$. Too high a De leads to elastic instabilities, melt fracture, and surface defects in the finished part.

Blood rheology: Blood is a non-Newtonian viscoelastic fluid. In the large arteries (low shear rate), red blood cells aggregate into rouleaux and blood behaves as a shear-thinning fluid. In the capillaries (high shear rate), rouleaux break up and individual cells deform to fit through vessels smaller than their resting diameter — liquid behaviour on short length scales.

Geophysics: The mantle is an elastic solid for seismic waves ($T \sim$ seconds, De ≫ 1) and a viscous fluid for convection ($T \sim 10^8$–$10^9$ years, De ≪ 1). The same material. Different Deborah numbers.

Glaciology: Ice is an elastic solid for rapid fracture (calving of icebergs) and a viscous fluid for glacier flow. The transition occurs at timescales of years to decades, depending on temperature and stress.

The cat is in good company.

References

Fardin, M.-A. (2014). On the rheology of cats. Rheology Bulletin, 83(2), 16–17.
Reiner, M. (1964). The Deborah number. Physics Today, 17(1), 62. https://doi.org/10.1063/1.3051374
Barnes, H.A., Hutton, J.F., & Walters, K. (1989). An Introduction to Rheology. Elsevier (Rheology Series, Vol. 3).
Bird, R.B., Armstrong, R.C., & Hassager, O. (1987). Dynamics of Polymeric Liquids, Vol. 1: Fluid Mechanics (2nd ed.). Wiley-Interscience.
Eggers, J. (1997). Nonlinear dynamics and breakup of free-surface flows. Reviews of Modern Physics, 69(3), 865–930. https://doi.org/10.1103/RevModPhys.69.865

Changelog

2025-12-15: Fixed Deborah number in summary from 0.08 to 0.008 (matching the body calculation: 5/600 = 0.00833).
2025-12-15: Corrected Fardin’s institutional affiliation from “Paris Diderot University” to “ENS Lyon” — his affiliation on the 2014 Rheology Bulletin paper is Université de Lyon / ENS Lyon (CNRS UMR 5672). He moved to Paris Diderot later in 2014, after the paper was published.

Hunting Exoplanets with Your Phone: A Classroom Experiment That Actually Works

Mon, 11 Mar 2024 00:00:00 +0000

This post describes the work behind “Exoplanet Hunting in the Classroom: An Easy-to-Implement Experiment Based on Video-Aided Light Curve Analysis with Smartphones”, published in The Physics Teacher in 2024 (co-authored with Alexander Küpper). It also draws on the earlier German-language paper on analogy experiments for the transit method, published in Astronomie+Raumfahrt in 2022.

The Pedagogical Problem

The transit method is how the majority of confirmed exoplanets have been found. When a planet passes in front of its host star, it blocks a fraction of the star’s light. A sufficiently precise light sensor pointed at the star will record a characteristic dip: a flat-bottomed decrease in flux during the transit, with a precise shape determined by the ratio of the planet’s radius to the star’s radius, the duration of the transit, and the geometry of the orbit.

This is conceptually accessible. The physics is essentially shadow casting — a topic covered in primary school — applied to an astronomically interesting situation. Students understand it quickly and find it genuinely exciting.

The problem is the implementation. How do you actually demonstrate this in a classroom?

Standard approaches divide into three categories, each with limitations:

Simulations and database exercises: Students work with real data from Kepler or TESS, or use a software simulation. These are conceptually valid but remote from physical experience. There is no sensor, no measurement, no uncertainty to grapple with.
Prefabricated kits: Products like PocketLab or Pasco offer purpose-built transit experiment setups. They work, but they are expensive, closed-source, and require manufacturer-specific software. A school that buys a Pasco sensor is locked into the Pasco ecosystem.
DIY benchtop setups: Various published designs use phototransistors, Arduinos, or similar components with a benchtop light source. These are flexible and cheap but require component procurement, assembly, and some technical confidence from the teacher. The barrier to entry is real.

What was missing was an approach that was inexpensive, open-source, required no specialist equipment procurement, and worked at the level of a student experiment rather than a teacher demonstration.

The Smartphone Solution

Modern Android smartphones include an ambient light sensor that is directly accessible via phyphox, the free measurement app developed at RWTH Aachen. Set up the experiment correctly, and the phone records a real-time light curve.

The basic setup requires three things:

A light source (a standard desk lamp, ideally with a constant-brightness LED bulb to avoid flicker)
An opaque sphere to act as the “planet” (a tennis ball, a ping-pong ball, anything with a defined circular silhouette)
A smartphone running phyphox, positioned beneath the lamp at a fixed distance and oriented so the light sensor faces upward

When the sphere is moved across the light path at a controlled height and speed, the light sensor records a transit: a smooth dip in measured illuminance with the flat-bottomed shape characteristic of a planetary transit across a uniformly bright disk.

This is the core experiment. It works. The transit signal is clear enough to measure even with the modest precision of a phone’s ambient light sensor, provided the background illumination is controlled (dark room or at least consistent ambient light).

The iPhoneRoblem and Its Solution

Apple devices do not expose their ambient light sensor through any public software API. An iPhone running phyphox cannot access the sensor that is physically present in the device.

The workaround we recommend: an external Bluetooth light sensor connected to phyphox. Options include the TI SensorTag CC2650, Bluetooth multimeters such as the OWON B35T, or an Arduino Nano 33 BLE Sense. The Arduino option is particularly well-suited to educational contexts: it is open-source, it is inexpensive, and its absence of a built-in operating system makes it more reliable as a pure sensor.

The external sensor approach also has a benefit beyond iPhone compatibility: it produces more consistent data across different devices, since you are measuring at a fixed external point rather than through whatever optical pathway the phone manufacturer chose. For experiments where comparison across student groups matters, this is not trivial.

Video-Aided Light Curve Analysis

The standard approach to a transit experiment is: measure the dip, calculate the planet-to-star radius ratio from the relative depth, done. This works and is pedagogically valid.

The paper introduces a complementary approach: simultaneously recording a video of the “planet” passing in front of the “lamp”, and using the video frames to cross-reference the light curve data.

Why? Because the light curve from a real transit experiment does not look exactly like the idealised textbook version. There is noise. There is baseline drift. The “ingress” and “egress” phases — where the planet is partially in front of the star — are often unclear at smartphone sensor resolution. Students frequently have difficulty connecting the shape of the curve to the physical geometry that produced it.

Video-aided analysis addresses this directly. Frame-by-frame, students can see exactly where the planet was at each moment in the light curve. The ingress becomes visible: when the sphere first touches the lamp’s light cone, the sensor begins to register the dip. The mid-transit flat bottom corresponds to full occultation of a central portion of the lamp. The egress mirrors the ingress. The correspondence between geometry and photometry — which is the conceptual core of the transit method — becomes explicit.

In a teaching context, this turns the error and noise in the light curve from an obstacle into an educational resource. Students can identify specific features of the curve and ask: what was happening in the physical experiment at that moment? The uncertainty is no longer an embarrassment. It is a diagnostic.

Scaffolding Levels

The paper distinguishes three implementation modes, corresponding to different levels of student independence:

Demonstration experiment: Teacher sets up and runs the apparatus. Students observe and discuss. Appropriate as an introduction to the concept before students engage with it independently.

Guided student experiment: Students follow a structured procedure, with specified setup, data collection protocol, and analysis worksheet. Appropriate for students who have not designed their own experiments and for lesson contexts where time is limited.

Open inquiry: Students are given the materials and a research question — “How does the depth of the transit dip depend on the size of the planet?” — and design their own procedure. Appropriate for upper secondary students with experience in experimental design, and for lesson contexts that explicitly address scientific method.

The materials for all three modes are described in the paper. The open inquiry mode is the most demanding but also the most research-authentic: students are not following a protocol but building one, confronting the actual decisions that experimental physicists make.

From the Classroom to the Telescope

A transit experiment with a lamp and a phone is, obviously, not the same as the photometry done by TESS or the James Webb Space Telescope. The planet-star radius ratios measurable in the classroom analog are much larger than for most real exoplanets. The signal-to-noise is worse. The lamp is not a star.

But the method is the same. The measurement principle — flux dip proportional to the square of the radius ratio, duration determined by orbital geometry — is the same physics that Kepler used to find thousands of planets. When students calculate the “radius” of their tennis-ball planet from their light curve, they are doing, in miniature, what professional astronomers do with data from space.

This connection to real research is not incidental to the pedagogy. It is central to it. The transit method works as a classroom experiment not because it is a good demonstration of some abstract physics principle but because it is a genuine slice of how contemporary science actually operates. The question the experiment answers — is there something out there? — is the same question the professional community is asking.

The simulation companion to this work — a browser-based model of transit photometry with full limb darkening, exomoon scenarios, and N-body dynamics — is described in this separate post. The simulation is the place to go when you want to explore parameter space; the physical experiment is the place to go when you want to understand what a measurement actually is.

Connection to the astro-lab

The transit experiment described here grew directly out of the astro-lab project at the University of Cologne, where Alexander Küpper and I had been developing smartphone-based analogy experiments for exoplanet detection since the COVID pivot in 2020. The astro-lab@home established the feasibility of the smartphone approach; the A+R 2022 paper on Analogieexperimente für die Transitmethode explored the design space more systematically; the TPT 2024 paper is the version written for an international teacher audience, with the comparative equipment table, the video-aided analysis technique, and the scaffolding levels made explicit.

If you want to extend the experiment to exomoons — detecting the gravitational wobble that a moon induces in a planet’s transit — that work is described in a later post.

For the curriculum article that places the transit experiment in the NRW Sekundarstufe I context — including the Direct Imaging pre-experiment — see Fremde Welten.

References

Spicker, S. J., & Küpper, A. (2024). Exoplanet hunting in the classroom: An easy-to-implement experiment based on video-aided light curve analysis with smartphones. The Physics Teacher, 62(3). https://doi.org/10.1119/5.0125305

Küpper, A., & Spicker, S. J. (2022). Analogieexperimente zur Transitmethode für den Einsatz in Schule und Hochschule. Astronomie+Raumfahrt im Unterricht, 59(5).

Staacks, S., Hütz, S., Heinke, H., & Stampfer, C. (2018). Advanced tools for smartphone-based experiments: phyphox. Physics Education, 53(4), 045009. https://doi.org/10.1088/1361-6552/aac05e

Changelog

2025-10-03: Updated the DOI for Spicker & Küpper (2024) to the correct 10.1119/5.0125305.

You Cannot Have All Three: The Fairness Impossibility Theorem

Fri, 08 Mar 2024 00:00:00 +0000

Summary

In 2016 ProPublica published an investigation showing that COMPAS — a widely used recidivism risk assessment tool — assigned higher risk scores to Black defendants than to White defendants with equivalent actual recidivism rates. The tool’s developer responded that COMPAS is well-calibrated: among defendants of any race assigned a given score, the subsequent recidivism rates are consistent with that score. Both claims were correct.

The apparent contradiction between them is resolved by a mathematical result that was proved independently by two groups the same year. The fairness impossibility theorem establishes that calibration, equal false positive rates, and equal false negative rates cannot all hold simultaneously when base rates differ between groups — unless the classifier is perfect.

This is not a property of COMPAS specifically. It is not fixed by a better algorithm, more diverse training data, or more careful engineering. It is a constraint that holds for any probabilistic classifier operating on groups with unequal prevalence of the predicted outcome.

The question this forces is not “how do we make the algorithm fair?” The question is “which fairness criterion do we endorse, and can we defend that choice to the people it disadvantages?” That is not a technical question.

The COMPAS Investigation

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner published “Machine Bias” in ProPublica on 23 May 2016 (Angwin et al., 2016). They had obtained COMPAS risk scores for approximately 7,000 defendants in Broward County, Florida, along with actual two-year recidivism data. Their finding: among defendants who did not go on to reoffend, Black defendants were falsely labelled high-risk at roughly twice the rate of White defendants. The false positive rate was substantially higher for Black defendants.

Northpointe (now Equivant), the tool’s developer, responded that ProPublica’s analysis was misleading. COMPAS is calibrated: within any given score band, the actual recidivism rate is the same regardless of race. A score of 7 means approximately the same thing for a Black defendant as for a White defendant. This is a genuine and important property for a risk assessment to have.

Both analyses were conducted correctly. The tension between them is not a matter of one side being wrong. It is a matter of two legitimate fairness criteria being simultaneously satisfied being mathematically impossible.

Three Definitions of Fairness

Let $Y \in \{0, 1\}$ be the true outcome (reoffend/not), $\hat{Y}$ be the classifier’s prediction, and $A \in \{0, 1\}$ indicate group membership.

Calibration (predictive parity): for all score values $s$,

$$P(Y = 1 \mid \hat{Y} = s, A = 0) = P(Y = 1 \mid \hat{Y} = s, A = 1)$$

If the model assigns a score of 7 to a defendant, the actual reoffending rate should be the same regardless of race. This is what COMPAS satisfies.

False positive rate parity:

$$P(\hat{Y} = 1 \mid Y = 0, A = 0) = P(\hat{Y} = 1 \mid Y = 0, A = 1)$$

Among defendants who will not reoffend, the probability of being incorrectly labelled high-risk should be equal across groups. This is what ProPublica found violated.

False negative rate parity:

$$P(\hat{Y} = 0 \mid Y = 1, A = 0) = P(\hat{Y} = 0 \mid Y = 1, A = 1)$$

Among defendants who will reoffend, the probability of being incorrectly labelled low-risk should be equal across groups.

All three properties seem like reasonable things to ask of a fair classifier. The impossibility theorem says you cannot have all three at once — with a precise exception.

The Impossibility Theorem

Alexandra Chouldechova proved the relevant result in 2017 using Broward County data as her case study (Chouldechova, 2017). Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved an equivalent result independently (Kleinberg et al., 2017).

The argument is straightforward. Suppose a classifier is calibrated and produces a binary prediction (high/low risk). Let $p_0$ and $p_1$ be the base rates — the actual reoffending rates — in groups 0 and 1. For a binary classifier with positive predictive value PPV and negative predictive value NPV:

The false positive rate satisfies (via Bayes): $\text{FPR} = \frac{\text{TPR} \cdot \text{PR} \cdot (1-\text{PPV})}{\text{PPV} \cdot (1-\text{PR})}$ where PR is prevalence and TPR is sensitivity
The false negative rate satisfies (via Bayes): $\text{FNR} = \frac{\text{TNR} \cdot (1-\text{PR}) \cdot (1-\text{NPV})}{\text{NPV} \cdot \text{PR}}$ where TNR is specificity

If calibration holds — PPV and NPV are equal across groups — and the base rates $p_0 \neq p_1$, then the FPR and FNR in each group are functions of that group’s specific base rate. They cannot both be equalized across groups unless either:

$p_0 = p_1$: the base rates are equal, or
The classifier is perfect: FPR = FNR = 0.

In the real case — unequal base rates, imperfect classifier — calibration and equalized error rates are mutually exclusive. You can have one or the other but not both. The three criteria have two degrees of freedom, and the third is determined by the first two plus the base rates. It is an algebraic constraint, not an engineering limitation.

A Structural Analogy

The structural similarity to another impossibility result is worth noting.

The Robertson inequality in quantum mechanics (Robertson, 1929) states that for any two observables $\hat{A}$ and $\hat{B}$:

$$\Delta A \cdot \Delta B \geq \frac{1}{2} \left| \langle [\hat{A}, \hat{B}] \rangle \right|$$

This is not an engineering failure. It is a consequence of the algebraic structure of the theory: if $[\hat{A}, \hat{B}] \neq 0$, then $\Delta A$ and $\Delta B$ cannot simultaneously be made arbitrarily small. No measurement apparatus, however precise, can violate it. The constraint is in the mathematics, not the hardware.

The fairness impossibility has the same character. Three desiderata, a structural constraint that prevents simultaneous satisfaction, and no algorithmic escape route. A better model does not help. Richer training data does not help. The constraint is in the arithmetic of conditional probabilities and base rates.

The disanalogy is this: in quantum mechanics, $\hbar$ is a fundamental constant — you cannot reduce it. In fairness, the base rates are not constants of nature. They are historical outcomes of social processes: incarceration rates, policing patterns, economic conditions, educational access. The theorem does not tell you that unequal base rates are acceptable; it tells you that given unequal base rates, the three fairness criteria cannot all be satisfied.

Gender Bias in AI Systems

The impossibility theorem applies to any binary classification setting with unequal base rates. The empirical landscape of AI gender bias gives several concrete instances where one criterion was satisfied while others were not.

In October 2018, Reuters reported that Amazon had developed and then abandoned an internal AI-based recruiting tool that systematically downgraded résumés from women (Dastin, 2018). The model had been trained on a decade of hiring decisions, in which successful hires were predominantly male. The model learned that “male” features were associated with success and penalized female indicators accordingly. Calibration to the training data produced systematic gender bias in output.

Tolga Bolukbasi and colleagues showed in 2016 that word embeddings trained on large text corpora encoded gender stereotypes in their geometric structure (Bolukbasi et al., 2016). The analogy $\text{man} : \text{computer programmer} :: \text{woman} : \text{homemaker}$ could be recovered directly from the vector arithmetic of the embedding space. The embedding was calibrated to the text corpus, which reflected the occupational distribution of the time — and perpetuated it.

Jieyu Zhao and colleagues found that image captioning and activity recognition models amplified existing gender associations (Zhao et al., 2017). “Cooking” was associated with women in 67% of training images; the models amplified that to 84% at inference. The amplification is a consequence of models learning the easiest features that predict the label — and in a world where cooking is disproportionately female, “female appearance” becomes a feature that predicts “cooking.”

Joy Buolamwini and Timnit Gebru’s “Gender Shades” study found error rates of up to 34.7% for darker-skinned women in commercial facial recognition systems, compared to 0.8% for lighter-skinned men (Buolamwini & Gebru, 2018). The classifiers were calibrated on predominantly light-skinned training data. Calibration on the majority group produced large errors on the minority group — exactly the pattern the impossibility theorem describes.

Hadas Kotek and colleagues tested four large language models on gender-stereotyped occupational prompts in 2023 (Kotek et al., 2023). The models were three to six times more likely to choose the gender-stereotyped occupation when responding to ambiguous prompts. The models were calibrated to human-generated text; human-generated text encodes human stereotypes.

The Solutions and Their Limits

Three broad approaches exist to algorithmic debiasing, and all three face the same constraint.

Pre-processing removes bias from training data before training. Zemel and colleagues proposed “Learning Fair Representations” — a latent embedding that encodes the data usefully while obscuring group membership (Zemel et al., 2013). This can reduce bias in the learned representation, but it cannot simultaneously satisfy all three fairness criteria; it trades one against another by compressing the group-informative dimensions.

Post-processing adjusts the classifier’s decisions after training. Moritz Hardt, Eric Price, and Nathan Srebro’s equalized odds approach (Hardt et al., 2016) adjusts decision thresholds separately per group to achieve FPR/FNR parity. This satisfies equalized odds by construction — but only by abandoning calibration, which the Chouldechova theorem requires when base rates differ.

In-processing incorporates a fairness constraint into the training objective. Agarwal and colleagues proposed a reductions approach that allows the practitioner to specify which fairness constraint to impose (Agarwal et al., 2018). But you must choose. The algorithm can optimize for any one of the three criteria; it cannot optimize for all three simultaneously when base rates differ.

A 2021 survey by Mitchell and colleagues confirms that all three paradigms face the same impossibility (Mitchell et al., 2021). The choice of paradigm is a choice about which criterion to prioritize, and that choice has distributional consequences that fall differently on different groups.

The Political Choice

This is where Arvind Narayanan’s framing becomes essential. His 2018 tutorial catalogued 21 distinct definitions of algorithmic fairness and titled it “21 Fairness Definitions and Their Politics” (Narayanan, 2018). The title is the argument: the definitions are not equivalent, choosing among them is not a technical decision, and the choice encodes a prior about what justice requires.

In the criminal justice context: a false positive (predicting recidivism when the defendant will not reoffend) imposes a cost on the defendant — higher bail, longer sentence, restricted conditions of release. A false negative (predicting non-recidivism when the defendant will reoffend) imposes a cost on potential future victims and on public safety. When we choose to minimize FPR parity, we are choosing to protect defendants from false accusation. When we choose to minimize FNR parity, we are choosing to protect the public from missed offenders. These are both defensible values. They produce different error distributions across groups.

Choosing overall accuracy as the metric — which is what maximizing predictive performance typically means — is itself a value choice: it implicitly weights errors by their frequency in the population, which means errors made on less-common outcomes are relatively under-penalized. When racial disparities in base rates are products of historical injustice, this choice compounds that injustice.

Solon Barocas, Moritz Hardt, and Arvind Narayanan’s textbook Fairness and Machine Learning (2023) makes explicit that the choice between fairness criteria is a normative, not technical, decision (Barocas et al., 2023). The book does not tell you which criterion to choose. It tells you that you must choose, that the choice has political content, and that presenting it as a technical optimization problem conceals that content.

Reuben Binns’ analysis through political philosophy confirms that different fairness criteria correspond to different underlying theories of justice: Rawlsian, Dworkinian, luck egalitarian framings all generate different orderings of the three criteria (Binns, 2018). The choice of fairness criterion is the choice of a theory of justice, whether or not the engineers implementing the system have thought of it in those terms.

The Theorem Is Not the Problem

I want to be clear about what the impossibility theorem does and does not say.

It does not say that algorithmic fairness is impossible. It says that you must choose among competing fairness criteria when base rates differ across groups, and that the choice has distributional consequences. Systems can be built that satisfy calibration, or equalized odds, or demographic parity — just not all three at once with unequal base rates.

It does not say that base rate disparities are natural or acceptable. The disparities in recidivism rates, hiring rates, image training sets, and text corpora are products of social history. The theorem constrains what a classifier can do given those disparities; it does not prescribe them.

What it does say is that “we built a fair algorithm” is not a statement that can be made without specifying which fairness criterion was satisfied and which was not. It is not a statement that can be defended on purely technical grounds. And it is not a statement that escapes political accountability by hiding behind mathematical precision.

The fairness debate in AI is, at its core, a debate about which errors we are willing to make, in whom, with what consequences. The theorem makes that debate unavoidable. Whether we have the vocabulary and the will to conduct it in those terms is a different question entirely.

References

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016, May 23). Machine bias. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163. DOI: 10.1089/big.2016.0047
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). DOI: 10.4230/LIPIcs.ITCS.2017.43
Robertson, H. P. (1929). The uncertainty principle. Physical Review, 34, 163–164. DOI: 10.1103/PhysRev.34.163
Dastin, J. (2018, October 10). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016). arXiv:1607.06520
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of EMNLP 2017, pp. 2979–2989. ACL Anthology: D17-1323
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (FAT* 2018), PMLR Vol. 81, pp. 77–91. https://proceedings.mlr.press/v81/buolamwini18a.html
Kotek, H., Dockum, R., & Sun, D. Q. (2023). Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference (CI ‘23), pp. 12–24. DOI: 10.1145/3582269.3615599
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013). Learning fair representations. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), PMLR Vol. 28, No. 3, pp. 325–333. https://proceedings.mlr.press/v28/zemel13.html
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016), pp. 3323–3331. arXiv:1610.02413
Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., & Wallach, H. (2018). A reductions approach to fair classification. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), PMLR Vol. 80, pp. 60–69. arXiv:1803.02453
Mitchell, S., Potash, E., Barocas, S., D’Amour, A., & Lum, K. (2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141–163. DOI: 10.1146/annurev-statistics-042720-125902
Narayanan, A. (2018). 21 Fairness Definitions and Their Politics. Tutorial at FAT* 2018. PDF
Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org
Binns, R. (2018). Fairness in machine learning: Lessons from political philosophy. In Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency (FAT* 2018), PMLR Vol. 81, pp. 149–159. arXiv:1712.03586

Changelog

2025-11-05: Updated the Zhao et al. (2017) cooking statistics to match the paper: 67% female agents for cooking in the training set (33% was the male share), amplified to 84% female at inference.

What the Videography Manual Didn't Cover: Filming Music Education

Tue, 13 Feb 2024 00:00:00 +0000

This post follows from the May 2023 post on the classroom videography manual. Read that one first if you want the baseline.

The Assumption Underneath the Manual

The manual we published — Kramer, Spicker, and Kaspar, 2023, open access at kups.ub.uni-koeln.de/65599 — is a good document for what it is. It covers a classroom. It assumes a teacher in front of twenty to thirty students, a forty-five minute lesson, a room with windows that create backlighting problems, a consent process that involves four institutional levels, and two static cameras facing each other as the baseline configuration.

All of that is correct for the context it addresses. The context is school-based subject teaching: physics, mathematics, German, history. The University of Cologne teacher education programme we developed the manual for is primarily about preparing people for exactly that context.

When I moved to the Cologne University of Music, I brought the same assumptions with me. It took a while for me to notice how much the new context violated them.

Sound Is Not the Same Problem

In the manual, the section on audio equipment is focused on speech capture. The recommendation — lavalier microphones for the teacher, boundary microphones at the cameras for student audio — is correct for a lesson where the subject matter is communicated through talking. The teacher talks. The students talk back. The quality criterion for the audio is: can we understand what is being said?

In music education, the subject matter is sound. What the student produces acoustically is not background noise supporting verbal instruction — it is the object of the lesson. And it is produced by instruments that have almost nothing in common acoustically with a human voice.

A lavalier microphone clipped to a teacher’s collar, positioned to capture speech from thirty centimetres away, will record a student’s piano playing through the back of the teacher’s head, through the air, through a directional capsule aimed at the wrong thing. The resulting audio is technically present and analytically useless.

Instruments have frequency ranges, dynamic ranges, and directional patterns that require completely different microphone selection and placement. A violin at fortissimo in a small practice room will clip every speech-grade microphone in the room. A pianissimo pianists’ breath-controlled passage that a skilled listener can hear clearly will barely register on a distant boundary microphone designed to capture “the general acoustic environment.” The distinction between a correctly produced tone and an incorrectly produced tone — which is the actual content of the lesson — may or may not be audible in the captured audio depending on whether anyone thought about microphone choice before walking through the door.

The manual’s principle of “as much as necessary, as little as possible” still applies, but “necessary” is a completely different specification here.

The One-to-One Lesson Problem

The classroom videography framework — including the manual — is built around a structural assumption: there is a teacher, and there is a class. The teacher stands or moves at the front; the students are arrayed in rows or groups. Two cameras can cover this because the spatial structure is relatively stable and the relevant action is roughly predictable.

A university instrumental lesson is typically one-to-one, in a small practice room, for sixty minutes. The spatial structure is two people close together around an instrument. The relevant action includes:

The teacher demonstrating a passage on their own instrument
The teacher making a physical correction — adjusting bow arm position, repositioning the student’s hand on the fingerboard, demonstrating breath support by putting a hand on the student’s diaphragm
The student playing and the teacher listening with their eyes closed
The teacher singing a melodic contour to show phrasing
Both of them playing at the same time (unison work, call and response)

A standard two-camera classroom setup captures none of this usefully. The standard framing — wide angle, teacher on one side, student on the other — produces footage where “something is happening near the piano” but where the analytically relevant detail (the finger position, the bow angle, the postural correction) is invisible at normal viewing distance.

You need different framing. You probably need closer cameras. You might need a third angle for body position. And you need to accept that this raises the setup complexity substantially beyond what the manual recommends as a baseline.

What the Lesson Is Actually About

There is a deeper structural difference that the equipment and setup challenges are symptoms of.

In subject-matter teaching, the lesson is the unit of analysis. A forty-five-minute lesson has a beginning, a development, a conclusion. The teacher enters with a plan; the video captures how that plan was executed and how the students responded. The analytical interest is in the lesson as a coherent pedagogical event.

In instrumental music education, the lesson is a container for cycles. A student plays a passage. The teacher identifies a problem — the intonation at bar twelve, the tendency to rush the syncopated rhythm, the bow pressure collapsing in the crescendo. The teacher says or demonstrates something. The student tries again. The teacher listens to what changed and what did not.

These cycles are the unit of analysis, and they happen dozens of times in a single lesson. The lesson-level video is useful context, but the analytically interesting question is inside the cycle: what did the teacher identify, what intervention did they choose, what happened to the student’s playing afterward?

Capturing those cycles in usable form requires not just video of the lesson but video that is indexed to them — where each attempt-and-response pair can be located and compared. A continuous recording of a sixty-minute lesson is not organised for this purpose. Timestamps help but do not replace the work of finding and annotating each cycle.

The Absent Camera Problem

There is a more fundamental issue that no amount of improved equipment configuration addresses.

The feedback cycle a teacher most wants to reach is the one that happens in a student’s practice session. Between lessons, the student is alone in a practice room, working through the same passages, repeating the same mistakes (or, occasionally, having the experience of something going right for reasons they do not fully understand). The teacher’s instructions from the last lesson are present only in the student’s memory of them, which is fallible and partial.

The videography manual is about research documentation: a trained operator, institutional consent, equipment brought in from outside. None of that is available in a student’s practice session at eleven o’clock on a Wednesday night. And even if you could film it — which you could, technically, with a phone — the resulting footage would be unwatched, because no workflow exists to get it from the student’s device to the teacher’s eyes in a form that supports structured feedback.

The practical reality is that most music teachers receive feedback about a student’s practice only through the student’s report of it (“I practiced every day”) and through the evidence presented in the lesson (which may or may not reflect what practice actually looked like). The gap between practice and lesson feedback is a structural feature of music education, and it is not something that research videography can address.

A Software Response

The tool I built to think through this problem is called Resonance, and it is available at github.com/sebastianspicker/resonance.

The design is deliberately different from the research videography model. Instead of an external camera operator documenting a lesson for later analysis, Resonance puts the documentation instrument in the student’s hands. Students capture short audio or video clips of their own practice — snippets of a passage they want the teacher to hear, a moment where something went wrong, a phrase they are finally getting right — and submit them to a course. The teacher reviews the queue and adds feedback with timestamped annotations: “at 0:23, the bow pressure drops — this is what is generating the scratch.”

The asymmetry is intentional. The student decides what to document. The teacher provides structured, specific feedback. The cycle is asynchronous — the student submits at eleven on a Wednesday night; the teacher responds Thursday morning — which means it is independent of the lesson schedule.

The technical decisions follow from the use context. Students practice in rooms where connectivity is unreliable, so the app is offline-first: recordings are captured locally and uploaded when a connection is available. An iPad is the natural form factor for a music student — larger screen, better camera, sits on a music stand. The backend is standard (Node.js, Postgres, S3-compatible object storage) because the interesting problem here is not the infrastructure but the workflow.

Resonance is a prototype and a proof of concept, not a production system. The authentication is explicitly development-mode only. The goal was to build enough of the thing to be able to think clearly about what it does and does not solve.

What It Does Not Solve

Resonance addresses the absent-camera problem for the practice-to-feedback loop. It does not address the research documentation problem that the videography manual was written for.

If you want to study how music teachers give feedback — as a research question about teaching practice, not just as a workflow tool — you still need the full apparatus: controlled recording conditions, appropriate microphones for instruments, multi-camera coverage of the lesson, consent for the resulting footage to be used for research and teaching purposes, and post-processing that produces an analytically usable document.

Resonance footage is not that. It is what a student chose to capture on an iPad in a practice room, with whatever acoustic environment happened to be present. It is useful for the practice-feedback cycle; it is not a research record.

The challenges I described in the first two sections — appropriate microphones, multi-angle coverage of one-to-one lessons, capture of the practice cycle rather than the lesson arc — are still open problems for anyone trying to do systematic observational research in music education. The manual gives you the framework for thinking about them. It does not give you solutions, because those solutions are context-specific and, in several cases, not yet worked out by the field.

What I find interesting is that the two problems — research documentation and practice-feedback — might look the same (filming music education) but require almost entirely different responses. Getting clear on which problem you are solving turns out to be most of the work.

The full classroom videography manual is at kups.ub.uni-koeln.de/65599. The Resonance repository is at github.com/sebastianspicker/resonance.

References

Kramer, C., Spicker, S. J., & Kaspar, K. (2023). Manual zur Erstellung von Unterrichtsvideographien. KUPS Open Access. https://kups.ub.uni-koeln.de/65599/

Lehmann, A. C., Sloboda, J. A., & Woody, R. H. (2007). Psychology for Musicians: Understanding and Acquiring the Skills. Oxford University Press.

Presland, C. (2005). Conservatoire student and instrumental professor: The student perspective on a complex relationship. British Journal of Music Education, 22(3), 237–248. https://doi.org/10.1017/S0265051705006558

Creech, A., & Hallam, S. (2011). Learning a musical instrument: The influence of interpersonal interaction on outcomes for school-aged pupils. Psychology of Music, 39(1), 102–122. https://doi.org/10.1177/0305735610370222

When Musicians Lock In: Coupled Oscillators and the Physics of Ensemble Synchronisation

Thu, 08 Feb 2024 00:00:00 +0000

The problem is ancient and the language for it is recent. In any ensemble — a string quartet, a jazz rhythm section, an orchestra — musicians with slightly different internal tempos must stay together. They do this by listening to each other. But what, exactly, does “listening to each other” do to their timing? And what happens when the listening channel is imperfect — delayed by the speed of sound across a wide stage, or by a network cable crossing a continent? The answer involves a differential equation that was not written to describe music.

This post extends the latency analysis in Latency in Networked Music Performance with the dynamical systems framework that underlies it.

Two Clocks on a Board

The first documented observation of coupled-oscillator synchronisation was made not by a musician but by a physicist. In 1665, Christiaan Huygens, confined to bed with illness, was watching two pendulum clocks mounted on the same wooden beam. Over the course of the night, the pendulums had synchronised into anti-phase oscillation — swinging in opposite directions in exact unison. He reported it to his father:

“I have noticed a remarkable effect which no-one has observed before… two clocks on the same board always end up in mutual synchrony.”

The mechanism was mechanical coupling through the beam. Each pendulum’s swing imparted a small impulse to the wood; the other pendulum felt this as a perturbation to its rhythm. Small perturbations, accumulated over hours, drove the clocks into a shared frequency and a fixed phase relationship.

This is the prototype of every ensemble synchronisation problem. Each musician is a clock. The acoustic environment — the air in the room, the reflected sound from the walls, the vibrations through the stage floor — is the wooden beam.

The Kuramoto Model

Yoshiki Kuramoto formalised the mathematics of coupled oscillators in 1975, motivated by biological synchronisation problems: firefly flashing, circadian rhythms, cardiac pacemakers. His model considers $N$ oscillators, each with a phase $\theta_i(t)$ evolving according to:

$$\frac{d\theta_i}{dt} = \omega_i + \frac{K}{N} \sum_{j=1}^{N} \sin(\theta_j - \theta_i), \qquad i = 1, \ldots, N.$$

The first term, $\omega_i$, is the oscillator’s natural frequency — the tempo it would maintain in isolation. These are drawn from a distribution $g(\omega)$, which in a real ensemble reflects the spread of individual preferred tempos among the players. The second term is the coupling: each oscillator is attracted toward the phases of all others, with strength $K/N$. The factor $1/N$ keeps the total coupling intensive (independent of ensemble size) as $N$ grows large.

Musically: $\theta_i$ is the phase of musician $i$’s internal pulse at a given moment, $\omega_i$ is their preferred tempo if playing alone, and $K$ is the coupling strength — how much they adjust their tempo in response to what they hear from the others.

The Order Parameter and the Phase Transition

To measure the degree of synchronisation, Kuramoto introduced the complex order parameter:

$$r(t)\, e^{i\psi(t)} = \frac{1}{N} \sum_{j=1}^{N} e^{i\theta_j(t)},$$

where $r(t) \in [0, 1]$ is the coherence of the ensemble and $\psi(t)$ is the collective mean phase. When $r = 0$, the phases are uniformly spread around the unit circle — the ensemble is incoherent. When $r = 1$, all phases coincide — perfect synchrony. In a live ensemble, $r$ is a direct measure of rhythmic cohesion, though of course not one you can read off a score.

Substituting the order parameter into the equation of motion:

$$\frac{d\theta_i}{dt} = \omega_i + K r \sin(\psi - \theta_i).$$

Each oscillator now interacts only with the mean-field quantities $r$ and $\psi$, not with every other oscillator individually. The coupling pulls each musician toward the collective mean phase with a force proportional to both $K$ (how attentively they listen) and $r$ (how coherent the group already is).

This mean-field form reveals the essential physics. For small $K$, oscillators with widely differing $\omega_i$ cannot follow the mean field — they drift at their own frequencies, and $r \approx 0$. At a critical coupling strength $K_c$, a macroscopic fraction of oscillators suddenly locks to a shared frequency, and $r$ begins to grow continuously from zero. For a unimodal, symmetric frequency distribution $g(\omega)$ with density $g(\bar\omega)$ at the mean:

$$K_c = \frac{2}{\pi\, g(\bar\omega)}.$$

Above $K_c$, the coherence grows as:

$$r \approx \sqrt{\frac{K - K_c}{K_c}}, \qquad K \gtrsim K_c.$$

This is a second-order (continuous) phase transition — the same mathematical structure as a ferromagnet approaching the Curie temperature, where spontaneous magnetisation appears continuously above a critical coupling. The musical ensemble and the magnetic material belong to the same universality class, governed by the same mean-field exponent $\frac{1}{2}$.

Above $K_c$, the fraction of oscillators that are locked (synchronised to the mean-field frequency) can be computed explicitly. An oscillator with natural frequency $\omega_i$ locks to the mean field if $|\omega_i - \bar\omega| \leq Kr$. For a Lorentzian distribution $g(\omega) = \frac{\gamma/\pi}{(\omega - \bar\omega)^2 + \gamma^2}$, this yields:

$$r = \sqrt{1 - \frac{K_c}{K}}, \qquad K_c = 2\gamma,$$

which is the exact self-consistency equation for the Kuramoto model with Lorentzian frequency spread (Strogatz, 2000).

The physical reading is direct: whether an ensemble locks into a shared pulse or drifts apart is a threshold phenomenon. A group of musicians with similar preferred tempos has a peaked $g(\bar\omega)$, giving a low $K_c$ — they synchronise easily with minimal attentive listening. A group with widely varying individual tempos needs stronger, more sustained coupling to cross the threshold. This is not a matter of musical discipline; it is a material property of the ensemble.

Concert Hall Applause: Neda et al. (2000)

The Kuramoto model is not only a theoretical construction. Neda et al. (2000) applied it to concert hall applause — one of the most direct real-world demonstrations of coupled-oscillator dynamics in a musical context.

They recorded applause in Romanian and Hungarian theaters and found that audiences spontaneously alternate between two distinct states. In the incoherent regime, each audience member claps at their own preferred rate (typically 2–3 Hz). Through acoustic coupling — each person hears the room-averaged sound and adjusts their clapping — the audience gradually synchronises to a shared, slower frequency (around 1.5 Hz): the synchronised regime.

The transitions between the two regimes are quantitatively consistent with the Kuramoto phase transition: the emergence of synchrony corresponds to $K$ crossing $K_c$ as people progressively pay more attention to the collective sound. Furthermore, Neda et al. document a characteristic phenomenon when synchrony breaks down: individual clapping frequency approximately doubles as audience members attempt to re-establish coherence. This frequency-doubling — a feature of nonlinear oscillator systems near instability — is exactly what the delayed response of coupling near $K_c$ predicts.

The paper is a useful pedagogical artefact: every music student has experienced concert hall applause, and hearing that it undergoes a physically measurable phase transition makes the connection between physics and musical experience concrete.

Latency and the Limits of Networked Ensemble Performance

In standard acoustic ensemble playing, the coupling delay is the propagation time for sound to cross the ensemble: at $343\ \text{m/s}$, across a ten-metre stage, roughly 30 ms. This is why orchestral seating is arranged with attention to who needs to hear whom first.

In networked music performance (NMP), the coupling delay $\tau$ is much larger: tens to hundreds of milliseconds depending on geographic distance and network infrastructure. The Kuramoto model generalises naturally to include this delay:

$$\frac{d\theta_i}{dt} = \omega_i + \frac{K}{N} \sum_{j=1}^{N} \sin\!\bigl(\theta_j(t - \tau) - \theta_i(t)\bigr).$$

Each musician hears the others’ phases as they were $\tau$ seconds ago, not as they are now.

In a synchronised state where all oscillators share the collective frequency $\bar\omega$ and phase $\psi(t) = \bar\omega t$, the delayed phase signal is $\psi(t - \tau) = \bar\omega t - \bar\omega\tau$. The effective coupling force contains a factor $\cos(\bar\omega\tau)$: the delay introduces a phase shift that reduces the useful component of the coupling. The critical coupling with delay is therefore:

$$K_c(\tau) = \frac{K_c(0)}{\cos(\bar\omega \tau)}.$$

As $\tau$ increases, $K_c(\tau)$ grows: synchronisation requires progressively stronger coupling (more attentive adjustment) to compensate for the information lag. The denominator $\cos(\bar\omega\tau)$ reaches zero when $\bar\omega\tau = \pi/2$. At this point $K_c(\tau) \to \infty$: no finite coupling strength can maintain synchrony. The critical delay is:

$$\tau_c = \frac{\pi}{2\bar\omega}.$$

For an ensemble performing at 120 BPM, the beat frequency is $\bar\omega = 2\pi \times 2\ \text{Hz} = 4\pi\ \text{rad/s}$:

$$\tau_c = \frac{\pi}{2 \times 4\pi} = \frac{1}{8}\ \text{s} = 125\ \text{ms}.$$

This is a remarkably clean result. The Kuramoto model with delay predicts that ensemble synchronisation collapses at around 125 ms one-way delay for a standard performance tempo. The empirical literature on NMP — from LoLa deployments across European conservatories to controlled latency studies in the lab — consistently finds that rhythmic coherence degrades noticeably above 50–80 ms and becomes essentially unworkable above 100–150 ms one-way. The model and the data agree.

The derivation also shows why faster tempos are harder in NMP: $\tau_c \propto 1/\bar\omega$, so doubling the tempo halves the tolerable latency. An ensemble performing at 240 BPM in a distributed setting faces a theoretical ceiling of 62 ms — which rules out transcontinental performance for most repertoire.

Brains in Sync: EEG Hyperscanning

The Kuramoto framework has recently been applied at a neural level. EEG hyperscanning — simultaneous EEG recording from multiple participants during a shared musical activity — has shown that musicians performing together exhibit inter-brain synchronisation: coherent cortical oscillations at the frequency of the music are measurable between players (Lindenberger et al., 2009; Müller et al., 2013). The phase coupling between brains during joint performance is significantly higher than during solo performance and higher than for musicians playing simultaneously but without acoustic coupling.

This suggests that the Kuramoto coupling operates at two levels: the acoustic (each musician hears the other and adjusts physical timing) and the neural (each musician’s cortical oscillators entrain to the shared musical pulse). The question of which level is primary — whether neural synchrony causes or follows from acoustic synchrony — remains open.

A 2023 review by Demos and Palmer argues that pairwise Kuramoto-type coupling is insufficient to capture full ensemble dynamics. Group-level effects — the differentiation between leader and follower roles, the emergence of collective timing that no individual would produce alone — require nonlinear dynamical frameworks that go beyond mean-field averaging. The model that adequately describes a string quartet may need to be richer than the one that describes a population of identical fireflies.

What This Means for Teaching

The Kuramoto model reframes standard rehearsal intuitions in physical terms.

“Listen more” translates to “increase your effective coupling constant $K$.” A musician who plays without attending to others has set $K \approx 0$ and will drift freely according to their own $\omega_i$. Listening — actively adjusting tempo in response to what you hear — is not metaphorical. It is the physical mechanism of coupling, and its effect is to pull you toward the mean phase $\psi$ with a force $Kr\sin(\psi - \theta_i)$.

“Our tempos are too different” is a claim about $g(\bar\omega)$ and therefore about $K_c$. A group with a wide spread of natural tempos needs more and stronger listening to synchronise. This is not a moral failing but a parameter; it suggests that ensemble warm-up time or explicit tempo negotiation before a performance serves to reduce the spread of natural frequencies before the coupling has to do all the work.

Latency as a rehearsal experiment can be made explicit. Artificially delaying the acoustic return to one musician in an ensemble — via headphone monitoring with variable delay — allows students to experience directly how the coordination degrades as $\tau$ increases toward $\tau_c$. They feel the system approaching the phase transition without the theoretical framework, but the framework makes the experience interpretable afterward.

The click track replaces peer-to-peer Kuramoto coupling with an external forcing term: each musician locks to a shared reference with fixed $\omega$ rather than adjusting dynamically to the group mean. This eliminates the phase transition but also eliminates the adaptive dynamics — the micro-timing fluctuations and expressive rubato — that characterise live ensemble playing. It is a pedagogically important distinction, even if studios routinely make the pragmatic choice.

References

Demos, A. P., & Palmer, C. (2023). Social and nonlinear dynamics unite: Musical group synchrony. Trends in Cognitive Sciences, 27(11), 1008–1018. https://doi.org/10.1016/j.tics.2023.08.005
Huygens, C. (1665). Letter to his father Constantijn Huygens, 26 February 1665. In Œuvres complètes de Christiaan Huygens, Vol. 5, p. 243. Martinus Nijhoff, 1893.
Kuramoto, Y. (1975). Self-entrainment of a population of coupled non-linear oscillators. In H. Araki (Ed.), International Symposium on Mathematical Problems in Theoretical Physics (Lecture Notes in Physics, Vol. 39, pp. 420–422). Springer.
Kuramoto, Y. (1984). Chemical Oscillations, Waves, and Turbulence. Springer.
Lindenberger, U., Li, S.-C., Gruber, W., & Müller, V. (2009). Brains swinging in concert: Cortical phase synchronization while playing guitar. BMC Neuroscience, 10, 22. https://doi.org/10.1186/1471-2202-10-22
Müller, V., Sänger, J., & Lindenberger, U. (2013). Intra- and inter-brain synchronization during musical improvisation on the guitar. PLOS ONE, 8(9), e73852. https://doi.org/10.1371/journal.pone.0073852
Neda, Z., Ravasz, E., Vicsek, T., Brechet, Y., & Barabási, A.-L. (2000). Physics of the rhythmic applause. Physical Review E, 61(6), 6987–6992. https://doi.org/10.1103/PhysRevE.61.6987
Strogatz, S. H. (2000). From Kuramoto to Crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena, 143(1–4), 1–20. https://doi.org/10.1016/S0167-2789(00)00094-4
Strogatz, S. H. (2003). Sync: How Order Emerges from Chaos in the Universe, Nature, and Daily Life. Hyperion.

Changelog

2026-01-14: Updated the author list for the Demos (2023) Trends in Cognitive Sciences reference to the published two authors (Demos & Palmer). The five names previously listed were from a different Demos paper.
2026-01-14: Changed “period-doubling” to “frequency-doubling.” When the clapping frequency doubles, the period halves; “frequency-doubling” is the precise term in this context.

The Impossible Heptagon

Mon, 15 Jan 2024 00:00:00 +0000

Danny Carey — drummer of Tool, one of the most rhythmically inventive musicians in rock — keeps a seven-pointed star on his kit and speaks about it using the language of sacred geometry. The heptagram appears in Tool’s visual artwork, in the Thelemic symbolism Carey draws on, in pre-modern cosmological diagrams, and in the decorative traditions of several cultures that had no contact with each other. The claim, loosely stated, is that seven-fold symmetry is privileged: that it reflects something structurally true, that its forms carry significance beyond the aesthetic.

The scientific reflex here is usually impatience. “Sacred geometry” occupies an uncomfortable cultural space — mathematically dressed, factually thin, reliant on the listener not checking claims too carefully. The golden ratio does not appear everywhere in nature. Most things described as sacred in this tradition are better described as things the speaker found surprising before learning a more precise vocabulary.

But the heptagon is genuinely strange. Not for the reasons usually given. For a different reason — a theorem.

The regular heptagon cannot be constructed with compass and straightedge.

Not “it is difficult.” Not “no one has found a construction yet.” The regular seven-sided polygon — all sides equal, all interior angles equal — is provably impossible to construct using an unmarked ruler and compass in finitely many steps. This has been known since 1801.

The Classical Constraint

Greek geometry restricted its tools deliberately. An unmarked straightedge draws lines through two known points. A compass draws circles centred at a known point with a given radius. No angle trisection. No markings. No graduated instruments. Just these two operations, applied one at a time, finitely many times.

Within this constraint, a great deal is achievable. A perpendicular bisector. An equilateral triangle. A regular pentagon — which requires the golden ratio and takes some work, but is reachable. A regular hexagon (trivially: six equilateral triangles around a centre).

Then: nothing for the heptagon. Greek geometers left no construction. Medieval Islamic mathematicians, who knew the regular polygon problem well, left no construction. Albrecht Dürer, in his 1525 Underweysung der Messung, gave an approximate construction that falls short by a small but nonzero margin. Each generation encountered the same wall.

In 1796, an 18-year-old Gauss proved that the regular 17-gon is constructible — a result so unexpected that he reportedly decided at that moment to become a mathematician rather than a philologist. In his 1801 Disquisitiones Arithmeticae he gave the complete characterisation of which regular polygons are constructible and which are not [1]. The heptagon was definitively placed among the impossible.

Gauss’s Theorem

A regular $n$-gon is constructible with compass and straightedge if and only if $n$ has the form

$$n = 2^k \cdot p_1 \cdot p_2 \cdots p_m$$

where $k \geq 0$ and the $p_i$ are distinct Fermat primes — primes of the form $2^{2^j} + 1$.

The Fermat primes currently known:

$j$	$F_j = 2^{2^j}+1$	Prime?
0	3	✓
1	5	✓
2	17	✓
3	257	✓
4	65537	✓
5	4 294 967 297	✗ (Euler, 1732)
6	18 446 744 073 709 551 617	✗
⋮	⋮	no further Fermat primes known

Five Fermat primes are known, all identified by the seventeenth century. Fermat himself conjectured that all numbers of this form are prime; he was wrong from $j = 5$ onward. Whether any further Fermat primes exist remains an open problem.

The constructible regular polygons therefore include the triangle (3), square (4), pentagon (5), hexagon (6), octagon (8), decagon (10), 15-gon, 17-gon, 257-gon, 65537-gon, and products of these with powers of 2. The 65537-gon was actually fully constructed by Johann Gustav Hermes, who spent around ten years on the computation in the 1880s and deposited a manuscript reportedly filling a large trunk at the University of Göttingen, where it remains.

Seven is prime, but $7 \neq 2^{2^j} + 1$ for any $j$ — it is not a Fermat prime. Therefore the regular heptagon is not on the list. It is not constructible.

The Algebra Behind the Geometry

Why does the structure of Fermat primes determine constructibility? The connection goes through algebra [2][3].

Every compass-and-straightedge construction corresponds to solving a sequence of equations of degree at most 2. Bisecting an angle, finding an intersection of a line and a circle — each step is a quadratic operation. After $k$ such steps, the numbers reachable lie in some field extension of $\mathbb{Q}$ (the rationals) with degree over $\mathbb{Q}$ at most $2^k$. Constructibility therefore requires the degree of the relevant extension to be a power of 2.

To construct a regular $n$-gon, you need to construct the angle $2\pi/n$, which requires constructing $\cos(2\pi/n)$. The question is: over what kind of field extension does $\cos(2\pi/n)$ sit?

For $n = 7$: let $\omega = e^{2\pi i/7}$, a primitive 7th root of unity. The minimal polynomial of $\omega$ over $\mathbb{Q}$ is the 7th cyclotomic polynomial

$$\Phi_7(x) = x^6 + x^5 + x^4 + x^3 + x^2 + x + 1,$$

which is irreducible over $\mathbb{Q}$, giving $[\mathbb{Q}(\omega) : \mathbb{Q}] = 6$. Since $\cos(2\pi/7) = (\omega + \omega^{-1})/2$, and since $\omega$ satisfies a degree-2 polynomial over $\mathbb{Q}(\cos 2\pi/7)$, we get

$$[\mathbb{Q}(\cos 2\pi/7) : \mathbb{Q}] = 3.$$

Specifically, $c = \cos(2\pi/7)$ is the root of the irreducible cubic

$$8c^3 + 4c^2 - 4c - 1 = 0,$$

or equivalently, $\alpha = 2\cos(2\pi/7)$ satisfies

$$\alpha^3 + \alpha^2 - 2\alpha - 1 = 0.$$

The three roots of this cubic are $2\cos(2\pi/7)$, $2\cos(4\pi/7)$, and $2\cos(6\pi/7)$. By Vieta’s formulas their sum is $-1$ and their product is $1$ — which can be verified directly from the identity $\cos(2\pi/7) + \cos(4\pi/7) + \cos(6\pi/7) = -1/2$.

The degree of the extension is 3. Three is not a power of 2. Therefore $\cos(2\pi/7)$ cannot be reached by any tower of quadratic extensions of $\mathbb{Q}$. Therefore the regular heptagon is not constructible. $\square$

Compare the pentagon: $\cos(2\pi/5) = (\sqrt{5}-1)/4$, satisfying the quadratic $4x^2 + 2x - 1 = 0$. Degree 2 — a power of 2. Constructible.

The 17-gon: the Galois group of $\mathbb{Q}(\zeta_{17})/\mathbb{Q}$ is $(\mathbb{Z}/17\mathbb{Z})^* \cong \mathbb{Z}/16\mathbb{Z}$, order $16 = 2^4$. The extension decomposes into four quadratic steps. This is exactly what Gauss computed at 18.

For 7: $(\mathbb{Z}/7\mathbb{Z})^* \cong \mathbb{Z}/6\mathbb{Z}$, order $6 = 2 \times 3$. The factor of 3 is the obstruction. The Galois group is not a 2-group, so the extension cannot be decomposed into quadratic steps. The heptagon is out of reach.

Sacred, Precisely

The phrase “sacred geometry” usually does work that “elegant mathematics” could do more honestly. But the heptagon is a case where something with genuine mathematical content sits underneath the mystical framing.

The Platonic tradition held that certain geometric forms exist as ideals — perfect, unchanging, more real than their physical approximations. The philosopher’s claim is that the heptagon exists in a realm beyond its material instantiation. The mathematician’s claim is: the heptagon is perfectly well-defined — seven equal sides, seven equal angles — but it cannot be reached from $\mathbb{Q}$ by the operations available to ruler and compass. You can approximate it to any desired precision. You can construct it exactly using origami, which allows angle trisection and is strictly more powerful than compass and straightedge [4]. But the classical constructive program — the one that reaches the pentagon, the hexagon, the 17-gon, the 65537-gon — cannot reach the heptagon.

There is a precise mathematical sense in which it lies outside the constructible world. Whether that constitutes sacredness is a question for a different kind of argument. But it is not nothing. The Pythagoreans were working without Galois theory; they had an intuition without the theorem. The theorem, when it came, confirmed that intuition about seven while explaining it more clearly than they could.

Carey’s intuition — that 7 sits outside the ordinary — is, by this route, formally correct.

What the Heptagram Is

The regular heptagon may be impossible to construct exactly, but the heptagram — the seven-pointed star — is perfectly drawable. Connecting every second vertex of an approximate regular heptagon gives $\{7/2\}$ in Schläfli notation [5]; connecting every third vertex gives $\{7/3\}$. Both are closed figures. Both appear throughout pre-modern symbolic traditions, which is unsurprising: they are the most intricate star polygons drawable with a single pen stroke before complexity outruns visibility.

They are also generators of rhythmic structure. Because 7 is prime, every star polygon on seven points visits all seven vertices in a single closed traversal — a property that does not hold for six-pointed or eight-pointed stars. This turns out to matter for how drum patterns are built across multiple bars. That connection — from the primality of 7 to the architecture of rhythmic accent cycles — is the subject of the companion post, Star Polygons and Drum Machines.

The broader series on mathematics in Tool’s music began with the Fibonacci structure embedded in the time signatures and syllable counts of “Lateralus” [6], and the group-theoretic structure underlying twelve-tone equal temperament provides the same algebraic scaffolding seen here [7].

References

[1] Gauss, C.F. (1801). Disquisitiones Arithmeticae. Leipzig: Fleischer. (§VII.)

[2] Stewart, I. (2004). Galois Theory (3rd ed.). CRC Press. Ch. 4.

[3] Conway, J.H. & Guy, R.K. (1996). The Book of Numbers. Springer. pp. 190–202.

[4] Hull, T. (2011). Solving cubics with creases: The work of Beloch and Lill. The American Mathematical Monthly, 118(4), 307–315. DOI: 10.4169/amer.math.monthly.118.04.307

[5] Coxeter, H.S.M. (1973). Regular Polytopes (3rd ed.). Dover. Ch. 2.

[6] See Fibonacci and Lateralus on this blog.

[7] See Twelve-TET and Group Theory on this blog.

Twelve Is Not an Accident: The Group Theory of Musical Tuning

Fri, 15 Dec 2023 00:00:00 +0000

Sit down at a piano and count the keys in one octave. Twelve. Seven white, five black, twelve total pitch classes before the pattern repeats. Ask a musician why twelve and they will probably say something about Western tradition, the church modes, or maybe vaguely gesture at the circle of fifths. Ask a musicologist and you might hear about Pythagoras, or the development of equal temperament in the Baroque period, or the well-tempered tuning systems of J. S. Bach. All of that history is real and worth knowing. But none of it explains why the number 12 works, and why every serious attempt at a usable keyboard instrument across widely separated cultures converges on the same cardinality.

The real answer is in number theory. Specifically, it is in the continued fraction expansion of a single irrational number: $\log_2(3/2)$. The number 12 is not a cultural choice. It is the smallest integer that gives a genuinely good rational approximation to that number — subject to the constraint that a human hand can navigate the resulting keyboard. Once you see the argument, the feeling of contingency evaporates completely. Twelve is forced on us.

Along the way, the same mathematical structure — the cyclic group $\mathbb{Z}_{12}$ — explains why Messiaen’s modes of limited transposition exist, why the circle of fifths closes exactly, and why certain chord types (augmented triads, diminished seventh chords, the whole-tone scale) have a strange self-similar quality that composers have exploited for centuries. If you want the full treatment of the Messiaen connection, I wrote a dedicated post: Messiaen, Modes, and the Group Theory of Harmony. Here I want to build the foundations from scratch, starting with the one interval that makes all of this necessary.

The interval that started everything

The perfect fifth has a frequency ratio of exactly 3:2. Play two strings in that ratio and the sound is stable, open, and unmistakably consonant — second only to the octave (2:1) in the hierarchy of simple intervals. The reason is physics: the overtone series of any vibrating string includes the fundamental frequency $f$, then $2f$, $3f$, $4f$, and so on. Two notes a perfect fifth apart share the overtone at $3f$ (for the lower note) and $2f'$ (for the upper note, where $f' = 3f/2$): those are the same frequency, $3f$. Shared overtones mean the two notes reinforce rather than fight each other. This is why the fifth sounds stable: it is literally built into the harmonic structure of physical vibration.

Humans discovered the fifth independently in ancient Greece, China, India, and Mesopotamia. It is not a cultural artifact [4]. Given that stability, it is natural to ask: can we build a complete pitch system by stacking fifths? Take a starting note, go up a fifth, up another, up another, and keep going. The notes you produce — C, G, D, A, E, B, F♯, … — are acoustically related to the starting point in a simple way, and they sound good together. This is the Pythagorean tuning system, and it underlies the construction of diatonic scales.

But here is the problem. A fifth raises the pitch by a factor of 3/2. An octave raises it by a factor of 2. These are independent: one is a power of 3 and the other a power of 2, and no power of 3/2 will ever equal a power of 2 exactly. In the language of modern mathematics, $\log_2(3/2)$ is irrational — this follows directly from the fundamental theorem of arithmetic, since no product of powers of 2 can equal a product of powers of 3. Whether it is also transcendental is an open question; a proof would follow from Schanuel’s conjecture, but that conjecture remains unresolved. What matters for tuning is the irrationality alone. Stacking pure fifths and stacking octaves are incommensurable operations. The circle of fifths can never close in pure Pythagorean tuning. We will always end up slightly sharp or flat relative to where we started.

This incommensurability is the central problem of musical tuning. Everything else — equal temperament, just intonation, meantone tuning, the Pythagorean comma, the whole apparatus of tuning theory — is a response to it.

Equal temperament and the approximation problem

In an equal temperament with $N$ notes per octave, we divide the octave into $N$ equal logarithmic steps. Each step corresponds to a frequency ratio of $2^{1/N}$. We then ask: how many steps $k$ gives the best approximation to a perfect fifth?

The condition is simply that $2^{k/N}$ should be close to $3/2$, which means $k/N$ should be close to $\log_2(3/2)$. So we need a good rational approximation to

$$\log_2\!\left(\frac{3}{2}\right) = \log_2 3 - 1 \approx 0.584962\ldots$$

The classical tool for finding best rational approximations is the continued fraction. Any real number $x$ can be written as

$$x = a_0 + \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{a_3 + \cdots}}}$$

where the $a_i$ are non-negative integers (positive for $i \geq 1$), called the partial quotients. For $\log_2(3/2)$ the expansion is

$$\log_2\!\left(\frac{3}{2}\right) = [0;\, 1,\, 1,\, 2,\, 2,\, 3,\, 1,\, 5,\, 2,\, 23,\, 2,\, \ldots]$$

The truncated continued fractions — the convergents — give the sequence of best rational approximations:

$$\frac{0}{1},\quad \frac{1}{1},\quad \frac{1}{2},\quad \frac{3}{5},\quad \frac{7}{12},\quad \frac{24}{41},\quad \frac{31}{53},\quad \frac{179}{306},\quad \ldots$$

Each convergent $k/N$ corresponds to a tuning system: the denominator $N$ is the number of equal steps per octave, and the numerator $k$ is the number of steps that best approximates a fifth. So we get: 1-TET (trivial), 2-TET (trivial), 5-TET, 12-TET, 41-TET, 53-TET, 306-TET, and so on [1], [2].

The key property of convergents is that they give uniquely good approximations. No rational number with a smaller denominator comes closer to the true value than a convergent does. So 7/12 is not merely a decent approximation to $\log_2(3/2)$ — it is provably the best approximation with denominator at most 12. To do better with a denominator below 41, you cannot.

To put numbers on it: in 12-TET, the fifth is $2^{7/12} \approx 1.498307\ldots$, while the true fifth is exactly $1.500000$. The error is about 0.11%, or roughly 2 cents (hundredths of a semitone). In 53-TET, the fifth is $2^{31/53} \approx 1.499941\ldots$, an error of less than 0.004%, about 0.07 cents — essentially indistinguishable from pure. Both 12 and 53 are convergents. Intermediate values like 19-TET or 31-TET are not convergents (they are not best approximations), and their fifths, while sometimes used in experimental or microtonal music, are less accurate relative to their complexity.

Why does this matter? Because a tuning system that approximates the fifth poorly will produce harmonies that beat visibly — the slight mistuning causes the sound to waver in a way that trained ears find uncomfortable in sustained chords. A good fifth approximation is not a luxury; it is the condition for the system to be musically usable in the harmonic practice that most of the world’s music assumes.

The Pythagorean comma

Before equal temperament became standard (roughly the 18th century in Western Europe), instruments were tuned using pure Pythagorean fifths: exact 3:2 ratios, stacked on top of each other. This gives beautiful, stable individual fifths, but it collects a debt.

After stacking 12 pure fifths, you have climbed in frequency by $(3/2)^{12}$:

$$(3/2)^{12} = \frac{3^{12}}{2^{12}} = \frac{531441}{4096} \approx 129.746\ldots$$

Meanwhile, 7 octaves is $2^7 = 128$. The ratio between these is

$$\frac{(3/2)^{12}}{2^7} = \frac{3^{12}}{2^{19}} = \frac{531441}{524288} \approx 1.01364$$

This is the Pythagorean comma: roughly 23.46 cents, or about a quarter of a semitone [4]. In Pythagorean tuning, the circle of fifths never closes. After 12 fifths you arrive at a note that is nominally the same pitch class as the starting point — but sharp by 23.46 cents. That final fifth, the one that “should” close the circle, sounds badly out of tune. It was historically called the “wolf fifth” because it howls.

Equal temperament solves this by distributing the comma across all 12 fifths. Each fifth is flattened by $23.46/12 \approx 1.955$ cents. The individual fifths are no longer pure, but the error is small enough to be acceptable — and crucially, it is uniform, so every key sounds equally good (or equally impure, depending on your perspective).

The Pythagorean comma being small — about 1.96% of the octave — is precisely why 12-TET works. It is small because 7/12 is an unusually good convergent of $\log_2(3/2)$. The two facts are the same fact. The comma is the numerator of the error when you approximate $\log_2(3/2)$ by $7/12$, multiplied up by 12 fifths’ worth of accumulation. When the approximation is good, the comma is small, and the distribution is imperceptible. This is why the piano is tuned the way it is.

The group theory

We are now ready for the algebra. In 12-TET, pitch classes form the set $\{0, 1, 2, \ldots, 11\}$ where we identify 0 with C, 1 with C♯, 2 with D, 3 with D♯, 4 with E, 5 with F, 6 with F♯, 7 with G, 8 with G♯, 9 with A, 10 with A♯, and 11 with B. Addition is modulo 12: after 11 comes 0 again, because after B comes C in the next octave (same pitch class). This is $\mathbb{Z}_{12}$, the integers mod 12, and it is a group under addition [1].

Transposition by a semitone is addition of 1. Transposition by a perfect fifth is addition of 7, because the fifth is 7 semitones in 12-TET. Start from C (0) and repeatedly add 7, always reducing modulo 12:

$$0 \to 7 \to 14 \equiv 2 \to 9 \to 16 \equiv 4 \to 11 \to 18 \equiv 6 \to 13 \equiv 1 \to 8 \to 15 \equiv 3 \to 10 \to 17 \equiv 5 \to 12 \equiv 0$$

In note names: C, G, D, A, E, B, F♯, C♯, G♯, D♯/E♭, A♯/B♭, F, C. That is the circle of fifths — all 12 pitch classes visited exactly once before returning to the start. The circle of fifths is the orbit of 0 under repeated addition of 7 in $\mathbb{Z}_{12}$.

Why does the orbit visit all 12 elements? Because $\gcd(7, 12) = 1$. This is Bézout’s theorem applied to cyclic groups: an element $g$ generates $\mathbb{Z}_n$ (i.e., its orbit under repeated addition covers all of $\mathbb{Z}_n$) if and only if $\gcd(g, n) = 1$. The generators of $\mathbb{Z}_{12}$ are exactly the elements coprime to 12: that is $\{1, 5, 7, 11\}$. Musically: transposition by 1 semitone (chromatic scale), by 5 semitones (perfect fourth), by 7 semitones (perfect fifth), or by 11 semitones (major seventh) each generates all 12 pitch classes. Transposition by 2 (a whole tone) does not — it produces only the 6-element whole-tone scale. Transposition by 3 (a minor third) produces only the 4-element diminished seventh chord.

This is not a curiosity; it is the algebraic skeleton of tonal music. The circle of fifths closes because 7 and 12 are coprime. That coprimality is guaranteed by the continued fraction structure: the numerator and denominator of a convergent in lowest terms are always coprime (as they must be, being a reduced fraction), and 7/12 is such a convergent.

Now consider the subgroups of $\mathbb{Z}_{12}$. By Lagrange’s theorem, subgroups of a finite group must have orders dividing the group order. The divisors of 12 are 1, 2, 3, 4, 6, and 12, so these are the only possible subgroup orders. For cyclic groups there is exactly one subgroup of each order dividing $n$, and it is generated by $n/d$ where $d$ is the subgroup order. The full list:

The trivial subgroup of order 1 is just $\{0\}$. The subgroup of order 2 is $\{0, 6\}$, generated by 6 — that is, the tritone axis, the interval of exactly half an octave. The subgroup of order 3 is $\{0, 4, 8\}$, generated by 4 — this is the augmented triad, three notes equally spaced around the octave by major thirds. The subgroup of order 4 is $\{0, 3, 6, 9\}$, generated by 3 — the diminished seventh chord, four notes equally spaced by minor thirds. The subgroup of order 6 is $\{0, 2, 4, 6, 8, 10\}$, generated by 2 — the whole-tone scale. And the full group of order 12 is all of $\mathbb{Z}_{12}$.

Each of these has a musical life. The augmented triad ($\{0, 4, 8\}$) sounds ambiguous because it maps onto itself under transposition by a major third — there are only 4 distinct augmented triads total, not 12. Composers exploit this ambiguity when they want harmonic instability without committing to a direction. The diminished seventh ($\{0, 3, 6, 9\}$) is similarly ambiguous: it has only 3 distinct forms and can resolve to any of several keys, which is why it appears so often at structural pivots in Romantic music. These properties are direct consequences of the subgroup structure of $\mathbb{Z}_{12}$.

Messiaen’s modes as cosets

Olivier Messiaen described his “modes of limited transposition” in his 1944 treatise Technique de mon langage musical. He identified seven scales — including the whole-tone scale and the octatonic scale — that have the peculiar property of mapping onto themselves under some transposition strictly smaller than an octave. He found them by ear, by introspection, and by exhaustive search at the keyboard. He did not have the group theory. But the group theory makes their existence not merely explainable but inevitable.

Here is the key definition. A scale $S \subseteq \mathbb{Z}_{12}$ is a mode of limited transposition if there exists some $t \in \{1, 2, \ldots, 11\}$ such that $S + t \equiv S \pmod{12}$ (as a set). In other words, transposing the scale by $t$ semitones maps the scale onto itself. The integer $t$ is called a period of the scale.

Now, the set of all periods of $S$ — together with 0 — forms a subgroup of $\mathbb{Z}_{12}$ (it is closed under addition modulo 12, since if both $t_1$ and $t_2$ are periods then so is $t_1 + t_2$). Call this subgroup $H$. The condition for $S$ to be a mode of limited transposition is simply that $H$ is nontrivial — that is, $H \neq \{0\}$.

Moreover, if $H$ is the period subgroup of $S$, then $S$ must be a union of cosets of $H$ in $\mathbb{Z}_{12}$. This follows immediately from the fact that $H$ acts on $S$ by translation and maps $S$ to itself: every element of $S$ belongs to exactly one coset of $H$, and $S$ is a union of whole cosets. The size of $S$ must therefore be a multiple of $|H|$.

The whole-tone scale $\{0, 2, 4, 6, 8, 10\}$ is itself the unique subgroup of order 6 in $\mathbb{Z}_{12}$. Its period subgroup is the whole-tone scale itself. Transposing by any even number (2, 4, 6, 8, or 10) maps it to itself. Transposing by an odd number gives the complementary whole-tone scale $\{1, 3, 5, 7, 9, 11\}$. There are therefore only 2 distinct transpositions of the whole-tone scale, not 12.

The octatonic (diminished) scale $\{0, 1, 3, 4, 6, 7, 9, 10\}$ has period subgroup $\{0, 3, 6, 9\}$ — the subgroup of order 4. It is a union of two cosets: $\{0, 3, 6, 9\}$ itself and $\{1, 4, 7, 10\}$. Transposing by 3 maps it onto itself. There are only 3 distinct transpositions. Messiaen calls this his Mode 2.

The general formula is clean: a mode of limited transposition with period subgroup of order $d$ has exactly $12/d$ distinct transpositions. For the whole-tone scale, $d = 6$ gives $12/6 = 2$ transpositions. For the octatonic scale, $d = 4$ gives $12/4 = 3$ transpositions.

What Messiaen found by ear was the complete classification of subsets of $\mathbb{Z}_{12}$ that are unions of cosets of a nontrivial subgroup [5]. The group theory makes their existence a theorem rather than a discovery. I find this genuinely beautiful: a composer’s intuition about harmonic symmetry turns out to be an exercise in the theory of cosets of cyclic groups. For the full analysis of each of Messiaen’s seven modes in these terms, see Messiaen, Modes, and the Group Theory of Harmony.

Why not 53?

Given that 53-TET approximates the fifth with an error of less than 0.004% — compared to 12-TET’s 0.11% — one might ask why we do not simply use 53-TET. The mathematical case is overwhelming. In addition to the nearly perfect fifth, 53-TET gives excellent approximations to the just major third (frequency ratio 5:4) and the just minor third (6:5). It was seriously advocated by the 19th-century theorist Robert Holford Macdowall Bosanquet, who even built a 53-key harmonium to demonstrate it. The Chinese theorist Jing Fang described a 53-note system in the 1st century BC. The Arabic music theorist Al-Farabi considered 53-division scales in the 10th century. Everyone who has ever thought carefully about tuning arrives at 53 eventually.

And yet no 53-TET instrument has ever entered widespread use. The reason is anatomical, not mathematical. A piano with 53 keys per octave spans more than 2 metres per octave at any reasonable key size — impossible to play. A guitar with 53 frets per octave has frets spaced roughly 3–4 millimetres apart in the upper register: no human fingertip is narrow enough to press a single fret without touching its neighbours. Even if you could play it, reading 53-TET notation would require an entirely new theoretical and pedagogical apparatus.

The constraint is: we want the largest $N$ such that (a) $N$ is a convergent denominator of $\log_2(3/2)$, so the fifth approximation is genuinely good, and (b) $N$ is small enough to navigate with human hands and readable at a glance. The convergent denominators are 1, 2, 5, 12, 41, 53, 306, … Of these, 12 is the largest that satisfies condition (b). The next convergent, 41, already strains human dexterity — 41-TET keyboard instruments have been built experimentally but never mass-produced. At 53 the case is closed.

One might argue about where exactly the cutoff is, and reasonable people might draw it at 19 or 31 (which are not convergents but have other virtues). But the point is that 12 is not merely a local optimum found by trial and error. It is the specific value where the continued fraction and human physiology intersect.

Closing

There is something I find genuinely satisfying about this argument. Music feels like the most human of activities — expressive, cultural, steeped in history and tradition. And yet the number 12, which lies at the foundation of so much of the world’s music, is not a human choice at all. It is the continued-fraction convergent of an irrational number that was fixed by the physics of vibrating strings long before any human struck a tuning fork.

The circle of fifths closes because $\gcd(7, 12) = 1$: a fact about integers, not about culture. Messiaen’s modes exist because $\mathbb{Z}_{12}$ has nontrivial proper subgroups: a fact about cyclic groups, not about 20th-century French aesthetics. The augmented triad sounds ambiguous because it is a coset of the order-3 subgroup of $\mathbb{Z}_{12}$: a fact about quotient groups, not about Romantic harmony conventions.

I came to music theory sideways — through acoustics, then signal processing, then the mathematics of scales. What surprised me, when I finally worked through the continued fraction argument properly, was not that the math existed but that it was so tight. There is essentially no freedom in the answer. Given the constraint that a musical scale should be built around the most consonant interval (after the octave), should form a closed group structure, and should be navigable by a human performer, the answer is 12. Not approximately 12, not 12 as a historical compromise. Exactly 12.

The number is not a tradition. It is a theorem.

For more on related themes: the Fibonacci sequence and golden ratio in music appear in Fibonacci, Lateralus, and the Golden Ratio. The Euclidean algorithm and rhythmic structure are explored in Euclidean Rhythms — a sister post to this one in the math-and-music thread. And for the physics of audio sampling rates, where a similar interplay of number theory and practical constraints forces another specific number, see Why 44,100 Hz?.

References

[1] Balzano, G. J. (1980). The group-theoretic description of 12-fold and microtonal pitch systems. Computer Music Journal, 4(4), 66–84.

[2] Carey, N., & Clampitt, D. (1989). Aspects of well-formed scales. Music Theory Spectrum, 11(2), 187–206.

[3] Milne, A., Sethares, W. A., & Plamondon, J. (2007). Isomorphic controllers and dynamic tuning. Computer Music Journal, 31(4), 15–32.

[4] Lloyd, L. S., & Boyle, H. (1978). Intervals, Scales and Temperaments. St. Martin’s Press.

[5] Douthett, J., & Steinbach, P. (1998). Parsimonious graphs: A study in parsimony, contextual transformations, and modes of limited transposition. Journal of Music Theory, 42(2), 241–263.

Changelog

2025-11-20: Updated the spelling of “Robert Holford Macdowall Bosanquet” (previously rendered as “Macdowell”).
2025-11-20: Changed “about 1.36% of the octave” to “about 1.96% of the octave.” The 1.36% figure is the frequency ratio above unity (531441/524288 ≈ 1.01364); the logarithmic fraction of the 1200-cent octave is 23.46/1200 ≈ 1.96%.
2025-11-20: Changed “12 octaves’ worth of accumulation” to “12 fifths’ worth of accumulation.” The Pythagorean comma accumulates over 12 stacked fifths (which span approximately 7 octaves), not 12 octaves.

Non-Commutative Pre-Schoolers

Mon, 13 Nov 2023 00:00:00 +0000

Summary

A three-year-old cannot put her shoes on before her socks. Not because she lacks motor skills — because the operations do not commute.

The same structural constraint, dressed in the language of operators on a Hilbert space, is why Heisenberg’s uncertainty principle holds. This post is about that connection: the accidental algebra lesson built into getting dressed, and why the physicists of 1925 had to abandon one of arithmetic’s most taken-for-granted assumptions.

Getting Dressed Is a Non-Abelian Problem

Start with the mundane. Your morning routine imposes a strict partial order on operations: underwear before trousers, socks before shoes, cap before chin-strap if you cycle. Try reversing any pair and the sequence fails — physically, not just socially. You cannot pull a sock over a shoe.

The operation “put on socks” followed by “put on shoes” produces a wearable human; the reverse produces neither, and no amount of wishing commutativity into existence will help.

In the language of abstract algebra, two operations $A$ and $B$ commute if $AB = BA$ — if doing them in either order yields the same result. Everyday life is full of operations that do not commute: rotate a book 90° around its vertical axis then 90° around its horizontal axis; now reverse the order. The final orientations differ. Turn right then turn left while driving; left then right. Different positions.

The intuition is not hard to build. What is surprising is how rarely we note it, and what it costs us when we finally hit a domain — quantum mechanics — where non-commutativity is not an inconvenient edge case but the central fact.

Piaget Said Seven; Toddlers Disagreed

Jean Piaget argued that children do not acquire operational thinking — the ability to mentally perform and reverse sequences of actions — until the concrete operational stage, roughly ages seven to eleven (Inhelder & Piaget, 1958). Before that, he claimed, children lack the understanding that an operation can be undone or reordered.

Post-Piagetian research pushed back hard. Patricia Bauer and Jean Mandler tested infants aged sixteen and twenty months on novel, multi-step action sequences (Bauer & Mandler, 1989). For causally structured sequences — where step A physically enables step B — infants reproduced the correct order after a two-week delay. They were not told the order was important. They had no language to encode it. They just knew, implicitly, that the operations had a necessary direction.

A 2020 study by Klemfuss and colleagues tested 100 children aged roughly two-and-a-half to five on temporal ordering questions (Klemfuss et al., 2020). Children answered “what happened first?” questions correctly 82% of the time. The errors that did appear followed an encoding-order bias — children defaulted to reporting the next event in the sequence as originally experienced, regardless of what was asked. The ordering knowledge was intact. What children lack, for Piaget’s full seven years, is the formal recursive conception of reversibility. The procedural knowledge — that some sequences must be done in the right order and cannot be freely rearranged — is there from the second year of life.

Which means: learning that $AB \neq BA$ is not learning something exotic. It is articulating something the nervous system already knows.

The Mathematician’s Commutator

Abstract algebra formalized this intuition in the nineteenth century. A group is abelian (commutative) if every pair of elements satisfies $ab = ba$. Integers under addition: abelian. Rotations in three dimensions: not.

Arthur Cayley’s 1858 memoir established matrix algebra as a formal theory (Cayley, 1858). Multiply two $2 \times 2$ matrices:

$$ A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \quad B = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} $$$$ AB = \begin{pmatrix} 2 & 1 \\ 4 & 3 \end{pmatrix}, \quad BA = \begin{pmatrix} 3 & 4 \\ 1 & 2 \end{pmatrix} $$

$AB \neq BA$. Non-commutativity is not a curiosity; it is the generic condition for matrix products. Commutativity is the special case — and requiring justification.

William Rowan Hamilton had already gone further. On 16 October 1843, walking along the Royal Canal in Dublin, he discovered the quaternions and carved their multiplication rule into the stone of Broom Bridge:

$$ i^2 = j^2 = k^2 = ijk = -1 $$

From this it follows immediately that $ij = k$ but $ji = -k$. Hamilton’s four-dimensional number system — the first algebraic structure beyond the complex numbers — was non-commutative by construction. He did not apologize for it. He celebrated it.

The Lie algebra structure underlying these commutator relations is the same skeleton that governs Messiaen’s modes of limited transposition, which I traced in a previous post on group theory and music — a very different physical domain, but identical algebraic machinery.

Born, Jordan, and the Physicist’s Shock

Classical mechanics treats position $x$ and momentum $p$ as ordinary real numbers. Real numbers commute: $xp = px$. The Poisson bracket $\{x, p\} = 1$ encodes a classical relationship, but the underlying quantities are scalars, and scalars commute.

In July 1925, Werner Heisenberg published a paper that could not quite bring itself to say what it was doing (Heisenberg, 1925). He replaced classical dynamical variables with arrays of numbers — what we would now call matrices — and found, uncomfortably, that the resulting quantum condition required order to matter.

While Heisenberg was on vacation, Max Born and Pascual Jordan finished the translation into matrix language (Born & Jordan, 1925). They wrote the commutation relation explicitly, recognized it as the fundamental law, and showed that it reproduced the known quantum results:

$$ [\hat{x}, \hat{p}] = \hat{x}\hat{p} - \hat{p}\hat{x} = i\hbar $$

Non-commutativity of position and momentum was not a mathematical accident. It was the theory.

The uncertainty principle followed four years later as a theorem, not an additional postulate. Howard Robertson proved in 1929 that for any two observables $\hat{A}$ and $\hat{B}$, the Cauchy–Schwarz inequality on Hilbert space yields (Robertson, 1929):

$$ \Delta A \cdot \Delta B \geq \frac{1}{2} \left| \langle [\hat{A}, \hat{B}] \rangle \right| $$

Substituting $\hat{A} = \hat{x}$, $\hat{B} = \hat{p}$, $[\hat{x}, \hat{p}] = i\hbar$:

$$ \Delta x \cdot \Delta p \geq \frac{\hbar}{2} $$

This is the uncertainty principle. It does not say nature is fuzzy or that measurement disturbs systems in some vague intuitive sense. It says: position and momentum are operators that do not commute, and the Robertson inequality then constrains their joint variance. Non-commutativity is the uncertainty principle. Put the shoes on before the socks and the state is not defined.

The same logic applies to angular momentum. The three components satisfy:

$$ [\hat{L}_x, \hat{L}_y] = i\hbar \hat{L}_z, \quad [\hat{L}_y, \hat{L}_z] = i\hbar \hat{L}_x, \quad [\hat{L}_z, \hat{L}_x] = i\hbar \hat{L}_y $$

This is the Lie algebra $\mathfrak{su}(2)$. You cannot simultaneously determine two components of angular momentum to arbitrary precision — not because the measurement apparatus is noisy, but because the operations of measuring them do not commute.

The fiber bundle language that underlies these rotation groups also appears, in different physical dress, in the problem of the falling cat and geometric phases — another case where the order of rotations has non-trivial physical consequences (see that post).

Connes and Non-Commutative Space

Alain Connes asked what happens if we allow the coordinates of space itself to be non-commutative. In ordinary geometry, the algebra of coordinate functions on a manifold is commutative: $f(x) \cdot g(x) = g(x) \cdot f(x)$. Connes’ non-commutative geometry replaces this with a spectral triple $(\mathcal{A}, \mathcal{H}, D)$: an algebra $\mathcal{A}$ of operators (possibly non-commutative) acting on a Hilbert space $\mathcal{H}$, with a generalized Dirac operator $D$ encoding the geometry (Connes, 1994).

The payoff was remarkable. With Ali Chamseddine, Connes showed that if $\mathcal{A}$ is chosen as a specific non-commutative product of the real numbers, complex numbers, quaternions, and matrix algebras, the spectral action principle reproduces the full Lagrangian of the Standard Model coupled to general relativity from a single geometric principle (Chamseddine & Connes, 1996). The Higgs field, the gauge bosons, the graviton: all from the geometry of a non-commutative space.

Classical geometry is the special case where the coordinate algebra is commutative. Drop that assumption and you open up a vastly richer landscape. Quantum mechanics lives in that landscape. Possibly, so does the structure of spacetime at the Planck scale.

The Lesson Pre-Schoolers Already Know

There is an irony here that I cannot quite leave alone. Students learning linear algebra for the first time consistently make the same mistake. Anna Sierpinska documented it carefully: they assume $AB = BA$ for matrices because they have spent years in arithmetic and scalar algebra where multiplication commutes (Sierpinska, 2000). The commutativity of ordinary multiplication is so deeply internalized that abandoning it feels like breaking a rule.

But the pre-schooler in the sock-and-shoe scenario never had that problem. Her procedural memory, documented in infants as young as sixteen months by Bauer and Mandler, encoded the correct asymmetry directly. The order of operations is the first thing a developing mind learns about actions in the world, before the arithmetic of school teaches it the convenient fiction that order is irrelevant.

Arithmetic is the outlier. $3 + 5 = 5 + 3$ because counting does not depend on where you start. But putting on clothes, multiplying matrices, rotating rigid bodies, measuring quantum observables: these operations carry memory of order, and they repay the attention a child already brings to them before she can name a number.

The universe is non-abelian. We are born knowing it. School briefly convinces us otherwise. Physics eventually agrees with the pre-schooler.

References

Inhelder, B., & Piaget, J. (1958). The Growth of Logical Thinking from Childhood to Adolescence. Basic Books.
Bauer, P. J., & Mandler, J. M. (1989). One thing follows another: Effects of temporal structure on 1- to 2-year-olds’ recall of events. Developmental Psychology, 25, 197–206.
Klemfuss, J. Z., McWilliams, K., Henderson, H. M., Olaguez, A. P., & Lyon, T. D. (2020). Order of encoding predicts young children’s responses to sequencing questions. Cognitive Development, 55, 100927. DOI: 10.1016/j.cogdev.2020.100927
Cayley, A. (1858). A memoir on the theory of matrices. Philosophical Transactions of the Royal Society of London, 148, 17–37. DOI: 10.1098/rstl.1858.0002
Heisenberg, W. (1925). Über quantentheoretische Umdeutung kinematischer und mechanischer Beziehungen. Zeitschrift für Physik, 33, 879–893.
Born, M., & Jordan, P. (1925). Zur Quantenmechanik. Zeitschrift für Physik, 34, 858–888.
Robertson, H. P. (1929). The uncertainty principle. Physical Review, 34, 163–164. DOI: 10.1103/PhysRev.34.163
Connes, A. (1994). Noncommutative Geometry. Academic Press. ISBN 0-12-185860-X.
Chamseddine, A. H., & Connes, A. (1996). Universal formula for noncommutative geometry actions: Unification of gravity and the standard model. Physical Review Letters, 77, 4868–4871. DOI: 10.1103/PhysRevLett.77.4868
Sierpinska, A. (2000). On some aspects of students’ thinking in linear algebra. In J.-L. Dorier (Ed.), On the Teaching of Linear Algebra (pp. 209–246). Kluwer Academic Publishers. DOI: 10.1007/0-306-47224-4_8

Changelog

2026-02-03: Corrected the age range for the Klemfuss et al. (2020) study from “two to four” to “roughly two-and-a-half to five” — the actual participants were aged 30–61 months.
2026-02-03: Updated the characterisation of Klemfuss et al. (2020) findings to reflect the paper’s central result: errors follow an encoding-order bias (children default to the next event in encoding sequence). The paper’s title — “Order of encoding predicts young children’s responses” — names the mechanism.

LK-99: Six Weeks That Showed How Physics Works

Mon, 09 Oct 2023 00:00:00 +0000

July 22, 2023

On a Saturday morning in late July 2023, two preprints appeared on arXiv. They were submitted by researchers affiliated with the Quantum Energy Research Centre in Seoul — Sukbae Lee, Ji-Hoon Kim, and colleagues — and they claimed something that condensed matter physicists have been chasing for over a century: a material that superconducts at room temperature and ambient pressure.

The compound was called LK-99. It was a copper-doped lead apatite, synthesized from common precursors using a procedure that, on paper, any moderately equipped laboratory could attempt. The claimed critical temperature was above 400 K — well above 293 K, which is room temperature, which is roughly the temperature of a warm afternoon in Seoul in July.

A video circulated almost immediately. A small, grey, irregular piece of LK-99 appeared to be partially levitating — tilting up, one end raised — above a permanent neodymium magnet. In the video it wobbles slightly, like something caught between gravity and an invisible hand.

Physics Twitter — I will use that name; it was still recognizably that in July 2023 — detonated. Within 72 hours, laboratories across the world were racing to synthesize LK-99. Discord servers formed. GitHub repositories appeared with shared synthesis protocols. Preprints from independent groups began accumulating before the original authors had likely had a good night’s sleep.

Six weeks later, the claim was dead.

I want to write about what happened in those six weeks, because I think the episode is more interesting as sociology of science than as condensed matter physics. LK-99 turned out to be a modest semiconductor with a ferromagnetic impurity. But the speed and the manner of that determination — the way a globally distributed community of physicists organized itself, shared data in real time, converged on a falsification, and then moved on — that is genuinely remarkable, and worth examining carefully.

Why Room-Temperature Superconductivity Is the Grail

Let me be precise about why this particular claim generates the response it does.

Superconductivity is the phenomenon in which certain materials, below a critical temperature T_c, carry electrical current with exactly zero resistance. Not very low resistance — zero. A current established in a superconducting loop will, in principle, continue flowing indefinitely without any driving voltage. This is not a small quantitative improvement over ordinary conductors; it is a qualitatively different regime of physics.

The trouble is that essentially all known superconductors require extreme cooling. Conventional metallic superconductors — the ones Heike Kamerlingh Onnes discovered in mercury in 1911 — become superconducting below about 30 K at best. That is liquid helium temperature, which is expensive, logistically demanding, and entirely impractical for large-scale applications. The discovery of high-temperature cuprate superconductors in 1986 (Bednorz and Müller, Nobel Prize 1987) was genuinely revolutionary: some cuprates superconduct up to about 138 K. But 138 K is still −135°C. It requires liquid nitrogen cooling, which is cheaper than liquid helium but still not something you install in a power grid without substantial infrastructure.

The current record belongs to a class of hydrogen-rich compounds under extreme pressure — carbonaceous sulfur hydride at roughly 15°C, but requiring about 267 GPa of pressure. For context, the pressure at the center of the Earth is about 360 GPa. You cannot run a power cable through a diamond anvil cell.

Room-temperature, ambient-pressure superconductivity would be transformative in a way that very few material discoveries are. Electrical grids currently lose somewhere between 5 and 10 percent of all transmitted energy to resistive heating — a staggering quantity of energy, simply dissipated as heat in cables. Zero-resistance transmission would eliminate that loss. Magnetically levitated transport would become feasible without the cryogenic infrastructure that makes current Maglev systems enormously expensive to build and maintain. Compact, affordable MRI machines would become possible. Effects on computing, on energy storage, on medical technology — the list runs long. It would be one of the most consequential material discoveries in the history of technology.

This is why the response to the LK-99 preprints was not hysteria but rather the entirely rational behavior of a community that understood exactly what was at stake if the claim were true.

What LK-99 Was and What It Claimed

LK-99 is chemically expressed as Pb₁₀₋ₓCuₓ(PO₄)₆O, where x is approximately 0.9 to 1.1. It is a lead apatite — the same crystal family as the mineral in tooth enamel — with a fraction of the lead atoms replaced by copper.

The proposed mechanism, as sketched in the preprints, involved Cu²⁺ substituting for Pb²⁺. Because copper has a slightly smaller ionic radius than lead, this substitution induces a local structural distortion. The claim was that this distortion produces a flat electronic band at the Fermi level — and flat bands are associated with strong electronic correlations that can, in principle, give rise to unconventional superconductivity. The analogy to twisted bilayer graphene was implicit in the discussion, though the mechanism is quite different and twisted bilayer graphene superconducts only well below 1 K.

Reading the preprints in late July 2023 was, I confess, a slightly uncomfortable experience. The writing was rushed. The two preprints — submitted by different author subsets from the same group — were internally inconsistent in places. The resistance measurements showed a large drop with temperature, but not zero resistance. The synthesis protocol was described in enough detail to be reproducible, which was good, but the characterization was incomplete in ways that mattered.

Red flags were present from the beginning, and many physicists noted them immediately. The levitation video showed a piece of LK-99 that was tilted and wobbling — not the stable, complete expulsion of magnetic flux you would expect from a true Meissner effect. A perfect superconductor placed above a magnet would levitate horizontally and stably. This piece was doing something, but the something was not obviously Meissner levitation.

And yet. The synthesis was simple. The claim was specific and testable. If there was even a small chance it was real, the imperative to check was overwhelming. So labs checked.

The Replication Wave

What happened over the following weeks was, as far as I am aware, unprecedented in condensed matter physics.

Normally, a replication in physics looks like this: a group reads a paper, decides it is interesting enough to attempt, orders precursor materials, synthesizes the compound (which takes weeks to months), characterizes it with appropriate instruments (more weeks), writes up the results, submits them (more weeks), and eventually publishes — often six months to a year after the original claim, sometimes much longer. The feedback cycle is slow by design: slowness is a feature, not a bug, because it allows careful work rather than hasty work.

The LK-99 replication did not look like this.

Within a week, preprints from independent groups — China, India, the United States, Germany — were appearing on arXiv. Discord servers with hundreds of members were organizing synthesis attempts in real time, sharing thermograms, resistance measurements, and microscope images as they came off instruments. Twitter threads tracked emerging results with the urgency of a live event. A GitHub repository maintained by the community accumulated synthesis protocols, shared data files, and links to new preprints as they appeared.

Some groups reported partial levitation. Others reported anomalous resistance drops. Others — starting almost immediately — reported synthesizing the material and finding nothing unusual at all.

The speed of this was extraordinary not because of any particular organizational effort, but because the incentive structure happened to align with the infrastructure that now exists. Preprints made sharing immediate. Social media made results public the moment they existed. The synthesis was simple enough to attempt in any reasonably equipped solid-state chemistry lab. And the motivation — the prize, if it were real — was enormous. You would not need to tell anyone to work on this. You would have to tell people to stop.

By mid-August 2023 — three weeks after the original preprints — the key debunking papers had appeared. By late August, there was no serious scientific debate remaining.

The Mechanism of Falsification

The levitating video was explained first, and the explanation is both mundane and instructive.

The LK-99 synthesis produces, as an essentially unavoidable impurity, copper sulfide — Cu₂S. Copper sulfide is interesting in its own right: it undergoes a structural phase transition at roughly 105°C (378 K) from a low-temperature chalcocite form to a high-temperature superionic conductor. This transition is accompanied by a large, sharp drop in electrical resistance — exactly the kind of anomalous feature that, in a sample of mixed composition, might be misidentified as a superconducting transition.

More importantly for the levitation: the LK-99 synthesis products ubiquitously contain ferromagnetic impurity phases. A ferromagnetic material will interact with a permanent magnet. Partial levitation, tilted and unstable, is entirely consistent with a ferromagnetic-diamagnetic competition — not with the Meissner effect.

Several groups published debunking papers in rapid succession. Kumar and colleagues (Kumar et al., 2023) reported the absence of superconductivity in LK-99 samples; other groups synthesized Cu₂S independently, confirmed its resistance anomaly near 380 K, and showed quantitatively that the LK-99 observations were fully consistent with Cu₂S contamination and ferromagnetic impurities. Liu and Meng (Liu & Meng, 2023) provided a complementary symmetry analysis explaining why the structural distortion mechanism did not actually predict superconductivity.

Several Chinese groups with high-quality synthesis capabilities — and, frankly, strong motivation to find a positive result — produced very pure LK-99 samples and found what you would expect of a clean lead apatite: a semiconductor with modest diamagnetism. Nothing anomalous. When you removed the Cu₂S impurity, you removed the anomaly.

Daniel Garisto summarized the consensus in a Nature news piece in August 2023 (Garisto, 2023): LK-99 is not a superconductor. The case was closed, with an efficiency that the scientific community should be proud of.

A Useful Contrast: Ranga Dias

The LK-99 episode does not exist in isolation. The preceding years had seen other extraordinary claims of room-temperature or near-room-temperature superconductivity, and the most prominent involved Ranga Dias at the University of Rochester.

Dias published two papers in Nature claiming superconductivity at or near room temperature: one in 2020, describing carbonaceous sulfur hydride at roughly 15°C under 267 GPa (Snider et al., 2020 — and I note that the earlier Dias and Silvera Science paper on metallic hydrogen (Dias & Silvera, 2017) received a significant erratum and has been widely questioned — establishing a pattern), and one in 2023, describing nitrogen-doped lutetium hydride under much lower pressure. Both Nature papers were eventually retracted — the 2020 paper in 2022, the 2023 paper in November 2023 — amid serious and credible allegations of data manipulation. The criticisms included statistical anomalies in background signals, apparent image duplication across different experimental conditions, and raw data that did not match the published figures. Hirsch, who had been following these claims closely, documented many of the irregularities (Hirsch, 2021).

The contrast with LK-99 is worth sitting with. The Korean team appears to have been guilty of honest overreach: genuine excitement about anomalous observations, insufficient characterization before posting, motivated interpretation of ambiguous data. This happens in science. Extraordinary rewards for being right create extraordinary pressure to believe you are right. The LK-99 researchers may have seen something they genuinely could not explain and convinced themselves it was what they hoped it was.

The Dias case, if the allegations of data manipulation are accurate — and the retractions, and the University of Rochester investigation that followed, suggest they have merit — is something different: not motivated misinterpretation but deliberate fabrication. The scientific outcomes are superficially similar: both sets of claims were false, both caused the community to expend significant effort on falsification, both damaged the credibility of the field. But the causes, and the appropriate institutional and moral responses, differ substantially.

How do you tell them apart in real time? In both cases, you had extraordinary claims that passed initial peer review at prestigious venues. In both cases, independent replication failed. The LK-99 falsification came faster, partly because the synthesis was simpler and partly because the community mobilized more broadly. The Dias case took years, and the data manipulation allegations required access to raw data that the research group was slow to provide.

I do not have a clean answer. The difference in mechanism — honest error versus alleged fraud — is not directly observable from the outside. What you can observe is willingness to share data, consistency of results across different instruments and laboratories, and whether the research group facilitates or obstructs independent verification. On those criteria, the LK-99 group and the Dias group look quite different.

The Sociology of What Happened

Let me step back from the physics and say something about what the LK-99 episode reveals about how science actually functions.

The first thing it reveals is that community self-correction works, and now works at extraordinary speed when the incentive is high enough. The coordinated global replication was not organized by any institution, any journal, any funding body. It emerged spontaneously from a community that understood what was at stake and had the tools — preprint servers, social media, Discord, GitHub — to coordinate without central direction. The result was a falsification that, in a previous era, might have taken two to five years, completed in six weeks. That is remarkable.

The second thing it reveals is that the preprint revolution is real and consequential. The LK-99 preprints bypassed traditional peer review entirely. That could be bad — and in principle, a false claim could propagate further and faster without peer review as a gate. In practice, in this case, removing the gate allowed not just the false claim but its falsification to move at the same speed. Peer review, as it is normally practiced, is too slow to respond to a claim like this on a timescale that matters. The community replaced it with something faster: immediate, distributed, adversarial review by people with direct experimental access to the question.

This is not an argument against peer review. It is an argument that peer review in the traditional sense — two or three reviewers reading a manuscript over a few weeks — is not the only form that meaningful scientific scrutiny takes.

The third thing the episode reveals is that social media’s role in science communication is deeply ambivalent. Twitter accelerated the spread of both the original claim and the debunking. The community of physicists on Twitter was, on the whole, appropriately skeptical from the first day — I saw many threads on July 22 and 23 that noted the red flags I mentioned above: the tilted levitation, the non-zero resistance, the inconsistencies between the two preprints. But that skepticism was invisible to most science journalists, who were looking at the same videos and preprints and reading the excitement rather than the caveats.

The Media, and the Calibration Problem

I want to be specific about the media failure, because I think it matters.

The appropriate headline on July 23, 2023 was something like: “Korean researchers post preprints claiming room-temperature superconductivity; claim is extraordinary and unverified; replication underway.” That headline is accurate. It conveys the genuine excitement — because the claim, if true, would be extraordinary — while conveying the appropriate uncertainty about an unverified preprint from a single group.

The headlines that actually appeared, across outlets that should know better, included “Room-temperature superconductor discovered” and “Scientists may have created the holy grail of energy.” These are not accurate. They convey neither the uncertainty nor the specific nature of the claim. They treat a preprint as a discovery.

This is a calibration failure — the same kind of failure I have written about in other contexts. On this blog, I have discussed how LLMs can fail catastrophically when they lack the context to assess whether their confident-sounding output is grounded in anything real (see the car-wash post, and more generally the discussion of context and grounding in more context is not always better). The mechanism in journalism is different but the structure is the same: confidence that is not appropriately calibrated to evidence.

The Bayesian structure of the situation was, or should have been, clear. The prior probability of a room-temperature, ambient-pressure superconductor being found in any given week is very small — not because room-temperature superconductors are impossible, but because such discoveries do not happen often and many previous claims have failed. Call that prior probability low. Against that prior, what evidence did we have on July 23? A video showing partial, unstable levitation — which, as I noted, is not what Meissner levitation looks like. Two rushed preprints that disagreed with each other in some details. No independent replication. P(levitation video | not a superconductor) was not particularly small, as the Cu₂S explanation would later demonstrate. So the posterior probability that LK-99 was a room-temperature superconductor, given the evidence available on July 23, was not meaningfully higher than the prior — which was low.

A well-calibrated science journalist would not have written “Room-temperature superconductor discovered.” A well-calibrated scientist — and many of them said exactly this — would have written “interesting claim, requires replication, maintain high skepticism.” The scientific community was, on the whole, well-calibrated. The journalism was not.

This is not a new observation. Science journalists have been criticized for overclaiming since there have been science journalists. But the LK-99 episode is a particularly clean example because the timescale was so short: the calibration failure in the media and the calibration success in the scientific community happened simultaneously, in full public view, and could be compared directly.

I write occasionally about AI systems and their tendency to produce confident outputs that are not grounded in evidence — a form of miscalibration that is particularly dangerous because the confident tone is not a signal of accuracy (a theme that runs through recent posts on this blog). The LK-99 episode is a reminder that miscalibration is not unique to neural networks. It is a general failure mode in any system that needs to estimate uncertainty about claims — human, institutional, or artificial. The cure in all cases is the same: track confidence to evidence, update on data, resist the pull of exciting priors.

What the Scientific Community Actually Did

I want to be careful not to end on a note of pure cynicism about the media and leave the scientific community looking saintly. The community is not saintly.

There were preprints from independent groups that claimed positive results before the falsification was clear — groups that perhaps saw anomalies and wanted to be part of the story. There was social pressure, documented in real time on Twitter, to share exciting results before they were fully analyzed. The Discord servers and GitHub repositories that were genuinely useful for coordination were also, occasionally, vectors for misinformation and premature interpretation.

The community self-corrected. That is the important thing. The noise in the system resolved into a clear answer, in six weeks, through a process that was adversarial in the best scientific sense: many people trying to verify or refute a specific testable claim, sharing data openly, calling out methodological problems in public. The answer that emerged was correct.

I find this genuinely impressive. It is easy to be cynical about institutional science — about publication bias, about the replication crisis in psychology and medicine, about the incentive structures that reward novelty over rigor. The LK-99 episode is a counter-example. It is evidence that, when a question is clear and testable and the stakes are high, the system works. Not perfectly, not without noise, but functionally.

Peer review in the classical sense was absent. Peer review in a broader sense — global, immediate, public, adversarial — worked faster than any journal could have managed, and reached a correct conclusion.

The Next Extraordinary Claim

LK-99 is over. The compound will appear in future textbooks, probably in a sidebar about famous failed claims in condensed matter physics, alongside Schön and Dias and others. The researchers who synthesized and characterized it honestly will get some credit for the negative result; the original Korean team will, I imagine, have a difficult few years professionally.

The question I am left with is what happens next time.

Room-temperature superconductivity will, almost certainly, be claimed again. The prize is too large and the search too active. Possibly the claim will be correct — I would not put that probability at zero. More likely it will be another false positive, another Cu₂S lurking in the impurity profile.

Will the media learn from LK-99? I am genuinely uncertain. The incentive structure for science journalism rewards excitement over accuracy, and “extraordinary claim requires replication” is a less clickable headline than “room-temperature superconductor discovered.” The journalists who wrote those headlines were not stupid; they were responding rationally to the incentives of their profession.

Will the scientific community respond as effectively? I think so, at least for claims of this kind: testable, synthesis-based, with enough labs in the world capable of attempting replication. The infrastructure — preprints, Discord, shared repositories — exists and is now demonstrated to work. The speed of the LK-99 falsification sets a kind of benchmark.

What the episode showed, in the end, is not that science is infallible or that the system is without problems. It showed that, under the right conditions — a clear empirical question, a distributed community with the tools and motivation to address it, and a culture of open data sharing — science can self-correct at remarkable speed. The failure was in communication, not in the science. That is a meaningful distinction.

Whether the media will have learned anything by the time the next extraordinary claim appears — that, I confess, I doubt.

References

Lee, S., Kim, J. H., & Kwon, Y.-W. (2023). The First Room-Temperature Ambient-Pressure Superconductor. arXiv:2307.12008. Retrieved from https://arxiv.org/abs/2307.12008
Kumar, K., Surface, N. B., & Baral, B. (2023). Absence of superconductivity in LK-99 at ambient conditions. arXiv:2308.03544. Retrieved from https://arxiv.org/abs/2308.03544
Liu, S., & Meng, S. (2023). Symmetry-breaking and the origin of the anomalous properties of LK-99. arXiv:2308.05135. Retrieved from https://arxiv.org/abs/2308.05135
Garisto, D. (2023). LK-99 isn’t a superconductor — how science sleuths solved the mystery. Nature, 620, 705–706. DOI: 10.1038/d41586-023-02585-7
Snider, E., Dasenbrock-Gammon, N., McBride, R., Debessai, M., Vindana, H., Vencatasamy, K., Lawler, K. V., Salamat, A., & Dias, R. P. (2020). Room-temperature superconductivity in a carbonaceous sulfur hydride. Nature, 586, 373–377. DOI: 10.1038/s41586-020-2801-z (Retracted 2022.)
Dias, R. P., & Silvera, I. F. (2017). Observation of the Wigner-Huntington transition to metallic hydrogen. Science, 355, 715–718. DOI: 10.1126/science.aal1579 (Erratum published 2017; widely questioned.)
Hirsch, J. E. (2021). Rejoinder to “Comment on ‘Absence of magnetic evidence for superconductivity in hydride compounds’” by Dias and Salamat. Physica C, 590, 1353964. DOI: 10.1016/j.physc.2021.1353964

Changelog

2025-09-14: Updated the Cu₂S characterisation: pure Cu₂S is diamagnetic; the ferromagnetism in LK-99 samples comes from impurity phases. Updated the Dias & Silvera 2017 Science paper status: it received an erratum but was not formally retracted (unlike the 2020 and 2023 Nature papers). Updated the Senapati et al. reference to the correct LK-99 debunking literature (the previous arXiv ID resolved to a different paper).

Zero Angular Momentum: The Falling Cat and the Geometry of Shape Space

Tue, 03 Oct 2023 00:00:00 +0000

We adopted two stray cats in 2023. They had been living under a garden shed and had strong opinions about most things, including the correct height from which to leap onto a bookshelf and whether landing was optional. They are indoor cats now, for health reasons — a vet’s recommendation they find unconvincing but have largely accepted. Watching one of them drop from a windowsill — always feet-first, always orientated correctly, from heights that would leave me reconsidering my life choices — I found myself thinking about a problem I had first encountered in a mechanics course and had never fully resolved to my satisfaction.

How does a cat rotate with zero angular momentum?

The Problem

When a cat is dropped from an inverted position — upside-down, held by a practised experimenter, then released — it rotates approximately 180° and lands on its feet. The drop takes around 0.3 seconds. The cat begins with negligible angular momentum (the experimenter can release it with almost no spin), and there are no external torques during free fall. By conservation of angular momentum, the total angular momentum of the cat must remain constant throughout the fall.

The total angular momentum is therefore approximately zero throughout the fall.

And yet the cat rotates 180°.

This is the falling cat problem. It was first documented quantitatively by Étienne-Jules Marey in 1894 using chronophotography — among the first high-speed photography of any biological motion — and it has occupied physicists, mathematicians, neuroscientists, and roboticists ever since.

The problem is not exotic. Every cat owner has seen it. What requires explanation is why our intuitions about angular momentum fail here, and what replaces them.

Why the Obvious Answers Do Not Work

There are two naive explanations for the cat’s righting reflex, both wrong.

Explanation 1: The cat uses initial angular momentum. The experimenter gives the cat a small spin before releasing it; the cat amplifies this to achieve the full 180°. This fails because controlled experiments (and Marey’s original photographs) confirm that cats can right themselves even when released with zero initial spin. Careful experimenters have verified this explicitly.

Explanation 2: The cat pushes against the air. A falling cat could, in principle, use aerodynamic forces to push against the air and generate a reaction. This fails because the angular impulse from air drag over 0.3 seconds is far too small to account for the observed 180° rotation. Marey’s chronophotographs already showed that the motion begins immediately on release, before air resistance could contribute meaningfully.

Both explanations appeal to external torques. The correct explanation requires none.

Marey and the Photographic Evidence

Étienne-Jules Marey published his chronophotographic sequence of a falling cat in La Nature on 10 November 1894. The images, taken at 60 frames per second, show the following clearly:

The front and rear halves of the cat move asymmetrically. The front half rotates in one direction; the rear half rotates by a smaller angle in the opposite direction.
The cat pulls its front legs in close to its body (reducing the moment of inertia of the front half) while extending its rear legs (increasing the moment of inertia of the rear half).
The front half then rotates rapidly (large angle, small moment of inertia); the rear half rotates slowly in the opposite direction (small angle, large moment of inertia).
The cat then extends its front legs and pulls in its rear legs, and reverses the process.

The net effect: the cat’s body orientation rotates by 180° even though the total angular momentum — computed as the sum of both halves — remains constant. The key word is sum. Individual parts can exchange angular momentum through internal torques; the sum is conserved.

This mechanism — internal redistribution of angular momentum without changing its total — is correct but not complete. It explains that rotation is possible, not how much rotation is achieved per cycle of shape change. For that, we need the mathematics.

Kane and Scher: The Two-Cylinder Model

The first rigorous mechanical model was published by T.R. Kane and M.P. Scher in 1969 (International Journal of Solids and Structures 5, 663–670).

They modelled the cat as two rigid axisymmetric cylinders — a front half and a rear half — connected at a joint that allows relative bending and twisting. The joint constraint imposes that the relative twist between the two halves is zero (a “no-twist” condition: the cylinders cannot spin relative to each other at their connection). The total angular momentum of the system is held fixed at zero.

Let the two cylinders have moments of inertia $I_1$ and $I_2$ about their symmetry axes, and let $\phi$ be the bend angle between them and $\psi$ the twist angle. The zero-angular-momentum constraint, combined with the no-twist condition, gives a system of equations that can be integrated numerically to find the net body rotation as a function of the shape-change trajectory $(\phi(t), \psi(t))$.

Kane and Scher showed that a specific sequence of shape changes — one complete cycle in the $(\phi, \psi)$ plane — produces a net rotation of approximately 90–100°. A second cycle gives the rest. The calculation was the first to confirm, from mechanics alone, that the righting manoeuvre requires no external torques and is entirely consistent with conservation of angular momentum.

What the Kane–Scher model does not explain is why the net rotation per cycle depends on the area enclosed by the trajectory in shape space — or why the same mathematical structure appears in quantum mechanics. For that, we need Montgomery’s formulation.

Montgomery: Fiber Bundles and Geometric Holonomy

In 1993, Richard Montgomery published a reformulation of the falling cat problem using gauge theory (Dynamics and Control of Mechanical Systems, Fields Institute Communications, AMS, pp. 193–218). The reformulation is the definitive mathematical treatment, and it connects the cat to one of the deepest structures in modern physics.

The Configuration Space

The full configuration space of the cat — the space of all possible positions and orientations — is

$$Q = SO(3) \times \mathcal{S},$$

where $SO(3)$ is the rotation group (describing the cat’s overall orientation in space) and $\mathcal{S}$ is the shape space (describing the internal geometry: the bend angle, the twist, the position of each limb relative to the body).

The angular momentum constraint $\mathbf{L} = 0$ defines a horizontal distribution on $Q$ — a preferred subspace of tangent vectors at each point that correspond to shape changes at zero angular momentum. This distribution is not integrable (it does not come from a foliation), which is the mathematical signature that holonomy is possible.

The Fiber Bundle

The projection

$$\pi \colon Q \to \mathcal{S}, \qquad (R, s) \mapsto s,$$

makes $Q$ into a principal fiber bundle over $\mathcal{S}$ with structure group $SO(3)$. The fiber above each shape $s \in \mathcal{S}$ is the set of all orientations the cat can have with that shape.

A connection on this bundle is a rule for “lifting” paths in the base $\mathcal{S}$ to horizontal paths in the total space $Q$ — that is, paths along which the angular momentum constraint is satisfied. This connection $\mathcal{A}$ is a one-form on $\mathcal{S}$ taking values in the Lie algebra $\mathfrak{so}(3)$.

Holonomy: The Geometric Phase

When the cat executes a closed loop $\gamma$ in shape space — a sequence of shape changes that returns it to its initial shape — the holonomy of the connection $\mathcal{A}$ around $\gamma$ gives the net rotation:

$$R_\gamma = \mathrm{Hol}_\mathcal{A}(\gamma) \in SO(3).$$

For the full non-Abelian case ($SO(3)$), the holonomy is a path-ordered exponential along $\gamma$ and its relationship to the curvature involves non-Abelian corrections. But the essential geometric intuition is captured by the Abelian case — rotation about a single axis — where Stokes’s theorem gives the net rotation directly:

$$\theta_\gamma = \iint_{\Sigma} F,$$

where $\Sigma$ is a surface bounded by $\gamma$ and $F = d\mathcal{A}$ is the curvature 2-form. The cat’s net rotation per cycle is the integral of the curvature over the area enclosed by its shape-change loop in $\mathcal{S}$. For small loops, the curvature $F_\mathcal{A} = d\mathcal{A}

\mathcal{A} \wedge \mathcal{A}$ determines the holonomy to leading order in both the Abelian and non-Abelian cases.

The rotation is geometric: it depends on the shape of the loop, not on the speed at which the loop is traversed. A cat executing the same shape-change sequence twice as fast achieves the same rotation in half the time.

The Connection to Berry Phase

The gauge structure of the falling cat problem is not an isolated curiosity. It is the same mathematical structure that governs several central phenomena in modern physics.

The Berry phase (Berry 1984, Proceedings of the Royal Society A) arises when a quantum system is transported adiabatically around a closed loop $C$ in parameter space. The state acquires a phase

$$\gamma_B = \oint_C \mathbf{A} \cdot d\mathbf{R},$$

where $\mathbf{A} = i\langle n(\mathbf{R}) | \nabla_\mathbf{R} | n(\mathbf{R}) \rangle$ is the Berry connection — a gauge field on parameter space. The Berry phase is the holonomy of this connection, which is to say: the cat righting itself and a quantum state accumulating a geometric phase are instances of the same mathematical theorem.

Shapere and Wilczek (1989) made this connection explicit for deformable bodies, noting that the net rotation of a swimming microorganism or a falling cat is the holonomy of a gauge connection on shape space — exactly the Berry phase, expressed in the language of classical mechanics.

The Foucault pendulum precesses at a rate of $2\pi\sin\phi$ per sidereal day, where $\phi$ is the latitude. The holonomy of the Levi-Civita connection on $S^2$ for parallel transport around the circle of latitude is the solid angle of the enclosed polar cap, $\Omega = 2\pi(1 - \sin\phi)$. The lab-frame precession $2\pi\sin\phi = 2\pi - \Omega$ is the complementary angle — the two sum to a full rotation because the local frame itself completes one circuit per sidereal day. It is another geometric phase.

The Aharonov-Bohm effect (1959) produces a phase shift for electrons circling a solenoid, even when the electrons travel only through field-free regions. The phase is the holonomy of the electromagnetic vector potential $\mathbf{A}$ around the loop — a Berry phase for the electromagnetic field.

All four phenomena — the falling cat, the Berry phase, the Foucault pendulum, the Aharonov-Bohm effect — are manifestations of the same structure: a connection on a fiber bundle, and holonomy as the geometric consequence of traversing a closed loop.

Batterman (2003, Studies in History and Philosophy of Modern Physics 34, 527–557) gives a particularly clear account of this unification, drawing out the common mathematical skeleton and its physical implications.

High-Rise Syndrome: Terminal Velocity and the Parachute Cat

There is a grounding empirical footnote to the elegant geometry above. Whitney and Mehlhaff (1987, Journal of the American Veterinary Medical Association 191, 1399–1403) analysed 132 cats brought to a Manhattan veterinary clinic after falling from buildings of two to thirty-two stories. Their finding was counterintuitive:

Cats falling from above seven stories had a lower injury rate than cats falling two to six stories. Overall, 90% of the cats in the study survived, with injuries paradoxically less severe at greater heights.

The explanation involves two phases. Below seven stories, the cat is still accelerating: it is tense, its legs are extended to brace for impact, and it absorbs the force of landing poorly. Above seven stories, the cat reaches terminal velocity — approximately $100\,\mathrm{km/h}$ for a falling cat — and then, apparently, relaxes. The vestibular system, having identified that the fall is not ending imminently, switches from the righting reflex to a parachute posture: legs spread horizontally, body flattened, increasing the cross-sectional area and hence air resistance.

Terminal velocity is reached when the drag force equals the gravitational force:

$$mg = \frac{1}{2} C_D \rho A v_t^2, \qquad v_t = \sqrt{\frac{2mg}{C_D \rho A}}.$$

For a spread-eagle cat ($m \approx 4\,\mathrm{kg}$, $A \approx 0.06\,\mathrm{m}^2$, $C_D \approx 1.0$, $\rho_\mathrm{air} \approx 1.2\,\mathrm{kg/m}^3$):

$$v_t \approx \sqrt{\frac{2 \times 4 \times 9.8}{1.0 \times 1.2 \times 0.06}} \approx 33\,\mathrm{m/s} \approx 120\,\mathrm{km/h}.$$

(The exact value depends on posture and fur drag; empirical estimates for cats in the parachute posture are lower, roughly $25$–$30\,\mathrm{m/s}$, because the effective area increases when the limbs are spread.)

A human in free-fall has terminal velocity around $55\,\mathrm{m/s}$ ($200\,\mathrm{km/h}$) — faster, because the mass-to-area ratio is higher. The cat, with its low mass and high drag relative to body weight, hits a gentler terminal velocity and distributes the impact more effectively.

The study is sometimes cited as evidence that cats are invincible. A significant caveat is survivorship bias: cats that died on impact were likely not brought to the veterinary clinic, so the dataset underrepresents fatal outcomes, especially for higher falls. The apparent decrease in injury rate above seven stories may partly reflect the fact that the most severely injured cats from those heights never entered the study. The aerodynamic posture explanation is plausible, but the data do not cleanly separate it from the sampling bias.

Robotics and Spacecraft

The falling cat problem has practical applications beyond veterinary statistics.

Spacecraft attitude control: Astronauts in free fall can change their body orientation without thrusters, using the same gauge-theoretic mechanism as the cat. NASA and ESA have studied cat-inspired reorientation manoeuvres for astronauts and satellites.

Robotics: The two-cylinder model inspired early robot designs capable of reorienting in free fall — useful for robots deployed from aircraft or spacecraft. Subsequent work (including a 2022 review in IEEE Transactions on Robotics) has produced legged robots that can right themselves after being knocked over using shape-change sequences derived from the Montgomery connection.

Gymnastics and diving: Human athletes performing somersaults and twists exploit the same gauge structure, though without articulating the mathematics. A tuck increases rotation rate (smaller $I$, constant $L$ → larger $\omega$); a layout decreases it. Changing the tuck–layout timing mid-rotation produces a net twist — holonomy in the shape space of a human body.

The View from a Windowsill

My cats have no opinion about fiber bundles. When one of them drops from the top of the bookcase, she is not solving the variational problem

$$\min_{\gamma \in \Omega} \int_\gamma |\dot{s}|^2 \, dt, \quad \text{subject to } \mathrm{Hol}_\mathcal{A}(\gamma) = R_{180°},$$

she is executing a motor program refined over millions of years of feline evolution. The vestibular system provides continuous feedback on body orientation; the cerebellum coordinates the shape-change sequence; the whole manoeuvre is over in a third of a second.

What physics tells us is that the manoeuvre is possible — that no law of nature forbids a body with zero angular momentum from reorienting — and gives the precise geometric reason: the curvature of a connection on shape space is non-zero, which means the holonomy of closed loops is non-trivial.

The same curvature that allows a cat to right itself allows a quantum state to accumulate a geometric phase, allows the Foucault pendulum to precess, and allows the Aharonov-Bohm effect to shift an interference fringe without a local field. These are not analogies. They are the same theorem, applied to different physical systems in different mathematical languages.

I find this more remarkable than the cat.

References

Batterman, R.W. (2003). Falling cats, parallel parking, and polarized light. Studies in History and Philosophy of Modern Physics, 34(4), 527–557. https://doi.org/10.1016/S1355-2198(03)00062-5
Berry, M.V. (1984). Quantal phase factors accompanying adiabatic changes. Proceedings of the Royal Society A, 392, 45–57. https://doi.org/10.1098/rspa.1984.0023
Gbur, G.J. (2019). Falling Felines and Fundamental Physics. Yale University Press.
Kane, T.R., & Scher, M.P. (1969). A dynamical explanation of the falling cat phenomenon. International Journal of Solids and Structures, 5(7), 663–670. https://doi.org/10.1016/0020-7683(69)90086-9
Marey, É.-J. (1894). Des mouvements que certains animaux exécutent pour retomber sur leurs pieds lorsqu’ils sont précipités d’un lieu élevé. La Nature, 10 November 1894.
Montgomery, R. (1993). Gauge theory of the falling cat. In M. Enos (Ed.), Dynamics and Control of Mechanical Systems (Fields Institute Communications, Vol. 1, pp. 193–218). American Mathematical Society.
Shapere, A., & Wilczek, F. (Eds.). (1989). Geometric Phases in Physics. World Scientific.
Whitney, W.O., & Mehlhaff, C.J. (1987). High-rise syndrome in cats. Journal of the American Veterinary Medical Association, 191(11), 1399–1403.

Changelog

2025-12-15: Corrected the Marey publication date from 22 November 1894 to 10 November 1894 (in text and in reference). Updated the Whitney & Mehlhaff (1987) statistics to reflect that the 90% survival rate applies to all cats in the study, as reported in the paper, rather than specifically to those falling from above seven stories.

Can a Planet Have a Moon? Teaching Exomoon Detection with a Disco Ball Motor

Thu, 14 Sep 2023 00:00:00 +0000

This post describes the paper “Ein Analogieexperiment zur Suche nach Exomonden” (An Analogy Experiment for the Search for Exomoons), published in MNU Journal in 2023 together with Alexander Küpper.

The Gap in the Curriculum

Most physics and astronomy teaching units that address the search for extraterrestrial life focus on exoplanets. The transit method gets visualised, a light curve gets plotted, and the lesson ends with: some exoplanets are in the habitable zone. The end.

What tends to get omitted: moons of exoplanets — exomoons — could equally be candidates for extraterrestrial life, particularly if the exoplanet itself sits in the habitable zone. The moon would then be in the habitable zone too, and a large moon could maintain the atmospheric conditions necessary for liquid water. The possibility is taken seriously in the astrophysics community, and survey data consistently shows that students find the question of life in the universe among the most interesting topics in all of science.

The pedagogical gap is this: the transit method is routinely demonstrated in analogy experiments, but the extension to exomoon detection is almost never treated experimentally, even though it is a natural continuation of the same experiment with only minor modifications. This paper is an attempt to close that gap.

What an Exomoon Signal Looks Like

When only a planet transits a star, the resulting light curve shows a characteristic symmetric dip: flux drops as the planet moves in front of the star, holds at a reduced level during full transit, and recovers as the planet exits. The normalised flux during the flat-bottomed phase is:

$$I(t) = \frac{A_s - A_p}{A_s} = 1 - \frac{A_p}{A_s}$$

where the dip depth $\delta = A_p / A_s$ is determined by the ratio of the planet’s cross-sectional area to the star’s.

When the planet has a moon, the situation is more complex. The light curve is now governed by:

$$I(t) = \frac{A_s - (A_p + A_m - A_{pm}(t))}{A_s}$$

where $A_m$ is the moon’s cross-sectional area and $A_{pm}(t)$ is the time-dependent overlap between the planet’s and moon’s projected disks (the moon is orbiting the planet, so this overlap changes during the transit).

The consequence: additional dips and asymmetries appear in the light curve. The moon can transit slightly before the planet (causing a small flux dip before the main transit begins), or slide in front of the planet during the transit (temporarily reducing the combined occulting area, causing a brief flux recovery in the middle of the dip), or emerge from behind the planet on the exit side (causing a small dip after the main transit ends). The exact signature depends on the relative sizes of planet and moon, their orbital period ratio, and the geometry of the particular transit.

These signatures are small. In real astrophysics, this is why no exomoon has been unambiguously confirmed. In a classroom analogy experiment, the signals are large enough to see clearly — which is exactly what makes the experiment pedagogically useful.

The Experimental Setup

The starting point is a standard transit analogy experiment: a sphere (the planet) on a rod, moved slowly around a lamp (the star) by a slowly rotating motor. A light sensor — an Android smartphone running phyphox, or an Arduino with a suitable sensor — records the illuminance over time. The resulting light curve shows the characteristic symmetric transit dip.

The modification is straightforward: attach a small battery-powered motor to the planet sphere, with a smaller sphere (the moon) on the motor’s arm. The motor we used is a disco ball motor — inexpensive, widely available, and with a rotation speed that works well relative to the transit timescale if you choose the geometry appropriately.

The result is a physical system with two independent circular motions:

The planet orbiting the star (driven by the main slow-rotation motor)
The moon orbiting the planet (driven by the disco ball motor)

When this system transits the “star” (the lamp), the light sensor records a compound light curve with the exomoon signatures described above.

One technical note on sensors: High sample rate matters here. The exomoon signatures are brief features on top of the transit dip, and a sensor that samples too slowly will average them out. We found that the TI SensorTag CC2650, despite being a reasonable choice for the basic transit experiment, has a light sensor sample rate of only 1.25 Hz — too slow to resolve exomoon signatures reliably. Android smartphones and Arduinos both achieve adequate sample rates. The Pasco light sensor used in the paper samples at up to 20 Hz and resolves the features clearly.

Reading the Light Curves

The paper presents two distinct light curve types that emerge from the experiment, each with a different exomoon orbital configuration.

Type 1: The moon’s orbital period is short relative to the transit duration. Multiple exomoon signatures appear within a single transit. These include:

A small dip before the main transit begins (moon transiting alone)
Asymmetric ingress/egress (moon leading or trailing the planet)
A brief flux recovery midway through the transit (moon passing behind the planet, reducing the total occluding area)
A small post-transit dip (moon still in front of the star after the planet has exited)

Type 2: Specific orbital phase alignment where the moon moves directly behind the planet at the moment of maximum occultation. In this case, the deepest point of the transit corresponds to planet alone blocking the star (moon hidden behind planet). As the moon emerges from behind the planet, the total occluded area increases again briefly before both planet and moon exit.

This second case is particularly useful for quantitative analysis: if the orbital geometry is right, students can separately determine the planet’s radius from the secondary dip depth and the combined planet-moon radius from the primary dip depth.

Video + Light Curve Together

The paper recommends recording a video of the experiment simultaneously with the light sensor measurement, from the perspective of the sensor (i.e., looking up at the lamp from below). This technique — which is also central to the transit method paper — is even more valuable here.

Without the video, the exomoon signatures in the light curve are easy to misread as noise or experimental error. With the video, students can advance frame by frame through the moments corresponding to the unusual features and see exactly what the physical system was doing: the moon sliding in front of the planet, the moon emerging from the planet’s shadow, the moon transiting alone at the start or end of the main event.

The cognitive load of interpreting an unfamiliar, complex signal drops substantially when the signal can be correlated frame by frame with a visual record of what produced it.

Differentiation and Extensions

The paper suggests the exomoon experiment as an extension for higher-ability students at the end of a unit on exoplanet detection, not as the entry point. The transit method should come first; the exomoon experiment builds on it.

For students who are comfortable with quantitative analysis, the formula above allows a full treatment: given the measured light curve and a known lamp radius, students can derive both the planet radius and the moon radius from the dip depths at the appropriate moments.

Possible further extensions:

Noise floor investigation: systematically vary the moon’s size and determine the smallest moon still detectable. This connects directly to the real astrophysical problem — the reason no exomoon has been confirmed is that the signal is buried in noise.
Period ratio effects: vary the transit speed (and thus the effective period ratio between moon and planet) to see how the light curve changes.
Sensor comparison: test different sensor types and compare their ability to resolve exomoon signatures. This turns the instrumental limitation into an explicit investigation.

For the deeper theoretical connections — transit timing variations, the David Kipping approach to exomoon detection — see the transit simulation post, which models these effects in a browser-based tool.

For the secondary school curriculum context and the Direct Imaging pre-experiment that typically precedes the transit unit, see Fremde Welten.

References

Küpper, A., & Spicker, S. J. (2023). Ein Analogieexperiment zur Suche nach Exomonden. MNU Journal, 76(5).

Sato, M., & Asada, H. (2009). Effects of mutual transits by extrasolar planet-companion systems on light curves. Publications of the Astronomical Society of Japan, 61(4), L29–L34.

Tusnski, L. R. M., & Valio, A. (2011). Transit model of planets with moon and ring systems. The Astrophysical Journal, 743(1), 97.

Heller, R. (2018). On the detection of extrasolar moons and rings. In H. J. Deeg & J. A. Belmonte (Eds.), Handbook of Exoplanets (pp. 835–851). Springer.

Küpper, A., Spicker, S. J., & Schadschneider, A. (2022). Analogieexperimente zur Transitmethode für den Physik- und Astronomieunterricht in der Sekundarstufe I. Astronomie+Raumfahrt im Unterricht, 59(188), 46–50.

Changelog

2025-10-03: Updated the Tusnski & Valio (2011) reference to use article number 97, replacing the previous page range “1–16.”

How Low Can You Go? Measuring Latency for Networked Music Performance Across Europe

Sat, 26 Aug 2023 00:00:00 +0000

This post summarises a manuscript submitted with Benjamin Bentz and colleagues from the RAPP Lab network. The paper is not yet peer-reviewed; numbers and conclusions are based on operational measurements collected 2020–2023. Feedback welcome — particularly from anyone who has run similar measurements on non-European or wireless-last-mile links.

The Problem

Musicians playing together in the same room experience acoustic propagation delay of roughly 3 ms per metre of separation — essentially free latency that most ensembles never consciously register. When you distribute musicians across a network, you inherit that propagation cost plus everything the signal chain adds on top: buffers, codec processing, routing hops, switching overhead.

Conventional video-conferencing (Zoom, Teams, etc.) operates at end-to-end delays of roughly 100–300 ms. That is comfortable for speech — human conversation tolerates round-trip delays up to about 250 ms before it starts to feel wrong — but it is well above the threshold at which ensemble timing breaks down. The NMP literature generally puts the upper bound for synchronous rhythmic playing somewhere between 20 and 30 ms one-way, with considerable variation by tempo, instrument, and whether the performers can see each other [Carôt 2011; Tsioutas & Xylomenos 2021; Medina Victoria 2019].

Specialised low-latency systems cut the processing overhead by avoiding compression, using hardware-accelerated video pipelines, and riding research-and-education networks that offer better jitter characteristics than commodity internet. Two of the better-known ones are LoLa (Low Latency Audio Visual Streaming System, developed at Conservatorio G. Tartini Trieste) and MVTP (Modular Video Transmission Platform, developed at CESNET in Prague). We deployed both at Hochschule für Musik und Tanz Köln as part of the RAPP Lab collaboration and spent about two and a half years measuring them.

The Latency Budget

End-to-end latency in NMP is cumulative and non-recoverable. Once delay enters the chain, nothing downstream can subtract it. The budget looks like:

\[ L_\text{total} = L_\text{capture} + L_\text{buffer} + L_\text{network} + L_\text{playback} \]

Network latency $ L_\text{network} $ includes propagation (roughly $ d / (2 \times 10^8) $ seconds for a fibre link of distance $ d $ metres, accounting for the refractive index of glass) plus per-hop processing. Everything else is system-dependent.

The key insight is that $ L_\text{buffer} $ is not fixed — it is a consequence of jitter. A jittery link forces larger buffers to avoid underruns, which directly adds to perceived latency. This is why raw bandwidth is almost irrelevant for NMP: a 1 Gbps link with erratic jitter will perform worse than a 100 Mbps link with deterministic behaviour.

What We Measured and How

Network RTT. ICMP ping, 1,000 packets per run. We report the median as a robust summary; the mean is too sensitive to the occasional rogue packet.

End-to-end audio latency. An audio signal-loop: transmit a test signal from site A to site B, have site B return it immediately, estimate round-trip delay by cross-correlation. One-way latency = signal-loop RTT / 2. This method captures local processing and buffering at both ends in addition to the network leg, which is what actually matters for a musician.

Video latency. Component-based estimation (capture frame cadence + processing pipeline + display). We did not have a frame-accurate video loopback method, so treat these numbers as estimates rather than precision measurements. That caveat matters less than it might seem because, as you will see, video was always slower than audio by a wide enough margin that it did not drive the operational decisions.

Firewall impact. A controlled 4-hour session on the Cologne–Vienna link, alternating between a DMZ configuration (direct research-backbone access) and a transparent enterprise firewall, logging packet loss and decoder instability.

Six partner institutions, air distances from 175 to 1,655 km, measurements collected between October 2020 and March 2023.

Results

Audio latency

Partner (from Cologne)	Air distance (km)	Median RTT (ms)	One-way audio latency (ms)
Prague	535	5.0	7.5
Vienna	745	7.0	9.5
Detmold	175	7.5	10.0
Trieste	775	10.0	12.5
Rome	1,090	17.5	20.0
Tallinn	1,465	19.5	22.0–22.5

The number that jumps out immediately: Detmold (175 km away) has higher latency than Vienna (745 km away). This is a routing issue, not a physics one. The Detmold link was traversing a less efficient campus path that added extra hops before reaching the research backbone. Prague, by contrast, was connected via a particularly short routing path and achieved the lowest latency of any link despite not being the geographically closest.

The practical implication: geographic distance is a poor predictor of achievable latency. Measure RTT; do not estimate from a map.

Video latency

Estimated one-way video latency was 20–35 ms across all configurations, with the dominant contributions coming from frame cadence (at 60 fps, you wait up to 16.7 ms for a frame to be captured regardless of what the network is doing) and buffering at the decoder. In every deployment, video consistently lagged audio. Musicians unsurprisingly fell back on audio for synchronization and treated video as a supplementary cue — useful for expressive and social information, not for timing.

The firewall experiment

This is the result I find most important for anyone planning a similar deployment.

Metric	DMZ (no firewall)	With enterprise firewall	Change
Dropped audio packets	0.002%	0.052%	+26×
Audio buffer realignments/hour	0.3	3.9	+13×
Dropped video frames	0.04%	0.74%	+18×
Additional latency	—	0.5–1.0 ms	—

The raw latency increase (0.5–1.0 ms) is small and largely irrelevant. The packet loss and buffer event increases are not. A 26-fold increase in dropped audio packets on an otherwise uncongested link means the firewall is doing something — likely deep packet inspection or stateful tracking — that introduces enough irregularity to destabilise small audio buffers. This forces you to either accept dropouts or increase buffer size, and increasing buffer size increases latency.

The message is: if your institution requires traffic inspection for security policy compliance, you are paying a latency tax that is more about stability than the raw delay number, and that tax is substantial.

Discussion

Based on the measured latencies and reported musical tolerances from the literature, I would roughly characterise the links as follows:

Prague, Vienna, Detmold, Trieste (7.5–12.5 ms): Compatible with most repertoire including rhythmically demanding chamber music. Musicians in our sessions reported the interaction as “natural” or “like being in the same room” at these latencies.
Rome (20 ms): Usable with attention to repertoire and tempo. Slower movements and music where tight rhythmic locking is not the primary aesthetic concern work well. Rhythmically dense passages at fast tempi become harder.
Tallinn (22–22.5 ms): At the upper edge of the comfortable range. Still usable — we ran a concert collaboration in March 2023 — but musicians adapt their interaction strategies, leaning more on musical anticipation than reactive synchronization.

What is notably absent from this data: anything outside the European research-network context. All six links ran on GÉANT or national backbone equivalents with favourable jitter characteristics. The numbers almost certainly do not transfer directly to commodity internet, satellite links, or mixed-topology paths.

Limitations I want to be explicit about. The video latency estimates are component-based, not directly measured, so treat that 20–35 ms range with appropriate skepticism. The firewall comparison is a single 4-hour session on a single link; I would not want to extrapolate too aggressively to other firewall vendors or configurations. And this is an operational measurement study, not a controlled perceptual experiment — I cannot tell you from this data at precisely what latency threshold a given ensemble will declare a session unusable, because that depends on the music, the musicians, and factors I did not measure.

Practical Takeaways

For anyone setting up a similar system:

Measure RTT before committing to a partner institution. A 100 km difference in air distance can easily be swamped by routing differences.
Get DMZ placement if at all possible. The firewall results suggest this matters more than any other single configuration decision.
Minimise campus hops between your endpoint and the research backbone. Each additional switching layer adds jitter risk.
Use small audio buffers and monitor for underruns. If your baseline RTT is good, your buffer can be small; if underruns increase, that is an early warning that network stability is degrading before packet loss becomes audible.
Accept that video will lag audio and design your session accordingly. This is not a system failure; it is a consequence of how video pipelines work at low latency. Plan for it.

References

Carôt, A. (2011). Low latency audio streaming for Internet-based musical interaction. Advances in Multimedia and Interactive Technologies. https://doi.org/10.4018/978-1-61692-831-5.ch015

Drioli, C., Allocchio, C., & Buso, N. (2013). Networked performances and natural interaction via LOLA. LNCS, 7990, 240–250. https://doi.org/10.1007/978-3-642-40050-6_21

Medina Victoria, A. (2019). A method for the measurement of the latency tolerance range of Western musicians. Ph.D. dissertation, Cork Institute of Technology (now Munster Technological University).

Rottondi, C., Chafe, C., Allocchio, C., & Sarti, A. (2016). An overview on networked music performance technologies. IEEE Access, 4, 8823–8843. https://doi.org/10.1109/ACCESS.2016.2628440

Tsioutas, K. & Xylomenos, G. (2021). On the impact of audio characteristics to the quality of musicians experience in network music performance. JAES, 69(12), 914–923. https://doi.org/10.17743/jaes.2021.0041

Ubik, S., Halak, J., Kolbe, M., Melnikov, J., & Frič, M. (2021). Lessons learned from distance collaboration in live culture. AISC, 1378, 608–615. https://doi.org/10.1007/978-3-030-74009-2_77

Changelog

2026-01-20: Updated the Drioli et al. (2013) LNCS volume number to 7990 (ECLAP 2013 proceedings). Updated the Ubik et al. (2021) AISC volume number to 1378 and page range to 608–615. Updated the fifth author’s surname to “Frič.”

The Milky Way Is a Gravitational Wave Detector

Fri, 07 Jul 2023 00:00:00 +0000

On 28–29 June 2023, four independent research collaborations published papers simultaneously: NANOGrav (North American Nanohertz Observatory for Gravitational Waves), the EPTA (European Pulsar Timing Array), the PPTA (Parkes Pulsar Timing Array, Australian), and InPTA (Indian Pulsar Timing Array). Each announced essentially the same result: evidence for a stochastic gravitational wave background at nanohertz frequencies. The simultaneous coordinated publication is itself telling — in physics, that kind of coordination usually signals that each group needed the others to make the claim credible. If one group had published alone, the community would have been skeptical. Four independent datasets saying the same thing is a different matter.

The detector for all four collaborations was not a machine. It was the Milky Way — specifically, a collection of millisecond pulsars scattered across the galaxy, separated by thousands of light-years. The arms of this instrument make LIGO look microscopic by comparison, and they are pointed in every direction at once.

There is a professional habit, well established in physics, of finding the scale of things either comforting or terrifying. This one is genuinely awe-inspiring.

The scale problem

To detect a gravitational wave, your instrument needs to be roughly comparable in size to the wavelength you are trying to observe. This is not a hard rule — LIGO’s 4-km arms are much shorter than the wavelengths it detects — but the sensitivity of any interferometer scales with arm length, so the constraint matters practically even when it is not absolute.

LIGO detects gravitational waves in the frequency range of roughly 10–1000 Hz. A wave at 100 Hz has a wavelength of

$$\lambda = \frac{c}{f} = \frac{3 \times 10^8 \text{ m/s}}{100 \text{ Hz}} \approx 3000 \text{ km},$$

about a quarter of the Earth’s diameter. LIGO’s 4-km Fabry-Pérot arm cavities are short compared to the wavelength, but the cavities fold the light hundreds of times to achieve an effective path length of about 1600 km, and the interferometer is exquisitely sensitive to differential arm-length changes of order $10^{-19}$ m — a fraction of the proton radius. The physics is heroic. But it works at the scale of stellar-mass binary mergers: neutron star–neutron star, neutron star–black hole, stellar-mass black hole–black hole. Those systems radiate at audible frequencies. LIGO famously converts its strain signals to sound, and the chirps do sound like something from a science fiction film.

Now consider pulsar timing arrays. The targets are nanohertz gravitational waves: frequencies of order $f \sim 10^{-9}$ to $10^{-8}$ Hz. At $10^{-9}$ Hz, the wavelength is

$$\lambda = \frac{c}{f} \approx \frac{3 \times 10^8 \text{ m/s}}{10^{-9} \text{ Hz}} = 3 \times 10^{17} \text{ m} \approx 10 \text{ pc} \approx 33 \text{ light-years}.$$

To detect oscillations at these frequencies, you need arms measured in light-years. Tens to thousands of light-years. The nearest millisecond pulsars are a few hundred light-years away; the most distant ones used in PTA arrays are several thousand light-years distant. By accident of the galaxy’s size and the distribution of recycled pulsars within it, we happen to live inside an instrument of approximately the right dimensions to detect nanohertz gravitational waves. The galaxy did not plan this. We got lucky.

Millisecond pulsars as cosmic clocks

A gravitational wave detector is only as good as its clock. LIGO measures differential length changes in its arm cavities using the interference of laser light — effectively using the constancy of the speed of light as a metronome. Pulsar timing arrays use a different clock: the rotation of a neutron star.

Ordinary pulsars are the collapsed remnants of massive stars, born in supernova explosions. A neutron star of roughly 1.4 solar masses is compressed into a sphere about 10 km across, rotating rapidly and radiating beams of radio waves from near its magnetic poles. The rotation periods of young pulsars are typically of order seconds, and they spin down on timescales of millions of years as they lose rotational energy to magnetic dipole radiation. They are not particularly stable clocks — the spin-down is uneven, and they occasionally “glitch,” suddenly spinning up by a small amount.

Millisecond pulsars (MSPs) are a different beast entirely. They have been recycled: spun up to near-millisecond rotation periods by accreting matter from a binary companion star. The accretion process deposits angular momentum onto the neutron star, increasing its spin rate dramatically, while simultaneously burying and weakening its magnetic field. The typical MSP has a surface magnetic field of order $10^4$–$10^5$ G, four to five orders of magnitude weaker than a young pulsar’s $10^9$–$10^{12}$ G. Since the spin-down torque scales as $B^2$, the weaker field means the MSP loses rotational energy far more slowly. Once the accretion stops — when the companion has exhausted its transferable mass — the MSP is left spinning rapidly and stably, with a spin-down rate of order $10^{-20}$ s/s.

The result is rotational stability competitive with terrestrial atomic clocks. PSR J0437$-$4715, one of the best-timed MSPs, has a rotation period of $P \approx 5.76$ ms and a timing residual — the scatter of individual pulse arrival times around the best-fit timing model — of order 100 ns over decades of observation. For a pulsar completing about 174 rotations per second, a residual of 100 ns over a baseline of years is remarkable. The fractional frequency stability is $\delta P / P \sim 10^{-14}$ or better. These are not merely good clocks; they are among the most stable periodic phenomena known to physics.

The timing model accounts for everything we know about the pulsar’s environment: its spin period and spin-down rate, its proper motion across the sky, parallax (from which we get the distance), Shapiro delay from any companions, dispersion measure variations in the interstellar medium, and more. After subtracting all modelled effects, what remains are timing residuals — small, unexplained deviations in the pulse arrival times. If a gravitational wave passes through, it will appear in those residuals.

How a gravitational wave shifts a pulse arrival time

A gravitational wave is a propagating perturbation of the spacetime metric. In the transverse-traceless (TT) gauge, a wave propagating along the $z$-direction perturbs the metric as

$$ds^2 = -c^2 dt^2 + \bigl(1 + h_+\bigr)dx^2 + \bigl(1 - h_+\bigr)dy^2 + 2h_\times \, dx \, dy + dz^2,$$

where $h_+$ and $h_\times$ are the two polarisation amplitudes, both functions of $t - z/c$. As the wave passes, the proper distances in the $x$- and $y$-directions oscillate: space itself is being stretched and squeezed at the wave frequency.

A radio pulse travelling from a pulsar to Earth traverses this oscillating spacetime. The proper path length changes, and so does the travel time. The timing residual induced in a pulsar in direction $\hat{n}$ by a gravitational wave with wavevector $\hat{k}$ is

$$R(t) \propto \int_0^t dt' \, \frac{\Delta\nu(t')}{\nu} \propto \int_0^t dt' \, h(t', \hat{n}, \hat{k}),$$

where $h$ is an appropriate contraction of the metric perturbation with the geometry of the Earth-pulsar baseline. The key point: the timing residual is the time integral of the metric strain. For a sinusoidal wave at frequency $f$, the residual oscillates at the same frequency but with amplitude suppressed by a factor of $1/(2\pi f T_{\rm obs})$ relative to what one might naively expect — which is partly why PTA analysis is technically demanding.

For a single pulsar, a timing residual tells you that something disturbed the spacetime between Earth and the pulsar — but it could be a systematic in the timing model, interstellar medium fluctuations, or intrinsic pulsar noise. You cannot claim a gravitational wave detection from one pulsar alone. What you need is the correlation between many pulsars.

The Hellings-Downs curve

Here is the central idea of pulsar timing array science. Consider two pulsars in directions $\hat{n}_a$ and $\hat{n}_b$, separated on the sky by an angle $\theta$ such that $\cos\theta = \hat{n}_a \cdot \hat{n}_b$. Both are embedded in the same stochastic gravitational wave background — a superposition of waves arriving from all directions, at all frequencies in the nanohertz band, with random phases and amplitudes. The timing residuals of both pulsars will be perturbed by this background. The question is: what is the expected cross-correlation between their residuals?

Hellings and Downs (1983) computed this, assuming the background is isotropic (equal power from all directions), unpolarised, and stationary. The answer is now called the Hellings-Downs (HD) curve:

$$\Gamma(\theta) = \frac{3}{2} \left(\frac{1-\cos\theta}{2}\right) \ln\!\left(\frac{1-\cos\theta}{2}\right) - \frac{1}{4}\left(\frac{1-\cos\theta}{2}\right) + \frac{1}{2}.$$

Let me unpack the features of this function:

At $\theta = 0$ (the same pulsar, or two pulsars in the same direction): $\Gamma(0) = 1/2$, maximum positive correlation. This makes sense — both pulsars see the same wave.
At $\theta \approx 50°$: the curve crosses zero and turns negative.
At $\theta = \pi/2$ (pulsars at right angles): $\Gamma(\pi/2) \approx -0.15$, near the minimum of the curve.
At $\theta = \pi$ (antipodal pulsars, opposite directions on the sky): $\Gamma(\pi) = 1/4$. Positive correlation even for pulsars in opposite directions — counterintuitive but correct.
In between, the curve dips negative (anticorrelated) for angles roughly $50°$–$120°$.

The shape is uniquely quadrupolar. It arises directly from the spin-2 tensor nature of gravitational waves. A scalar perturbation (like a monopole clock error common to all pulsars — such as an error in the terrestrial time standard used to timestamp the observations) would produce a flat, angle-independent correlation. A dipole perturbation (like an error in Earth’s ephemeris, or a systematic in our knowledge of Earth’s position) would produce a dipolar, $\cos\theta$ pattern. Only spin-2 tensor radiation produces the Hellings-Downs shape.

This is why the HD curve is the smoking gun. If you observe cross-correlations between pulsar timing residuals that match this specific, non-trivial curve as a function of angular separation — something that dips negative around 90° and recovers to a positive value at 180° — you have direct evidence that a tensor gravitational wave background is responsible. No other known astrophysical systematic produces this pattern.

The difficulty is statistical. You need enough pulsars, spread over a wide range of sky angles, each timed with sufficient precision, over a long enough baseline, that you can measure this correlation function with confidence. That has been the programme of the PTA collaborations since the 1990s. In June 2023, they had enough.

The 2023 evidence

The NANOGrav 15-year dataset (Agazie et al., 2023) comprises 68 millisecond pulsars observed for up to 15 years, with an average of roughly 2200 timing observations per pulsar. The dataset represents an enormous investment of telescope time — primarily at the Arecibo Observatory (until its collapse in December 2020) and the Green Bank Telescope.

The analysis found an excess of low-frequency noise common to many pulsars, consistent with a power-law spectral shape. More importantly, when the cross-correlations between all pairs of pulsars were computed and binned by angular separation, the result was consistent with the Hellings-Downs curve. The statistical significance of the HD correlation — that is, the evidence that the spatial correlation pattern matches the quadrupolar prediction rather than some isotropic or zero-correlation model — was 3–4$\sigma$ depending on the analysis method and prior assumptions. The collaboration carefully described this as “evidence for” rather than “detection of” a gravitational wave background, following community conventions (detection would conventionally require $\geq 5\sigma$).

Simultaneously, the EPTA published its second data release (Antoniadis et al., 2023), the PPTA published its results (Reardon et al., 2023), and InPTA contributed its analysis. A fifth collaboration, the Chinese Pulsar Timing Array (CPTA), also published consistent results (Xu et al., 2023). Each group used independent datasets, different telescopes, different software pipelines, different statistical methodologies. All found the same thing.

The fact that five groups independently recovered a consistent signal with approximately the right spectral shape and approximately the correct spatial correlations is the argument for reality. Any single group’s result could be explained by a systematic error in their data or analysis. Five groups with independent data and methods converging on the same result is much harder to explain as coincidence.

The combined interpretation is clear: something is producing a stochastic background of nanohertz gravitational waves permeating the galaxy, with spatial correlations consistent with the tensor quadrupole signature predicted by general relativity. The Milky Way is a gravitational wave detector, and it has measured something.

What is making the noise?

The million-dollar question. The observed signal has a characteristic spectral shape: the power spectral density of the timing residuals scales approximately as $S(f) \propto f^{-13/3}$, or equivalently, the gravitational wave energy density spectrum

$$\Omega_{\rm GW}(f) \propto f^{2/3}.$$

This $f^{2/3}$ scaling is the expected spectrum for an ensemble of circular binary systems driven purely by gravitational wave emission — specifically, a population of supermassive black hole binaries (SMBHBs) distributed across a wide range of masses and redshifts.

Here is the astrophysical picture. When two massive galaxies merge — an event that happens billions of times over cosmic history — their central supermassive black holes (each typically $10^7$–$10^{10}$ solar masses) do not immediately merge. First, they sink toward the centre of the merger remnant by dynamical friction against the stellar background, forming a loosely bound binary on scales of a parsec or so. The binary then hardens: three-body interactions with individual stars passing close to the binary extract orbital energy, driving the pair to smaller separations. Eventually, when the binary has hardened to a separation where gravitational wave emission dominates the energy loss, the pair inspirals and merges on a timescale of millions to billions of years.

The incoherent superposition of gravitational wave emission from the many billions of SMBHB systems across the observable universe — at all masses, at all orbital frequencies, at all redshifts — produces a stochastic background. It is, in a sense, cosmic traffic noise: no individual merger is detectable as a discrete event, but the combined hum from all of them is. The predicted spectral amplitude and shape from this population are broadly consistent with the observed signal, though the uncertainties on the astrophysical model are large enough that the agreement is not a precise test.

Alternative sources are possible. A first-order phase transition in the early universe (such as a QCD or electroweak phase transition) would produce a background of gravitational waves with a different spectral shape — more peaked and potentially harder than the SMBHB prediction. A network of cosmic strings, topological defects from symmetry-breaking phase transitions in the early universe, would produce yet another spectrum, approximately flat in $\Omega_{\rm GW}(f)$. Primordial gravitational waves from inflation are expected at far lower frequencies, near the CMB scale, but with a nearly scale-invariant spectrum that could contribute at nanohertz frequencies in some models. The 2023 data are most consistent with the SMBHB interpretation, but cannot rule out contributions from early-universe sources — or combinations of both. Future data with longer baselines and more pulsars will tighten the spectral measurements and may reveal deviations from the SMBHB prediction.

There is also the tantalising prospect of eventually resolving individual SMBHB systems above the stochastic background — the gravitational wave equivalent of resolving individual radio sources above the extragalactic background. No individual SMBHB has yet been identified in PTA data, but the 15-year NANOGrav dataset already places interesting upper limits on the most massive known candidate systems.

Multi-frequency gravitational wave astronomy

We are now, unambiguously, in the era of multi-frequency gravitational wave astronomy. The universe produces gravitational waves across an enormous span of frequencies, and different frequency windows probe fundamentally different source populations:

Primordial / CMB B-modes (~$10^{-17}$ Hz): Wavelengths comparable to the Hubble scale. Primordial tensor perturbations from inflation would imprint a distinctive B-mode polarisation pattern in the cosmic microwave background. No confirmed detection yet; the BICEP/Keck programme and future CMB experiments (LiteBIRD, CMB-S4) are sensitive to this regime.
Pulsar timing arrays (~$10^{-9}$–$10^{-8}$ Hz): The nanohertz band, just accessed by NANOGrav and partners. Sources: SMBHB inspirals, possibly early-universe phase transitions and cosmic strings. Arm length: thousands of light-years.
LISA (~$10^{-4}$–$10^{-1}$ Hz): The millihertz band. The LISA space interferometer (planned launch in the 2030s) will have arm lengths of 2.5 million km and will detect SMBHB mergers directly as they happen, stellar-mass compact binary inspirals years before their LIGO-band merger, and possibly signals from extreme mass-ratio inspirals. LISA will see the SMBHB sources that PTAs see as a stochastic background, but in their final years of inspiral.
LIGO/Virgo/KAGRA (~10–$10^3$ Hz): The audible band. Stellar-mass black hole and neutron star mergers. Over 90 confirmed events as of the end of O3, with many more candidates in O4. Source masses: $\sim$1–100 $M_\odot$.

This is the gravitational wave analogue of multi-wavelength astronomy. Just as the universe looks completely different in radio, infrared, optical, X-ray, and gamma-ray light — each wavelength band revealing different physical processes and source populations — the universe sounds completely different at each gravitational wave frequency. PTAs hear the rumble of cosmic structure formation; LISA will hear the whisper of the final million years before the most massive black hole mergers; LIGO hears the sharp crack of stellar-mass collisions.

The 2023 announcements represent the opening of the nanohertz window. We have gone from one gravitational wave frequency band (LIGO/Virgo) to two. The next decade, with LISA launching and PTAs continuing to accumulate data, will see the opening of a third.

A brief detour: LIGO’s O4 run and the mass-gap object

While the PTA collaborations were making their June 2023 announcement, LIGO’s fourth observing run (O4, which ran from May 2023 to June 2025) was proceeding at an extraordinary rate. The upgraded detectors were detecting candidate gravitational wave events at roughly one every two to three days — over 200 candidates across the full run. This is now a production science rather than an exploration.

Among the most scientifically interesting events was GW230529, detected on 29 May 2023 and published by the LIGO Scientific Collaboration (Abbott et al., 2024). The signal is consistent with the merger of a neutron star with a compact object whose mass was measured to be approximately $2.5$–$4.5\ M_\odot$. This mass range sits squarely in what theorists call the “mass gap” — the range between the heaviest neutron stars ($\lesssim 2.3\ M_\odot$, though this upper limit is uncertain) and the lightest stellar-mass black holes inferred from X-ray binaries ($\gtrsim 5\ M_\odot$).

Whether GW230529’s companion was the heaviest neutron star ever observed, or the lightest black hole, is genuinely unknown. The distinction matters enormously for nuclear physics: if it is a neutron star, it constrains the nuclear equation of state at supranuclear densities. If it is a black hole, it means the mass gap is narrower than X-ray observations suggested, and our understanding of compact object formation needs revision. Gravitational wave observations alone cannot distinguish a rapidly spinning heavy neutron star from a slowly spinning light black hole without additional electromagnetic counterpart observations, and no counterpart was found for GW230529. This question will likely not be settled until we have electromagnetic constraints from similar systems, or until we accumulate enough mass-gap events to understand the population statistically.

The instrument we did not build

I want to return to the pulsar timing array concept, because I think it deserves more than a passing technical description. The idea is this: nature has distributed a set of extremely stable clocks across the galaxy. We did not put them there. We did not design them. We simply discovered that neutron stars, after a particular evolutionary pathway involving mass transfer from a binary companion, achieve a rotational stability that happens to be sufficient to detect perturbations of spacetime at cosmological scales.

The instrument is the galaxy itself — or rather, our ability to model it. We build a timing model for each pulsar: a comprehensive description of every known effect that influences pulse arrival times. We subtract the model. What remains, the residuals, contains the signal we cannot yet explain. We cross-correlate the residuals of 68 (or 25, or 57, depending on the collaboration) pulsars in pairs, compute the correlation as a function of angular separation, and compare to the Hellings-Downs prediction.

The engineering challenge is not building the detector. It is characterising it. Understanding the noise. Modelling the interstellar medium, which disperses radio pulses in a frequency-dependent way and varies as the pulsar moves through clouds of ionised gas. Accounting for clock errors in the terrestrial time standards used to timestamp observations. Dealing with instrumental noise in each of the different radio telescopes that contribute data. Building a Bayesian framework that can simultaneously fit the timing model parameters, the pulsar noise properties, and the GWB parameters for dozens of pulsars.

This is painstaking, years-long work. The 15-year NANOGrav dataset reflects something like 600 pulsar-years of observation. The detection is earned.

The Hellings-Downs correlation — the specific pattern that emerged from those residuals, consistent with the quadrupolar fingerprint of general relativity’s spin-2 gravitational waves — is one of the more beautiful results I have seen in recent astrophysics. It is a direct measurement of the tensor nature of gravity, at frequency scales eleven orders of magnitude below anything LIGO can access, using a detector assembled by the Milky Way over the course of 10 billion years of stellar evolution and galaxy mergers.

We are in an age of gravitational wave astronomy. I find that remarkable.

If you are interested in the broader theme of using astronomical observations as physics experiments rather than just cataloguing the sky, the posts on transit photometry and the gift of transits and on smartphone-based exoplanet observations cover similar ground at different scales — stellar radii measured in units of stellar radii, by timing the dimming of a star as a planet crosses its face. The underlying logic is the same: precision timing plus a physical model plus statistics equals a measurement of something you could not directly touch.

References

Hellings, R. W., & Downs, G. S. (1983). Upper limits on the isotropic gravitational radiation background from pulsar timing analysis. The Astrophysical Journal Letters, 265, L39–L42. DOI: 10.1086/183954
Agazie, G., et al. (NANOGrav Collaboration). (2023). The NANOGrav 15 yr Data Set: Evidence for a Gravitational-wave Background. The Astrophysical Journal Letters, 951, L8. DOI: 10.3847/2041-8213/acdac6
Antoniadis, J., et al. (EPTA Collaboration). (2023). The second data release from the European Pulsar Timing Array — III. Search for gravitational wave signals. Astronomy & Astrophysics, 678, A50. DOI: 10.1051/0004-6361/202346844
Reardon, D. J., et al. (PPTA Collaboration). (2023). Search for an isotropic gravitational-wave background with the Parkes Pulsar Timing Array. The Astrophysical Journal Letters, 951, L6. DOI: 10.3847/2041-8213/acdd02
Xu, H., et al. (CPTA Collaboration). (2023). Searching for the nano-Hertz stochastic gravitational wave background with the Chinese Pulsar Timing Array. Research in Astronomy and Astrophysics, 23(7), 075024. DOI: 10.1088/1674-4527/acdfa5
Abbott, R., et al. (LIGO Scientific Collaboration). (2024). Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 $M_\odot$ Compact Object and a Neutron Star. The Astrophysical Journal Letters, 970, L34. DOI: 10.3847/2041-8213/ad5beb

Changelog

2025-12-01: Corrected the summary from “four independent pulsar timing arrays” to “five” — the CPTA (Chinese Pulsar Timing Array) also published consistent results in June 2023 and is counted as the fifth group in the body text.

Fremde Welten: Teaching Exoplanet Detection in the Secondary School Classroom

Wed, 14 Jun 2023 00:00:00 +0000

This post describes the article “Fremde Welten — Die Suche nach Exoplaneten mit Analogieexperimenten thematisieren” (Strange Worlds: Teaching Exoplanet Detection with Analogy Experiments), published in Unterricht Physik (Issue 194, 2023) with Alexander Küpper.

Where Students Start

Before students encounter the transit method, most of them have a clear mental model of how exoplanet detection works: you point a large telescope at a nearby star, and if there is a planet, you see it. “You could see them [the exoplanets] with a telescope/binoculars” and “You can see them with an extremely powerful telescope” are typical responses from year 8–9 students before they work through an actual detection unit.

This is not an unreasonable starting intuition. Telescopes see things far away. Planets are things far away. The inference seems to follow.

What it misses is the contrast ratio problem. A star is not just brighter than its planets — it is overwhelmingly, almost incomprehensibly brighter. In visible light, a star like the Sun outshines Jupiter by roughly a billion to one. Against that glare, the planet is functionally invisible. Direct imaging of exoplanets is possible in special circumstances — young planets far from their stars, imaged in infrared — but for the vast majority of exoplanets, it is not a viable detection method.

The unit described in this article takes that misconception as its entry point and builds from there.

The Direct Imaging Experiment

The first experiment in the unit is a hands-on demonstration of why direct imaging is difficult.

The setup: a student points their smartphone camera at a small light source (a switched-on torch). Directly next to the torch, barely a few centimetres away, is a pin with a coloured head — the “exoplanet”. On the phone’s display, the pinhead is invisible. The torch (star) drowns it out completely.

Students can then investigate what would need to change for the pinhead to become visible. The answer they discover: block the torch with a small disc held in front of the camera at the right distance. With the direct glare suppressed, the illuminated pinhead becomes visible in the image.

This is a coronagraph in miniature. The same principle is used in real direct-imaging instruments like SPHERE on the VLT or the coronagraph in the Nancy Grace Roman Space Telescope. Students discover, experimentally, the essential idea: to see an exoplanet directly, you need to suppress the star’s light without blocking the planet’s.

The experiment also motivates a natural follow-on question: under what conditions does direct imaging work at all? Students can vary the pinhead distance from the torch and its size, exploring qualitatively the conditions under which the “exoplanet” becomes detectable even with partial suppression. The answer — large planets, far from their host star — matches the real observational bias: most directly imaged exoplanets are large, young (still warm from formation), and in wide orbits.

The Transit Experiment

Once the limits of direct imaging are established, the unit introduces the transit method as the primary indirect technique. The pedagogical structure is deliberate: students have already understood that you cannot usually see exoplanets directly, which motivates the question of how else you might detect them.

The transit experiment uses a lamp as the star, a ball moved by hand (approximately periodically) around the lamp, and an Android smartphone running phyphox as the light sensor. When the ball crosses in front of the lamp from the sensor’s perspective, the measured illuminance dips. Students see a real light curve — not a simulation, not a graph from a database, but something they produced themselves from a physical measurement.

Two phyphox experiment files are provided for download (via QR code in the article and at astro-lab.app):

Basic experiment: records the raw illuminance data and displays the light curve. The focus is qualitative — what shape does the dip have? What determines the depth? What determines the period? Students can formulate the relationship between dip depth and planet-to-star size ratio as a qualitative rule (the larger the planet relative to the star, the deeper the dip) without necessarily working through the mathematics.

Extended experiment: adds real-time calculations of the transit depth $\Delta F$, the maximum illuminance $I_*$ and transit illuminance $I_\text{transit}$, the transit duration, and the orbital period. For students who are ready for it, this allows a quantitative derivation of the “planet” radius from the light curve — given a known lamp radius and the measured transit depth:

$$\Delta F = \left(\frac{R_p}{R_*}\right)^2$$

The extended experiment also invites critical engagement with the model: the radius derived from the analogy experiment will differ from the actual ball radius, because the distance ratios in the tabletop setup are not to scale. Making that discrepancy explicit — and asking students why it arises — is good science practice.

Limits of the Transit Method

A recurring theme in the unit is that every detection method has limits, and understanding those limits is part of understanding the method.

For the transit method, the fundamental limit is inclination. A transit is only observable if the planet’s orbital plane is aligned (nearly edge-on) relative to our line of sight. Most exoplanetary systems, viewed from Earth, will not be aligned in this way. The transit method is therefore a biased sample: it preferentially detects planets in edge-on orbits, and it misses most planets entirely.

Students can explore this experimentally: tilt the plane of the ball’s orbit away from edge-on and observe what happens to the light curve. The dip disappears. This connects naturally to a broader point about how astronomical surveys work: when we report “X% of stars have detectable planets”, we are reporting a fraction that has been corrected for this and other observational biases.

The article includes a differentiation note: the limits investigation works well as an open inquiry task, with students formulating and testing their own hypotheses about what orbital configurations produce detectable transits.

Exoplanets as a Curriculum Bridge

One point the article makes explicitly is that exoplanets are not just an astronomy topic but a context that connects to multiple items in the German physics curriculum for Sekundarstufe I. The cross-connections include:

Optics: the seeing process (why does the star outshine the planet?), shadow formation, refraction in telescopes
Mechanics: orbital period, Kepler’s laws at a qualitative level, the habitable zone as a consequence of stellar luminosity and distance
Thermodynamics: planetary surface temperature, the greenhouse effect, albedo
Pressure: atmospheric pressure, habitability (a connection developed more fully in the Mission to Mars experiment)

The motivating context — could this planet host life? — sustains student engagement across these topics in a way that treating them in isolation does not.

What Comes After

The transit method is a productive entry point, but the search for extraterrestrial life does not end with planet detection. The article closes by noting that the detected exoplanets need to be analysed for habitability — which depends on orbital radius (habitable zone), stellar temperature, planet radius (mass is not available from transit data alone), atmospheric composition, albedo, and greenhouse effect.

Many of these can be connected back to physics experiment contexts, and the astro-lab project has developed smartphone-based analogy experiments for several of them. Detailed information on these is at astro-lab.app.

For the full pedagogical sequence — from the original astro-lab student laboratory, through the COVID pivot to home experiments, to the return to school — see The Lab Goes Home. For the exomoon extension, which takes the transit experiment further into the question of moon-hosted life, see Can a Planet Have a Moon?.

References

Küpper, A., & Spicker, S. J. (2023). Fremde Welten — Die Suche nach Exoplaneten mit Analogieexperimenten thematisieren. Unterricht Physik, 34(194), 4–9.

MSB NRW (2019). Kernlehrplan für die Sekundarstufe I — Gymnasium in Nordrhein-Westfalen: Physik. Ministerium für Schule und Bildung NRW.

Changelog

2025-10-03: Updated the DOI for Spicker & Küpper (2024) to the correct 10.1119/5.0125305.

How to Actually Film a Classroom: An Open-Access Manual on Classroom Videography

Tue, 09 May 2023 00:00:00 +0000

This post is a follow-up to the June 2020 post on ViLLA and video in teacher education. That post was about why classroom video is useful and what the ViLLA project found. This one is about the practical question that post sidestepped: what does it actually take to film a real lesson?

The manual — Kramer, C., Spicker, S. J., & Kaspar, K. (2023). Manual zur Erstellung von Unterrichtsvideographien — is open access and freely downloadable at kups.ub.uni-koeln.de/65599. Funded by the BMBF under the ZuS Qualitätsoffensive Lehrerbildung programme (grant 01JA1815).

Why a Manual Exists

The argument for classroom video in teacher education is not hard to make. The evidence that video-based learning improves the perceptual and interpretive skills of student teachers is solid enough that “should we use video?” is no longer a particularly interesting question. The interesting questions are downstream: which kind of video, for what purpose, produced how, stored where, used under what conditions.

The last of those — produced how — turns out to be the one that most programmes have the least guidance on. There is a reasonably large research literature on the effects of classroom video, and a smaller but growing literature on design principles for video-based learning environments. There is much less on the practical production side: what you need to decide before you enter a school building, what can go wrong during filming, and what the post-processing work actually involves.

The gap matters because it creates a reproducibility problem. If every research group that wants classroom video has to figure out independently how to handle consent across four institutional levels, how to position two cameras in a classroom with a window on the wrong side, and how much post-processing time to budget per lesson, a lot of effort goes into re-solving problems that have already been solved. The manual is an attempt to make that accumulated knowledge explicit and shareable.

Three Phases, and Why Preparation Is the Most Important One

The manual is structured around the production lifecycle: preparation, production, and post-processing. Each section ends with a practical checklist. The structuring is not original — it follows Thomson (2019) and draws on Herrle and Breitenbach (2016) and several other methodological guides — but the synthesis reflects what we learned from actually running videography sessions at the University of Cologne over several years.

The strongest claim in the manual is that preparation is the most important phase. This sounds obvious and is consistently underestimated.

Methodical preparation: the question before the camera question

Before any equipment decisions, the manual asks you to work through a prior question: is video actually the right medium for what you want to know?

This is not a rhetorical check. Classroom video is excellent at capturing dynamic processes — movement, gesture, voice, simultaneous events — and works well for constructs like classroom management and communication patterns. It works less well for constructs where the relevant data is not visible on the surface, like a student’s prior knowledge activation or the cognitive demands of a task. Using video for those questions is possible, but you need more sessions, more annotation work, and supplementary instruments. Building that into your timeline before you start is considerably better than realising it after you have sixty hours of footage.

The manual also distinguishes four decisions about what kind of video you are making:

Authentic vs. staged: real everyday teaching vs. deliberately constructed cases. Authentic footage gives you ecological validity; staged footage lets you control which situations appear.
Own vs. others’ teaching: self-recording for reflection vs. observing others for general analysis.
Typical vs. best practice: real-world teaching in its ordinary form vs. exemplary demonstration material.
Sequence vs. full lesson: a targeted extract sufficient for a specific analytic focus vs. a complete lesson for contextualised, developmental analysis.

None of these are neutral technical choices. They are methodological decisions that determine what the resulting footage can be used for and what it cannot.

The most time-consuming part of any real videography project is not the filming. It is obtaining the permissions.

You need written consent from pupils, parents or guardians (separately, depending on age — the threshold is 14 in the German legal framework the manual follows), the class teacher, school leadership, the school authority, and in some states the relevant ministry. The scope of the consent you obtain determines the scope of use you can put the footage to: footage filmed under a narrow research-project-only consent cannot be uploaded to ViLLA; footage filmed with broad usage rights can. The broader the rights you request, the higher the barrier for participants to agree.

The practical implication: decide early what you want to do with the footage, because what you put in the information letters and consent forms determines what is possible for the lifetime of the data. This is a decision you cannot easily undo.

The manual also addresses the case where some pupils do not consent: in that situation, it is often possible to position non-consenting pupils in a “blind spot” — an area of the room where neither camera nor microphone captures them. But this requires knowing the room layout and the planned seating arrangement in advance, which is another reason organisational preparation starts earlier than you think.

Technical preparation: as much as necessary, as little as possible

The guiding principle for equipment selection is stated directly in the manual: so viel wie nötig, so wenig wie möglich — as much as necessary, as little as possible.

This matters because there is a pull toward technical elaboration that does not always serve the research purpose. More cameras capture more perspectives; more microphones capture more of the acoustic space; 360° cameras give you everything. But more equipment means more setup time, more opportunities for failure during filming, and substantially more post-processing work. And more visual complexity in the final video does not automatically mean more analytically useful material — it can mean more cognitive load for the students watching it.

The baseline setup the manual recommends is two static cameras positioned facing each other: one centred on the students, one centred on the teacher. This configuration, with lavalier microphones on teachers and boundary microphones for student audio at the cameras, captures most of what you need for classroom management research and teacher education at a level of complexity that is manageable. Extensions — pan cameras for interaction analysis, additional cameras for group work, mobile eye-tracking for teacher perspective, 360° cameras — are described as additions for specific purposes, not as defaults.

What Happens During Filming

The production section of the manual is the most specific and in some ways the most useful part if you are planning a session for the first time. Some things worth knowing:

Start the cameras before the lesson. Authentically start once means you cannot go back. Events that happen before the official start of the lesson — how a teacher enters, how students settle, how the first few minutes of a lesson are framed — can be analytically relevant. And any technical problems that surface before teaching begins can still be fixed. Footage filmed before the lesson is easy to cut in post; lost footage from the opening of a lesson is gone.

The camera operator’s job is to be boring. The manual is explicit that operators should neither engage with the lesson content nor conspicuously attend to the equipment. A relaxed posture, eyes on the monitor, not reacting to what is happening in the room — this is the technique that allows pupils and teachers to stop registering the cameras, which typically happens within the first few minutes if operators are not drawing attention to themselves.

Use a clapper. When running multiple cameras or separate audio recorders, a handclap or clapperboard after all devices are rolling gives you a synchronisation point for later editing. This is known to everyone who has ever synchronised footage, but it is the kind of thing that is easy to forget in the scramble of setting up during a ten-minute break.

Backlighting is the enemy. Windows behind subjects produce the most common image quality problem in classroom footage. The manual discusses ND filters for cases where backlighting cannot be avoided, but the first-choice solution is room scouting in advance to know where the windows are and plan camera placement accordingly.

Post-Processing: The Hidden Cost

The post-processing chapter is the one I think is most likely to recalibrate expectations productively.

Post-processing is time-intensive in proportion to the number of camera angles, the number of audio tracks requiring synchronisation or correction, and the extent of image and sound quality work needed. The manual is explicit that editing should be done by people with content knowledge — not just technical skill — because the person in the edit suite is constantly making decisions about what to include, how to cut between perspectives, when to show the teacher’s face vs. the students' faces. Those decisions are not editorially neutral. They determine what a viewer of the finished video can perceive.

This is the point in the manual where the methodological problem I mentioned in the previous post becomes concrete: the videography setting is not a neutral window onto the classroom. The two-camera cross-cut convention (cut to the face of whoever is speaking) is widely used and convenient for teaching purposes, but it is also an editorial choice that foregrounds spoken exchange and makes other information — spatial position, background activity, gestural communication between students — less visible. Knowing that this choice was made is part of what a researcher or educator needs to know in order to use the footage responsibly.

Data security deserves its own mention. Video files are large, they contain images of minors, and they need to be stored under conditions that comply with current data protection law — which means redundant backup, restricted access, purpose limitation, and active awareness of what the current legal requirements are (which change). The manual recommends checking applicable regulations before starting rather than after, and treating data security as part of the workflow design rather than an administrative afterthought.

What Is Coming Next

The manual’s final chapter points toward three developments that are worth tracking:

360° video and VR. Gold and Windscheid (2020) found that 360° classroom video produces higher presence in student teacher observers than conventional video, though without differences in learning outcomes measured by events noticed or ratings of teaching quality. Whether the presence effect translates into something measurable is an open empirical question. The VR version of this — using 360° classroom footage as an immersive training environment where student teachers can observe without the pressure of having to act — is methodologically interesting and practically plausible at costs that are no longer prohibitive.

Animated classroom video. The handful of studies on animated (as opposed to filmed) classroom situations suggests that student teachers notice similar learning-relevant events in animated and real footage (Smith et al., 2012; Chieu et al., 2011). If that holds up, animation offers a way to construct specific scenarios that would be hard to capture or ethically complex to film — situations involving conflict, failure, or particular forms of student difficulty — without requiring access to actual classrooms or consent from real pupils.

Mobile eye-tracking. The combination of classroom videography with mobile eye-tracking worn by teachers (Rüth, Zimmermann, & Kaspar, 2020) opens the teacher’s-perspective angle that a fixed camera cannot capture. It is a technically more demanding addition to the setup but an analytically distinctive one, and the hardware costs have come down substantially.

A Note on Open Access

The manual is freely available at kups.ub.uni-koeln.de/65599. We made it open access deliberately. The practical obstacles to classroom videography — not knowing how to handle consent, not knowing what equipment configuration works for a standard lesson, not knowing how long post-processing will actually take — are not obstacles that should be higher for researchers at institutions without an existing videography infrastructure. The knowledge exists; it should be findable.

If you are at the University of Cologne and want to run a videography session but do not have your own equipment, the ZuS Media Labs project has a lending programme. Contact the team at zus-kontakt@uni-koeln.de for the current equipment catalogue.

For the specific challenges the manual doesn’t address — recording in music education, instrument acoustics, one-to-one lessons, and practice-session documentation — see the follow-up post on filming music education.

References

Chieu, V. M., Herbst, P., & Weiss, M. (2011). Effect of an animated classroom story embedded in online discussion on helping mathematics teachers learn to notice. Journal of the Learning Sciences, 20(4), 589–624. https://doi.org/10.1080/10508406.2011.528324

Gold, B., & Windscheid, J. (2020). Observing 360-degree classroom videos — effects of video type on presence, emotions, workload, classroom observations, and ratings of teaching quality. Computers & Education, 156, 103960. https://doi.org/10.1016/j.compedu.2020.103960

Herrle, M., & Breitenbach, S. (2016). Planung, Durchführung und Nachbereitung videogestützter Beobachtungen im Unterricht. In U. Rauin, M. Herrle & T. Engartner (Hrsg.), Videoanalysen in der Unterrichtsforschung, 30–49. Beltz Juventa.

Kramer, C., König, J., Strauß, S., & Kaspar, K. (2020). Classroom videos or transcripts? A quasi-experimental study to assess the effects of media-based learning on pre-service teachers’ situation-specific skills of classroom management. International Journal of Educational Research, 103, 101624. https://doi.org/10.1016/j.ijer.2020.101624

Rüth, M., Zimmermann, D., & Kaspar, K. (2020). Mobiles Eye-Tracking im Unterricht. In K. Kaspar et al. (Hrsg.), Bildung, Schule, Digitalisierung, 222–228. Waxmann.

Smith, D., McLaughlin, T., & Brown, I. (2012). 3-D computer animation vs. live-action video. Contemporary Issues in Technology and Teacher Education, 12(1), 41–54.

Thomson, A. (2019). The creation and use of video-for-learning in higher education. Master’s thesis, Queensland University of Technology. https://doi.org/10.5204/thesis.eprints.130743

The Charm of Impossibilities: Group Theory and Messiaen's Modes of Limited Transposition

Wed, 19 Apr 2023 00:00:00 +0000

I first encountered Messiaen’s second mode — the octatonic scale — in an analysis seminar during my physics studies, played by a colleague on an upright piano in a rehearsal room with terrible acoustics. She demonstrated something that stopped me: no matter how many times she transposed the scale up by a minor third, she could never find a “new” version. After three transpositions she was back where she started. She called it the charm of impossibilities. It took me years to understand why it is impossible, and longer still to see that the answer is not musical but algebraic.

This post is a companion to Fibonacci, the Golden Ratio, and Tool’s Lateralus, which found number theory in a prog-rock song. Here we find abstract algebra in twentieth-century sacred music.

Pitch Classes and the Chromatic Clock

Western music divides the octave into twelve equal semitones. For purposes of harmony and counterpoint, the absolute pitch is often less important than the pitch class — the equivalence class of all pitches related by octave transposition. Middle C and the C two octaves above belong to the same pitch class.

We label the twelve pitch classes $0, 1, 2, \ldots, 11$, with $0 = \mathrm{C}$, $1 = \mathrm{C}\sharp/\mathrm{D}\flat$, $2 = \mathrm{D}$, and so on up to $11 = \mathrm{B}$. Addition is taken modulo 12 — the integers wrap around like a clock face, with $11 + 2 = 1$ (one semitone above B is C$\sharp$).

The set of pitch classes with this operation is a group:

$$\mathbb{Z}_{12} = \{0, 1, 2, \ldots, 11\}, \qquad x \oplus y = (x + y) \bmod 12.$$

This is the cyclic group of order 12. It has an identity element ($0$, “no transposition”), every element has an inverse ($-n \bmod 12$), and the operation is associative. If you are used to thinking about the chromatic scale as a linear sequence ending at the octave, $\mathbb{Z}_{12}$ is the insistence that it is actually a circle.

Musical Operations as Group Elements

Two operations are fundamental in tonal and post-tonal music theory.

Transposition by $n$ semitones maps every pitch class up by $n$:

$$T_n \colon x \mapsto x + n \pmod{12}.$$

The twelve transpositions $T_0, T_1, \ldots, T_{11}$ are exactly the elements of $\mathbb{Z}_{12}$, with $T_n$ corresponding to the integer $n$. Composing two transpositions gives a transposition: $T_m \circ T_n = T_{m+n}$.

Inversion reflects the pitch-class circle:

$$I \colon x \mapsto -x \pmod{12}.$$

Inversion maps C to C, D to B$\flat$, E to A$\flat$, and so on — it is the mirror symmetry of the chromatic circle about the C/F$\sharp$ axis. Combining inversion with transposition gives the inversional transpositions:

$$I_n \colon x \mapsto n - x \pmod{12}.$$

The transpositions and inversional transpositions together generate a group of order 24:

$$D_{12} = \langle T_1, I \rangle.$$

This is the dihedral group $D_{12}$ — the same abstract group that describes the symmetries of a regular 12-gon (twelve rotations and twelve reflections). The identification is not coincidental: the twelve pitch classes arranged in a circle are the vertices of a regular 12-gon, and the musical operations are geometrically the symmetries of that polygon.

Twelve-tone composition — Schoenberg’s method — is almost entirely a working-out of the consequences of $D_{12}$ acting on ordered sequences of the twelve pitch classes. The four canonical row forms (prime, inversion, retrograde, retrograde-inversion) correspond to cosets of $\mathbb{Z}_{12}$ (the transposition subgroup).

Orbits and Stabilisers

Let $S \subseteq \mathbb{Z}_{12}$ be a pitch-class set — a chord, a scale, a collection of any size.

The orbit of $S$ under $\mathbb{Z}_{12}$ is the collection of all distinct transpositions of $S$:

$$\mathrm{Orb}(S) = \{ T_n(S) : n \in \mathbb{Z}_{12} \}.$$

For most sets, all twelve transpositions produce a different set, so $|\mathrm{Orb}(S)| = 12$. The C major scale, for example, has twelve distinct transpositions, one for each key.

But some sets are symmetric under certain transpositions: there exists $n \neq 0$ such that $T_n(S) = S$. The collection of all symmetry transpositions of $S$ is the stabiliser:

$$\mathrm{Stab}(S) = \{ T_n \in \mathbb{Z}_{12} : T_n(S) = S \}.$$

Because composing two symmetry transpositions yields another, $\mathrm{Stab}(S)$ is a subgroup of $\mathbb{Z}_{12}$.

The orbit–stabiliser theorem gives the fundamental count:

$$|\mathrm{Orb}(S)| \cdot |\mathrm{Stab}(S)| = |\mathbb{Z}_{12}| = 12.$$

The number of distinct transpositions of $S$ equals $12$ divided by the number of transpositions that leave $S$ unchanged. The more internally symmetric $S$ is, the fewer new versions you can produce by transposing it.

A set with $|\mathrm{Stab}(S)| > 1$ — one that is invariant under some non-trivial transposition — is a mode of limited transposition.

Mode 1: The Whole-Tone Scale

The whole-tone scale contains the six pitch classes at even intervals:

$$\mathrm{Mode\ 1} = \{0, 2, 4, 6, 8, 10\}.$$

Transposing by $T_2$:

$$T_2(\{0, 2, 4, 6, 8, 10\}) = \{2, 4, 6, 8, 10, 0\} = \{0, 2, 4, 6, 8, 10\}. \checkmark$$

The set is unchanged. The same holds for $T_4, T_6, T_8, T_{10}$. The stabiliser is the full subgroup of even transpositions:

$$\mathrm{Stab}(\mathrm{Mode\ 1}) = \{T_0, T_2, T_4, T_6, T_8, T_{10}\} \cong \mathbb{Z}_6.$$

By the orbit–stabiliser theorem:

$$|\mathrm{Orb}(\mathrm{Mode\ 1})| = \frac{12}{6} = 2.$$

There are exactly two distinct whole-tone scales. Every pianist learns this: the one on C and the one on C$\sharp$. Composing with whole-tone harmony means working from a stock of only two harmonic pools with no way to modulate into a genuinely new version of the scale. This is Messiaen’s first charm of impossibility.

Mode 2: The Octatonic Scale

The octatonic (diminished) scale alternates half-step and whole-step intervals. Starting on C:

$$\mathrm{Mode\ 2} = \{0, 1, 3, 4, 6, 7, 9, 10\}.$$

Does $T_3$ leave this set invariant?

$$T_3(\{0, 1, 3, 4, 6, 7, 9, 10\}) = \{3, 4, 6, 7, 9, 10, 0, 1\} = \{0, 1, 3, 4, 6, 7, 9, 10\}. \checkmark$$

Also $T_6$ and $T_9$. The stabiliser is the subgroup generated by transposition by a minor third:

$$\mathrm{Stab}(\mathrm{Mode\ 2}) = \{T_0, T_3, T_6, T_9\} \cong \mathbb{Z}_4.$$

The orbit size:

$$|\mathrm{Orb}(\mathrm{Mode\ 2})| = \frac{12}{4} = 3.$$

There are exactly three distinct octatonic scales. Composers from Rimsky-Korsakov and Bartók to Coltrane have exploited this closed system. The three scales correspond to the three cosets of the subgroup $\langle T_3 \rangle$ in $\mathbb{Z}_{12}$: the cosets $\{0, 3, 6, 9\}$, $\{1, 4, 7, 10\}$, and $\{2, 5, 8, 11\}$ are the “starting-point classes” that generate each scale. Note that the scales themselves are not pairwise disjoint — each has eight pitch classes, so any two share four — but the coset structure determines which transpositions produce the same scale and which produce a different one.

The Subgroup Lattice and All Seven Modes

The orbit–stabiliser theorem constrains which stabiliser sizes are algebraically possible. Since $\mathrm{Stab}(S)$ is a subgroup of $\mathbb{Z}_{12}$, its order must divide 12. The proper non-trivial subgroups of $\mathbb{Z}_{12}$ — those with order strictly between 1 and 12 — are precisely:

Subgroup	Generator	Order	Orbit size
$\langle T_2 \rangle = \{T_0, T_2, T_4, T_6, T_8, T_{10}\}$	$T_2$	6	2
$\langle T_3 \rangle = \{T_0, T_3, T_6, T_9\}$	$T_3$	4	3
$\langle T_4 \rangle = \{T_0, T_4, T_8\}$	$T_4$	3	4
$\langle T_6 \rangle = \{T_0, T_6\}$	$T_6$	2	6

These four subgroups exist because the proper divisors of 12 that are greater than 1 are exactly $\{2, 3, 4, 6\}$. The subgroups of $\mathbb{Z}_n$ are in bijection with the divisors of $n$ — a consequence of the fundamental theorem of cyclic groups. Since $12 = 2^2 \times 3$, the proper divisors are $1, 2, 3, 4, 6$.

Each row of the table maps onto a level in Messiaen’s system:

Mode 1 (whole-tone scale): stabiliser $\langle T_2 \rangle$, 2 transpositions
Mode 2 (octatonic scale): stabiliser $\langle T_3 \rangle$, 3 transpositions
Mode 3: stabiliser $\langle T_4 \rangle$, 4 transpositions
Modes 4 – 7: stabiliser $\langle T_6 \rangle$, 6 transpositions each

The subgroup lattice of $\mathbb{Z}_{12}$ — its Hasse diagram of containment relationships — maps directly onto the hierarchy of Messiaen’s modes. The more symmetric the stabiliser subgroup, the fewer distinct transpositions the mode admits.

The containment relations are: $\langle T_2 \rangle \supset \langle T_4 \rangle$ and $\langle T_2 \rangle \supset \langle T_6 \rangle$ and $\langle T_3 \rangle \supset \langle T_6 \rangle$. Correspondingly, Mode 1 (stabiliser $\langle T_2 \rangle$, order 6) is “more limited” than Mode 3 (stabiliser $\langle T_4 \rangle$, order 3), in the sense that $\langle T_4 \rangle \subset \langle T_2 \rangle$: every symmetry of Mode 3 is also a symmetry of Mode 1’s stabiliser.

Why Exactly Seven Modes?

Messiaen was not enumerating all pitch-class sets with non-trivial stabilisers — there are many more than seven. At the level of the stabiliser $\langle T_6 \rangle$, for example, there are numerous pitch-class sets invariant under the tritone transposition $T_6$: any set $S$ such that $S = S + 6$ qualifies. Some of these sets are large (ten pitch classes), some are small (two pitch classes), some are musically coherent and some are not.

Messiaen selected seven that he found aesthetically and compositionally viable: scales of moderate cardinality, with a balance of interval types, that he could use as raw material for his harmonic language. The group theory explains the constraint (modes are possible only at the four stabiliser types listed above), not the selection (which specific sets Messiaen chose among the many that satisfy the constraint).

The question “why seven?” is therefore partly combinatorial and partly compositional. What is group-theoretically determined is the number of levels (four: orbit sizes 2, 3, 4, 6) and the impossibility of any mode with, say, five distinct transpositions (since 5 does not divide 12).

What Messiaen Knew — and Did Not Know

Messiaen described his modes in Technique de mon langage musical (1944). His account is entirely musical and phenomenological. He lists each mode by its interval sequence, notes how many transpositions it admits, and names the limitation a “charm.” The impossibility is for him a spiritual property, a form of harmonic stasis that he associated — as a devout Catholic — with divine eternity. A mode that cannot depart is, in his compositional theology, a glimpse of the unchanging.

He was not doing group theory. The orbit–stabiliser theorem (in its abstract form) postdates Lagrange (1771), Cauchy (early 19th century), and Galois (1832). But the concepts were not part of music-theoretic discourse until Milton Babbitt’s work in the 1950s, and they were not formalised in the pitch-class set framework I have used here until Allen Forte’s The Structure of Atonal Music (1973) and David Lewin’s Generalized Musical Intervals and Transformations (1987).

What Messiaen had was a musician’s ear for symmetry. He could hear that the modes were closed, without having the algebraic vocabulary to explain why. The group theory shows that he was correct, and why he was correct with a precision that no amount of phenomenological description could provide.

From Messiaen to Lewin

Lewin’s transformational theory (1987) generalises the $\mathbb{Z}_{12}$ framework to arbitrary musical spaces. A Generalized Interval System is a triple $(S, G, \mathrm{int})$ where $S$ is a set of musical objects, $G$ is a group, and $\mathrm{int} : S \times S \to G$ assigns an interval to each ordered pair of objects in a way that is consistent with the group structure.

This framework treats musical transformations — not just pitch-class transpositions but rhythmic augmentations, timbral shifts, any structurally defined operation — as elements of a group. The mathematics does not privilege any particular musical parameter; it applies wherever a transformation group acts on a set of musical objects.

Neo-Riemannian theory, which emerged from Lewin’s work in the 1980s and 1990s and was systematised by Cohn (1998), applies this framework to triadic transformations (the operations L, P, and R that map major and minor triads to their relatives, parallels, and leading-tone exchanges). The group generated by L, P, and R on the set of 24 major and minor triads is isomorphic to $D_{12}$ — the same dihedral group that governs Messiaen’s modes, but acting on a different musical space.

Emmanuel Amiot’s more recent work (2016) applies the discrete Fourier transform to pitch-class sets, using the DFT coefficients on $\mathbb{Z}_{12}$ as a continuous measure of a set’s similarity to the modes of limited transposition. The Fourier coefficients detect the algebraic symmetries that stabilisers measure discretely: a set with large coefficient at frequency $k$ (in the DFT over $\mathbb{Z}_{12}$) is close, in a precise sense, to having the stabiliser $\langle T_{12/k} \rangle$.

The group-theoretic perspective has moved, over seventy years, from a marginal curiosity to the dominant mathematical framework in music theory. Messiaen’s modes — which once seemed like personal compositional idiosyncrasies — are revealed as structurally constrained: the possible stabiliser orders are fixed by the divisors of 12, and the orbit sizes that Messiaen’s ear discovered are exactly those that Lagrange’s theorem permits. Many pitch-class sets have non-trivial stabilisers; Messiaen found the seven that are musically viable. Their limitation is not a personal choice but an algebraic fact.

The charm of impossibilities is a theorem of group theory. And it is exactly as beautiful as Messiaen heard it to be.

References

Amiot, E. (2016). Music Through Fourier Space: Discrete Fourier Transform in Music Theory. Springer (Computational Music Science).
Babbitt, M. (1960). Twelve-tone invariants as compositional determinants. The Musical Quarterly, 46(2), 246–259. https://doi.org/10.1093/mq/XLVI.2.246
Cohn, R. (1998). Introduction to neo-Riemannian theory: A survey and a historical perspective. Journal of Music Theory, 42(2), 167–180. https://doi.org/10.2307/843871
Forte, A. (1973). The Structure of Atonal Music. Yale University Press.
Lewin, D. (1987). Generalized Musical Intervals and Transformations. Yale University Press. (Reissued Oxford University Press, 2007.)
Messiaen, O. (1944). Technique de mon langage musical. Alphonse Leduc. (English translation: Satterfield, J., 1956.)
Tymoczko, D. (2006). The geometry of musical chords. Science, 313(5783), 72–74. https://doi.org/10.1126/science.1126287
Tymoczko, D. (2011). A Geometry of Music: Harmony and Counterpoint in the Extended Common Practice. Oxford University Press.

Changelog

2026-01-14: Changed “cosets of $D_{12}$” to “cosets of $\mathbb{Z}_{12}$ (the transposition subgroup)” in the twelve-tone composition paragraph. $D_{12}$ (order 24) already includes both transpositions and inversions, yielding only 2 cosets in the full serial group. The four row forms {P, I, R, RI} correspond to 4 cosets of the transposition-only subgroup $\mathbb{Z}_{12}$ (order 12) in the full group of order 48.

Nobody Is Normal, Nobody Is Sick: A Roast of a Well-Meaning Slogan

Sat, 18 Feb 2023 00:00:00 +0000

TL;DR

The slogan: “Aus der Nähe betrachtet ist keiner normal” — roughly, “Up close, nobody is normal.” Displayed at a Sozialpsychiatrisches Zentrum to reduce stigma around mental illness.
What it gets right: stigma around psychiatric conditions is real and harmful. The slogan’s intention is correct.
What it gets catastrophically wrong:
- It conflates statistical normality (deviation from average) with clinical significance (harmful dysfunction). These are different concepts, and modern psychiatric nosology uses the second, not the first.
- “Nobody is normal” is exactly the argument people use to dismiss depression, OCD, and anxiety as not-real-illness. Lending it institutional authority from a psychiatric centre is counterproductive.
- Psychologist Nick Haslam calls the underlying mechanism “concept creep”: stretching clinical concepts until they cover everyone paradoxically devalues them for the people who actually need them.
- The anti-stigma research literature does not robustly support the normalisation framing. Evidence is mixed, sometimes running in the wrong direction.
- A psychiatric centre whose slogan implies that the normal/abnormal distinction is arbitrary has implicitly argued against the relevance of its own services.
Analogous translation: “Aus der Nähe betrachtet hat keiner ein normales Herz.” Up close, nobody has a normal heart. This is technically true. It does not help people with cardiac disease. Neither does the original.
What would actually help: affirming that psychiatric conditions are real, treatable, and do not define the whole person — without dissolving the conceptual distinction on which clinical care depends.

The Slogan and What It Wants

Sozialpsychiatrische Zentren — community psychiatric centres in German-speaking countries — do important work: outreach, supported housing, day programmes, a bridge between acute inpatient care and independent living. The stigma around mental illness is real, persistent, and measurably harmful. Tackling it is legitimate and necessary.

The slogan “Aus der Nähe betrachtet ist keiner normal” is designed to contribute to that project. The implicit argument: the line between “normal” and “mentally ill” is blurry. Everyone has quirks, struggles, peculiarities. “Normal” is a fiction. Therefore: don’t stigmatise people with psychiatric diagnoses, because they are no different in kind from everyone else.

This sounds compassionate. It sounds inclusive. It sounds like the kind of thing a thoughtful person would print on a poster.

It is precisely the wrong thing to say — and in a way that causes active damage to the people it is trying to help.

What Seems Fine Is Not Fine

Let me put it plainly before building the argument.

The slogan’s logic: nobody is normal → the normal/abnormal distinction is arbitrary → psychiatric diagnosis is arbitrary → people with diagnoses should not be stigmatised.

The conclusion is correct. The route to it is a disaster.

The problem is not the destination. The problem is what the argument concedes on the way: that psychiatric categories are essentially a matter of perspective, that the distinction between clinical illness and ordinary human variation dissolves under sufficiently close examination, that if you look hard enough, everyone is mentally ill.

That last implication is the argument that has been used, for decades, to dismiss people with genuine clinical conditions. “Everyone gets depressed sometimes.” “Everyone is a bit OCD.” “Everyone gets anxious — have you tried exercise?”

The person deploying this framing usually believes they are being kind, inclusive, normalising. What they are doing is removing the evidentiary ground on which someone with major depressive disorder, or obsessive-compulsive disorder, or generalised anxiety disorder stands when they say: I am ill. I need treatment. My condition is real.

The slogan borrows this structure and prints it on a poster. That a psychiatric institution is doing it makes it worse, not better.

Problem 1: The Wrong Target

The first error is attacking a concept of “normal” that psychiatry itself abandoned decades ago.

When the slogan says “nobody is normal,” it implies that psychiatric diagnosis works by measuring deviation from some statistical average of human behaviour. Sufficiently deviant equals disordered; not-too-deviant equals normal. Since everyone deviates from the average in some direction, “normal” is an illusion.

This is a reasonable critique of a naive, 19th-century model of mental illness. It is not a critique of modern psychiatric nosology.

Jerome Wakefield’s influential 1992 analysis in the American Psychologist argues that genuine mental disorder requires two components: dysfunction — the failure of a psychological mechanism to perform its naturally selected function — and harm — the dysfunction causes suffering or impairment to the person (Wakefield, 1992). “Harmful dysfunction,” not statistical deviance. You can be spectacularly unusual and not disordered. You can be statistically common — depression affects roughly one in five people over a lifetime — and severely ill.

The DSM-5 builds in a related safeguard: the clinical significance criterion. For most diagnoses, the symptom cluster must cause “clinically significant distress or impairment in social, occupational, or other important areas of functioning” (American Psychiatric Association, 2013). High neuroticism, unusual ideation, eccentric behaviour — none of these, on their own, constitute a disorder under this criterion. What matters is whether the person is suffering and whether their functioning is impaired.

Christopher Boorse, working from a biomedical angle, defined health in terms of species-typical functioning — whether biological systems are doing what they evolved to do (Boorse, 1977). Boorse’s formulation is contested, but its core point holds: the relevant question is not “is this person similar to the average person” but “are this person’s systems performing their functions.” These are very different questions.

The slogan attacks a straw man. Real psychiatric diagnosis — when done well — is not in the business of pathologising deviation from a norm of cheerfulness or orderliness or sociability. It is in the business of identifying harmful dysfunction. The “nobody is normal” framing has no purchase on that target.

Problem 2: Concept Creep and the Dilution Effect

Nick Haslam, a psychologist at the University of Melbourne, has documented what he calls “concept creep” — the progressive expansion of psychological concepts (trauma, mental disorder, depression, bullying) to cover increasingly mild instances of what they originally described (Haslam, 2016).

The expansion happens in two directions: horizontal (covering more types of phenomena) and vertical (covering less severe instances). A concept of “trauma” that originally required exposure to life-threatening events has expanded to include ordinary life stressors. A concept of “depression” that originally meant severe, impairing low mood has expanded toward ordinary sadness.

Concept creep sounds inclusive. It is, in practice, a dilution. When “everyone is a bit depressed” becomes institutionally sanctioned, the person with major depressive disorder — who cannot get out of bed, who has not eaten in three days, who is considering suicide — finds their claim to the label contested. The clinical category loses its clinical weight precisely because everyone is in it.

The slogan “nobody is normal” is concept creep in slogan form. By implying that the clinical/non-clinical distinction is arbitrary, it weakens the conceptual infrastructure on which clinical claims rest. This is not a hypothetical harm. It is the mechanism by which a great deal of dismissal of severe mental illness operates: not by claiming that mental illness doesn’t exist, but by claiming that everyone is a bit mentally ill, so what’s the problem, stop complaining.

Allen Frances, who chaired the DSM-IV task force and subsequently became a sharp critic of diagnostic inflation, wrote a book (Saving Normal, 2013) about the opposite problem: the expansion of diagnostic categories to medicalise ordinary human variation (Frances, 2013). Frances’s worry and the slogan’s argument share a logical structure — “the line between normal and disordered is blurry, therefore the line is somewhat arbitrary” — and both forget the same thing: the people with the most severe, genuine, impairing psychiatric conditions need that line to carry weight. Blur it enough and their most urgent claims become indistinguishable from everyone else’s minor struggles.

Problem 3: What the Anti-Stigma Literature Actually Says

Does the “we’re all a bit X” normalisation framing reliably reduce stigma? The evidence is, at best, mixed.

Patrick Corrigan and David Penn’s review of social-psychological approaches to psychiatric stigma identifies a consistent risk in normalisation campaigns: they can fail to distinguish between the ordinary distress that everyone experiences and the clinical conditions that require treatment and support (Corrigan & Penn, 1999). When stigma reduction messaging implies that psychiatric conditions are simply more-of-what-everyone-has, it may reduce perceived severity and undermine motivation to support treatment access.

Kvaale, Haslam, and Gottdiener’s meta-analysis of biogenetic framings in anti-stigma campaigns — which share structural features with the normalisation approach — found paradoxical effects: reduced blame, yes, but sometimes increased perceived dangerousness and greater social distance (Kvaale, Haslam, & Gottdiener, 2013). The “we’re all on a spectrum” variant has its own specific paradox: if nobody is normal, the distinction that generates stigma dissolves — but so does the distinction that generates respect for people with serious conditions who need real resources. Both edges cut.

What the literature supports more robustly is contact: direct, positive interaction with people who have experience of mental illness, presented as whole persons and not primarily as patients. Contact works better than educational campaigns about what mental illness is or isn’t. The “nobody is normal” poster is an educational campaign about what mental illness isn’t. It is probably less effective than a conversation.

Problem 4: The Institutional Contradiction

There is a fourth problem, and I find it the most striking.

The slogan belongs to a Sozialpsychiatrisches Zentrum — an institution that exists precisely because some people have psychiatric conditions that impair their functioning and require dedicated support. Its implicit mission: there is a meaningful distinction between people who need psychiatric services and people who do not, and we provide those services for the former.

The slogan: nobody is normal.

If nobody is normal, then everybody is, in the relevant sense, a bit psychiatrically ill. If the line between normal and not-normal is arbitrary, then so is the line between people who need psychiatric services and people who don’t. If the category “psychiatric condition requiring support” is as fuzzy as the slogan implies — a mere matter of proximity and perspective — then why should anyone prioritise coming to this particular institution?

The slogan, taken seriously, argues against the relevance of its own institution. A psychiatric centre has printed on its posters the claim that psychiatric categories dissolve under close examination. This is an unusual thing for a psychiatric centre to announce.

The Analogous Translation

Let me make the logical structure visible with a direct translation into another field of medicine:

“Aus der Nähe betrachtet hat keiner ein normales Herz.”

“Up close, nobody has a normal heart.”

This is, in a technical sense, largely true. Cardiologists can find something to remark on in almost any heart — a minor valve irregularity, some degree of atherosclerosis past middle age, a benign arrhythmia, a structural variation within the clinical reference range. Under sufficiently detailed examination, the perfectly normal heart is a platonic ideal rather than a clinical reality.

Does this mean coronary artery disease doesn’t exist? Does it mean myocardial infarction is a matter of perspective or proximity? Does it mean that someone waiting for a cardiac transplant should be reassured that, up close, nobody has a normal heart, so they shouldn’t worry too much about their own?

Obviously not. The clinical category of cardiac disease does not depend on the existence of a perfectly normal heart. It depends on whether specific mechanisms are failing in ways that cause harm — which is true for some people and not for others, regardless of whether everyone has some minor deviation from an idealised cardiovascular anatomy.

The slogan about psychiatric normalcy makes exactly the same error. The clinical category of mental disorder does not depend on the existence of a psychologically perfect human being. It depends on whether psychological mechanisms are failing in ways that cause harm — which is true for some people and not for others, regardless of whether everyone has quirks, struggles, or eccentricities.

The heart analogy is also useful for what it reveals about whose interests the slogan serves. “Nobody has a normal heart” would be printed, presumably, to reassure people who feel embarrassed about their cardiac condition — to say: you’re not so different from anyone else. What it actually does is make it harder for that person to say: my heart is not functioning well, and that is a real medical fact that deserves real medical attention. The compassionate intent and the practical effect run in opposite directions.

What Would Actually Help

The goal — reducing stigma against people with psychiatric conditions — is correct and important. The approach — dissolving the category of “normal” until psychiatric and non-psychiatric become indistinguishable — is not.

A more defensible anti-stigma argument goes: mental illness is real, it involves genuine failures of psychological functioning, it causes genuine suffering, and none of that makes the person with it less worthy of respect, resources, and full participation in society. This is the position that affirms both the reality of the condition and the humanity of the person. It does not require denying the normal/abnormal distinction. It requires insisting that the distinction does not carry the moral weight that stigma assigns to it.

The difference between “nobody is normal, so stop stigmatising” and “you can be ill and still be a person of full worth” sounds subtle. In practice, it is enormous. The first removes the conceptual ground from under the people most in need. The second leaves the ground intact while refusing to let it be used as a weapon.

Psychisch krank — und trotzdem ganz. Mentally ill — and still whole. Not: nobody is normal. But: being ill doesn’t make you less of a person. The second slogan does not hand ammunition to the dismissers. The first one does.

Karneval Coda

It is Karneval. Everyone is wearing a mask.

The slogan “Aus der Nähe betrachtet ist keiner normal” is wearing a mask too: the mask of tolerance, of radical inclusion, of refusing to pathologise difference. Under the mask is a logical structure that, taken seriously, would dissolve the evidentiary basis for psychiatric care, hand a slogan to everyone who has ever told someone with depression that they just need to try harder, and leave the people with the most severe conditions with one fewer conceptual tool for insisting that their suffering is real, their need is legitimate, and their claim on resources and support deserves to be taken seriously.

The mask is well-intentioned. Karneval ends on Wednesday. The poster will still be on the wall.

References

Wakefield, J. C. (1992). The concept of mental disorder: On the boundary between biological facts and social values. American Psychologist, 47(3), 373–388. DOI: 10.1037/0003-066X.47.3.373
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders (5th ed.). American Psychiatric Publishing. DOI: 10.1176/appi.books.9780890425596
Boorse, C. (1977). Health as a theoretical concept. Philosophy of Science, 44(4), 542–573. DOI: 10.1086/288768
Haslam, N. (2016). Concept creep: Psychology’s expanding concepts of harm and pathology. Psychological Inquiry, 27(1), 1–17. DOI: 10.1080/1047840X.2016.1082418
Frances, A. (2013). Saving Normal: An Insider’s Revolt against Out-of-Control Psychiatric Diagnosis, DSM-5, Big Pharma, and the Medicalization of Ordinary Life. HarperCollins.
Corrigan, P. W., & Penn, D. L. (1999). Lessons from social psychology on discrediting psychiatric stigma. American Psychologist, 54(9), 765–776. DOI: 10.1037/0003-066X.54.9.765
Kvaale, E. P., Haslam, N., & Gottdiener, W. H. (2013). The ‘side effects’ of medicalization: A meta-analytic review of how biogenetic explanations affect stigma. Clinical Psychology Review, 33(6), 782–794. DOI: 10.1016/j.cpr.2013.06.002

Spiral Out: Tool's Lateralus, the Fibonacci Sequence, and the Mathematics of Musical Structure

Tue, 08 Nov 2022 00:00:00 +0000

Two Passions, One Song

Physics training means coming to mathematics as a tool before arriving at it as an object of aesthetic interest, and it took me longer than it should have to notice that a proof can be beautiful in the same way a piece of music can be beautiful — not despite its rigour but because of it. Both reward attention to structure. Both have surfaces accessible to a casual listener and depths that only reveal themselves when you look harder.

Lateralus, the title track of Tool’s 2001 album, is a convenient case study for the overlap. It is not the only piece of music built around Fibonacci numbers — Bartók made the connection decades earlier, and it appears in scattered places across Western and non-Western traditions — but it is among the most thoroughly and deliberately constructed, and the mathematical structure is audible rather than merely theoretical.

What follows is an attempt to do justice to both dimensions: the mathematics of the Fibonacci sequence and the golden ratio, and the musical mechanics of how those structures show up and what they do.

The Fibonacci Sequence

The sequence is defined by a recurrence relation. Starting from the initial values $F(1) = 1$ and $F(2) = 1$, each subsequent term is the sum of the two preceding ones:

$$F(n) = F(n-1) + F(n-2), \quad n \geq 3$$

This gives:

$$1,\; 1,\; 2,\; 3,\; 5,\; 8,\; 13,\; 21,\; 34,\; 55,\; 89,\; 144,\; 233,\; 377,\; 610,\; \mathbf{987},\; 1597,\; \ldots$$

The term $987$ is the sixteenth Fibonacci number, $F(16)$. Keep that in mind.

The recurrence can be encoded compactly in a matrix formulation. For $n \geq 1$:

$$\begin{pmatrix} F(n+1) \\ F(n) \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} 1 \\ 0 \end{pmatrix}$$

This is more than notational tidiness — it connects the Fibonacci sequence to the eigenvalues of the matrix $\mathbf{A} = \bigl(\begin{smallmatrix}1 & 1 \\ 1 & 0\end{smallmatrix}\bigr)$, which are exactly $\varphi$ and $-1/\varphi$ where $\varphi$ is the golden ratio. That connection gives us Binet’s formula, a closed-form expression for the $n$-th Fibonacci number:

$$F(n) = \frac{\varphi^n - \psi^n}{\sqrt{5}}, \quad \varphi = \frac{1+\sqrt{5}}{2},\quad \psi = \frac{1-\sqrt{5}}{2} = -\frac{1}{\varphi}$$

Since $|\psi| < 1$, the term $\psi^n / \sqrt{5}$ diminishes rapidly, and for large $n$ we have the convenient approximation:

$$F(n) \approx \frac{\varphi^n}{\sqrt{5}}$$

This means Fibonacci numbers grow exponentially, at a rate governed by the golden ratio. The sequence does not grow linearly or polynomially; it spirals outward.

The Golden Ratio

The golden ratio $\varphi$ appears as the limit of consecutive Fibonacci ratios:

$$\varphi = \lim_{n \to \infty} \frac{F(n+1)}{F(n)} = \frac{1+\sqrt{5}}{2} \approx 1.61803\ldots$$

It can be derived from a simple geometric proportion: divide a line segment into two parts such that the ratio of the whole segment to the longer part equals the ratio of the longer part to the shorter part. Calling those ratios $r$:

$$\frac{a+b}{a} = \frac{a}{b} = r \implies r^2 - r - 1 = 0 \implies r = \frac{1+\sqrt{5}}{2} = \varphi$$

What makes $\varphi$ mathematically distinctive is its continued fraction representation:

$$\varphi = 1 + \cfrac{1}{1 + \cfrac{1}{1 + \cfrac{1}{1 + \cdots}}}$$

This is the simplest possible infinite continued fraction. It is also, in a precise sense, the hardest real number to approximate by rational fractions. The convergents of a continued fraction are the best rational approximations to a real number at each level of precision; the convergents of $\varphi$ are exactly the ratios of consecutive Fibonacci numbers: $1/1$, $2/1$, $3/2$, $5/3$, $8/5$, $13/8$, $\ldots$ These converge more slowly to $\varphi$ than the convergents of any other irrational number. $\varphi$ is, in this sense, maximally irrational.

That property has a physical consequence. In botanical phyllotaxis — the arrangement of leaves, seeds, and petals on plants — structures that grow by adding new elements at a fixed angular increment will pack most efficiently when that increment is as far as possible from any rational fraction of a full rotation. The optimal angle is:

$$\theta = \frac{2\pi}{\varphi^2} \approx 137.508°$$

This is the golden angle, and it is the reason sunflower seed spirals count $55$ and $89$ (consecutive Fibonacci numbers) in their two counter-rotating sets. The mathematics of efficient growth in nature and the mathematics of the Fibonacci sequence are the same mathematics.

The golden spiral — the logarithmic spiral whose growth factor per quarter turn is $\varphi$ — is the visual representation of this: it is self-similar, expanding without bound while maintaining constant proportionality.

Fibonacci Numbers in Music: Before Tool

The connection between the Fibonacci sequence and musical structure is not Tool’s invention. The most carefully documented case is Béla Bartók, whose Music for Strings, Percussion and Celesta (1936) has been analysed exhaustively by Ernő Lendvai. In the first movement, the climax arrives at bar 55 (a Fibonacci number), and Lendvai counted the overall structure as 89 bars — the score has 88, but he added an implied final rest bar to reach the Fibonacci number — dividing at bar 55 with near-mathematical precision. Lendvai argued that Bartók consciously embedded Fibonacci proportions into formal structure, tonal architecture, and thematic development throughout much of his output.

Whether these proportions were conscious design or an instinct that selected naturally resonant proportions is contested. The same question applies to claims about Mozart and Chopin. What is more defensible is a structural observation about the piano keyboard and Western scales that requires no attribution of intent:

A single octave on the piano keyboard has 13 keys, comprising 8 white keys and 5 black keys. The black keys are grouped as 2 and 3. The numbers $2, 3, 5, 8, 13$ are five consecutive Fibonacci numbers — $F(3)$ through $F(7)$.

The standard Western scales make this concrete. The major scale contains 7 distinct pitches within an octave of 12 semitones. The pentatonic scale (ubiquitous in folk, blues, rock) contains 5 pitches. The chromatic scale contains 12 pitch classes per octave; counting both endpoints of the octave (C to C) gives 13 chromatic notes, the next Fibonacci number.

Harmonic intervals in just intonation are rational approximations of simple frequency ratios: the octave (2:1), the perfect fifth (3:2), the perfect fourth (4:3), the major third (5:4), the minor third (6:5). The numerators and denominators are small integers, often Fibonacci numbers or their neighbours. The major triad — the structural foundation of tonal Western music — consists of intervals in frequency ratios $4:5:6$, three consecutive integers that bracket the Fibonacci-adjacent range.

This does not mean that Western music is secretly Fibonacci. It means that the integer frequency ratios that produce consonant intervals are the small integers, and small integers include the small Fibonacci numbers. The connection is genuine but not exclusive.

Lateralus

Tool’s Lateralus (2001, album of the same name) is unusual in that the Fibonacci construction is not an analytical inference applied after the fact — it was discussed publicly by the band. Drummer Danny Carey has spoken about his engagement with sacred geometry and mathematical structure, and the song’s construction has been described as intentional by multiple band members.

There are two primary levels of Fibonacci structure in the song. The third — the thematic content of the lyrics — makes the mathematical frame explicit.

The Syllable Count

The opening verses are constructed so that successive lines contain syllable counts following the Fibonacci sequence ascending: $1, 1, 2, 3, 5, 8, 13$. The first syllable count is a single word. The second is another. The third is a two-syllable phrase. The sequence continues, each line adding the weight of the previous two, until the thirteenth-syllable line, which in structure and delivery feels like the crest of a wave.

The second half of the verse then descends: $13, 8, 5, 3, 2, 1, 1$. Or, in some analyses, the chorus and pre-chorus sections begin a new ascending Fibonacci run before the full descent, creating a nested structure of expansions and contractions.

The audible effect of this design is not arbitrary. A sequence of lines whose syllable counts follow $1, 1, 2, 3, 5, 8, 13$ creates a consistently accelerating density of text over the same musical time. The vocal line becomes more compressed as the syllable count rises, building tension — and then the descent releases it. This is not how most pop or rock lyrics are structured. It produces a breathing, organic quality, the way a plant reaches toward light.

The Time Signature: 987

The verse sections of the song cycle through three time signatures in succession: $9/8$, then $8/8$, then $7/8$.

$$9/8 + 8/8 + 7/8$$

This three-bar pattern repeats. Now: the sequence of numerators is $9$, $8$, $7$. Written as a three-digit number: 987. And as noted above, $987 = F(16)$, the sixteenth Fibonacci number.

Whether this is a deliberate encoding or a remarkable coincidence is a matter of interpretation. The time signature sequence is definitely deliberate — asymmetric meters of this kind require careful compositional choice. The fact that their numerators concatenate to a Fibonacci number is either intentional and clever or accidental and still remarkable. Either way, the time signature pattern has a musical function independent of the Fibonacci reading.

In standard rock, time is almost always $4/4$: four even beats per bar, a pulse that is maximally predictable and maximally amenable to groove. The $9/8 + 8/8 + 7/8$ pattern is the opposite. Each bar has a different length. The listener’s internal metronome, calibrated to $4/4$, cannot lock onto the pattern. The music generates forward momentum not through a repeated downbeat but through the continuous, non-periodic unfolding of measures whose lengths shift. This is the rhythmic analogue of a spiral: no two revolutions are identical in length, but the growth is consistent.

The chorus and other sections use different time signatures, including stretches in $5/8$ and $7/8$ — Fibonacci numbers again, and specifically the $5, 8, 13$ triplet that appears so often in this context.

The Thematic Content

The lyrics are explicitly about spirals, Fibonacci growth, and the experience of reaching beyond a current state of development. They reference the idea of expanding one’s perception outward through accumulating cycles, each containing and exceeding the previous one. The chorus refrain — about spiralling outward — names the mathematical structure of the golden spiral directly. The song is, in its own terms, about the process that the mathematics describes.

This kind of thematic coherence between structure and content is what makes the construction interesting rather than merely clever. The Fibonacci structure is not decorative. It is the argument of the song made manifest in its form.

Why Fibonacci Structure Works in Music

The most interesting question is not whether the Fibonacci structure is there — it clearly is — but why it produces the musical effect it does.

Consider what the Fibonacci sequence represents physically. It is the growth law of structures that build on their own preceding state: $F(n) = F(n-1) + F(n-2)$. Unlike arithmetic growth (add a constant) or geometric growth (multiply by a constant), Fibonacci growth is self-referential. Each term contains the memory of the previous two. The sequence is expansive but not uniform; it accelerates, but always in proportion to what came before.

Musical tension and release are, in an important sense, the same mechanism. A phrase creates an expectation; its continuation either confirms or subverts that expectation; resolution reduces the tension. What makes a musical phrase feel like it is building toward something is precisely the progressive accumulation of expectation — each bar adding its weight to the previous, the accumulated tension requiring resolution at a scale proportional to the build-up. The Fibonacci syllable structure in Lateralus generates this literally: each line is denser than the previous two lines’ combined syllable count would suggest is comfortable, until the structure has to breathe.

The time signature asymmetry works similarly. In $4/4$, the beat is predictable, and the listener’s body can lock to it and then coast on that lock. In $9/8 + 8/8 + 7/8$, the beat is never fully locked — the pattern is periodic (it repeats) but the internal structure of each repetition is shifting. The listener is perpetually catching up, perpetually leaning slightly into the music to find the next downbeat. This is not discomfort — it is engagement. The mathematical reason is that the pattern is large enough to be periodic (it does repeat) but small enough to be audible as a unit. The brain can learn the 24-beat super-pattern; it just requires attention that $4/4$ does not.

There is a deeper reason why golden-ratio proportions feel right in musical form. The golden section of a piece — the point at which the piece divides in the $\varphi : 1$ ratio — is the point of maximum accumulated development before the final resolution. In a five-minute piece, the golden section falls at roughly 3:05. This is, empirically, where the emotional and structural climax tends to sit in a wide range of well-regarded music, from Baroque to jazz. Whether composers consciously target this proportion or whether the proportion is what accumulated development looks like when done well is not easily separable. But the mathematical reason it is a proportion worth targeting is that $\varphi$ is the only division point that is self-similar: the ratio of the whole to the longer part equals the ratio of the longer part to the shorter part. There is no arbitrary scale associated with the golden section; it is scale-invariant, the same proportion at every level of analysis.

A Brief Note on Binet and Limits

The closed-form expression for Fibonacci numbers,

$$F(n) = \frac{\varphi^n - \psi^n}{\sqrt{5}},$$

has a pleasing consequence for large $n$. Since $|\psi| \approx 0.618 < 1$, the term $\psi^n \to 0$, and $F(n)$ is simply the nearest integer to $\varphi^n / \sqrt{5}$. The integers produced by the Fibonacci recurrence are the integers that $\varphi^n / \sqrt{5}$ passes closest to. The exponential growth of $\varphi^n$ and the rounding to integers together give the sequence.

This is also why the ratios $F(n+1)/F(n)$ converge to $\varphi$ exponentially fast — the error is $\mathcal{O}(|\psi/\varphi|^n) = \mathcal{O}(\varphi^{-2n})$ — and why, for musical purposes, the Fibonacci ratios $8:5$, $13:8$, $21:13$ are already excellent approximations of the golden ratio, close enough that the ear cannot distinguish them from $\varphi$ in any direct sense.

What Lateralus Is

Lateralus is not a math lecture set to music. It is a nine-minute progressive metal track that is physically involving, rhythmically complex, and lyrically coherent. The Fibonacci structure would be worthless if the song were not also, on purely musical terms, good.

What the mathematics adds is a vocabulary for something the song achieves anyway: the sense of growing without ever arriving, of each section being both a resolution of what came before and an opening toward something larger. The golden spiral does not end. The Fibonacci sequence does not converge. The song does not resolve in the sense that a classical sonata resolves; it spirals to a close.

The reason this is worth writing about is that it makes concrete a connection that is usually stated vaguely: mathematics and music are similar. They are similar in specific and articulable ways. The self-referential structure of the Fibonacci recurrence, the scale- invariance of the golden ratio, the information-theoretic account of tension and expectation — these are not metaphors for musical experience. They are, in this case, the actual mechanism.

References

Lendvai, E. (1971). Béla Bartók: An Analysis of His Music. Kahn & Averill.

Benson, D. J. (2006). Music: A Mathematical Offering. Cambridge University Press. (For an introduction to the general theory of tuning, temperament, and harmonic series.)

Tool. (2001). Lateralus. Volcano Records.

Livio, M. (2002). The Golden Ratio: The Story of Phi, the World’s Most Astonishing Number. Broadway Books.

Knott, R. (2013). Fibonacci numbers and the golden section in art, architecture and music. University of Surrey Mathematics Department. https://r-knott.surrey.ac.uk/Fibonacci/fibInArt.html

Changelog

2025-11-20: Clarified the Bartók bar count: the written score has 88 bars; Lendvai’s analysis counted 89 by adding an implied final rest bar to reach the Fibonacci number. Previously stated as “89 bars” without qualification.

What Black Hole Images Actually Show (and Why a Wormhole Would Look Different)

Mon, 17 Oct 2022 00:00:00 +0000

Summary

In 2019 and 2022 the Event Horizon Telescope released images of two supermassive black holes. Both looked exactly like physics predicted they would. That precision — agreement to within a few percent — is what makes them scientifically powerful. The ring is not merely beautiful; it is a quantitative measurement of a metric.

The more interesting question, the one I want to spend time on here, is: what would a wormhole look like? The answer is: radically different. Which means the images are also evidence — not just confirmation of a black hole, but a ruling out of certain alternatives at those locations.

The Images

The Event Horizon Telescope is a planet-scale interferometer: radio dishes from Hawaii to the South Pole, phase-locked to atomic clocks, synthesizing an effective aperture the diameter of Earth. At millimetre wavelengths, this gives an angular resolution of around 20 microarcseconds — enough to resolve a grapefruit on the Moon.

In April 2019 the collaboration published six simultaneous papers on M87*, the supermassive black hole at the centre of the Virgo A galaxy (EHT Collaboration et al., 2019a, 2019b). The ring had an angular diameter of $42 \pm 3$ μas, consistent with a black hole of mass $M = (6.5 \pm 0.7) \times 10^9 \, M_\odot$ at a distance of 16.8 Mpc. The southern arc of the ring was brighter — I will return to why.

In May 2022 the same team published results on Sagittarius A*, the Milky Way’s central black hole (EHT Collaboration et al., 2022). The ring diameter: $51.8 \pm 2.3$ μas, corresponding to a mass of $\sim 4 \times 10^6 \, M_\odot$ at 8.18 kpc. M87* is roughly 1500 times more massive than Sgr A* and roughly 2000 times farther away — so the two apparent ring sizes are within 25% of each other. The universe arranged the coincidence; the EHT exploited it.

The Physics of the Ring

The ring is not the black hole itself. You cannot image an event horizon: by definition, no information escapes from it. What the EHT resolves is the photon sphere — the region of unstable circular photon orbits — and its shadow.

For a non-rotating (Schwarzschild) black hole, the photon sphere sits at:

$$ r_\text{ph} = \frac{3GM}{c^2} = \frac{3}{2} R_S $$

where $R_S = 2GM/c^2$ is the Schwarzschild radius. Light orbiting here is in unstable equilibrium: a small perturbation sends it either spiralling inward or escaping to infinity. The critical impact parameter — the perpendicular distance from the optical axis at which an incoming photon just grazes the photon sphere — is:

$$ b_c = \frac{3\sqrt{3} \, GM}{c^2} \approx 5.196 \, \frac{GM}{c^2} $$

The angular diameter of the shadow as seen by a distant observer is therefore:

$$ \theta_\text{shadow} = \frac{2 b_c}{D} = \frac{6\sqrt{3} \, GM}{c^2 D} $$

Plugging in the EHT numbers for M87* ($M = 6.5 \times 10^9 \, M_\odot$, $D = 16.8$ Mpc):

$$ \theta \approx \frac{6 \times 1.732 \times 6.5 \times 10^9 \times 1477 \, \text{m}}{16.8 \times 3.086 \times 10^{22} \, \text{m}} \approx 40 \, \mu\text{as} $$

The EHT measured $42 \pm 3$ μas. Agreement within 5%. This is not a post-hoc fit; it is a prediction that follows directly from general relativity and a mass independently constrained by stellar kinematics.

The first numerical simulation of this image was done by Jean-Pierre Luminet in 1979, using punch cards and an IBM 7040 (Luminet, 1979). He computed the geodesics, rendered the result by hand on photographic paper, and produced an image that looks startlingly like the 2019 photograph — forty years before the telescope existed.

The Brightness Asymmetry

The southern arc of the M87* ring is brighter. This is not an instrumental artefact. The accretion disk — the superheated plasma spiralling into the black hole — orbits at mildly relativistic speeds, $v \sim 0.3\text{–}0.6 \, c$. On the approaching side of the disk, synchrotron emission is Doppler-beamed toward the observer: intensity amplified, frequency blueshifted. On the receding side, the flux is deboosted (EHT Collaboration et al., 2019b).

In M87* the approaching side faces south, which implies — combined with the known orientation of M87’s large-scale relativistic jet — that the black hole spin axis points away from Earth. The brightness asymmetry is, in effect, a spin measurement.

Interstellar Did It Right

In 2014, the visual effects company Double Negative rendered the black hole Gargantua for Christopher Nolan’s Interstellar. They did this by integrating the actual geodesic equations for a rapidly spinning (near-extremal Kerr) black hole. Kip Thorne, one of the producers, collaborated on two companion papers with the visual effects team (James et al., 2015a).

The resulting image showed the accretion disk wrapping both above and below the black hole, producing a characteristic double-arc structure — direct emission at the equator plus a secondary image of the disk mirrored by gravitational lensing. This was not artistic licence. It was the first photorealistic render of a black hole produced from first principles, and the physicists found new results in the process: features of the lens map that had not previously been worked out analytically.

The same team published a companion paper on the wormhole in Interstellar (James et al., 2015b). That paper is where things get interesting.

What a Wormhole Would Actually Look Like

A traversable Morris-Thorne wormhole connects two regions of spacetime through a throat. An observer near the throat would see both connected universes simultaneously — one on each side of the throat boundary. The key visual feature, worked out in detail by Thomas Müller (Müller, 2004), is this:

Looking through the throat, you see the far-side universe compressed into a disk, bounded by a bright Einstein ring at the throat.
Outside the ring, you see the near-side universe, heavily distorted by the wormhole’s gravitational field.
There is no shadow in the sense a black hole has — no region from which light cannot escape. Instead, the ring acts as a portal: all light that reaches the throat passes through rather than being absorbed.

The James et al. (2015b) wormhole paper shows this explicitly. The Interstellar wormhole was rendered as a spherical lens with a celestial hemisphere visible through it. The visual signature is a double celestial sphere: your own sky distorted around the outside, and a compressed view of a distant universe through the middle.

This looks nothing like the EHT images.

The EHT sees a shadow — a dark central region from which no emission escapes, surrounded by a bright ring. A traversable wormhole at the same mass and distance would show a bright ring with a second universe visible in the centre, not a dark disk. The topologies of the light-path structures are fundamentally different.

The Images Rule Something Out

This is the point I find underappreciated. The EHT results are usually discussed as confirming that M87* and Sgr A* are black holes consistent with GR. That framing is correct. But the images are also falsifying evidence against alternatives.

Several exotic compact object proposals — gravastars, boson stars, some wormhole metrics — predict shadow-like features. But traversable wormholes of the Morris-Thorne type do not. The EHT image morphology — shadow, photon ring, brightness asymmetry tracking Doppler beaming — matches the Kerr metric quantitatively. An astrophysical wormhole of the type that appears in popular science coverage would look observably different.

The constraint is not absolute. You could construct wormhole geometries whose photon-sphere structure mimics a black hole’s shadow. But those are not the wormholes that typically appear in discussions of traversable shortcuts through spacetime, and the Morris-Thorne type — the physically simplest case — is ruled out at M87* and Sgr A* by the EHT morphology alone.

For more on wormhole theory — ER bridges as time-reversal symmetry, the Einstein-Rosen caterpillar, and Euclidean wormholes in single theories — see a later post. The physics is rich and ongoing. But a picture of a wormhole, if one were ever imaged, would not look like what the EHT published. It would look like a portal.

The Astonishing Thing Is That It Worked

I want to end on this. The ring around M87* was predicted in 1916 from a theory written down without any observation of a black hole, by people who were not sure black holes existed, using mathematics developed for entirely different purposes. Luminet computed the image in 1979 on punch cards, and it matched a photograph taken in 2019 with a planet-scale interferometer.

The agreement is 5%. In astrophysics, where parameters routinely span ten orders of magnitude, that is essentially exact.

The images are astonishing not because they surprised physicists — they confirmed what general relativity predicted. They are astonishing because general relativity is apparently the kind of theory that earns the right to be trusted at microarcsecond precision, at distances of 16.8 megaparsecs, around objects whose entire interiors are, by construction, hidden from us.

Peer review welcome. If you have a wormhole geometry whose shadow is indistinguishable from a Kerr black hole at current EHT resolution, I would genuinely like to read the paper.

References

Event Horizon Telescope Collaboration et al. (2019). First M87 Event Horizon Telescope results. I. The shadow of the supermassive black hole. The Astrophysical Journal Letters, 875, L1. DOI: 10.3847/2041-8213/ab0ec7
Event Horizon Telescope Collaboration et al. (2019). First M87 Event Horizon Telescope results. V. Physical origin of the asymmetric ring. The Astrophysical Journal Letters, 875, L5. DOI: 10.3847/2041-8213/ab0f43
Event Horizon Telescope Collaboration et al. (2022). First Sagittarius A* Event Horizon Telescope results. I. The shadow of the supermassive black hole in the center of the Milky Way. The Astrophysical Journal Letters, 930, L12. DOI: 10.3847/2041-8213/ac6674
Luminet, J.-P. (1979). Image of a spherical black hole with thin accretion disk. Astronomy & Astrophysics, 75, 228–235.
James, O., von Tunzelmann, E., Franklin, P., & Thorne, K. S. (2015). Gravitational lensing by spinning black holes in astrophysics, and in the movie Interstellar. Classical and Quantum Gravity, 32, 065001. DOI: 10.1088/0264-9381/32/6/065001
James, O., von Tunzelmann, E., Franklin, P., & Thorne, K. S. (2015). Visualizing Interstellar’s wormhole. American Journal of Physics, 83(6), 486–499. DOI: 10.1119/1.4916949
Müller, T. (2004). Visual appearance of a Morris-Thorne wormhole. American Journal of Physics, 72(8), 1045–1050. DOI: 10.1119/1.1758220

The Lab Goes Home: astro-lab@home and the COVID Pivot in Astronomy Education

Fri, 14 Oct 2022 00:00:00 +0000

This post describes two related projects: the astro-lab@home, published in CAPjournal in 2022 with Alexander Küpper and André Bresges; and its successor, the astro-lab@school, published the same year in Astronomie+Raumfahrt. Both grew from the same question: what does astronomy education look like when you cannot bring students into a lab?

What the astro-lab Was

Before the pandemic, the astro-lab at the University of Cologne was a student laboratory focused on extrasolar planets. School groups — mostly secondary school students — came in and worked through a set of analogy experiments: how do you detect a planet you cannot see? How do you infer its size, its orbit, whether it might be habitable?

The pedagogical bet was that exoplanet research, precisely because it is headline-generating and genuinely open-ended, could counteract the motivational slump in physics that tends to set in around middle school. The context — life in the universe, habitable worlds, the possibility of something out there — did a lot of the work that no abstract force diagram could do.

The experiments themselves were analogy experiments: a lamp standing in for a star, a sphere on a track standing in for a planet. The key measurement was the transit: when the “planet” passed in front of the “star”, the light sensor registered a dip. Students measured the dip, estimated the ratio of areas, connected it to radius, and got a number that meant something. The number was not precise. It did not need to be. It was real.

Spring 2020

In March 2020, schools shut down, and the University of Cologne followed. Visits to the astro-lab were cancelled. The question the team faced — Alexander Küpper, André Bresges, and I — was not whether to do something but what was actually feasible.

German distance learning at the time was characterised by worksheet packages delivered to students with minimal interactive contact. Only 16% of German students reported being in video conferences with their teachers; 30% reported no contact at all since the initial shutdown. The infrastructure was not there, the habits were not there, and the expectation that students had the materials and equipment for a physics lab at home was not warranted.

What students did have, almost universally, was a smartphone.

Modern smartphones contain a remarkable array of sensors: ambient light sensors, accelerometers, gyroscopes, barometers, magnetometers. The app phyphox, developed at RWTH Aachen, makes those sensors accessible with a clean interface designed for use in education. If the sensor hardware was already in students’ pockets, the lab setup problem became: what household materials can stand in for the rest of the apparatus?

astro-lab@home: Bringing Science to the Sofa

The astro-lab@home project adapted the original lab experiments for home use with smartphones and everyday materials. The core transit experiment — measuring the dip in light caused by an opaque object passing in front of a lamp — turned out to be reproducible without any specialist equipment. A table lamp, a ball on a string, and a smartphone positioned beneath the lamp gave you the raw data. phyphox recorded the light curve in real time.

We designed the setup to be flexible enough to work with what students actually had. The default used the ambient light sensor in Android devices, which is directly accessible through phyphox. iPhones do not expose their light sensor through software interfaces, so for Apple devices we recommended an external Bluetooth sensor — an inexpensive workaround that also had the advantage of producing more consistent data across device types.

The resulting package was not just an equipment list. We developed accompanying materials that explained the physics (why does a transit produce a specific shape of dip rather than a sharp cutoff?), connected the analogy experiment to the real science (how does this scale up to the actual transit photometry done by TESS and Kepler?), and offered scaffolding at different levels of independence.

The project was published in the IAU’s CAPjournal in 2022 — a journal aimed at communicators and educators in astronomy. The audience was intentionally broad: teachers looking for accessible classroom activities, outreach organisations trying to reach students at home, curious individuals who wanted to do something real with their phone. “Bringing science to the sofa” was the headline, and that was genuine. The experiments worked in a living room.

What Came Next: astro-lab@school

When schools reopened and in-person teaching became possible again, the question was not simply “back to normal” but what the COVID period had actually taught us about the format.

The astro-lab@school, published in Astronomie+Raumfahrt in 2022, addressed that question directly. Some things from the home version had worked better than expected. The smartphone-based setup was cheaper, more portable, and more directly in students’ hands than the original benchtop apparatus. There was something pedagogically valuable about students using their own devices rather than lab equipment provided by someone else.

The astro-lab@school retained the smartphone-centred approach and adapted it for a school context: class sizes, time constraints, the reality of mixed equipment across a room of thirty students. The experiments from the home version were modified for group work and parallel execution. The scaffolding materials were reworked for the paced structure of a school lesson rather than the self-directed format of home use.

The result was not a reversion to the pre-pandemic lab. It was a hybrid: in-person group work, but with tools and methods developed for distributed individual use. The pandemic had, inadvertently, pushed the format toward something more robust.

A Note on What Made This Work

The core technical contribution — smartphones as measurement instruments for analogy experiments in astronomy education — is described in more detail in a later publication in The Physics Teacher, which covers the experimental setups, sensor comparison, and pedagogical progression in a form aimed at an international teaching audience. If you want the how-to, start there.

What I want to note here is something slightly different: the role of context.

The astro-lab bet on exoplanets as a motivational context, and the evidence supports that bet. Exoplanet research remains one of the few areas of physics that generates genuine public enthusiasm, and students' interest in the topic is empirically documented. What the COVID period showed is that the context is robust enough to survive the removal of the lab infrastructure. Students working on transit photometry with a lamp and a smartphone in their kitchen were doing the same thing, conceptually, as students at a benchtop sensor station at the university. The physical setup was different. The question was the same.

That is, I think, a more general lesson. Context-driven education is not dependent on a specific material configuration. The question carries.

For the curriculum unit that places these experiments in the context of the NRW Sekundarstufe I physics syllabus, see Fremde Welten. For the air pressure / Mars experiment that grew from the same lab, see Mission to Mars.

References

Spicker, S. J., Küpper, A., & Bresges, A. (2022). astro-lab@home — bringing science to the sofa. CAPjournal, 31, 12–17.

Küpper, A., & Spicker, S. J. (2022). astro-lab@school. Astronomie+Raumfahrt im Unterricht, 59(6).

Küpper, A., & Schulz, A. (2017). Das Schülerlabor astro-lab an der Universität zu Köln. Astronomie+Raumfahrt im Unterricht, 54(1).

Stampfer, C., & Staacks, S. (2020). phyphox — using smartphones as experimental tools. Physics Education, 55(5), 055007. https://doi.org/10.1088/1361-6552/ab8a2e

Why Universities Need Their Own YouTube

Tue, 05 Jul 2022 00:00:00 +0000

In June 2022 I gave a presentation at the Tag der Lehre at the Hochschule für Musik und Tanz Köln on video-supported teaching with educast.nrw. Presenting something to colleagues who already use ILIAS every day and are broadly skeptical of yet another platform is a useful discipline. You have to answer the obvious question fast: why should we care about this, when YouTube works fine?

The short answer is that YouTube does not work fine, for reasons that matter specifically to universities. The longer answer is what this post is about.

What Is Wrong with YouTube

YouTube is the world’s dominant video platform. It is free to use, globally available, handles any file size, transcodes automatically, and comes with an audience of two billion logged-in users. For individual creators who want reach, it is genuinely hard to beat.

For universities it fails in at least three important ways.

Data protection. When you upload a lecture or a concert recording to YouTube, the content, the metadata, and the viewing behaviour of your students go to Google’s servers — which are predominantly in the United States. Under the GDPR, transferring personal data to third countries requires either an adequacy decision, standard contractual clauses with additional safeguards, or explicit informed consent. After the Schrems II ruling (Court of Justice of the EU, 2020), the adequacy of US-based data transfers became legally contested in a way that makes institutional YouTube use genuinely difficult for European universities. Using it for anything with identifiable students — which includes most teaching content — is a compliance problem.

Platform logic. YouTube is designed to maximise watch time. Its recommendation algorithm is not neutral. It will recommend whatever comes after your lecture, and what comes after your lecture is not under your control. For educational content — especially sensitive material, or content that should remain in a defined pedagogical context — this is a real problem. The platform is not indifferent to what is hosted on it; it shapes how it is consumed.

Institutional fragility. YouTube is free until it isn’t. Platform terms change; monetisation policies change; content is demonetised or removed based on automated systems with imperfect appeal mechanisms. Building institutional infrastructure on a free commercial service is a bet that the commercial incentives of that service will remain aligned with your needs. That bet has a poor historical record.

None of this means that individual instructors should never use YouTube. It means that universities should not make YouTube their default institutional solution.

educast.nrw: A Cooperative Model

educast.nrw is a project of the Digitale Hochschule NRW — a cooperative of North Rhine-Westphalian universities building shared digital infrastructure. The concept is a state-wide video service, run by universities for universities, for the recording, processing, management, and distribution of video content in teaching and research. The platforms calls itself “Hochschul-YouTube”, which is both accurate and slightly underselling what makes it different.

The HfMT Köln participates as a user institution, which is where I come in as the IT contact for setting it up on our end.

The technical foundation is Opencast, an open-source video management system developed by a consortium of universities. This matters: the software is auditable, the development direction is set by the institutions that use it rather than by advertising revenue, and the infrastructure runs on German servers that are explicitly GDPR-compliant. Licenses on uploaded content are freely choosable — CC-BY-SA is an option, which means the university’s teaching materials can be open access if that is what the instructor wants.

What It Can Do

The feature set covers the actual use cases of a university, not the use cases of a content creator trying to build a following.

Recording. The Opencast Studio browser app records in three modes: screen only, camera only, or screen and camera simultaneously as a synchronised multi-stream. That last option — Presentation and Presenter as separate streams, played back with the viewer switching focus between them, or as picture-in-picture — is the format that works for a lecture. You get the slides and the speaker in the same video, but the viewer can choose which to focus on. That flexibility is not something you get from a simple screen recording uploaded to YouTube.

Multi-perspective video. For a music university this is the feature that changes things. A concert or a masterclass is not well-served by a single camera angle. The platform supports simultaneous recording from two camera perspectives — a wide shot and a detail shot, say, or a front view and a hands view for a piano performance. The viewer can switch between them in playback, or the institution can set a default presentation. This is infrastructure that makes the teaching use of concert recordings actually feasible, not just technically possible.

Formats. Video up to 4K with adaptive bitrate streaming (the player adjusts automatically to the viewer’s connection), audio up to broadcast quality (48kHz/16-bit, exceeding CD’s 44.1kHz), with FLAC at up to 96kHz/24-bit in development. No file size limit. These specifications matter for music. A piano recording compressed to whatever YouTube decides to do with it is not the same as an uncompromised audio stream. The difference is audible.

ILIAS integration. This is the practical hinge. Video that lives in a separate platform is video that students may or may not find. Video embedded directly in the ILIAS course page, in the learning module, at the point in the curriculum where it is relevant — that is video that is part of the course rather than adjacent to it. The integration between educast.nrw and ILIAS is direct: upload to the video platform, embed in ILIAS on pages, in learning modules, or as standalone video objects, all from within ILIAS.

Access rights. The granularity here is what distinguishes it from any public platform. Each video can be set to: public (anyone on the internet), institution-wide (anyone logged in at the university), course-wide (only enrolled students in a specific course), individual (specific named people), or private (only the uploader). A graduation concert might be public. A practice session for student feedback might be course-only. A recording made for an individual student’s reflection might be shared only with that student and their teacher. These are all normal use cases in a music university; they all require different settings; the platform handles all of them.

Use Cases in a Music University

The general university use case — lecture recording, video tutorial, self-study module — applies at HfMT as much as anywhere. But a music and dance university has some specific ones.

Concert recordings. HfMT Köln runs performances at both its Cologne and Wuppertal sites. Recording these and making them available to students, faculty, and selectively to the public used to mean someone had to manage files, find hosting, deal with YouTube’s automated copyright detection flagging student performances of copyrighted repertoire, and explain to students why their graduation concert had been muted by an algorithm. The controlled platform makes all of this manageable.

Stage presence as a reflective tool. Watching yourself perform is a standard part of performance training. It is uncomfortable, useful, and until recently required either dedicated recording equipment or the ad-hoc use of a phone propped against something. A proper recording infrastructure with controlled access — the student sees the video, their teacher sees the video, nobody else does — changes the pedagogical viability of this approach. The barrier to actually using video feedback in practice teaching drops substantially.

Theory and practice. This is the institutional argument I made in the presentation and stand by: video infrastructure that works for a lecture also works for a concert. The same system that stores the introduction to music theory also stores the masterclass by a visiting artist. This is not incidental — it is the point of a shared infrastructure. You do not need to choose a platform for academic content and a different one for performance content. The platform works for both.

The Argument Behind the Argument

There is a broader principle at work here that extends beyond video platforms. Public universities are funded by public money. The infrastructure they build with that money — software, platforms, data, content — should be under their control, governed by their values, and ideally available to other public institutions. The commercial platform model inverts this: you get free hosting in exchange for your data, your students’ attention, and your institutional dependence.

educast.nrw is an example of what the alternative looks like in practice: a cooperative of public institutions building shared infrastructure on open-source software, governed collectively, with data on European servers under European law. It is not perfect — the setup overhead is real, the user experience does not match YouTube’s, and the feature roadmap (automatic subtitling, H5P support, livestreaming, annotation tools) is still catching up to what commercial platforms have had for years. But the model is right.

The question of who owns the video infrastructure of a university is the same question as who owns its email, its learning management system, its student data. The answer should be: the university, operating under law, answerable to its students and to the public that funds it.

The slides from the Tag der Lehre 2022 presentation are available on request. For educast.nrw setup at HfMT Köln, contact the IT department.

Changelog

2025-08-18: Corrected the capitalisation of “OpenCast” to “Opencast” throughout (matching the project’s official spelling on opencast.org).

Teaching Stellar Evolution Without a Star: DIY Experiments and a Board Game

Mon, 11 Apr 2022 00:00:00 +0000

This post covers two related pieces of work: a paper on DIY smartphone experiments for stellar formation, submitted to Astronomie+Raumfahrt (co-authored with Alexander Küpper); and a board game, “Staub und Sterne” (Dust and Stars), designed for use in secondary school physics by Miriam Küpper and Alexander Küpper.

The Curriculum Problem

The 2019 revision of the NRW Gymnasium physics curriculum for Sekundarstufe I requires students to be able to describe, in broad outline, the typical stages of stellar evolution. This is new territory for many teachers — it is not a topic that would have appeared in teacher education programmes of ten or twenty years ago, and few teachers have personal experience with it from their own school or university courses.

More fundamentally: stellar evolution is a topic where the usual experimental approach does not work. You cannot compress an interstellar gas cloud in a classroom. You cannot observe a star form in real time. The timescales involved are tens of millions to billions of years; the spatial scales are measured in light-years and astronomical units. The experimental toolkit that works for optics, mechanics, and even much of electromagnetism simply does not apply.

This creates a genuine pedagogical challenge. Students have strong interest in astrophysical topics — the ROSE study documents this consistently — and stellar evolution involves physical concepts that are curriculum-relevant (gravity, pressure, energy, radiation). But the standard path from “concept” to “experiment” to “understanding” is not available in the usual form.

Two approaches are described here. One uses what students do have — smartphones and household materials — to model the physics of stellar formation through analogy. The other accepts that some physics is better learned through structured play, and designs accordingly.

DIY Experiments for Stellar Formation

The physics of star formation starts with an interstellar gas cloud and the competition between gravity and pressure. A cloud collapses when gravity wins: specifically, when the cloud is massive enough (or cold enough) that gravitational attraction overcomes the thermal pressure of the gas. This is the Jeans criterion, and it is the quantitative condition that separates clouds that will form stars from clouds that will disperse.

The qualitative version is accessible to secondary school students: a dense, cold, massive cloud is more likely to collapse than a diffuse, hot, small one. Once collapse begins, it is self-reinforcing — increasing density increases the gravitational attraction, which drives further compression, which increases the density further.

Two DIY experiments were developed to give students a physical encounter with the key concepts, using materials that can be assembled at home or in school without specialist equipment.

Experiment 1: Compression and heating. When a gas is compressed, it heats. This is directly measurable with the temperature sensor in a smartphone (or a separate Bluetooth thermometer connected to phyphox) and a simple compression apparatus — a syringe, a sealed container, or an inflation device. Students observe the temperature rise during compression and temperature drop during expansion, establishing the qualitative relationship. In the stellar formation context: the collapsing gas cloud heats as it compresses, which is why a protostar is hot long before nuclear fusion ignites.

Experiment 2: Self-reinforcing compression. A simple model of the positive feedback loop in gravitational collapse: a weighted ball in a flexible container, which compresses a small spring or air cushion. The more the ball compresses the cushion, the further it falls. Students can explore the threshold conditions under which the system reaches a stable equilibrium versus continues to compress indefinitely — a qualitative model of the Jeans criterion.

Both experiments are designed to be performed with available materials at the DIY/home level. The smartphone’s sensor integration via phyphox provides quantitative data where possible, maintaining the connection to real measurement that is a design principle across all the astro-lab experiments.

Why Stellar Evolution Is Hard to Experiment With

A methodological note worth making explicit: the shift from direct experiment to analogy experiment to board game is not a retreat from rigor. It is a recognition that different kinds of physical and conceptual content require different pedagogical approaches.

For exoplanet detection, we can build a genuine analogy: the physics of a planet blocking a star’s light and a ball blocking a lamp’s light are structurally identical. The analogy experiment produces data whose interpretation follows the same logic as the real scientific data.

For stellar evolution, the analogy is weaker. The compression of a gas syringe models one aspect of the collapse (temperature increase) but not the self-gravitating dynamics, the radiation pressure that eventually halts collapse, or the nuclear ignition that defines the transition from protostar to main sequence star. No tabletop experiment captures the whole process.

This is important to tell students: the experiment models this aspect, and not those aspects. Making the model limits explicit is part of the scientific literacy the unit is supposed to develop.

“Staub und Sterne”: A Board Game for Stellar Evolution

The board game “Staub und Sterne” (Dust and Stars), designed by Miriam Küpper and Alexander Küpper, takes a different route to the same content.

Games have been used in physics education in all phases of a lesson: as entry points (introducing a topic without immediately constraining it to a specific physical question), as vehicles for content acquisition, and as reinforcement and assessment tools. For stellar evolution specifically, the argument for a game is strong: the content involves a branching process with multiple pathways depending on a single initial parameter (mass), it is cyclic (the remnant of stellar death seeds the gas cloud that forms the next generation of stars), and it is inherently dynamic — the drama of a supernova is hard to convey through a diagram but easy to convey through play.

The target audience is years 7–8 (or year 8–9 depending on the school’s internal curriculum placement). The learning objectives:

Describe the stages of stellar evolution as a function of mass
Name the possible end states (white dwarf, neutron star, black hole) and the stellar paths that lead to each
Describe stellar evolution as a cyclic process: the gas cloud produced at the end of a star’s life can, under the right conditions, seed the formation of new stars

The game “Staub und Sterne” (the name translates as “Dust and Stars”) has players navigating a star through its lifecycle, with the key branching decision determined by the star’s initial mass. A low-mass star follows one path; a high-mass star follows another. Both paths end in a stellar remnant and a dispersed gas cloud — raw material for the next cycle.

The game design incorporates the research on flow experience in learning: cooperative or competitive play, immediate feedback on decisions, the kind of engaged attention that is rare in conventional physics lessons and that the ROSE study data suggest is precisely what is missing for many students in physics classrooms.

A Note on What Experiments Cannot Reach

There is a broader point here that the exoplanet posts sidestep because the experiments for exoplanet detection are so unusually good. For most astrophysics — stellar evolution, galactic dynamics, cosmology — there is no analogy experiment that captures the full physics. The observable has been observed, the theory has been developed, but the pedagogical problem of how to give students a physical encounter with that knowledge remains genuinely difficult.

Games, simulations, interactive visualisations, and structural analogies all have a role. Each of them is a partial solution; none of them is what a well-designed experiment is. Knowing which approach fits which content, and being honest with students about the limits of the model you are using, is part of what physics teaching requires.

The experiments described in this post are a start on one small part of that problem.

The exoplanet experiments from the same project are described in the astro-lab@home, Hunting Exoplanets with Your Phone, and Fremde Welten posts.

The misconceptions students bring to stellar evolution — about the Sun, gravity, nucleosynthesis, and the language of astronomy — are documented in detail in Please Stop Saying the Sun Is on Fire, written as a companion to the September 2020 teacher training session that motivated much of this work.

References

Spicker, S. J., & Küpper, A. (submitted). Einfache DIY-Experimente zum Verständnis der Sternentstehung für den Physik- und Astronomieunterricht sowie zu Hause. Astronomie+Raumfahrt im Unterricht.

Küpper, M., & Küpper, A. (2022). Sternentwicklung spielerisch verstehen: Konzeption eines Brettspiels für den Physikunterricht der Sekundarstufe I. Presentation at AG Lehrerfortbildung, Universität zu Köln.

Elster, D. (2008). Was interessiert Jugendliche an den Naturwissenschaften? VFPC Verein zur Förderung des physikalischen und chemischen Unterrichts.

MSB NRW (2019). Kernlehrplan für die Sekundarstufe I — Gymnasium in Nordrhein-Westfalen: Physik. Ministerium für Schule und Bildung NRW.

Ward-Thompson, D., & Whitworth, A. (2011). An Introduction to Star Formation. Cambridge University Press.

They Told Me Not to Use Design Thinking. They Were Right.

Tue, 23 Nov 2021 00:00:00 +0000

A follow-up to the Mission to Mars post, which describes the experimental work. This one is about the methodology layer underneath it — specifically, what I got wrong.

The Setup

My background is in physics. I ended up in physics education research sideways, through the astro-lab project and through a genuine interest in why students find physics so alienating and what might help. When it came time to frame that work as a thesis, I had to choose a methodology.

I chose design thinking. Or more precisely, I chose something that borrowed heavily from design-based research and design thinking frameworks and that felt, at the time, like the obvious match for what I was doing. I was designing experiments. I was iterating on them. I was testing them with students and refining them. Design thinking is a framework for exactly this process. What could be more natural?

Several people told me I was making a mistake. Colleagues with more qualitative research experience, a supervisor who had been through the methodology debates in education research more times than he wanted to count. The consistent advice was: use grounded theory. Be systematic about your data. Let the categories emerge from what you actually observe rather than from what you designed the experiment to produce.

I thought I understood what they were saying. I did not understand what they were saying.

What I Thought Design Thinking Gave Me

Design thinking, as a research framing, offered what felt like a clean correspondence between method and subject matter. The thing I was producing was a designed artifact — a teaching experiment. The process I was following was inherently iterative: run it, observe what happens, revise, run it again. The framework had a vocabulary for this (empathise, define, ideate, prototype, test) that matched my actual working process.

Design-based research, the academic version of this approach in education, has a real literature behind it. It is used in educational technology research and in curriculum development. It is not a made-up category. The argument for it is reasonable: if you are trying to design effective educational interventions, then designing and studying those interventions at the same time is a coherent research strategy.

What I told myself was: I am doing design-based research. The methodology matches the work. The thesis will describe the design process, the rationale for each design decision, the iterative refinements, and the evidence that the final design works. This is a contribution to knowledge because it produces a principled, evidence-informed design that other practitioners can use and adapt.

This is not wrong. But it is not enough for a thesis. And I only understood why it is not enough after I had spent considerable time trying to make it be enough.

The Reckoning in the Methodology Chapter

The methodology chapter of a thesis is where you have to be explicit about the epistemological status of your claims. You are not just describing what you did. You are explaining why the thing you did counts as knowledge production, what kind of knowledge it produces, and how someone else could evaluate whether you did it correctly.

This is where design thinking started to come apart.

What kind of claim does a design study make? The honest answer is: it makes a claim about this design, in these contexts, with these students. It does not easily generalise beyond that. If I show that the Mission to Mars experiment produces measurable improvements in students’ understanding of air pressure in a student lab context at the University of Cologne in 2019, the implication for other teachers in other contexts is… unclear. The design worked here. Maybe it will work for you. Good luck.

A thesis contribution needs to be something more transferable than that. It needs to produce knowledge about a phenomenon, not just knowledge about a specific designed object. “Here is a well-designed experiment” is a practitioner contribution, which is genuinely valuable, but it is not the same as a theoretical contribution to the field.

The iteration problem. Design thinking celebrates iterative refinement. But in a thesis, every iteration needs to be motivated by evidence, and the nature of the evidence and how it maps onto the design changes needs to be made explicit. If I changed something between version 1 and version 2 of the experiment, the methodology chapter must explain: what data told me to make that change? How did I analyse it? What coding framework did I apply? What alternative changes did I consider and rule out, and on what grounds?

Design thinking has no systematic answer to these questions. It has process descriptions (“we tested with users and gathered feedback”) but not research methodology answers (“I applied open coding to the think-aloud protocols and the following categories emerged, which pointed toward this specific revision”). Without that precision, the “iteration” in the methodology chapter looks like: I tried it, it did not quite work, I made it better. Which is honest but not a researchable process.

The validation problem. Design-based research often validates its designs against the criteria that motivated the design. I designed the experiment to address specific student misconceptions about air pressure. I then tested whether students who did the experiment had fewer of those misconceptions afterward. If the answer is yes, the design is validated.

But this is circular in a way that becomes visible under examination. The misconceptions I targeted were the ones I identified at the start. The students I studied were the ones who came to my lab. The measurement instrument I used was one I designed to detect the specific changes I expected the design to produce. The whole system is oriented toward confirming the design rather than discovering something about the phenomenon.

Grounded theory cuts this loop. You start with the data — the students’ actual responses, their misconceptions as they express them, the things that confuse them that you did not anticipate — and you build categories from the bottom up. What you end up with is a theory of how students actually think about air pressure (or whatever the topic is), which may or may not match what you assumed when you designed the experiment. The cases where it does not match are precisely where the theoretical contribution lives.

What Grounded Theory Would Have Required

Grounded theory, done properly, is laborious. The Glaserian version (open coding, theoretical sampling until saturation, constant comparative method) requires treating every interview, every observation, every student response as a data source to be systematically analysed, compared, and connected into a coherent theory.

Theoretical sampling means you do not decide in advance how many students to study or what contexts to observe. You keep gathering data until new cases stop producing new categories — until the theory is saturated. This is methodologically sound and practically painful, because you cannot know in advance when you will be done.

Memoing — writing ongoing analytical notes about the emerging categories and their relationships — is a discipline that forces you to be explicit about your reasoning at every step. Not just “these two responses seem similar” but “these two responses are similar because both students are treating pressure as a property of moving air, and here is how that connects to the misconception documented by [citation].”

I did not want to do this. I wanted to design experiments. Grounded theory felt like a detour from the thing I was actually interested in.

The advice I received was: this is not a detour. A systematic analysis of what students think about air pressure, and how they think about it, and what experiences shift their thinking, is a theoretical contribution that would make the experiments more useful to everyone — not just a record of experiments that worked in one lab in one city in one year.

They were right about this.

What I Actually Learned (Too Late to Use in the Thesis)

The most useful student responses in the Mission to Mars experiment were not the ones that confirmed the design was working. They were the unexpected ones.

The PVC pipe failure — the moment when the lid pops off and students hear the sound — was included because I thought it would demonstrate the direction of pressure force in a visceral way. What I observed, which I noted but did not systematically analyse, was that different students interpreted the pop differently. Some immediately understood it as the internal air pushing out. Others interpreted it as the external vacuum pulling the lid. A few were unsure which way the force had been directed even after the event.

A grounded theory analysis of those responses would have produced something genuinely interesting: a typology of how students process a demonstrable physical event when it conflicts with their existing pressure intuitions. That typology would have been transferable to other experimental contexts, other pressure scenarios, other situations where students encounter the vacuum-suction confusion.

Instead I noted it, described it qualitatively, and moved on because it was not what the design was optimised to produce.

That is the design thinking trap. You are so focused on the designed outcome that you treat unexpected observations as noise rather than as data. Grounded theory treats them as the most valuable data you have.

A Note for Other Physicists Entering Education Research

If you are coming from a natural science background and you are starting work in education research, the methodology question will feel foreign at first. In physics, methodology is largely a matter of technical choice — which instrument, which statistical test, which model. The epistemological questions (what kind of knowledge does this produce? how does it generalise?) are handled by the experimental framework itself, which is a known, shared, peer-reviewed practice.

In qualitative education research, those questions are not handled in advance. You have to work them out explicitly, for your specific study, in writing. This is uncomfortable for people trained in a tradition where you do the experiment and then write up what happened.

The temptation, for a physicist, is to choose a methodology that feels like a framework for doing things rather than one that feels like a framework for thinking about what you found. Design thinking is a framework for doing things. Grounded theory is a framework for thinking about what you found.

Both are legitimate. But a thesis needs to make a theoretical contribution, and theoretical contributions come from systematic analysis of phenomena, not from documentation of designed objects.

I would have finished faster and understood more if I had done the uncomfortable thing from the start.

The experimental work this post is commenting on is described in Mission to Mars. For a more successful later use of qualitative methodology in a related context, see AI Transcription and Grounded Theory.

References

Glaser, B. G., & Strauss, A. L. (1967). The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine.

Strauss, A., & Corbin, J. (1998). Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory (2nd ed.). SAGE Publications.

The Design-Based Research Collective (2003). Design-based research: An emerging paradigm for educational inquiry. Educational Researcher, 32(1), 5–8. https://doi.org/10.3102/0013189X032001005

Brown, T. (2008). Design thinking. Harvard Business Review, 86(6), 84–92.

Mission to Mars: Teaching Air Pressure with a Smartphone and a Vacuum Pump

Fri, 17 Sep 2021 00:00:00 +0000

This post describes “Mission to Mars: Concept and Implementation of a Design-Based (Hands-On) Smartphone Experiment Helping Students Understand the Effects Caused by Differences in Air Pressure”, published in The Physics Teacher (Vol. 59, 2021) together with Alexander Küpper and André Bresges.

The Problem With Air Pressure

Air pressure is one of those topics that students nominally know something about from everyday life and have almost always misconceived. The documented misconceptions are a long list: air is “empty” (nothing in it), air is weightless, air only exerts pressure when it moves (like wind), a vacuum “sucks” rather than being a region where surrounding air pushes in, and pressure increases with height rather than decreasing.

Some of these misconceptions are stubborn precisely because everyday experience seems to support them. Air does not feel like it has weight. A vacuum cleaner does feel like it is pulling. The atmosphere, experienced from inside it, does not announce itself as a pressure source.

The standard approach to this material — explaining atmospheric pressure, defining $p = F/A$, working through barometric altitude formulae — addresses the conceptual gaps at the declarative level. Students can recite that air exerts pressure in all directions. Whether they have actually updated their mental model is a different question.

“Mission to Mars” is a design-based attempt at conceptual change through physical encounter with the consequences of pressure differences.

The Context: Why Mars?

The motivation for choosing the Mars context was empirical, not poetic. The ROSE study — a large international survey of student interests in science — consistently finds that space, astronomy, and human exploration rank among the most motivating contexts for physics learning, for both boys and girls. Physics education research in Germany has known for decades that generic “physics” lessons underperform motivationally compared with context-embedded physics, and astronomy is one of the contexts with the clearest evidence base.

“Mission to Mars” asks students: a crewed mission to Mars would travel through the vacuum of space, with the crew living in a pressurised compartment. The compartment has to maintain atmospheric pressure while surrounded by near-vacuum. What happens if it fails? And how would you design a spacecraft structure to prevent that failure?

The question is concrete. The physics behind it — the difference between the pressure inside the compartment and the near-zero pressure outside, and the forces this pressure difference exerts on any structure — is the content of the lesson.

The Experiment

The full version of the experiment, as we ran it at the astro-lab at the University of Cologne, uses a vacuum pump, a bell jar, and a smartphone running phyphox. The smartphone’s built-in barometric pressure sensor records real-time atmospheric pressure inside the bell jar as the pump evacuates it.

Before building anything, students verify that the smartphone is a functional pressure gauge: they measure the current atmospheric pressure in the room and compare it with a provided reference value. This step matters pedagogically — it establishes that the phone is a real scientific instrument, not just a device for receiving worksheets.

Then comes the design-build-test cycle:

Design: Students are given PVC plumbing pipe sections, empty food containers, resealable bags, rubber bands, clamps, and other household materials. Their task is to build a prototype “spaceship” — a container that will maintain near-atmospheric pressure inside while the bell jar around it is evacuated to low pressure. The phone (or external sensor) goes inside the prototype to measure whether the prototype is holding.

Predict: Before testing, students are asked to state why they think their prototype will or won’t work. This surfaces their preconceptions in a low-stakes way and sets up the next stage.

Test: The prototype goes into the bell jar. The pump runs. The pressure sensor records. The light curve — sorry, the pressure curve — tells the story. Four outcomes are possible:

Nearly flat line: the prototype is airtight, pressure inside stays near atmospheric. Mission success.
“Bathtub” curve: a visible failure event — a cap pops off, the pressure inside drops sharply and then equalises. Students hear the pop. They did not expect the pop. This is the moment.
Gradual decay: the prototype leaks slowly, the pressure inside drops steadily. Invisible failure.
Noisy signal: something wrong with the setup.

The PVC pipe trap: the PVC pipe is deliberately included because it is the most impressive-looking material and is reliably incorrect. The friction between pipe and lid is insufficient at the pressure differences reached in the bell jar. The lid pops off. Students rebuild.

The Misconceptions, Addressed

The design-test-rebuild cycle forces students to confront the misconceptions listed above in a direct physical way:

Air is empty/weightless: handled in pre-activities with standard demonstrations (the dunked napkin, the deflated-vs-inflated balloon).

Air only exerts pressure when moving: the bell jar demonstration makes this concrete — the sensor shows pressure even in a static, undisturbed volume. When the pump evacuates the jar, the “stillness” of the remaining air doesn’t change its pressure.

A vacuum sucks: this is the crucial one, and the PVC lid pop addresses it more effectively than any explanation. The lid does not get sucked outward. The air inside the prototype at near-atmospheric pressure pushes the lid open against the external near-vacuum. When the lid fails and students hear the rush of air flowing back in after the valve is opened, the direction of the pressure force becomes viscerally clear: it was always the higher-pressure region pushing into the lower-pressure region.

The inquiry is scaffolded through worksheets and index cards, and there is a teaching assistant present in the lab version to catch dangerous situations (the smartphone can be damaged if exposed to too low a pressure — the instructions include a warning about testing pump suction strength before risking the device).

DIY Variants for School and Home

The full lab setup is expensive and not portable. One design principle we wanted to maintain was accessibility: the experiment should work at three budget levels.

Low budget (< $5): empty food containers connected to a household vacuum cleaner through a small hole. Works, no real-time measurement visible to students.

Mid budget ($5–$50): translucent storage containers in nested sizes (large = “space”, small = “spaceship”), a small sealing ring to connect the vacuum source. Students can watch the phone display through the container during evacuation. The vacuum achieved is lower than the lab version, but the qualitative experience — the prototype holding or failing — is the same.

Expensive ($500+): the full lab version with bell jar and diaphragm pump. Best analogy, best data, highest barrier.

The DIY take-home message, as the paper puts it: be creative, fail forward. Anything that creates some vacuum and fits a prototype counts.

The experiment adapts readily to e-learning contexts: each group builds a prototype, tests it (or has a family member film it), and presents the outcome — including why the first prototype failed and how the second was improved — in a shared video conference.

A Note on Where This Fits

“Mission to Mars” grew out of the astro-lab at the University of Cologne, the same student laboratory context as the exoplanet transit experiments. The common thread is not the specific physics topic (air pressure here, photometry there) but the experimental approach: smartphones as real measurement instruments, everyday materials as apparatus, an astronomical context that sustains engagement, and a design-build-test cycle that forces students to encounter the physics physically rather than only propositionally.

The air pressure content connects naturally to the exoplanet unit at a curriculum level: habitability of exoplanets depends partly on atmospheric pressure. In the Fremde Welten article, atmospheric pressure is listed as one of the factors that determine whether a detected exoplanet could support life — an explicit cross-link between the two units.

The astro-lab@home post describes how the broader astro-lab programme — including this experiment — was adapted for home use during the pandemic. The air pressure experiment is among the more challenging to replicate at home, but the low-budget vacuum cleaner variant makes a version of it possible.

The design-build-test structure of this experiment also ended up at the centre of a methodological argument during my thesis work. The short version: everyone told me to use grounded theory instead of design thinking as the research framework, and they were right to do so. That story is in a separate post.

References

Spicker, S. J., Küpper, A., & Bresges, A. (2022). Mission to Mars: Concept and implementation of a design-based (hands-on) smartphone experiment helping students understand the effects caused by differences in air pressure. The Physics Teacher, 60(1), 47–50. https://doi.org/10.1119/10.0009109

Küpper, A., & Schulz, A. (2017). Schülerinnen und Schüler auf der Suche nach der Erde 2.0 im Schülerlabor der Universität zu Köln. Astronomie+Raumfahrt im Unterricht, 54(157), 40–45.

Staacks, S., Hütz, S., Heinke, H., & Stampfer, C. (2018). Advanced tools for smartphone-based experiments: phyphox. Physics Education, 53(4), 045009. https://doi.org/10.1088/1361-6552/aac05e

Sjoberg, S., & Schreiner, C. (2010). The ROSE project: An overview and key findings. University of Oslo.

Changelog

2025-10-03: Updated the self-citation to the correct year (2022), volume/issue (60(1)), pages (47–50), and DOI (10.1119/10.0009109).

Please Stop Saying the Sun Is on Fire

Tue, 17 Nov 2020 00:00:00 +0000

In September 2020 Alexander Küpper and I gave a teacher training session on stellar formation — why experiments for it are hard to design, and what misconceptions students typically arrive with. This post is loosely based on the misconceptions part of that talk, which turned out to generate the most discussion.

Why Misconceptions Are Not Just Wrong Answers

Before the list, a clarification that matters pedagogically.

When education researchers say “misconception,” they do not mean a random error or a gap in knowledge. A misconception is a stable, self-consistent mental model that students actively use to interpret new information. It persists not because the student hasn’t heard the correct explanation but because the incorrect model handles a wide range of everyday experience reasonably well.

“Fire is a thing that makes heat and light and consumes fuel” is a perfectly adequate mental model for everything a student encounters outside a physics class. It explains candles, campfires, gas hobs, and car engines. The fact that it also leads the same student to conclude that the Sun “burns” in the chemical combustion sense is not a failure of intelligence — it is the natural extension of a model that works.

The implication, which Bransford, Brown, and Cocking put plainly in 2000: if you ignore what students already believe and simply present the correct model, “the understanding they develop can vary substantially from what the instructor intended.” The new information gets interpreted through the existing model, not in place of it. You end up with students who can repeat “the Sun fuses hydrogen” while still, in their mental model, imagining it as a very large and very hot fire.

With that said: here is the list.

The Sun Is Not a Star

This one leads because it is the most structurally interesting.

Bailey et al. (2009), in a study of students’ pre-instructional ideas about stars and star formation, document the following category of response: the Sun is a special kind of astronomical body with its own distinct properties. It is not a star. Stars are the things you see in the night sky. The Sun is different.

This is not an isolated finding. Schecker et al. (2018) document the same pattern in the German context. Students who know perfectly well that “the Sun is a star” as a stated fact will nonetheless, when asked to reason about stellar properties, implicitly exempt the Sun from those properties. Stars are far away, they are small and faint, they are cold and distant. The Sun is close, large, and bright. Ergo the Sun cannot really be a star, whatever the textbook says.

The pedagogical consequence is that teaching stellar evolution to students who hold this model requires first collapsing the Sun/star distinction — otherwise everything that follows is about something unfamiliar and distant rather than about the object eight light minutes away that we can observe in detail.

A companion misconception: all stars are smaller than the Sun. This is the inverse problem. Students who correctly classify the Sun as a star but have only seen stars as faint points of light infer that stars must be small. The Sun, which they know to be large, therefore cannot be a typical star. Betelgeuse — a red supergiant with a radius approximately 700 times the Sun’s, which if placed at the Sun’s position would extend past the asteroid belt — tends to produce strong cognitive dissonance when it is first encountered.

The Sun Is on Fire

The combustion model of stellar energy is, empirically, the most common student conception and the hardest to dislodge.

From Favia et al.’s misconception inventory, translated loosely:

“The Sun is made of fire.”
“Stars run on fuel: petrol or natural gas.”
“The Sun is made of molten lava.”
“The Sun is a heat-planet.”

Bailey et al.’s quantitative data: when asked how stars produce light, 32% of students described chemical burning. A further 28% described unspecified “chemical reactions.” Only 7% named nuclear fusion. Only 3% could both name fusion and correctly connect it to the production of light.

The combustion model is coherent and consistent. It gives you a mechanism (fuel + oxygen → heat and light), a timescale (stars eventually run out of fuel and go dark), and a product (visible light and heat). What it cannot handle is the scale: the Sun has been burning for 4.6 billion years and has approximately 5 billion years of fuel remaining. Chemical combustion at the Sun’s luminosity would exhaust any chemically plausible fuel supply in tens of thousands of years. This is the crack in the model that fusion fills — not by saying “the Sun burns differently” but by replacing the entire energy mechanism with one that operates at scales the combustion model cannot reach.

One related misconception worth noting explicitly: the Sun is hottest at its surface. This is the intuitive model — things are hot near the fire and cooler further away. The corona’s temperature of a million Kelvin, far above the photospheric 5,778 K, violates this so thoroughly that it remained an active research problem for decades (and, in some senses, still is). Students encountering coronal heating for the first time do not usually reject it, but they do find it genuinely strange in a way they cannot articulate — which is the signature of something colliding with a stable prior model.

Gravity Only Works When Things Move

The gravity misconceptions documented in the research literature are worth treating separately because they have direct consequences for understanding stellar formation — which depends entirely on gravity acting on stationary or slowly drifting gas clouds.

The relevant findings:

Gravity requires motion (Palmer, 2001). A significant proportion of students believe that gravity only acts on objects that are in motion. A stationary object is not subject to gravitational attraction. A table sitting on the floor: fine, no gravity needed. A gas cloud drifting slowly through space: also fine. A gas cloud being compressed by gravitational self-attraction: this requires gravity to act on particles that are not yet moving, which the model cannot accommodate.

Force implies movement (Gunstone & Watts, 1985). The more general version: forces produce motion, and where there is no net motion, there is no net force. The concept of force balance — two equal and opposite forces summing to zero net force, with the object not moving — is not available to students holding this model. It is hard to overstate how consequential this is for astrophysics. Almost every stable astrophysical structure — a main-sequence star, a planetary orbit, a galaxy’s rotation — is a force balance. Students without the concept cannot reason about any of them correctly.

Gravity only acts on Earth (Bar, Brosh, and Sneider, 2016). Students in the space context often reason that gravity is a property of Earth specifically. In space, things are “weightless” — and weightlessness is interpreted as the absence of gravity rather than as the experience of free fall in a gravitational field. The result: gravity cannot be the mechanism by which an interstellar gas cloud collapses, because gas clouds are in space and gravity does not work there. Asghar and Libarkin (2010) found that only one in five non-physics college students could correctly describe gravity as an attractive force between masses, using the correct vocabulary.

These are not fringe findings. They are the majority conception at the pre-instructional stage. Any unit on stellar formation that opens with “gravity compresses the gas cloud” is speaking to students who mostly do not believe that gravity can do that to a gas cloud in space.

Metals Always Existed

This misconception is my personal favourite because it requires no incorrect intuition — it requires an absence of information that most people have never had reason to acquire.

Students and adults who have not encountered stellar nucleosynthesis simply have no model for where heavy elements come from. Asked directly, a common response is that metals “always existed” — they are a feature of the universe, present from the beginning. The alternative framing: “stars create matter from nothing” — which captures the sense that something is being generated, without a mechanism.

The correct picture: the Big Bang produced primarily hydrogen and helium, with trace amounts of lithium and beryllium. Every heavier element — including all the carbon in your body, all the iron in your blood, all the oxygen in every breath — was synthesised in a stellar interior or in a supernova. The gold in a wedding ring was produced in a neutron star merger. We are, in the precise sense of the phrase, made of star stuff; but not because stars are somehow magical, because the nuclear physics of stellar interiors and violent stellar deaths is the only process in the universe that can manufacture these elements.

This has a direct implication for stellar evolution education: if students believe metals always existed, the cycle of stellar death and new star formation — in which dying stars enrich the interstellar medium with heavy elements that become part of the next generation of stars and their planets — loses most of its meaning. The cycle is interesting precisely because it explains why later-generation stars and their planets have a richer elemental composition than first-generation stars. Remove that frame and you have a sequence of events with no cumulative significance.

Some Language-Based Misconceptions (A Brief Digression)

Since I promised something about quantum leaps: the phrase “quantum leap” in everyday usage means a sudden, large, discontinuous advance. In physics, a quantum transition is the smallest possible discrete change in a system’s energy state. The electron moves from one energy level to another; the photon is emitted or absorbed; the scale of change is on the order of electron-volts. It is, emphatically, not large.

The astronomy version of this class of error:

“Light year” used as a unit of time. “That happened light years ago.” A light year is the distance light travels in one year — approximately 9.46 × 10¹² kilometres. It is a unit of distance, not time. This one is so embedded in everyday usage that correcting it usually produces mild annoyance rather than reconsideration.

“Shooting stars.” Meteors — small rocky or metallic bodies entering the atmosphere — have nothing to do with stars. They are typically the size of a grain of sand to a pebble. The visual resemblance to a moving point of light crossing the sky is where the name comes from; the resemblance to stellar physics is zero.

“Black holes suck things in.” Black holes do not have more gravity than the object that formed them at the same distance. If the Sun were replaced by a black hole of equal mass, the planets would continue on their current orbits. A black hole is only a black hole within its Schwarzschild radius; beyond that it is a gravitational field like any other. What black holes have is a point of no return — the event horizon — beyond which escape velocity exceeds the speed of light. They do not actively pull. They are very massive objects that objects can fall into.

“The dark side of the Moon.” The Moon has a far side (permanently facing away from Earth, due to tidal locking) and a near side. Both sides receive approximately equal sunlight over the lunar cycle. The far side is not permanently dark; it has a day and a night like the near side. “Dark side” persists in common usage because Pink Floyd used it as an album title and nobody wanted to call it “The Far Side of the Moon.” (Although Douglas Adams would have had something to say about that.)

Why This List Matters for Teaching

The misconceptions described above are not randomly distributed. They cluster around three areas where intuitive extrapolation from everyday experience leads systematically away from the correct physics:

Scale: human intuition was not built for 150 million kilometres, let alone 4.6 billion years or the 9.46 × 10¹² km in a light year. The Sun cannot be fire because fire cannot last 4.6 billion years; but “4.6 billion years” is not a number that everyday experience makes graspable.
Energy mechanism: combustion is the dominant frame for “things that produce heat and light.” Nuclear fusion is not part of everyday experience at any scale. The conceptual distance between them is not factual but mechanistic — it requires replacing an entire causal model.
Gravity: our direct experience of gravity is of a downward force, active at Earth’s surface, which keeps things from floating away. The idea of gravity as a universal mutual attraction between all masses — active in empty space, responsible for cloud collapse and galaxy formation — is a substantive generalisation that everyday experience does not motivate.

The pedagogical literature’s recommendation is not to avoid these topics but to surface the prior models explicitly before presenting the correct physics. If you ask students “where does the Sun’s energy come from?” before you teach nuclear fusion, you learn what they believe and you create the cognitive conditions for productive conceptual conflict. If you simply present the fusion model without that step, students add “fusion” to their vocabulary while retaining “fire” in their mental model.

The experiments Alexander Küpper and I have been developing through the astro-lab project — described in the stellar evolution post and the astro-lab@home post — are designed with these specific misconceptions in mind. The net-based gravity experiment addresses the “gravity doesn’t work in space” and “force requires motion” problems directly, by making gravitational attraction between all particles visible as a material structure. The pressure-temperature experiment makes the “compression heats the gas” step concrete before any mention of fusion.

These are not complete solutions to deeply held misconceptions. But they are a start at building the conceptual scaffolding that makes “and then fusion begins” something other than an assertion to be memorised and filed away without understanding.

References

Asghar, A. A., & Libarkin, J. C. (2010). Gravity, magnetism, and “down”: Non-physics college students’ conceptions of gravity. The Science Educator, 19(1), 42–55.

Bailey, J. M., Prather, E. E., Johnson, B., & Slater, T. F. (2009). College students’ preinstructional ideas about stars and star formation. Astronomy Education Review, 8(1). https://doi.org/10.3847/AER2009038

Bar, V., Brosh, Y., & Sneider, C. (2016). Weight, mass, and gravity: Threshold concepts in learning science. Science Educator, 25(1), 22–34.

Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.) (2000). How People Learn: Brain, Mind, Experience, and School. National Academy Press.

Favia, A., Comins, N. F., & Thorpe, G. L. (2013). The elements of item response theory and its framework in analyzing introductory astronomy college student misconceptions. I. Galaxies. Astronomy Education Review.

Gunstone, R., & Watts, M. (1985). Force and motion. In R. Driver, E. Guesne, & A. Tiberghien (Eds.), Children’s Ideas in Science (pp. 85–104). Open University Press.

Palmer, D. (2001). Students’ alternative conceptions and scientifically acceptable conceptions about gravity. International Journal of Science Education, 23(7), 691–706. https://doi.org/10.1080/09500690010006527

Schecker, H., Wilhelm, T., Hopf, M., & Duit, R. (Eds.) (2018). Schülervorstellungen und Physikunterricht. Springer.

Changelog

2025-09-14: Updated the DOI for Bailey et al. (2009) to the correct 10.3847/AER2009038.
2025-09-14: Changed “would extend past Mars” to “would extend past the asteroid belt” for Betelgeuse at ~700 R☉. At ~3.26 AU, Betelgeuse’s radius exceeds Mars’s orbital distance (1.52 AU) by more than a factor of two and reaches well into the asteroid belt (2.2–3.3 AU).

What Happens When You Film Student Teachers: ViLLA and the Case for Video in Teacher Education

Sun, 14 Jun 2020 00:00:00 +0000

In September 2019 I gave a presentation on the ViLLA project at the ZuS Innovation Workshop at the University of Cologne together with Daniel Zimmermann. This post is the blog-friendly version of that presentation — what ViLLA is, why video in teacher education is not as obvious as it sounds, and what the research actually showed. The project team at the time: Prof. Dr. Dr. Kai Kaspar, Prof. Dr. Johannes König, Charlotte Kramer, Marco Rüth, Daniel Zimmermann, Anne van Laak, and myself.

The Problem With Learning to Teach

Here is the uncomfortable thing about learning to teach: for the first few years of your career, your primary research subjects are children. Every class you misread, every transition you fumble, every moment you lose the room — those are learning experiences, and the students in the room pay part of the cost.

This is not a new problem, and nobody is pretending it has a clean solution. But it raises a question that teacher education programmes have been grappling with for a long time: how much of the relevant learning can happen before the student teacher is standing alone in front of thirty eleven-year-olds?

One answer — not the only one, but a defensible one — is: more of it, if you give people good video.

What ViLLA Is

ViLLA (Videos in der Lehrerinnen- und Lehrerausbildung — Videos in Teacher Education) is an online portal of real classroom recordings built for use in teacher education at the University of Cologne. The idea was to film actual teaching, make the recordings searchable and pedagogically annotated, and give student teachers access to genuine classroom situations before they were responsible for managing one themselves.

This sounds straightforward until you try to do it. Filming real classrooms requires ethical clearance, consent from pupils and parents, cooperation from schools, and a recording setup that doesn’t turn the lesson into a performance. The resulting videos need to be usable for instruction, which means they need accompanying material: lesson plans, worksheets, transcripts, annotations by subject-matter specialists. And then they need to be housed somewhere students can actually find them.

The first phase of ViLLA ran from April 2013 to December 2014, funded by the University of Cologne’s Innovation in Teaching programme. We opened officially on 5 November 2014 with a database of classroom sequences tagged by subject, year group, school type, and didactic focus. The core intended audience: student teachers, Referendarinnen* (trainee teachers in the practical training phase), and the university instructors and school-based mentors working with them.

What the Research Showed

The project was not just infrastructure. From the beginning we ran research alongside the portal development — specifically, quasi-experimental studies on whether and how video-based instruction actually improves the skills we care about.

The target construct was situation-specific skills for classroom management — the ability to perceive, interpret, and respond to classroom events in real time. This is a domain where there is reasonable theoretical agreement that expert teachers differ from novices not primarily in declarative knowledge (knowing that you should address disruptions early) but in perception and response speed (actually noticing the early signs and acting on them).

The key finding from the ViLLA studies: combining video with transcripts was more effective than control seminars that used neither. Students who worked with video and transcript material showed better development of situation-specific classroom management skills than comparison groups. The effect was not enormous, but it was there, it replicated, and it was large enough to justify the infrastructure investment.

The transcript component is worth highlighting because it’s not obvious. You might expect that video alone would be sufficient — you are showing people real teaching. But the transcript creates an additional layer of perceptual access: you can pause on a moment, read back exactly what was said, annotate, compare your reading of the situation with a peer’s. The multimodal combination seems to do something that either medium alone does not.

ViLLA 2.0: Scaling Up

By 2015, ViLLA had grown into a second development phase. In November 2016 it received federal funding through the BMBF’s Qualitätsoffensive Lehrerbildung (Quality Initiative for Teacher Education), embedded in the University of Cologne’s Zukunftsstrategie Lehrerinnenbildung* (ZuS) umbrella project.

The scale change was significant. 185 videos in the database by the time of the 2019 presentation, covering more subjects, more school types, and more outside-school teaching and learning scenarios than the original portal had included. The self-learning modules — originally an add-on — became a central feature.

Two types of modules emerged from the practice:

Case-based modules built around a specific filmed sequence, asking the learner to work through what they observe, what decisions the teacher made, and what they would do differently. These are close to case-based reasoning as used in medical education — the video is the case.

Theme-centred modules organised around a pedagogical concept (classroom transitions, group work monitoring, handling disruptions) and drawing on multiple video examples to illustrate the same phenomenon across different contexts. The goal is pattern recognition — not learning what to do in this lesson, but developing a schema that transfers to next year’s class in a different school.

The Meta-Portal and What It Means

One development I am particularly interested in from a research infrastructure perspective: ViLLA’s integration into unterrichtsvideos.net, a meta-portal that aggregates classroom video collections from universities across Germany.

The single-portal model has an obvious limitation: your institution’s videos reflect your institution’s context. The schools you filmed, the subject specialists on your team, the pedagogical questions your programme emphasises. Aggregation across portals means a student teacher in Cologne can access video collected at Münster or Berlin, search across the combined database by year group and subject, and get access without separate registration at each institution.

This matters for research too. A shared infrastructure with standardised tagging creates the conditions for cross-institutional studies. You can ask whether the same video material works differently in different programme contexts, or whether different annotation frameworks lead to different learning outcomes. The portal is also, then, a methodology — a way of generating comparable data.

What I Think Is Actually Interesting Here

I should be honest about where my personal research interest sits in all of this, because it is not primarily in the technology.

The thing that I find genuinely interesting about the ViLLA project is the implicit theory of professional learning it rests on. We filmed real lessons — not idealised demonstrations, not training videos produced for the purpose, but actual classroom teaching with the roughness and contingency that implies. We then gave those videos to student teachers and asked them to look carefully.

The assumption is that professional perception can be educated. That what distinguishes a competent teacher from a novice is not just accumulated experience but the capacity to read situations quickly and accurately — and that this capacity can be developed through structured encounter with material before you are responsible for it.

This is an empirical claim and we have evidence for it. But it also connects to broader questions about expertise, perception, and what it means to prepare someone for a practice-based profession. Medical education has been working on these questions through simulation and case-based learning for decades. Teacher education is, in many institutions, still catching up.

ViLLA is one attempt to close that gap. Whether it is the right attempt, in its current form, is something I am still working out. But the question it is trying to answer — what do you need to have seen, and thought about, before you can teach well — seems to me like one of the important ones.

Where This Is Going

Two strands that were live at the time of the 2019 presentation and that I will return to in later posts:

The ProvidiS project (Förderung der professionellen Wahrnehmung in digitalen, videobasierten Selbstlernmodulen — Promoting Professional Perception in Digital, Video-Based Self-Learning Modules), a follow-on BMBF project in cooperation with the Universities of Münster and FU Berlin, which moves from infrastructure to targeted intervention design. The question shifts from “does video work?” to “which features of video-based learning design produce which effects on professional perception, for which learners?”

And a methodological strand I have become increasingly interested in: the videography setting itself as a research question. How you film a lesson — camera placement, editing conventions, what gets cut — shapes what the viewer can perceive. The transcript does something similar. These are not neutral mediations. They are constructions, and the choices made in constructing them have downstream effects on what student teachers learn to see. This connects to questions I have been thinking about in qualitative methodology more broadly — which I will probably end up writing about separately.

References

König, J., Blömeke, S., Klein, P., Suhl, U., Busse, A., & Kaiser, G. (2014). Is teachers’ general pedagogical knowledge a premise for noticing and interpreting classroom situations? Teaching and Teacher Education, 38, 76–88. https://doi.org/10.1016/j.tate.2013.11.004

Sherin, M. G. (2007). The development of teachers’ professional vision in video clubs. In R. Goldman, R. Pea, B. Barron, & S. J. Derry (Eds.), Video Research in the Learning Sciences (pp. 383–395). Lawrence Erlbaum.

van Es, E. A., & Sherin, M. G. (2002). Learning to notice: Scaffolding new teachers' interpretations of classroom interactions. Journal of Technology and Teacher Education, 10(4), 571–596.

Hello World — What Is This Blog?

Wed, 22 Jan 2020 00:00:00 +0000

Why does this exist?

I have a folder. It is full of half-finished ideas, speculative derivations, and results that are probably interesting but will definitely never make it through formal peer review — at least not at the current level of polish.

So instead of letting them rot, I’m posting them here.

The format is loose. The rigor is variable. The pinky promise is firm:

Peer review is welcome. All criticism will be posted alongside the original entry.

If you find an error, a flawed assumption, or a better framing — open an issue on GitHub. I will read it, respond to it, and append it to the post.

What to expect

Posts will look roughly like this:

An idea — usually the kind that arrives at 11 pm and seems very important.
Some argument — math, code, or prose, depending on what makes sense.
Honest limitations — what would need to be true for this to actually hold up.
Open questions — what I don’t know and am not going to pretend I do.

A taste of the math rendering

Since this blog covers scientific content, equations should work. Here’s a sanity check:

Inline math: the Gaussian integral $ \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} $ is a classic.

Display math:

$$ \sum_{n=0}^{\infty} \frac{(-1)^n}{2n+1} = 1 - \frac{1}{3} + \frac{1}{5} - \cdots = \frac{\pi}{4} $$

And a block with \[...\] delimiters:

\[ \mathcal{F}\{f\}(\xi) = \int_{-\infty}^{\infty} f(x)\, e^{-2\pi i x \xi}\, dx \]

If those rendered correctly, we’re in business.

A taste of code

import numpy as np

def estimate_pi(n_samples: int) -> float:
    """Monte Carlo estimation of pi."""
    x, y = np.random.uniform(-1, 1, (2, n_samples))
    inside = (x**2 + y**2) <= 1.0
    return 4 * inside.mean()

print(f"π ≈ {estimate_pi(10_000_000):.6f}")

See you in the next post.