Reproducibility on Sebastian Spicker

Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters

Thu, 04 Dec 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/ralph-loop. This post is the design rationale.

December 2025

It happened fast. In the twelve months before I am writing this, agentic coding went from a niche research topic to the default mode for several categories of software engineering task. Codex runs code in a sandboxed container and submits pull requests. Claude Code works through a task list in your terminal while you make coffee. Cursor’s agent mode rewrites a file, runs the tests, reads the failures, and tries again — automatically, without waiting for you to press a button.

The demos are impressive. The production reality is messier.

The problem is not that these systems do not work. They work well enough, often enough, to be genuinely useful. The problem is that “works” means something different when an agent is executing than when a human is. A human who makes a mistake can tell you what they were thinking. An agent that produces a subtly wrong result leaves you with a diff and no explanation. And an agent run that worked last Tuesday might not work today, because the model changed, or the context window filled differently, or the prompt-to-output mapping is, at bottom, a stochastic function.

This is the problem the Ralph Loop is designed to address: not “make agents more capable” but “make agent runs reproducible.”

The Reproducibility Problem, Formally

An LLM tool call is a stochastic function. Given a prompt $p$, the model samples from a distribution over possible outputs:

$$T : \mathcal{P} \to \Delta(\mathcal{O})$$

where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over $\mathcal{O}$.

At temperature zero — the most deterministic setting most systems support — this collapses toward a point mass:

$$T_0(p) \approx \delta_{o^*}$$

where $o^*$ is the argmax token sequence. “Approximately” because hardware non-determinism, batching effects, and floating-point accumulation mean that even $T_0$ is not strictly reproducible across runs, environments, or model versions.

A naive agentic loop composes these calls. If an agent takes $k$ sequential tool calls to complete a task, the result is a $k$-fold composition:

$$o_k = T(T(\cdots T(p_0) \cdots))$$

The variance does not merely add — it propagates through the dependencies. Early outputs condition later prompts; a small deviation at step 2 can shift the trajectory of step 5 substantially. This is not a theoretical concern. It is the practical experience of anyone who has tried to reproduce a multi-step agent run.

The Ralph Loop does not solve the stochasticity of $T$. What it does is prevent the composition.

The Ralph Loop as a State Machine

The system’s state at any point in a run is a triple:

$$\sigma = (Q,\; S,\; L)$$

where:

$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD (product requirements document) — with stories sorted by priority, then by ID
$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$ is the status vector, one entry per story
$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock state protecting $S$ from concurrent writes

The transition function $\delta$ at each step is:

Select: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ — deterministic by construction, since $Q$ has a fixed ordering
Build: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of the story definition and the static policy document; no dependency on previous tool outputs
Execute: $o \sim T(p)$ — exactly one tool call, output captured
Accept: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the acceptance criterion (was the expected report file created at the expected path?)
Commit: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$; otherwise increment the attempt counter; write atomically under lock $L$

The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in exactly one position. The loop continues until no open stories remain or a story limit $N$ is reached.

Termination. Since $|Q| = n$ is finite, $S$ has at most $n$ open entries, and each step either closes one entry or increments an attempt counter bounded by $A_{\max}$, the loop terminates in at most $n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually satisfies any reachable acceptance criterion — which is what CODEX.md’s constraints are designed to encourage — the loop converges in exactly $n$ successful transitions.

Replay. The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to \sigma_k$ is determined by $Q$ and the sequence of tool outputs $o_1, o_2, \ldots, o_k$. The .runtime/events.log records these outputs. If tool outputs are deterministic, the run is fully deterministic. If they are not — as in practice they will not be — the stochasticity is at least isolated to individual steps rather than allowed to compound across the chain.

The One-Tool-Call Invariant

The most important constraint in the Ralph Loop is also the simplest: exactly one tool call per story attempt.

This is not the natural design. A natural agentic loop would let the model plan, execute, observe, reflect, and re-execute within a single story. Some frameworks call this “inner monologue” or “chain-of-thought with tool use.” The model emits reasoning tokens, calls a tool, reads the result, emits more reasoning, calls another tool, and eventually produces the final output.

This is more capable for complex tasks. It is also what makes reproducibility hard. Each additional tool call in the chain is a fresh draw from $T$, conditioned on the previous outputs. After five tool calls, the prompt for the fifth includes four previous outputs — each of which varied slightly from the last run. The fifth output is now conditioned on a different input.

Formally: let the multi-call policy use $k$ sequential calls per story. Each call $c_j$ produces output $o_j \sim T(p_j)$, where $p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function $f$. The variance of the final output $o_k$ depends on the accumulated conditioning:

$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]

\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$

By the law of total variance, applied recursively, the total variance decomposes into explained and residual components — conditioning redistributes variance but does not eliminate the residual term. In a well-designed, low-variance chain the residual may stay small; in practice, LLM outputs have non-trivial variance at each step, and that variance propagates through the conditioning chain.

The one-call constraint collapses $k$ to 1:

$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$

The output depends only on the story definition and the static policy document. Not on previous tool outputs. The stories are designed to be atomic enough that one call is sufficient. If a story requires more, it should be split into two stories in the PRD. This is a forcing function toward better task decomposition, which I consider a feature rather than a limitation.

Scope as a Topological Constraint

In fixing mode, each story carries a scope[] field listing the files or directories the agent is permitted to modify. The runner captures a snapshot of the repository state before execution:

$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

where $h(f)$ is a hash of the file contents. After the tool call:

$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must satisfy:

$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$

This is a locality constraint on the filesystem graph: the agent’s writes are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the story’s scope declaration. Writes that escape this neighbourhood are a story failure, regardless of whether they look correct.

The motivation is containment. When a fixing agent makes a “small repair” to one file but also helpfully tidies up three adjacent files it noticed while reading, you have three undocumented changes outside the story’s intent. In a system with many stories running sequentially, out-of-scope changes accumulate silently. The scope constraint prevents this. Crucially, prompt instructions alone are not sufficient — an agent told “only modify files in scope” can still modify out-of-scope files if the instructions are interpreted loosely or the context is long. The runner enforces scope at the file system level, after the fact, and that enforcement cannot be argued with.

Acceptance Criteria: Grounding Evaluation in Filesystem Events

Each story’s acceptance criterion is a single line of the form Created — the path where the report or output file should appear.

This is intentionally minimal. The alternative — semantic acceptance criteria (“did the agent identify all relevant security issues?”) — would require another model call to evaluate, reintroducing stochasticity at the evaluation layer and creating the infinite regress of “who checks the checker.” A created file at the right path is a necessary condition for a valid run. It is not a sufficient condition for correctness, but necessary conditions that can be checked deterministically are already more than most agentic pipelines provide.

The quality of the outputs — whether the audit findings are accurate, whether the fix is correct — depends on the model and the prompt quality. The Ralph Loop gives you a framework for running agents safely and repeatably. Verifying that the agent was right is a different problem and, arguably, a harder one.

Why Bash

A question I have fielded: why Bash and jq, not Python or Node.js?

The practical reason: the target environment is an agent sandbox that has reliable POSIX tooling but variable package availability. Python dependency management inside a constrained container is itself a source of variance. Bash with jq has no dependencies beyond what any standard Unix environment provides.

The philosophical reason: the framework’s job is orchestration, not computation. It selects stories, builds prompts from templates, calls one external tool, parses one file path, and updates one JSON field. None of this requires a type system or a rich standard library. Bash is the right tool for glue that does not need to be impressive.

The one place Bash becomes awkward is the schema validation layer, which is implemented with a separate jq script against a JSON Schema. This works but is not elegant. If the PRD schema grows substantially, that component would be worth replacing with something that has native schema validation support.

What This Is Not

The Ralph Loop is not an agent. It is a harness for agents. It does not decide what tasks to run, does not reason about a codebase, and does not write code. It sequences discrete, pre-specified stories, enforces the constraints on each execution, and records the outcomes. The intelligence is in the model and in the story design; the framework contributes only discipline.

This distinction matters because the current wave of agentic tools conflates two things that are worth keeping separate: the capability to reason and act (what the model provides) and the infrastructure for doing so safely and repeatably (what the harness provides). Improving the model does not automatically improve the harness — and a better model in a poorly constrained harness just fails more impressively.

The repository is at github.com/sebastianspicker/ralph-loop. The Bash implementation, the PRD schema, the CODEX.md policy document, and the test suite are all there.

LK-99: Six Weeks That Showed How Physics Works

Mon, 09 Oct 2023 00:00:00 +0000

July 22, 2023

On a Saturday morning in late July 2023, two preprints appeared on arXiv. They were submitted by researchers affiliated with the Quantum Energy Research Centre in Seoul — Sukbae Lee, Ji-Hoon Kim, and colleagues — and they claimed something that condensed matter physicists have been chasing for over a century: a material that superconducts at room temperature and ambient pressure.

The compound was called LK-99. It was a copper-doped lead apatite, synthesized from common precursors using a procedure that, on paper, any moderately equipped laboratory could attempt. The claimed critical temperature was above 400 K — well above 293 K, which is room temperature, which is roughly the temperature of a warm afternoon in Seoul in July.

A video circulated almost immediately. A small, grey, irregular piece of LK-99 appeared to be partially levitating — tilting up, one end raised — above a permanent neodymium magnet. In the video it wobbles slightly, like something caught between gravity and an invisible hand.

Physics Twitter — I will use that name; it was still recognizably that in July 2023 — detonated. Within 72 hours, laboratories across the world were racing to synthesize LK-99. Discord servers formed. GitHub repositories appeared with shared synthesis protocols. Preprints from independent groups began accumulating before the original authors had likely had a good night’s sleep.

Six weeks later, the claim was dead.

I want to write about what happened in those six weeks, because I think the episode is more interesting as sociology of science than as condensed matter physics. LK-99 turned out to be a modest semiconductor with a ferromagnetic impurity. But the speed and the manner of that determination — the way a globally distributed community of physicists organized itself, shared data in real time, converged on a falsification, and then moved on — that is genuinely remarkable, and worth examining carefully.

Why Room-Temperature Superconductivity Is the Grail

Let me be precise about why this particular claim generates the response it does.

Superconductivity is the phenomenon in which certain materials, below a critical temperature T_c, carry electrical current with exactly zero resistance. Not very low resistance — zero. A current established in a superconducting loop will, in principle, continue flowing indefinitely without any driving voltage. This is not a small quantitative improvement over ordinary conductors; it is a qualitatively different regime of physics.

The trouble is that essentially all known superconductors require extreme cooling. Conventional metallic superconductors — the ones Heike Kamerlingh Onnes discovered in mercury in 1911 — become superconducting below about 30 K at best. That is liquid helium temperature, which is expensive, logistically demanding, and entirely impractical for large-scale applications. The discovery of high-temperature cuprate superconductors in 1986 (Bednorz and Müller, Nobel Prize 1987) was genuinely revolutionary: some cuprates superconduct up to about 138 K. But 138 K is still −135°C. It requires liquid nitrogen cooling, which is cheaper than liquid helium but still not something you install in a power grid without substantial infrastructure.

The current record belongs to a class of hydrogen-rich compounds under extreme pressure — carbonaceous sulfur hydride at roughly 15°C, but requiring about 267 GPa of pressure. For context, the pressure at the center of the Earth is about 360 GPa. You cannot run a power cable through a diamond anvil cell.

Room-temperature, ambient-pressure superconductivity would be transformative in a way that very few material discoveries are. Electrical grids currently lose somewhere between 5 and 10 percent of all transmitted energy to resistive heating — a staggering quantity of energy, simply dissipated as heat in cables. Zero-resistance transmission would eliminate that loss. Magnetically levitated transport would become feasible without the cryogenic infrastructure that makes current Maglev systems enormously expensive to build and maintain. Compact, affordable MRI machines would become possible. Effects on computing, on energy storage, on medical technology — the list runs long. It would be one of the most consequential material discoveries in the history of technology.

This is why the response to the LK-99 preprints was not hysteria but rather the entirely rational behavior of a community that understood exactly what was at stake if the claim were true.

What LK-99 Was and What It Claimed

LK-99 is chemically expressed as Pb₁₀₋ₓCuₓ(PO₄)₆O, where x is approximately 0.9 to 1.1. It is a lead apatite — the same crystal family as the mineral in tooth enamel — with a fraction of the lead atoms replaced by copper.

The proposed mechanism, as sketched in the preprints, involved Cu²⁺ substituting for Pb²⁺. Because copper has a slightly smaller ionic radius than lead, this substitution induces a local structural distortion. The claim was that this distortion produces a flat electronic band at the Fermi level — and flat bands are associated with strong electronic correlations that can, in principle, give rise to unconventional superconductivity. The analogy to twisted bilayer graphene was implicit in the discussion, though the mechanism is quite different and twisted bilayer graphene superconducts only well below 1 K.

Reading the preprints in late July 2023 was, I confess, a slightly uncomfortable experience. The writing was rushed. The two preprints — submitted by different author subsets from the same group — were internally inconsistent in places. The resistance measurements showed a large drop with temperature, but not zero resistance. The synthesis protocol was described in enough detail to be reproducible, which was good, but the characterization was incomplete in ways that mattered.

Red flags were present from the beginning, and many physicists noted them immediately. The levitation video showed a piece of LK-99 that was tilted and wobbling — not the stable, complete expulsion of magnetic flux you would expect from a true Meissner effect. A perfect superconductor placed above a magnet would levitate horizontally and stably. This piece was doing something, but the something was not obviously Meissner levitation.

And yet. The synthesis was simple. The claim was specific and testable. If there was even a small chance it was real, the imperative to check was overwhelming. So labs checked.

The Replication Wave

What happened over the following weeks was, as far as I am aware, unprecedented in condensed matter physics.

Normally, a replication in physics looks like this: a group reads a paper, decides it is interesting enough to attempt, orders precursor materials, synthesizes the compound (which takes weeks to months), characterizes it with appropriate instruments (more weeks), writes up the results, submits them (more weeks), and eventually publishes — often six months to a year after the original claim, sometimes much longer. The feedback cycle is slow by design: slowness is a feature, not a bug, because it allows careful work rather than hasty work.

The LK-99 replication did not look like this.

Within a week, preprints from independent groups — China, India, the United States, Germany — were appearing on arXiv. Discord servers with hundreds of members were organizing synthesis attempts in real time, sharing thermograms, resistance measurements, and microscope images as they came off instruments. Twitter threads tracked emerging results with the urgency of a live event. A GitHub repository maintained by the community accumulated synthesis protocols, shared data files, and links to new preprints as they appeared.

Some groups reported partial levitation. Others reported anomalous resistance drops. Others — starting almost immediately — reported synthesizing the material and finding nothing unusual at all.

The speed of this was extraordinary not because of any particular organizational effort, but because the incentive structure happened to align with the infrastructure that now exists. Preprints made sharing immediate. Social media made results public the moment they existed. The synthesis was simple enough to attempt in any reasonably equipped solid-state chemistry lab. And the motivation — the prize, if it were real — was enormous. You would not need to tell anyone to work on this. You would have to tell people to stop.

By mid-August 2023 — three weeks after the original preprints — the key debunking papers had appeared. By late August, there was no serious scientific debate remaining.

The Mechanism of Falsification

The levitating video was explained first, and the explanation is both mundane and instructive.

The LK-99 synthesis produces, as an essentially unavoidable impurity, copper sulfide — Cu₂S. Copper sulfide is interesting in its own right: it undergoes a structural phase transition at roughly 105°C (378 K) from a low-temperature chalcocite form to a high-temperature superionic conductor. This transition is accompanied by a large, sharp drop in electrical resistance — exactly the kind of anomalous feature that, in a sample of mixed composition, might be misidentified as a superconducting transition.

More importantly for the levitation: the LK-99 synthesis products ubiquitously contain ferromagnetic impurity phases. A ferromagnetic material will interact with a permanent magnet. Partial levitation, tilted and unstable, is entirely consistent with a ferromagnetic-diamagnetic competition — not with the Meissner effect.

Several groups published debunking papers in rapid succession. Kumar and colleagues (Kumar et al., 2023) reported the absence of superconductivity in LK-99 samples; other groups synthesized Cu₂S independently, confirmed its resistance anomaly near 380 K, and showed quantitatively that the LK-99 observations were fully consistent with Cu₂S contamination and ferromagnetic impurities. Liu and Meng (Liu & Meng, 2023) provided a complementary symmetry analysis explaining why the structural distortion mechanism did not actually predict superconductivity.

Several Chinese groups with high-quality synthesis capabilities — and, frankly, strong motivation to find a positive result — produced very pure LK-99 samples and found what you would expect of a clean lead apatite: a semiconductor with modest diamagnetism. Nothing anomalous. When you removed the Cu₂S impurity, you removed the anomaly.

Daniel Garisto summarized the consensus in a Nature news piece in August 2023 (Garisto, 2023): LK-99 is not a superconductor. The case was closed, with an efficiency that the scientific community should be proud of.

A Useful Contrast: Ranga Dias

The LK-99 episode does not exist in isolation. The preceding years had seen other extraordinary claims of room-temperature or near-room-temperature superconductivity, and the most prominent involved Ranga Dias at the University of Rochester.

Dias published two papers in Nature claiming superconductivity at or near room temperature: one in 2020, describing carbonaceous sulfur hydride at roughly 15°C under 267 GPa (Snider et al., 2020 — and I note that the earlier Dias and Silvera Science paper on metallic hydrogen (Dias & Silvera, 2017) received a significant erratum and has been widely questioned — establishing a pattern), and one in 2023, describing nitrogen-doped lutetium hydride under much lower pressure. Both Nature papers were eventually retracted — the 2020 paper in 2022, the 2023 paper in November 2023 — amid serious and credible allegations of data manipulation. The criticisms included statistical anomalies in background signals, apparent image duplication across different experimental conditions, and raw data that did not match the published figures. Hirsch, who had been following these claims closely, documented many of the irregularities (Hirsch, 2021).

The contrast with LK-99 is worth sitting with. The Korean team appears to have been guilty of honest overreach: genuine excitement about anomalous observations, insufficient characterization before posting, motivated interpretation of ambiguous data. This happens in science. Extraordinary rewards for being right create extraordinary pressure to believe you are right. The LK-99 researchers may have seen something they genuinely could not explain and convinced themselves it was what they hoped it was.

The Dias case, if the allegations of data manipulation are accurate — and the retractions, and the University of Rochester investigation that followed, suggest they have merit — is something different: not motivated misinterpretation but deliberate fabrication. The scientific outcomes are superficially similar: both sets of claims were false, both caused the community to expend significant effort on falsification, both damaged the credibility of the field. But the causes, and the appropriate institutional and moral responses, differ substantially.

How do you tell them apart in real time? In both cases, you had extraordinary claims that passed initial peer review at prestigious venues. In both cases, independent replication failed. The LK-99 falsification came faster, partly because the synthesis was simpler and partly because the community mobilized more broadly. The Dias case took years, and the data manipulation allegations required access to raw data that the research group was slow to provide.

I do not have a clean answer. The difference in mechanism — honest error versus alleged fraud — is not directly observable from the outside. What you can observe is willingness to share data, consistency of results across different instruments and laboratories, and whether the research group facilitates or obstructs independent verification. On those criteria, the LK-99 group and the Dias group look quite different.

The Sociology of What Happened

Let me step back from the physics and say something about what the LK-99 episode reveals about how science actually functions.

The first thing it reveals is that community self-correction works, and now works at extraordinary speed when the incentive is high enough. The coordinated global replication was not organized by any institution, any journal, any funding body. It emerged spontaneously from a community that understood what was at stake and had the tools — preprint servers, social media, Discord, GitHub — to coordinate without central direction. The result was a falsification that, in a previous era, might have taken two to five years, completed in six weeks. That is remarkable.

The second thing it reveals is that the preprint revolution is real and consequential. The LK-99 preprints bypassed traditional peer review entirely. That could be bad — and in principle, a false claim could propagate further and faster without peer review as a gate. In practice, in this case, removing the gate allowed not just the false claim but its falsification to move at the same speed. Peer review, as it is normally practiced, is too slow to respond to a claim like this on a timescale that matters. The community replaced it with something faster: immediate, distributed, adversarial review by people with direct experimental access to the question.

This is not an argument against peer review. It is an argument that peer review in the traditional sense — two or three reviewers reading a manuscript over a few weeks — is not the only form that meaningful scientific scrutiny takes.

The third thing the episode reveals is that social media’s role in science communication is deeply ambivalent. Twitter accelerated the spread of both the original claim and the debunking. The community of physicists on Twitter was, on the whole, appropriately skeptical from the first day — I saw many threads on July 22 and 23 that noted the red flags I mentioned above: the tilted levitation, the non-zero resistance, the inconsistencies between the two preprints. But that skepticism was invisible to most science journalists, who were looking at the same videos and preprints and reading the excitement rather than the caveats.

The Media, and the Calibration Problem

I want to be specific about the media failure, because I think it matters.

The appropriate headline on July 23, 2023 was something like: “Korean researchers post preprints claiming room-temperature superconductivity; claim is extraordinary and unverified; replication underway.” That headline is accurate. It conveys the genuine excitement — because the claim, if true, would be extraordinary — while conveying the appropriate uncertainty about an unverified preprint from a single group.

The headlines that actually appeared, across outlets that should know better, included “Room-temperature superconductor discovered” and “Scientists may have created the holy grail of energy.” These are not accurate. They convey neither the uncertainty nor the specific nature of the claim. They treat a preprint as a discovery.

This is a calibration failure — the same kind of failure I have written about in other contexts. On this blog, I have discussed how LLMs can fail catastrophically when they lack the context to assess whether their confident-sounding output is grounded in anything real (see the car-wash post, and more generally the discussion of context and grounding in more context is not always better). The mechanism in journalism is different but the structure is the same: confidence that is not appropriately calibrated to evidence.

The Bayesian structure of the situation was, or should have been, clear. The prior probability of a room-temperature, ambient-pressure superconductor being found in any given week is very small — not because room-temperature superconductors are impossible, but because such discoveries do not happen often and many previous claims have failed. Call that prior probability low. Against that prior, what evidence did we have on July 23? A video showing partial, unstable levitation — which, as I noted, is not what Meissner levitation looks like. Two rushed preprints that disagreed with each other in some details. No independent replication. P(levitation video | not a superconductor) was not particularly small, as the Cu₂S explanation would later demonstrate. So the posterior probability that LK-99 was a room-temperature superconductor, given the evidence available on July 23, was not meaningfully higher than the prior — which was low.

A well-calibrated science journalist would not have written “Room-temperature superconductor discovered.” A well-calibrated scientist — and many of them said exactly this — would have written “interesting claim, requires replication, maintain high skepticism.” The scientific community was, on the whole, well-calibrated. The journalism was not.

This is not a new observation. Science journalists have been criticized for overclaiming since there have been science journalists. But the LK-99 episode is a particularly clean example because the timescale was so short: the calibration failure in the media and the calibration success in the scientific community happened simultaneously, in full public view, and could be compared directly.

I write occasionally about AI systems and their tendency to produce confident outputs that are not grounded in evidence — a form of miscalibration that is particularly dangerous because the confident tone is not a signal of accuracy (a theme that runs through recent posts on this blog). The LK-99 episode is a reminder that miscalibration is not unique to neural networks. It is a general failure mode in any system that needs to estimate uncertainty about claims — human, institutional, or artificial. The cure in all cases is the same: track confidence to evidence, update on data, resist the pull of exciting priors.

What the Scientific Community Actually Did

I want to be careful not to end on a note of pure cynicism about the media and leave the scientific community looking saintly. The community is not saintly.

There were preprints from independent groups that claimed positive results before the falsification was clear — groups that perhaps saw anomalies and wanted to be part of the story. There was social pressure, documented in real time on Twitter, to share exciting results before they were fully analyzed. The Discord servers and GitHub repositories that were genuinely useful for coordination were also, occasionally, vectors for misinformation and premature interpretation.

The community self-corrected. That is the important thing. The noise in the system resolved into a clear answer, in six weeks, through a process that was adversarial in the best scientific sense: many people trying to verify or refute a specific testable claim, sharing data openly, calling out methodological problems in public. The answer that emerged was correct.

I find this genuinely impressive. It is easy to be cynical about institutional science — about publication bias, about the replication crisis in psychology and medicine, about the incentive structures that reward novelty over rigor. The LK-99 episode is a counter-example. It is evidence that, when a question is clear and testable and the stakes are high, the system works. Not perfectly, not without noise, but functionally.

Peer review in the classical sense was absent. Peer review in a broader sense — global, immediate, public, adversarial — worked faster than any journal could have managed, and reached a correct conclusion.

The Next Extraordinary Claim

LK-99 is over. The compound will appear in future textbooks, probably in a sidebar about famous failed claims in condensed matter physics, alongside Schön and Dias and others. The researchers who synthesized and characterized it honestly will get some credit for the negative result; the original Korean team will, I imagine, have a difficult few years professionally.

The question I am left with is what happens next time.

Room-temperature superconductivity will, almost certainly, be claimed again. The prize is too large and the search too active. Possibly the claim will be correct — I would not put that probability at zero. More likely it will be another false positive, another Cu₂S lurking in the impurity profile.

Will the media learn from LK-99? I am genuinely uncertain. The incentive structure for science journalism rewards excitement over accuracy, and “extraordinary claim requires replication” is a less clickable headline than “room-temperature superconductor discovered.” The journalists who wrote those headlines were not stupid; they were responding rationally to the incentives of their profession.

Will the scientific community respond as effectively? I think so, at least for claims of this kind: testable, synthesis-based, with enough labs in the world capable of attempting replication. The infrastructure — preprints, Discord, shared repositories — exists and is now demonstrated to work. The speed of the LK-99 falsification sets a kind of benchmark.

What the episode showed, in the end, is not that science is infallible or that the system is without problems. It showed that, under the right conditions — a clear empirical question, a distributed community with the tools and motivation to address it, and a culture of open data sharing — science can self-correct at remarkable speed. The failure was in communication, not in the science. That is a meaningful distinction.

Whether the media will have learned anything by the time the next extraordinary claim appears — that, I confess, I doubt.

References

Lee, S., Kim, J. H., & Kwon, Y.-W. (2023). The First Room-Temperature Ambient-Pressure Superconductor. arXiv:2307.12008. Retrieved from https://arxiv.org/abs/2307.12008
Kumar, K., Surface, N. B., & Baral, B. (2023). Absence of superconductivity in LK-99 at ambient conditions. arXiv:2308.03544. Retrieved from https://arxiv.org/abs/2308.03544
Liu, S., & Meng, S. (2023). Symmetry-breaking and the origin of the anomalous properties of LK-99. arXiv:2308.05135. Retrieved from https://arxiv.org/abs/2308.05135
Garisto, D. (2023). LK-99 isn’t a superconductor — how science sleuths solved the mystery. Nature, 620, 705–706. DOI: 10.1038/d41586-023-02585-7
Snider, E., Dasenbrock-Gammon, N., McBride, R., Debessai, M., Vindana, H., Vencatasamy, K., Lawler, K. V., Salamat, A., & Dias, R. P. (2020). Room-temperature superconductivity in a carbonaceous sulfur hydride. Nature, 586, 373–377. DOI: 10.1038/s41586-020-2801-z (Retracted 2022.)
Dias, R. P., & Silvera, I. F. (2017). Observation of the Wigner-Huntington transition to metallic hydrogen. Science, 355, 715–718. DOI: 10.1126/science.aal1579 (Erratum published 2017; widely questioned.)
Hirsch, J. E. (2021). Rejoinder to “Comment on ‘Absence of magnetic evidence for superconductivity in hydride compounds’” by Dias and Salamat. Physica C, 590, 1353964. DOI: 10.1016/j.physc.2021.1353964

Changelog

2025-09-14: Updated the Cu₂S characterisation: pure Cu₂S is diamagnetic; the ferromagnetism in LK-99 samples comes from impurity phases. Updated the Dias & Silvera 2017 Science paper status: it received an erratum but was not formally retracted (unlike the 2020 and 2023 Nature papers). Updated the Senapati et al. reference to the correct LK-99 debunking literature (the previous arXiv ID resolved to a different paper).