Statistical-Mechanics on Sebastian Spicker

A Gas at Temperature T: Xenakis and the Physics of Stochastic Music

Tue, 14 Oct 2025 00:00:00 +0000

Iannis Xenakis (1922–2001) was trained as a civil engineer at the Athens Polytechnic, joined the Greek Resistance during the Second World War and the subsequent Greek Civil War, survived a British army tank shell in January 1945 that cost him the sight in his left eye and part of his jaw, was sentenced to death in absentia by the Greek military government, fled to Paris in 1947, and worked for twelve years as an architect in Le Corbusier’s atelier — where he contributed structural engineering to the Unité d’Habitation in Marseille and designed the Philips Pavilion for Expo 58. In parallel, already in his thirties, he taught himself composition — approaching Honegger (who was too ill to teach) and then studying with Messiaen — and became one of the central figures of the post-war avant-garde. I mention the biography not as background colour but because it bears on the physics. A person who has been through what Xenakis had been through by 1950 is not likely to be intimidated by the kinetic theory of gases.

He was not. In 1955–56 he composed Pithoprakta — “actions through probability” — for 46 strings, each of which is, in his own account, a molecule of an ideal gas. This post works through the mathematics he used and asks what it means when a composer takes statistical mechanics seriously as a compositional tool.

The Problem with Post-War Serialism

To understand why Xenakis did what he did, it helps to know what everyone else was doing. By the early 1950s, the dominant tendency in European new music was total serialism: the systematic extension of Schoenberg’s twelve-tone technique to rhythm, dynamics, articulation, and register. Every parameter of every note was determined by a series. Messiaen had sketched this direction in Mode de valeurs et d’intensités (1949); Boulez and Stockhausen had taken it to its logical extreme.

The result, as Xenakis observed with characteristic bluntness in Formalized Music (1963/1992), was a kind of sonic indistinguishability: because every parameter varied according to independent deterministic series, the textures produced by total serialism sounded essentially like random noise. The maximum of local determinism had produced the appearance of global chaos.

His diagnosis was precise and, I think, correct: if the perceptual result of maximum determinism and maximum randomness is the same, then the path forward is not to find a better deterministic scheme but to embrace randomness explicitly, at the level that governs the macroscopic structure. Control the distribution; let the individual events vary within it. This is exactly what statistical mechanics does for a gas: it does not track every molecule, but it knows with great precision what the distribution of velocities will be.

Statistical Mechanics in Brief

In a classical ideal gas of $N$ molecules at thermal equilibrium with temperature $T$, the molecules move in all directions with speeds distributed according to the Maxwell-Boltzmann speed distribution:

$$f(v) = \sqrt{\frac{2}{\pi}}\, \frac{v^2}{a^3}\, \exp\!\left(-\frac{v^2}{2a^2}\right), \qquad a = \sqrt{\frac{k_B T}{m}},$$

where $m$ is the molecular mass and $k_B$ is Boltzmann’s constant. The parameter $a$ sets the characteristic speed scale: it grows with temperature (hotter gas means faster molecules) and shrinks with molecular mass (heavier molecules move more slowly at the same temperature).

The distribution has a characteristic shape: it rises as $v^2$ for small speeds (few molecules are nearly stationary), peaks at the most probable speed $v_p = a\sqrt{2}$, and falls off as $e^{-v^2/2a^2}$ for large speeds (very fast molecules are exponentially rare). The three characteristic speeds are:

$$v_p = a\sqrt{2}, \qquad \langle v \rangle = a\sqrt{\tfrac{8}{\pi}}, \qquad v_\mathrm{rms} = a\sqrt{3}.$$

No individual molecule is tracked. The distribution is everything: once you know $f(v)$, you know all macroscopic properties of the gas — pressure, mean kinetic energy, thermal conductivity — without knowing the trajectory of a single molecule. The individual is sacrificed to the ensemble.

Pithoprakta and the Orchestra as Gas

In Pithoprakta (1955–56), Xenakis assigns each of the 46 string instruments to a molecule of a gas. The musical analogue of molecular speed is the velocity of a glissando: the rate at which a glissando moves through pitch, measured in semitones per second. Slow glissandi are cold molecules; fast glissandi are hot ones.

For a given passage with a specified musical “temperature” (an intensity-and-density parameter he could set as a compositional choice), the 46 glissando speeds are drawn from the Maxwell-Boltzmann distribution for that temperature. No two strings play the same glissando at the same speed. The effect, to a listener, is a dense sound-mass — a shimmer or a roar — whose internal texture varies but whose overall character (the temperature, the density) is under the composer’s control at exactly the level that matters perceptually.

Xenakis worked out the velocities numerically by hand. The score of Pithoprakta was among the first in which the individual parts were derived from a statistical distribution rather than from a melody, a row, or an improvisation instruction. The calculation is tedious but not difficult: for each time window, choose a temperature, compute $f(v)$ for the 46 values of $v$ that tile the distribution, and assign one speed to each instrument.

The connection between macroscopic structure and microscopic liberty is deliberately preserved. The shape of the sound-mass — its brightness, its turbulence, its rate of change — is controlled. Each individual line is unpredictable. This is, structurally, the same trade-off that makes thermodynamics work: you give up on the individual trajectory and gain exact knowledge of the aggregate.

Musical Temperature as a Compositional Parameter

The analogy is worth making precise. In the physical gas, raising the temperature $T$ increases $a = \sqrt{k_B T / m}$, which shifts the peak of $f(v)$ to the right and widens the distribution. More molecules have high speeds; the variance of speeds increases.

In Pithoprakta, raising the musical “temperature” has the same effect: more instruments perform rapid glissandi; the pitch-space trajectories are more varied; the texture becomes more active and more turbulent. Lowering the temperature concentrates the glissando speeds near zero — slow motion, near-stasis, long sustained tones that change pitch only gradually. The orchestra cools.

This mapping is not metaphorical. Xenakis computed it. The score contains numerically derived glissando speeds; the connection between the perceptual temperature of the texture and the statistical parameter $T$ is quantitative. When musicians speak of a passage “heating up,” they are usually using a figure of speech. In Pithoprakta, they are describing a thermodynamic fact.

The Poisson Distribution and Event Density

Pithoprakta uses a second physical model alongside the Maxwell-Boltzmann distribution: the Poisson process, which governs the density of independent, randomly occurring events.

If musical events (pizzicato attacks, bow changes, individual note entries) occur at a mean rate of $\lambda$ events per second, the probability of exactly $k$ events occurring in a time window of length $T$ is:

$$P(N = k) = \frac{(\lambda T)^k\, e^{-\lambda T}}{k!}.$$

The Poisson distribution has a single parameter $\lambda$ that controls both the mean and the variance (they are equal: $\langle N \rangle = \mathrm{Var}(N) = \lambda T$). A high $\lambda$ produces a dense cluster of events; a low $\lambda$ produces sparse, widely spaced events.

Xenakis used this to control the density of pizzicato attacks independently of the glissando texture. A passage can be cool (slow glissandi) and dense (many pizzicati), or hot and sparse, or any combination. The two distributions operate on independent musical parameters — pitch motion and event density — giving the composer a two-dimensional thermodynamic control space over the texture.

Markov Chains: Analogique A and Analogique B

In Analogique A (for string orchestra, 1958–59) and its companion Analogique B (for sinusoidal tones, same year), Xenakis moved to a different stochastic framework: Markov chains.

A Markov chain is a sequence of states where the probability of transitioning to the next state depends only on the current state. The chain is specified by a transition matrix $P$, where $P_{ij}$ is the probability of moving from state $i$ to state $j$:

$$P_{ij} \geq 0, \qquad \sum_j P_{ij} = 1 \quad \forall\, i.$$

Under mild conditions (irreducibility and aperiodicity), the chain converges to a unique stationary distribution $\pi$ satisfying:

$$\pi P = \pi, \qquad \sum_i \pi_i = 1.$$

The convergence is geometric: if $\lambda_2$ is the second-largest eigenvalue of $P$ in absolute value, then after $n$ steps the distribution $\pi^{(n)}$ satisfies $\|\pi^{(n)} - \pi\| \leq C |\lambda_2|^n$ for some constant $C$. The gap $1 - |\lambda_2|$ — the spectral gap — controls how quickly the chain forgets its initial state. A transition matrix with a large spectral gap produces rapid convergence; one with $|\lambda_2| \approx 1$ produces long-memory dependence between distant states. This is a compositional choice: the spectral gap determines how quickly a piece’s texture changes character.

In Analogique A, Xenakis divided the sonic space into a grid of cells defined by pitch register (high/middle/low), density (sparse/medium/dense), and dynamic (soft/loud). Each “screen” — a brief time window — occupies one cell in this grid. The progression of screens through the piece is governed by transition probabilities: from a high/dense/loud screen, there is some probability of moving to each adjacent cell, specified by Xenakis’s chosen transition matrix.

This is a Markov chain on a discrete state space of sonic textures. The macroscopic trajectory of the piece — its overall movement through sound- quality space — is determined by the transition matrix, which the composer sets. The details of each screen are filled in stochastically, within the parameters of the current state. Again, the individual is sacrificed to the aggregate; control is exercised at the level of the distribution rather than the event.

Game Theory: Duel and Stratégie

The most extreme and, to my mind, most interesting of Xenakis’s formalisations is the use of game theory in Duel (1959) and Stratégie (1962).

A two-player zero-sum game is specified by a payoff matrix $A \in \mathbb{R}^{m \times n}$. Player 1 (the “maximiser”) chooses a row $i$; Player 2 (the “minimiser”) chooses a column $j$; Player 1 receives payoff $A_{ij}$ and Player 2 receives $-A_{ij}$. In a pure-strategy game, each player selects a single action. In a mixed-strategy game, each player chooses a probability distribution over their actions: Player 1 uses $\mathbf{x} \in \Delta_m$ and Player 2 uses $\mathbf{y} \in \Delta_n$, where $\Delta_k$ denotes the standard $(k-1)$-simplex.

The expected payoff to Player 1 under mixed strategies is:

$$E(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top A\, \mathbf{y}.$$

Von Neumann’s minimax theorem (1928) guarantees that:

$$\max_{\mathbf{x} \in \Delta_m} \min_{\mathbf{y} \in \Delta_n} \mathbf{x}^\top A\, \mathbf{y} \;=\; \min_{\mathbf{y} \in \Delta_n} \max_{\mathbf{x} \in \Delta_m} \mathbf{x}^\top A\, \mathbf{y} \;=\; v^*,$$

where $v^*$ is the value of the game. The pair $(\mathbf{x}^*, \mathbf{y}^*)$ that achieves this saddle point is the Nash equilibrium: neither player can improve their expected payoff by unilaterally deviating from their equilibrium strategy.

In Stratégie, each conductor leads one orchestra. Each has nineteen “tactics” — six basic musical textures (e.g., sustained chords, staccato pizzicati, glissandi masses, silence) plus thirteen combinatorial tactics that combine two or three of the basics. The payoff matrix is a $19 \times 19$ integer matrix, also defined by Xenakis, specifying how many points Conductor 1 scores when their orchestra plays tactic $i$ against Conductor 2’s tactic $j$. A referee tracks the score.

The conductors make decisions in real time during the performance, choosing tactics based on what the other conductor is doing and on the evolving score. The piece ends when one conductor reaches a predetermined score threshold.

The Nash equilibrium of the payoff matrix tells each conductor, in principle, the optimal distribution over tactics to play: if both play optimally, the expected score trajectory is determined. In practice, conductors are not expected to compute mixed strategies on the podium; Xenakis’s point is structural. The game-theoretic formalism is used to design the payoff matrix so that no tactic dominates — every choice has consequences that depend on the opponent’s choice — guaranteeing that the piece will always contain genuine strategic tension regardless of who is conducting.

Duel (1959) is the earlier, simpler version for two chamber orchestras. Stratégie (1962) was premiered in April 1963 at the Venice Biennale with two conductors competing live. The audience was aware of the game, of the score, and of the payoff matrix. The premiere was by most accounts a success, though the practical complications of running a zero-sum game in a concert hall (including the question of whether conductors were actually computing Nash equilibria or just following intuition) were never fully resolved.

Formalized Music

Xenakis assembled his theoretical framework in Musiques formelles (1963), translated and expanded as Formalized Music (1971; revised edition 1992). The book is one of the strangest documents in twentieth-century music theory: part treatise, part manifesto, part mathematical appendix. It covers stochastic composition, Markov chains, game theory, set theory, group theory, and symbolic logic — all presented with the confidence of someone who is equally at home in the engineering faculty and the concert hall, and with the occasional obscurity of someone writing simultaneously for two audiences who share almost no vocabulary.

The core argument is that musical composition can and should be treated as the application of mathematical structures to sonic material, not because mathematics makes music “better” but because mathematical structures are the most powerful available tools for controlling relationships between sounds at multiple scales simultaneously. The statistical distributions control the macroscopic; the individual values vary within them. The game- theoretic payoff matrix controls the strategic interaction; the individual tactics fill in the details. Mathematics operates at the structural level and leaves the acoustic surface free.

This is a different relationship between mathematics and music from the ones in my earlier posts on group theory and Messiaen or the Euclidean algorithm and world rhythms. In those cases, mathematics describes structure that already exists in the music — structure the composers arrived at by ear. In Xenakis, mathematics is the generative tool: the score is derived from the calculation.

What the Analogy Does and Does Not Do

The Maxwell-Boltzmann analogy in Pithoprakta is exact in one direction and approximate in another.

It is exact in the following sense: the glissando speeds Xenakis computed for his 46 strings genuinely follow the Maxwell-Boltzmann distribution with the parameters he chose. The score is a realisation of that distribution. If you collect the glissando speeds from the score and plot their histogram, you will find the characteristic $v^2 e^{-v^2/2a^2}$ shape.

It is approximate — or rather, it is analogical — in the sense that strings in an orchestra are not molecules of a gas. They do not collide. They have mass and inertia in a physical sense that has no direct mapping to musical parameters. The temperature $T$ is not a temperature in any thermodynamic sense; it is a compositional variable that Xenakis chose to parameterise with the same symbol because the formal relationship is the same. The analogy is structural, not ontological.

This is worth saying plainly because it is easy to be misled in both directions: either to over-claim (the orchestra is a gas) or to dismiss (the orchestra is merely labelled with physical vocabulary). The actual claim is more modest and more interesting: the mathematical structure of the Maxwell-Boltzmann distribution is the right tool for specifying a certain kind of orchestral texture, namely one where individual elements vary stochastically around a controlled macroscopic envelope. The physics provides the formalism; the music provides the application. This is how mathematics works in engineering, too.

The Centenary and What Remains

Xenakis died in 2001, by then partially deaf and with dementia. His centenary in 2022 produced a wave of new performances, recordings, and scholarship — including the Meta-Xenakis volume (Open Book Publishers, 2022), which collects analyses of his compositional mathematics, his architectural work (he designed the Philips Pavilion for Le Corbusier’s Expo 58 in Brussels using the same ruled-surface geometry he was using in Metastaseis), and his political biography.

What remains resonant about his project is not the specific distributions he chose — the Maxwell-Boltzmann is not the only or even necessarily the best distribution for many musical applications — but the epistemological position it represents. Xenakis insisted that the right question to ask about a musical texture is not “what is the note at beat 3 of bar 47?” but “what is the distribution from which the events in this section are drawn?” This shift from individual determination to statistical control is precisely the shift that makes thermodynamics possible as a science, and Xenakis was the first composer to apply it deliberately and systematically.

When a composer writes “let the orchestra be a gas at temperature $T$” and then actually computes the consequences with Boltzmann’s constant in front of him, I do not feel that physics has been appropriated. I feel that it has been recognised — seen, from a different direction, as the same thing it always was: a set of tools for thinking about ensembles of interacting elements whose individual behaviour is too complex to track but whose collective behaviour is not.

The orchestra is not a gas. But the Maxwell-Boltzmann distribution describes it anyway.

References

Ames, C. (1989). The Markov process as a compositional model: A survey and tutorial. Leonardo, 22(2), 175–187. https://doi.org/10.2307/1575226
Jedrzejewski, F. (2006). Mathematical Theory of Music. Delatour France / IRCAM.
Nash, J. F. (1950). Equilibrium points in $n$-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49. https://doi.org/10.1073/pnas.36.1.48
Nierhaus, G. (2009). Algorithmic Composition: Paradigms of Automated Music Generation. Springer.
Matossian, N. (2005). Xenakis (revised ed.). Moufflon Publications.
Solomos, M. (Ed.). (2022). Meta-Xenakis. Open Book Publishers. https://doi.org/10.11647/OBP.0313
von Neumann, J. (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1), 295–320. https://doi.org/10.1007/BF01448847
von Neumann, J., & Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press.
Xenakis, I. (1992). Formalized Music: Thought and Mathematics in Composition (revised ed.). Pendragon Press. (Originally published as Musiques formelles, La Revue Musicale, 1963.)

Changelog

2026-01-14: Corrected the description of Stratégie (1962): each conductor has nineteen tactics (six basic plus thirteen combinatorial), with a 19 x 19 payoff matrix — not six tactics and a 6 x 6 matrix. The six-tactic, 6 x 6 description applies to the earlier Duel (1959).
2026-01-14: Added “in April 1963” to the Stratégie premiere sentence. The composition date is 1962; the premiere took place on 25 April 1963 at the Venice Biennale.
2026-01-14: Changed “studying briefly with Honegger” to “approaching Honegger (who was too ill to teach).” Xenakis sought instruction from Honegger circa 1949, but Honegger was in declining health and did not take him as a student.

The Hamiltonian of Intelligence: From Spin Glasses to Neural Networks

Mon, 21 Oct 2024 00:00:00 +0000

On October 8, 2024, the Royal Swedish Academy of Sciences announced that the Nobel Prize in Physics would go to John Hopfield and Geoffrey Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.” Within hours, the physics corner of the internet had an episode. Thermodynamics Twitter — yes, that is a thing — asked whether gradient descent is really physics in the sense that the Higgs mechanism is physics. The condensed matter community, who have been doing disordered systems since before most ML practitioners were born, oscillated between pride (“finally, they noticed us”) and bafflement (“why is Hinton here and not Parisi?”). There were takes. There were dunks. Someone made a graph of Nobel prizes versus average journal impact factor and it was not flattering to this year’s winner.

I understand the irritation. I do not share it.

The argument I want to make is stronger than “machine learning uses some physics concepts by analogy.” The energy function that Hopfield wrote down in 1982 is not inspired by the Ising Hamiltonian. It is the Ising Hamiltonian. The machine that Hinton and Sejnowski built in 1985 is not named after Boltzmann as a cute metaphor. It is a physical system whose equilibrium distribution is the Boltzmann distribution, and whose learning algorithm is derived from statistical mechanics. The lineage from disordered magnets to protein structure prediction is not a convenient narrative; it is a sequence of mathematical identities.

Let me trace it properly.

The 2021 Nobel: Parisi and the frozen magnet

Before we get to 2024, we need 2021. Giorgio Parisi received half the Nobel Prize in Physics that year for work done between 1979 and 1983 on spin glasses. The other half went to Syukuro Manabe and Klaus Hasselmann for climate modelling — an interesting pairing that provoked its own set of takes, though rather fewer.

A spin glass is a disordered magnetic system. The canonical physical realisation is a dilute alloy: a small concentration of manganese atoms dissolved in copper. Each manganese atom carries a magnetic moment — a spin — that can point in one of two directions, which we label $\sigma_i \in \{-1, +1\}$. The spins interact with each other via exchange interactions mediated by the conduction electrons. The crucial feature is that these interactions are random: some spin pairs prefer to align (ferromagnetic coupling, $J_{ij} > 0$) and others prefer to anti-align (antiferromagnetic coupling, $J_{ij} < 0$), and there is no spatial pattern to which is which.

The Hamiltonian of the system is

$$H = -\sum_{i < j} J_{ij} \sigma_i \sigma_j$$

where the $J_{ij}$ are random variables drawn from some distribution. In the Sherrington-Kirkpatrick (SK) model (Sherrington & Kirkpatrick, 1975), all $N$ spins interact with all other spins — a mean-field model — and the couplings are drawn from a Gaussian distribution with mean zero and variance $J^2/N$:

$$J_{ij} \sim \mathcal{N}\!\left(0,\, \frac{J^2}{N}\right)$$

The factor of $1/N$ is essential for extensivity: without it, the energy would scale as $N^2$ rather than $N$, which is unphysical.

Now here is the key phenomenon. At high temperature, the spins fluctuate freely and the system is paramagnetic. Cool it below the glass transition temperature $T_g$, and the system “freezes” — but not into a ferromagnet with all spins aligned, and not into a simple antiferromagnet. It freezes into one of an astronomically large number of disordered, metastable states. The system is not in its true ground state; it is trapped. It cannot find its way down because the energy landscape is rugged: every path toward lower energy is blocked by a barrier.

This rugged landscape is the central object. It has exponentially many local minima, separated by barriers that grow with system size. Different initial conditions lead to different frozen states. The system has memory of its history — hence “glass” rather than “crystal.”

Computing thermodynamic quantities in this system requires averaging over the disorder (the random $J_{ij}$), which means computing the quenched average of the free energy:

$$\overline{F} = -T\, \overline{\ln Z}$$

The overline denotes an average over the distribution of couplings. The problem is that $\ln Z$ is hard to average because $Z$ is a sum of exponentially many terms. Parisi’s solution — the replica trick — is a mathematical device worth describing, because it is beautifully strange.

The trick exploits the identity $\ln Z = \lim_{n \to 0} (Z^n - 1)/n$. We compute $\overline{Z^n}$ for integer $n$, which is feasible because $Z^n$ is a product of $n$ copies (replicas) of the partition function, and the average over disorder decouples. We then analytically continue in $n$ to $n \to 0$. The result is an effective action in terms of order parameters $q^{ab}$, which describe the overlap between spin configurations in replica $a$ and replica $b$.

The naive assumption is replica symmetry: all $q^{ab}$ are equal. This assumption turns out to be wrong. Parisi showed that the correct solution breaks replica symmetry in a hierarchical way — the overlap matrix $q^{ab}$ has a nested structure, described by a function $q(x)$ for $x \in [0,1]$. This is replica symmetry breaking (RSB).

RSB has a beautiful physical interpretation. The phase space of the spin glass is organised into an ultrametric tree: exponentially many states, arranged in nested clusters. States in the same cluster are similar (high overlap); states in different clusters are very different (low overlap). The hierarchy has infinitely many levels. Parisi showed that this structure is exact in the SK model (Parisi, 1979), and he spent the subsequent years proving it rigorously.

This is not an abstraction. RSB predicts specific, measurable properties of real spin glass alloys, and experiments have confirmed them. It is also, I want to emphasise, not a result that anyone expected. The mathematics forced it.

Three years after Parisi solved the SK model, a physicist at Bell Labs wrote a paper about memory.

Hopfield (1982): memory as energy minimisation

John Hopfield was a condensed matter physicist who had drifted toward biophysics — electron transfer in proteins, neural computation. In 1982 he published a paper in PNAS with the title “Neural networks and physical systems with emergent collective computational abilities” (Hopfield, 1982). Most biologists read it as a neuroscience paper. It is a statistical mechanics paper.

Hopfield defined a network of $N$ binary “neurons” $s_i \in \{-1, +1\}$ with symmetric weights $W_{ij} = W_{ji}$, and an energy function:

$$E = -\frac{1}{2} \sum_{i \neq j} W_{ij}\, s_i s_j$$

Readers who have seen the SK Hamiltonian above will notice something. This is it. The $J_{ij}$ of the spin glass are the $W_{ij}$ of the neural network. The Ising spins $\sigma_i$ are the neuron states $s_i$. The Hopfield network energy function is the Ising model Hamiltonian with symmetric, fixed (non-random) couplings. This is not a metaphor. This is the same equation.

The dynamics: at each step, choose a neuron $i$ at random and update it according to

$$s_i \leftarrow \text{sgn}\!\left(\sum_{j} W_{ij} s_j\right)$$

This update always decreases or leaves unchanged the energy $E$ (because the weights are symmetric). The network is a gradient descent machine on $E$. It will always converge to a local minimum — a fixed point.

The innovation is in how Hopfield chose the weights. To store a set of $p$ binary patterns $\xi^\mu \in \{-1,+1\}^N$ (for $\mu = 1, \ldots, p$), use Hebb’s rule:

$$W_{ij} = \frac{1}{N} \sum_{\mu=1}^{p} \xi^\mu_i\, \xi^\mu_j$$

This is the outer product rule. Each stored pattern contributes a rank-1 matrix to $W$. You can verify that if $s = \xi^\mu$, then the local field at neuron $i$ is

$$h_i = \sum_j W_{ij} s_j = \frac{1}{N}\sum_j \sum_{\nu} \xi^\nu_i \xi^\nu_j \xi^\mu_j = \xi^\mu_i + \frac{1}{N}\sum_{\nu \neq \mu} \xi^\nu_i \underbrace{\left(\sum_j \xi^\nu_j \xi^\mu_j\right)}_{\text{cross-talk}}$$

The first term reinforces pattern $\mu$. The second term is noise from the other stored patterns. When the patterns are random and uncorrelated, the cross-talk averages to zero for the first term to dominate, and the stored patterns are stable fixed points of the dynamics. A noisy or incomplete input — a partial pattern — will evolve under the dynamics toward the nearest stored pattern. This is associative memory: content-addressable retrieval.

The capacity limit follows from the same analysis. As $p$ grows, the cross-talk grows. When $p$ exceeds approximately $0.14N$, the cross-talk overwhelms the signal, and the network begins to form spurious minima — states that are not any of the stored patterns but are mixtures or corruptions of them. The network has entered a spin-glass phase.

This is not a rough analogy. Amit, Gutfreund, and Sompolinsky showed in 1985 that the Hopfield model is exactly the SK model with $p$ planted minima (Amit, Gutfreund, & Sompolinsky, 1985). The phase diagram of the Hopfield model — paramagnetic phase, memory phase, spin-glass phase — maps precisely onto the phase diagram of the SK model. The capacity limit $p \approx 0.14N$ is the phase boundary between the memory phase and the spin-glass phase, derivable from Parisi’s RSB theory.

The 2021 Nobel and the 2024 Nobel are, mathematically, about the same model.

Boltzmann machines (Hinton & Sejnowski, 1985)

The Hopfield model is deterministic and shallow — one layer of visible neurons, no hidden structure. Geoffrey Hinton and Terry Sejnowski, in a collaboration that began at the Cognitive Science summer school in Pittsfield in 1983 and culminated in a 1985 paper (Ackley, Hinton, & Sejnowski, 1985), added two things: hidden units and stochastic dynamics.

Hidden units $h_j$ are neurons not connected to any input or output. They do not correspond to observable quantities; they model latent structure in the data. The energy of the system is:

$$E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} W_{ij}\, v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j$$

where $v_i$ are the visible (data) units, $h_j$ are the hidden units, $a_i$ and $b_j$ are biases. Note that this is still an Ising-type energy; the $W_{ij}$ are now inter-layer weights.

The stochastic dynamics replace deterministic gradient descent with a Markov chain. Each unit is updated probabilistically:

$$P(s_k = 1 \mid \text{rest}) = \sigma\!\left(\sum_j W_{kj} s_j + \text{bias}_k\right)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the logistic sigmoid. At inverse temperature $\beta = 1/T$, the probability of any complete configuration is

$$P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\, e^{-\beta E(\mathbf{v}, \mathbf{h})}$$

This is the Boltzmann distribution. The machine is named after Ludwig Boltzmann because the equilibrium distribution of its states is the Boltzmann distribution. Not analogously. Literally.

Learning amounts to adjusting the weights to make the model distribution $P(\mathbf{v}, \mathbf{h})$ match the data distribution $P_{\text{data}}(\mathbf{v})$. The objective is to minimise the Kullback-Leibler divergence:

$$\mathcal{L} = D_{\mathrm{KL}}(P_{\text{data}} \| P_{\text{model}}) = \sum_{\mathbf{v}} P_{\text{data}}(\mathbf{v}) \ln \frac{P_{\text{data}}(\mathbf{v})}{P_{\text{model}}(\mathbf{v})}$$

The gradient with respect to the weight $W_{ij}$ is

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = -\langle v_i h_j \rangle_{\text{data}} + \langle v_i h_j \rangle_{\text{model}}$$

The first term is the empirical correlation between visible unit $i$ and hidden unit $j$ when the visible units are clamped to data. The second term is the correlation in the model’s free-running equilibrium. The learning rule says: increase $W_{ij}$ if the data sees these two units co-active more than the model does, and decrease it otherwise. This is Hebbian learning with a contrastive correction — the physics of equilibration drives the learning.

The computational difficulty is the second term. Computing $\langle v_i h_j \rangle_{\text{model}}$ requires the Markov chain to reach equilibrium, which takes exponentially long in general. Hinton’s later invention of contrastive divergence — run the chain for only a few steps rather than to equilibrium — made training feasible, at the cost of a biased gradient estimate. This engineering compromise is part of why the physics purists are uncomfortable: the original derivation is rigorous statistical mechanics, but the algorithm that actually works in practice is an approximation whose convergence properties are poorly understood.

I find this charming rather than damning. Physics itself is full of approximations whose convergence properties are poorly understood but which happen to give right answers. Perturbation theory beyond leading order, the replica trick itself — these are not rigorous mathematics. They are informed guesses that happen to be correct. The history of theoretical physics is mostly the history of getting away with things.

From Boltzmann machines to transformers

The Boltzmann machine was computationally difficult but conceptually foundational. The restricted Boltzmann machine (RBM) — with no within-layer connections, so that hidden units are conditionally independent given the visible units and vice versa — made training via contrastive divergence practical.

Hinton, Osindero, and Teh’s 2006 paper on deep belief networks showed that stacking RBMs and pre-training them greedily could initialise deep networks well enough to fine-tune with backpropagation. This was the breakthrough that restarted deep learning after the winter of the 1990s. It is fair to say that without the Boltzmann machine as conceptual foundation and the RBM as practical building block, the deep learning revolution that gave us large language models that fail to count letters in words would not have happened in the form it did.

The connection between Hopfield networks and modern attention mechanisms is more recent and more surprising. Ramsauer et al. (2020) showed that modern Hopfield networks — a generalisation of the original with continuous states and a different energy function — have exponential storage capacity (Ramsauer et al., 2020). More strikingly, the update rule of the modern Hopfield network is:

$$\mathbf{s}^{\text{new}} = \mathbf{X}\, \text{softmax}\!\left(\beta \mathbf{X}^\top \mathbf{s}\right)$$

where $\mathbf{X}$ is the matrix of stored patterns and $\mathbf{s}$ is the query. This is the attention mechanism of the transformer, up to notation. The transformer’s multi-head self-attention is, formally, a generalised Hopfield retrieval step. The architecture that powers GPT and everything descended from it is, at one level of abstraction, an associative memory performing energy minimisation on a Hopfield energy landscape.

I do not want to overstate this. The connection is formal and the interpretation is contested. But it is not nothing. The physicists who built the Hopfield network in 1982 were working on the same mathematical object that is now used to process language, images, and protein sequences at industrial scale.

The protein folding connection

The 2024 Nobel Prize in Chemistry went to Demis Hassabis, John Jumper, and David Baker for computational protein structure prediction — specifically for AlphaFold2 (Jumper et al., 2021). This made October 2024 a remarkable month for Nobel Prizes in fields adjacent to artificial intelligence, and it is not a coincidence.

Protein folding is a spin-glass problem. A protein is a polymer of amino acids, each with different chemical properties and steric constraints. The protein folds into a unique three-dimensional structure — its native conformation — determined by its sequence. The energy landscape of the folding process is precisely the kind of rugged landscape that Parisi described for spin glasses: exponentially many misfolded states, separated by barriers, with the native structure as the global minimum (or close to it).

Levinthal’s paradox, formulated in 1969, makes the absurdity quantitative. A modest protein of 100 amino acids might have $3^{100} \approx 10^{47}$ possible conformations (allowing three dihedral angle states per residue). Random search of this space, at the rate of one conformation per picosecond, would take $10^{35}$ years — somewhat longer than the age of the universe. Yet proteins fold in milliseconds to seconds. They do not search randomly; the energy landscape is funnel-shaped, channelling the dynamics toward the native state. But predicting which state is the native one from sequence alone remained one of the hard problems of structural biology for fifty years.

AlphaFold2 uses a transformer architecture — descended from the Boltzmann machine lineage — trained on millions of known protein structures. It does not simulate the folding dynamics; it has learned, from data, a mapping from sequence to structure that encodes the statistical mechanics of the folding funnel. The Nobel committee gave it the Chemistry prize because it is transforming biochemistry. But the conceptual machinery is pure statistical physics: representation of a high-dimensional energy landscape, approximation of the minimum, learned from the distribution of solved instances.

The three Nobels of 2021–2024 form the most coherent consecutive triple I can remember: Parisi showed how disordered energy landscapes behave; Hopfield and Hinton showed how to use energy landscapes as memory and learning machines; Hassabis and Jumper showed how to apply the resulting architecture to the most consequential outstanding problem in molecular biology. Each step is a mathematical consequence of the one before it.

The controversy: did the committee err?

I said I understand the irritation. Here is what is right about it.

Hinton’s work after the Boltzmann machine — backpropagation, dropout, convolutional networks, deep learning at ImageNet scale — is primarily engineering and empirical machine learning. The 2012 AlexNet result that restarted the field was not a theoretical physics contribution; it was a demonstration that known methods work very well on very large datasets with very large GPUs. The fact that it works is not explained by statistical mechanics. The scaling laws of neural networks (loss scales as a power law with compute, parameters, and data) are empirical observations that physicists have tried to explain with renormalisation group arguments with mixed success.

If the Nobel Prize in Physics were awarded for “the work that most influenced technology in the past decade,” the case for Hinton is strong. If it were awarded for “the most important contribution to the science of physics,” the case is weaker. There is a version of the Nobel announcement that emphasises the Boltzmann machine specifically — the 1985 paper that is literally named after a physicist and uses his distribution — and that version sits cleanly within physics. There is a broader version that encompasses all of Hinton’s career, and that version includes a great deal of empirical machine learning that the physics community is reasonably reluctant to claim.

My view, for what it is worth from someone who has been thinking about AI ethics and consequences for rather longer than feels comfortable: the Nobel correctly identifies that the foundational conceptual contributions — the Ising Hamiltonian as associative memory, the Boltzmann distribution as a learning target, the connection between statistical mechanics and computation — are physics. They came from physicists, they use physics mathematics, they extend physics intuition into a new domain. The subsequent scaling of these ideas using TPUs and transformer architectures is engineering. Valuable engineering, world-changing engineering, but engineering. The Nobel is for the former. If the citation had been more specific — “for the Boltzmann machine and its demonstration that physical principles govern neural computation” — the physics community would have been less irritated and equally correct.

What the irritation reveals is something slightly uncomfortable about disciplinary identity. Physicists are proud of universality: the idea that the same mathematical structures appear in wildly different physical systems. RSB in spin glasses, replica methods in random matrices, the Parisi–Sourlas correspondence between disordered systems and supersymmetric field theories — the joy of physics is precisely that these deep structural similarities cross domain boundaries. When that universality reaches into machine learning and says “your transformer attention layer is a Hopfield retrieval step,” physicists should be delighted, not affronted.

The agentic systems that are being built right now on top of transformer architectures are doing something that looks, from a sufficiently abstract distance, like what the Hopfield network was designed to do: find stored patterns that match a query, and use them to generate a response. The failures of grounding that I have written about elsewhere are, in this view, failures of the energy landscape — the model finds a metastable state that is not the correct minimum, and the dynamics cannot escape. Spin glass physics does not explain these failures in detail, but it gives a language for thinking about them. That is what physics is for.

The universality argument

Let me make the deeper claim explicit. Why should disordered magnets, associative memory networks, and protein folding all live in the same mathematical family?

Because they all have the same structure: many interacting degrees of freedom with competing constraints, a combinatorially large configuration space, an energy landscape with exponentially many metastable states, and dynamics that search for — and frequently fail to find — global minima. This is a universality class. The specific details (magnetic moments versus neuron states versus dihedral angles) are irrelevant at the level of the energy landscape topology.

Parisi’s contribution was to show that this class has a specific, exactly-solvable structure in mean field theory, characterised by replica symmetry breaking and the ultrametric organisation of states. This was not a solution to one model. It was a description of a universality class. The fact that the Hopfield model is in this class is not a coincidence requiring explanation; it is a mathematical identity requiring verification.

The Kuramoto model for coupled oscillators — which I have written about in the context of ensemble synchronisation and neural phase coupling — is another member of this extended family. The synchronisation transition in the Kuramoto model, the glass transition in the SK model, and the memory phase transition in the Hopfield model are all mean-field phase transitions in disordered many-body systems. The mathematics is more similar than the physics syllabi suggest.

When I teach physics and occasionally venture into questions about what the AI tools my students are using actually do, I find myself reaching for this framework. Not because it gives engineering insight into how to train a better model — it does not, particularly — but because it gives honest insight into what kind of thing a neural network is. It is a physical system. It has an energy landscape. Its failures are phase transitions. Its successes are energy minimisation. The vocabulary of statistical mechanics is not a metaphor; it is the correct description.

The Nobel committee noticed. They were right to notice.

The 2021 and 2024 Nobel Prizes in Physics have now officially bridged the gap between condensed matter physics and machine learning in the public record. For anyone who wants to understand either field more deeply than the press releases suggest, the SK model and the Hopfield network are the right place to start. Both papers are short by modern standards — Parisi’s 1979 letter is three pages; Hopfield’s 1982 PNAS paper is five — and both repay close reading.

References

Sherrington, D., & Kirkpatrick, S. (1975). Solvable model of a spin-glass. Physical Review Letters, 35(26), 1792–1796. DOI: 10.1103/PhysRevLett.35.1792
Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters, 43(23), 1754–1756. DOI: 10.1103/PhysRevLett.43.1754
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554–2558. DOI: 10.1073/pnas.79.8.2554
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169. DOI: 10.1207/s15516709cog0901_7
Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14), 1530–1533. DOI: 10.1103/PhysRevLett.55.1530
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. DOI: 10.1038/s41586-021-03819-2
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield networks is all you need. arXiv:2008.02217. Retrieved from https://arxiv.org/abs/2008.02217
Nobel Prize Committee. (2024). Scientific background: Machine learning and physical systems. The Royal Swedish Academy of Sciences. Retrieved from https://www.nobelprize.org/prizes/physics/2024/advanced-information/