The Hamiltonian of Intelligence: From Spin Glasses to Neural Networks

On October 8, 2024, the Royal Swedish Academy of Sciences announced that the Nobel Prize in Physics would go to John Hopfield and Geoffrey Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.” Within hours, the physics corner of the internet had an episode. Thermodynamics Twitter — yes, that is a thing — asked whether gradient descent is really physics in the sense that the Higgs mechanism is physics. The condensed matter community, who have been doing disordered systems since before most ML practitioners were born, oscillated between pride (“finally, they noticed us”) and bafflement (“why is Hinton here and not Parisi?”). There were takes. There were dunks. Someone made a graph of Nobel prizes versus average journal impact factor and it was not flattering to this year’s winner.

I understand the irritation. I do not share it.

The argument I want to make is stronger than “machine learning uses some physics concepts by analogy.” The energy function that Hopfield wrote down in 1982 is not inspired by the Ising Hamiltonian. It is the Ising Hamiltonian. The machine that Hinton and Sejnowski built in 1985 is not named after Boltzmann as a cute metaphor. It is a physical system whose equilibrium distribution is the Boltzmann distribution, and whose learning algorithm is derived from statistical mechanics. The lineage from disordered magnets to protein structure prediction is not a convenient narrative; it is a sequence of mathematical identities.

Let me trace it properly.

The 2021 Nobel: Parisi and the frozen magnet

Before we get to 2024, we need 2021. Giorgio Parisi received half the Nobel Prize in Physics that year for work done between 1979 and 1983 on spin glasses. The other half went to Syukuro Manabe and Klaus Hasselmann for climate modelling — an interesting pairing that provoked its own set of takes, though rather fewer.

A spin glass is a disordered magnetic system. The canonical physical realisation is a dilute alloy: a small concentration of manganese atoms dissolved in copper. Each manganese atom carries a magnetic moment — a spin — that can point in one of two directions, which we label $\sigma_i \in \{-1, +1\}$. The spins interact with each other via exchange interactions mediated by the conduction electrons. The crucial feature is that these interactions are random: some spin pairs prefer to align (ferromagnetic coupling, $J_{ij} > 0$) and others prefer to anti-align (antiferromagnetic coupling, $J_{ij} < 0$), and there is no spatial pattern to which is which.

The Hamiltonian of the system is

$$H = -\sum_{i < j} J_{ij} \sigma_i \sigma_j$$

where the $J_{ij}$ are random variables drawn from some distribution. In the Sherrington-Kirkpatrick (SK) model (Sherrington & Kirkpatrick, 1975), all $N$ spins interact with all other spins — a mean-field model — and the couplings are drawn from a Gaussian distribution with mean zero and variance $J^2/N$:

$$J_{ij} \sim \mathcal{N}\!\left(0,\, \frac{J^2}{N}\right)$$

The factor of $1/N$ is essential for extensivity: without it, the energy would scale as $N^2$ rather than $N$, which is unphysical.

Now here is the key phenomenon. At high temperature, the spins fluctuate freely and the system is paramagnetic. Cool it below the glass transition temperature $T_g$, and the system “freezes” — but not into a ferromagnet with all spins aligned, and not into a simple antiferromagnet. It freezes into one of an astronomically large number of disordered, metastable states. The system is not in its true ground state; it is trapped. It cannot find its way down because the energy landscape is rugged: every path toward lower energy is blocked by a barrier.

This rugged landscape is the central object. It has exponentially many local minima, separated by barriers that grow with system size. Different initial conditions lead to different frozen states. The system has memory of its history — hence “glass” rather than “crystal.”

Computing thermodynamic quantities in this system requires averaging over the disorder (the random $J_{ij}$), which means computing the quenched average of the free energy:

$$\overline{F} = -T\, \overline{\ln Z}$$

The overline denotes an average over the distribution of couplings. The problem is that $\ln Z$ is hard to average because $Z$ is a sum of exponentially many terms. Parisi’s solution — the replica trick — is a mathematical device worth describing, because it is beautifully strange.

The trick exploits the identity $\ln Z = \lim_{n \to 0} (Z^n - 1)/n$. We compute $\overline{Z^n}$ for integer $n$, which is feasible because $Z^n$ is a product of $n$ copies (replicas) of the partition function, and the average over disorder decouples. We then analytically continue in $n$ to $n \to 0$. The result is an effective action in terms of order parameters $q^{ab}$, which describe the overlap between spin configurations in replica $a$ and replica $b$.

The naive assumption is replica symmetry: all $q^{ab}$ are equal. This assumption turns out to be wrong. Parisi showed that the correct solution breaks replica symmetry in a hierarchical way — the overlap matrix $q^{ab}$ has a nested structure, described by a function $q(x)$ for $x \in [0,1]$. This is replica symmetry breaking (RSB).

RSB has a beautiful physical interpretation. The phase space of the spin glass is organised into an ultrametric tree: exponentially many states, arranged in nested clusters. States in the same cluster are similar (high overlap); states in different clusters are very different (low overlap). The hierarchy has infinitely many levels. Parisi showed that this structure is exact in the SK model (Parisi, 1979), and he spent the subsequent years proving it rigorously.

This is not an abstraction. RSB predicts specific, measurable properties of real spin glass alloys, and experiments have confirmed them. It is also, I want to emphasise, not a result that anyone expected. The mathematics forced it.

Three years after Parisi solved the SK model, a physicist at Bell Labs wrote a paper about memory.

Hopfield (1982): memory as energy minimisation

John Hopfield was a condensed matter physicist who had drifted toward biophysics — electron transfer in proteins, neural computation. In 1982 he published a paper in PNAS with the title “Neural networks and physical systems with emergent collective computational abilities” (Hopfield, 1982). Most biologists read it as a neuroscience paper. It is a statistical mechanics paper.

Hopfield defined a network of $N$ binary “neurons” $s_i \in \{-1, +1\}$ with symmetric weights $W_{ij} = W_{ji}$, and an energy function:

$$E = -\frac{1}{2} \sum_{i \neq j} W_{ij}\, s_i s_j$$

Readers who have seen the SK Hamiltonian above will notice something. This is it. The $J_{ij}$ of the spin glass are the $W_{ij}$ of the neural network. The Ising spins $\sigma_i$ are the neuron states $s_i$. The Hopfield network energy function is the Ising model Hamiltonian with symmetric, fixed (non-random) couplings. This is not a metaphor. This is the same equation.

The dynamics: at each step, choose a neuron $i$ at random and update it according to

$$s_i \leftarrow \text{sgn}\!\left(\sum_{j} W_{ij} s_j\right)$$

This update always decreases or leaves unchanged the energy $E$ (because the weights are symmetric). The network is a gradient descent machine on $E$. It will always converge to a local minimum — a fixed point.

The innovation is in how Hopfield chose the weights. To store a set of $p$ binary patterns $\xi^\mu \in \{-1,+1\}^N$ (for $\mu = 1, \ldots, p$), use Hebb’s rule:

$$W_{ij} = \frac{1}{N} \sum_{\mu=1}^{p} \xi^\mu_i\, \xi^\mu_j$$

This is the outer product rule. Each stored pattern contributes a rank-1 matrix to $W$. You can verify that if $s = \xi^\mu$, then the local field at neuron $i$ is

$$h_i = \sum_j W_{ij} s_j = \frac{1}{N}\sum_j \sum_{\nu} \xi^\nu_i \xi^\nu_j \xi^\mu_j = \xi^\mu_i + \frac{1}{N}\sum_{\nu \neq \mu} \xi^\nu_i \underbrace{\left(\sum_j \xi^\nu_j \xi^\mu_j\right)}_{\text{cross-talk}}$$

The first term reinforces pattern $\mu$. The second term is noise from the other stored patterns. When the patterns are random and uncorrelated, the cross-talk averages to zero for the first term to dominate, and the stored patterns are stable fixed points of the dynamics. A noisy or incomplete input — a partial pattern — will evolve under the dynamics toward the nearest stored pattern. This is associative memory: content-addressable retrieval.

The capacity limit follows from the same analysis. As $p$ grows, the cross-talk grows. When $p$ exceeds approximately $0.14N$, the cross-talk overwhelms the signal, and the network begins to form spurious minima — states that are not any of the stored patterns but are mixtures or corruptions of them. The network has entered a spin-glass phase.

This is not a rough analogy. Amit, Gutfreund, and Sompolinsky showed in 1985 that the Hopfield model is exactly the SK model with $p$ planted minima (Amit, Gutfreund, & Sompolinsky, 1985). The phase diagram of the Hopfield model — paramagnetic phase, memory phase, spin-glass phase — maps precisely onto the phase diagram of the SK model. The capacity limit $p \approx 0.14N$ is the phase boundary between the memory phase and the spin-glass phase, derivable from Parisi’s RSB theory.

The 2021 Nobel and the 2024 Nobel are, mathematically, about the same model.

Boltzmann machines (Hinton & Sejnowski, 1985)

The Hopfield model is deterministic and shallow — one layer of visible neurons, no hidden structure. Geoffrey Hinton and Terry Sejnowski, in a collaboration that began at the Cognitive Science summer school in Pittsfield in 1983 and culminated in a 1985 paper (Ackley, Hinton, & Sejnowski, 1985), added two things: hidden units and stochastic dynamics.

Hidden units $h_j$ are neurons not connected to any input or output. They do not correspond to observable quantities; they model latent structure in the data. The energy of the system is:

$$E(\mathbf{v}, \mathbf{h}) = -\sum_{i,j} W_{ij}\, v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j$$

where $v_i$ are the visible (data) units, $h_j$ are the hidden units, $a_i$ and $b_j$ are biases. Note that this is still an Ising-type energy; the $W_{ij}$ are now inter-layer weights.

The stochastic dynamics replace deterministic gradient descent with a Markov chain. Each unit is updated probabilistically:

$$P(s_k = 1 \mid \text{rest}) = \sigma\!\left(\sum_j W_{kj} s_j + \text{bias}_k\right)$$

where $\sigma(x) = 1/(1 + e^{-x})$ is the logistic sigmoid. At inverse temperature $\beta = 1/T$, the probability of any complete configuration is

$$P(\mathbf{v}, \mathbf{h}) = \frac{1}{Z}\, e^{-\beta E(\mathbf{v}, \mathbf{h})}$$

This is the Boltzmann distribution. The machine is named after Ludwig Boltzmann because the equilibrium distribution of its states is the Boltzmann distribution. Not analogously. Literally.

Learning amounts to adjusting the weights to make the model distribution $P(\mathbf{v}, \mathbf{h})$ match the data distribution $P_{\text{data}}(\mathbf{v})$. The objective is to minimise the Kullback-Leibler divergence:

$$\mathcal{L} = D_{\mathrm{KL}}(P_{\text{data}} \| P_{\text{model}}) = \sum_{\mathbf{v}} P_{\text{data}}(\mathbf{v}) \ln \frac{P_{\text{data}}(\mathbf{v})}{P_{\text{model}}(\mathbf{v})}$$

The gradient with respect to the weight $W_{ij}$ is

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = -\langle v_i h_j \rangle_{\text{data}} + \langle v_i h_j \rangle_{\text{model}}$$

The first term is the empirical correlation between visible unit $i$ and hidden unit $j$ when the visible units are clamped to data. The second term is the correlation in the model’s free-running equilibrium. The learning rule says: increase $W_{ij}$ if the data sees these two units co-active more than the model does, and decrease it otherwise. This is Hebbian learning with a contrastive correction — the physics of equilibration drives the learning.

The computational difficulty is the second term. Computing $\langle v_i h_j \rangle_{\text{model}}$ requires the Markov chain to reach equilibrium, which takes exponentially long in general. Hinton’s later invention of contrastive divergence — run the chain for only a few steps rather than to equilibrium — made training feasible, at the cost of a biased gradient estimate. This engineering compromise is part of why the physics purists are uncomfortable: the original derivation is rigorous statistical mechanics, but the algorithm that actually works in practice is an approximation whose convergence properties are poorly understood.

I find this charming rather than damning. Physics itself is full of approximations whose convergence properties are poorly understood but which happen to give right answers. Perturbation theory beyond leading order, the replica trick itself — these are not rigorous mathematics. They are informed guesses that happen to be correct. The history of theoretical physics is mostly the history of getting away with things.

From Boltzmann machines to transformers

The Boltzmann machine was computationally difficult but conceptually foundational. The restricted Boltzmann machine (RBM) — with no within-layer connections, so that hidden units are conditionally independent given the visible units and vice versa — made training via contrastive divergence practical.

Hinton, Osindero, and Teh’s 2006 paper on deep belief networks showed that stacking RBMs and pre-training them greedily could initialise deep networks well enough to fine-tune with backpropagation. This was the breakthrough that restarted deep learning after the winter of the 1990s. It is fair to say that without the Boltzmann machine as conceptual foundation and the RBM as practical building block, the deep learning revolution that gave us large language models that fail to count letters in words would not have happened in the form it did.

The connection between Hopfield networks and modern attention mechanisms is more recent and more surprising. Ramsauer et al. (2020) showed that modern Hopfield networks — a generalisation of the original with continuous states and a different energy function — have exponential storage capacity (Ramsauer et al., 2020). More strikingly, the update rule of the modern Hopfield network is:

$$\mathbf{s}^{\text{new}} = \mathbf{X}\, \text{softmax}\!\left(\beta \mathbf{X}^\top \mathbf{s}\right)$$

where $\mathbf{X}$ is the matrix of stored patterns and $\mathbf{s}$ is the query. This is the attention mechanism of the transformer, up to notation. The transformer’s multi-head self-attention is, formally, a generalised Hopfield retrieval step. The architecture that powers GPT and everything descended from it is, at one level of abstraction, an associative memory performing energy minimisation on a Hopfield energy landscape.

I do not want to overstate this. The connection is formal and the interpretation is contested. But it is not nothing. The physicists who built the Hopfield network in 1982 were working on the same mathematical object that is now used to process language, images, and protein sequences at industrial scale.

The protein folding connection

The 2024 Nobel Prize in Chemistry went to Demis Hassabis, John Jumper, and David Baker for computational protein structure prediction — specifically for AlphaFold2 (Jumper et al., 2021). This made October 2024 a remarkable month for Nobel Prizes in fields adjacent to artificial intelligence, and it is not a coincidence.

Protein folding is a spin-glass problem. A protein is a polymer of amino acids, each with different chemical properties and steric constraints. The protein folds into a unique three-dimensional structure — its native conformation — determined by its sequence. The energy landscape of the folding process is precisely the kind of rugged landscape that Parisi described for spin glasses: exponentially many misfolded states, separated by barriers, with the native structure as the global minimum (or close to it).

Levinthal’s paradox, formulated in 1969, makes the absurdity quantitative. A modest protein of 100 amino acids might have $3^{100} \approx 10^{47}$ possible conformations (allowing three dihedral angle states per residue). Random search of this space, at the rate of one conformation per picosecond, would take $10^{35}$ years — somewhat longer than the age of the universe. Yet proteins fold in milliseconds to seconds. They do not search randomly; the energy landscape is funnel-shaped, channelling the dynamics toward the native state. But predicting which state is the native one from sequence alone remained one of the hard problems of structural biology for fifty years.

AlphaFold2 uses a transformer architecture — descended from the Boltzmann machine lineage — trained on millions of known protein structures. It does not simulate the folding dynamics; it has learned, from data, a mapping from sequence to structure that encodes the statistical mechanics of the folding funnel. The Nobel committee gave it the Chemistry prize because it is transforming biochemistry. But the conceptual machinery is pure statistical physics: representation of a high-dimensional energy landscape, approximation of the minimum, learned from the distribution of solved instances.

The three Nobels of 2021–2024 form the most coherent consecutive triple I can remember: Parisi showed how disordered energy landscapes behave; Hopfield and Hinton showed how to use energy landscapes as memory and learning machines; Hassabis and Jumper showed how to apply the resulting architecture to the most consequential outstanding problem in molecular biology. Each step is a mathematical consequence of the one before it.

The controversy: did the committee err?

I said I understand the irritation. Here is what is right about it.

Hinton’s work after the Boltzmann machine — backpropagation, dropout, convolutional networks, deep learning at ImageNet scale — is primarily engineering and empirical machine learning. The 2012 AlexNet result that restarted the field was not a theoretical physics contribution; it was a demonstration that known methods work very well on very large datasets with very large GPUs. The fact that it works is not explained by statistical mechanics. The scaling laws of neural networks (loss scales as a power law with compute, parameters, and data) are empirical observations that physicists have tried to explain with renormalisation group arguments with mixed success.

If the Nobel Prize in Physics were awarded for “the work that most influenced technology in the past decade,” the case for Hinton is strong. If it were awarded for “the most important contribution to the science of physics,” the case is weaker. There is a version of the Nobel announcement that emphasises the Boltzmann machine specifically — the 1985 paper that is literally named after a physicist and uses his distribution — and that version sits cleanly within physics. There is a broader version that encompasses all of Hinton’s career, and that version includes a great deal of empirical machine learning that the physics community is reasonably reluctant to claim.

My view, for what it is worth from someone who has been thinking about AI ethics and consequences for rather longer than feels comfortable: the Nobel correctly identifies that the foundational conceptual contributions — the Ising Hamiltonian as associative memory, the Boltzmann distribution as a learning target, the connection between statistical mechanics and computation — are physics. They came from physicists, they use physics mathematics, they extend physics intuition into a new domain. The subsequent scaling of these ideas using TPUs and transformer architectures is engineering. Valuable engineering, world-changing engineering, but engineering. The Nobel is for the former. If the citation had been more specific — “for the Boltzmann machine and its demonstration that physical principles govern neural computation” — the physics community would have been less irritated and equally correct.

What the irritation reveals is something slightly uncomfortable about disciplinary identity. Physicists are proud of universality: the idea that the same mathematical structures appear in wildly different physical systems. RSB in spin glasses, replica methods in random matrices, the Parisi–Sourlas correspondence between disordered systems and supersymmetric field theories — the joy of physics is precisely that these deep structural similarities cross domain boundaries. When that universality reaches into machine learning and says “your transformer attention layer is a Hopfield retrieval step,” physicists should be delighted, not affronted.

The agentic systems that are being built right now on top of transformer architectures are doing something that looks, from a sufficiently abstract distance, like what the Hopfield network was designed to do: find stored patterns that match a query, and use them to generate a response. The failures of grounding that I have written about elsewhere are, in this view, failures of the energy landscape — the model finds a metastable state that is not the correct minimum, and the dynamics cannot escape. Spin glass physics does not explain these failures in detail, but it gives a language for thinking about them. That is what physics is for.

The universality argument

Let me make the deeper claim explicit. Why should disordered magnets, associative memory networks, and protein folding all live in the same mathematical family?

Because they all have the same structure: many interacting degrees of freedom with competing constraints, a combinatorially large configuration space, an energy landscape with exponentially many metastable states, and dynamics that search for — and frequently fail to find — global minima. This is a universality class. The specific details (magnetic moments versus neuron states versus dihedral angles) are irrelevant at the level of the energy landscape topology.

Parisi’s contribution was to show that this class has a specific, exactly-solvable structure in mean field theory, characterised by replica symmetry breaking and the ultrametric organisation of states. This was not a solution to one model. It was a description of a universality class. The fact that the Hopfield model is in this class is not a coincidence requiring explanation; it is a mathematical identity requiring verification.

The Kuramoto model for coupled oscillators — which I have written about in the context of ensemble synchronisation and neural phase coupling — is another member of this extended family. The synchronisation transition in the Kuramoto model, the glass transition in the SK model, and the memory phase transition in the Hopfield model are all mean-field phase transitions in disordered many-body systems. The mathematics is more similar than the physics syllabi suggest.

When I teach physics and occasionally venture into questions about what the AI tools my students are using actually do, I find myself reaching for this framework. Not because it gives engineering insight into how to train a better model — it does not, particularly — but because it gives honest insight into what kind of thing a neural network is. It is a physical system. It has an energy landscape. Its failures are phase transitions. Its successes are energy minimisation. The vocabulary of statistical mechanics is not a metaphor; it is the correct description.

The Nobel committee noticed. They were right to notice.

The 2021 and 2024 Nobel Prizes in Physics have now officially bridged the gap between condensed matter physics and machine learning in the public record. For anyone who wants to understand either field more deeply than the press releases suggest, the SK model and the Hopfield network are the right place to start. Both papers are short by modern standards — Parisi’s 1979 letter is three pages; Hopfield’s 1982 PNAS paper is five — and both repay close reading.

References

Sherrington, D., & Kirkpatrick, S. (1975). Solvable model of a spin-glass. Physical Review Letters, 35(26), 1792–1796. DOI: 10.1103/PhysRevLett.35.1792
Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters, 43(23), 1754–1756. DOI: 10.1103/PhysRevLett.43.1754
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554–2558. DOI: 10.1073/pnas.79.8.2554
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(1), 147–169. DOI: 10.1207/s15516709cog0901_7
Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55(14), 1530–1533. DOI: 10.1103/PhysRevLett.55.1530
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. DOI: 10.1038/s41586-021-03819-2
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield networks is all you need. arXiv:2008.02217. Retrieved from https://arxiv.org/abs/2008.02217
Nobel Prize Committee. (2024). Scientific background: Machine learning and physical systems. The Royal Swedish Academy of Sciences. Retrieved from https://www.nobelprize.org/prizes/physics/2024/advanced-information/

The 2021 Nobel: Parisi and the frozen magnet#

Hopfield (1982): memory as energy minimisation#

Boltzmann machines (Hinton & Sejnowski, 1985)#

From Boltzmann machines to transformers#

The protein folding connection#

The controversy: did the committee err?#

The universality argument#

References#