Audio on Sebastian Spicker

From Oxide to Oversampling: The Physics of Recorded Sound

Fri, 15 Aug 2025 00:00:00 +0000

There is an argument that has been running in recording studios since roughly 1982, when the first commercially mastered compact discs appeared. On one side: analogue tape has warmth, depth, something the ear likes. On the other: digital audio is more accurate, lower noise, the measurements say so. The argument produces more heat than light, because most participants treat it as an aesthetic question — a matter of feeling, taste, preference. It is not. The difference between tape and digital audio is a physics difference, and the physics is specific enough to calculate.

The physics here turns out to be some of my favourite kind: it sits at the intersection of condensed matter, signal processing, and Fourier analysis, and it connects directly to why certain sounds are perceived as pleasant. This post walks through both sides. Part I is the ferromagnetic physics of magnetic tape and the harmonic structure of saturation distortion. Part II is the delta-sigma modulator and the engineering trick that achieves 24-bit dynamic range from a 1-bit comparator. Neither side of the debate is as simple as its partisans claim, and the physics of both is more interesting than the aesthetics argument they have been stuck in for forty years.

Part I: The Physics of Magnetic Tape

Ferromagnetic Recording

Magnetic recording tape is a thin polymer substrate coated with a layer of ferromagnetic particles suspended in a binder. For most of the twentieth century those particles were iron oxide — specifically $\gamma\text{-Fe}_2\text{O}_3$, gamma-phase ferric oxide — though chromium dioxide ($\text{CrO}_2$) and later metal-particle formulations with pure iron or iron-cobalt alloys were developed for higher coercivity and better high-frequency response. What all of these materials share is the key property of ferromagnetism: each particle is a small permanent magnet, a magnetic domain with a net magnetic moment that can be oriented by an external field and that will retain that orientation when the field is removed.

The recording process exploits this directly. The recording head is a toroidal electromagnet with a narrow gap. When audio-frequency current flows through the head’s coil, the field at the gap follows the current, and as the tape moves past at a fixed speed, successive particles along the tape length are aligned according to the instantaneous field at the moment they pass the gap. The result is a spatial encoding of the time-domain audio signal along the tape. On playback, the inverse process occurs: the moving pattern of magnetised particles generates a time-varying flux in the playback head’s core, which induces a voltage in the coil by Faraday’s law, reproducing the original current waveform.

So far this description is entirely linear. The head current maps to a field, the field maps to a magnetisation, the magnetisation maps back to a voltage. If all three relationships were linear, tape would be a near-perfect recording medium — limited only by particle noise and head gap frequency response. The nonlinearity comes from the second relationship in that chain, and it comes from the fundamental physics of how ferromagnetic materials respond to an applied field.

The B-H Curve and Hysteresis

The relationship between the applied magnetic field intensity $H$ (from the recording head, measured in A/m) and the resulting magnetic flux density $B$ in the tape (measured in tesla) is not linear. It follows a curve — actually a family of nested curves — known as the hysteresis loop, and its shape determines almost everything interesting about tape recording [3].

Starting from a demagnetised state and increasing $H$ from zero, the initial slope $dB/dH$ — the magnetic permeability $\mu$ — is relatively low. The domains in the material are oriented randomly and require a threshold of energy to begin reorienting. As $H$ increases further, the permeability rises, and there is a region of steep, approximately linear increase in $B$. Then, as $H$ continues to increase, the material saturates: progressively fewer unaligned domains remain, the slope falls, and eventually $dB/dH \to 0$ as all domains are aligned. The $B$-$H$ curve is S-shaped, and the saturation is irreversible in a specific sense: if you now reduce $H$ back toward zero, $B$ does not retrace the original path. It remains at a higher value — the remanence $B_r$ — and you must apply a reverse field of magnitude $H_c$, the coercivity, to bring $B$ back to zero. The loop formed by this cycle of magnetisation and demagnetisation is the hysteresis loop, and its area is proportional to the energy dissipated as heat per cycle.

The crucial feature for audio recording is what happens near the origin. A small audio signal, sitting near $H = 0$, does not experience a nicely linear region of the $B$-$H$ curve. The initial permeability is low, and there is an inflection point near zero: the slope increases as you move away from zero before the saturation region brings it back down again. This means that even at low recording levels, the transfer function from head current to tape magnetisation is nonlinear, and in a particular way — the response is symmetric under $H \to -H$, which means the dominant nonlinear term is even-order. Without some remedy, even a gentle sine wave would emerge from the playback head with significant even-harmonic content added. The signal would also sit in a region of the curve where the effective permeability depends on signal amplitude, making the recording level-dependent in an uncontrolled way. Something needed to be done about this, and the solution found in the 1940s is one of the more elegant pieces of applied physics in the history of the recording industry.

The Bias Signal

The solution is called AC bias, and its discovery is usually credited to Braunmühl and Weber at the German Reichs-Rundfunk-Gesellschaft around 1940, though there are earlier related patents. The idea is simple once stated: add a high-frequency signal — typically between 50 kHz and 150 kHz, well above the audio band — to the recording current before it drives the head. This bias signal has an amplitude large enough to drive the tape through multiple cycles of its B-H curve on each audio cycle, but it is filtered out of the playback signal by the tape’s own limited high-frequency response and by subsequent low-pass filtering.

The effect on the recording process is to linearise the transfer function. The operating point is no longer stationary near the inflection point at $H = 0$. Instead, it rides up and down the B-H curve rapidly many times per audio period, driven by the bias. The audio signal merely modulates the envelope of this rapid oscillation. The net magnetisation that remains after the tape leaves the head gap is the time average of many rapid traversals of the hysteresis loop, and this average tracks the audio signal with good linearity provided the signal level is modest. The bias amplitude and frequency are tuned carefully for each tape formulation — too little bias and the linearisation is incomplete; too much and the signal is undermodulated and the high-frequency response suffers as the bias begins to erase fine spatial patterns written by high-frequency audio. Getting the bias right is part of the alignment procedure for every analogue tape machine and part of why different tape formulations require different machine settings.

The result, for moderate recording levels, is a remarkably clean and linear recording medium. The nonlinear character of the B-H curve is effectively tamed by the bias trick, and the remaining imperfections are mostly second-order: azimuth errors, print-through, head bump, self-demagnetisation at short wavelengths. For practical purposes, a well-aligned analogue tape machine at moderate recording levels is a linear system.

Harmonic Generation at High Levels

At high recording levels — when the audio signal is large enough to push the operating point into the saturation region even after the bias has done its linearising work — the picture changes. The transfer function from input current to output magnetisation becomes genuinely nonlinear, and the harmonic content of the distortion becomes the central question.

The standard framework is a Taylor expansion of the transfer function around the operating point:

$$y(t) = a_1 x(t) + a_2 x^2(t) + a_3 x^3(t) + a_4 x^4(t) + \cdots$$

where $x(t)$ is the input signal (the audio current), $y(t)$ is the output (the magnetisation recorded on tape), and the coefficients $a_n$ are determined by the shape of the B-H curve near saturation. For a pure tone $x(t) = A \sin(\omega t)$, the higher-order terms generate harmonics in a calculable way.

The second-order term gives:

$$a_2 x^2(t) = a_2 A^2 \sin^2(\omega t) = \frac{a_2 A^2}{2}\bigl(1 - \cos 2\omega t\bigr)$$

This is a DC offset plus a component at $2\omega$ — the second harmonic, one octave above the fundamental.

The third-order term gives:

$$a_3 x^3(t) = a_3 A^3 \sin^3(\omega t) = a_3 A^3 \left(\frac{3}{4}\sin\omega t - \frac{1}{4}\sin 3\omega t\right)$$

The $\frac{3}{4}$ piece adds to (or subtracts from) the fundamental depending on the sign of $a_3$; the $-\frac{1}{4}$ piece is a third harmonic at $3\omega$, one octave and a fifth above the fundamental.

Carrying through to fourth order:

$$a_4 x^4(t) = \frac{a_4 A^4}{8}\bigl(3 - 4\cos 2\omega t + \cos 4\omega t\bigr)$$

which contributes additional DC, a component at $2\omega$, and a fourth harmonic at $4\omega$.

Collecting the terms through fourth order, the output is approximately:

$$y(t) \approx \left(a_1 + \frac{3a_3 A^2}{4}\right)A\sin\omega t - \frac{a_2 A^2}{2}\cos 2\omega t - \frac{a_3 A^3}{4}\sin 3\omega t + \cdots$$

The important observation is about which harmonics dominate and what they sound like. The B-H curve of a ferromagnetic material near saturation is approximately symmetric: the saturation behaviour for positive $H$ mirrors that for negative $H$. A symmetric nonlinearity has $a_2 = a_4 = 0$ (all even coefficients vanish by symmetry), and only odd harmonics are generated. But at moderate levels, just before full saturation, the symmetry of the B-H loop as traversed by the biased signal is not perfect, and the even-order terms are nonzero — though small. This gives tape its characteristic distortion signature: at moderate saturation levels, the even harmonics ($2\omega$, $4\omega$) dominate; at heavy saturation, the odd harmonics ($3\omega$, $5\omega$) appear more strongly.

The perceptual consequence of this is the crux of the “analogue warmth” story. The second harmonic is the octave of the fundamental. The fourth harmonic is the double octave. These are, in Western harmonic practice and in the physics of vibrating strings, the most consonant possible intervals. Adding even harmonics at low amplitude to a fundamental makes the sound fuller and richer without introducing beating or dissonance. Odd harmonics — particularly the fifth (at $5\omega$, a major third above the double octave) and the seventh (a flattened seventh above the double octave) — are less consonant relative to the fundamental and at high amplitude produce the harsh, buzzy character associated with heavy distortion or the deliberate aggression of a fuzz pedal.

There is one more effect worth naming: the saturation is a soft knee. The B-H curve does not have a sharp corner at saturation — it curves gradually from the linear region into the flat-topped saturation region. This means that transient signals — percussive attacks, consonant onsets — that briefly exceed the nominal recording level are not hard-clipped but gently compressed. Their peaks are rounded by the shape of the B-H curve. Engineers and producers who record through tape often describe this as the machine “breathing” or as a pleasing “gluing” of transients. The physics is simple: the soft-knee transfer function applies more gain reduction to instantaneous peaks than to the sustained body of the signal, functioning as a fast, musically transparent dynamic compressor for any material that approaches saturation.

Part II: The Physics of Delta-Sigma Conversion

Nyquist-Rate ADC and Its Limits

The straightforward approach to analogue-to-digital audio conversion samples the signal at a rate just above twice the highest audio frequency — the Nyquist rate — using a quantiser with enough bits to achieve the desired dynamic range. For CD-quality audio, the sampling rate is 44.1 kHz (slightly above $2 \times 20{,}000$ Hz) and the word length is 16 bits. The dynamic range of a $b$-bit PCM system is, to a good approximation:

$$\text{SNR} \approx 6.02b + 1.76 \text{ dB}$$

so 16 bits gives approximately $6.02 \times 16 + 1.76 \approx 98$ dB, which matches the dynamic range of the best analogue tape and is well above the approximately 70 dB achievable with the noise floor of typical studio tape at 15 ips [4].

The engineering problem with a straightforward Nyquist-rate ADC is the anti-aliasing filter. Before sampling, all content above $f_s/2 = 22.05$ kHz must be removed. If it is not, energy at frequency $f > f_s/2$ aliases into the audio band as a spurious component at $f_s - f$, which is inaudible in origin but very much audible in its alias. To achieve 98 dB of alias suppression — matching the 16-bit dynamic range — the filter must attenuate signals at 22.05 kHz by 98 dB relative to signals at 20 kHz. The transition band is only 2.05 kHz wide. That requires a very high-order analogue filter — typically seventh-order elliptic or Chebyshev — and such filters have significant phase distortion within the audio band, particularly at frequencies near the passband edge. In 1982, building this filter precisely, cheaply, and repeatably in consumer hardware was a genuine engineering challenge. The filters introduced audible phase and amplitude ripple that the original measurements had not anticipated and that contributed to early criticisms of the CD sound.

Oversampling

The delta-sigma ($\Sigma\Delta$) ADC architecture was developed to sidestep the steep-filter problem entirely, and its adoption in consumer audio from the late 1980s onwards largely resolved the anti-aliasing filter debate [1]. The core idea is oversampling: instead of sampling at 44.1 kHz with 16 bits, the $\Sigma\Delta$ converter samples at $M \times 44.1$ kHz — where $M$ is the oversampling ratio, typically 64 in early audio converters, giving $64 \times 44.1 = 2.8224$ MHz — with a 1-bit quantiser. The anti-aliasing filter now needs to attenuate everything above 1.4112 MHz before sampling. Its transition band runs from 20 kHz to 1.4112 MHz, a ratio of roughly 70:1. This is easy: a simple, cheap, first- or second-order RC filter suffices, with negligible phase distortion anywhere in the audio band. The price paid is that the quantiser is now only 1 bit, and a 1-bit quantiser has terrible resolution on its own.

To understand what oversampling buys even before any clever signal processing, consider the quantisation noise floor. For a uniform quantiser with step size $\Delta$, the quantisation noise power is $P_q = \Delta^2/12$, and this noise is spread approximately uniformly from 0 to $f_s/2$. The noise power spectral density is $P_q / (f_s/2)$. After oversampling by a factor of $M$ — so that the effective Nyquist band runs from 0 to $f_{\text{audio}} = f_s/(2M)$ — the in-band noise power is:

$$P_{\text{in-band}} = \frac{P_q}{f_s/2} \cdot f_{\text{audio}} = \frac{P_q}{f_s/2} \cdot \frac{f_s}{2M} = \frac{P_q}{M}$$

Each doubling of $M$ halves the in-band noise power, an improvement of 3 dB, equivalent to half a bit of resolution. At 64× oversampling this gives 18 dB, or three extra bits — useful, but not enough to get from a 1-bit quantiser to 16-bit performance. We need something more.

Noise Shaping

The second ingredient — and the one that makes $\Sigma\Delta$ conversion genuinely remarkable — is noise shaping. Rather than spreading quantisation noise uniformly in frequency, we can engineer its spectral distribution so that almost all the noise power sits above the audio band, where it is removed by a digital low-pass filter (the decimation filter) at the output.

A first-order $\Sigma\Delta$ modulator achieves this by a feedback loop. At each sample step, the quantiser takes the running integral of the difference between the input signal and the previous quantised output. More precisely: the quantisation error $e_n = y_n - \hat{x}_n$ (where $\hat{x}_n$ is the input to the quantiser and $y_n$ is the 1-bit output) is fed back and subtracted from the next input before integration. This is the integrator-feedback structure that gives the modulator its name: $\Sigma$ for the integrating summation, $\Delta$ for the difference.

In the $z$-domain, this feedback structure gives the quantisation noise a transfer function of:

$$N(z) = 1 - z^{-1}$$

that is, the noise at time $n$ is the current error minus the previous error — a first-difference operation. In the frequency domain, substituting $z = e^{j 2\pi f / f_s}$:

$$\bigl|N(f)\bigr|^2 = \left|1 - e^{-j 2\pi f / f_s}\right|^2 = 4\sin^2\!\left(\frac{\pi f}{f_s}\right)$$

For frequencies well below the sampling rate, $f \ll f_s$, the small-angle approximation gives:

$$\bigl|N(f)\bigr|^2 \approx \left(\frac{2\pi f}{f_s}\right)^2$$

The noise power spectral density rises as $f^2$ — it is heavily suppressed at low frequencies and pushed up toward $f_s/2$. Integrating this shaped noise over the audio band $[0, f_{\text{audio}}]$ and comparing to the flat-spectrum case, the in-band SNR improvement for a first-order modulator scales as $M^3$ rather than $M^1$: every doubling of oversampling ratio gives 9 dB improvement (1.5 bits) instead of 3 dB. At 64× oversampling — six doublings — a first-order modulator recovers approximately 54 dB, or 9 effective bits.

A second-order modulator applies the noise-shaping filter twice, giving $|N(f)|^2 \propto f^4$ and an SNR gain scaling as $M^5$: 15 dB per octave of oversampling. At 64× — again six doublings — this recovers approximately 90 dB, or 15 effective bits. Modern high-performance audio ADCs use fifth- to seventh-order modulators operating at 128× oversampling or higher. The in-band noise floor drops to levels corresponding to 20–24 effective bits — entirely from a 1-bit hardware comparator, with all the resolution coming from the noise shaping and the subsequent digital decimation filter.

The following table illustrates the SNR gain achievable at practical oversampling ratios:

Modulator order	Oversampling ratio	SNR gain	Effective bits gained
1st order	64×	54 dB	9
2nd order	64×	90 dB	15
5th order	128×	~120 dB	~20

The 5th-order row deserves a moment’s attention. A single-bit comparator — a device that outputs only 1 or 0, with no analogue subtlety whatsoever — combined with oversampling and noise shaping, achieves the resolution of a 20-bit Nyquist-rate ADC and is doing so using a simple digital feedback loop and an analogue integrator that can be fabricated cheaply on a CMOS chip. This is, I think, one of the more quietly stunning pieces of engineering in consumer electronics, and it goes entirely unnoticed because the CD player it lives inside is now considered mundane.

There is a subtlety worth adding for completeness. Real $\Sigma\Delta$ modulators of order three and above are potentially unstable — the noise-shaping loop can become unstable for large input signals, producing limit cycles or tonal artefacts. Managing this stability is a significant part of the design problem and involves either restricting the input range, adding nonlinear stability control, or using multi-bit internal quantisers (which reduce the quantisation step and ease the stability constraint while retaining most of the noise-shaping benefit). The multi-bit approach also addresses a related issue: the ideal 1-bit DAC in the feedback loop is inherently linear (there are only two levels, so there is no differential nonlinearity), but multi-bit internal DACs must be trimmed or calibrated to avoid nonlinearity in the feedback path corrupting the noise shaping. These engineering details are discussed thoroughly in Norsworthy, Schreier, and Temes [5], which remains the standard reference.

The digital audio infrastructure that delta-sigma conversion enabled — clean, cheap, phase-linear converters without steep analogue filters — also made digital audio workable in latency-sensitive applications like live performance. For a discussion of why latency matters so much in network music performance and how it shapes system design, see my earlier post on NMP latency and the physics of musical timing.

The Irony of the Comparison

Both tape saturation and delta-sigma conversion are, at root, about the same problem: how to manage the relationship between a signal and the finite resolution of the medium storing it. Tape manages the problem physically and somewhat accidentally — the ferromagnetic B-H curve happens to generate even harmonics that are consonant with the recorded signal, and the bias trick linearises the response well enough that the distortion only becomes audible when the engineer deliberately pushes into saturation. Delta-sigma manages the problem mathematically and deliberately — quantisation noise is redistributed in frequency by a designed feedback loop so that it falls outside the audible band.

Neither approach is perfect, and neither is neutral. Tape adds signal-correlated harmonic distortion whose spectral content depends on recording level and which compresses transients in a way that changes the perceived dynamics. Digital audio, even with delta-sigma conversion, has its own imperfections: idle-channel noise from the modulator, potential for tonal limit-cycle artefacts at specific input levels, and the abrupt onset of hard clipping at full scale — which, unlike tape saturation, is symmetrical and rapid and adds all harmonics simultaneously, giving the harsh, unpleasant character that digital overloads are known for. The soft-knee vs. hard-clip distinction is real and audible, and it is probably the most defensible technical basis for the claim that analogue tape handles transient overloads more graciously.

What is not defensible is the claim that one medium is inherently more musical than the other, or that digital audio lacks something fundamental that tape possesses. They are differently imperfect. The imperfections of tape happen to sit at harmonic relationships that Western ears, shaped by a tradition of music built on those same harmonic intervals, find pleasing. The imperfections of digital audio are not at pleasing harmonic intervals; they are wideband quantisation noise (before shaping) or ultrasonic shaped noise (after), and a sharp cliff at full scale. Different physics, different perceptual character.

A Personal Note

I spent a long time thinking the tape versus digital debate was mostly audiophile mythology — a community of enthusiasts rationalising the warmth of nostalgia as the warmth of oxide particles. The physics is more interesting than that, and doing the calculation changed my view. The second-harmonic content of tape saturation is not an accident or a romantic story; it is what you get when you push a symmetric nonlinearity with an audio sine wave, and the reason it sounds pleasant is not arbitrary but is grounded in the physics of consonance and the harmonic series. The delta-sigma converter is not a mundane commodity chip but a genuinely elegant solution to an otherwise intractable filter-design problem, and the fact that it achieves 24-bit resolution from a 1-bit comparator by spectral redistribution of noise is the kind of result that should get more attention in physics education.

Both technologies deserve better than the aesthetics argument they have been fighting in for forty years. The tools to understand them are not exotic — Taylor series, Fourier analysis, the z-transform, and the basic physics of ferromagnetism — and the reward is a clear-eyed picture of what is actually going on inside two of the most consequential inventions in the history of recorded music. If you are interested in related mathematics underlying other aspects of music, the posts on Euclidean rhythms and Messiaen’s modes and group theory cover the combinatorial and algebraic structures in rhythm and pitch that sit alongside the physics discussed here.

References

[1] Candy, J. C., & Temes, G. C. (Eds.). (1992). Oversampling Delta-Sigma Data Converters: Theory, Design, and Simulation. IEEE Press.

[2] Reiss, J. D., & McPherson, A. (2015). Audio Effects: Theory, Implementation and Application. CRC Press.

[3] Bertram, H. N. (1994). Theory of Magnetic Recording. Cambridge University Press.

[4] Pohlmann, K. C. (2010). Principles of Digital Audio (6th ed.). McGraw-Hill.

[5] Norsworthy, S. R., Schreier, R., & Temes, G. C. (Eds.). (1997). Delta-Sigma Data Converters: Theory, Design, and Simulation. IEEE Press.

Changelog

2026-01-14: Updated the interval description for the 7th harmonic to “above the double octave.” The 7th harmonic (7f) sits between the double octave (4f) and the triple octave (8f).

Why 44,100? The Accidental Physics of the CD Sampling Rate

Mon, 05 Aug 2024 00:00:00 +0000

44,100 Hz. Not 44,000. Not 48,000. Not even 40,000 or 50,000, which would at least have the virtue of roundness. The number that defines CD-quality audio is specific in a way that invites a question most people never think to ask: why that number?

The Puzzle

When a physical constant turns out to be $1.6 \times 10^{-19}$ coulombs, that is just nature being nature — no further explanation is needed or available. But when an engineering standard settles on 44,100 Hz rather than, say, 44,000 Hz or 45,000 Hz, there is a story hiding in the specificity.

The standard answer — the one you find on Wikipedia and in most popular accounts — is that 44.1 kHz satisfies the Nyquist criterion for 20 kHz audio, and so it was chosen to preserve the full range of human hearing. This is true. It is also almost completely uninformative. The Nyquist criterion for 20 kHz audio requires only that the sampling rate exceed 40 kHz. That constraint is satisfied by 40,001 Hz as much as by 44,100 Hz. The specific value requires a different explanation entirely.

That explanation involves a Sony engineer, a consumer videocassette recorder, and the accidental convergence of two television standards developed independently on different continents. The number 44,100 is not an optimisation. It is an archaeological deposit. And like most archaeological deposits, it is still with us long after the civilisation that created it has disappeared.

I want to work through the physics first, because the Nyquist theorem is genuinely beautiful and is often presented in a way that obscures what it actually says. Then I want to show you the arithmetic that makes 44,100 inevitable given 1970s constraints — and the way NTSC and PAL, designed for completely different reasons, conspire to produce the same number. If you enjoy “hidden mathematics in music,” you might also find it in Euclidean Rhythms, where a 2,300-year-old algorithm turns out to encode the structure of West African and Cuban percussion.

The Nyquist–Shannon Sampling Theorem

Before the archaeology, the physics.

In 1928, Harry Nyquist published a paper on telegraph transmission theory that contained, somewhat incidentally, the germ of what would become one of the most consequential theorems in applied mathematics [4]. Claude Shannon formalised and generalised it in 1949 [5]. The theorem states: a continuous bandlimited signal whose highest frequency component is $f_{\max}$ can be perfectly reconstructed from discrete samples taken at rate $f_s$ if and only if

$$f_s > 2 f_{\max}.$$

The quantity $f_s / 2$ is called the Nyquist frequency. Sampling below it causes aliasing: high-frequency components fold back into the spectrum and appear as spurious low-frequency artefacts that are indistinguishable from genuine signal. Once you have aliased a signal, the damage is permanent. Sampling at or above the Nyquist rate, the theorem says, causes no information loss at all — the original continuous waveform can be recovered exactly, in principle, from the discrete sample sequence.

Human hearing extends from roughly 20 Hz to 20 kHz (and, for most adults over thirty, substantially less at the top end, but 20 kHz is the canonical engineering requirement). Setting $f_{\max} = 20$ kHz, the Nyquist criterion requires $f_s > 40$ kHz.

But here is the subtlety that the Wikipedia summary tends to skip. The theorem assumes that the signal is perfectly bandlimited before sampling — meaning that all energy above $f_{\max}$ has been removed. This requires an anti-aliasing filter: a low-pass filter applied to the analogue signal before the analogue-to-digital converter samples it. If your anti-aliasing filter passes everything up to 20 kHz and blocks everything above it with perfect sharpness, then 40,001 Hz would suffice. The problem is that such a filter is physically unrealisable.

Real filters do not have vertical cutoffs. They have a transition band: a frequency range over which attenuation increases gradually from zero to full suppression. The steeper you want the transition, the higher the filter order, and for practical filter hardware in 1979 — op-amps, capacitors, inductors, no DSP to speak of — a “steep enough” filter meant a transition band of roughly 10% of the passband edge frequency. For a 20 kHz passband edge, that is about 2 kHz of transition band.

So the actual engineering requirement is not just $f_s > 40$ kHz. It is $f_s > 40$ kHz plus enough headroom for a realisable anti-aliasing filter. With $f_s = 44.1$ kHz, the Nyquist limit sits at $f_s/2 = 22.05$ kHz. The gap between the top of the audio band and the Nyquist limit is

$$22{,}050 - 20{,}000 = 2{,}050 \text{ Hz},$$

which is just over 10% of 20 kHz. This is enough to build a practical anti-aliasing filter with 1970s and early 1980s analogue components. Had the sampling rate been 41 kHz, the gap would have been only 500 Hz — far too narrow for affordable hardware. Had it been 50 kHz, the gap would have been more comfortable, but you would be storing 13.6% more data per second for no audible benefit.

So 44.1 kHz is in the right neighbourhood given real-world filter constraints. But it is still a specific number. The question of why 44,100 rather than 44,000 or 43,500 or 44,800 is still open. That is where the VCRs come in.

The VCR Problem

In the late 1970s, Sony was developing what would eventually become the Compact Disc. One of the fundamental engineering problems was storage: where do you put the digital audio data? A 74-minute stereo recording at 16 bits and 44.1 kHz generates roughly 780 megabytes. In 1979, that was an absurd quantity of data. Hard drives with that capacity existed but cost tens of thousands of dollars and weighed as much as a washing machine. Dedicated digital tape formats existed in professional studios but were exotic and expensive [1].

The only affordable high-bandwidth magnetic recording medium available to consumer-facing engineers in 1979 was the VCR — the videocassette recorder. VHS and Betamax had recently become consumer products, and the tape and drive mechanism was cheap, reliable, and capable of storing several hours of high-bandwidth video signal. That video signal bandwidth was substantial: enough, in principle, to carry digital audio if you could get it onto the tape in the right form.

Sony’s solution was elegant to the point of audacity. Rather than inventing a new tape format, they encoded digital audio samples as a black-and-white pseudo-video signal — patterns of light and dark pixels that a standard VCR recorded without modification, because as far as the VCR was concerned it was just receiving a monochrome video feed. The resulting device, the Sony PCM-1600 (1979), was a standalone unit that sat between a microphone preamplifier and a VCR, converting audio to fake video for recording and back to audio for playback [3].

The sampling rate of the audio was now determined not by any audio engineering consideration but by the geometry of the video signal. And the geometry of the video signal was fixed by the television broadcast standard — which brought entirely different historical contingencies into the calculation.

The NTSC Arithmetic

The NTSC standard — developed in North America and Japan — specifies 30 frames per second and 525 total scan lines per frame. Of those 525 lines, 35 are consumed by the vertical blanking interval (the time needed for the electron beam in a CRT to return from the bottom of the screen to the top). That leaves 490 active lines per frame actually carrying picture information.

Sony packed 3 audio samples into each active scan line. The audio sampling rate is then:

$$f_s = \underbrace{30}_{\text{frames/s}} \times \underbrace{490}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$

There it is. 44,100 Hz, emerging not from any consideration of human hearing or filter design, but from the frame rate and line count of the North American television standard.

The PAL Arithmetic

Now the European video standard, PAL, which was developed in the 1960s independently of NTSC and optimised for different priorities. PAL uses 25 frames per second and 625 total scan lines per frame. The vertical blanking interval consumes 37 lines, leaving 588 active lines per frame.

Sony packed 3 audio samples into each active PAL scan line as well. The sampling rate:

$$f_s = \underbrace{25}_{\text{frames/s}} \times \underbrace{588}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$

The same number.

Let that settle for a moment. NTSC: 30 frames per second, 490 active lines. PAL: 25 frames per second, 588 active lines. Different frame rates. Different line counts. Developed on different continents for different broadcast environments. And yet $30 \times 490 = 25 \times 588 = 14{,}700$, so multiplying by 3 gives 44,100 in both cases.

This is not coincidence in any deep sense — NTSC and PAL were both designed to fill approximately the same video bandwidth, just with different tradeoffs between temporal resolution (frame rate) and spatial resolution (line count). But for Sony’s VCR encoding scheme, the numerical convergence was enormously convenient: a single PCM processor running at 44.1 kHz could record to either NTSC or PAL video equipment without any change to the audio electronics. The same master machine could work in Tokyo and in Frankfurt.

The arithmetic is, I think, one of those moments where a coincidence that is perfectly explicable in hindsight still feels satisfying in the way that a physical derivation feels satisfying. You set up the constraints — fill the video bandwidth, pack an integer number of samples per line, keep the number of samples small enough to fit in a line’s worth of data — and the number 44,100 falls out of two independent calculations like a constant of nature. It is not a constant of nature. It is a contingent product of mid-twentieth-century broadcast engineering. But the mathematics does not care.

From Tape to Disc

When Philips and Sony sat down to negotiate the Red Book standard — the technical specification for the Compact Disc, finalised in 1980 and commercially launched in 1982 — both companies brought existing infrastructure to the table [3]. Both had been building digital audio equipment for several years. Both had PCM processors running in professional studios. Both had catalogues of digital masters recorded on VCR tape. And all of that equipment ran at 44.1 kHz, because all of it had been built to interface with the video tape standard that made digital audio recording practically affordable in the first place.

Changing the sampling rate for the CD would have required rebuilding the entire mastering chain: new PCM processors, new format conversion hardware, new master tape libraries. The economic and logistical cost would have been enormous. The 44.1 kHz rate was not chosen for the CD because it was optimal in any absolute engineering sense. It was chosen because it was already there [1], [2].

This is a pattern worth recognising. Major technical standards are rarely chosen by optimisation from first principles. They are chosen by consolidating what already exists. The QWERTY keyboard layout was optimised for typewriter mechanisms that no longer exist. The 60 Hz AC frequency in North America was set by Westinghouse generators installed in the 1890s. The 44.1 kHz CD sampling rate was set by VCR tape recorders that were obsolete within a decade of the CD’s launch.

The Other Rates

Not all digital audio runs at 44.1 kHz, and the coexistence of different rates in the modern audio industry is the direct legacy of 44.1 kHz’s awkward origins.

48 kHz is the professional broadcast and studio standard. It is used in digital video, in DAT tape, in most professional audio interfaces, and in the digital audio embedded in broadcast television signals — including, as a matter of course, in the digital television infrastructure described in the context of university video platforms like educast.nrw. Why 48? Broadcast infrastructure needed a rate that had clean integer relationships with the 32 kHz rate used in early satellite and ISDN broadcast systems. The relationship $48 = \frac{3}{2} \times 32$ is exact, making synchronisation straightforward. 44.1 kHz has no such clean relationship with anything in broadcast engineering.

The ratio between the two dominant rates is $48 / 44.1 = 160 / 147$. This fraction — irreducible, inelegant, non-obvious — is the source of essentially every sample-rate conversion problem in audio post-production. When a CD master (44.1 kHz) is prepared for broadcast (48 kHz), a sample-rate converter must interpolate 147 samples up to 160 samples, or downsample 160 samples to 147, at every moment. The process introduces small errors, and doing it well requires significant computational effort. Every time a musician’s recording moves between the consumer and professional audio worlds, it passes through this fractional bottleneck. Two standards that could have been made compatible were instead set by completely independent historical processes, and we have been paying the computational tax ever since.

96 kHz and 192 kHz are marketed as “high-resolution audio.” Here the physics gets genuinely murky and the claims made by the audio industry deserve some scepticism. Human hearing above 20 kHz is, for most adults, genuinely absent — not reduced, but absent, because the outer hair cells in the cochlea that respond to those frequencies progressively die from the teenage years onward and are not replaced. The argument for high sampling rates is typically one of two things: first, that ultrasonic content can cause intermodulation distortion, where sum and difference frequencies of ultrasonic components fall back into the audible band; second, that a higher sampling rate allows for a more relaxed anti-aliasing filter with better phase behaviour within the audible band.

Both effects are real and measurable in laboratory conditions. Whether they are audible under controlled double-blind listening conditions is a separate and more contested question. The published evidence is not strong. What is not contested is that 96 kHz files are twice the size of 44.1 kHz files, and 192 kHz files are more than four times the size, for the same bit depth and the same number of audio channels. Whether that storage cost buys anything audible is, as of the current state of the literature, an open question.

The Irony

Here is the situation we are actually in. The canonical digital audio format — 16-bit, 44.1 kHz PCM, the format that defined CD quality for a generation and that remains the standard for music distribution — is physically a photograph of analogue video tape. The digitisation of music was made possible by television engineering. The specific number that defines the fidelity of every CD ever pressed is determined by the frame rates and line counts of 1970s broadcast television standards, which were themselves determined by the capabilities of 1940s CRT technology and the political negotiations of early broadcast licensing bodies.

When someone tells you that 44.1 kHz is the “natural” or “perfect” sampling rate for audio, they are, without knowing it, paying tribute to the NTSC standards committee of 1941 and the PAL engineers of the 1960s. The number carries history in it the way a fossil carries the structure of a long-dead organism. It is the right number, in the sense that it works. Its rightness has nothing to do with the reasons it was chosen.

I find this genuinely satisfying rather than disappointing. The history of physics and engineering is full of contingent numbers that turned out to be good enough, and whose goodness was only rationalised after the fact. The metre was originally defined as one ten-millionth of the distance from the equator to the North Pole along the Paris meridian — an arbitrary geodetic choice that turned out to produce a unit of length that is remarkably convenient for human-scale physics. The kilogram was a cylinder of platinum-iridium alloy in a vault outside Paris for over a century. 44,100 Hz is in good company.

The Archaeology of a Number

The numbers we inherit from engineering history are rarely arbitrary at every level simultaneously. 44,100 Hz is not arbitrary at the level of sampling theory: it satisfies the Nyquist criterion with enough headroom for a physically realisable anti-aliasing filter, given 1970s component technology. That is a genuine constraint, and the number sits in the right region of parameter space for it.

But it is arbitrary at a deeper level: it is the specific number that happened to fit a video tape format that happened to be affordable in 1979, a format that was itself determined by broadcast standards that were set for entirely unrelated reasons decades earlier. The chain of contingencies runs: 1940s television engineering defines NTSC and PAL frame rates and line counts; 1970s consumer VCR technology makes those tape formats cheap; 1979 Sony engineers encode digital audio as fake video; the arithmetic of the video formats fixes the sampling rate at 44,100 Hz; that rate gets locked into the CD standard in 1980; 44.1 kHz becomes the defining frequency of a digital music format that ships billions of units over the following four decades.

Science and engineering produce exact numbers from messy contingencies. The number 44,100 is simultaneously a theorem output (it satisfies a well-defined engineering constraint), a historical accident (it is determined by the specific video tape hardware that existed in 1979), and an institutional fossil (it outlasted the VCRs that created it by four decades and counting). All three things are true at the same time.

The VCRs are gone. The sampling rate remains.

References

[1] Pohlmann, K. C. (2010). Principles of Digital Audio (6th ed.). McGraw-Hill.

[2] Watkinson, J. (2001). The Art of Digital Audio (3rd ed.). Focal Press.

[3] Immink, K. A. S. (1998). The compact disc story. Journal of the AES, 46(5), 458–465.

[4] Nyquist, H. (1928). Certain topics in telegraph transmission theory. Transactions of the AIEE, 47(2), 617–644.

[5] Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21.