44,100 Hz. Not 44,000. Not 48,000. Not even 40,000 or 50,000, which would at least have the virtue of roundness. The number that defines CD-quality audio is specific in a way that invites a question most people never think to ask: why that number?
The Puzzle
When a physical constant turns out to be $1.6 \times 10^{-19}$ coulombs, that is just nature being nature — no further explanation is needed or available. But when an engineering standard settles on 44,100 Hz rather than, say, 44,000 Hz or 45,000 Hz, there is a story hiding in the specificity.
The standard answer — the one you find on Wikipedia and in most popular accounts — is that 44.1 kHz satisfies the Nyquist criterion for 20 kHz audio, and so it was chosen to preserve the full range of human hearing. This is true. It is also almost completely uninformative. The Nyquist criterion for 20 kHz audio requires only that the sampling rate exceed 40 kHz. That constraint is satisfied by 40,001 Hz as much as by 44,100 Hz. The specific value requires a different explanation entirely.
That explanation involves a Sony engineer, a consumer videocassette recorder, and the accidental convergence of two television standards developed independently on different continents. The number 44,100 is not an optimisation. It is an archaeological deposit. And like most archaeological deposits, it is still with us long after the civilisation that created it has disappeared.
I want to work through the physics first, because the Nyquist theorem is genuinely beautiful and is often presented in a way that obscures what it actually says. Then I want to show you the arithmetic that makes 44,100 inevitable given 1970s constraints — and the way NTSC and PAL, designed for completely different reasons, conspire to produce the same number. If you enjoy “hidden mathematics in music,” you might also find it in Euclidean Rhythms, where a 2,300-year-old algorithm turns out to encode the structure of West African and Cuban percussion.
The Nyquist–Shannon Sampling Theorem
Before the archaeology, the physics.
In 1928, Harry Nyquist published a paper on telegraph transmission theory that contained, somewhat incidentally, the germ of what would become one of the most consequential theorems in applied mathematics [4]. Claude Shannon formalised and generalised it in 1949 [5]. The theorem states: a continuous bandlimited signal whose highest frequency component is $f_{\max}$ can be perfectly reconstructed from discrete samples taken at rate $f_s$ if and only if
$$f_s > 2 f_{\max}.$$The quantity $f_s / 2$ is called the Nyquist frequency. Sampling below it causes aliasing: high-frequency components fold back into the spectrum and appear as spurious low-frequency artefacts that are indistinguishable from genuine signal. Once you have aliased a signal, the damage is permanent. Sampling at or above the Nyquist rate, the theorem says, causes no information loss at all — the original continuous waveform can be recovered exactly, in principle, from the discrete sample sequence.
Human hearing extends from roughly 20 Hz to 20 kHz (and, for most adults over thirty, substantially less at the top end, but 20 kHz is the canonical engineering requirement). Setting $f_{\max} = 20$ kHz, the Nyquist criterion requires $f_s > 40$ kHz.
But here is the subtlety that the Wikipedia summary tends to skip. The theorem assumes that the signal is perfectly bandlimited before sampling — meaning that all energy above $f_{\max}$ has been removed. This requires an anti-aliasing filter: a low-pass filter applied to the analogue signal before the analogue-to-digital converter samples it. If your anti-aliasing filter passes everything up to 20 kHz and blocks everything above it with perfect sharpness, then 40,001 Hz would suffice. The problem is that such a filter is physically unrealisable.
Real filters do not have vertical cutoffs. They have a transition band: a frequency range over which attenuation increases gradually from zero to full suppression. The steeper you want the transition, the higher the filter order, and for practical filter hardware in 1979 — op-amps, capacitors, inductors, no DSP to speak of — a “steep enough” filter meant a transition band of roughly 10% of the passband edge frequency. For a 20 kHz passband edge, that is about 2 kHz of transition band.
So the actual engineering requirement is not just $f_s > 40$ kHz. It is $f_s > 40$ kHz plus enough headroom for a realisable anti-aliasing filter. With $f_s = 44.1$ kHz, the Nyquist limit sits at $f_s/2 = 22.05$ kHz. The gap between the top of the audio band and the Nyquist limit is
$$22{,}050 - 20{,}000 = 2{,}050 \text{ Hz},$$which is just over 10% of 20 kHz. This is enough to build a practical anti-aliasing filter with 1970s and early 1980s analogue components. Had the sampling rate been 41 kHz, the gap would have been only 500 Hz — far too narrow for affordable hardware. Had it been 50 kHz, the gap would have been more comfortable, but you would be storing 13.6% more data per second for no audible benefit.
So 44.1 kHz is in the right neighbourhood given real-world filter constraints. But it is still a specific number. The question of why 44,100 rather than 44,000 or 43,500 or 44,800 is still open. That is where the VCRs come in.
The VCR Problem
In the late 1970s, Sony was developing what would eventually become the Compact Disc. One of the fundamental engineering problems was storage: where do you put the digital audio data? A 74-minute stereo recording at 16 bits and 44.1 kHz generates roughly 780 megabytes. In 1979, that was an absurd quantity of data. Hard drives with that capacity existed but cost tens of thousands of dollars and weighed as much as a washing machine. Dedicated digital tape formats existed in professional studios but were exotic and expensive [1].
The only affordable high-bandwidth magnetic recording medium available to consumer-facing engineers in 1979 was the VCR — the videocassette recorder. VHS and Betamax had recently become consumer products, and the tape and drive mechanism was cheap, reliable, and capable of storing several hours of high-bandwidth video signal. That video signal bandwidth was substantial: enough, in principle, to carry digital audio if you could get it onto the tape in the right form.
Sony’s solution was elegant to the point of audacity. Rather than inventing a new tape format, they encoded digital audio samples as a black-and-white pseudo-video signal — patterns of light and dark pixels that a standard VCR recorded without modification, because as far as the VCR was concerned it was just receiving a monochrome video feed. The resulting device, the Sony PCM-1600 (1979), was a standalone unit that sat between a microphone preamplifier and a VCR, converting audio to fake video for recording and back to audio for playback [3].
The sampling rate of the audio was now determined not by any audio engineering consideration but by the geometry of the video signal. And the geometry of the video signal was fixed by the television broadcast standard — which brought entirely different historical contingencies into the calculation.
The NTSC Arithmetic
The NTSC standard — developed in North America and Japan — specifies 30 frames per second and 525 total scan lines per frame. Of those 525 lines, 35 are consumed by the vertical blanking interval (the time needed for the electron beam in a CRT to return from the bottom of the screen to the top). That leaves 490 active lines per frame actually carrying picture information.
Sony packed 3 audio samples into each active scan line. The audio sampling rate is then:
$$f_s = \underbrace{30}_{\text{frames/s}} \times \underbrace{490}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$There it is. 44,100 Hz, emerging not from any consideration of human hearing or filter design, but from the frame rate and line count of the North American television standard.
The PAL Arithmetic
Now the European video standard, PAL, which was developed in the 1960s independently of NTSC and optimised for different priorities. PAL uses 25 frames per second and 625 total scan lines per frame. The vertical blanking interval consumes 37 lines, leaving 588 active lines per frame.
Sony packed 3 audio samples into each active PAL scan line as well. The sampling rate:
$$f_s = \underbrace{25}_{\text{frames/s}} \times \underbrace{588}_{\text{active lines/frame}} \times \underbrace{3}_{\text{samples/line}} = 44{,}100 \text{ Hz}.$$The same number.
Let that settle for a moment. NTSC: 30 frames per second, 490 active lines. PAL: 25 frames per second, 588 active lines. Different frame rates. Different line counts. Developed on different continents for different broadcast environments. And yet $30 \times 490 = 25 \times 588 = 14{,}700$, so multiplying by 3 gives 44,100 in both cases.
This is not coincidence in any deep sense — NTSC and PAL were both designed to fill approximately the same video bandwidth, just with different tradeoffs between temporal resolution (frame rate) and spatial resolution (line count). But for Sony’s VCR encoding scheme, the numerical convergence was enormously convenient: a single PCM processor running at 44.1 kHz could record to either NTSC or PAL video equipment without any change to the audio electronics. The same master machine could work in Tokyo and in Frankfurt.
The arithmetic is, I think, one of those moments where a coincidence that is perfectly explicable in hindsight still feels satisfying in the way that a physical derivation feels satisfying. You set up the constraints — fill the video bandwidth, pack an integer number of samples per line, keep the number of samples small enough to fit in a line’s worth of data — and the number 44,100 falls out of two independent calculations like a constant of nature. It is not a constant of nature. It is a contingent product of mid-twentieth-century broadcast engineering. But the mathematics does not care.
From Tape to Disc
When Philips and Sony sat down to negotiate the Red Book standard — the technical specification for the Compact Disc, finalised in 1980 and commercially launched in 1982 — both companies brought existing infrastructure to the table [3]. Both had been building digital audio equipment for several years. Both had PCM processors running in professional studios. Both had catalogues of digital masters recorded on VCR tape. And all of that equipment ran at 44.1 kHz, because all of it had been built to interface with the video tape standard that made digital audio recording practically affordable in the first place.
Changing the sampling rate for the CD would have required rebuilding the entire mastering chain: new PCM processors, new format conversion hardware, new master tape libraries. The economic and logistical cost would have been enormous. The 44.1 kHz rate was not chosen for the CD because it was optimal in any absolute engineering sense. It was chosen because it was already there [1], [2].
This is a pattern worth recognising. Major technical standards are rarely chosen by optimisation from first principles. They are chosen by consolidating what already exists. The QWERTY keyboard layout was optimised for typewriter mechanisms that no longer exist. The 60 Hz AC frequency in North America was set by Westinghouse generators installed in the 1890s. The 44.1 kHz CD sampling rate was set by VCR tape recorders that were obsolete within a decade of the CD’s launch.
The Other Rates
Not all digital audio runs at 44.1 kHz, and the coexistence of different rates in the modern audio industry is the direct legacy of 44.1 kHz’s awkward origins.
48 kHz is the professional broadcast and studio standard. It is used in digital video, in DAT tape, in most professional audio interfaces, and in the digital audio embedded in broadcast television signals — including, as a matter of course, in the digital television infrastructure described in the context of university video platforms like educast.nrw. Why 48? Broadcast infrastructure needed a rate that had clean integer relationships with the 32 kHz rate used in early satellite and ISDN broadcast systems. The relationship $48 = \frac{3}{2} \times 32$ is exact, making synchronisation straightforward. 44.1 kHz has no such clean relationship with anything in broadcast engineering.
The ratio between the two dominant rates is $48 / 44.1 = 160 / 147$. This fraction — irreducible, inelegant, non-obvious — is the source of essentially every sample-rate conversion problem in audio post-production. When a CD master (44.1 kHz) is prepared for broadcast (48 kHz), a sample-rate converter must interpolate 147 samples up to 160 samples, or downsample 160 samples to 147, at every moment. The process introduces small errors, and doing it well requires significant computational effort. Every time a musician’s recording moves between the consumer and professional audio worlds, it passes through this fractional bottleneck. Two standards that could have been made compatible were instead set by completely independent historical processes, and we have been paying the computational tax ever since.
96 kHz and 192 kHz are marketed as “high-resolution audio.” Here the physics gets genuinely murky and the claims made by the audio industry deserve some scepticism. Human hearing above 20 kHz is, for most adults, genuinely absent — not reduced, but absent, because the outer hair cells in the cochlea that respond to those frequencies progressively die from the teenage years onward and are not replaced. The argument for high sampling rates is typically one of two things: first, that ultrasonic content can cause intermodulation distortion, where sum and difference frequencies of ultrasonic components fall back into the audible band; second, that a higher sampling rate allows for a more relaxed anti-aliasing filter with better phase behaviour within the audible band.
Both effects are real and measurable in laboratory conditions. Whether they are audible under controlled double-blind listening conditions is a separate and more contested question. The published evidence is not strong. What is not contested is that 96 kHz files are twice the size of 44.1 kHz files, and 192 kHz files are more than four times the size, for the same bit depth and the same number of audio channels. Whether that storage cost buys anything audible is, as of the current state of the literature, an open question.
The Irony
Here is the situation we are actually in. The canonical digital audio format — 16-bit, 44.1 kHz PCM, the format that defined CD quality for a generation and that remains the standard for music distribution — is physically a photograph of analogue video tape. The digitisation of music was made possible by television engineering. The specific number that defines the fidelity of every CD ever pressed is determined by the frame rates and line counts of 1970s broadcast television standards, which were themselves determined by the capabilities of 1940s CRT technology and the political negotiations of early broadcast licensing bodies.
When someone tells you that 44.1 kHz is the “natural” or “perfect” sampling rate for audio, they are, without knowing it, paying tribute to the NTSC standards committee of 1941 and the PAL engineers of the 1960s. The number carries history in it the way a fossil carries the structure of a long-dead organism. It is the right number, in the sense that it works. Its rightness has nothing to do with the reasons it was chosen.
I find this genuinely satisfying rather than disappointing. The history of physics and engineering is full of contingent numbers that turned out to be good enough, and whose goodness was only rationalised after the fact. The metre was originally defined as one ten-millionth of the distance from the equator to the North Pole along the Paris meridian — an arbitrary geodetic choice that turned out to produce a unit of length that is remarkably convenient for human-scale physics. The kilogram was a cylinder of platinum-iridium alloy in a vault outside Paris for over a century. 44,100 Hz is in good company.
The Archaeology of a Number
The numbers we inherit from engineering history are rarely arbitrary at every level simultaneously. 44,100 Hz is not arbitrary at the level of sampling theory: it satisfies the Nyquist criterion with enough headroom for a physically realisable anti-aliasing filter, given 1970s component technology. That is a genuine constraint, and the number sits in the right region of parameter space for it.
But it is arbitrary at a deeper level: it is the specific number that happened to fit a video tape format that happened to be affordable in 1979, a format that was itself determined by broadcast standards that were set for entirely unrelated reasons decades earlier. The chain of contingencies runs: 1940s television engineering defines NTSC and PAL frame rates and line counts; 1970s consumer VCR technology makes those tape formats cheap; 1979 Sony engineers encode digital audio as fake video; the arithmetic of the video formats fixes the sampling rate at 44,100 Hz; that rate gets locked into the CD standard in 1980; 44.1 kHz becomes the defining frequency of a digital music format that ships billions of units over the following four decades.
Science and engineering produce exact numbers from messy contingencies. The number 44,100 is simultaneously a theorem output (it satisfies a well-defined engineering constraint), a historical accident (it is determined by the specific video tape hardware that existed in 1979), and an institutional fossil (it outlasted the VCRs that created it by four decades and counting). All three things are true at the same time.
The VCRs are gone. The sampling rate remains.
References
[1] Pohlmann, K. C. (2010). Principles of Digital Audio (6th ed.). McGraw-Hill.
[2] Watkinson, J. (2001). The Art of Digital Audio (3rd ed.). Focal Press.
[3] Immink, K. A. S. (1998). The compact disc story. Journal of the AES, 46(5), 458–465.
[4] Nyquist, H. (1928). Certain topics in telegraph transmission theory. Transactions of the AIEE, 47(2), 617–644.
[5] Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21.