Latency on Sebastian Spicker

When Musicians Lock In: Coupled Oscillators and the Physics of Ensemble Synchronisation

Thu, 08 Feb 2024 00:00:00 +0000

The problem is ancient and the language for it is recent. In any ensemble — a string quartet, a jazz rhythm section, an orchestra — musicians with slightly different internal tempos must stay together. They do this by listening to each other. But what, exactly, does “listening to each other” do to their timing? And what happens when the listening channel is imperfect — delayed by the speed of sound across a wide stage, or by a network cable crossing a continent? The answer involves a differential equation that was not written to describe music.

This post extends the latency analysis in Latency in Networked Music Performance with the dynamical systems framework that underlies it.

Two Clocks on a Board

The first documented observation of coupled-oscillator synchronisation was made not by a musician but by a physicist. In 1665, Christiaan Huygens, confined to bed with illness, was watching two pendulum clocks mounted on the same wooden beam. Over the course of the night, the pendulums had synchronised into anti-phase oscillation — swinging in opposite directions in exact unison. He reported it to his father:

“I have noticed a remarkable effect which no-one has observed before… two clocks on the same board always end up in mutual synchrony.”

The mechanism was mechanical coupling through the beam. Each pendulum’s swing imparted a small impulse to the wood; the other pendulum felt this as a perturbation to its rhythm. Small perturbations, accumulated over hours, drove the clocks into a shared frequency and a fixed phase relationship.

This is the prototype of every ensemble synchronisation problem. Each musician is a clock. The acoustic environment — the air in the room, the reflected sound from the walls, the vibrations through the stage floor — is the wooden beam.

The Kuramoto Model

Yoshiki Kuramoto formalised the mathematics of coupled oscillators in 1975, motivated by biological synchronisation problems: firefly flashing, circadian rhythms, cardiac pacemakers. His model considers $N$ oscillators, each with a phase $\theta_i(t)$ evolving according to:

$$\frac{d\theta_i}{dt} = \omega_i + \frac{K}{N} \sum_{j=1}^{N} \sin(\theta_j - \theta_i), \qquad i = 1, \ldots, N.$$

The first term, $\omega_i$, is the oscillator’s natural frequency — the tempo it would maintain in isolation. These are drawn from a distribution $g(\omega)$, which in a real ensemble reflects the spread of individual preferred tempos among the players. The second term is the coupling: each oscillator is attracted toward the phases of all others, with strength $K/N$. The factor $1/N$ keeps the total coupling intensive (independent of ensemble size) as $N$ grows large.

Musically: $\theta_i$ is the phase of musician $i$’s internal pulse at a given moment, $\omega_i$ is their preferred tempo if playing alone, and $K$ is the coupling strength — how much they adjust their tempo in response to what they hear from the others.

The Order Parameter and the Phase Transition

To measure the degree of synchronisation, Kuramoto introduced the complex order parameter:

$$r(t)\, e^{i\psi(t)} = \frac{1}{N} \sum_{j=1}^{N} e^{i\theta_j(t)},$$

where $r(t) \in [0, 1]$ is the coherence of the ensemble and $\psi(t)$ is the collective mean phase. When $r = 0$, the phases are uniformly spread around the unit circle — the ensemble is incoherent. When $r = 1$, all phases coincide — perfect synchrony. In a live ensemble, $r$ is a direct measure of rhythmic cohesion, though of course not one you can read off a score.

Substituting the order parameter into the equation of motion:

$$\frac{d\theta_i}{dt} = \omega_i + K r \sin(\psi - \theta_i).$$

Each oscillator now interacts only with the mean-field quantities $r$ and $\psi$, not with every other oscillator individually. The coupling pulls each musician toward the collective mean phase with a force proportional to both $K$ (how attentively they listen) and $r$ (how coherent the group already is).

This mean-field form reveals the essential physics. For small $K$, oscillators with widely differing $\omega_i$ cannot follow the mean field — they drift at their own frequencies, and $r \approx 0$. At a critical coupling strength $K_c$, a macroscopic fraction of oscillators suddenly locks to a shared frequency, and $r$ begins to grow continuously from zero. For a unimodal, symmetric frequency distribution $g(\omega)$ with density $g(\bar\omega)$ at the mean:

$$K_c = \frac{2}{\pi\, g(\bar\omega)}.$$

Above $K_c$, the coherence grows as:

$$r \approx \sqrt{\frac{K - K_c}{K_c}}, \qquad K \gtrsim K_c.$$

This is a second-order (continuous) phase transition — the same mathematical structure as a ferromagnet approaching the Curie temperature, where spontaneous magnetisation appears continuously above a critical coupling. The musical ensemble and the magnetic material belong to the same universality class, governed by the same mean-field exponent $\frac{1}{2}$.

Above $K_c$, the fraction of oscillators that are locked (synchronised to the mean-field frequency) can be computed explicitly. An oscillator with natural frequency $\omega_i$ locks to the mean field if $|\omega_i - \bar\omega| \leq Kr$. For a Lorentzian distribution $g(\omega) = \frac{\gamma/\pi}{(\omega - \bar\omega)^2 + \gamma^2}$, this yields:

$$r = \sqrt{1 - \frac{K_c}{K}}, \qquad K_c = 2\gamma,$$

which is the exact self-consistency equation for the Kuramoto model with Lorentzian frequency spread (Strogatz, 2000).

The physical reading is direct: whether an ensemble locks into a shared pulse or drifts apart is a threshold phenomenon. A group of musicians with similar preferred tempos has a peaked $g(\bar\omega)$, giving a low $K_c$ — they synchronise easily with minimal attentive listening. A group with widely varying individual tempos needs stronger, more sustained coupling to cross the threshold. This is not a matter of musical discipline; it is a material property of the ensemble.

Concert Hall Applause: Neda et al. (2000)

The Kuramoto model is not only a theoretical construction. Neda et al. (2000) applied it to concert hall applause — one of the most direct real-world demonstrations of coupled-oscillator dynamics in a musical context.

They recorded applause in Romanian and Hungarian theaters and found that audiences spontaneously alternate between two distinct states. In the incoherent regime, each audience member claps at their own preferred rate (typically 2–3 Hz). Through acoustic coupling — each person hears the room-averaged sound and adjusts their clapping — the audience gradually synchronises to a shared, slower frequency (around 1.5 Hz): the synchronised regime.

The transitions between the two regimes are quantitatively consistent with the Kuramoto phase transition: the emergence of synchrony corresponds to $K$ crossing $K_c$ as people progressively pay more attention to the collective sound. Furthermore, Neda et al. document a characteristic phenomenon when synchrony breaks down: individual clapping frequency approximately doubles as audience members attempt to re-establish coherence. This frequency-doubling — a feature of nonlinear oscillator systems near instability — is exactly what the delayed response of coupling near $K_c$ predicts.

The paper is a useful pedagogical artefact: every music student has experienced concert hall applause, and hearing that it undergoes a physically measurable phase transition makes the connection between physics and musical experience concrete.

Latency and the Limits of Networked Ensemble Performance

In standard acoustic ensemble playing, the coupling delay is the propagation time for sound to cross the ensemble: at $343\ \text{m/s}$, across a ten-metre stage, roughly 30 ms. This is why orchestral seating is arranged with attention to who needs to hear whom first.

In networked music performance (NMP), the coupling delay $\tau$ is much larger: tens to hundreds of milliseconds depending on geographic distance and network infrastructure. The Kuramoto model generalises naturally to include this delay:

$$\frac{d\theta_i}{dt} = \omega_i + \frac{K}{N} \sum_{j=1}^{N} \sin\!\bigl(\theta_j(t - \tau) - \theta_i(t)\bigr).$$

Each musician hears the others’ phases as they were $\tau$ seconds ago, not as they are now.

In a synchronised state where all oscillators share the collective frequency $\bar\omega$ and phase $\psi(t) = \bar\omega t$, the delayed phase signal is $\psi(t - \tau) = \bar\omega t - \bar\omega\tau$. The effective coupling force contains a factor $\cos(\bar\omega\tau)$: the delay introduces a phase shift that reduces the useful component of the coupling. The critical coupling with delay is therefore:

$$K_c(\tau) = \frac{K_c(0)}{\cos(\bar\omega \tau)}.$$

As $\tau$ increases, $K_c(\tau)$ grows: synchronisation requires progressively stronger coupling (more attentive adjustment) to compensate for the information lag. The denominator $\cos(\bar\omega\tau)$ reaches zero when $\bar\omega\tau = \pi/2$. At this point $K_c(\tau) \to \infty$: no finite coupling strength can maintain synchrony. The critical delay is:

$$\tau_c = \frac{\pi}{2\bar\omega}.$$

For an ensemble performing at 120 BPM, the beat frequency is $\bar\omega = 2\pi \times 2\ \text{Hz} = 4\pi\ \text{rad/s}$:

$$\tau_c = \frac{\pi}{2 \times 4\pi} = \frac{1}{8}\ \text{s} = 125\ \text{ms}.$$

This is a remarkably clean result. The Kuramoto model with delay predicts that ensemble synchronisation collapses at around 125 ms one-way delay for a standard performance tempo. The empirical literature on NMP — from LoLa deployments across European conservatories to controlled latency studies in the lab — consistently finds that rhythmic coherence degrades noticeably above 50–80 ms and becomes essentially unworkable above 100–150 ms one-way. The model and the data agree.

The derivation also shows why faster tempos are harder in NMP: $\tau_c \propto 1/\bar\omega$, so doubling the tempo halves the tolerable latency. An ensemble performing at 240 BPM in a distributed setting faces a theoretical ceiling of 62 ms — which rules out transcontinental performance for most repertoire.

Brains in Sync: EEG Hyperscanning

The Kuramoto framework has recently been applied at a neural level. EEG hyperscanning — simultaneous EEG recording from multiple participants during a shared musical activity — has shown that musicians performing together exhibit inter-brain synchronisation: coherent cortical oscillations at the frequency of the music are measurable between players (Lindenberger et al., 2009; Müller et al., 2013). The phase coupling between brains during joint performance is significantly higher than during solo performance and higher than for musicians playing simultaneously but without acoustic coupling.

This suggests that the Kuramoto coupling operates at two levels: the acoustic (each musician hears the other and adjusts physical timing) and the neural (each musician’s cortical oscillators entrain to the shared musical pulse). The question of which level is primary — whether neural synchrony causes or follows from acoustic synchrony — remains open.

A 2023 review by Demos and Palmer argues that pairwise Kuramoto-type coupling is insufficient to capture full ensemble dynamics. Group-level effects — the differentiation between leader and follower roles, the emergence of collective timing that no individual would produce alone — require nonlinear dynamical frameworks that go beyond mean-field averaging. The model that adequately describes a string quartet may need to be richer than the one that describes a population of identical fireflies.

What This Means for Teaching

The Kuramoto model reframes standard rehearsal intuitions in physical terms.

“Listen more” translates to “increase your effective coupling constant $K$.” A musician who plays without attending to others has set $K \approx 0$ and will drift freely according to their own $\omega_i$. Listening — actively adjusting tempo in response to what you hear — is not metaphorical. It is the physical mechanism of coupling, and its effect is to pull you toward the mean phase $\psi$ with a force $Kr\sin(\psi - \theta_i)$.

“Our tempos are too different” is a claim about $g(\bar\omega)$ and therefore about $K_c$. A group with a wide spread of natural tempos needs more and stronger listening to synchronise. This is not a moral failing but a parameter; it suggests that ensemble warm-up time or explicit tempo negotiation before a performance serves to reduce the spread of natural frequencies before the coupling has to do all the work.

Latency as a rehearsal experiment can be made explicit. Artificially delaying the acoustic return to one musician in an ensemble — via headphone monitoring with variable delay — allows students to experience directly how the coordination degrades as $\tau$ increases toward $\tau_c$. They feel the system approaching the phase transition without the theoretical framework, but the framework makes the experience interpretable afterward.

The click track replaces peer-to-peer Kuramoto coupling with an external forcing term: each musician locks to a shared reference with fixed $\omega$ rather than adjusting dynamically to the group mean. This eliminates the phase transition but also eliminates the adaptive dynamics — the micro-timing fluctuations and expressive rubato — that characterise live ensemble playing. It is a pedagogically important distinction, even if studios routinely make the pragmatic choice.

References

Demos, A. P., & Palmer, C. (2023). Social and nonlinear dynamics unite: Musical group synchrony. Trends in Cognitive Sciences, 27(11), 1008–1018. https://doi.org/10.1016/j.tics.2023.08.005
Huygens, C. (1665). Letter to his father Constantijn Huygens, 26 February 1665. In Œuvres complètes de Christiaan Huygens, Vol. 5, p. 243. Martinus Nijhoff, 1893.
Kuramoto, Y. (1975). Self-entrainment of a population of coupled non-linear oscillators. In H. Araki (Ed.), International Symposium on Mathematical Problems in Theoretical Physics (Lecture Notes in Physics, Vol. 39, pp. 420–422). Springer.
Kuramoto, Y. (1984). Chemical Oscillations, Waves, and Turbulence. Springer.
Lindenberger, U., Li, S.-C., Gruber, W., & Müller, V. (2009). Brains swinging in concert: Cortical phase synchronization while playing guitar. BMC Neuroscience, 10, 22. https://doi.org/10.1186/1471-2202-10-22
Müller, V., Sänger, J., & Lindenberger, U. (2013). Intra- and inter-brain synchronization during musical improvisation on the guitar. PLOS ONE, 8(9), e73852. https://doi.org/10.1371/journal.pone.0073852
Neda, Z., Ravasz, E., Vicsek, T., Brechet, Y., & Barabási, A.-L. (2000). Physics of the rhythmic applause. Physical Review E, 61(6), 6987–6992. https://doi.org/10.1103/PhysRevE.61.6987
Strogatz, S. H. (2000). From Kuramoto to Crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena, 143(1–4), 1–20. https://doi.org/10.1016/S0167-2789(00)00094-4
Strogatz, S. H. (2003). Sync: How Order Emerges from Chaos in the Universe, Nature, and Daily Life. Hyperion.

Changelog

2026-01-14: Updated the author list for the Demos (2023) Trends in Cognitive Sciences reference to the published two authors (Demos & Palmer). The five names previously listed were from a different Demos paper.
2026-01-14: Changed “period-doubling” to “frequency-doubling.” When the clapping frequency doubles, the period halves; “frequency-doubling” is the precise term in this context.

How Low Can You Go? Measuring Latency for Networked Music Performance Across Europe

Sat, 26 Aug 2023 00:00:00 +0000

This post summarises a manuscript submitted with Benjamin Bentz and colleagues from the RAPP Lab network. The paper is not yet peer-reviewed; numbers and conclusions are based on operational measurements collected 2020–2023. Feedback welcome — particularly from anyone who has run similar measurements on non-European or wireless-last-mile links.

The Problem

Musicians playing together in the same room experience acoustic propagation delay of roughly 3 ms per metre of separation — essentially free latency that most ensembles never consciously register. When you distribute musicians across a network, you inherit that propagation cost plus everything the signal chain adds on top: buffers, codec processing, routing hops, switching overhead.

Conventional video-conferencing (Zoom, Teams, etc.) operates at end-to-end delays of roughly 100–300 ms. That is comfortable for speech — human conversation tolerates round-trip delays up to about 250 ms before it starts to feel wrong — but it is well above the threshold at which ensemble timing breaks down. The NMP literature generally puts the upper bound for synchronous rhythmic playing somewhere between 20 and 30 ms one-way, with considerable variation by tempo, instrument, and whether the performers can see each other [Carôt 2011; Tsioutas & Xylomenos 2021; Medina Victoria 2019].

Specialised low-latency systems cut the processing overhead by avoiding compression, using hardware-accelerated video pipelines, and riding research-and-education networks that offer better jitter characteristics than commodity internet. Two of the better-known ones are LoLa (Low Latency Audio Visual Streaming System, developed at Conservatorio G. Tartini Trieste) and MVTP (Modular Video Transmission Platform, developed at CESNET in Prague). We deployed both at Hochschule für Musik und Tanz Köln as part of the RAPP Lab collaboration and spent about two and a half years measuring them.

The Latency Budget

End-to-end latency in NMP is cumulative and non-recoverable. Once delay enters the chain, nothing downstream can subtract it. The budget looks like:

\[ L_\text{total} = L_\text{capture} + L_\text{buffer} + L_\text{network} + L_\text{playback} \]

Network latency $ L_\text{network} $ includes propagation (roughly $ d / (2 \times 10^8) $ seconds for a fibre link of distance $ d $ metres, accounting for the refractive index of glass) plus per-hop processing. Everything else is system-dependent.

The key insight is that $ L_\text{buffer} $ is not fixed — it is a consequence of jitter. A jittery link forces larger buffers to avoid underruns, which directly adds to perceived latency. This is why raw bandwidth is almost irrelevant for NMP: a 1 Gbps link with erratic jitter will perform worse than a 100 Mbps link with deterministic behaviour.

What We Measured and How

Network RTT. ICMP ping, 1,000 packets per run. We report the median as a robust summary; the mean is too sensitive to the occasional rogue packet.

End-to-end audio latency. An audio signal-loop: transmit a test signal from site A to site B, have site B return it immediately, estimate round-trip delay by cross-correlation. One-way latency = signal-loop RTT / 2. This method captures local processing and buffering at both ends in addition to the network leg, which is what actually matters for a musician.

Video latency. Component-based estimation (capture frame cadence + processing pipeline + display). We did not have a frame-accurate video loopback method, so treat these numbers as estimates rather than precision measurements. That caveat matters less than it might seem because, as you will see, video was always slower than audio by a wide enough margin that it did not drive the operational decisions.

Firewall impact. A controlled 4-hour session on the Cologne–Vienna link, alternating between a DMZ configuration (direct research-backbone access) and a transparent enterprise firewall, logging packet loss and decoder instability.

Six partner institutions, air distances from 175 to 1,655 km, measurements collected between October 2020 and March 2023.

Results

Audio latency

Partner (from Cologne)	Air distance (km)	Median RTT (ms)	One-way audio latency (ms)
Prague	535	5.0	7.5
Vienna	745	7.0	9.5
Detmold	175	7.5	10.0
Trieste	775	10.0	12.5
Rome	1,090	17.5	20.0
Tallinn	1,465	19.5	22.0–22.5

The number that jumps out immediately: Detmold (175 km away) has higher latency than Vienna (745 km away). This is a routing issue, not a physics one. The Detmold link was traversing a less efficient campus path that added extra hops before reaching the research backbone. Prague, by contrast, was connected via a particularly short routing path and achieved the lowest latency of any link despite not being the geographically closest.

The practical implication: geographic distance is a poor predictor of achievable latency. Measure RTT; do not estimate from a map.

Video latency

Estimated one-way video latency was 20–35 ms across all configurations, with the dominant contributions coming from frame cadence (at 60 fps, you wait up to 16.7 ms for a frame to be captured regardless of what the network is doing) and buffering at the decoder. In every deployment, video consistently lagged audio. Musicians unsurprisingly fell back on audio for synchronization and treated video as a supplementary cue — useful for expressive and social information, not for timing.

The firewall experiment

This is the result I find most important for anyone planning a similar deployment.

Metric	DMZ (no firewall)	With enterprise firewall	Change
Dropped audio packets	0.002%	0.052%	+26×
Audio buffer realignments/hour	0.3	3.9	+13×
Dropped video frames	0.04%	0.74%	+18×
Additional latency	—	0.5–1.0 ms	—

The raw latency increase (0.5–1.0 ms) is small and largely irrelevant. The packet loss and buffer event increases are not. A 26-fold increase in dropped audio packets on an otherwise uncongested link means the firewall is doing something — likely deep packet inspection or stateful tracking — that introduces enough irregularity to destabilise small audio buffers. This forces you to either accept dropouts or increase buffer size, and increasing buffer size increases latency.

The message is: if your institution requires traffic inspection for security policy compliance, you are paying a latency tax that is more about stability than the raw delay number, and that tax is substantial.

Discussion

Based on the measured latencies and reported musical tolerances from the literature, I would roughly characterise the links as follows:

Prague, Vienna, Detmold, Trieste (7.5–12.5 ms): Compatible with most repertoire including rhythmically demanding chamber music. Musicians in our sessions reported the interaction as “natural” or “like being in the same room” at these latencies.
Rome (20 ms): Usable with attention to repertoire and tempo. Slower movements and music where tight rhythmic locking is not the primary aesthetic concern work well. Rhythmically dense passages at fast tempi become harder.
Tallinn (22–22.5 ms): At the upper edge of the comfortable range. Still usable — we ran a concert collaboration in March 2023 — but musicians adapt their interaction strategies, leaning more on musical anticipation than reactive synchronization.

What is notably absent from this data: anything outside the European research-network context. All six links ran on GÉANT or national backbone equivalents with favourable jitter characteristics. The numbers almost certainly do not transfer directly to commodity internet, satellite links, or mixed-topology paths.

Limitations I want to be explicit about. The video latency estimates are component-based, not directly measured, so treat that 20–35 ms range with appropriate skepticism. The firewall comparison is a single 4-hour session on a single link; I would not want to extrapolate too aggressively to other firewall vendors or configurations. And this is an operational measurement study, not a controlled perceptual experiment — I cannot tell you from this data at precisely what latency threshold a given ensemble will declare a session unusable, because that depends on the music, the musicians, and factors I did not measure.

Practical Takeaways

For anyone setting up a similar system:

Measure RTT before committing to a partner institution. A 100 km difference in air distance can easily be swamped by routing differences.
Get DMZ placement if at all possible. The firewall results suggest this matters more than any other single configuration decision.
Minimise campus hops between your endpoint and the research backbone. Each additional switching layer adds jitter risk.
Use small audio buffers and monitor for underruns. If your baseline RTT is good, your buffer can be small; if underruns increase, that is an early warning that network stability is degrading before packet loss becomes audible.
Accept that video will lag audio and design your session accordingly. This is not a system failure; it is a consequence of how video pipelines work at low latency. Plan for it.

References

Carôt, A. (2011). Low latency audio streaming for Internet-based musical interaction. Advances in Multimedia and Interactive Technologies. https://doi.org/10.4018/978-1-61692-831-5.ch015

Drioli, C., Allocchio, C., & Buso, N. (2013). Networked performances and natural interaction via LOLA. LNCS, 7990, 240–250. https://doi.org/10.1007/978-3-642-40050-6_21

Medina Victoria, A. (2019). A method for the measurement of the latency tolerance range of Western musicians. Ph.D. dissertation, Cork Institute of Technology (now Munster Technological University).

Rottondi, C., Chafe, C., Allocchio, C., & Sarti, A. (2016). An overview on networked music performance technologies. IEEE Access, 4, 8823–8843. https://doi.org/10.1109/ACCESS.2016.2628440

Tsioutas, K. & Xylomenos, G. (2021). On the impact of audio characteristics to the quality of musicians experience in network music performance. JAES, 69(12), 914–923. https://doi.org/10.17743/jaes.2021.0041

Ubik, S., Halak, J., Kolbe, M., Melnikov, J., & Frič, M. (2021). Lessons learned from distance collaboration in live culture. AISC, 1378, 608–615. https://doi.org/10.1007/978-3-030-74009-2_77

Changelog

2026-01-20: Updated the Drioli et al. (2013) LNCS volume number to 7990 (ECLAP 2013 proceedings). Updated the Ubik et al. (2021) AISC volume number to 1378 and page range to 608–615. Updated the fifth author’s surname to “Frič.”