This post summarises a manuscript submitted with Benjamin Bentz and colleagues from the RAPP Lab network. The paper is not yet peer-reviewed; numbers and conclusions are based on operational measurements collected 2020–2023. Feedback welcome — particularly from anyone who has run similar measurements on non-European or wireless-last-mile links.


The Problem

Musicians playing together in the same room experience acoustic propagation delay of roughly 3 ms per metre of separation — essentially free latency that most ensembles never consciously register. When you distribute musicians across a network, you inherit that propagation cost plus everything the signal chain adds on top: buffers, codec processing, routing hops, switching overhead.

Conventional video-conferencing (Zoom, Teams, etc.) operates at end-to-end delays of roughly 100–300 ms. That is comfortable for speech — human conversation tolerates round-trip delays up to about 250 ms before it starts to feel wrong — but it is well above the threshold at which ensemble timing breaks down. The NMP literature generally puts the upper bound for synchronous rhythmic playing somewhere between 20 and 30 ms one-way, with considerable variation by tempo, instrument, and whether the performers can see each other [Carôt 2011; Tsioutas & Xylomenos 2021; Medina Victoria 2019].

Specialised low-latency systems cut the processing overhead by avoiding compression, using hardware-accelerated video pipelines, and riding research-and-education networks that offer better jitter characteristics than commodity internet. Two of the better-known ones are LoLa (Low Latency Audio Visual Streaming System, developed at Conservatorio G. Tartini Trieste) and MVTP (Modular Video Transmission Platform, developed at CESNET in Prague). We deployed both at Hochschule für Musik und Tanz Köln as part of the RAPP Lab collaboration and spent about two and a half years measuring them.


The Latency Budget

End-to-end latency in NMP is cumulative and non-recoverable. Once delay enters the chain, nothing downstream can subtract it. The budget looks like:

\[ L_\text{total} = L_\text{capture} + L_\text{buffer} + L_\text{network} + L_\text{playback} \]

Network latency \( L_\text{network} \) includes propagation (roughly \( d / (2 \times 10^8) \) seconds for a fibre link of distance \( d \) metres, accounting for the refractive index of glass) plus per-hop processing. Everything else is system-dependent.

The key insight is that \( L_\text{buffer} \) is not fixed — it is a consequence of jitter. A jittery link forces larger buffers to avoid underruns, which directly adds to perceived latency. This is why raw bandwidth is almost irrelevant for NMP: a 1 Gbps link with erratic jitter will perform worse than a 100 Mbps link with deterministic behaviour.


What We Measured and How

Network RTT. ICMP ping, 1,000 packets per run. We report the median as a robust summary; the mean is too sensitive to the occasional rogue packet.

End-to-end audio latency. An audio signal-loop: transmit a test signal from site A to site B, have site B return it immediately, estimate round-trip delay by cross-correlation. One-way latency = signal-loop RTT / 2. This method captures local processing and buffering at both ends in addition to the network leg, which is what actually matters for a musician.

Video latency. Component-based estimation (capture frame cadence + processing pipeline + display). We did not have a frame-accurate video loopback method, so treat these numbers as estimates rather than precision measurements. That caveat matters less than it might seem because, as you will see, video was always slower than audio by a wide enough margin that it did not drive the operational decisions.

Firewall impact. A controlled 4-hour session on the Cologne–Vienna link, alternating between a DMZ configuration (direct research-backbone access) and a transparent enterprise firewall, logging packet loss and decoder instability.

Six partner institutions, air distances from 175 to 1,655 km, measurements collected between October 2020 and March 2023.


Results

Audio latency

Partner (from Cologne)Air distance (km)Median RTT (ms)One-way audio latency (ms)
Prague5355.07.5
Vienna7457.09.5
Detmold1757.510.0
Trieste77510.012.5
Rome1,09017.520.0
Tallinn1,46519.522.0–22.5

The number that jumps out immediately: Detmold (175 km away) has higher latency than Vienna (745 km away). This is a routing issue, not a physics one. The Detmold link was traversing a less efficient campus path that added extra hops before reaching the research backbone. Prague, by contrast, was connected via a particularly short routing path and achieved the lowest latency of any link despite not being the geographically closest.

The practical implication: geographic distance is a poor predictor of achievable latency. Measure RTT; do not estimate from a map.

Video latency

Estimated one-way video latency was 20–35 ms across all configurations, with the dominant contributions coming from frame cadence (at 60 fps, you wait up to 16.7 ms for a frame to be captured regardless of what the network is doing) and buffering at the decoder. In every deployment, video consistently lagged audio. Musicians unsurprisingly fell back on audio for synchronization and treated video as a supplementary cue — useful for expressive and social information, not for timing.

The firewall experiment

This is the result I find most important for anyone planning a similar deployment.

MetricDMZ (no firewall)With enterprise firewallChange
Dropped audio packets0.002%0.052%+26×
Audio buffer realignments/hour0.33.9+13×
Dropped video frames0.04%0.74%+18×
Additional latency0.5–1.0 ms

The raw latency increase (0.5–1.0 ms) is small and largely irrelevant. The packet loss and buffer event increases are not. A 26-fold increase in dropped audio packets on an otherwise uncongested link means the firewall is doing something — likely deep packet inspection or stateful tracking — that introduces enough irregularity to destabilise small audio buffers. This forces you to either accept dropouts or increase buffer size, and increasing buffer size increases latency.

The message is: if your institution requires traffic inspection for security policy compliance, you are paying a latency tax that is more about stability than the raw delay number, and that tax is substantial.


Discussion

Based on the measured latencies and reported musical tolerances from the literature, I would roughly characterise the links as follows:

  • Prague, Vienna, Detmold, Trieste (7.5–12.5 ms): Compatible with most repertoire including rhythmically demanding chamber music. Musicians in our sessions reported the interaction as “natural” or “like being in the same room” at these latencies.

  • Rome (20 ms): Usable with attention to repertoire and tempo. Slower movements and music where tight rhythmic locking is not the primary aesthetic concern work well. Rhythmically dense passages at fast tempi become harder.

  • Tallinn (22–22.5 ms): At the upper edge of the comfortable range. Still usable — we ran a concert collaboration in March 2023 — but musicians adapt their interaction strategies, leaning more on musical anticipation than reactive synchronization.

What is notably absent from this data: anything outside the European research-network context. All six links ran on GÉANT or national backbone equivalents with favourable jitter characteristics. The numbers almost certainly do not transfer directly to commodity internet, satellite links, or mixed-topology paths.

Limitations I want to be explicit about. The video latency estimates are component-based, not directly measured, so treat that 20–35 ms range with appropriate skepticism. The firewall comparison is a single 4-hour session on a single link; I would not want to extrapolate too aggressively to other firewall vendors or configurations. And this is an operational measurement study, not a controlled perceptual experiment — I cannot tell you from this data at precisely what latency threshold a given ensemble will declare a session unusable, because that depends on the music, the musicians, and factors I did not measure.


Practical Takeaways

For anyone setting up a similar system:

  1. Measure RTT before committing to a partner institution. A 100 km difference in air distance can easily be swamped by routing differences.
  2. Get DMZ placement if at all possible. The firewall results suggest this matters more than any other single configuration decision.
  3. Minimise campus hops between your endpoint and the research backbone. Each additional switching layer adds jitter risk.
  4. Use small audio buffers and monitor for underruns. If your baseline RTT is good, your buffer can be small; if underruns increase, that is an early warning that network stability is degrading before packet loss becomes audible.
  5. Accept that video will lag audio and design your session accordingly. This is not a system failure; it is a consequence of how video pipelines work at low latency. Plan for it.

References

Carôt, A. (2011). Low latency audio streaming for Internet-based musical interaction. Advances in Multimedia and Interactive Technologies. https://doi.org/10.4018/978-1-61692-831-5.ch015

Drioli, C., Allocchio, C., & Buso, N. (2013). Networked performances and natural interaction via LOLA. LNCS, 7990, 240–250. https://doi.org/10.1007/978-3-642-40050-6_21

Medina Victoria, A. (2019). A method for the measurement of the latency tolerance range of Western musicians. Ph.D. dissertation, Cork Institute of Technology (now Munster Technological University).

Rottondi, C., Chafe, C., Allocchio, C., & Sarti, A. (2016). An overview on networked music performance technologies. IEEE Access, 4, 8823–8843. https://doi.org/10.1109/ACCESS.2016.2628440

Tsioutas, K. & Xylomenos, G. (2021). On the impact of audio characteristics to the quality of musicians experience in network music performance. JAES, 69(12), 914–923. https://doi.org/10.17743/jaes.2021.0041

Ubik, S., Halak, J., Kolbe, M., Melnikov, J., & Frič, M. (2021). Lessons learned from distance collaboration in live culture. AISC, 1378, 608–615. https://doi.org/10.1007/978-3-030-74009-2_77


Changelog

  • 2026-01-20: Updated the Drioli et al. (2013) LNCS volume number to 7990 (ECLAP 2013 proceedings). Updated the Ubik et al. (2021) AISC volume number to 1378 and page range to 608–615. Updated the fifth author’s surname to “Frič.”