Summary
There is a popular intuition in LLM engineering that context is a resource you should always spend freely: more background, more history, more examples — inevitably better answers. This intuition is wrong often enough to be dangerous. Context has a signal-to-noise structure, attention has a position-dependent bias, and the architecture that processes all of it scales quadratically. Adding irrelevant tokens does not leave performance neutral; it actively degrades it. This post argues for structured sparsity as a design principle: give a model exactly the context it needs for the decision it is making right now, and nothing else.
Background
The “more is more” assumption has an obvious origin. Transformers were designed to condition on sequences, and every new token in the context window is, in principle, available to every attention head. The release of models with 128k, 200k, and now million-token context windows reinforced the story: the constraint is gone, so pack in everything you have.
Two lines of empirical and theoretical work complicate this story.
The lost-in-the-middle problem. Liu et al. [1] showed that retrieval accuracy on multi-document question answering degrades sharply when the relevant passage appears in the middle of a long context, compared to the beginning or end. Performance on 20-document prompts dropped by more than 20 percentage points relative to the single-document baseline — not because the model lacked the information, but because it was buried. The effect is consistent across model families and persists at model scales where you would not expect it.
The complexity argument. Standard scaled dot-product attention [2] is
\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V \]The \( QK^{\top} \) product is \( O(n^2) \) in sequence length \( n \). Inference-time KV-cache mitigates compute cost, but memory grows linearly and the softmax normalises over a denominator that grows with \( n \). A head attending to 128 000 tokens is averaging over a vastly noisier signal than one attending to 512.
The Idea
Context as a noisy channel
Think of the information reaching a given attention head as a noisy channel in the Shannon sense. The signal is the subset of tokens that are actually relevant to the current decoding step; the rest is noise. Signal-to-noise ratio is
\[ \text{SNR} = \frac{|\mathcal{S}|}{n - |\mathcal{S}|} \]where \( \mathcal{S} \subset \{1, \ldots, n\} \) is the set of relevant token positions and \( n \) is total context length. For a fixed task, \( |\mathcal{S}| \) is roughly constant. So SNR is a decreasing function of \( n \). Adding irrelevant context makes the problem strictly harder in this framing — it does not leave it unchanged.
This is a toy model, but it captures something real. The softmax in the attention head distributes a probability mass of 1.0 across \( n \) positions. If the attended sequence doubles in length and the relevant positions remain the same, each relevant position receives roughly half the probability mass it did before — unless the model’s learned attention patterns are precise enough to suppress the irrelevant positions to near-zero, which is a strong assumption.
Position bias compounds the problem
Empirically, transformers exhibit a U-shaped recall curve over context position: tokens near the start (primacy) and tokens near the end (recency) are retrieved more reliably than tokens in the middle. If you stuff a long context with background material and bury the task-relevant information in the middle, you are fighting the architecture’s learned inductive bias.
The effect is roughly consistent with what would emerge if the model’s attention weight distribution were modelled as a mixture of a flat prior and a position-biased component. Under that model, increasing \( n \) inflates the flat component’s contribution and dilutes the position-biased recovery of relevant tokens.
What structured sparsity looks like in practice
The corrective is not to artificially shrink context windows — it is to ensure that at each decision point, the context is populated with tokens that are relevant to that decision. Three practical expressions of this principle:
Retrieval over recall. Rather than prepending a full document corpus, retrieve the top-\( k \) passages at query time. This keeps \( n \) small and \( |\mathcal{S}| / n \) high.
Rolling summarisation. Compress history into a running summary and discard the raw transcript. The summary carries the signal; the raw transcript is mostly noise by the time it is several turns old.
Phased orchestration. Decompose a multi-step task into phases, each with its own focused context. Phase \( t \) receives only the output of phase \( t-1 \) (plus any task-specific retrieval), not the entire accumulated history of all prior phases. This keeps per-phase \( n \) bounded even as the total task length grows.
Discussion
The argument above is not novel — pieces of it appear scattered across the alignment and inference-efficiency literatures. What I think is underappreciated is that it applies to agentic systems with particular force. A single-shot prompt has a fixed, author-controlled context. An agent accumulating tool outputs, prior reasoning traces, and retrieved documents across a long task trajectory will naturally inflate its own context window over time — and degrade its own performance as a result, without any external change in task difficulty.
The naive fix is to give the agent a bigger context window. The correct fix is to never let it accumulate a bloated context in the first place.
Limitations. The SNR framing treats all irrelevant tokens as equally noisy, which is false — some irrelevant tokens are actively misleading (distractors) [3], others are benign fillers. The quadratic cost argument mostly applies to full-attention models; sparse and linear attention variants have different scaling properties. And “relevant” is itself a function of the model’s knowledge, which makes the optimisation circular in practice.
What would make this publishable. Controlled ablation: fix a task, vary context length by inserting null tokens of increasing volume, measure performance as a function of \( n \) and of the position of the relevant material. Do this across model sizes and families to separate architectural effects from scale effects. The lost-in-the-middle paper is close to this but does not isolate null-token inflation from document-count inflation.
References
[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://arxiv.org/abs/2307.03172
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762
[3] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Schärli, N., & Zhou, D. (2023). Large language models can be easily distracted by irrelevant context. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). https://arxiv.org/abs/2302.00093
The phased orchestration argument in the Discussion section is not just theoretical hand-waving — I have been building a concrete implementation of it. The current state lives at sebastianspicker/phased-agent-orchestration. It is rough, but the core idea is there: each agent phase gets a bounded, purpose-built context rather than the full accumulated history. Feedback very welcome.