<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Information-Theory on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/information-theory/</link>
    <description>Recent content in Information-Theory on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Sun, 22 Feb 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/information-theory/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>More Context Is Not Always Better</title>
      <link>https://sebastianspicker.github.io/posts/more-context-not-always-better/</link>
      <pubDate>Sun, 22 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/more-context-not-always-better/</guid>
      <description>The intuition that feeding a language model more information improves its outputs is wrong often enough to matter. Here is why, and what to do about it.</description>
      <content:encoded><![CDATA[<h2 id="summary">Summary</h2>
<p>There is a popular intuition in LLM engineering that context is a resource
you should always spend freely: more background, more history, more examples —
inevitably better answers. This intuition is wrong often enough to be
dangerous. Context has a signal-to-noise structure, attention has a
position-dependent bias, and the architecture that processes all of it scales
quadratically. Adding irrelevant tokens does not leave performance neutral; it
actively degrades it. This post argues for <em>structured sparsity</em> as a design
principle: give a model exactly the context it needs for the decision it is
making right now, and nothing else.</p>
<hr>
<h2 id="background">Background</h2>
<p>The &ldquo;more is more&rdquo; assumption has an obvious origin. Transformers were
designed to condition on sequences, and every new token in the context window
is, in principle, available to every attention head. The release of models with
128k, 200k, and now million-token context windows reinforced the story: the
constraint is gone, so pack in everything you have.</p>
<p>Two lines of empirical and theoretical work complicate this story.</p>
<p><strong>The lost-in-the-middle problem.</strong> Liu et al. <a href="#ref-1">[1]</a> showed that retrieval
accuracy on multi-document question answering degrades sharply when the
relevant passage appears in the <em>middle</em> of a long context, compared to the
beginning or end. Performance on 20-document prompts dropped by more than
20 percentage points relative to the single-document baseline — not because the
model lacked the information, but because it was buried. The effect is
consistent across model families and persists at model scales where you would
not expect it.</p>
<p><strong>The complexity argument.</strong> Standard scaled dot-product attention <a href="#ref-2">[2]</a> is</p>
\[
  \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V
\]<p>The \( QK^{\top} \) product is \( O(n^2) \) in sequence length \( n \).
Inference-time KV-cache mitigates compute cost, but memory grows linearly and
the softmax normalises over a denominator that grows with \( n \). A head
attending to 128 000 tokens is averaging over a vastly noisier signal than one
attending to 512.</p>
<hr>
<h2 id="the-idea">The Idea</h2>
<h3 id="context-as-a-noisy-channel">Context as a noisy channel</h3>
<p>Think of the information reaching a given attention head as a noisy channel in
the Shannon sense. The signal is the subset of tokens that are actually
relevant to the current decoding step; the rest is noise. Signal-to-noise ratio
is</p>
\[
  \text{SNR} = \frac{|\mathcal{S}|}{n - |\mathcal{S}|}
\]<p>where \( \mathcal{S} \subset \{1, \ldots, n\} \) is the set of relevant token
positions and \( n \) is total context length. For a fixed task, \( |\mathcal{S}| \)
is roughly constant. So SNR is a <em>decreasing</em> function of \( n \). Adding
irrelevant context makes the problem strictly harder in this framing — it does
not leave it unchanged.</p>
<p>This is a toy model, but it captures something real. The softmax in the
attention head distributes a probability mass of 1.0 across \( n \) positions.
If the attended sequence doubles in length and the relevant positions remain the
same, each relevant position receives roughly half the probability mass it did
before — unless the model&rsquo;s learned attention patterns are precise enough to
suppress the irrelevant positions to near-zero, which is a strong assumption.</p>
<h3 id="position-bias-compounds-the-problem">Position bias compounds the problem</h3>
<p>Empirically, transformers exhibit a U-shaped recall curve over context
position: tokens near the start (primacy) and tokens near the end (recency)
are retrieved more reliably than tokens in the middle. If you stuff a long
context with background material and bury the task-relevant information in the
middle, you are fighting the architecture&rsquo;s learned inductive bias.</p>
<p>The effect is roughly consistent with what would emerge if the model&rsquo;s
attention weight distribution were modelled as a mixture of a flat prior and a
position-biased component. Under that model, increasing \( n \) inflates the
flat component&rsquo;s contribution and dilutes the position-biased recovery of
relevant tokens.</p>
<h3 id="what-structured-sparsity-looks-like-in-practice">What structured sparsity looks like in practice</h3>
<p>The corrective is not to artificially shrink context windows — it is to ensure
that at each decision point, the context is populated with tokens that are
<em>relevant to that decision</em>. Three practical expressions of this principle:</p>
<ol>
<li>
<p><strong>Retrieval over recall.</strong> Rather than prepending a full document corpus,
retrieve the top-\( k \) passages at query time. This keeps \( n \) small
and \( |\mathcal{S}| / n \) high.</p>
</li>
<li>
<p><strong>Rolling summarisation.</strong> Compress history into a running summary and
discard the raw transcript. The summary carries the signal; the raw
transcript is mostly noise by the time it is several turns old.</p>
</li>
<li>
<p><strong>Phased orchestration.</strong> Decompose a multi-step task into phases, each
with its own focused context. Phase \( t \) receives only the output of
phase \( t-1 \) (plus any task-specific retrieval), not the entire
accumulated history of all prior phases. This keeps per-phase \( n \) bounded
even as the total task length grows.</p>
</li>
</ol>
<hr>
<h2 id="discussion">Discussion</h2>
<p>The argument above is not novel — pieces of it appear scattered across the
alignment and inference-efficiency literatures. What I think is underappreciated
is that it applies to <em>agentic</em> systems with particular force. A single-shot
prompt has a fixed, author-controlled context. An agent accumulating tool
outputs, prior reasoning traces, and retrieved documents across a long task
trajectory will naturally inflate its own context window over time — and
degrade its own performance as a result, without any external change in task
difficulty.</p>
<p>The naive fix is to give the agent a bigger context window. The correct fix is
to never let it accumulate a bloated context in the first place.</p>
<p><strong>Limitations.</strong> The SNR framing treats all irrelevant tokens as equally noisy,
which is false — some irrelevant tokens are actively misleading (distractors) <a href="#ref-3">[3]</a>,
others are benign fillers. The quadratic cost argument mostly applies to
full-attention models; sparse and linear attention variants have different
scaling properties. And &ldquo;relevant&rdquo; is itself a function of the model&rsquo;s
knowledge, which makes the optimisation circular in practice.</p>
<p><strong>What would make this publishable.</strong> Controlled ablation: fix a task, vary
context length by inserting null tokens of increasing volume, measure
performance as a function of \( n \) and of the position of the relevant
material. Do this across model sizes and families to separate architectural
effects from scale effects. The lost-in-the-middle paper is close to this but
does not isolate null-token inflation from document-count inflation.</p>
<hr>
<h2 id="references">References</h2>
<p><span id="ref-1"></span>[1] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., &amp; Liang, P. (2024). Lost in the middle: How language models use long contexts. <em>Transactions of the Association for Computational Linguistics</em>, 12, 157–173. <a href="https://arxiv.org/abs/2307.03172">https://arxiv.org/abs/2307.03172</a></p>
<p><span id="ref-2"></span>[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &amp; Polosukhin, I. (2017). Attention is all you need. <em>Advances in Neural Information Processing Systems</em>, 30. <a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></p>
<p><span id="ref-3"></span>[3] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., Schärli, N., &amp; Zhou, D. (2023). Large language models can be easily distracted by irrelevant context. <em>Proceedings of the 40th International Conference on Machine Learning (ICML 2023)</em>. <a href="https://arxiv.org/abs/2302.00093">https://arxiv.org/abs/2302.00093</a></p>
<hr>
<p><em>The phased orchestration argument in the Discussion section is not just
theoretical hand-waving — I have been building a concrete implementation of it.
The current state lives at
<a href="https://github.com/sebastianspicker/phased-agent-orchestration">sebastianspicker/phased-agent-orchestration</a>.
It is rough, but the core idea is there: each agent phase gets a bounded,
purpose-built context rather than the full accumulated history. Feedback very
welcome.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
