Summary
This is a follow-up to More Context Is Not Always Better. The earlier point was theoretical: long context is not automatically useful context. The practical consequence is sharper. In agentic systems, token saving should not be understood primarily as thrift. It is a form of noise reduction.
Every additional tool output, log fragment, stale plan, old reasoning trace, retrieved document, or configuration dump changes the statistical environment in which the model must make its next decision. Some of those tokens are evidence. Many are not. The problem is not that large context windows are useless; the problem is that large context windows make it easy to stop curating context.
The workflow I now find most robust treats the agent harness as a sequence of filters. Headroom sits at the model/API boundary. RTK reduces shell output. lean-ctx reduces file-reading and search noise. Serena and CodeGraph provide project-local structure instead of undifferentiated text. AGENTS.md, hooks, MCP configuration, and subagent configuration encode routing policy. Codacy, OpenSpec, impeccable, and Ponytail reduce different kinds of decision surface.
That sounds like a tooling inventory. It is not. The unifying idea is that each layer should improve the signal-to-noise ratio of the context before the model is asked to reason over it.
The Mathematical Point
Let the available context be a sequence of chunks or tokens
$$ C = \{t_1, t_2, \ldots, t_n\}, $$and let \(q\) denote the current task. For each \(t_i\), define a task-dependent relevance score
$$ r_i(q) \in [0,1]. $$This is not meant as a claim that relevance is easy to measure exactly. It is a useful abstraction. A token that contains a failing assertion, an API contract, or the line currently being edited has high relevance. A token from an unrelated passing test, an old plan, or a verbose progress log has low relevance.
A crude signal-to-noise ratio for the context is then
$$ \operatorname{SNR}(C,q)= \frac{\sum_{i=1}^{n} r_i(q)} {\epsilon + \sum_{i=1}^{n}(1-r_i(q))}, $$where \(\epsilon\) avoids division by zero. The important property is simple: adding irrelevant tokens increases the denominator without increasing the numerator. More context can therefore make the decision problem worse even when the relevant evidence is technically present.
The practical objective is not to maximise context length. It is to select a bounded working context \(C'\) that preserves decision-relevant information:
$$ C_B^* = \arg\max_{C' \subseteq C,\ |C'|\le B} \left( \sum_{t_i \in C'} r_i(q) - \lambda |C'| \right), $$where \(B\) is a token budget and \(\lambda\) is the penalty for carrying additional context. If \(\lambda\) is too small, the system keeps material because it is available. If \(\lambda\) is too large, the system deletes evidence. The engineering task is to make the filtering good enough that the context becomes smaller and more informative at the same time.
This is why “token saving” is a slightly misleading name. The goal is not merely fewer tokens. The goal is a better working set.
Why This Is Not Just Personal Preference
Current long-context research increasingly treats context as something to be selected, compressed, or compiled, rather than simply enlarged.
LongAttnComp, for example, frames long-context compression as a token-level selection problem. It introduces token-level chunking, token-budget selection, positional reordering, and a learned cross-attention scoring layer. The notable result is not only that inference becomes cheaper. On InfiniteBench Code-Debug, the compressed context can match or exceed full-context accuracy, which is exactly the behaviour one would expect if full context contains substantial noise as well as signal [[1]].
CompressKV makes the same point inside the KV cache. Instead of treating all cached key-value entries as equally worth retaining, it identifies semantic retrieval heads and uses them to preserve tokens likely to matter later in generation. The reported results are strong: more than 97 percent of full-cache performance with 3 percent of the KV cache on LongBench question-answering tasks, and 90 percent Needle-in-a-Haystack accuracy with 0.7 percent KV storage [[2]]. The precise numbers may move with model family and benchmark, but the direction is hard to ignore. Contextual information has very uneven value.
Latent Context Compilation approaches the issue differently. It compiles long context into compact portable memory tokens, reporting preserved detail and reasoning at 16x compression [[3]]. That is conceptually close to what good agent memory should do: not replay the entire transcript, but carry forward a dense representation of what remains useful.
Work on adaptive context compression for long-running interactions makes the agent connection explicit. As conversations and tasks grow, context length, memory saturation, and computational cost degrade performance. The proposed remedy is importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation [[4]]. This is close to the engineering problem in day-to-day agent use: the system must decide what to keep before accumulated history becomes self-inflicted noise.
These papers differ technically. One compresses long prompts, one compresses KV cache, one compiles memory, and one adapts context across long-running interactions. But they share the same structural assumption: not all context deserves equal access to the model’s attention.
The Harness as Applied Context Selection
This is the frame in which my current setup makes sense.
Headroom is the system-level boundary for model and agent traffic. It is not just a proxy. It is the place where request and response handling, compression, health checks, and traffic visibility become part of the reasoning environment. If it is healthy, memory-aware, and compressing effectively, it reduces waste before individual tool choices even matter. If behaviour looks wrong, that has to be verified with concrete endpoints such as /health, /readyz, and /stats, not inferred from the absence of visible failure.
RTK is the default shell-output filter. Git, tests, logs, builds, package managers, Docker, and similar commands produce a lot of text whose main value is usually concentrated in a few lines: the failing assertion, the changed file, the non-zero exit, the warning that actually matters. Using RTK by default keeps the shell available while preventing routine commands from flooding the transcript.
lean-ctx belongs primarily to file reading, search, and compact tree views. When it works well, it solves a very common failure mode: reading whole files before knowing which parts matter. A compact ctx_read or ctx_search result is often a better context object than a raw dump of every line. At the same time, this is not ideology. If shell execution through ctx_shell is blocked, then it should not be treated as the primary shell path until that integration is fixed. Compression that hides evidence or blocks execution is not context engineering. It is just failure with fewer tokens.
Serena and CodeGraph are different again. They are useful inside actual code projects because they retrieve structure: symbols, references, call paths, project-local knowledge. They are not replacements for shell routing or log filtering, and they are not especially meaningful for home-directory configuration work unless a project has been activated. Their value is that they answer structural questions without forcing the model to infer project topology from raw text.
AGENTS.md, hooks, MCP configuration, subagent configuration, and token-saving configuration form the policy layer. They encode defaults that otherwise have to be renegotiated in every task: when to use RTK, when to use lean-ctx, when raw shell is justified, when to avoid subagents, when to preserve exact deterministic evidence. Good policy reduces prompt mass because the same routing decisions no longer have to be restated manually.
Codacy, OpenSpec, impeccable, and Ponytail reduce context in a broader sense. Codacy turns some code-quality questions into deterministic findings instead of free-form model judgement. OpenSpec keeps implementation attached to a declared behavioural slice. impeccable narrows UI critique toward visible, user-facing quality rather than vague aesthetic preference. Ponytail asks whether a feature, abstraction, dependency, or explanation needs to exist at all. The best token saving is sometimes deletion before the model ever sees the next problem.
Failure Modes
The obvious failure mode is over-compression. If a filtered command removes the exact compiler error, the failing line, the relevant HTTP status code, or the one stack frame that matters, then the signal-to-noise ratio has not improved. It has worsened.
This gives a practical rule:
$$ \text{compress only if } \Delta \operatorname{SNR} > 0. $$In other words, compression is justified when it removes more irrelevant material than relevant material. It is not justified merely because the output is shorter.
Subagents have the same problem. A subagent is useful when it returns a smaller, sharper evidence object than the work it replaces. It is not useful when it produces another long narrative that must itself be audited. The same applies to summaries. A summary that preserves exact file names, commands, failures, and decisions can be valuable. A summary that says “various changes were made” is not compression. It is information loss.
There is also a privacy boundary. A workflow can describe AGENTS.md, hooks, MCP configuration, subagent configuration, and token-saving setup without publishing raw local configuration. The article-level claim is about routing and filtering, not about exposing secrets, tokens, private endpoints, environment variables, or credentials.
Conclusion
The strongest argument for token saving is not economic. It is epistemic. A model that receives fewer irrelevant tokens is not merely cheaper to run; it is often in a better position to answer the actual question.
This is the practical continuation of the “more context is not always better” argument. Long context gives capacity. It does not solve selection. In real agent work, selection happens through the harness: proxies, shell filters, file readers, symbol tools, hooks, specs, quality gates, and deliberately minimal scope.
The mature version of context engineering is therefore not a race toward larger windows. It is the design of systems that know what not to include.
References
[1] Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, & Chen Wu. (2026). LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning. https://arxiv.org/abs/2606.01336
[2] Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, & Grace Li Zhang. (2026). CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference. https://arxiv.org/abs/2606.24467
[3] Zeju Li, Yizhou Zhou, & Qiang Xu. (2026). Latent Context Compilation: Distilling Long Context into Compact Portable Memory. https://arxiv.org/abs/2602.21221
[4] Payal Fofadiya & Sunil Tiwari. (2026). Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions. https://arxiv.org/abs/2603.29193