<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Formal-Methods on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/formal-methods/</link>
    <description>Recent content in Formal-Methods on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Thu, 04 Dec 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/formal-methods/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters</title>
      <link>https://sebastianspicker.github.io/posts/ralph-loop/</link>
      <pubDate>Thu, 04 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ralph-loop/</guid>
      <description>In late 2025, agentic coding tools went from impressive demos to daily infrastructure. The problem nobody talked about enough: when an LLM agent has write access to a codebase and no formal constraints, reproducibility breaks down. The Ralph Loop is a deterministic, story-driven execution framework that addresses this — one tool call per story, scoped writes, atomic state. A design rationale with a formal sketch of why the constraints matter.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
This post is the design rationale.</em></p>
<hr>
<h2 id="december-2025">December 2025</h2>
<p>It happened fast. In the twelve months before I am writing this, agentic
coding went from a niche research topic to the default mode for several
categories of software engineering task. Codex runs code in a sandboxed
container and submits pull requests. Claude Code works through a task list
in your terminal while you make coffee. Cursor&rsquo;s agent mode rewrites a
file, runs the tests, reads the failures, and tries again — automatically,
without waiting for you to press a button.</p>
<p>The demos are impressive. The production reality is messier.</p>
<p>The problem is not that these systems do not work. They work well enough,
often enough, to be genuinely useful. The problem is that &ldquo;works&rdquo; means
something different when an agent is executing than when a human is.
A human who makes a mistake can tell you what they were thinking.
An agent that produces a subtly wrong result leaves you with a diff and
no explanation. And an agent run that worked last Tuesday might not work
today, because the model changed, or the context window filled differently,
or the prompt-to-output mapping is, at bottom, a stochastic function.</p>
<p>This is the problem the Ralph Loop is designed to address: not &ldquo;make
agents more capable&rdquo; but &ldquo;make agent runs reproducible.&rdquo;</p>
<hr>
<h2 id="the-reproducibility-problem-formally">The Reproducibility Problem, Formally</h2>
<p>An LLM tool call is a stochastic function. Given a prompt $p$, the
model samples from a distribution over possible outputs:</p>
$$T : \mathcal{P} \to \Delta(\mathcal{O})$$<p>where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space
of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over
$\mathcal{O}$.</p>
<p>At temperature zero — the most deterministic setting most systems support —
this collapses toward a point mass:</p>
$$T_0(p) \approx \delta_{o^*}$$<p>where $o^*$ is the argmax token sequence. &ldquo;Approximately&rdquo; because hardware
non-determinism, batching effects, and floating-point accumulation mean
that even $T_0$ is not strictly reproducible across runs, environments, or
model versions.</p>
<p>A naive agentic loop composes these calls. If an agent takes $k$ sequential
tool calls to complete a task, the result is a $k$-fold composition:</p>
$$o_k = T(T(\cdots T(p_0) \cdots))$$<p>The variance does not merely add — it propagates through the dependencies.
Early outputs condition later prompts; a small deviation at step 2 can
shift the trajectory of step 5 substantially. This is not a theoretical
concern. It is the practical experience of anyone who has tried to reproduce
a multi-step agent run.</p>
<p>The Ralph Loop does not solve the stochasticity of $T$. What it does is
prevent the composition.</p>
<hr>
<h2 id="the-ralph-loop-as-a-state-machine">The Ralph Loop as a State Machine</h2>
<p>The system&rsquo;s state at any point in a run is a triple:</p>
$$\sigma = (Q,\; S,\; L)$$<p>where:</p>
<ul>
<li>$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD
(product requirements document) — with stories sorted by priority, then
by ID</li>
<li>$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$
is the status vector, one entry per story</li>
<li>$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock
state protecting $S$ from concurrent writes</li>
</ul>
<p>The transition function $\delta$ at each step is:</p>
<ol>
<li><strong>Select</strong>: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ —
deterministic by construction, since $Q$ has a fixed ordering</li>
<li><strong>Build</strong>: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of
the story definition and the static policy document; no dependency on
previous tool outputs</li>
<li><strong>Execute</strong>: $o \sim T(p)$ — exactly one tool call, output captured</li>
<li><strong>Accept</strong>: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the
acceptance criterion (was the expected report file created at the
expected path?)</li>
<li><strong>Commit</strong>: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$;
otherwise increment the attempt counter; write atomically under lock $L$</li>
</ol>
<p>The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in
exactly one position. The loop continues until no open stories remain or
a story limit $N$ is reached.</p>
<p><strong>Termination.</strong> Since $|Q| = n$ is finite, $S$ has at most $n$ open
entries, and each step either closes one entry or increments an attempt
counter bounded by $A_{\max}$, the loop terminates in at most
$n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually
satisfies any reachable acceptance criterion — which is what CODEX.md&rsquo;s
constraints are designed to encourage — the loop converges in exactly $n$
successful transitions.</p>
<p><strong>Replay.</strong> The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to
\sigma_k$ is determined by $Q$ and the sequence of tool outputs
$o_1, o_2, \ldots, o_k$. The <code>.runtime/events.log</code> records these
outputs. If tool outputs are deterministic, the run is fully deterministic.
If they are not — as in practice they will not be — the stochasticity is
at least isolated to individual steps rather than allowed to compound
across the chain.</p>
<hr>
<h2 id="the-one-tool-call-invariant">The One-Tool-Call Invariant</h2>
<p>The most important constraint in the Ralph Loop is also the simplest:
exactly one tool call per story attempt.</p>
<p>This is not the natural design. A natural agentic loop would let the model
plan, execute, observe, reflect, and re-execute within a single story.
Some frameworks call this &ldquo;inner monologue&rdquo; or &ldquo;chain-of-thought with tool
use.&rdquo; The model emits reasoning tokens, calls a tool, reads the result,
emits more reasoning, calls another tool, and eventually produces the
final output.</p>
<p>This is more capable for complex tasks. It is also what makes
reproducibility hard. Each additional tool call in the chain is a fresh
draw from $T$, conditioned on the previous outputs. After five tool calls,
the prompt for the fifth includes four previous outputs — each of which
varied slightly from the last run. The fifth output is now conditioned on
a different input.</p>
<p>Formally: let the multi-call policy use $k$ sequential calls per story.
Each call $c_j$ produces output $o_j \sim T(p_j)$, where
$p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function
$f$. The variance of the final output $o_k$ depends on the accumulated
conditioning:</p>
<p>$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]</p>
<ul>
<li>\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$</li>
</ul>
<p>By the law of total variance, applied recursively, the total variance
decomposes into explained and residual components — conditioning
redistributes variance but does not eliminate the residual term. In a
well-designed, low-variance chain the residual may stay small; in
practice, LLM outputs have non-trivial variance at each step, and that
variance propagates through the conditioning chain.</p>
<p>The one-call constraint collapses $k$ to 1:</p>
$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$<p>The output depends only on the story definition and the static policy
document. Not on previous tool outputs. The stories are designed to be
atomic enough that one call is sufficient. If a story requires more, it
should be split into two stories in the PRD. This is a forcing function
toward better task decomposition, which I consider a feature rather than
a limitation.</p>
<hr>
<h2 id="scope-as-a-topological-constraint">Scope as a Topological Constraint</h2>
<p>In fixing mode, each story carries a <code>scope[]</code> field listing the files
or directories the agent is permitted to modify. The runner captures a
snapshot of the repository state before execution:</p>
$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>where $h(f)$ is a hash of the file contents. After the tool call:</p>
$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must
satisfy:</p>
$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$<p>This is a locality constraint on the filesystem graph: the agent&rsquo;s writes
are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the
story&rsquo;s scope declaration. Writes that escape this neighbourhood are a
story failure, regardless of whether they look correct.</p>
<p>The motivation is containment. When a fixing agent makes a &ldquo;small repair&rdquo;
to one file but also helpfully tidies up three adjacent files it noticed
while reading, you have three undocumented changes outside the story&rsquo;s
intent. In a system with many stories running sequentially, out-of-scope
changes accumulate silently. The scope constraint prevents this.
Crucially, prompt instructions alone are not sufficient — an agent told
&ldquo;only modify files in scope&rdquo; can still modify out-of-scope files if the
instructions are interpreted loosely or the context is long. The runner
enforces scope at the file system level, after the fact, and that
enforcement cannot be argued with.</p>
<hr>
<h2 id="acceptance-criteria-grounding-evaluation-in-filesystem-events">Acceptance Criteria: Grounding Evaluation in Filesystem Events</h2>
<p>Each story&rsquo;s acceptance criterion is a single line of the form
<code>Created &lt;path&gt;</code> — the path where the report or output file should appear.</p>
<p>This is intentionally minimal. The alternative — semantic acceptance
criteria (&ldquo;did the agent identify all relevant security issues?&rdquo;) — would
require another model call to evaluate, reintroducing stochasticity at
the evaluation layer and creating the infinite regress of &ldquo;who checks the
checker.&rdquo; A created file at the right path is a necessary condition for
a valid run. It is not a sufficient condition for correctness, but
necessary conditions that can be checked deterministically are already
more than most agentic pipelines provide.</p>
<p>The quality of the outputs — whether the audit findings are accurate,
whether the fix is correct — depends on the model and the prompt quality.
The Ralph Loop gives you a framework for running agents safely and
repeatably. Verifying that the agent was right is a different problem and,
arguably, a harder one.</p>
<hr>
<h2 id="why-bash">Why Bash</h2>
<p>A question I have fielded: why Bash and jq, not Python or Node.js?</p>
<p>The practical reason: the target environment is an agent sandbox that has
reliable POSIX tooling but variable package availability. Python dependency
management inside a constrained container is itself a source of variance.
Bash with jq has no dependencies beyond what any standard Unix environment
provides.</p>
<p>The philosophical reason: the framework&rsquo;s job is orchestration, not
computation. It selects stories, builds prompts from templates, calls one
external tool, parses one file path, and updates one JSON field. None of
this requires a type system or a rich standard library. Bash is the right
tool for glue that does not need to be impressive.</p>
<p>The one place Bash becomes awkward is the schema validation layer, which
is implemented with a separate <code>jq</code> script against a JSON Schema. This
works but is not elegant. If the PRD schema grows substantially, that
component would be worth replacing with something that has native schema
validation support.</p>
<hr>
<h2 id="what-this-is-not">What This Is Not</h2>
<p>The Ralph Loop is not an agent. It is a harness for agents. It does not
decide what tasks to run, does not reason about a codebase, and does not
write code. It sequences discrete, pre-specified stories, enforces the
constraints on each execution, and records the outcomes. The intelligence
is in the model and in the story design; the framework contributes only
discipline.</p>
<p>This distinction matters because the current wave of agentic tools
conflates two things that are worth keeping separate: the capability to
reason and act (what the model provides) and the infrastructure for doing
so safely and repeatably (what the harness provides). Improving the model
does not automatically improve the harness — and a better model in a
poorly constrained harness just fails more impressively.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
The Bash implementation, the PRD schema, the CODEX.md policy document,
and the test suite are all there.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
