Software-Engineering on Sebastian Spicker

Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters

Thu, 04 Dec 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/ralph-loop. This post is the design rationale.

December 2025

It happened fast. In the twelve months before I am writing this, agentic coding went from a niche research topic to the default mode for several categories of software engineering task. Codex runs code in a sandboxed container and submits pull requests. Claude Code works through a task list in your terminal while you make coffee. Cursor’s agent mode rewrites a file, runs the tests, reads the failures, and tries again — automatically, without waiting for you to press a button.

The demos are impressive. The production reality is messier.

The problem is not that these systems do not work. They work well enough, often enough, to be genuinely useful. The problem is that “works” means something different when an agent is executing than when a human is. A human who makes a mistake can tell you what they were thinking. An agent that produces a subtly wrong result leaves you with a diff and no explanation. And an agent run that worked last Tuesday might not work today, because the model changed, or the context window filled differently, or the prompt-to-output mapping is, at bottom, a stochastic function.

This is the problem the Ralph Loop is designed to address: not “make agents more capable” but “make agent runs reproducible.”

The Reproducibility Problem, Formally

An LLM tool call is a stochastic function. Given a prompt $p$, the model samples from a distribution over possible outputs:

$$T : \mathcal{P} \to \Delta(\mathcal{O})$$

where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over $\mathcal{O}$.

At temperature zero — the most deterministic setting most systems support — this collapses toward a point mass:

$$T_0(p) \approx \delta_{o^*}$$

where $o^*$ is the argmax token sequence. “Approximately” because hardware non-determinism, batching effects, and floating-point accumulation mean that even $T_0$ is not strictly reproducible across runs, environments, or model versions.

A naive agentic loop composes these calls. If an agent takes $k$ sequential tool calls to complete a task, the result is a $k$-fold composition:

$$o_k = T(T(\cdots T(p_0) \cdots))$$

The variance does not merely add — it propagates through the dependencies. Early outputs condition later prompts; a small deviation at step 2 can shift the trajectory of step 5 substantially. This is not a theoretical concern. It is the practical experience of anyone who has tried to reproduce a multi-step agent run.

The Ralph Loop does not solve the stochasticity of $T$. What it does is prevent the composition.

The Ralph Loop as a State Machine

The system’s state at any point in a run is a triple:

$$\sigma = (Q,\; S,\; L)$$

where:

$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD (product requirements document) — with stories sorted by priority, then by ID
$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$ is the status vector, one entry per story
$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock state protecting $S$ from concurrent writes

The transition function $\delta$ at each step is:

Select: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ — deterministic by construction, since $Q$ has a fixed ordering
Build: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of the story definition and the static policy document; no dependency on previous tool outputs
Execute: $o \sim T(p)$ — exactly one tool call, output captured
Accept: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the acceptance criterion (was the expected report file created at the expected path?)
Commit: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$; otherwise increment the attempt counter; write atomically under lock $L$

The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in exactly one position. The loop continues until no open stories remain or a story limit $N$ is reached.

Termination. Since $|Q| = n$ is finite, $S$ has at most $n$ open entries, and each step either closes one entry or increments an attempt counter bounded by $A_{\max}$, the loop terminates in at most $n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually satisfies any reachable acceptance criterion — which is what CODEX.md’s constraints are designed to encourage — the loop converges in exactly $n$ successful transitions.

Replay. The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to \sigma_k$ is determined by $Q$ and the sequence of tool outputs $o_1, o_2, \ldots, o_k$. The .runtime/events.log records these outputs. If tool outputs are deterministic, the run is fully deterministic. If they are not — as in practice they will not be — the stochasticity is at least isolated to individual steps rather than allowed to compound across the chain.

The One-Tool-Call Invariant

The most important constraint in the Ralph Loop is also the simplest: exactly one tool call per story attempt.

This is not the natural design. A natural agentic loop would let the model plan, execute, observe, reflect, and re-execute within a single story. Some frameworks call this “inner monologue” or “chain-of-thought with tool use.” The model emits reasoning tokens, calls a tool, reads the result, emits more reasoning, calls another tool, and eventually produces the final output.

This is more capable for complex tasks. It is also what makes reproducibility hard. Each additional tool call in the chain is a fresh draw from $T$, conditioned on the previous outputs. After five tool calls, the prompt for the fifth includes four previous outputs — each of which varied slightly from the last run. The fifth output is now conditioned on a different input.

Formally: let the multi-call policy use $k$ sequential calls per story. Each call $c_j$ produces output $o_j \sim T(p_j)$, where $p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function $f$. The variance of the final output $o_k$ depends on the accumulated conditioning:

$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]

\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$

By the law of total variance, applied recursively, the total variance decomposes into explained and residual components — conditioning redistributes variance but does not eliminate the residual term. In a well-designed, low-variance chain the residual may stay small; in practice, LLM outputs have non-trivial variance at each step, and that variance propagates through the conditioning chain.

The one-call constraint collapses $k$ to 1:

$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$

The output depends only on the story definition and the static policy document. Not on previous tool outputs. The stories are designed to be atomic enough that one call is sufficient. If a story requires more, it should be split into two stories in the PRD. This is a forcing function toward better task decomposition, which I consider a feature rather than a limitation.

Scope as a Topological Constraint

In fixing mode, each story carries a scope[] field listing the files or directories the agent is permitted to modify. The runner captures a snapshot of the repository state before execution:

$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

where $h(f)$ is a hash of the file contents. After the tool call:

$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$

The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must satisfy:

$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$

This is a locality constraint on the filesystem graph: the agent’s writes are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the story’s scope declaration. Writes that escape this neighbourhood are a story failure, regardless of whether they look correct.

The motivation is containment. When a fixing agent makes a “small repair” to one file but also helpfully tidies up three adjacent files it noticed while reading, you have three undocumented changes outside the story’s intent. In a system with many stories running sequentially, out-of-scope changes accumulate silently. The scope constraint prevents this. Crucially, prompt instructions alone are not sufficient — an agent told “only modify files in scope” can still modify out-of-scope files if the instructions are interpreted loosely or the context is long. The runner enforces scope at the file system level, after the fact, and that enforcement cannot be argued with.

Acceptance Criteria: Grounding Evaluation in Filesystem Events

Each story’s acceptance criterion is a single line of the form Created — the path where the report or output file should appear.

This is intentionally minimal. The alternative — semantic acceptance criteria (“did the agent identify all relevant security issues?”) — would require another model call to evaluate, reintroducing stochasticity at the evaluation layer and creating the infinite regress of “who checks the checker.” A created file at the right path is a necessary condition for a valid run. It is not a sufficient condition for correctness, but necessary conditions that can be checked deterministically are already more than most agentic pipelines provide.

The quality of the outputs — whether the audit findings are accurate, whether the fix is correct — depends on the model and the prompt quality. The Ralph Loop gives you a framework for running agents safely and repeatably. Verifying that the agent was right is a different problem and, arguably, a harder one.

Why Bash

A question I have fielded: why Bash and jq, not Python or Node.js?

The practical reason: the target environment is an agent sandbox that has reliable POSIX tooling but variable package availability. Python dependency management inside a constrained container is itself a source of variance. Bash with jq has no dependencies beyond what any standard Unix environment provides.

The philosophical reason: the framework’s job is orchestration, not computation. It selects stories, builds prompts from templates, calls one external tool, parses one file path, and updates one JSON field. None of this requires a type system or a rich standard library. Bash is the right tool for glue that does not need to be impressive.

The one place Bash becomes awkward is the schema validation layer, which is implemented with a separate jq script against a JSON Schema. This works but is not elegant. If the PRD schema grows substantially, that component would be worth replacing with something that has native schema validation support.

What This Is Not

The Ralph Loop is not an agent. It is a harness for agents. It does not decide what tasks to run, does not reason about a codebase, and does not write code. It sequences discrete, pre-specified stories, enforces the constraints on each execution, and records the outcomes. The intelligence is in the model and in the story design; the framework contributes only discipline.

This distinction matters because the current wave of agentic tools conflates two things that are worth keeping separate: the capability to reason and act (what the model provides) and the infrastructure for doing so safely and repeatably (what the harness provides). Improving the model does not automatically improve the harness — and a better model in a poorly constrained harness just fails more impressively.

The repository is at github.com/sebastianspicker/ralph-loop. The Bash implementation, the PRD schema, the CODEX.md policy document, and the test suite are all there.

The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting

Sat, 22 Mar 2025 00:00:00 +0000

The repository is at github.com/sebastianspicker/AI-PDF-Renamer.

The Problem

Every PDF acquisition pipeline eventually produces the same chaos. Journal articles downloaded from publisher sites arrive as 513194-008.pdf or 1-s2.0-S0360131520302700-main.pdf. Scanned letters from the tax authority arrive as scan0023.pdf. Invoices arrive as Rechnung.pdf — every invoice from every vendor, overwriting each other if you are not paying attention. The actual content is in the file. The filename tells you nothing.

The human solution is trivial: open the PDF, glance at the title or date or sender, type a descriptive name. Thirty seconds per file, multiplied by several hundred files accumulated over a year, becomes a task that perpetually does not get done.

The automated solution sounds equally trivial: read the text, decide what the document is, generate a filename. What could be involved?

Quite a bit, it turns out. Working through the implementation is a useful way to make concrete some things about LLMs and text processing that are easy to understand in the abstract but clearer with a specific task in front of you.

Step One: Getting Text Out of a PDF

A PDF is not a text file. It is a binary format designed for page layout and print fidelity — it encodes character positions, fonts, and rendering instructions, not a linear stream of prose. The text in a PDF has to be extracted by a parser that reassembles it from the position data.

For PDFs with embedded text (most modern documents), this works well enough. For scanned PDFs — images of pages, with no embedded text at all — you need OCR as a fallback. The pipeline handles both: native extraction first, OCR if the text yield is below a useful threshold.

The result is a string. Already there are failure modes: two-column layouts produce interleaved text if the parser reads left-to-right across both columns simultaneously; footnotes appear in the middle of sentences; tables produce gibberish unless the parser handles them specifically. These are not catastrophic — for renaming purposes, the first paragraph and the document header are usually enough, and those are less likely to be badly formatted than the body. But they are real, and they mean that the text passed to the next stage is not always clean.

Step Two: The Token Budget

Once you have a string representing the document’s text, you cannot simply pass all of it to a language model. Two reasons: context windows have hard limits, and — even when they are large enough — filling them with the full text of a thirty-page document is wasteful for a task that only needs the title, date, and category.

Language models do not process characters. They process tokens — subword units produced by the same BPE compression scheme I described in the strawberry post. A rough practical rule for English text is:

$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$

This is an approximation — technical text, non-English content, and code tokenise differently — but it is useful for budgeting. A ten-page academic paper might contain around 30,000 characters, which is approximately 7,500 tokens. The context window of a small local model (the default here is qwen2.5:3b via Ollama) is typically in the range of 8,000–32,000 tokens, depending on the version and configuration. You have room — but not unlimited room, and the LLM also needs space for the prompt itself and the response.

The tool defaults to 28,000 tokens of extracted text (DEFAULT_MAX_CONTENT_TOKENS), leaving comfortable headroom for the prompt and response in most configurations. For documents that exceed this, the extraction is truncated — typically to the first N characters, on the reasonable assumption that titles, dates, and document types appear early.

This truncation is a design decision, not a limitation to be apologised for. For the renaming task, the first two pages of a document contain everything the filename needs. A strategy that extracts the first page plus the last page (which often has a date, a signature, or a reference number) would work for some document types. The current implementation keeps it simple: take the front, stay within budget.

Step Three: Heuristics First

Here is something that improves almost any LLM pipeline for structured extraction tasks: do as much work as possible with deterministic rules before touching the model.

The AI PDF Renamer applies a scoring pass over the extracted text before deciding whether to call the LLM at all. The heuristics are regex-based rules that look for patterns likely to appear in specific document types:

Date patterns: \d{4}-\d{2}-\d{2}, \d{2}\.\d{2}\.\d{4}, and a dozen variants
Document type markers: “Rechnung”, “Invoice”, “Beleg”, “Gutschrift”, “Receipt”
Author/institution lines near the document header
Keywords from a configurable list associated with specific categories

Each rule that fires contributes a score to a candidate metadata record. If the heuristic pass produces a confident result — date found, category identified, a couple of distinguishing keywords present — the LLM call is skipped entirely. The file gets renamed from the heuristic output.

This matters for a few reasons. Heuristics are fast (microseconds vs. seconds for an LLM call), deterministic (the same input always produces the same output), and do not require a running model. For a batch of two hundred invoices from the same vendor, the heuristic pass will handle most of them without any LLM involvement.

The LLM is enrichment for the hard cases: documents with unusual formats, mixed-language content, documents where the type is not obvious from surface features. In practice this is probably 20–40% of a typical mixed-document folder.

Step Four: What to Ask the LLM, and How

When a heuristic pass does not produce a confident result, the pipeline builds a prompt from the extracted text and sends it to the local endpoint. What the prompt asks for matters enormously.

The naive approach: “Please rename this PDF. Here is the content: [text].” The response will be a sentence. Maybe several sentences. It will not be parseable as a filename without further processing, and that further processing is itself an LLM call or a fragile regex.

The better approach: ask for structured output. The prompt in llm_prompts.py requests a JSON object conforming to a schema — something like:

{
  "date": "YYYYMMDD or null",
  "category": "one of: invoice, paper, letter, contract, ...",
  "keywords": ["max 3 short keywords"],
  "summary": "max 5 words"
}

The model returns JSON. The response parser in llm_parsing.py validates it against the schema, catches malformed responses, applies fallbacks for null fields, and sanitises the individual fields before they are assembled into a filename.

This works because JSON is well-represented in LLM training data — models have seen vastly more JSON than they have seen arbitrary prose instructions to parse. A model told to return a specific JSON structure will do so reliably for most inputs. The failure rate (malformed JSON, missing fields, hallucinated values) is low enough to be handled by the fallback logic.

What counts as a hallucinated value in this context? Dates in the future. Categories not in the allowed set. Keywords that are not present in the source text. The llm_schema.py validation layer catches the obvious cases; for subtler errors (a plausible-sounding date that does not appear in the document), the tool relies on the heuristic pass having already identified any date that can be reliably extracted.

Step Five: The Filename

The output format is YYYYMMDD-category-keywords-summary.pdf. A few design decisions embedded in this:

Date first. Lexicographic sorting of filenames then gives you chronological sorting for free. This is the most useful sort order for most document types — you want to find the most recent invoice, not the alphabetically first one.

Lowercase, hyphens only. No spaces (which require escaping in many contexts), no special characters (which are illegal in some filesystems or require quoting), no uppercase (which creates case-sensitivity issues across platforms). The sanitisation step in filename.py strips or replaces anything that does not conform.

Collision resolution. Two documents with the same date, category, keywords, and summary would produce the same filename. The resolver appends a counter suffix (_01, _02, …) when a target name already exists. This is deterministic — the same set of documents always produces the same filenames, regardless of processing order — which matters for the undo log.

Local-First

The LLM endpoint defaults to http://127.0.0.1:11434/v1/completions — Ollama running locally, no external traffic. This is a deliberate choice for a document management tool. The documents being renamed are likely to include medical records, financial statements, legal correspondence — content that should not be routed through an external API by default.

A small 8B model running locally is sufficient for this task. The extraction problem does not require deep reasoning; it requires pattern recognition over a short text and the ability to return a specific JSON structure. Models at this scale handle it well. The latency is measurable (a few seconds per document on a modern laptop with a reasonably fast inference backend) but acceptable for a batch job running in the background.

For users who want to use a remote API, the endpoint is configurable — the local default is a sensible starting point, not a hard constraint.

What It Cannot Do

Renaming is a classification problem disguised as a text generation problem. The tool works well when documents have standard structure — title on page one, date near the header or footer, document type identifiable from a few keywords. It works less well for documents that are structurally atypical: a hand-written letter scanned at poor resolution, a PDF that is essentially a single large image, a document in a language the model handles badly.

The heuristic fallback means that even when the LLM produces a bad result, the file gets a usable if imperfect name rather than a broken one. And the undo log means that a bad batch run can be reversed. These are not complete solutions to the hard cases, but they are the right design response to a tool that handles real-world document noise.

The harder limit is semantic: the tool can tell you that a document is an invoice and extract its date and vendor name. It cannot tell you whether the invoice has been paid, whether it matches a purchase order, or whether the amount is correct. For those questions, renaming is just the first step in a longer pipeline.

The repository is at github.com/sebastianspicker/AI-PDF-Renamer. The tokenisation background in the extraction and budgeting sections connects to the strawberry tokenisation post and the context window post.

Changelog

2026-04-02: Corrected the default model name from qwen3:8b to qwen2.5:3b. The codebase default is qwen2.5:3b (apple-silicon preset) or qwen2.5:7b-instruct (gpu preset).
2026-04-02: Corrected DEFAULT_MAX_CONTENT_TOKENS description from “28,000 characters … roughly 7,000 tokens” to “28,000 tokens.” The variable is a token limit, not a character limit.