<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Automation on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/automation/</link>
    <description>Recent content in Automation on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/automation/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>There Is an App for That — Until There Isn&#39;t</title>
      <link>https://sebastianspicker.github.io/posts/automatable-unautomatable-baumol-mental-health/</link>
      <pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/automatable-unautomatable-baumol-mental-health/</guid>
      <description>German health insurance will reimburse a mental health app within days but cannot provide a therapist within six months. Last week, psychotherapy fees were cut by 4.5%. Baumol&amp;rsquo;s cost disease — originally about why string quartets get relatively more expensive — explains why the app gold rush and the collapse of mental health provision are the same phenomenon.</description>
      <content:encoded><![CDATA[<p>Someone vibe coded an app that tells you how many layers to wear today. It has 85,000 users. Someone else tracks her eyelash styles — every new set gets a photo and a note about the method. A father built Storypot: his kids drag emoji into a virtual pot and the app generates a bedtime story. A product manager made Standup Buddy, which randomises who talks first in a daily meeting. That is the entire feature.
These are not bad things. Some of them are genuinely lovely — Storypot in particular. The layers app clearly meets a need, given 85,000 people agree. I have built tools like this myself — I automated my concert setlist workflow and <a href="/posts/setlist-to-playlist/">wrote about it on this blog</a> — and the feeling of compressing a forty-minute ritual into four minutes of machine-assisted execution is real and satisfying.</p>
<p>There is a term for this now. Karpathy coined it in early 2025: vibe coding. You describe what you want, the model writes the code, you run it, you fix what breaks by describing the fix, and at no point do you necessarily understand what the code does. The barrier to building software has not been lowered so much as removed. A single person with an afternoon and a language model can ship what would have required a team and a quarter, two years ago.</p>
<p>Meanwhile. In Germany, the average wait from an initial consultation to the start of psychotherapy is 142 days — nearly five months — according to a BPtK analysis of statutory insurance billing data <a href="#ref-1">[1]</a>. The Telefonseelsorge — the crisis line, the last resort — handled 1.2 million calls in 2024. It is staffed by approximately 7,700 volunteers and funded primarily by the Protestant and Catholic churches. Its financing is described, in its own institutional language, as <em>äußerst angespannt</em> — extremely strained <a href="#ref-2">[2]</a>. Six days ago, on April 1, psychotherapy fees in Germany were cut by 4.5% <a href="#ref-3">[3]</a>. The thesis of this post is structural, not moral. There is a class of work that scales, and a class of work that does not. Our entire economy of attention — cultural, financial, technological — is optimised for the first class. The second class is not merely neglected. It is being made structurally more expensive, in a precise economic sense, by the very productivity gains that make the first class so intoxicating. And the policy apparatus, facing this structural pressure, is doing exactly what you would predict: it is funding apps.</p>
<p>The economist William Baumol explained the mechanism in 1966. It has a name, and the name is a diagnosis.</p>
<hr>
<h2 id="the-seduction-of-leverage">The Seduction of Leverage</h2>
<p>What makes vibe coding culturally significant is not the code. It is the leverage. A single developer, aided by a language model, can produce software that reaches millions of users. The marginal cost of an additional user approaches zero. The output scales without bound while the input — one person, one prompt, one afternoon — stays fixed. This is the defining characteristic of automatable work: the ratio of output to input can grow without limit.</p>
<p>This is not new. Software has always had this property. What is new is that the barrier to producing software has collapsed. You no longer need to understand data structures, or networking, or the programming language. You need an idea and a few hours. The productivity frontier has shifted so dramatically that the interesting constraint is no longer <em>can I build it</em> but <em>should anyone use it</em>. The cultural response has been euphoric. Communities, podcasts, courses, manifestos. People who have never written a line of code are shipping products. I am not interested in dismissing this. The ability to build is a form of agency, and more people having it is not, in itself, a problem. The problem is what the euphoria obscures.</p>
<h2 id="what-therapy-actually-requires">What Therapy Actually Requires</h2>
<p>A psychotherapy session has the following structure. One therapist sits with one patient for approximately fifty minutes. The therapist listens, observes, formulates, responds. The patient speaks, reflects, resists, revises. The therapeutic alliance — the quality of the relationship between therapist and patient — is one of the most robust predictors of treatment outcome, across modalities, across conditions, across decades of research <a href="#ref-4">[4]</a>. This is not a feature that can be optimised away. It is the mechanism of action. When a meta-analysis finds that the specific technique matters less than the relationship — that CBT, psychodynamic, and humanistic therapies produce roughly equivalent outcomes when the alliance is strong — it is telling you that the human in the room is not an implementation detail. The human in the room <em>is</em> the intervention.</p>
<p>You cannot parallelise this. A therapist cannot see two patients simultaneously without degrading the thing that makes the session work. You cannot batch it — twelve people in a room is group therapy, which is a different intervention with different dynamics and different limitations. You cannot cache it — the session is not a retrieval operation over stored responses but an emergent interaction that depends on what happens in the room that day. The irreducible unit of therapy is: one trained human, fully present, for one hour, with one other human. This has not changed since Freud&rsquo;s consulting room on Berggasse 19, and no plausible technological development will change it, because the presence <em>is</em> the treatment. A therapist working full-time can see roughly twenty-five to thirty patients per week. That is the ceiling. It is set by the biology of attention and the ethics of care, not by inefficiency.</p>
<h2 id="baumols-cost-disease">Baumol&rsquo;s Cost Disease</h2>
<p>In 1966, the economists William Baumol and William Bowen published <em>Performing Arts, The Economic Dilemma</em>, a study of why orchestras, theatre companies, and dance troupes were perpetually in financial crisis despite growing audiences and rising cultural prestige <a href="#ref-5">[5]</a>. Their diagnosis was precise. A string quartet requires four musicians and approximately forty minutes to perform Beethoven&rsquo;s Op. 131. This was true in 1826 and is true in 2026. The productivity of the quartet — measured in output per unit of labour input — has not increased. It cannot increase. The performance <em>is</em> the labour.</p>
<p>Meanwhile, the productivity of a textile worker, a steelworker, a software developer has increased by orders of magnitude. Wages in the productive sectors rise because productivity rises. Wages in the nonproductive sectors must keep pace — not because musicians deserve parity as a matter of justice, though they may, but because if they do not keep pace, musicians will leave for sectors that pay more. The quartet must compete in the same labour market as the factory and the tech company.</p>
<p>The result: the relative cost of live performance rises without bound. Not because musicians got worse. Not because audiences stopped caring. But because everything else got cheaper, and the quartet cannot. Baumol later generalised the result beyond the performing arts to all services in which the labour itself constitutes the product: education, healthcare, legal services, and — centrally for our purposes — psychotherapy <a href="#ref-6">[6]</a>. A therapy session is a string quartet. The labour is the product. The productivity cannot increase. The cost, relative to the scalable economy, rises every time the scalable economy gets more productive. And vibe coding is a massive productivity shock to the scalable economy.</p>
<h2 id="there-is-an-app-for-that">There Is an App for That</h2>
<p>In 2019, the German government passed the Digitales-Versorgung-Gesetz, creating a fast-track approval process for <em>Digitale Gesundheitsanwendungen</em> — digital health applications, or DiGA. The idea: apps that can be prescribed by a doctor and reimbursed by statutory health insurance, just like medication. A patient walks into a practice, receives a prescription code, downloads the app, and the Krankenkasse pays <a href="#ref-7">[7]</a>. As of mid-2025, the BfArM directory lists roughly 58 DiGA. Nearly half target psychiatric conditions — depression, anxiety, insomnia, burnout. Names like deprexis, HelloBetter, Selfapy. A patient who would wait 142 days for a therapist can get a DiGA prescribed the same afternoon.</p>
<p>The pricing structure deserves attention. In the first twelve months after listing, manufacturers set their own price. The average: €541 per prescription <a href="#ref-8">[8]</a>. Some exceeded €2,000. After the first year, negotiated prices drop to an average of roughly €226 — but by then, the insurance has already paid the introductory rate for every early adopter. Total statutory health insurance spending on DiGA since 2020: €234 million. That spending grew 71% between 2023 and 2024 <a href="#ref-9">[9]</a>. Here is the number that should sit next to that one. A single outpatient psychotherapy session costs the insurance system approximately €115. The €234 million spent on DiGA since 2020 could have funded over two million therapy sessions — enough for roughly 80,000 complete courses of 25-session treatment. And here is the evidence question. Only 12 of the 68 DiGA that have entered the directory demonstrated a proven positive care effect at the time of inclusion. The rest were listed provisionally, with twelve months to produce evidence. About one in six were subsequently delisted — removed from the directory because the evidence did not materialise <a href="#ref-10">[10]</a>.</p>
<p>I want to be precise about what I am and am not saying. Some DiGA have a real evidence base. Structured CBT exercises delivered digitally can produce measurable short-term symptom improvement — I reviewed the Woebot trial data in an <a href="/posts/ai-companion-loneliness-ironic-process/">earlier post on AI companions</a> and took those results seriously. A DiGA that delivers psychoeducation and behavioural activation exercises is a tool, and tools can be useful. But a tool and a therapeutic relationship are not the same product delivered through different channels. They are different products. The policy framework treats them as substitutable — the patient who cannot access a therapist receives an app instead. The substitution is not a clinical judgement. It is a structural inevitability: facing the impossibility of scaling therapy, the system reaches for the scalable alternative, because the scalable alternative is what the incentive structure rewards. This is not a corruption story. This is Baumol&rsquo;s cost disease expressed through health policy. The system is doing exactly what the theory predicts.</p>
<h2 id="the-fear-and-the-compliance">The Fear and the Compliance</h2>
<p>There is an irony at the centre of the current discourse about AI and work that I want to name, because I think it is underexamined. People are afraid of AI. Specifically, they are afraid it will take their jobs. The surveys confirm this consistently — Gallup, Pew, the European Commission&rsquo;s Eurobarometer — significant fractions of the working population in every developed country report anxiety about AI-driven job displacement.</p>
<p>And yet. The same people — not a different demographic, not a separate population, the <em>same people</em> — are enthusiastically using AI to do their work. They use language models to write their emails, their reports, their presentations. They vibe code tools for their teams. They let AI draft their strategy documents, summarise their meetings, compose their performance reviews. They celebrate the productivity gain. They post about it. This is not hypocrisy. It is something more interesting: a revealed preference for automation that contradicts a stated preference against it. The fear is about structural displacement — losing the <em>role</em>. The compliance is about local optimisation — doing the <em>task</em> more efficiently. No one wakes up and decides to automate themselves out of a job. They automate one task at a time, each automation locally sensible, until the job is a shell around an AI core. And all of this activity — the fear, the adoption, the discourse, the think pieces, the congressional hearings — is directed at automatable work. The kind of work where AI is a plausible substitute.</p>
<p>No one is afraid that AI will take the crisis counsellor&rsquo;s job. No one is vibe coding a replacement for a psychiatric nurse. The work that is collapsing is not collapsing because AI replaced it. It is collapsing because it was never scalable, never attracted the capital or the talent that scalable work attracts, and every productivity gain in the scalable sector makes the unscalable sector relatively more expensive and harder to staff. The discourse about AI and jobs is, in this sense, exactly backwards. The threat is not that AI will replace the work that matters most. The threat is that it will make the work that matters most <em>invisible</em> — by making everything else so cheap and fast and abundant that we forget the expensive, slow, irreducibly human work exists at all.</p>
<h2 id="the-political-arithmetic">The Political Arithmetic</h2>
<p>On March 11, 2026, the Erweiterter Bewertungsausschuss — the body that sets fee schedules for outpatient care in Germany — decided a 4.5% flat cut to nearly all psychotherapeutic service fees, effective April 1 <a href="#ref-3">[3]</a>. The health insurers had originally demanded 10%. Germany spends €4.6 billion annually on outpatient psychotherapy — roughly 1.5% of total statutory health insurance expenditure. The fee cut applies to this budget. The average therapist surplus — what remains after practice costs — is approximately €52 per hour <a href="#ref-11">[11]</a>. The cut is not large in percentage terms. It is large in the context of a profession that is already among the lowest-paid in outpatient medicine. Nearly half a million people signed a petition against the cuts. There were protests in Berlin, Leipzig, Hanover, Hamburg, Stuttgart, Munich. The Kassenärztliche Bundesvereinigung filed a lawsuit. The Bundespsychotherapeutenkammer called the decision <em>skandalös</em> <a href="#ref-12">[12]</a>.</p>
<p>What makes this particularly striking is the sequence. The coalition agreement signed by CDU/CSU and SPD in May 2025 explicitly addresses mental health — securing psychotherapy training financing, needs-based planning for child and adolescent psychotherapy, crisis intervention rights for psychotherapists, and a suicide prevention law. The BPtK itself welcomed the agreement as giving mental health a <em>neuen Stellenwert</em>, a new significance <a href="#ref-13">[13]</a>. Less than a year later, the same government&rsquo;s arbitration body cuts psychotherapy fees by 4.5%. The stated commitment and the enacted policy point in opposite directions. This is not unusual in politics. What is unusual is that it maps so precisely onto Baumol&rsquo;s mechanism: the coalition agreement acknowledges the problem in language; the fee schedule acknowledges it in arithmetic. And the arithmetic wins, because the arithmetic always wins when the work does not scale. The <em>Bedarfsplanung</em>, the needs-based planning system that determines how many psychotherapy seats are approved per region, was partially reformed in 2019 after decades of operating on 1990s-era ratios. The reform added roughly 800 seats. The BPtK considers it still fundamentally inadequate <a href="#ref-14">[14]</a>.</p>
<p>The arithmetic is plain. DiGA spending: growing 71% year on year. Psychotherapy fees: cut by 4.5%. The direction is unambiguous. Invest in the scalable. Cut the unscalable. And the damage compounds in a way that the policy apparatus appears not to understand, or not to care about. A therapist who leaves the profession because €52 per hour is no longer viable does not return when the cut is reversed. The training pipeline for a new clinical psychologist runs six to eight years from university admission to licensure. Over forty thousand accredited psychotherapists serve the system today <a href="#ref-14">[14]</a>. Every one who leaves creates a gap measured in decades, not budget cycles. The Telefonseelsorge, staffed by volunteers and funded by the churches, is not a mental health system. It is what remains when the mental health system is not there. Treating it as a substitute — treating 7,700 volunteers as adequate coverage for a country of 84 million — is not a policy position. It is an admission that the actual policy has failed.</p>
<h2 id="the-uncomfortable-part">The Uncomfortable Part</h2>
<p>Here is where I should, by the conventions of the form, propose a solution. I should say something about funding, about training pipelines, about recognising care work as infrastructure rather than a cost centre.</p>
<p>I think those things are true. I think we should pay therapists more, not less. I think Baumol&rsquo;s cost disease means we should <em>expect</em> this to be expensive and fund it anyway, because the alternative — accepting that people in crisis will wait 142 days while the scalable economy celebrates another productivity milestone — is a failure of collective priorities so basic that it should be uncomfortable to state plainly. But I am also the person who automated his setlist workflow and was satisfied by the compression. I vibe code things. I use AI tools daily. I am inside the attention gradient, not observing it from above. The part of me that finds leverage intoxicating is the same part that writes this blog, and I do not think I am unusual in this.</p>
<p>The structural isomorphism is exact: Baumol&rsquo;s string quartet, the therapist&rsquo;s fifty minutes, the crisis counsellor&rsquo;s phone call at 3am. The labour is the product. The product does not scale. The cost rises. The talent flows elsewhere. And the policy, rather than resisting the gradient, follows it — funding apps, cutting fees, digitising what cannot be digitised without changing what it is. The layers app reaches 85,000 users. The therapy app is reimbursed within the week. The therapist is available in five months, if at all.</p>
<p>I do not have a clean resolution to offer. I have a diagnosis — Baumol&rsquo;s cost disease, applied to the attention economy of a civilisation that has discovered how to make scalable work almost free — and an observation: the political system is not counteracting the disease. It is accelerating it. The quartet still needs four musicians. The session still needs the therapist in the room. The phone still needs someone to answer it. Nothing we are building will change this. The question is whether we notice before the people who needed the answer stop calling.</p>
<hr>
<h2 id="references">References</h2>
<p><span id="ref-1"></span>[1] Bundespsychotherapeutenkammer. <em>Psychisch Kranke warten 142 Tage auf eine psychotherapeutische Behandlung</em>. BPtK. <a href="https://www.bptk.de/pressemitteilungen/psychisch-kranke-warten-142-tage-auf-eine-psychotherapeutische-behandlung/">https://www.bptk.de/pressemitteilungen/psychisch-kranke-warten-142-tage-auf-eine-psychotherapeutische-behandlung/</a></p>
<p><span id="ref-2"></span>[2] Evangelisch-Lutherische Kirche in Norddeutschland (2025). <em>Finanzierung der Telefonseelsorge ist äußerst angespannt</em>. <a href="https://www.kirche-mv.de/nachrichten/2025/februar/finanzierung-der-telefonseelsorge-ist-aeusserst-angespannt">https://www.kirche-mv.de/nachrichten/2025/februar/finanzierung-der-telefonseelsorge-ist-aeusserst-angespannt</a></p>
<p><span id="ref-3"></span>[3] Kassenärztliche Bundesvereinigung (2026). <em>Paukenschlag: KBV klagt gegen massive Kürzungen psychotherapeutischer Leistungen</em>. <a href="https://www.kbv.de/presse/pressemitteilungen/2026/paukenschlag-kbv-klagt-gegen-massive-kuerzungen-psychotherapeutischer-leistungen">https://www.kbv.de/presse/pressemitteilungen/2026/paukenschlag-kbv-klagt-gegen-massive-kuerzungen-psychotherapeutischer-leistungen</a></p>
<p><span id="ref-4"></span>[4] Flückiger, C., Del Re, A. C., Wampold, B. E., &amp; Horvath, A. O. (2018). The alliance in adult psychotherapy: A meta-analytic synthesis. <em>Psychotherapy</em>, 55(4), 316–340. <a href="https://doi.org/10.1037/pst0000172">https://doi.org/10.1037/pst0000172</a></p>
<p><span id="ref-5"></span>[5] Baumol, W. J., &amp; Bowen, W. G. (1966). <em>Performing Arts, The Economic Dilemma: A Study of Problems Common to Theater, Opera, Music and Dance</em>. Twentieth Century Fund.</p>
<p><span id="ref-6"></span>[6] Baumol, W. J. (2012). <em>The Cost Disease: Why Computers Get Cheaper and Health Care Doesn&rsquo;t</em>. Yale University Press.</p>
<p><span id="ref-7"></span>[7] Bundesinstitut für Arzneimittel und Medizinprodukte. <em>DiGA-Verzeichnis</em>. <a href="https://diga.bfarm.de/de">https://diga.bfarm.de/de</a></p>
<p><span id="ref-8"></span>[8] GKV-Spitzenverband (2025). <em>Bericht des GKV-Spitzenverbandes über die Inanspruchnahme und Entwicklung der Versorgung mit Digitalen Gesundheitsanwendungen</em>. Reported in: MTR Consult. <a href="https://mtrconsult.com/news/gkv-report-utilization-and-development-digital-health-application-diga-care-germany">https://mtrconsult.com/news/gkv-report-utilization-and-development-digital-health-application-diga-care-germany</a></p>
<p><span id="ref-9"></span>[9] Heise Online (2025). <em>Insurers critique high costs and low benefits of prescription apps</em>. <a href="https://www.heise.de/en/news/Insurers-critique-high-costs-and-low-benefits-of-prescription-apps-10375339.html">https://www.heise.de/en/news/Insurers-critique-high-costs-and-low-benefits-of-prescription-apps-10375339.html</a></p>
<p><span id="ref-10"></span>[10] Goeldner, M., &amp; Gehder, S. (2024). Digital Health Applications (DiGAs) on a Fast Track: Insights From a Data-Driven Analysis of Prescribable Digital Therapeutics in Germany From 2020 to Mid-2024. <em>JMIR mHealth and uHealth</em>. <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC11393499/">https://pmc.ncbi.nlm.nih.gov/articles/PMC11393499/</a></p>
<p><span id="ref-11"></span>[11] Taz (2026). <em>Weniger Honorar für Psychotherapie</em>. <a href="https://taz.de/Weniger-Honorar-fuer-Psychotherapie/!6162806/">https://taz.de/Weniger-Honorar-fuer-Psychotherapie/!6162806/</a></p>
<p><span id="ref-12"></span>[12] Bundespsychotherapeutenkammer (2026). <em>Gemeinsam gegen die Kürzung psychotherapeutischer Leistungen</em>. <a href="https://www.bptk.de/pressemitteilungen/gemeinsam-gegen-die-kuerzung-psychotherapeutischer-leistungen/">https://www.bptk.de/pressemitteilungen/gemeinsam-gegen-die-kuerzung-psychotherapeutischer-leistungen/</a></p>
<p><span id="ref-13"></span>[13] Bundespsychotherapeutenkammer (2025). <em>Koalitionsvertrag gibt psychischer Gesundheit neuen Stellenwert</em>. <a href="https://www.bptk.de/pressemitteilungen/koalitionsvertrag-gibt-psychischer-gesundheit-neuen-stellenwert/">https://www.bptk.de/pressemitteilungen/koalitionsvertrag-gibt-psychischer-gesundheit-neuen-stellenwert/</a></p>
<p><span id="ref-14"></span>[14] Bundespsychotherapeutenkammer. <em>Reform der Bedarfsplanung</em>. <a href="https://www.bptk.de/ratgeber/reform-der-bedarfsplanung/">https://www.bptk.de/ratgeber/reform-der-bedarfsplanung/</a></p>
]]></content:encoded>
    </item>
    <item>
      <title>Automate the Boring Stuff: Setlist to Playlist</title>
      <link>https://sebastianspicker.github.io/posts/setlist-to-playlist/</link>
      <pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/setlist-to-playlist/</guid>
      <description>I love concerts. I love setlists. I hate building the playlist manually afterward. But do I really? A small automation project, a Deftones show in Dortmund, and the question of whether you should automate something you kind of enjoy.</description>
      <content:encoded><![CDATA[<p>Saturday was the Deftones at the Westfalenhalle in Dortmund. One of those concerts where the setlist is part of the experience — where you register, with something close to physical relief, that the arc landed exactly right, and you spend the Uber home mentally replaying the order.</p>
<p>Sunday I built a playlist from it. It took about forty minutes.</p>
<p>This is the post about why that number is already too low, and also possibly too high.</p>
<h2 id="the-ritual">The Ritual</h2>
<p>There is a specific kind of concert listening that happens in the days after a show. You go home, you look up the setlist — setlist.fm is the canonical archive, maintained with an almost academic precision by people who care — and you build a playlist from it in whatever streaming app you use. Then you play it through, in order, and what comes back is not just the music but the spatial memory of the room, the sound mix, the moment the lights dropped for that particular song.</p>
<p>I have been doing this for years. It is a ritual, and like most rituals, part of its meaning is in the doing. The forty minutes of searching song by song, the occasional discovery that a deep cut is on Apple Music in one version but not another, the fiddling with live versus studio — that friction is not purely annoying. It is part of the processing.</p>
<p>And yet. The pile of unprocessed setlists sits in a folder. Shows I attended and never got around to. Setlists I meant to build into playlists and didn&rsquo;t, because the forty minutes were not available that week, and then the moment passed. The ritual unrealised is just a list of song titles.</p>
<p>This is the dilemma, and it is not entirely trivial.</p>
<h2 id="why-this-is-harder-than-it-should-be">Why This Is Harder Than It Should Be</h2>
<p>The setlist.fm API is excellent. It gives you structured data: artist, venue, date, song titles in order, with notations for encores, covers, and dropped songs. What it does not give you is streaming IDs. The song title is a string; the Apple Music track is an object with a catalog ID, a duration, multiple versions, regional availability, and the possibility of not existing at all in the catalog of your country.</p>
<p>The matching problem — connecting a string like &ldquo;Change (In the House of Flies)&rdquo; to the correct Apple Music track, filtered for the right album version, ignoring the live recordings you did not ask for — is not hard, but it is fiddly. You can get 80% of a setlist matched automatically without much effort. The remaining 20% are the covers, the deep cuts, the songs with subtitles in parentheses that differ between the setlist record and the catalog metadata.</p>
<p>Spotify has a fairly rich ecosystem of community tools for exactly this workflow, because Spotify&rsquo;s API is permissive and well-documented and the auth flow is reasonable for third-party developers. Apple Music is harder. The MusicKit framework is real and capable, but the authentication requires managing a private key and JWT tokens signed with developer credentials — not the OAuth dance most developers are used to. The result is that the setlist → Apple Music pipeline is significantly underbuilt compared to the Spotify equivalent.</p>
<p>This is partly why I built <a href="https://github.com/sebastianspicker/setlist-to-playlist">setlist-to-playlist</a> as a PWA rather than reaching for an existing tool.</p>
<h2 id="how-it-works">How It Works</h2>
<p>The app is a Progressive Web App — installable, mobile-friendly, works as a small tool you open on your phone in the taxi home from a show — built on Next.js with a monorepo structure managed by pnpm and Turbo. The architecture is in three phases:</p>
<p><strong>Import.</strong> You paste a setlist.fm URL or ID. The app queries setlist.fm through a server-side proxy — the API key lives on the server and never touches the client — and returns the structured setlist data: songs in order, with metadata about covers, medleys, and notes.</p>
<p><strong>Preview and matching.</strong> The core package runs a matching algorithm against the Apple Music catalog, using the MusicKit JS API for browser-based catalog search. For each song title, it searches Apple Music and presents the best candidate, giving you the chance to confirm or swap before anything is written. This is the step where the 20% problem is addressed manually — the app handles the obvious cases automatically and surfaces the ambiguous ones for human judgement.</p>
<p><strong>Export.</strong> Once you are happy with the track list, the app creates a playlist in your Apple Music library. MusicKit handles the authentication in-browser; the backend generates the JWT tokens using credentials from Apple Developer, signing with the private key server-side so it stays off the client.</p>
<p>The whole thing is local-first in the sense that matters: the Apple Music authentication is between your browser and Apple, and no playlist data or listening history is stored by the app. The only thing the server touches is the API key proxying and the JWT generation.</p>
<h2 id="the-actual-experience">The Actual Experience</h2>
<p>After the Deftones show: opened the app on the phone, pasted the setlist.fm URL, had the playlist in Apple Music in about four minutes. Three tracks needed manual confirmation — two because of live-versus-studio ambiguity, one because a cover required a search adjustment, the kind of edge case where the name setlist.fm records differs from what appears in regional streaming catalogs.</p>
<p>Four minutes instead of forty. Mission accomplished.</p>
<p>And yet.</p>
<p>I noticed, processing the setlist that quickly, that something was missing. Not the music — the music was all there, in order, correct. What was missing was the time spent inside the setlist. The forty minutes of handling each song is also forty minutes of thinking about each song, of remembering where in the set it fell, of deciding which album version you want to hear. The automation removed the friction and also removed the processing.</p>
<p>I am not sure this is a problem. It is probably more accurate to say that it is a trade-off, and that what trade-off you want depends on what you are doing with the ritual. If the backlog is the problem — the pile of unprocessed shows — the automation solves it cleanly. If the processing itself is the point, you probably should not automate it, and the tool is there for when you want it.</p>
<p>That is the correct relationship to automation, I think. Not &ldquo;this should always be automated&rdquo; or &ldquo;this should never be automated&rdquo;, but &ldquo;here is a tool that removes the mechanical part; use it when the mechanical part is not the point&rdquo;.</p>
<h2 id="a-note-on-the-tech-stack">A Note on the Tech Stack</h2>
<p>For the interested: Next.js 15 with App Router, pnpm workspaces with Turbo for the monorepo, MusicKit JS for Apple Music integration, setlist.fm REST API. The JWT for Apple Music uses the <code>jose</code> library for token signing. The matching logic lives in a standalone <code>packages/core</code> module, which makes it testable in isolation and reusable if anyone wants to port this to a different frontend or a CLI.</p>
<p>The repo is at <a href="https://github.com/sebastianspicker/setlist-to-playlist">github.com/sebastianspicker/setlist-to-playlist</a>. PRs welcome, particularly around the matching heuristics — that is the part where there is the most room for improvement.</p>
<hr>
<p>The Deftones were exceptional, for the record. The Westfalenhalle was loud in the way that only a concrete hall that size can be loud, which is to say: correctly loud.</p>
<p>The playlist is good. I am glad it took four minutes and not forty.</p>
<p>I am also glad I know what I gave up.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Constraining the Coding Agent: The Ralph Loop and Why Determinism Matters</title>
      <link>https://sebastianspicker.github.io/posts/ralph-loop/</link>
      <pubDate>Thu, 04 Dec 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ralph-loop/</guid>
      <description>In late 2025, agentic coding tools went from impressive demos to daily infrastructure. The problem nobody talked about enough: when an LLM agent has write access to a codebase and no formal constraints, reproducibility breaks down. The Ralph Loop is a deterministic, story-driven execution framework that addresses this — one tool call per story, scoped writes, atomic state. A design rationale with a formal sketch of why the constraints matter.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
This post is the design rationale.</em></p>
<hr>
<h2 id="december-2025">December 2025</h2>
<p>It happened fast. In the twelve months before I am writing this, agentic
coding went from a niche research topic to the default mode for several
categories of software engineering task. Codex runs code in a sandboxed
container and submits pull requests. Claude Code works through a task list
in your terminal while you make coffee. Cursor&rsquo;s agent mode rewrites a
file, runs the tests, reads the failures, and tries again — automatically,
without waiting for you to press a button.</p>
<p>The demos are impressive. The production reality is messier.</p>
<p>The problem is not that these systems do not work. They work well enough,
often enough, to be genuinely useful. The problem is that &ldquo;works&rdquo; means
something different when an agent is executing than when a human is.
A human who makes a mistake can tell you what they were thinking.
An agent that produces a subtly wrong result leaves you with a diff and
no explanation. And an agent run that worked last Tuesday might not work
today, because the model changed, or the context window filled differently,
or the prompt-to-output mapping is, at bottom, a stochastic function.</p>
<p>This is the problem the Ralph Loop is designed to address: not &ldquo;make
agents more capable&rdquo; but &ldquo;make agent runs reproducible.&rdquo;</p>
<hr>
<h2 id="the-reproducibility-problem-formally">The Reproducibility Problem, Formally</h2>
<p>An LLM tool call is a stochastic function. Given a prompt $p$, the
model samples from a distribution over possible outputs:</p>
$$T : \mathcal{P} \to \Delta(\mathcal{O})$$<p>where $\mathcal{P}$ is the space of prompts, $\mathcal{O}$ is the space
of outputs, and $\Delta(\mathcal{O})$ denotes the probability simplex over
$\mathcal{O}$.</p>
<p>At temperature zero — the most deterministic setting most systems support —
this collapses toward a point mass:</p>
$$T_0(p) \approx \delta_{o^*}$$<p>where $o^*$ is the argmax token sequence. &ldquo;Approximately&rdquo; because hardware
non-determinism, batching effects, and floating-point accumulation mean
that even $T_0$ is not strictly reproducible across runs, environments, or
model versions.</p>
<p>A naive agentic loop composes these calls. If an agent takes $k$ sequential
tool calls to complete a task, the result is a $k$-fold composition:</p>
$$o_k = T(T(\cdots T(p_0) \cdots))$$<p>The variance does not merely add — it propagates through the dependencies.
Early outputs condition later prompts; a small deviation at step 2 can
shift the trajectory of step 5 substantially. This is not a theoretical
concern. It is the practical experience of anyone who has tried to reproduce
a multi-step agent run.</p>
<p>The Ralph Loop does not solve the stochasticity of $T$. What it does is
prevent the composition.</p>
<hr>
<h2 id="the-ralph-loop-as-a-state-machine">The Ralph Loop as a State Machine</h2>
<p>The system&rsquo;s state at any point in a run is a triple:</p>
$$\sigma = (Q,\; S,\; L)$$<p>where:</p>
<ul>
<li>$Q = (s_1, s_2, \ldots, s_n)$ is the ordered story queue — the PRD
(product requirements document) — with stories sorted by priority, then
by ID</li>
<li>$S \in \lbrace \texttt{open}, \texttt{passing}, \texttt{skipped} \rbrace^n$
is the status vector, one entry per story</li>
<li>$L \in \lbrace \texttt{free}, \texttt{held} \rbrace$ is the file-lock
state protecting $S$ from concurrent writes</li>
</ul>
<p>The transition function $\delta$ at each step is:</p>
<ol>
<li><strong>Select</strong>: $i^* = \min\lbrace i : S[i] = \texttt{open} \rbrace$ —
deterministic by construction, since $Q$ has a fixed ordering</li>
<li><strong>Build</strong>: $p = \pi(s_{i^*},\; \text{CODEX.md})$ — a pure function of
the story definition and the static policy document; no dependency on
previous tool outputs</li>
<li><strong>Execute</strong>: $o \sim T(p)$ — exactly one tool call, output captured</li>
<li><strong>Accept</strong>: $\alpha(o) \in \lbrace \top, \bot \rbrace$ — parse the
acceptance criterion (was the expected report file created at the
expected path?)</li>
<li><strong>Commit</strong>: if $\alpha(o) = \top$, set $S[i^*] \leftarrow \texttt{passing}$;
otherwise increment the attempt counter; write atomically under lock $L$</li>
</ol>
<p>The next state is $\sigma' = (Q, S', L)$ where $S'$ differs from $S$ in
exactly one position. The loop continues until no open stories remain or
a story limit $N$ is reached.</p>
<p><strong>Termination.</strong> Since $|Q| = n$ is finite, $S$ has at most $n$ open
entries, and each step either closes one entry or increments an attempt
counter bounded by $A_{\max}$, the loop terminates in at most
$n \cdot A_{\max}$ steps. Under the assumption that $T$ eventually
satisfies any reachable acceptance criterion — which is what CODEX.md&rsquo;s
constraints are designed to encourage — the loop converges in exactly $n$
successful transitions.</p>
<p><strong>Replay.</strong> The entire trajectory $\sigma_0 \to \sigma_1 \to \cdots \to
\sigma_k$ is determined by $Q$ and the sequence of tool outputs
$o_1, o_2, \ldots, o_k$. The <code>.runtime/events.log</code> records these
outputs. If tool outputs are deterministic, the run is fully deterministic.
If they are not — as in practice they will not be — the stochasticity is
at least isolated to individual steps rather than allowed to compound
across the chain.</p>
<hr>
<h2 id="the-one-tool-call-invariant">The One-Tool-Call Invariant</h2>
<p>The most important constraint in the Ralph Loop is also the simplest:
exactly one tool call per story attempt.</p>
<p>This is not the natural design. A natural agentic loop would let the model
plan, execute, observe, reflect, and re-execute within a single story.
Some frameworks call this &ldquo;inner monologue&rdquo; or &ldquo;chain-of-thought with tool
use.&rdquo; The model emits reasoning tokens, calls a tool, reads the result,
emits more reasoning, calls another tool, and eventually produces the
final output.</p>
<p>This is more capable for complex tasks. It is also what makes
reproducibility hard. Each additional tool call in the chain is a fresh
draw from $T$, conditioned on the previous outputs. After five tool calls,
the prompt for the fifth includes four previous outputs — each of which
varied slightly from the last run. The fifth output is now conditioned on
a different input.</p>
<p>Formally: let the multi-call policy use $k$ sequential calls per story.
Each call $c_j$ produces output $o_j \sim T(p_j)$, where
$p_j = f(o_1, \ldots, o_{j-1}, s_{i^*})$ for some conditioning function
$f$. The variance of the final output $o_k$ depends on the accumulated
conditioning:</p>
<p>$$\text{Var}(o_k) ;=; \text{Var}_{o_1}!\left[, \mathbb{E}[o_k \mid o_1] ,\right]</p>
<ul>
<li>\mathbb{E}_{o_1}!\left[, \text{Var}(o_k \mid o_1) ,\right]$$</li>
</ul>
<p>By the law of total variance, applied recursively, the total variance
decomposes into explained and residual components — conditioning
redistributes variance but does not eliminate the residual term. In a
well-designed, low-variance chain the residual may stay small; in
practice, LLM outputs have non-trivial variance at each step, and that
variance propagates through the conditioning chain.</p>
<p>The one-call constraint collapses $k$ to 1:</p>
$$o_i \sim T\!\bigl(\pi(s_i, \text{CODEX.md})\bigr)$$<p>The output depends only on the story definition and the static policy
document. Not on previous tool outputs. The stories are designed to be
atomic enough that one call is sufficient. If a story requires more, it
should be split into two stories in the PRD. This is a forcing function
toward better task decomposition, which I consider a feature rather than
a limitation.</p>
<hr>
<h2 id="scope-as-a-topological-constraint">Scope as a Topological Constraint</h2>
<p>In fixing mode, each story carries a <code>scope[]</code> field listing the files
or directories the agent is permitted to modify. The runner captures a
snapshot of the repository state before execution:</p>
$$F_{\text{before}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>where $h(f)$ is a hash of the file contents. After the tool call:</p>
$$F_{\text{after}} = \lbrace (f,\; h(f)) : f \in \text{repo} \rbrace$$<p>The diff $\Delta = F_{\text{after}} \setminus F_{\text{before}}$ must
satisfy:</p>
$$\forall\, (f, \_) \in \Delta \;:\; f \in \text{scope}(s_{i^*})$$<p>This is a locality constraint on the filesystem graph: the agent&rsquo;s writes
are confined to the neighbourhood $\mathcal{N}(s_{i^*})$ defined by the
story&rsquo;s scope declaration. Writes that escape this neighbourhood are a
story failure, regardless of whether they look correct.</p>
<p>The motivation is containment. When a fixing agent makes a &ldquo;small repair&rdquo;
to one file but also helpfully tidies up three adjacent files it noticed
while reading, you have three undocumented changes outside the story&rsquo;s
intent. In a system with many stories running sequentially, out-of-scope
changes accumulate silently. The scope constraint prevents this.
Crucially, prompt instructions alone are not sufficient — an agent told
&ldquo;only modify files in scope&rdquo; can still modify out-of-scope files if the
instructions are interpreted loosely or the context is long. The runner
enforces scope at the file system level, after the fact, and that
enforcement cannot be argued with.</p>
<hr>
<h2 id="acceptance-criteria-grounding-evaluation-in-filesystem-events">Acceptance Criteria: Grounding Evaluation in Filesystem Events</h2>
<p>Each story&rsquo;s acceptance criterion is a single line of the form
<code>Created &lt;path&gt;</code> — the path where the report or output file should appear.</p>
<p>This is intentionally minimal. The alternative — semantic acceptance
criteria (&ldquo;did the agent identify all relevant security issues?&rdquo;) — would
require another model call to evaluate, reintroducing stochasticity at
the evaluation layer and creating the infinite regress of &ldquo;who checks the
checker.&rdquo; A created file at the right path is a necessary condition for
a valid run. It is not a sufficient condition for correctness, but
necessary conditions that can be checked deterministically are already
more than most agentic pipelines provide.</p>
<p>The quality of the outputs — whether the audit findings are accurate,
whether the fix is correct — depends on the model and the prompt quality.
The Ralph Loop gives you a framework for running agents safely and
repeatably. Verifying that the agent was right is a different problem and,
arguably, a harder one.</p>
<hr>
<h2 id="why-bash">Why Bash</h2>
<p>A question I have fielded: why Bash and jq, not Python or Node.js?</p>
<p>The practical reason: the target environment is an agent sandbox that has
reliable POSIX tooling but variable package availability. Python dependency
management inside a constrained container is itself a source of variance.
Bash with jq has no dependencies beyond what any standard Unix environment
provides.</p>
<p>The philosophical reason: the framework&rsquo;s job is orchestration, not
computation. It selects stories, builds prompts from templates, calls one
external tool, parses one file path, and updates one JSON field. None of
this requires a type system or a rich standard library. Bash is the right
tool for glue that does not need to be impressive.</p>
<p>The one place Bash becomes awkward is the schema validation layer, which
is implemented with a separate <code>jq</code> script against a JSON Schema. This
works but is not elegant. If the PRD schema grows substantially, that
component would be worth replacing with something that has native schema
validation support.</p>
<hr>
<h2 id="what-this-is-not">What This Is Not</h2>
<p>The Ralph Loop is not an agent. It is a harness for agents. It does not
decide what tasks to run, does not reason about a codebase, and does not
write code. It sequences discrete, pre-specified stories, enforces the
constraints on each execution, and records the outcomes. The intelligence
is in the model and in the story design; the framework contributes only
discipline.</p>
<p>This distinction matters because the current wave of agentic tools
conflates two things that are worth keeping separate: the capability to
reason and act (what the model provides) and the infrastructure for doing
so safely and repeatably (what the harness provides). Improving the model
does not automatically improve the harness — and a better model in a
poorly constrained harness just fails more impressively.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/ralph-loop">github.com/sebastianspicker/ralph-loop</a>.
The Bash implementation, the PRD schema, the CODEX.md policy document,
and the test suite are all there.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting</title>
      <link>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</link>
      <pubDate>Sat, 22 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</guid>
      <description>Everyone has a Downloads folder full of &amp;ldquo;scan0023.pdf&amp;rdquo; and &amp;ldquo;document(3)-final-FINAL.pdf&amp;rdquo;. Renaming them by content sounds trivial — read the file, understand what it is, give it a name. The implementation reveals something useful about how LLMs actually handle text: what a token is, why context windows matter in practice, why you want structured output instead of prose, and why heuristics should go first. The repository is at github.com/sebastianspicker/AI-PDF-Renamer.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.</em></p>
<hr>
<h2 id="the-problem">The Problem</h2>
<p>Every PDF acquisition pipeline eventually produces the same chaos.
Journal articles downloaded from publisher sites arrive as
<code>513194-008.pdf</code> or <code>1-s2.0-S0360131520302700-main.pdf</code>. Scanned
letters from the tax authority arrive as <code>scan0023.pdf</code>. Invoices arrive
as <code>Rechnung.pdf</code> — every invoice from every vendor, overwriting each
other if you are not paying attention. The actual content is
in the file. The filename tells you nothing.</p>
<p>The human solution is trivial: open the PDF, glance at the title or
date or sender, type a descriptive name. Thirty seconds per file,
multiplied by several hundred files accumulated over a year, becomes
a task that perpetually does not get done.</p>
<p>The automated solution sounds equally trivial: read the text, decide what
the document is, generate a filename. What could be involved?</p>
<p>Quite a bit, it turns out. Working through the implementation is a useful
way to make concrete some things about LLMs and text processing that are
easy to understand in the abstract but clearer with a specific task in
front of you.</p>
<hr>
<h2 id="step-one-getting-text-out-of-a-pdf">Step One: Getting Text Out of a PDF</h2>
<p>A PDF is not a text file. It is a binary format designed for page layout
and print fidelity — it encodes character positions, fonts, and rendering
instructions, not a linear stream of prose. The text in a PDF has to be
extracted by a parser that reassembles it from the position data.</p>
<p>For PDFs with embedded text (most modern documents), this works well
enough. For scanned PDFs — images of pages, with no embedded text at all —
you need OCR as a fallback. The pipeline handles both: native extraction
first, OCR if the text yield is below a useful threshold.</p>
<p>The result is a string. Already there are failure modes: two-column
layouts produce interleaved text if the parser reads left-to-right across
both columns simultaneously; footnotes appear in the middle of
sentences; tables produce gibberish unless the parser handles them
specifically. These are not catastrophic — for renaming purposes,
the first paragraph and the document header are usually enough, and those
are less likely to be badly formatted than the body. But they are real,
and they mean that the text passed to the next stage is not always clean.</p>
<hr>
<h2 id="step-two-the-token-budget">Step Two: The Token Budget</h2>
<p>Once you have a string representing the document&rsquo;s text, you cannot simply
pass all of it to a language model. Two reasons: context windows have hard
limits, and — even when they are large enough — filling them with the full
text of a thirty-page document is wasteful for a task that only needs the
title, date, and category.</p>
<p>Language models do not process characters. They process <em>tokens</em> — subword
units produced by the same BPE compression scheme I described
<a href="/posts/strawberry-tokenisation/">in the strawberry post</a>. A rough
practical rule for English text is:</p>
$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$<p>This is an approximation — technical text, non-English content, and
code tokenise differently — but it is useful for budgeting. A ten-page
academic paper might contain around 30,000 characters, which is
approximately 7,500 tokens. The context window of a small local model
(the default here is <code>qwen2.5:3b</code> via Ollama) is typically in the range
of 8,000–32,000 tokens, depending on the version and configuration.
You have room — but not unlimited room, and the LLM also needs space
for the prompt itself and the response.</p>
<p>The tool defaults to 28,000 tokens of extracted text
(<code>DEFAULT_MAX_CONTENT_TOKENS</code>), leaving comfortable headroom for the
prompt and response in most configurations. For documents that exceed this, the extraction
is truncated — typically to the first N characters, on the reasonable
assumption that titles, dates, and document types appear early.</p>
<p>This truncation is a design decision, not a limitation to be apologised
for. For the renaming task, the first two pages of a document contain
everything the filename needs. A strategy that extracts the first page
plus the last page (which often has a date, a signature, or a reference
number) would work for some document types. The current implementation
keeps it simple: take the front, stay within budget.</p>
<hr>
<h2 id="step-three-heuristics-first">Step Three: Heuristics First</h2>
<p>Here is something that improves almost any LLM pipeline for structured
extraction tasks: do as much work as possible with deterministic rules
before touching the model.</p>
<p>The AI PDF Renamer applies a scoring pass over the extracted text before
deciding whether to call the LLM at all. The heuristics are regex-based
rules that look for patterns likely to appear in specific document types:</p>
<ul>
<li>Date patterns: <code>\d{4}-\d{2}-\d{2}</code>, <code>\d{2}\.\d{2}\.\d{4}</code>, and a
dozen variants</li>
<li>Document type markers: &ldquo;Rechnung&rdquo;, &ldquo;Invoice&rdquo;, &ldquo;Beleg&rdquo;, &ldquo;Gutschrift&rdquo;,
&ldquo;Receipt&rdquo;</li>
<li>Author/institution lines near the document header</li>
<li>Keywords from a configurable list associated with specific categories</li>
</ul>
<p>Each rule that fires contributes a score to a candidate metadata record.
If the heuristic pass produces a confident result — date found, category
identified, a couple of distinguishing keywords present — the LLM call
is skipped entirely. The file gets renamed from the heuristic output.</p>
<p>This matters for a few reasons. Heuristics are fast (microseconds vs.
seconds for an LLM call), deterministic (the same input always produces
the same output), and do not require a running model. For a batch of
two hundred invoices from the same vendor, the heuristic pass will handle
most of them without any LLM involvement.</p>
<p>The LLM is enrichment for the hard cases: documents with unusual formats,
mixed-language content, documents where the type is not obvious from
surface features. In practice this is probably 20–40% of a typical
mixed-document folder.</p>
<hr>
<h2 id="step-four-what-to-ask-the-llm-and-how">Step Four: What to Ask the LLM, and How</h2>
<p>When a heuristic pass does not produce a confident result, the pipeline
builds a prompt from the extracted text and sends it to the local
endpoint. What the prompt asks for matters enormously.</p>
<p>The naive approach: &ldquo;Please rename this PDF. Here is the content: [text].&rdquo;
The response will be a sentence. Maybe several sentences. It will not be
parseable as a filename without further processing, and that further
processing is itself an LLM call or a fragile regex.</p>
<p>The better approach: ask for structured output. The prompt in
<code>llm_prompts.py</code> requests a JSON object conforming to a schema — something
like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;date&#34;</span><span class="p">:</span> <span class="s2">&#34;YYYYMMDD or null&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;category&#34;</span><span class="p">:</span> <span class="s2">&#34;one of: invoice, paper, letter, contract, ...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;keywords&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;max 3 short keywords&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="s2">&#34;max 5 words&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The model returns JSON. The response parser in <code>llm_parsing.py</code> validates
it against the schema, catches malformed responses, applies fallbacks for
null fields, and sanitises the individual fields before they are assembled
into a filename.</p>
<p>This works because JSON is well-represented in LLM training data —
models have seen vastly more JSON than they have seen arbitrary prose
instructions to parse. A model told to return a specific JSON structure
will do so reliably for most inputs. The failure rate (malformed JSON,
missing fields, hallucinated values) is low enough to be handled by
the fallback logic.</p>
<p>What counts as a hallucinated value in this context? Dates in the future.
Categories not in the allowed set. Keywords that are not present in the
source text. The <code>llm_schema.py</code> validation layer catches the obvious
cases; for subtler errors (a plausible-sounding date that does not appear
in the document), the tool relies on the heuristic pass having already
identified any date that can be reliably extracted.</p>
<hr>
<h2 id="step-five-the-filename">Step Five: The Filename</h2>
<p>The output format is <code>YYYYMMDD-category-keywords-summary.pdf</code>. A few
design decisions embedded in this:</p>
<p><strong>Date first.</strong> Lexicographic sorting of filenames then gives you
chronological sorting for free. This is the most useful sort order for
most document types — you want to find the most recent invoice, not
the alphabetically first one.</p>
<p><strong>Lowercase, hyphens only.</strong> No spaces (which require escaping in many
contexts), no special characters (which are illegal in some filesystems
or require quoting), no uppercase (which creates case-sensitivity issues
across platforms). The sanitisation step in <code>filename.py</code> strips or
replaces anything that does not conform.</p>
<p><strong>Collision resolution.</strong> Two documents with the same date, category,
keywords, and summary would produce the same filename. The resolver
appends a counter suffix (<code>_01</code>, <code>_02</code>, &hellip;) when a target name already
exists. This is deterministic — the same set of documents always produces
the same filenames, regardless of processing order — which matters for
the undo log.</p>
<hr>
<h2 id="local-first">Local-First</h2>
<p>The LLM endpoint defaults to <code>http://127.0.0.1:11434/v1/completions</code> —
Ollama running locally, no external traffic. This is a deliberate choice
for a document management tool. The documents being renamed are likely
to include medical records, financial statements, legal correspondence —
content that should not be routed through an external API by default.</p>
<p>A small 8B model running locally is sufficient for this task. The
extraction problem does not require deep reasoning; it requires pattern
recognition over a short text and the ability to return a specific JSON
structure. Models at this scale handle it well. The latency is measurable
(a few seconds per document on a modern laptop with a reasonably fast
inference backend) but acceptable for a batch job running in the
background.</p>
<p>For users who want to use a remote API, the endpoint is configurable —
the local default is a sensible starting point, not a hard constraint.</p>
<hr>
<h2 id="what-it-cannot-do">What It Cannot Do</h2>
<p>Renaming is a classification problem disguised as a text generation
problem. The tool works well when documents have standard structure —
title on page one, date near the header or footer, document type
identifiable from a few keywords. It works less well for documents that
are structurally atypical: a hand-written letter scanned at poor
resolution, a PDF that is essentially a single large image, a document
in a language the model handles badly.</p>
<p>The heuristic fallback means that even when the LLM produces a bad
result, the file gets a usable if imperfect name rather than a broken
one. And the undo log means that a bad batch run can be reversed. These
are not complete solutions to the hard cases, but they are the right
design response to a tool that handles real-world document noise.</p>
<p>The harder limit is semantic: the tool can tell you that a document is
an invoice and extract its date and vendor name. It cannot tell you
whether the invoice has been paid, whether it matches a purchase order,
or whether the amount is correct. For those questions, renaming is just
the first step in a longer pipeline.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.
The tokenisation background in the extraction and budgeting sections
connects to the <a href="/posts/strawberry-tokenisation/">strawberry tokenisation post</a>
and the <a href="/posts/more-context-not-always-better/">context window post</a>.</em></p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-04-02</strong>: Corrected the default model name from <code>qwen3:8b</code> to <code>qwen2.5:3b</code>. The codebase default is <code>qwen2.5:3b</code> (apple-silicon preset) or <code>qwen2.5:7b-instruct</code> (gpu preset).</li>
<li><strong>2026-04-02</strong>: Corrected <code>DEFAULT_MAX_CONTENT_TOKENS</code> description from &ldquo;28,000 characters &hellip; roughly 7,000 tokens&rdquo; to &ldquo;28,000 tokens.&rdquo; The variable is a token limit, not a character limit.</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
