<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Python on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/python/</link>
    <description>Recent content in Python on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Sat, 22 Mar 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/python/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Papertrail: AI PDF Renaming and the Tokens That Make It Interesting</title>
      <link>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</link>
      <pubDate>Sat, 22 Mar 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-pdf-renamer/</guid>
      <description>Everyone has a Downloads folder full of &amp;ldquo;scan0023.pdf&amp;rdquo; and &amp;ldquo;document(3)-final-FINAL.pdf&amp;rdquo;. Renaming them by content sounds trivial — read the file, understand what it is, give it a name. The implementation reveals something useful about how LLMs actually handle text: what a token is, why context windows matter in practice, why you want structured output instead of prose, and why heuristics should go first. The repository is at github.com/sebastianspicker/AI-PDF-Renamer.</description>
      <content:encoded><![CDATA[<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.</em></p>
<hr>
<h2 id="the-problem">The Problem</h2>
<p>Every PDF acquisition pipeline eventually produces the same chaos.
Journal articles downloaded from publisher sites arrive as
<code>513194-008.pdf</code> or <code>1-s2.0-S0360131520302700-main.pdf</code>. Scanned
letters from the tax authority arrive as <code>scan0023.pdf</code>. Invoices arrive
as <code>Rechnung.pdf</code> — every invoice from every vendor, overwriting each
other if you are not paying attention. The actual content is
in the file. The filename tells you nothing.</p>
<p>The human solution is trivial: open the PDF, glance at the title or
date or sender, type a descriptive name. Thirty seconds per file,
multiplied by several hundred files accumulated over a year, becomes
a task that perpetually does not get done.</p>
<p>The automated solution sounds equally trivial: read the text, decide what
the document is, generate a filename. What could be involved?</p>
<p>Quite a bit, it turns out. Working through the implementation is a useful
way to make concrete some things about LLMs and text processing that are
easy to understand in the abstract but clearer with a specific task in
front of you.</p>
<hr>
<h2 id="step-one-getting-text-out-of-a-pdf">Step One: Getting Text Out of a PDF</h2>
<p>A PDF is not a text file. It is a binary format designed for page layout
and print fidelity — it encodes character positions, fonts, and rendering
instructions, not a linear stream of prose. The text in a PDF has to be
extracted by a parser that reassembles it from the position data.</p>
<p>For PDFs with embedded text (most modern documents), this works well
enough. For scanned PDFs — images of pages, with no embedded text at all —
you need OCR as a fallback. The pipeline handles both: native extraction
first, OCR if the text yield is below a useful threshold.</p>
<p>The result is a string. Already there are failure modes: two-column
layouts produce interleaved text if the parser reads left-to-right across
both columns simultaneously; footnotes appear in the middle of
sentences; tables produce gibberish unless the parser handles them
specifically. These are not catastrophic — for renaming purposes,
the first paragraph and the document header are usually enough, and those
are less likely to be badly formatted than the body. But they are real,
and they mean that the text passed to the next stage is not always clean.</p>
<hr>
<h2 id="step-two-the-token-budget">Step Two: The Token Budget</h2>
<p>Once you have a string representing the document&rsquo;s text, you cannot simply
pass all of it to a language model. Two reasons: context windows have hard
limits, and — even when they are large enough — filling them with the full
text of a thirty-page document is wasteful for a task that only needs the
title, date, and category.</p>
<p>Language models do not process characters. They process <em>tokens</em> — subword
units produced by the same BPE compression scheme I described
<a href="/posts/strawberry-tokenisation/">in the strawberry post</a>. A rough
practical rule for English text is:</p>
$$N_{\text{tokens}} \;\approx\; \frac{N_{\text{chars}}}{4}$$<p>This is an approximation — technical text, non-English content, and
code tokenise differently — but it is useful for budgeting. A ten-page
academic paper might contain around 30,000 characters, which is
approximately 7,500 tokens. The context window of a small local model
(the default here is <code>qwen2.5:3b</code> via Ollama) is typically in the range
of 8,000–32,000 tokens, depending on the version and configuration.
You have room — but not unlimited room, and the LLM also needs space
for the prompt itself and the response.</p>
<p>The tool defaults to 28,000 tokens of extracted text
(<code>DEFAULT_MAX_CONTENT_TOKENS</code>), leaving comfortable headroom for the
prompt and response in most configurations. For documents that exceed this, the extraction
is truncated — typically to the first N characters, on the reasonable
assumption that titles, dates, and document types appear early.</p>
<p>This truncation is a design decision, not a limitation to be apologised
for. For the renaming task, the first two pages of a document contain
everything the filename needs. A strategy that extracts the first page
plus the last page (which often has a date, a signature, or a reference
number) would work for some document types. The current implementation
keeps it simple: take the front, stay within budget.</p>
<hr>
<h2 id="step-three-heuristics-first">Step Three: Heuristics First</h2>
<p>Here is something that improves almost any LLM pipeline for structured
extraction tasks: do as much work as possible with deterministic rules
before touching the model.</p>
<p>The AI PDF Renamer applies a scoring pass over the extracted text before
deciding whether to call the LLM at all. The heuristics are regex-based
rules that look for patterns likely to appear in specific document types:</p>
<ul>
<li>Date patterns: <code>\d{4}-\d{2}-\d{2}</code>, <code>\d{2}\.\d{2}\.\d{4}</code>, and a
dozen variants</li>
<li>Document type markers: &ldquo;Rechnung&rdquo;, &ldquo;Invoice&rdquo;, &ldquo;Beleg&rdquo;, &ldquo;Gutschrift&rdquo;,
&ldquo;Receipt&rdquo;</li>
<li>Author/institution lines near the document header</li>
<li>Keywords from a configurable list associated with specific categories</li>
</ul>
<p>Each rule that fires contributes a score to a candidate metadata record.
If the heuristic pass produces a confident result — date found, category
identified, a couple of distinguishing keywords present — the LLM call
is skipped entirely. The file gets renamed from the heuristic output.</p>
<p>This matters for a few reasons. Heuristics are fast (microseconds vs.
seconds for an LLM call), deterministic (the same input always produces
the same output), and do not require a running model. For a batch of
two hundred invoices from the same vendor, the heuristic pass will handle
most of them without any LLM involvement.</p>
<p>The LLM is enrichment for the hard cases: documents with unusual formats,
mixed-language content, documents where the type is not obvious from
surface features. In practice this is probably 20–40% of a typical
mixed-document folder.</p>
<hr>
<h2 id="step-four-what-to-ask-the-llm-and-how">Step Four: What to Ask the LLM, and How</h2>
<p>When a heuristic pass does not produce a confident result, the pipeline
builds a prompt from the extracted text and sends it to the local
endpoint. What the prompt asks for matters enormously.</p>
<p>The naive approach: &ldquo;Please rename this PDF. Here is the content: [text].&rdquo;
The response will be a sentence. Maybe several sentences. It will not be
parseable as a filename without further processing, and that further
processing is itself an LLM call or a fragile regex.</p>
<p>The better approach: ask for structured output. The prompt in
<code>llm_prompts.py</code> requests a JSON object conforming to a schema — something
like:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;date&#34;</span><span class="p">:</span> <span class="s2">&#34;YYYYMMDD or null&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;category&#34;</span><span class="p">:</span> <span class="s2">&#34;one of: invoice, paper, letter, contract, ...&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;keywords&#34;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&#34;max 3 short keywords&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">  <span class="nt">&#34;summary&#34;</span><span class="p">:</span> <span class="s2">&#34;max 5 words&#34;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>The model returns JSON. The response parser in <code>llm_parsing.py</code> validates
it against the schema, catches malformed responses, applies fallbacks for
null fields, and sanitises the individual fields before they are assembled
into a filename.</p>
<p>This works because JSON is well-represented in LLM training data —
models have seen vastly more JSON than they have seen arbitrary prose
instructions to parse. A model told to return a specific JSON structure
will do so reliably for most inputs. The failure rate (malformed JSON,
missing fields, hallucinated values) is low enough to be handled by
the fallback logic.</p>
<p>What counts as a hallucinated value in this context? Dates in the future.
Categories not in the allowed set. Keywords that are not present in the
source text. The <code>llm_schema.py</code> validation layer catches the obvious
cases; for subtler errors (a plausible-sounding date that does not appear
in the document), the tool relies on the heuristic pass having already
identified any date that can be reliably extracted.</p>
<hr>
<h2 id="step-five-the-filename">Step Five: The Filename</h2>
<p>The output format is <code>YYYYMMDD-category-keywords-summary.pdf</code>. A few
design decisions embedded in this:</p>
<p><strong>Date first.</strong> Lexicographic sorting of filenames then gives you
chronological sorting for free. This is the most useful sort order for
most document types — you want to find the most recent invoice, not
the alphabetically first one.</p>
<p><strong>Lowercase, hyphens only.</strong> No spaces (which require escaping in many
contexts), no special characters (which are illegal in some filesystems
or require quoting), no uppercase (which creates case-sensitivity issues
across platforms). The sanitisation step in <code>filename.py</code> strips or
replaces anything that does not conform.</p>
<p><strong>Collision resolution.</strong> Two documents with the same date, category,
keywords, and summary would produce the same filename. The resolver
appends a counter suffix (<code>_01</code>, <code>_02</code>, &hellip;) when a target name already
exists. This is deterministic — the same set of documents always produces
the same filenames, regardless of processing order — which matters for
the undo log.</p>
<hr>
<h2 id="local-first">Local-First</h2>
<p>The LLM endpoint defaults to <code>http://127.0.0.1:11434/v1/completions</code> —
Ollama running locally, no external traffic. This is a deliberate choice
for a document management tool. The documents being renamed are likely
to include medical records, financial statements, legal correspondence —
content that should not be routed through an external API by default.</p>
<p>A small 8B model running locally is sufficient for this task. The
extraction problem does not require deep reasoning; it requires pattern
recognition over a short text and the ability to return a specific JSON
structure. Models at this scale handle it well. The latency is measurable
(a few seconds per document on a modern laptop with a reasonably fast
inference backend) but acceptable for a batch job running in the
background.</p>
<p>For users who want to use a remote API, the endpoint is configurable —
the local default is a sensible starting point, not a hard constraint.</p>
<hr>
<h2 id="what-it-cannot-do">What It Cannot Do</h2>
<p>Renaming is a classification problem disguised as a text generation
problem. The tool works well when documents have standard structure —
title on page one, date near the header or footer, document type
identifiable from a few keywords. It works less well for documents that
are structurally atypical: a hand-written letter scanned at poor
resolution, a PDF that is essentially a single large image, a document
in a language the model handles badly.</p>
<p>The heuristic fallback means that even when the LLM produces a bad
result, the file gets a usable if imperfect name rather than a broken
one. And the undo log means that a bad batch run can be reversed. These
are not complete solutions to the hard cases, but they are the right
design response to a tool that handles real-world document noise.</p>
<p>The harder limit is semantic: the tool can tell you that a document is
an invoice and extract its date and vendor name. It cannot tell you
whether the invoice has been paid, whether it matches a purchase order,
or whether the amount is correct. For those questions, renaming is just
the first step in a longer pipeline.</p>
<hr>
<p><em>The repository is at
<a href="https://github.com/sebastianspicker/AI-PDF-Renamer">github.com/sebastianspicker/AI-PDF-Renamer</a>.
The tokenisation background in the extraction and budgeting sections
connects to the <a href="/posts/strawberry-tokenisation/">strawberry tokenisation post</a>
and the <a href="/posts/more-context-not-always-better/">context window post</a>.</em></p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-04-02</strong>: Corrected the default model name from <code>qwen3:8b</code> to <code>qwen2.5:3b</code>. The codebase default is <code>qwen2.5:3b</code> (apple-silicon preset) or <code>qwen2.5:7b-instruct</code> (gpu preset).</li>
<li><strong>2026-04-02</strong>: Corrected <code>DEFAULT_MAX_CONTENT_TOKENS</code> description from &ldquo;28,000 characters &hellip; roughly 7,000 tokens&rdquo; to &ldquo;28,000 tokens.&rdquo; The variable is a token limit, not a character limit.</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
