<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Transcription on Sebastian Spicker</title>
    <link>https://sebastianspicker.github.io/tags/transcription/</link>
    <description>Recent content in Transcription on Sebastian Spicker</description>
    <image>
      <title>Sebastian Spicker</title>
      <url>https://sebastianspicker.github.io/og-image.png</url>
      <link>https://sebastianspicker.github.io/og-image.png</link>
    </image>
    <generator>Hugo -- 0.160.0</generator>
    <language>en</language>
    <lastBuildDate>Tue, 10 Jun 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://sebastianspicker.github.io/tags/transcription/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Your Transcript Is Already an Interpretation: AI Transcription and Grounded Theory</title>
      <link>https://sebastianspicker.github.io/posts/ai-transcription-grounded-theory/</link>
      <pubDate>Tue, 10 Jun 2025 00:00:00 +0000</pubDate>
      <guid>https://sebastianspicker.github.io/posts/ai-transcription-grounded-theory/</guid>
      <description>aTrain and noScribe are local, GDPR-compliant, Whisper-based transcription tools that can genuinely save hours of work in qualitative interview research. They also make methodological decisions on your behalf without telling you. If you do grounded theory, you need to know which decisions those are.</description>
      <content:encoded><![CDATA[<p><em>In June 2025 I put together a practical guide on AI-assisted transcription
for professors of music pedagogy at HfMT Köln — primarily a hands-on
introduction to aTrain and noScribe. This post is the methodological
companion to that guide: the stuff I could not fit into a workshop handout
but that I think matters more than the installation instructions.</em></p>
<hr>
<h2 id="the-seduction">The Seduction</h2>
<p>AI transcription tools have reached a point where, for clean audio of a
single speaker in a quiet room, the output is genuinely good. You load a
90-minute interview, click a button, wait roughly 20 minutes, and get a
readable transcript with timestamps and speaker labels. In transcript-hours,
that is an order of magnitude faster than manual transcription. The appeal is
obvious, especially if you are a qualitative researcher working with a backlog
of interview recordings.</p>
<p>The two tools I have been evaluating — <strong>aTrain</strong> (developed at University of
Graz) and <strong>noScribe</strong> (an independent open-source project) — both run
entirely locally on your machine. No audio file is uploaded anywhere. No
cloud API is involved. This matters for interview research: you are handling
other people&rsquo;s speech, often on topics they regard as sensitive, and the
GDPR landscape for sending recordings to external servers is genuinely
complicated. Local processing sidesteps that problem entirely.</p>
<p>Both tools are built on <strong>OpenAI&rsquo;s Whisper model</strong>, which is — despite the
name — open-source and runs offline. They differ in interface philosophy,
feature depth, and what methodological commitments they make visible.</p>
<p>But the seduction is the problem. The speed and cleanliness of the output
makes it easy to treat the transcript as a neutral record rather than as a
construction. It is not. Every transcription is an act of interpretation. An
AI transcription is an act of interpretation performed by an algorithm that
does not know what your research question is.</p>
<hr>
<h2 id="why-this-is-a-grounded-theory-problem-specifically">Why This Is a Grounded Theory Problem Specifically</h2>
<p>In grounded theory — whether you follow the Strauss and Corbin tradition or
the constructivist reformulation by Charmaz — the researcher is not a passive
recorder of data. The analytical process begins with the first moment of
contact with the material. Coding, memo-writing, constant comparison, and
theoretical sampling all assume that you are working with data that you have
genuinely engaged with and that reflects choices made with your research
question in mind.</p>
<p>Transcription is the first of those choices. What counts as a pause? Do you
mark hesitations and self-corrections? Do you capture overlapping speech? Do
you note emphasis, speed changes, or trailing-off? The answers to these
questions are not neutral. They are determined by what level of analysis you
intend. A thematic analysis of interview content needs something different
from a conversation analysis of turn-taking, which needs something different
from a discourse analysis attending to hedges and disfluencies.</p>
<p>When you transcribe manually, you make these choices explicitly or
implicitly, but you make them. When you delegate to an algorithm, the
algorithm makes them — according to its training data and its default
settings — and then presents you with output that looks authoritative.</p>
<p>The risk is not that AI transcription is inaccurate (though it sometimes is).
The risk is that it is <em>selectively accurate in ways you did not choose</em> and
that those choices shape what you subsequently see in the data.</p>
<hr>
<h2 id="what-the-tools-actually-do">What the Tools Actually Do</h2>
<h3 id="atrain">aTrain</h3>
<p>aTrain is the simpler of the two. Windows-native (Microsoft Store), with a
macOS beta for Apple Silicon. The interface has essentially one meaningful
decision point after you load your file: whether to activate speaker
detection. Everything else is handled automatically. Output formats are plain
text with timestamps, SRT subtitle files, and — most useful for researchers —
direct QDA exports for MAXQDA, ATLAS.ti, and NVivo with synchronised
audio-timestamp links.</p>
<p>What aTrain does not do: it does not mark pauses. It does not detect
disfluencies (the <em>ähms</em>, <em>uhs</em>, self-interruptions, false starts). It does
not detect overlapping speech. It produces clean, semantically coherent
transcripts — which means it actively smooths what you gave it. If a
speaker says <em>&ldquo;well — I mean — it was, I think it was more like — yeah,
complicated&rdquo;</em>, aTrain will probably give you something closer to <em>&ldquo;I think it
was complicated&rdquo;</em>. The hesitation structure disappears.</p>
<p>For a thematic interview study where you are interested in what people said
about a topic, this is probably fine. For any analysis where <em>how</em> something
was said is part of the data — pace, repair, emphasis, epistemic hedging —
aTrain is erasing data you need.</p>
<h3 id="noscribe">noScribe</h3>
<p>noScribe is more complex in almost every dimension. Available for Windows,
macOS (including Apple Silicon and Intel), and Linux. The interface exposes
a meaningful number of configuration decisions:</p>
<ul>
<li><strong>Mark Pause</strong>: off, or marked at 1-, 2-, or 3-second thresholds, with
conventional notation <code>(.)</code>, <code>(..)</code>, <code>(...)</code>, <code>(10 seconds pause)</code></li>
<li><strong>Speaker Detection</strong>: automatic count, fixed count, or disabled</li>
<li><strong>Overlapping Speech</strong>: experimental detection, marked with <code>//double slash//</code></li>
<li><strong>Disfluencies</strong>: off or on — captures <em>ähm</em>, <em>äh</em>, self-corrections,
false starts</li>
<li><strong>Timestamps</strong>: by speaker turn or every 60 seconds</li>
</ul>
<p>It also has an integrated editor (noScribeEdit) with synchronised audio
playback: click anywhere in the transcript and the audio seeks to that
position. This is the single most useful feature for post-transcription
review, and aTrain does not have anything equivalent.</p>
<p>The configuration complexity is not gratuitous. It reflects the fact that
different methodological frameworks require different transcription
conventions. noScribe&rsquo;s disfluency detection corresponds roughly to what a
GAT2-Light transcription requires. Its pause notation system maps onto
conversation analytic conventions. The choices you make in the interface are
methodological choices, not just technical preferences.</p>
<hr>
<h2 id="the-normalisation-problem">The Normalisation Problem</h2>
<p>Both tools perform what I would call <em>normalisation</em>: they produce transcripts
that read more fluently than the original speech. This is a feature from a
usability standpoint and a methodological liability from a qualitative
research standpoint.</p>
<p>Specific failure modes I observed in evaluation:</p>
<p><strong>Compound word errors</strong> (more pronounced in noScribe for German): <em>VR-Brille</em>
(&ldquo;VR headset&rdquo;) transcribed as <em>Brille VR</em>, proper nouns mangled, domain
vocabulary rendered phonetically. In music research contexts this is
particularly salient — instrument names, notation terms, composer names, and
genre vocabulary are all potential failure points.</p>
<p><strong>Speaker detection overcounting</strong>: both tools, when speaker detection is
active, tend to identify more speakers than are present. A two-person
interview with one hesitant speaker may generate three or four speaker labels.
Manual correction is required.</p>
<p><strong>Acoustic transcription</strong>: noScribe occasionally produces what the document
calls <em>lautliche Transkriptionen</em> — phonetic renderings rather than semantic
ones. A speaker saying <em>Beamer</em> (data projector) may be transcribed as <em>Bima</em>.
This is not an error in the conventional sense; it is the model accurately
representing what it heard acoustically rather than semantically resolving it.
For music researchers studying how non-specialist participants talk about
technical equipment, this is interesting. For most interview research, it
requires correction.</p>
<p><strong>Pause and overlap reliability degrades with audio quality</strong>: both tools
perform well on clean, close-mic mono recordings of single speakers in quiet
rooms. Introduce a second speaker, ambient noise, variable recording distance,
or a phone recording, and accuracy drops substantially. This matters
specifically for music interview research, where the interview setting is
often a rehearsal room or performance space rather than an acoustic booth.</p>
<hr>
<h2 id="a-methodological-comparison-not-a-feature-list">A Methodological Comparison, Not a Feature List</h2>
<p>The useful comparison between aTrain and noScribe is not technical — it is
about which methodological contexts each is suited to.</p>
<table>
  <thead>
      <tr>
          <th>Research context</th>
          <th>Tool</th>
          <th>Why</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Thematic/content analysis, single speaker</td>
          <td>aTrain</td>
          <td>Speed, simplicity, adequate accuracy, QDA export</td>
      </tr>
      <tr>
          <td>Grounded theory with attention to epistemic hedging</td>
          <td>noScribe + disfluencies</td>
          <td>Captures the hesitation structure that carries methodological information</td>
      </tr>
      <tr>
          <td>Conversation analysis</td>
          <td>Neither, or noScribe as starting point</td>
          <td>CA requires phonetic detail neither tool reliably produces</td>
      </tr>
      <tr>
          <td>Large corpus, initial open coding</td>
          <td>aTrain</td>
          <td>Volume and speed outweigh detail at early stages</td>
      </tr>
      <tr>
          <td>Interpretive phenomenological analysis</td>
          <td>noScribe</td>
          <td>The pause and disfluency data is IPA-relevant</td>
      </tr>
      <tr>
          <td>Teaching transcription as a research practice</td>
          <td>Both</td>
          <td><em>See below</em></td>
      </tr>
  </tbody>
</table>
<p>The last row deserves its own section.</p>
<hr>
<h2 id="using-both-tools-to-teach-about-transcription">Using Both Tools to Teach About Transcription</h2>
<p>The most pedagogically valuable use of these tools is probably not producing
transcripts — it is using them to make the constructed nature of transcripts
visible to students.</p>
<p>A simple exercise: take a three-minute excerpt of an interview recording.
Have students transcribe it manually according to whatever convention the
course uses. Then run the same excerpt through aTrain and noScribe with
different settings. Compare the three or four resulting transcripts in a
seminar discussion.</p>
<p>The differences that emerge are not about which transcript is &ldquo;correct&rdquo;. They
are about what each transcript makes visible and what it hides. The aTrain
transcript will be clean and readable. The manually-produced transcript will
have annotation that the students chose based on what struck them as relevant.
The noScribe transcript with disfluencies enabled will look noisy. All three
are representations of the same three minutes of speech.</p>
<p>Questions that come out of this reliably: Why did the student who transcribed
manually mark that particular pause? What did the student not mark that the
software did? What did the software produce that the student did not hear?
What does the &ldquo;cleaner&rdquo; transcript lose?</p>
<p>This is the entry point to a genuinely grounded theory-relevant conversation
about data construction: the transcript is not the data. The transcript is a
representation of the data made according to principles that should be
theoretically motivated, and those principles should be stated explicitly in
the methods section.</p>
<hr>
<h2 id="what-these-tools-cannot-replace">What These Tools Cannot Replace</h2>
<p>The document I prepared for the HfMT professors ends with a sentence I want
to quote directly from the German, because it is the methodological core of
the whole thing:</p>
<blockquote>
<p><em>Automatisierung ersetzt nicht das Nachdenken über Daten.</em>
Automation does not replace thinking about data.</p>
</blockquote>
<p>More precisely: the algorithm makes decisions about what counts as a pause,
what counts as language, whose voice counts as a separate speaker — without
knowing what is scientifically relevant. It does not know that the half-second
hesitation before a particular word is the most important moment in the
interview. It does not know that the overlapping &ldquo;mm-hm&rdquo; is a data point for
your analysis of how the interviewee manages discomfort. It does not know
that the repeated self-correction in the middle of a sentence about teaching
practice is where your emerging category is.</p>
<p>You have to know that. And you only know it if you have been in enough
contact with the material to have developed theoretical sensitivity — which is
exactly what Strauss and Corbin mean when they describe the iterative
relationship between data collection, coding, and theoretical development in
grounded theory.</p>
<p>AI transcription tools save the hours of typing. They do not and cannot
substitute for the analytical engagement that makes a grounded theory study
produce knowledge rather than a theme list.</p>
<p>Use them. But use them knowing what they are doing.</p>
<hr>
<h2 id="practical-summary">Practical Summary</h2>
<ul>
<li><strong>aTrain</strong>: one-click, local, GDPR-compliant, good QDA integration,
appropriate for thematic analysis. No disfluencies, no pauses, no
overlap detection. Versions: Windows (Microsoft Store), macOS beta.
Current version: 1.3.1.</li>
<li><strong>noScribe</strong>: more complex, highly configurable, disfluency and pause
detection, integrated audio-sync editor, appropriate for grounded theory
and discourse-oriented work. More demanding to set up. Current version:
0.6.2.</li>
<li><strong>Neither tool</strong> is appropriate as a black-box solution for conversation
analysis or prosodic research.</li>
<li><strong>Both tools</strong> require manual post-processing. Estimate correction time
at roughly 20–40% of the original interview length for clean recordings
with a single speaker; more for multi-speaker or suboptimal audio.</li>
<li><strong>In teaching</strong>: the exercise of comparing manual, aTrain, and noScribe
transcripts of the same excerpt is more pedagogically valuable than any
of the transcripts individually.</li>
</ul>
<hr>
<h2 id="references">References</h2>
<p>Charmaz, K. (2014). <em>Constructing Grounded Theory</em> (2nd ed.).
SAGE Publications.</p>
<p>Dresing, T. &amp; Pehl, T. (2018). <em>Praxisbuch Interview, Transkription &amp;
Analyse</em> (8th ed.). Eigenverlag. <a href="https://www.audiotranskription.de">https://www.audiotranskription.de</a></p>
<p>Haberl, A., Fleiß, J., Kowald, D., &amp; Thalmann, S. (2024). Take the aTrain.
Introducing an interface for the accessible transcription of interviews.
<em>Journal of Behavioral and Experimental Finance</em>, 41, 100891.
<a href="https://doi.org/10.1016/j.jbef.2024.100891">https://doi.org/10.1016/j.jbef.2024.100891</a></p>
<p>Kailscheuer, K. (2023). noScribe [software].
<a href="https://github.com/kaixxx/noScribe">https://github.com/kaixxx/noScribe</a></p>
<p>Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., &amp; Sutskever, I.
(2022). Robust speech recognition via large-scale weak supervision.
arXiv preprint arXiv:2212.04356. <a href="https://arxiv.org/abs/2212.04356">https://arxiv.org/abs/2212.04356</a></p>
<p>Strauss, A. &amp; Corbin, J. (1998). <em>Basics of Qualitative Research</em>
(2nd ed.). SAGE Publications.</p>
<hr>
<h2 id="changelog">Changelog</h2>
<ul>
<li><strong>2026-01-20</strong>: Updated the aTrain reference to the published form: Haberl, A., Fleiß, J., Kowald, D., &amp; Thalmann, S. (2024), &ldquo;Take the aTrain. Introducing an interface for the accessible transcription of interviews.&rdquo;</li>
</ul>
]]></content:encoded>
    </item>
  </channel>
</rss>
