In June 2025 I put together a practical guide on AI-assisted transcription for professors of music pedagogy at HfMT Köln — primarily a hands-on introduction to aTrain and noScribe. This post is the methodological companion to that guide: the stuff I could not fit into a workshop handout but that I think matters more than the installation instructions.


The Seduction

AI transcription tools have reached a point where, for clean audio of a single speaker in a quiet room, the output is genuinely good. You load a 90-minute interview, click a button, wait roughly 20 minutes, and get a readable transcript with timestamps and speaker labels. In transcript-hours, that is an order of magnitude faster than manual transcription. The appeal is obvious, especially if you are a qualitative researcher working with a backlog of interview recordings.

The two tools I have been evaluating — aTrain (developed at University of Graz) and noScribe (an independent open-source project) — both run entirely locally on your machine. No audio file is uploaded anywhere. No cloud API is involved. This matters for interview research: you are handling other people’s speech, often on topics they regard as sensitive, and the GDPR landscape for sending recordings to external servers is genuinely complicated. Local processing sidesteps that problem entirely.

Both tools are built on OpenAI’s Whisper model, which is — despite the name — open-source and runs offline. They differ in interface philosophy, feature depth, and what methodological commitments they make visible.

But the seduction is the problem. The speed and cleanliness of the output makes it easy to treat the transcript as a neutral record rather than as a construction. It is not. Every transcription is an act of interpretation. An AI transcription is an act of interpretation performed by an algorithm that does not know what your research question is.


Why This Is a Grounded Theory Problem Specifically

In grounded theory — whether you follow the Strauss and Corbin tradition or the constructivist reformulation by Charmaz — the researcher is not a passive recorder of data. The analytical process begins with the first moment of contact with the material. Coding, memo-writing, constant comparison, and theoretical sampling all assume that you are working with data that you have genuinely engaged with and that reflects choices made with your research question in mind.

Transcription is the first of those choices. What counts as a pause? Do you mark hesitations and self-corrections? Do you capture overlapping speech? Do you note emphasis, speed changes, or trailing-off? The answers to these questions are not neutral. They are determined by what level of analysis you intend. A thematic analysis of interview content needs something different from a conversation analysis of turn-taking, which needs something different from a discourse analysis attending to hedges and disfluencies.

When you transcribe manually, you make these choices explicitly or implicitly, but you make them. When you delegate to an algorithm, the algorithm makes them — according to its training data and its default settings — and then presents you with output that looks authoritative.

The risk is not that AI transcription is inaccurate (though it sometimes is). The risk is that it is selectively accurate in ways you did not choose and that those choices shape what you subsequently see in the data.


What the Tools Actually Do

aTrain

aTrain is the simpler of the two. Windows-native (Microsoft Store), with a macOS beta for Apple Silicon. The interface has essentially one meaningful decision point after you load your file: whether to activate speaker detection. Everything else is handled automatically. Output formats are plain text with timestamps, SRT subtitle files, and — most useful for researchers — direct QDA exports for MAXQDA, ATLAS.ti, and NVivo with synchronised audio-timestamp links.

What aTrain does not do: it does not mark pauses. It does not detect disfluencies (the ähms, uhs, self-interruptions, false starts). It does not detect overlapping speech. It produces clean, semantically coherent transcripts — which means it actively smooths what you gave it. If a speaker says “well — I mean — it was, I think it was more like — yeah, complicated”, aTrain will probably give you something closer to “I think it was complicated”. The hesitation structure disappears.

For a thematic interview study where you are interested in what people said about a topic, this is probably fine. For any analysis where how something was said is part of the data — pace, repair, emphasis, epistemic hedging — aTrain is erasing data you need.

noScribe

noScribe is more complex in almost every dimension. Available for Windows, macOS (including Apple Silicon and Intel), and Linux. The interface exposes a meaningful number of configuration decisions:

  • Mark Pause: off, or marked at 1-, 2-, or 3-second thresholds, with conventional notation (.), (..), (...), (10 seconds pause)
  • Speaker Detection: automatic count, fixed count, or disabled
  • Overlapping Speech: experimental detection, marked with //double slash//
  • Disfluencies: off or on — captures ähm, äh, self-corrections, false starts
  • Timestamps: by speaker turn or every 60 seconds

It also has an integrated editor (noScribeEdit) with synchronised audio playback: click anywhere in the transcript and the audio seeks to that position. This is the single most useful feature for post-transcription review, and aTrain does not have anything equivalent.

The configuration complexity is not gratuitous. It reflects the fact that different methodological frameworks require different transcription conventions. noScribe’s disfluency detection corresponds roughly to what a GAT2-Light transcription requires. Its pause notation system maps onto conversation analytic conventions. The choices you make in the interface are methodological choices, not just technical preferences.


The Normalisation Problem

Both tools perform what I would call normalisation: they produce transcripts that read more fluently than the original speech. This is a feature from a usability standpoint and a methodological liability from a qualitative research standpoint.

Specific failure modes I observed in evaluation:

Compound word errors (more pronounced in noScribe for German): VR-Brille (“VR headset”) transcribed as Brille VR, proper nouns mangled, domain vocabulary rendered phonetically. In music research contexts this is particularly salient — instrument names, notation terms, composer names, and genre vocabulary are all potential failure points.

Speaker detection overcounting: both tools, when speaker detection is active, tend to identify more speakers than are present. A two-person interview with one hesitant speaker may generate three or four speaker labels. Manual correction is required.

Acoustic transcription: noScribe occasionally produces what the document calls lautliche Transkriptionen — phonetic renderings rather than semantic ones. A speaker saying Beamer (data projector) may be transcribed as Bima. This is not an error in the conventional sense; it is the model accurately representing what it heard acoustically rather than semantically resolving it. For music researchers studying how non-specialist participants talk about technical equipment, this is interesting. For most interview research, it requires correction.

Pause and overlap reliability degrades with audio quality: both tools perform well on clean, close-mic mono recordings of single speakers in quiet rooms. Introduce a second speaker, ambient noise, variable recording distance, or a phone recording, and accuracy drops substantially. This matters specifically for music interview research, where the interview setting is often a rehearsal room or performance space rather than an acoustic booth.


A Methodological Comparison, Not a Feature List

The useful comparison between aTrain and noScribe is not technical — it is about which methodological contexts each is suited to.

Research contextToolWhy
Thematic/content analysis, single speakeraTrainSpeed, simplicity, adequate accuracy, QDA export
Grounded theory with attention to epistemic hedgingnoScribe + disfluenciesCaptures the hesitation structure that carries methodological information
Conversation analysisNeither, or noScribe as starting pointCA requires phonetic detail neither tool reliably produces
Large corpus, initial open codingaTrainVolume and speed outweigh detail at early stages
Interpretive phenomenological analysisnoScribeThe pause and disfluency data is IPA-relevant
Teaching transcription as a research practiceBothSee below

The last row deserves its own section.


Using Both Tools to Teach About Transcription

The most pedagogically valuable use of these tools is probably not producing transcripts — it is using them to make the constructed nature of transcripts visible to students.

A simple exercise: take a three-minute excerpt of an interview recording. Have students transcribe it manually according to whatever convention the course uses. Then run the same excerpt through aTrain and noScribe with different settings. Compare the three or four resulting transcripts in a seminar discussion.

The differences that emerge are not about which transcript is “correct”. They are about what each transcript makes visible and what it hides. The aTrain transcript will be clean and readable. The manually-produced transcript will have annotation that the students chose based on what struck them as relevant. The noScribe transcript with disfluencies enabled will look noisy. All three are representations of the same three minutes of speech.

Questions that come out of this reliably: Why did the student who transcribed manually mark that particular pause? What did the student not mark that the software did? What did the software produce that the student did not hear? What does the “cleaner” transcript lose?

This is the entry point to a genuinely grounded theory-relevant conversation about data construction: the transcript is not the data. The transcript is a representation of the data made according to principles that should be theoretically motivated, and those principles should be stated explicitly in the methods section.


What These Tools Cannot Replace

The document I prepared for the HfMT professors ends with a sentence I want to quote directly from the German, because it is the methodological core of the whole thing:

Automatisierung ersetzt nicht das Nachdenken über Daten. Automation does not replace thinking about data.

More precisely: the algorithm makes decisions about what counts as a pause, what counts as language, whose voice counts as a separate speaker — without knowing what is scientifically relevant. It does not know that the half-second hesitation before a particular word is the most important moment in the interview. It does not know that the overlapping “mm-hm” is a data point for your analysis of how the interviewee manages discomfort. It does not know that the repeated self-correction in the middle of a sentence about teaching practice is where your emerging category is.

You have to know that. And you only know it if you have been in enough contact with the material to have developed theoretical sensitivity — which is exactly what Strauss and Corbin mean when they describe the iterative relationship between data collection, coding, and theoretical development in grounded theory.

AI transcription tools save the hours of typing. They do not and cannot substitute for the analytical engagement that makes a grounded theory study produce knowledge rather than a theme list.

Use them. But use them knowing what they are doing.


Practical Summary

  • aTrain: one-click, local, GDPR-compliant, good QDA integration, appropriate for thematic analysis. No disfluencies, no pauses, no overlap detection. Versions: Windows (Microsoft Store), macOS beta. Current version: 1.3.1.
  • noScribe: more complex, highly configurable, disfluency and pause detection, integrated audio-sync editor, appropriate for grounded theory and discourse-oriented work. More demanding to set up. Current version: 0.6.2.
  • Neither tool is appropriate as a black-box solution for conversation analysis or prosodic research.
  • Both tools require manual post-processing. Estimate correction time at roughly 20–40% of the original interview length for clean recordings with a single speaker; more for multi-speaker or suboptimal audio.
  • In teaching: the exercise of comparing manual, aTrain, and noScribe transcripts of the same excerpt is more pedagogically valuable than any of the transcripts individually.

References

Charmaz, K. (2014). Constructing Grounded Theory (2nd ed.). SAGE Publications.

Dresing, T. & Pehl, T. (2018). Praxisbuch Interview, Transkription & Analyse (8th ed.). Eigenverlag. https://www.audiotranskription.de

Haberl, A., Fleiß, J., Kowald, D., & Thalmann, S. (2024). Take the aTrain. Introducing an interface for the accessible transcription of interviews. Journal of Behavioral and Experimental Finance, 41, 100891. https://doi.org/10.1016/j.jbef.2024.100891

Kailscheuer, K. (2023). noScribe [software]. https://github.com/kaixxx/noScribe

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356. https://arxiv.org/abs/2212.04356

Strauss, A. & Corbin, J. (1998). Basics of Qualitative Research (2nd ed.). SAGE Publications.


Changelog

  • 2026-01-20: Updated the aTrain reference to the published form: Haberl, A., Fleiß, J., Kowald, D., & Thalmann, S. (2024), “Take the aTrain. Introducing an interface for the accessible transcription of interviews.”