Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Child speech is one of those persistent thorns in ASR research that nobody has fully solved, and for good reason. Children's vocal tracts are physically different — shorter, still developing — which means their formant frequencies, speaking rates, and prosodic patterns diverge substantially from the adult speech that dominates training corpora. Add the fact that young speakers are inherently less consistent: they hesitate, restart mid-word, produce non-standard pronunciations, and generally behave like people still learning to coordinate their articulators. Now place all of this in a low-resource language like Dutch, where even adult ASR models have less data to draw from than their English counterparts, and the problem compounds. This paper takes a pragmatic and frankly refreshing approach by not trying to build a better model from scratch. Instead, it asks a more operational question: given what we have, how well does it actually work, and can we trust the output enough to skip human review on some portion of it?

The first finding — that fine-tuned Whisper-medium dominates — is unsurprising to anyone who has watched Whisper's trajectory. OpenAI released a model trained on 680,000 hours of multilingual web-scraped audio, and despite its well-documented hallucination problems and tendency toward confident nonsense, it possesses a breadth of acoustic pattern recognition that narrower, architecture-specific models struggle to match. What the fine-tuning result tells us is something more nuanced: Whisper's pre-training gives it enough generalization capacity that even a relatively modest amount of domain-specific data can shift it meaningfully toward child speech. The Parakeet and Wav2Vec2 families, while competent, lack that broad foundational exposure. This reinforces a trend visible across the field — massive pre-training followed by targeted adaptation is winning over purpose-built smaller models, even in specialized domains.

But the real intellectual contribution here lies in the second research question. The utterance-level selection method is elegantly simple in concept: compare what the ASR system heard against what the child was supposed to say (the read prompt), and flag the ones that match. If a child is reading a sentence aloud and the ASR output aligns closely with the original text, you have circumstantial evidence that the pronunciation was sufficiently standard. The precision figures — 98.3% and higher — are striking. That means in roughly 98 out of 100 cases where the system says "this is a clean, correctly pronounced utterance," it is right. For corpus linguists and developmental researchers, this is a meaningful efficiency gain.

What concerns me, though, is what gets left behind. The selection method filters conservatively. On the JASMIN dataset, 58% of utterances fail the filter. On DART, 81.9% are rejected. That is a lot of data on the cutting room floor, and there is an inherent bias in what survives: utterances where children pronounce things in standard, predictable ways. Children who speak with regional accents, who have speech-language disorders, who simply talk differently from the majority — these are exactly the populations that speech research often struggles to include. A filtering method that preferentially retains "normal" pronunciations risks creating a selection bias that quietly narrows the linguistic diversity of the resulting corpus. The paper does not adequately address this, and it should be front and center in any discussion of deploying this method at scale.

There is also a quieter tension worth examining. The DART dataset apparently contains substantially noisier conditions, which is realistic — field recordings of children are messy. Yet the WER of 70.37% on DART tells us the models are essentially failing on that data in a raw, unfiltered state. The selection method rescues some usable material, but the underlying recognition problem is not solved; it is sidestepped. If your goal is to build a clean corpus for phonological analysis, this might be acceptable. If your goal is to actually understand what children said in challenging acoustic environments, you still need better models or better data collection methods.

What this paper quietly demonstrates is a broader shift in how the research community thinks about ASR in specialized domains. The question is no longer simply "can we build a model that transcribes X accurately?" It is becoming "can we build a pipeline that is honest about what it gets right and what it does not, and can we use that honesty to make practical decisions about data quality?" The confidence-based filtering approach is a step toward that kind of epistemically transparent workflow. It acknowledges imperfection rather than pretending it away.

The limitation that nags at me most is generalizability. Dutch is a well-resourced language by European standards, even if it pales next to English. Whisper has Dutch in its training distribution. Applying this same pipeline to truly low-resource languages — say, a Bantu language with no dedicated ASR training data — would likely yield far worse results at every stage. The paper's framing as relevant to "low-resource languages" is slightly generous given that Dutch is spoken by roughly 25 million people and represented

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

Deep Analysis

Related Articles

Related Articles

[Virtual Event] Anatomy of a Data Breach: What to Do if it Happens to You

AI Society Simulation: When Claude Became Mayor and Grok Went Extinct in 4 Days — What Should We Fear?

Anthropic Surpasses OpenAI: The "Code is King" Logic Behind the $965B Valuation

[GitHub] tesseract-ocr/tesseract