A New Tool to Measure Machine-Generated Transcript Accuracy: The Transcript Accuracy Auditor

Context

Nearly every year for the past decade, I performed experiments examining the accuracy of automatic speech recognition (ASR), machine generated, speech-to-text tools for transcribing oral history. As of late, AI has catapulted the quality and effectiveness of this technology and the proliferation of available services has been swift . Although there is still a lot of cleanup that is necessary, there is a lot less labor involved in authenticating machine-generated transcripts. There is a lot of anecdotal discussion of the different services and the quality of the transcripts, but how good are the transcripts coming from services like Trint.com, Whisper, MacWhisper, Temi.com, Descript, Otter.ai to name just a few?

I mentioned I used to annually evaluate the quality of these transcripts. This evaluation was an extremely cumbersome and manual process involving taking a machine-generated sample (usually about 5 pages), compare it to a fully authenticated version (by a human) and doing a word by word analysis looking for omissions and outright incorrect words. I would literally highlight mistaken words with one color, highlight the omissions in a different color, then create a metric/percentage based on the total words. In recent years, I created Python scripts to evaluate the Word Error Rate (WER) of transcripts based on a comparison, but the metrics were limited and the assessment was flawed.

The Tool

So, how good (or bad) are the services performing? Between all of the different options for machine-generated transcripts, changing models, and shifting style guides, getting a trustworthy answer has been harder than it should be. So, I created a tool I use in order to more effectively and easily assess and evaluate the quality and accuracy of machine-generated transcripts. The free tool is called the Transcript Accuracy Auditor, which creates more consistent, and repeatable results—using measures that are transparent to archivists and oral historians, not just engineers. The tool is browser-based, but all processing occurs locally in your web browser—no files are uploaded or stored—and the output stays on your device. So your data remains your data.

Transcript Accuracy Auditor Report for Trint.com

What does the tool do?

The Transcript Accuracy Auditor compares a machine‑generated transcript to a human‑corrected reference of the same interview. It ignores speaker tags, timecodes, and common discourse markers before scoring, aligns words carefully, then reports two scores side‑by‑side: WIP (Word Information Preserved) and WER (Word Error Rate). It also counts substitutions, deletions, and insertions and generates plain‑language explanations and examples. Everything runs locally in your web browser; no files are uploaded or stored.

Two scores that work together

WIP — Word Information Preserved (higher is better)
The WIP asks: “How much of the reference text’s information is preserved in the machine transcript without extra material?” It rewards matches in both directions, balancing omissions and insertions. In plain language: if the machine transcript captures most of the reference wording and doesn’t add much, WIP will be high.
WER — Word Error Rate (lower is better)
WER is the long‑standing standard in speech recognition: WER = (S + D + I) ÷ N, where S=substitutions, D=deletions (missing words), I=insertions (extra words), and N=reference word count. WER is great for benchmarking models, but it penalizes insertions against reference length, even when those extras are harmless. That’s why I present WIP alongside WER.

What the Auditor ignores by default

To make scores meaningful for oral history practice, the tool normalizes the text before alignment:
Speaker labels and timecodes (e.g., “SPEAKER 2 [00:02:06]”) are removed.
Common discourse markers (“um,” “uh,” “mm‑hmm,” etc.) can be ignored.
Case, punctuation, and common typographic variants (e.g., curly quotes vs. straight) are normalized.
Parenthetical non‑verbal notes in the reference (e.g., “(laughs)”) can be removed so they don’t count as errors.

All defaults are visible under Configure. You can turn options off if your house style needs them kept.

A quick tour of the tool

1) Load transcripts (UTF‑8 .txt)
Load the human‑corrected reference transcript and the machine‑generated transcript from the service you’re evaluating. The preview panes show the raw files with original line breaks for quick visual checks.

2) Add the service/source name (Optional)
Label the machine transcript you’re evaluating—e.g., “Otter.ai,” “Descript,” “Trint,” “MacWhisper”—This will ensure the resulting report can name the service being evaluated.

3) Configure (defaults are sane)
The default settings mirror oral‑history norms: speaker/timecodes stripped, punctuation and case normalized, common fillers ignored, and parenthetical non‑verbals in the reference removed. You can also enable the “soften very common little words” option (non‑standard) that de‑emphasizes tiny shifts like “and/the/of” when the same word appears nearby on the other side. This can reduce inflated extra/missing counts caused by word‑order differences. Use it only if that fits your reporting policy.

4) Analyze
Click Analyze. You’ll get:
• WIP and WER side‑by‑side, with “Higher is better / Lower is better” reminders.
• Counts of substitutions (S), deletions (D), and insertions (I).
• “Select Examples” with highlighted differences in three tabs: Missing Words, Extra Words, and Word Swaps.
• A report you can view, print to PDF, or export as TXT (the TXT includes examples as plain text and does not have the highlights).

5) Save the report
The report includes a concise Analysis Summary, the counts, scores, and examples, plus a short Method Summary at the end so colleagues understand what was measured.

Transcript Accuracy Assessment: The Nunn Center’s SpeakEZ

Understanding the scores

WIP high, WER low: excellent machine transcript. Most reference content is preserved; very few missing/extras.
WIP high, WER higher than you expect: likely many harmless insertions (e.g., function words, small word‑order shifts) that WER counts strictly. The examples will show whether these extras matter for your use case.
WIP lower than expected but WER moderate: the machine may be missing key phrases even if it doesn’t hallucinate. Look at the Missing tab first.
Lots of “Word Swaps”: check names, numbers, and acronyms. Consider running a targeted manual pass on proper nouns and dates.

The tool tags every change as one of three types:

Deletion (D): a word in the reference is missing in the machine transcript.
Insertion (I): a word exists in the machine transcript that’s not in the reference.
Substitution (S): a word in the machine differs from the word in the reference at the same aligned spot.

Because the tool normalizes case, punctuation, and common fillers before alignment, scores focus on real transcript content, not formatting artifacts. (If your institution’s style guide requires punctuation to count, you can disable normalization in Configure.)

Best practices for comparisons

Always evaluate against a human‑corrected reference of the same interview.
Export machine transcripts as UTF‑8 .txt. (If the service gives you DOCX, “Save As” plain text first.)
Keep your normalization choices consistent (in the configuration panel) across services so results are apples‑to‑apples.
Report both WIP and WER. I like using both because I feel that, combined, they tell a fuller story.
Examine the smoke examples. Some of your examples will be outliers that are based on style guide decisions made when creating the authenticated reference version of the transcript.

Known limitations (and why they matter)

Style guides vary and that will impact results. There are a dizzying amount of variables in transcripts. Some services completely ignore the uhms and ums, some services even omit curse words. As you may have noticed in a graphic above, when I say “gonna,” an interviewer, some services type that literally. However, the Nunn Center style guide authenticated this as “going to.” This would be an insertion and a deletion and woudl be flagged by the tool. These types of divergences will impact the scores, so be aware.
Scores are an approximation. Alignment is careful but imperfect, especially with heavy paraphrase or aggressive speaker diarization in the machine file.

How to use the tool

1) Pick one interview with a solid reference transcript.
2) Run it through two or three services (e.g., Otter.ai, Descript, Trint, MacWhisper).
3) Load the reference and each machine transcript in the Auditor.
4) Keep defaults on, analyze, export reports, and compare.
5) Decide what “good enough” looks like for your project and budget. Use WIP for a humanities‑friendly north star, and WER to align with industry expectations.

Closing thought

This automatic assessment tool is meant to help us talk about accuracy in ways that are rigorous and easy to understand. It makes the evidence visible: not just a number, but where and how things go wrong. My hope is that it helps you choose tools confidently—and advocate for quality when it matters most. I am sure there are some limitations, so I am committed to tweaking the tool as needed.

I hope you find the Transcript Accuracy Auditor useful. Best of luck in assessing the accuracy of transcription services, I really hope this helps.