OnlyText v2.0 | Transcript Cleaner

1. Choose Transcript Files

Ready.

Expected: transcript .docx, .txt, or text-based .pdf files. DOCX conversion uses Mammoth.js; PDF text extraction uses PDF.js. Scanned/image-only PDFs are not supported.

Batch Results

No files loaded.

Preview Processed File

Choose one processed file to preview both the original input and cleaned TXT output.

OnlyText was created by Douglas A. Boyd. This tool is provided for non-commercial and archival purposes and is used at your own risk. No warranty is expressed or implied; results should be reviewed for accuracy before publication or deposit.|Help / About

Version v2.0 — Released May 4, 2026

Help / About

How to Use OnlyText

1) Load one or more transcript files (.docx, .txt, or text-based .pdf).
2) Click “Create OnlyText TXT.”
3) Use the preview menu to compare the original input with the cleaned TXT output.
4) Choose the export mode and click Export. ZIP is the default; individual downloads are available as an option.

Filename rules
• All outputs are .txt files.
• Existing .txt input filenames are preserved exactly.
• Non-TXT inputs, such as .docx or .pdf files, are exported with the same base filename and a .txt extension.
• Output filenames never include the original .docx or .pdf extension.

Transcript-start rule
• OnlyText looks for the first true transcript speaker label followed by a colon, such as FLEMING:, Jewell:, Malcolm Jewell:, Interviewer:, Respondent:, or Speaker 1:.
• It scores likely speaker starts by looking at the label shape, repeated labels, nearby speaker turns, and whether the label is followed by transcript text.
• Everything before that first speaker label is removed.

Cleanup rules
• Removes title page and front matter before the first detected speaker label.
• Removes standalone page numbers, roman numerals, page break markers, and common repeated header/footer lines.
• Removes repeated short lines that occur multiple times, which helps with exported headers and footers.
• Preserves transcript speaker labels and paragraph breaks.

Notes
• All processing occurs locally in the browser.
• DOCX conversion uses Mammoth.js to extract text from Word files.
• PDF conversion uses PDF.js to extract embedded text from text-based PDFs. Scanned/image-only PDFs are not supported.
• Exports are always UTF-8 plain text. No encoding choice is required.
• Results should be reviewed before deposit or publication.

OnlyText

Preview Processed File

Original / Input Preview

OnlyText Output TXT Preview