Cleaning a Whisper Transcript¶
Whisper AI will generate transcripts with a variety of flaws and issues. However, to save time, prioritize fixing certain issues over others.
Goals for Cleanup¶
Whisper transcripts should be as accurate to the audio as possible. No additions or omissions.
Whisper transcripts should be readable. Grammar should be accurate enough so that it is understandable without audio and can be read.
Whisper transcripts should be searchable. It is expected that users will search for specific keywords such as the name of a person or the title of a work. Therefore, these names should be spelled correctly.
How to clean a transcript¶
Remove all hallucinations. Hallucinations can be identified by blocks of red text.
Fill in omissions. This can be done either by listening to the audio and transcribing by hand or by splicing together multiple AI-generated transcripts. If you do the latter, document where you added text and what file you pulled your addition from.
Capitalize proper names and beginnings of sentences.
Add punctuation as needed, especially periods and punctuation used within the title of a work.
Make sure the following are spelled correctly: Names of individuals, places, organizations, events, groups, works. This includes fictional varieties of each (ie characters, fictional organizations, imaginary places).
Whatever else may be relevant depending on the collection, what it emphasizes, and how users are expected to interact with it.
Things not to prioritize due to time constraints¶
Standardizing spelling of words with regional varieties.
Adding commas, semicolons, and other punctuation that are not required for the transcript to be understood.
Ensuring common connecting words are accurate (ie “and” versus “then” when the transcript makes sense with either).
Things to NOT do¶
Removing filler words like “uh” and “um” unless they are so often used, the transcript becomes difficult to read.
Correcting the grammar of the speaker or otherwise changing the transcript.
Correcting misquoted/misremembered titles, names of individuals, concepts, quotations from elsewhere (just make a note for the metadata).
Censoring words (just make a note of harmful language for the metadata).
Removing parts of the transcript deemed “unnecessary” such as commercials or off-script chatter.