========================== Cleaning a Whisper Transcript ========================== Whisper AI will generate transcripts with a variety of flaws and issues. However, to save time, prioritize fixing certain issues over others. --------------------- Goals for Cleanup --------------------- * Whisper transcripts should be as accurate to the audio as possible. No additions or omissions. * Whisper transcripts should be readable. Grammar should be accurate enough so that it is understandable without audio and can be read. * Whisper transcripts should be searchable. It is expected that users will search for specific keywords such as the name of a person or the title of a work. Therefore, these names should be spelled correctly. --------------------- How to clean a transcript --------------------- * Remove all hallucinations. Hallucinations can be identified by blocks of red text. * Fill in omissions. This can be done either by listening to the audio and transcribing by hand or by splicing together multiple AI-generated transcripts. If you do the latter, document where you added text and what file you pulled your addition from. * Capitalize proper names and beginnings of sentences. * Add punctuation as needed, especially periods and punctuation used within the title of a work. * Make sure the following are spelled correctly: Names of individuals, places, organizations, events, groups, works. This includes fictional varieties of each (ie characters, fictional organizations, imaginary places). * Whatever else may be relevant depending on the collection, what it emphasizes, and how users are expected to interact with it. --------------------- Things not to prioritize due to time constraints --------------------- * Standardizing spelling of words with regional varieties. * Adding commas, semicolons, and other punctuation that are not required for the transcript to be understood. * Ensuring common connecting words are accurate (ie "and" versus "then" when the transcript makes sense with either). ------------------ Things to **NOT** do ------------------ * Removing filler words like "uh" and "um" unless they are so often used, the transcript becomes difficult to read. * Correcting the grammar of the speaker or otherwise changing the transcript. * Correcting misquoted/misremembered titles, names of individuals, concepts, quotations from elsewhere (just make a note for the metadata). * Censoring words (just make a note of harmful language for the metadata). * Removing parts of the transcript deemed "unnecessary" such as commercials or off-script chatter.