Policy and Guiding Principles for Captioning and Transcription of A/V Works¶

Purpose¶

This policy establishes procedures for providing accessibility and discoverability features for audiovisual (A/V) works, including video and audio-only materials. It ensures consistent application of closed captioning, diarization, downloadable PDF transcripts, and navigable time-stamped transcripts. The goal is to improve access, mitigate risk, and provide a baseline of accessibility recognizing that all files will be initially generated by AI technologies and that we may not have the resources or ability to clean up everything.

Guiding Principles¶

Accessibility First: We should provide the most inclusive experience possible within available resources.
Pragmatism: AI-generated captions and transcripts will contain errors, but they represent an important first step toward accessibility and compliance.
Transparency: Users should be aware that captions and transcripts are auto-generated and may contain inaccuracies.
Risk Mitigation: Even imperfect captions and transcripts reduce institutional risk by demonstrating a proactive commitment to accessibility.
Collection Context: Different types of collections may receive different levels of captioning and transcript cleanup, depending on their significance, use, and other political factors.
Perfection Can Be the Enemy of Good: The priority is to provide timely, usable captions and transcripts, even if imperfect. Delaying release in pursuit of complete accuracy undermines accessibility and runs counter to the goal of reducing barriers. Errors can be refined later as resources allow.

Accessibility First¶

The primary goal of our work in making A/V materials accessible is to increase access for our community of users while supporting compliance with ADA Title II requirements and WCAG 2.1 AA standards. Additional steps may be taken depending on the nature of the collection and the resources available, but the central driver of this work will always be improving accessibility.

Pragmatism¶

Our approach to closed captions, transcripts, and other supplemental files will be pragmatic and guided by the resources available. We will prioritize solutions that balance accessibility with feasibility, seeking innovative tools and workflows that reliably generate usable initial files.

Minimal requirements for works and collections may differ based on several factors, including:

Work type / content model (video vs. audio, etc.)
Significance of the collection (high-use, high-visibility, or unique value)
Access restrictions (internal, limited, or private vs. public)
Language of the work
Intended audience and anticipated use

These requirements will evolve over time as technology improves, community expectations shift, and institutional resources change.

Transparency¶

Users must be able to understand when a closed caption file or transcript has been machine-generated and may contain errors. Transparency builds trust and ensures that access is not mistaken for perfection.

To support this principle:

Labeling: All captions and transcripts that have not been reviewed or corrected should be clearly marked as machine-generated.
Disclaimers: Supplemental files (e.g., downloadable PDF transcripts) should include disclaimers noting that errors may be present.
Contextual Limits: Transparency is especially important for works in languages where we lack expertise or for content that includes proper nouns, technical terms, or specialized subject matter that may be transcribed inaccurately.
Metadata Tracking: Where possible, administrative metadata should document the status of a file (machine-generated, partially remediated, fully reviewed) to track its evolution over time.

By being transparent, we provide meaningful access while setting realistic expectations about the limitations of machine-generated assets.

Risk Mitigation¶

Our work with captions and transcripts also serves to mitigate legal, ethical, and reputational risks associated with inaccessible A/V content. While AI-generated files are imperfect, providing them demonstrates a good-faith effort to reduce barriers to access and comply with accessibility standards.

To support this principle:

Baseline Coverage: All A/V works should have at least a machine-generated caption or transcript, even if errors are present.
Prioritization: High-use, high-visibility, or significant collections should be prioritized for remediation and improvement to reduce exposure to risk.
Documentation: Exceptions and limitations should be documented to show deliberate, reasoned decision-making rather than omission.
Continuous Improvement: As resources allow, captions and transcripts should be incrementally improved, recognizing that partial progress reduces risk more than inaction.
Demonstrable Effort: Maintaining records of workflows, disclaimers, and metadata demonstrates institutional commitment to accessibility, even when perfection is not achievable.

By applying these practices, we acknowledge the limits of our resources while taking proactive steps to meet accessibility expectations and reduce potential risks.

Collection Context¶

Different types of collections may receive different levels of captioning and transcript cleanup, depending on their importance and use:

Legacy Collections: Older collections will be prioritized for bulk AI processing with minimal manual correction, unless identified as significant.
New Collections: New works should follow the full procedure outlined in this policy at the time of ingest, ensuring accessibility from the outset.
Significant Collections: High-use, high-visibility, or strategically important collections may receive additional human review and editing to improve accuracy.

The significance of a collection and the need for enhanced remediation should be determined before or during the kickoff meeting for new collections. This evaluation helps determine whether the collection can be published within available resources and whether additional support—such as grant funding or donor contributions—might be pursued to cover labor-intensive tasks required to improve accuracy.

By tailoring our approach to the context of each collection, we balance pragmatism with risk mitigation, ensuring that resources are applied where they have the greatest impact on accessibility and institutional compliance.

Perfection Can Be the Enemy of Good¶

While accuracy and completeness are important, striving for perfection in captions, transcripts, and supplemental files can unintentionally hinder broader accessibility goals. Spending excessive time correcting minor errors or chasing complete remediation may:

Delay Access: Over-investing in perfection can slow the publication of new collections, delaying access to materials for our users.
Limit Scope: Resources devoted to perfecting a few works may prevent us from processing larger volumes, leaving many collections without any accessibility support.
Create Risk of Inaction: Excessive focus on perfection can result in collections never being released or, in extreme cases, needing to be taken down due to incomplete accessibility work.

Our goal is to provide meaningful access as quickly as possible. AI-generated captions and transcripts, even with errors, represent a significant step toward accessibility. By accepting a level of imperfection, we can balance quality with productivity, ensuring that more users benefit from a broader range of collections.

Video Works¶

A video work is any resource that includes a video file that can be viewed in a media player.

Lowest / MVP File¶

A machine-generated closed caption file that is clearly marked as machine-generated.

Minimal Enhancement¶

All elements of the MVP file.
Spot-checked captions and time-stamped transcript for obvious errors, with minimal corrections applied to improve readability and accuracy and eliminate hallucinations.
A downloadable transcript (PDF) generated from the captions, with a disclaimer about possible errors.
A navigable time-stamped transcript based on the closed caption file.
Metadata updated to reflect remediation status (e.g., Partially Reviewed or Minimally Reviewed).

Intermediate Enhancement¶

All elements of Minimal Enhancement.
Initialized, but potentially error prone, diarization and speaker identification.
Minimal non-speech markers initialized for things like music and songs.
Metadata updated to reflect remediation status (e.g., Partially Reviewed with Machine Generated Speaker Identification).

Advanced Enhancement¶

All elements of Intermediate Enhancement.
Diarization reviewed and corrected for major speaker errors.
Metadata updated to reflect remediation status (e.g., Partially Reviewed with Machine Generated Speaker Identification and Major Speaker Errors Corrected).

Highest / Fully Remediated¶

All elements of the Advanced Enhancement.
Full human review and correction of captions and time-stamped transcript.
Diarization fully corrected.
Proper names, technical terms, and specialized content verified.
Interactive time-stamped transcript fully functional and synced.
Metadata reflects Fully Reviewed status, with version history documented.
Subtitles in English (if the Video was originally in another language).
Subtitles in Another Language (if the Video was originally in English).
Optional: additional accessibility enhancements such as audio descriptions or visual summaries if feasible.

À La Carte Enhancement¶

It will be extremely rare for items to get the highest level of treatment. Because some collections may require some aspects of higher levels of remediation, this category is defined.

All Elements of Minimal Enhancement.
If possible: all elements of Intermediate Enhancement.
Any elements from other levels but justified by the curator with resources considered.
Metadata reflecting every additional task beyond minimal or intermediate enhancement.
Documentations explaining what was done and why.

Non-Musical Audio Works¶

An audio work is any resource that includes a audio file that can be listened to in a media player.

Lowest / MVP File¶

A machine generated time-stamped transcript delivered in plain text and clearly marked as machine-generated.

Minimal Enhancement¶

All elements of the MVP file.
Spot-checked time-stamped transcript for obvious errors, with minimal corrections applied to improve readability and accuracy and eliminate hallucinations.
A downloadable transcript (PDF) generated from the time-stamped transcript, with a disclaimer about possible errors. PDF should be accessible (PDF-UA).
Metadata updated to reflect remediation status (e.g., Minimally Reviewed).

Intermediate Enhancement¶

All elements of Minimal Enhancement.
Minimal non-speech markers initialized for things like music and songs.
Metadata updated to reflect remediation status (e.g., Partially Reviewed).
Interactive, synced time-stamped transcript.

Highest / Fully Remediated¶

All elements of the Intermediate Enhancement.
Full human review and correction of time-stamped transcript.
Proper names, technical terms, and specialized content verified.
Interactive time-stamped transcript fully functional and synced.
Metadata reflects Fully Reviewed status, with version history documented.
Translation to English (if the audio was originally in another language).
Translation to Another Language (if the audio was originally in English).

À La Carte Enhancement¶

It will be extremely rare for items to get the highest level of treatment. Because some collections may require some aspects of higher levels of remediation, this category is defined.

All Elements of Minimal Enhancement.
If possible: all elements of Intermediate Enhancement.
Any elements from other levels but justified by the curator with resources considered.
Metadata reflecting every additional task beyond minimal or intermediate enhancement.
Documentations explaining what was done and why.

Music Audio¶

This section of the policy is for audio files of a song. Videos with music are not covered here.

Defining MUSIC:

Audio files with music.
Audio file with music and SOME speaking (Such as introducing a musician).

Not counted as MUSIC for the purposes of this document:

Audio files with music and LOTS of speaking (Such as a talk show with a musical performance or a song sample) - Treat as regular audio.
Music videos - Treat as videos.

Music characteristics¶

Certain factors may help determine the level of enhancement music files undergo. For new collections, this should be determined at a DPMT meeting. Some factors include:

Length of song
Intelligibility of lyrics
Rarity of information contained in that song (ie unpublished song likely not found elsewhere)
Time available for library staff to dedicate to this

Lowest / MVP File¶

A machine generated transcript delivered in plain text and clearly marked as machine-generated.

Minimal Enhancement¶

All elements of the MVP file.
A downloadable PDF transcript generated from the transcript, with a disclaimer about possible errors. PDF should be accessible (PDF-UA). Lyrics arranged in verses/stanzas.
For instrumental songs, transcript replaced with the text “🎵 Instrumental Song 🎵”.
For songs in languages not known by library staff, transcript replaced with the text “🎵 Song with lyrics in [Language] 🎵”.
Metadata updated to reflect remediation status (e.g., Minimally Reviewed).

Intermediate Enchancement¶

All elements of Minimal Enhancement.
Time-stamped transcript. This should be a vtt.
Metadata updated to reflect remediation status (e.g., Partially Reviewed).

Highest / Fully Remediated¶

All elements of the Intermediate Enhancement.
Full human review and correction of time-stamped transcript and PDF transcript.
Proper names, technical terms, and specialized content verified.
Interactive time-stamped transcript fully functional and synced.
Metadata reflects Fully Reviewed status, with version history documented.

À La Carte Enhancement¶

All Elements of Minimal Enhancement.
If possible: all elements of Intermediate Enhancement.
Any elements from other levels but justified by the curator with resources considered.
Metadata reflecting every additional task beyond minimal or intermediate enhancement.
Documentations explaining what was done and why.

Terminology¶

Audio Description: A service that provides an additional audio track of narration, describing the key visual elements of a program to make it accessible for people who are blind or visually impaired. The narration is inserted into the natural pauses in the program’s dialogue, conveying information like character movements, settings, and expressions that would otherwise be missed. This process helps to create a more complete and equitable viewing experience for everyone. Currently, we do not support audio description.
Closed-Caption File: A closed caption file is a synchronized text transcript of a video’s audio. Its contents appear in the media player and can be turned on or off by clicking its corresponding label or language code. It should include non-speech information and speaker identification. There are many formats of closed caption files but we have adopted WebVTT.
PDF Transcript: A PDF transcript is a supplemental file that may be associated with a work or file. It is meant to act as a different rendering of the resource appearing in the player.
Subtitle File: A subtitle file is a synchronized text transcript of a video’s audio in a language different from what it was originally recorded in. Like closed caption files, subtitle files are shown in the media player. It is not required by WCAG 2.1 AA. It may include speaker identification and other non-speech information. There are many formats of subtitle files but we have adopted WebVTT.
Time stamped transcript: A transcript is a synchronized transcript of an audio or video file that is used primarily for search and navigation. As a result, it may not include non-speech information or speaker identification. There are many formats of transcript files but we have adopted WebVTT.
Transcript: A transcript is a text file containing the transcribed audio in text format. It does not need to be synchronized or time-stamped, nor have speaker identification nor include non-speech information.