Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2024 State of ASR

2024 State of ASR

What is the current state of speech technology? Is automatic speech recognition (ASR) sufficient for closed captioning or live captioning? Do we still need humans to achieve accuracy standards for accessibility?

Speech recognition is widely used to streamline the process of creating closed captions, audio descriptions, and other media accessibility accommodations. This session will discuss the findings from a 2024 research study of leading ASR engines to understand how speech AI measures up to the task of captioning and transcription without the intervention of a human editor.

This is the only report of its kind focused on the application of speech recognition technology for captioning vs. for other technologies.

3Play Media

May 21, 2024
Tweet

More Decks by 3Play Media

Other Decks in Technology

Transcript

  1. Hello! We’re excited to chat ASR today. . Elisa Lewis

    (She/Her) Senior Brand Marketing Manager @ 3Play Media [email protected] Tessa Kettelberger (She/Her) Senior Data Scientist @ 3Play Media [email protected]
  2. An Overview of ASR Tech Improving ASR ASR gets better

    by modelling “truth” data so the AI learns from its mistakes. For example - ASR might read “I need to call an über” until the company name “Uber” is added to its vocabulary. How Is It Used? ASR is used in many aspects of daily life - from transcription to phone support to automated assistants like Siri or Alexa. What Is ASR? ASR stands for Automatic Speech Recognition and refers to the use of Machine Learning (ML), Natural Language Processing (NLP), and Artificial Intelligence (AI) technology to convert speech into text. ASR For Transcription This session will specifically cover the use case of ASR for transcription and captioning
  3. Auto Assistants Vs Captions Automated Assistants: • Single Speaker •

    High quality audio, close speaker • Learns your voice • Constrained tasks • Clarification • Did you catch my drift? Automatic Captions: • Usually multiple speakers • Tasks are open-ended • Background noise, poor audio • Lost frequencies • Most of us don’t speak perfectly • Changing audio conditions
  4. An annual review of the top speech recognitions testing how

    they perform for the task of captioning and transcription. We test for both Word Error Rate (WER) and Formatted Error Rate (FER). The Report Because we use speech recognition as the first step in our human-corrected captioning process, we care about using the best ASR out there. This annual test ensures we’re using the absolute best tech available. Our Goal
  5. The Accessibility Picture Variety Long-form transcription and captioning can present

    a variety of environments and subjects. Length Captioning relies on long-form audio, not short commands & feedback. Readability Captions are consumed by humans and need to be understandable, using proper sentence case and grammar. Captioning Presents a Unique Challenge
  6. The Accessibility Picture 3-Step Process ASR is the first step

    of our captioning process, followed by 2 rounds of human editing and review. The better the ASR, the easier the job of the humans. Post-Processing We do our own post-processing on the ASR engines we use to further improve the ASR output. We have millions of accurately transcribed words that we model on top of ASR to further tune the results. Speechmatics is our current primary ASR engine, so we model on Speechmatics. We would expect to see a similar 10% relative improvement if we applied our proprietary post-processing to any engine in this report. How does. 3Play use. ASR?.
  7. 10 ASR Engines On. . 158 Hours & 1,336,810 Words.

    . Across 700 Videos. . From 10 Industries. We Tested ….
  8. Distribution By Industry • Higher Ed • Tech • Goods

    & Services • Cinematic • Associations • Sports • Government • Media Publishing • eLearning • News & Networks Note: The duration, number of speakers, audio quality, and speaking style (e.g. scripted vs. spontaneous) varies greatly across this data. Specifically …. ASR Engines • Speechmatics (SMX) • AssemblyAI Universal-1 • Microsoft • Rev.ai • DeepGram Nova-2 • IBM • Google Latest-long • Google Enhanced Video • Whisper Large-v2 • Whisper Large-v3
  9. Our R&D Team Tested Two. Metrics: WER & FER.. Word

    Error Rate (WER) Word Error Rate is the metric you typically see when discussing transcription accuracy. For example, “99% accurate captions” would have a WER of 1%. That means 1 in every 100 words is incorrect - the standard for recorded captioning. In addition to pure WER, we dig deeper to measure insertions, substitutions, deletions, and corrections - which provides nuance on how different engines get to the measured WER. Formatted Error Rate (FER) While WER is the most common measure of caption accuracy, we think FER is most critical to the human experience of caption accuracy. FER takes into account formatting errors like punctuation, capitalization, and number formatting. We use FER to measure the the “read” experience of captioning, so we include crucial captioning elements such as audio tags.
  10. Word Error Rates. Key Takeaways 1. AssemblyAI has advanced from

    their tied-for-first position in 2023 to a solid first-place position this year 2. Speechmatics still stands out in second-place with a significant lead over the engines below it 3. Newer models are performing worse than the old models they are meant to replace a. Whisper’s new V3 model vs their V2 model b. AssemblyAI’s new Universal-1 vs Conformer-2 c. Google’s Latest-long vs their old Video model Error Rate *AssemblyAI Conformer-2 7.13 AssemblyAI Universal-1 7.47 Speechmatics 8.15 Whisper Large-V2 9.4 Microsoft 9.46 Rev.ai 11 DeepGram Nova-2 11.5 * Google Video 14.6 Google Latest-long 15.2 Whisper Large-V3 19.3 IBM 23.6 * These engines have been or will soon be deprecated
  11. Formatted Error Rates. Key Takeaways 1. Assembly is still a

    leader in the formatting space 2. Whisper’s models are more competitive in formatting than in content. They rank higher when formatting counts. 3. Even the best results present significant difficulties for ASR as captioning a. Formatting is crucial for readability and meaning b. A 17% error rate means one error every 6 words Error Rate *AssemblyAI Conformer-2 17 AssemblyAI Universal-1 17.5 Whisper Large-V2 17.6 Speechmatics 19.2 Microsoft 20.1 DeepGram Nova-2 20.1 Rev.ai 21.6 Whisper Large-V3 27.6 Google Latest-long 29.8 *Google Video 30 IBM 43.4 * These engines have been or will soon be deprecated
  12. The Accessibility Picture Substitutions When ASR mishears a word. For

    example, if it says encyclopedia instead of 3Play Media, that's a substitution error. Insertions When ASR adds an extra word. One example is when ASR misinterprets background noise as speech it may add extra nonsense words. Deletions When ASR misses or removes words. ASR has transcribed nothing at all when it should have recognized a spoken word. Different Types Of Errors.
  13. Error Type Breakdown %SUB %INS %DEL AssemblyAI Universal-1 2.56 2.53

    2.38 Speechmatics 2.47 3.98 1.7 Whisper Large-V2 2.88 4.07 2.44 Microsoft 3.07 4.09 2.29 Rev.ai 3.75 4.77 2.48 DeepGram Nova-2 3.27 4.09 4.11 Google Latest-long 5.14 3.49 6.59 Whisper Large-V3 4.49 10.8 4.01 IBM 9.98 5.17 4.01 KEY TAKEAWAYS • When we break down the error rate into constituent error types, we can understand the different behaviors of each engine • Speechmatics has the fewest substitutions and deletions, but more insertions • AssemblyAI makes the fewest insertions but has more deletions than other high performing engines. • Whisper V3’s poor performance in overall error rate appears to be disproportionately insertion errors
  14. Poll Time! Which. Industry Do You Think ASR Performed the

    Best On?. • Higher Education • Goods and Services • Media Publishing • Cinematic • eLearning • Sports
  15. WER & FER By Industry. INDUSTRY AVG. WER AVG. FER

    eLearning 3.97 11.8 Goods/Services 5.05 13.4 News/Networks 5.25 14.9 Publishing 6.12 14.9 Government 6.92 16.4 Higher Ed 7.16 15.2 Other 7.77 16.2 Associations 8.0 16.8 Tech 9.69 19.8 Sports 10.2 20.2 Cinematic 10.2 21.2 Key Takeaways • Performance is highly impacted by domain and the style of the content • The best-performing industries tend to have ◦ A single speaker ◦ Little or no cross-talk ◦ Minimal background audio ◦ Professional recording environments ◦ Scripted (rather than spontaneous) speech Scores shown here are the average of the top 4 engines on that industry. While I’m not showing individual scores, I want to mention that Whisper does particularly poorly on Cinematic content. Open source and trained on public data, it seems limited by copyright in some domains.
  16. “. … ASR Hallucinations 👀👀👀. Whisper’s greatest flaw seems to

    be its tendency to sometimes “hallucinate” additional speech that doesn’t appear in the original audio sample. The hallucinations sometimes look very credible if you aren’t listening to the audio. They are usually sensible and on-topic, grammatically correct sentences. This would make viewing the captions as a Deaf/HoH user really confusing. If auto-captions are nonsensical, it’s clear they are making a mistake, but with these, you could easily assume the mistakes are what is actually being said. Whisper’s scores don’t adequately penalize hallucinations in my opinion. Hallucinations will show up as errors, but an area where the text was completely invented may still get as low as a 50% error rate (rather than 100%) because of common pronouns, function words, and punctuation lining up with the real text. ”.
  17. Common Hallucination Types • Repeating a word tens to hundreds

    of times ◦ “thank you thank you thank you…” • Reproducing a large section of the transcript from elsewhere in the audio • Plausible but made-up speech ◦ “Like and subscribe” • Grammatically reasonable but meaningless ◦ “We ate from Amsterdam to Amsterdam” Truth Hallucination > Thanks > for > watching. > Please > subscribe > to > my > channel. Thanks Thanks for for watching watching.
  18. TRUTH WHISPER the > mysteries of the universe in a

    the southeastern part of the state it’s a
  19. Frequency of Hallucinations Hallucinations are ASR errors which have no

    basis in the original audio. Detecting Hallucinations • Sequences of insertion or substitution errors (at least 4 words long) • Truth transcript does not indicate lyrics or other language speech in the area (ie. [MUSIC PLAYING]) • Shares few letters (or sounds) with truth transcript in the area • Other engine without hallucinations did not also have a similar number of errors in the area • Estimate “false positives” of this heuristic by running it on a non-hallucinating engine and subtracting the rate of hallucination detected from other engine’s results Engines with Hallucinations • While AssemblyAI’s announcement of Universal-1 implied they had a small number of hallucinations, we didn’t find evidence that they hallucinated on our data • Whisper V2 hallucinates on about 20% of files • Whisper V3 hallucinates on about 57% of files
  20. .Key. Findings:. (TL;DR). Source Material Matters It’s clear that results

    are still heavily dependent on audio quality and content difficulty. Most improvements are driven by training techniques, not changes to technology. Unprecedented Levels of Data Doesn’t Guarantee Success The latest Whisper model (V3) is much worse than the previous one (V2). They retrained a model that in the real world, performs much worse Hallucination? What is it about Whisper’s model that hallucinates completely made up content? Does this have to do with their scaled supervised learning approach? Use Case Matters These engines are ultimately trained for different use cases. Understanding your use case and which engine best suits it is critical to produce the highest quality. Still Not Good Enough It’s clear that ASR is still far from good enough for compliance, where 99%+ accuracy is required to provide an equal experience.
  21. .What This. .Means For . You.. While technology continues to

    improve, there is still a significant leap to real accuracy from even the best speech recognition engines, making humans a crucial part of creating accurate captions.
  22. Word Errors Formatting Errors • Multiple speakers or overlapping speech

    • Background noise • Poor audio quality • False starts • Acoustic errors • “Function” words • Speaker labels • Punctuation • Grammar • Numbers • Non-speech elements • [INAUDIBLE] tags Common Causes of ASR Errors:.
  23. This example indicates a very common ASR error. Although seemingly

    small, the meaning is completely reversed. “I can’t attend the meeting.” vs. “I can attend the meeting.” Function Words
  24. These examples of names and complex vocabulary require human expertise

    & knowledge. In each case, the truth is on the left, and the ASR is on the right. Complex Vocabulary
  25. Speechmatics is no longer the clear leader.. Whisper and Assembly

    AI. Entered last year and were right there with SMX. Assembly AI has become even more competitive than last year.
  26. The best engines can. Achieve up to 93% accuracy ….

    For non-specialized content. With great audio quality.
  27. Be First to See the 2024 Report:. Here’s how: •

    Scan the QR code or visit go.3playmedia.com/rs-2024-asr • Fill out the form • You’ll receive an email directly to the report when it’s published!
  28. Thank You! What Questions Do You Have?. STATE OF ASR

    go.3playmedia.com/rs-2024-asr 3PLAY MEDIA www.3playmedia.com | @3playmedia Elisa Lewis (She/Her) [email protected] Tessa Kettelberger (She/Her) [email protected]