2024 State of ASR

The State Of ASR 2024.

Hello! We’re excited to chat ASR today. . Elisa Lewis
(She/Her) Senior Brand Marketing Manager @ 3Play Media [email protected] Tessa Kettelberger (She/Her) Senior Data Scientist @ 3Play Media [email protected]

Agenda. ASR overview Annual State of ASR report Research results
& trends Key takeaways & conclusions

An Overview of ASR Tech Improving ASR ASR gets better
by modelling “truth” data so the AI learns from its mistakes. For example - ASR might read “I need to call an über” until the company name “Uber” is added to its vocabulary. How Is It Used? ASR is used in many aspects of daily life - from transcription to phone support to automated assistants like Siri or Alexa. What Is ASR? ASR stands for Automatic Speech Recognition and refers to the use of Machine Learning (ML), Natural Language Processing (NLP), and Artificial Intelligence (AI) technology to convert speech into text. ASR For Transcription This session will specifically cover the use case of ASR for transcription and captioning

Auto Assistants Vs Captions Automated Assistants: • Single Speaker •
High quality audio, close speaker • Learns your voice • Constrained tasks • Clarification • Did you catch my drift? Automatic Captions: • Usually multiple speakers • Tasks are open-ended • Background noise, poor audio • Lost frequencies • Most of us don’t speak perfectly • Changing audio conditions

.Let’s Talk. .State Of ASR..

An annual review of the top speech recognitions testing how
they perform for the task of captioning and transcription. We test for both Word Error Rate (WER) and Formatted Error Rate (FER). The Report Because we use speech recognition as the first step in our human-corrected captioning process, we care about using the best ASR out there. This annual test ensures we’re using the absolute best tech available. Our Goal

The Accessibility Picture Variety Long-form transcription and captioning can present
a variety of environments and subjects. Length Captioning relies on long-form audio, not short commands & feedback. Readability Captions are consumed by humans and need to be understandable, using proper sentence case and grammar. Captioning Presents a Unique Challenge

The Accessibility Picture 3-Step Process ASR is the first step
of our captioning process, followed by 2 rounds of human editing and review. The better the ASR, the easier the job of the humans. Post-Processing We do our own post-processing on the ASR engines we use to further improve the ASR output. We have millions of accurately transcribed words that we model on top of ASR to further tune the results. Speechmatics is our current primary ASR engine, so we model on Speechmatics. We would expect to see a similar 10% relative improvement if we applied our proprietary post-processing to any engine in this report. How does. 3Play use. ASR?.

.Let’s See The. .Data..

10 ASR Engines On. . 158 Hours & 1,336,810 Words.
. Across 700 Videos. . From 10 Industries. We Tested ….

Distribution By Industry • Higher Ed • Tech • Goods
& Services • Cinematic • Associations • Sports • Government • Media Publishing • eLearning • News & Networks Note: The duration, number of speakers, audio quality, and speaking style (e.g. scripted vs. spontaneous) varies greatly across this data. Specifically …. ASR Engines • Speechmatics (SMX) • AssemblyAI Universal-1 • Microsoft • Rev.ai • DeepGram Nova-2 • IBM • Google Latest-long • Google Enhanced Video • Whisper Large-v2 • Whisper Large-v3

Our R&D Team Tested Two. Metrics: WER & FER.. Word
Error Rate (WER) Word Error Rate is the metric you typically see when discussing transcription accuracy. For example, “99% accurate captions” would have a WER of 1%. That means 1 in every 100 words is incorrect - the standard for recorded captioning. In addition to pure WER, we dig deeper to measure insertions, substitutions, deletions, and corrections - which provides nuance on how different engines get to the measured WER. Formatted Error Rate (FER) While WER is the most common measure of caption accuracy, we think FER is most critical to the human experience of caption accuracy. FER takes into account formatting errors like punctuation, capitalization, and number formatting. We use FER to measure the the “read” experience of captioning, so we include crucial captioning elements such as audio tags.

Word Error Rates. Key Takeaways 1. AssemblyAI has advanced from
their tied-for-first position in 2023 to a solid first-place position this year 2. Speechmatics still stands out in second-place with a significant lead over the engines below it 3. Newer models are performing worse than the old models they are meant to replace a. Whisper’s new V3 model vs their V2 model b. AssemblyAI’s new Universal-1 vs Conformer-2 c. Google’s Latest-long vs their old Video model Error Rate *AssemblyAI Conformer-2 7.13 AssemblyAI Universal-1 7.47 Speechmatics 8.15 Whisper Large-V2 9.4 Microsoft 9.46 Rev.ai 11 DeepGram Nova-2 11.5 * Google Video 14.6 Google Latest-long 15.2 Whisper Large-V3 19.3 IBM 23.6 * These engines have been or will soon be deprecated

Formatted Error Rates. Key Takeaways 1. Assembly is still a
leader in the formatting space 2. Whisper’s models are more competitive in formatting than in content. They rank higher when formatting counts. 3. Even the best results present significant difficulties for ASR as captioning a. Formatting is crucial for readability and meaning b. A 17% error rate means one error every 6 words Error Rate *AssemblyAI Conformer-2 17 AssemblyAI Universal-1 17.5 Whisper Large-V2 17.6 Speechmatics 19.2 Microsoft 20.1 DeepGram Nova-2 20.1 Rev.ai 21.6 Whisper Large-V3 27.6 Google Latest-long 29.8 *Google Video 30 IBM 43.4 * These engines have been or will soon be deprecated

.Not all errors are made equal..

The Accessibility Picture Substitutions When ASR mishears a word. For
example, if it says encyclopedia instead of 3Play Media, that's a substitution error. Insertions When ASR adds an extra word. One example is when ASR misinterprets background noise as speech it may add extra nonsense words. Deletions When ASR misses or removes words. ASR has transcribed nothing at all when it should have recognized a spoken word. Different Types Of Errors.

Error Type Breakdown %SUB %INS %DEL AssemblyAI Universal-1 2.56 2.53
2.38 Speechmatics 2.47 3.98 1.7 Whisper Large-V2 2.88 4.07 2.44 Microsoft 3.07 4.09 2.29 Rev.ai 3.75 4.77 2.48 DeepGram Nova-2 3.27 4.09 4.11 Google Latest-long 5.14 3.49 6.59 Whisper Large-V3 4.49 10.8 4.01 IBM 9.98 5.17 4.01 KEY TAKEAWAYS • When we break down the error rate into constituent error types, we can understand the different behaviors of each engine • Speechmatics has the fewest substitutions and deletions, but more insertions • AssemblyAI makes the fewest insertions but has more deletions than other high performing engines. • Whisper V3’s poor performance in overall error rate appears to be disproportionately insertion errors

Poll Time! Which. Industry Do You Think ASR Performed the
Best On?. • Higher Education • Goods and Services • Media Publishing • Cinematic • eLearning • Sports

WER & FER By Industry. INDUSTRY AVG. WER AVG. FER
eLearning 3.97 11.8 Goods/Services 5.05 13.4 News/Networks 5.25 14.9 Publishing 6.12 14.9 Government 6.92 16.4 Higher Ed 7.16 15.2 Other 7.77 16.2 Associations 8.0 16.8 Tech 9.69 19.8 Sports 10.2 20.2 Cinematic 10.2 21.2 Key Takeaways • Performance is highly impacted by domain and the style of the content • The best-performing industries tend to have ◦ A single speaker ◦ Little or no cross-talk ◦ Minimal background audio ◦ Professional recording environments ◦ Scripted (rather than spontaneous) speech Scores shown here are the average of the top 4 engines on that industry. While I’m not showing individual scores, I want to mention that Whisper does particularly poorly on Cinematic content. Open source and trained on public data, it seems limited by copyright in some domains.

“. … ASR Hallucinations 👀👀👀. Whisper’s greatest flaw seems to
be its tendency to sometimes “hallucinate” additional speech that doesn’t appear in the original audio sample. The hallucinations sometimes look very credible if you aren’t listening to the audio. They are usually sensible and on-topic, grammatically correct sentences. This would make viewing the captions as a Deaf/HoH user really confusing. If auto-captions are nonsensical, it’s clear they are making a mistake, but with these, you could easily assume the mistakes are what is actually being said. Whisper’s scores don’t adequately penalize hallucinations in my opinion. Hallucinations will show up as errors, but an area where the text was completely invented may still get as low as a 50% error rate (rather than 100%) because of common pronouns, function words, and punctuation lining up with the real text. ”.

Common Hallucination Types • Repeating a word tens to hundreds
of times ◦ “thank you thank you thank you…” • Reproducing a large section of the transcript from elsewhere in the audio • Plausible but made-up speech ◦ “Like and subscribe” • Grammatically reasonable but meaningless ◦ “We ate from Amsterdam to Amsterdam” Truth Hallucination > Thanks > for > watching. > Please > subscribe > to > my > channel. Thanks Thanks for for watching watching.

TRUTH WHISPER the > mysteries of the universe in a
the southeastern part of the state it’s a

Frequency of Hallucinations Hallucinations are ASR errors which have no
basis in the original audio. Detecting Hallucinations • Sequences of insertion or substitution errors (at least 4 words long) • Truth transcript does not indicate lyrics or other language speech in the area (ie. [MUSIC PLAYING]) • Shares few letters (or sounds) with truth transcript in the area • Other engine without hallucinations did not also have a similar number of errors in the area • Estimate “false positives” of this heuristic by running it on a non-hallucinating engine and subtracting the rate of hallucination detected from other engine’s results Engines with Hallucinations • While AssemblyAI’s announcement of Universal-1 implied they had a small number of hallucinations, we didn’t find evidence that they hallucinated on our data • Whisper V2 hallucinates on about 20% of files • Whisper V3 hallucinates on about 57% of files

.Key. Findings:. (TL;DR). Source Material Matters It’s clear that results
are still heavily dependent on audio quality and content difficulty. Most improvements are driven by training techniques, not changes to technology. Unprecedented Levels of Data Doesn’t Guarantee Success The latest Whisper model (V3) is much worse than the previous one (V2). They retrained a model that in the real world, performs much worse Hallucination? What is it about Whisper’s model that hallucinates completely made up content? Does this have to do with their scaled supervised learning approach? Use Case Matters These engines are ultimately trained for different use cases. Understanding your use case and which engine best suits it is critical to produce the highest quality. Still Not Good Enough It’s clear that ASR is still far from good enough for compliance, where 99%+ accuracy is required to provide an equal experience.

.What This. .Means For . You.. While technology continues to
improve, there is still a significant leap to real accuracy from even the best speech recognition engines, making humans a crucial part of creating accurate captions.

Word Errors Formatting Errors • Multiple speakers or overlapping speech
• Background noise • Poor audio quality • False starts • Acoustic errors • “Function” words • Speaker labels • Punctuation • Grammar • Numbers • Non-speech elements • [INAUDIBLE] tags Common Causes of ASR Errors:.

Incorrect punctuation can change the meaning of language tremendously. Formatting
Errors

This example indicates a very common ASR error. Although seemingly
small, the meaning is completely reversed. “I can’t attend the meeting.” vs. “I can attend the meeting.” Function Words

These examples of names and complex vocabulary require human expertise
& knowledge. In each case, the truth is on the left, and the ASR is on the right. Complex Vocabulary

Remember - Errors Add Up. Quickly .... At 85% Accuracy,
1 In 7 Words. Is Incorrect.

Quality Matters..

.So,. To Recap:.

Speechmatics is no longer the clear leader.. Whisper and Assembly
AI. Entered last year and were right there with SMX. Assembly AI has become even more competitive than last year.

Whisper & Assembly AI are. .Differentiating themselves. .Further this year
by what. .Already made them unique.

The best engines can. Achieve up to 93% accuracy ….
For non-specialized content. With great audio quality.

There’s still. a long way to go. to replace humans..

Be First to See the 2024 Report:. Here’s how: •
Scan the QR code or visit go.3playmedia.com/rs-2024-asr • Fill out the form • You’ll receive an email directly to the report when it’s published!

Thank You! What Questions Do You Have?. STATE OF ASR
go.3playmedia.com/rs-2024-asr 3PLAY MEDIA www.3playmedia.com | @3playmedia Elisa Lewis (She/Her) [email protected] Tessa Kettelberger (She/Her) [email protected]

2024 State of ASR

2024 State of ASR

More Decks by 3Play Media

Other Decks in Technology

Featured

Transcript