Slide 1

Slide 1 text

The State Of ASR 2024.

Slide 2

Slide 2 text

Hello! We’re excited to chat ASR today. . Elisa Lewis (She/Her) Senior Brand Marketing Manager @ 3Play Media [email protected] Tessa Kettelberger (She/Her) Senior Data Scientist @ 3Play Media [email protected]

Slide 3

Slide 3 text

Agenda. ASR overview Annual State of ASR report Research results & trends Key takeaways & conclusions

Slide 4

Slide 4 text

An Overview of ASR Tech Improving ASR ASR gets better by modelling “truth” data so the AI learns from its mistakes. For example - ASR might read “I need to call an über” until the company name “Uber” is added to its vocabulary. How Is It Used? ASR is used in many aspects of daily life - from transcription to phone support to automated assistants like Siri or Alexa. What Is ASR? ASR stands for Automatic Speech Recognition and refers to the use of Machine Learning (ML), Natural Language Processing (NLP), and Artificial Intelligence (AI) technology to convert speech into text. ASR For Transcription This session will specifically cover the use case of ASR for transcription and captioning

Slide 5

Slide 5 text

Auto Assistants Vs Captions Automated Assistants: ● Single Speaker ● High quality audio, close speaker ● Learns your voice ● Constrained tasks ● Clarification ● Did you catch my drift? Automatic Captions: ● Usually multiple speakers ● Tasks are open-ended ● Background noise, poor audio ● Lost frequencies ● Most of us don’t speak perfectly ● Changing audio conditions

Slide 6

Slide 6 text

.Let’s Talk. .State Of ASR..

Slide 7

Slide 7 text

An annual review of the top speech recognitions testing how they perform for the task of captioning and transcription. We test for both Word Error Rate (WER) and Formatted Error Rate (FER). The Report Because we use speech recognition as the first step in our human-corrected captioning process, we care about using the best ASR out there. This annual test ensures we’re using the absolute best tech available. Our Goal

Slide 8

Slide 8 text

The Accessibility Picture Variety Long-form transcription and captioning can present a variety of environments and subjects. Length Captioning relies on long-form audio, not short commands & feedback. Readability Captions are consumed by humans and need to be understandable, using proper sentence case and grammar. Captioning Presents a Unique Challenge

Slide 9

Slide 9 text

The Accessibility Picture 3-Step Process ASR is the first step of our captioning process, followed by 2 rounds of human editing and review. The better the ASR, the easier the job of the humans. Post-Processing We do our own post-processing on the ASR engines we use to further improve the ASR output. We have millions of accurately transcribed words that we model on top of ASR to further tune the results. Speechmatics is our current primary ASR engine, so we model on Speechmatics. We would expect to see a similar 10% relative improvement if we applied our proprietary post-processing to any engine in this report. How does. 3Play use. ASR?.

Slide 10

Slide 10 text

.Let’s See The. .Data..

Slide 11

Slide 11 text

10 ASR Engines On. . 158 Hours & 1,336,810 Words. . Across 700 Videos. . From 10 Industries. We Tested ….

Slide 12

Slide 12 text

Distribution By Industry ● Higher Ed ● Tech ● Goods & Services ● Cinematic ● Associations ● Sports ● Government ● Media Publishing ● eLearning ● News & Networks Note: The duration, number of speakers, audio quality, and speaking style (e.g. scripted vs. spontaneous) varies greatly across this data. Specifically …. ASR Engines ● Speechmatics (SMX) ● AssemblyAI Universal-1 ● Microsoft ● Rev.ai ● DeepGram Nova-2 ● IBM ● Google Latest-long ● Google Enhanced Video ● Whisper Large-v2 ● Whisper Large-v3

Slide 13

Slide 13 text

Our R&D Team Tested Two. Metrics: WER & FER.. Word Error Rate (WER) Word Error Rate is the metric you typically see when discussing transcription accuracy. For example, “99% accurate captions” would have a WER of 1%. That means 1 in every 100 words is incorrect - the standard for recorded captioning. In addition to pure WER, we dig deeper to measure insertions, substitutions, deletions, and corrections - which provides nuance on how different engines get to the measured WER. Formatted Error Rate (FER) While WER is the most common measure of caption accuracy, we think FER is most critical to the human experience of caption accuracy. FER takes into account formatting errors like punctuation, capitalization, and number formatting. We use FER to measure the the “read” experience of captioning, so we include crucial captioning elements such as audio tags.

Slide 14

Slide 14 text

Word Error Rates. Key Takeaways 1. AssemblyAI has advanced from their tied-for-first position in 2023 to a solid first-place position this year 2. Speechmatics still stands out in second-place with a significant lead over the engines below it 3. Newer models are performing worse than the old models they are meant to replace a. Whisper’s new V3 model vs their V2 model b. AssemblyAI’s new Universal-1 vs Conformer-2 c. Google’s Latest-long vs their old Video model Error Rate *AssemblyAI Conformer-2 7.13 AssemblyAI Universal-1 7.47 Speechmatics 8.15 Whisper Large-V2 9.4 Microsoft 9.46 Rev.ai 11 DeepGram Nova-2 11.5 * Google Video 14.6 Google Latest-long 15.2 Whisper Large-V3 19.3 IBM 23.6 * These engines have been or will soon be deprecated

Slide 15

Slide 15 text

Formatted Error Rates. Key Takeaways 1. Assembly is still a leader in the formatting space 2. Whisper’s models are more competitive in formatting than in content. They rank higher when formatting counts. 3. Even the best results present significant difficulties for ASR as captioning a. Formatting is crucial for readability and meaning b. A 17% error rate means one error every 6 words Error Rate *AssemblyAI Conformer-2 17 AssemblyAI Universal-1 17.5 Whisper Large-V2 17.6 Speechmatics 19.2 Microsoft 20.1 DeepGram Nova-2 20.1 Rev.ai 21.6 Whisper Large-V3 27.6 Google Latest-long 29.8 *Google Video 30 IBM 43.4 * These engines have been or will soon be deprecated

Slide 16

Slide 16 text

.Not all errors are made equal..

Slide 17

Slide 17 text

The Accessibility Picture Substitutions When ASR mishears a word. For example, if it says encyclopedia instead of 3Play Media, that's a substitution error. Insertions When ASR adds an extra word. One example is when ASR misinterprets background noise as speech it may add extra nonsense words. Deletions When ASR misses or removes words. ASR has transcribed nothing at all when it should have recognized a spoken word. Different Types Of Errors.

Slide 18

Slide 18 text

Error Type Breakdown %SUB %INS %DEL AssemblyAI Universal-1 2.56 2.53 2.38 Speechmatics 2.47 3.98 1.7 Whisper Large-V2 2.88 4.07 2.44 Microsoft 3.07 4.09 2.29 Rev.ai 3.75 4.77 2.48 DeepGram Nova-2 3.27 4.09 4.11 Google Latest-long 5.14 3.49 6.59 Whisper Large-V3 4.49 10.8 4.01 IBM 9.98 5.17 4.01 KEY TAKEAWAYS ● When we break down the error rate into constituent error types, we can understand the different behaviors of each engine ● Speechmatics has the fewest substitutions and deletions, but more insertions ● AssemblyAI makes the fewest insertions but has more deletions than other high performing engines. ● Whisper V3’s poor performance in overall error rate appears to be disproportionately insertion errors

Slide 19

Slide 19 text

Poll Time! Which. Industry Do You Think ASR Performed the Best On?. ● Higher Education ● Goods and Services ● Media Publishing ● Cinematic ● eLearning ● Sports

Slide 20

Slide 20 text

WER & FER By Industry. INDUSTRY AVG. WER AVG. FER eLearning 3.97 11.8 Goods/Services 5.05 13.4 News/Networks 5.25 14.9 Publishing 6.12 14.9 Government 6.92 16.4 Higher Ed 7.16 15.2 Other 7.77 16.2 Associations 8.0 16.8 Tech 9.69 19.8 Sports 10.2 20.2 Cinematic 10.2 21.2 Key Takeaways ● Performance is highly impacted by domain and the style of the content ● The best-performing industries tend to have ○ A single speaker ○ Little or no cross-talk ○ Minimal background audio ○ Professional recording environments ○ Scripted (rather than spontaneous) speech Scores shown here are the average of the top 4 engines on that industry. While I’m not showing individual scores, I want to mention that Whisper does particularly poorly on Cinematic content. Open source and trained on public data, it seems limited by copyright in some domains.

Slide 21

Slide 21 text

“. … ASR Hallucinations 👀👀👀. Whisper’s greatest flaw seems to be its tendency to sometimes “hallucinate” additional speech that doesn’t appear in the original audio sample. The hallucinations sometimes look very credible if you aren’t listening to the audio. They are usually sensible and on-topic, grammatically correct sentences. This would make viewing the captions as a Deaf/HoH user really confusing. If auto-captions are nonsensical, it’s clear they are making a mistake, but with these, you could easily assume the mistakes are what is actually being said. Whisper’s scores don’t adequately penalize hallucinations in my opinion. Hallucinations will show up as errors, but an area where the text was completely invented may still get as low as a 50% error rate (rather than 100%) because of common pronouns, function words, and punctuation lining up with the real text. ”.

Slide 22

Slide 22 text

Common Hallucination Types ● Repeating a word tens to hundreds of times ○ “thank you thank you thank you…” ● Reproducing a large section of the transcript from elsewhere in the audio ● Plausible but made-up speech ○ “Like and subscribe” ● Grammatically reasonable but meaningless ○ “We ate from Amsterdam to Amsterdam” Truth Hallucination > Thanks > for > watching. > Please > subscribe > to > my > channel. Thanks Thanks for for watching watching.

Slide 23

Slide 23 text

TRUTH WHISPER the > mysteries of the universe in a the southeastern part of the state it’s a

Slide 24

Slide 24 text

Frequency of Hallucinations Hallucinations are ASR errors which have no basis in the original audio. Detecting Hallucinations ● Sequences of insertion or substitution errors (at least 4 words long) ● Truth transcript does not indicate lyrics or other language speech in the area (ie. [MUSIC PLAYING]) ● Shares few letters (or sounds) with truth transcript in the area ● Other engine without hallucinations did not also have a similar number of errors in the area ● Estimate “false positives” of this heuristic by running it on a non-hallucinating engine and subtracting the rate of hallucination detected from other engine’s results Engines with Hallucinations ● While AssemblyAI’s announcement of Universal-1 implied they had a small number of hallucinations, we didn’t find evidence that they hallucinated on our data ● Whisper V2 hallucinates on about 20% of files ● Whisper V3 hallucinates on about 57% of files

Slide 25

Slide 25 text

.Key. Findings:. (TL;DR). Source Material Matters It’s clear that results are still heavily dependent on audio quality and content difficulty. Most improvements are driven by training techniques, not changes to technology. Unprecedented Levels of Data Doesn’t Guarantee Success The latest Whisper model (V3) is much worse than the previous one (V2). They retrained a model that in the real world, performs much worse Hallucination? What is it about Whisper’s model that hallucinates completely made up content? Does this have to do with their scaled supervised learning approach? Use Case Matters These engines are ultimately trained for different use cases. Understanding your use case and which engine best suits it is critical to produce the highest quality. Still Not Good Enough It’s clear that ASR is still far from good enough for compliance, where 99%+ accuracy is required to provide an equal experience.

Slide 26

Slide 26 text

.What This. .Means For . You.. While technology continues to improve, there is still a significant leap to real accuracy from even the best speech recognition engines, making humans a crucial part of creating accurate captions.

Slide 27

Slide 27 text

Word Errors Formatting Errors ● Multiple speakers or overlapping speech ● Background noise ● Poor audio quality ● False starts ● Acoustic errors ● “Function” words ● Speaker labels ● Punctuation ● Grammar ● Numbers ● Non-speech elements ● [INAUDIBLE] tags Common Causes of ASR Errors:.

Slide 28

Slide 28 text

Incorrect punctuation can change the meaning of language tremendously. Formatting Errors

Slide 29

Slide 29 text

This example indicates a very common ASR error. Although seemingly small, the meaning is completely reversed. “I can’t attend the meeting.” vs. “I can attend the meeting.” Function Words

Slide 30

Slide 30 text

These examples of names and complex vocabulary require human expertise & knowledge. In each case, the truth is on the left, and the ASR is on the right. Complex Vocabulary

Slide 31

Slide 31 text

Remember - Errors Add Up. Quickly .... At 85% Accuracy, 1 In 7 Words. Is Incorrect.

Slide 32

Slide 32 text

Quality Matters..

Slide 33

Slide 33 text

.So,. To Recap:.

Slide 34

Slide 34 text

Speechmatics is no longer the clear leader.. Whisper and Assembly AI. Entered last year and were right there with SMX. Assembly AI has become even more competitive than last year.

Slide 35

Slide 35 text

Whisper & Assembly AI are. .Differentiating themselves. .Further this year by what. .Already made them unique.

Slide 36

Slide 36 text

The best engines can. Achieve up to 93% accuracy …. For non-specialized content. With great audio quality.

Slide 37

Slide 37 text

There’s still. a long way to go. to replace humans..

Slide 38

Slide 38 text

Be First to See the 2024 Report:. Here’s how: ● Scan the QR code or visit go.3playmedia.com/rs-2024-asr ● Fill out the form ● You’ll receive an email directly to the report when it’s published!

Slide 39

Slide 39 text

Thank You! What Questions Do You Have?. STATE OF ASR go.3playmedia.com/rs-2024-asr 3PLAY MEDIA www.3playmedia.com | @3playmedia Elisa Lewis (She/Her) [email protected] Tessa Kettelberger (She/Her) [email protected]