End-to-End Automatic Speech Recognition Running on Edge Devices

Transcript

End-to-End Automatic Speech Recognition Running on Edge Devices Motoi Omachi

/ Yahoo! JAPAN

Self introduction Motoi OMACHI - Joined Yahoo Japan Corporation in

2016 - Software Engineer on automatic speech recognition (ASR) - Research: End-to-end ASR Presented at the top conferences on speech processing and natural language processing (ICASSP2022, NAACL2021) - Development: server-side ASR, on-device SDK Leads on-device ASR SDK development

Overview • We released an end-to-end automatic speech recognition (E2E

ASR) engine that completely runs on the edge devices • This presentation introduces • Current ASR system (YJVOICE) • Popular E2E ASR techniques • How we addressed some problems for running ASR on edge-devices

YJVOICE: Yahoo! JAPAN Speech Recognition

Automatic speech recognition (ASR) ASR “Hello” Speech Transcription

YJVOICE • Has been developed since 2011 • Used for

many Yahoo Japan applications • Supports 1.5 million+ vocabulary

Client-Server-based ASR system ASR Smart phone Server “Hello” User

Issues of current YJVOICE Stability ASR does not work when

a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment

Issues of current YJVOICE Stability ASR does not work when

Recently released on-device ASR ! ASR Smart phone Server “Hello”

User

Released on our smartphone application Client-Server ASR On-device ASR

End-to-end (E2E) ASR

DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

… … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れれます明⽇荒れは Language model Transcription Speech

DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder Acoustic

model Pronunciation dictionary 晴れれます明⽇荒れは Language model Transcription … … … a sh i ta … 足 a shi 明日 a shi ta 下 shi ta … Speech

End-to-end (E2E) ASR Feature extraction Decoder Transcription Voice activity detection

… … … … 明⽇・・・ End-to-end model (single neural network) Speech

Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model

Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

Popular models for E2E ASR Encoder Softmax 明日の

天気 Encoder Softmax 明日の天気 Joint Network Prediction Network 明日の天 Encoder 明日の天気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder

Popular models for E2E ASR Encoder Softmax 明日の

Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦

Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦ Encoder-based -> Released in April 2022 RNN-Transducer -> Released in October 2022

Released E2E ASR SDK And research topics

SLA Support OS • iOS 12.0+ • Android 5.0+ Real

time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide

SLA Support OS • iOS 12.0+ • Android 5.0+ Real

Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

Character error rate (%) DNN-HMM hybrid (server-side ASR)

Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

Character error rate (%) DNN-HMM hybrid (server-side ASR) 4.7 points

Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

Character error rate (%) DNN-HMM hybrid (server-side ASR) Almost similar (within 0.1 points)

Research topics Out-of-Vocabulary (OOV) problem • It is hard to

predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym ）

Research topics Out-of-Vocabulary (OOV) problem • It is hard to

Model training E2E ASR output E2E ASR Transcriptions output error

E2E ASR Transcriptions output error

Data augmentation [1] using untranscribed speech [1] Y. He., et

al., “Streaming End-to-end Speech Recognition for Mobile Devices.,” Proc. ICASSP 2019. DNN-HMM Hybrid ASR Automatic transcription Untranscribed speech E2EASR Model Transcriptions output Automatic transcription E2E ASR Transcriptions output

ASR result on OOV condition 0 5 10 15 20

25 30 Transcription +Automatic transcription Transcription Training data Character error rate (%) 19.9 point improvement

Research topics Out-of-Vocabulary (OOV) problem • It is hard to

predict the words which do not appear in the training data (e.g., buzz words, addresses, proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym ） M. Omachi et. al, “End-to-end ASR to jointly predict transcriptions and linguistic annotations, ” Proc. NAACL 2021

E2E ASR E2E ASR 日本橋までの行き方 Transcription Speech

Pipeline system (E2E ASR w/ Post-processing) E2E ASR 日本橋までの行き方日本橋：ニホンバシ

まで：マデの：ノ行き：イキ方：カタ Transcription Grapheme: Phoneme NLP-based Phoneme prediction • NLP-based post-processing is affected by ASR errors • NLP-based post-processing requires additional memory and computational costs Speech

Joint prediction of graphemes and phonemes [2] [2] M. Omachi.,

et al., “End-to-end ASR to jointly predict transcriptions and linguistic annotations.,” Proc. NAACL2021. E2E ASR 日本橋までの行き方 Transcription Grapheme: Phoneme NLP-based Phoneme prediction E2E ASR 日本橋：ニホンバシまで：マデの：ノ行き：イキ方：カタ日本橋ニホンバシまでマデのノ行きイキ方カタ Sequence of the graphemes and phonemes graphemes Phonemes Graphemes Phonemes Speech Speech

Examples REF: ピッチ/ピッチと/トスペクトラ/スペクトラスペクトル/スペクトル包絡/ホウラク Pipeline: ピッチ/ピッチと/ト

スペクトラスペクトル/スペクトラスペクトル包絡/ホウラク Proposed: ピッチ/ピッチと/トスペクトラ/スペクトラスペクトル/スペクトル包絡/ホウラク REF: その/ソノ後/ゴ音楽/オンガクが/ガ全盛/ゼンセー Pipeline: その/ソノ後/アト音楽/オンガクが/ガ全盛/ゼンセー Proposed: その/ソノ後/ゴ音楽/オンガクが/ガ全盛/ゼンセー We will apply the proposed strategy to the RNN-T model

Examples REF: ピッチ/ピッチと/トスペクトラ/スペクトラスペクトル/スペクトル包絡/ホウラク Pipeline: ピッチ/ピッチと/ト

Recent publications from Yahoo! JAPAN Speech Team 2020 • Y.

Fujita et. al., ” Attention-based ASR with Lightweight and Dynamic Convolutions,” Proc. ICASSP2020. • Y. Fujita et. al, “Insertion-Based Modeling for End-to-End Automatic Speech Recognition,” Proc. INTERSPEECH2020. • X. Chang et. al., “End-to-End ASR with Adaptive Span Self-Attention,” Proc. INTERSPEECH2020. (co-author Y. Fujita and M. Omachi) 2021 • M. Omachi et. al., ” End-to-end ASR to jointly predict transcriptions and linguistic annotations,” Proc. NACCL 2021. • Y. Fujita et. al., “Toward Streaming ASR with Non-Autoregressive Insertion-based Model,“ Proc. INTERSPEECH 2021. • T. Maekaku et. al., “Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021,” Proc. INTERSPEECH 2021. • T. Wang et. al., “Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models,” Proc. INTERSPEECH 2021. (co-author Y. Fujita) • Y. Higuchi et. al., “A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation,” Proc. ASRU 2021. (co-author Y. Fujita) • X . Chang et. al., “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition,” Proc. ASRU 2021. (co-author T.Maekaku) • F. Boyer et. al., “A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies,” Proc. ASRU 2021. (co-author Y. Shinohara and T. Ishii) 2022 • M. Omachi et. al., “Non-autoregressive End-to-end automatic speech recognition incorporating downstream natural language processing,” Proc. ICASSP 2022. • T. Maekaku et. al., “An exploration of HuBERT with large number of cluster units and model assessment using Bayesian Information Criterion,” Proc. ICASSP 2022. • T. Maekaku et. al,, “Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR,” Proc. INTERSPEECH2022. • Y. Shinohara et. al., “Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition,” Proc. INTERSPEECH 2022. • X. Chang et. al., “End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation, ” Proc. INTERSPEECH 2022 (co-author Y. Fujita and T. Maekaku)

Summary

Development of E2E ASR SDK • We developed an ASR

which completely runs on the edge devices, which solves issues of the client-server-based ASR. (e.g., user privacy, long latency caused by communication) • Accuracy is approaching to the client-server-based ASR. • We introduced research topics about OOV problem and joint prediction of graphemes and pronunciations

End-to-End Automatic Speech Recognition Running...

End-to-End Automatic Speech Recognition Running on Edge Devices

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript