Upgrade to Pro — share decks privately, control downloads, hide ads and more …

End-to-End Automatic Speech Recognition Running on Edge Devices

End-to-End Automatic Speech Recognition Running on Edge Devices

Motoi Omachi (Yahoo! JAPAN / Science Division 1, Science Group, Technology Group / Engineer)

https://tech-verse.me/ja/sessions/163
https://tech-verse.me/en/sessions/163
https://tech-verse.me/ko/sessions/163

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Self introduction Motoi OMACHI - Joined Yahoo Japan Corporation in

    2016 - Software Engineer on automatic speech recognition (ASR) - Research: End-to-end ASR Presented at the top conferences on speech processing and natural language processing (ICASSP2022, NAACL2021) - Development: server-side ASR, on-device SDK Leads on-device ASR SDK development
  2. Overview • We released an end-to-end automatic speech recognition (E2E

    ASR) engine that completely runs on the edge devices • This presentation introduces • Current ASR system (YJVOICE) • Popular E2E ASR techniques • How we addressed some problems for running ASR on edge-devices
  3. YJVOICE • Has been developed since 2011 • Used for

    many Yahoo Japan applications • Supports 1.5 million+ vocabulary
  4. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  5. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  6. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  7. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  8. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  9. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  10. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  11. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  12. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  13. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  14. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  15. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  16. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder Acoustic

    model Pronunciation dictionary 晴れ れ ます 明⽇ 荒れ は Language model Transcription … … … a sh i ta … 足 a shi 明日 a shi ta 下 shi ta … Speech
  17. End-to-end (E2E) ASR Feature extraction Decoder Transcription Voice activity detection

    … … … … 明 ⽇ ・・・ End-to-end model (single neural network) Speech
  18. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  19. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  20. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  21. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  22. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  23. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  24. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  25. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦ Encoder-based -> Released in April 2022 RNN-Transducer -> Released in October 2022
  26. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  27. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  28. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  29. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  30. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR)
  31. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR) 4.7 points
  32. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR) Almost similar (within 0.1 points)
  33. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  34. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  35. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  36. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  37. Data augmentation [1] using untranscribed speech [1] Y. He., et

    al., “Streaming End-to-end Speech Recognition for Mobile Devices.,” Proc. ICASSP 2019. DNN-HMM Hybrid ASR Automatic transcription Untranscribed speech E2EASR Model Transcriptions output Automatic transcription E2E ASR Transcriptions output
  38. ASR result on OOV condition 0 5 10 15 20

    25 30 Transcription +Automatic transcription Transcription Training data Character error rate (%) 19.9 point improvement
  39. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzz words, addresses, proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym ) M. Omachi et. al, “End-to-end ASR to jointly predict transcriptions and ling- uistic annotations, ” Proc. NAACL 2021
  40. Pipeline system (E2E ASR w/ Post-processing) E2E ASR 日本橋までの行き方 日本橋:ニホンバシ

    まで:マデ の:ノ 行き:イキ 方:カタ Transcription Grapheme: Phoneme NLP-based Phoneme prediction • NLP-based post-processing is affected by ASR errors • NLP-based post-processing requires additional memory and computational costs Speech
  41. Joint prediction of graphemes and phonemes [2] [2] M. Omachi.,

    et al., “End-to-end ASR to jointly predict transcriptions and linguistic annotations.,” Proc. NAACL2021. E2E ASR 日本橋までの行き方 Transcription Grapheme: Phoneme NLP-based Phoneme prediction E2E ASR 日本橋:ニホンバシ まで:マデ の:ノ 行き:イキ 方:カタ 日 本 橋 ニ ホ ン バ シ ま で マ デ の ノ 行 き イ キ 方 カ タ Sequence of the graphemes and phonemes graphemes Phonemes Graphemes Phonemes Speech Speech
  42. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  43. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  44. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  45. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  46. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  47. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  48. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible The papers from the Yahoo! JAPAN Speech Team about the above topics are accepted in the top conferences (3 papers at ICASSP2022 / INTERSPEECH2022)
  49. Recent publications from Yahoo! JAPAN Speech Team 2020 • Y.

    Fujita et. al., ” Attention-based ASR with Lightweight and Dynamic Convolutions,” Proc. ICASSP2020. • Y. Fujita et. al, “Insertion-Based Modeling for End-to-End Automatic Speech Recognition,” Proc. INTERSPEECH2020. • X. Chang et. al., “End-to-End ASR with Adaptive Span Self-Attention,” Proc. INTERSPEECH2020. (co-author Y. Fujita and M. Omachi) 2021 • M. Omachi et. al., ” End-to-end ASR to jointly predict transcriptions and linguistic annotations,” Proc. NACCL 2021. • Y. Fujita et. al., “Toward Streaming ASR with Non-Autoregressive Insertion-based Model,“ Proc. INTERSPEECH 2021. • T. Maekaku et. al., “Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021,” Proc. INTERSPEECH 2021. • T. Wang et. al., “Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models,” Proc. INTERSPEECH 2021. (co-author Y. Fujita) • Y. Higuchi et. al., “A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation,” Proc. ASRU 2021. (co-author Y. Fujita) • X . Chang et. al., “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition,” Proc. ASRU 2021. (co-author T.Maekaku) • F. Boyer et. al., “A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies,” Proc. ASRU 2021. (co-author Y. Shinohara and T. Ishii) 2022 • M. Omachi et. al., “Non-autoregressive End-to-end automatic speech recognition incorporating downstream natural language processing,” Proc. ICASSP 2022. • T. Maekaku et. al., “An exploration of HuBERT with large number of cluster units and model assessment using Bayesian Information Criterion,” Proc. ICASSP 2022. • T. Maekaku et. al,, “Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR,” Proc. INTERSPEECH2022. • Y. Shinohara et. al., “Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition,” Proc. INTERSPEECH 2022. • X. Chang et. al., “End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation, ” Proc. INTERSPEECH 2022 (co-author Y. Fujita and T. Maekaku)
  50. Development of E2E ASR SDK • We developed an ASR

    which completely runs on the edge devices, which solves issues of the client-server-based ASR. (e.g., user privacy, long latency caused by communication) • Accuracy is approaching to the client-server-based ASR. • We introduced research topics about OOV problem and joint prediction of graphemes and pronunciations