$30 off During Our Annual Pro Sale. View Details »

End-to-End Automatic Speech Recognition Running on Edge Devices

End-to-End Automatic Speech Recognition Running on Edge Devices

Motoi Omachi (Yahoo! JAPAN / Science Division 1, Science Group, Technology Group / Engineer)

https://tech-verse.me/ja/sessions/163
https://tech-verse.me/en/sessions/163
https://tech-verse.me/ko/sessions/163

Tech-Verse2022
PRO

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. End-to-End Automatic Speech Recognition Running on Edge Devices Motoi Omachi

    / Yahoo! JAPAN
  2. Self introduction Motoi OMACHI - Joined Yahoo Japan Corporation in

    2016 - Software Engineer on automatic speech recognition (ASR) - Research: End-to-end ASR Presented at the top conferences on speech processing and natural language processing (ICASSP2022, NAACL2021) - Development: server-side ASR, on-device SDK Leads on-device ASR SDK development
  3. Overview • We released an end-to-end automatic speech recognition (E2E

    ASR) engine that completely runs on the edge devices • This presentation introduces • Current ASR system (YJVOICE) • Popular E2E ASR techniques • How we addressed some problems for running ASR on edge-devices
  4. YJVOICE: Yahoo! JAPAN Speech Recognition

  5. Automatic speech recognition (ASR) ASR “Hello” Speech Transcription

  6. YJVOICE • Has been developed since 2011 • Used for

    many Yahoo Japan applications • Supports 1.5 million+ vocabulary
  7. Client-Server-based ASR system ASR Smart phone Server “Hello” User

  8. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  9. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  10. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  11. Issues of current YJVOICE Stability ASR does not work when

    a network connection is closed User privacy Users who do not want to upload their speech are not willing to use our ASR system Latency It suffers from a long latency to get results for a slow communication speed environment
  12. Recently released on-device ASR ! ASR Smart phone Server “Hello”

    User
  13. Released on our smartphone application Client-Server ASR On-device ASR

  14. End-to-end (E2E) ASR

  15. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  16. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  17. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  18. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  19. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  20. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  21. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  22. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder …

    … … … a sh i ta … Acoustic model Pronunciation dictionary 足 a shi 明日 a shi ta 下 shi ta … 晴れ れ ます 明⽇ 荒れ は Language model Transcription Speech
  23. DNN-HMM hybrid ASR Feature extraction Voice activity detection Decoder Acoustic

    model Pronunciation dictionary 晴れ れ ます 明⽇ 荒れ は Language model Transcription … … … a sh i ta … 足 a shi 明日 a shi ta 下 shi ta … Speech
  24. End-to-end (E2E) ASR Feature extraction Decoder Transcription Voice activity detection

    … … … … 明 ⽇ ・・・ End-to-end model (single neural network) Speech
  25. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  26. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  27. Pros of E2E ASR over DNN-HMM hybrid ASR Simplicity •

    Less expert knowledge is required for implementation Smaller model size • Compressible up to about 1/100 compared to DNN-HMM hybrid-based ASR model
  28. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  29. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  30. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  31. Popular models for E2E ASR Encoder Softmax 明 日 の

    天 気 Encoder Softmax 明 日 の 天 気 Joint Network Prediction Network 明 日 の 天 Encoder 明 日 の 天 気 Attention Decoder Encoder-based RNN-Transducer Attention-based Encoder Decoder
  32. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦
  33. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦
  34. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦
  35. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦
  36. Model comparison Streaming Accuracy Encoder-based (CTC) ◦ △ RNN-Transducer (RNN-T)

    ˓ ◦ Attention-based Encoder Decoder (AED) ˚ ◦ Encoder-based -> Released in April 2022 RNN-Transducer -> Released in October 2022
  37. Released E2E ASR SDK And research topics

  38. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  39. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  40. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  41. SLA Support OS • iOS 12.0+ • Android 5.0+ Real

    time factor (RTF) • Less than 1.0 (confirmed on iPhone 11 Pro) Accuracy • Close to the server-side ASR → Next slide
  42. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR)
  43. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR) 4.7 points
  44. Accuracy 0 10 20 Encoder-based (Apr. 2022) RNN-T (Sept. 2022)

    Character error rate (%) DNN-HMM hybrid (server-side ASR) Almost similar (within 0.1 points)
  45. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  46. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  47. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  48. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzzwords, addresses, and proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym )
  49. Model training E2E ASR output E2E ASR Transcriptions output error

    E2E ASR Transcriptions output error
  50. Data augmentation [1] using untranscribed speech [1] Y. He., et

    al., “Streaming End-to-end Speech Recognition for Mobile Devices.,” Proc. ICASSP 2019. DNN-HMM Hybrid ASR Automatic transcription Untranscribed speech E2EASR Model Transcriptions output Automatic transcription E2E ASR Transcriptions output
  51. ASR result on OOV condition 0 5 10 15 20

    25 30 Transcription +Automatic transcription Transcription Training data Character error rate (%) 19.9 point improvement
  52. Research topics Out-of-Vocabulary (OOV) problem • It is hard to

    predict the words which do not appear in the training data (e.g., buzz words, addresses, proper nouns) Joint prediction of graphemes and pronunciations • Popular E2E ASR outputs graphemes only. • Joint prediction of graphemes and pronunciations is useful (e.g., distinguishing the query of heteronym ) M. Omachi et. al, “End-to-end ASR to jointly predict transcriptions and ling- uistic annotations, ” Proc. NAACL 2021
  53. E2E ASR E2E ASR 日本橋までの行き方 Transcription Speech

  54. Pipeline system (E2E ASR w/ Post-processing) E2E ASR 日本橋までの行き方 日本橋:ニホンバシ

    まで:マデ の:ノ 行き:イキ 方:カタ Transcription Grapheme: Phoneme NLP-based Phoneme prediction • NLP-based post-processing is affected by ASR errors • NLP-based post-processing requires additional memory and computational costs Speech
  55. Joint prediction of graphemes and phonemes [2] [2] M. Omachi.,

    et al., “End-to-end ASR to jointly predict transcriptions and linguistic annotations.,” Proc. NAACL2021. E2E ASR 日本橋までの行き方 Transcription Grapheme: Phoneme NLP-based Phoneme prediction E2E ASR 日本橋:ニホンバシ まで:マデ の:ノ 行き:イキ 方:カタ 日 本 橋 ニ ホ ン バ シ ま で マ デ の ノ 行 き イ キ 方 カ タ Sequence of the graphemes and phonemes graphemes Phonemes Graphemes Phonemes Speech Speech
  56. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  57. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  58. Examples REF: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク Pipeline: ピッチ/ピッチ と/ト

    スペクトラスペクトル/スペクトラスペクトル 包絡/ホウラク Proposed: ピッチ/ピッチ と/ト スペクトラ/スペクトラ スペクトル/スペクトル 包絡/ホウラク REF: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー Pipeline: その/ソノ 後/アト 音楽/オンガク が/ガ 全盛/ゼンセー Proposed: その/ソノ 後/ゴ 音楽/オンガク が/ガ 全盛/ゼンセー We will apply the proposed strategy to the RNN-T model
  59. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  60. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  61. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible
  62. Other topics Self-supervised learning (SSL) • We can improve the

    performance using a lot of untranscribed speeches Lower latency • Users want to get result as soon as possible The papers from the Yahoo! JAPAN Speech Team about the above topics are accepted in the top conferences (3 papers at ICASSP2022 / INTERSPEECH2022)
  63. Recent publications from Yahoo! JAPAN Speech Team 2020 • Y.

    Fujita et. al., ” Attention-based ASR with Lightweight and Dynamic Convolutions,” Proc. ICASSP2020. • Y. Fujita et. al, “Insertion-Based Modeling for End-to-End Automatic Speech Recognition,” Proc. INTERSPEECH2020. • X. Chang et. al., “End-to-End ASR with Adaptive Span Self-Attention,” Proc. INTERSPEECH2020. (co-author Y. Fujita and M. Omachi) 2021 • M. Omachi et. al., ” End-to-end ASR to jointly predict transcriptions and linguistic annotations,” Proc. NACCL 2021. • Y. Fujita et. al., “Toward Streaming ASR with Non-Autoregressive Insertion-based Model,“ Proc. INTERSPEECH 2021. • T. Maekaku et. al., “Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021,” Proc. INTERSPEECH 2021. • T. Wang et. al., “Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models,” Proc. INTERSPEECH 2021. (co-author Y. Fujita) • Y. Higuchi et. al., “A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation,” Proc. ASRU 2021. (co-author Y. Fujita) • X . Chang et. al., “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition,” Proc. ASRU 2021. (co-author T.Maekaku) • F. Boyer et. al., “A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies,” Proc. ASRU 2021. (co-author Y. Shinohara and T. Ishii) 2022 • M. Omachi et. al., “Non-autoregressive End-to-end automatic speech recognition incorporating downstream natural language processing,” Proc. ICASSP 2022. • T. Maekaku et. al., “An exploration of HuBERT with large number of cluster units and model assessment using Bayesian Information Criterion,” Proc. ICASSP 2022. • T. Maekaku et. al,, “Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR,” Proc. INTERSPEECH2022. • Y. Shinohara et. al., “Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition,” Proc. INTERSPEECH 2022. • X. Chang et. al., “End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation, ” Proc. INTERSPEECH 2022 (co-author Y. Fujita and T. Maekaku)
  64. Summary

  65. Development of E2E ASR SDK • We developed an ASR

    which completely runs on the edge devices, which solves issues of the client-server-based ASR. (e.g., user privacy, long latency caused by communication) • Accuracy is approaching to the client-server-based ASR. • We introduced research topics about OOV problem and joint prediction of graphemes and pronunciations
  66. Thank you