Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  2. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  3. Self Introduction ー Presenter › 2017.09 ~ : CLOVA Speech

    in NAVER History Interest › Foreign Languages › Travel…. T^T …… à Domestic Camping Hyuksu Ryu › Leader of Speech Global Team › Global Speech Recognition including JP
  4. CLOVA Speech Team › Korean / Japanese / English …

    › Speaker (Wave, Friends, Friends Mini, Desk…), AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR
  5. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  6. What is E2E ASR? Comparison to the existing ASR system

    The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model
  7. What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech

    Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST
  8. What is E2E ASR? Cons › Need so many many

    many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre
  9. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  10. CLOVA Note › You do just Recording › We Recognize,

    Analyze, Save › Now in Beta Service in KR › For Korean, Japanese, English
  11. Functions Select & Listen › You can select & listen

    the desired part based on recognition result
  12. Functions CLOVA Note with Zoom › You can make CLOVA

    Notes while Zoom meeting › From Zoom Marketplace (for Zoom pro account holder)
  13. Release Make your speech into meaningful writing Make CLOVA Note

    write, You can concentrate on meeting itself. 2020. Nov. 19 CLOVA Note Release AI Voice Recording Service CLOVA Note
  14. How many? App Download 1000K Total Users 900K 6.2 times

    more than Jan. Sept. Avg. WAU 83K Market possibility is confirmed as an exclusive AI tech Service in 2021
  15. CLOVA Note Which Technology is necessary? › “WHO” says when?

    Speaker Diarization (Clustering) Speech Recognition › Says “WHAT”? › Free Conversation / No Patterns › Many Proper nouns, Technical Terms
  16. Self-supervised Learning › E2E system needs massive amount of data

    (incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming › If we can use UNLABELLED data for training? à WOW!!! Motivation Basic Concept › Two-Step Training: Pre-training & Fine-tuning › Pre-training: Using unlabeled data, clustering similar data › Fine-tuning: Using labeled data, tuning data in detail
  17. Self-supervised Learning Performance Baevski, A., Zhou, H., Mohamed, A., &

    Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
  18. Keyword Boosting › So many Proper nouns & Technical terms

    › It is difficult to recognize them among words with similar pronunciation › How can we treat them? Motivation Premise › User memo in Notes › Memo would be highly related to contents of recording
  19. Keyword Boosting Example - アップロード フェイル (upload fail) - TTSが上がるのを確認必要

    (need TTS check) - カウントする仕組みが必要 (necessary structure for counting) User memo - アップロード, フェイル, TTS, カウント Extracted KEYWORDS by tf-idf - (before) できなくてセールがなった時に - (after) できなくてフェイルがなった時に Example 1 - (before) 次はフィットネス障害の数ですが - (after) 次はTTS障害の数ですが - (before) チケットスカウトも大事ですが - (after) TTS カウントも大事ですが Example 2 Example 3 - アップロード, フェイル, TTS, 確認, カウ ント, 必要, 仕組み Parsed words from Memo
  20. Speaker Diarization › Partitioning an input audio stream into homogeneous

    segments according to the speaker identity › Part of speaker recognition o BUBTLUPJEFOUJGZlXIPTQPLFXIFOz What is Speaker Diarization? Hello. Nice to meet you. Good to see you, too. Thank you in advance. Recognition Speaker 1 Speaker 2 Diarization Speech Signal Speaker 1
  21. Speaker Diarization Speech Signal à Feature Extraction à Clustering Speech

    Signal End Point Detection Extract Speaker Features Clustering !" !# !$ Represent each segment as vectors of fixed dimension using DNN !%
  22. Speaker Diarization › Contrastive learning Toward better clustering Existing Clustering

    Contrastive Learning Generating more precise clustering from features conversion
  23. Speaker Diarization International Challenge (DIHARD3) › 3rd place among 25

    teams worldwide › https://sat.nist.gov/dihard3#tab_leaderboard
  24. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  25. - Many of recent E2E models read whole speech signal

    and then start to recognize it, making uses’ response time (latency) increase → Not suitable for applications that require real-time interaction - Latency is further increased when input speech is longer E2E-ASR vs Latency Highway and freeway mean the same thing Result
  26. InterCTC [Lee, Watanabe 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS

    $5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS
  27. Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS

    %FDPEFS $5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS %FDPEFS
  28. ASR Performance & Inference Speed Y. Higuchi et al., “A

    Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation, ” ASRU 2021
  29. 4MPU How to Handle Queries " " $ $ $

    $ " $ # " $ Queries (chunk) Time GPU scheduling Time 4MPU #
  30. Triton Inference Server $ $ $ $ " $ "

    Queries (chunk) Time " " # 4MPU 4MPU $ # GPU scheduling Time
  31. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  32. CLOVA Note will be released soon! Don’t miss it!! LINE/NAVER

    keep on making innovation! Working intensely on making E2E-ASR faster Conclusion