$30 off During Our Annual Pro Sale. View Details »

Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

LINE DEVDAY 2021
PRO

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  3. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  4. CLOVA Speech

  5. Self Introduction ー Presenter › 2017.09 ~ : CLOVA Speech

    in NAVER History Interest › Foreign Languages › Travel…. T^T …… à Domestic Camping Hyuksu Ryu › Leader of Speech Global Team › Global Speech Recognition including JP
  6. CLOVA Speech Team › Korean / Japanese / English …

    › Speaker (Wave, Friends, Friends Mini, Desk…), AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR
  7. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  8. What is E2E ASR? Comparison to the existing ASR system

    The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model
  9. What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech

    Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST
  10. What is E2E ASR? Cons › Need so many many

    many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre
  11. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  12. What is CLOVA Note?

  13. CLOVA Note What is CLOVA Note?

  14. CLOVA Note What is CLOVA Note?

  15. CLOVA Note › You do just Recording › We Recognize,

    Analyze, Save › Now in Beta Service in KR › For Korean, Japanese, English
  16. What can we do in CLOVA Note

  17. Functions Select & Listen › You can select & listen

    the desired part based on recognition result
  18. › You can memo while recording Functions Memo

  19. Functions Search Keywords › You can search keywords and find

    what you want
  20. Functions Note Sharing › You can share your notes to

    your friends & colleagues
  21. Functions CLOVA Note with Zoom › You can make CLOVA

    Notes while Zoom meeting › From Zoom Marketplace (for Zoom pro account holder)
  22. Functions CLOVA Note with Zoom

  23. CLOVA Note in KR

  24. Release Make your speech into meaningful writing Make CLOVA Note

    write, You can concentrate on meeting itself. 2020. Nov. 19 CLOVA Note Release AI Voice Recording Service CLOVA Note
  25. How many? App Download 1000K Total Users 900K 6.2 times

    more than Jan. Sept. Avg. WAU 83K Market possibility is confirmed as an exclusive AI tech Service in 2021
  26. How many? CLOVA Note App Download > 1 MILLION 2021.10

    2021.06 2021.04 2021.02 1000K
  27. Technologies in CLOVA Note

  28. CLOVA Note Which Technology is necessary? › “WHO” says when?

    Speaker Diarization (Clustering) Speech Recognition › Says “WHAT”? › Free Conversation / No Patterns › Many Proper nouns, Technical Terms
  29. Self-supervised Learning

  30. Self-supervised Learning › E2E system needs massive amount of data

    (incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming › If we can use UNLABELLED data for training? à WOW!!! Motivation Basic Concept › Two-Step Training: Pre-training & Fine-tuning › Pre-training: Using unlabeled data, clustering similar data › Fine-tuning: Using labeled data, tuning data in detail
  31. Self-supervised Learning Pre-training

  32. Self-supervised Learning Pre-training #hh #cc #vv #pp

  33. Pre-training #hh #cc #vv #pp Self-supervised Learning

  34. Pre-training #hh #cc #vv #pp Self-supervised Learning

  35. Pre-training #hh Self-supervised Learning

  36. Pre-training #hh #cc #vv #pp Self-supervised Learning

  37. Pre-training Self-supervised Learning #pp

  38. Pre-training Self-supervised Learning #pp #hh #cc #vv

  39. Fine-tuning Self-supervised Learning pizza hamburger coffee vegetable Training using Label

  40. Self-supervised Learning Performance Baevski, A., Zhou, H., Mohamed, A., &

    Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
  41. Keyword Boosting

  42. Keyword Boosting › So many Proper nouns & Technical terms

    › It is difficult to recognize them among words with similar pronunciation › How can we treat them? Motivation Premise › User memo in Notes › Memo would be highly related to contents of recording
  43. Keyword Boosting › Parsing memo text › Extract keyword from

    parsed text using tf-idf How to do
  44. Keyword Boosting Example - アップロード フェイル (upload fail) - TTSが上がるのを確認必要

    (need TTS check) - カウントする仕組みが必要 (necessary structure for counting) User memo - アップロード, フェイル, TTS, カウント Extracted KEYWORDS by tf-idf - (before) できなくてセールがなった時に - (after) できなくてフェイルがなった時に Example 1 - (before) 次はフィットネス障害の数ですが - (after) 次はTTS障害の数ですが - (before) チケットスカウトも大事ですが - (after) TTS カウントも大事ですが Example 2 Example 3 - アップロード, フェイル, TTS, 確認, カウ ント, 必要, 仕組み Parsed words from Memo
  45. Speaker Diarization

  46. Speaker Diarization › Partitioning an input audio stream into homogeneous

    segments according to the speaker identity › Part of speaker recognition o BUBTLUPJEFOUJGZlXIPTQPLFXIFOz What is Speaker Diarization? Hello. Nice to meet you. Good to see you, too. Thank you in advance. Recognition Speaker 1 Speaker 2 Diarization Speech Signal Speaker 1
  47. Speaker Diarization Speech Signal à Feature Extraction à Clustering Speech

    Signal End Point Detection Extract Speaker Features Clustering !" !# !$ Represent each segment as vectors of fixed dimension using DNN !%
  48. Speaker Diarization › Contrastive learning Toward better clustering Existing Clustering

    Contrastive Learning Generating more precise clustering from features conversion
  49. Speaker Diarization International Challenge (DIHARD3) › 3rd place among 25

    teams worldwide › https://sat.nist.gov/dihard3#tab_leaderboard
  50. Release SOON in JP!!! We are expecting to meet you

    SOON with CLOVA Note !!!
  51. :VTVLF,JEB -*/&.BOBHFSPG4QFFDI5FBN #JPHSBQIZ • .BJOMZXPSLJOHPOTQFFDISFDPHOJUJPO UISPVHI5PTIJCBBOE:BIPP+"1"/ • +PJOFE-*/&BOEMFBEJOHTQFFDI UFBN 'BWPSJUF

    • 5IJSEXBWFDPGGFF DBU
  52. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  53. - Many of recent E2E models read whole speech signal

    and then start to recognize it, making uses’ response time (latency) increase → Not suitable for applications that require real-time interaction - Latency is further increased when input speech is longer E2E-ASR vs Latency Highway and freeway mean the same thing Result
  54. Why E2E-ASR Slow? %FDPEFS $PODBU <sos> highway highway and and

    freeway … Transformer Model
  55. Chunk-based Decoding %FDPEFS $PODBU <sos> highway highway and %FDPEFS $PODBU

    and freeway freeway means …
  56. Why E2E-ASR Slow? %FDPEFS $PODBU <sos> highway highway and and

    freeway … Transformer Model
  57. Non-Autoregressive ASR %FDPEFS $PODBU highway and freeway Transformer Model

  58. CTC (Connectionist Temporal Classification) 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS $5$MPTT

    5FYU &ODPEFS
  59. InterCTC [Lee, Watanabe 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS

    $5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS
  60. Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS

    %FDPEFS $5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS %FDPEFS
  61. ASR Performance & Inference Speed Y. Higuchi et al., “A

    Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation, ” ASRU 2021
  62. 4MPU How to Handle Queries " " $ $ $

    $ " $ # " $ Queries (chunk) Time GPU scheduling Time 4MPU #
  63. LINE NVIDIA Collaboration

  64. Triton Inference Server $ $ $ $ " $ "

    Queries (chunk) Time " " # 4MPU 4MPU $ # GPU scheduling Time
  65. Publications from LINE/NAVER Speech 9+9 papers accepted!

  66. Agenda - Introduction - What is E2E-ASR? - AI Voice

    Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion
  67. CLOVA Note will be released soon! Don’t miss it!! LINE/NAVER

    keep on making innovation! Working intensely on making E2E-ASR faster Conclusion