Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLOVA Speech: End-to-End Speech Recognition in ...

CLOVA Speech: End-to-End Speech Recognition in Everyday Life

Avatar for LINE DevDay 2020

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  2. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  3. Self Introduction  Presenter › 2017.09 ~ : CLOVA Speech

    in NAVER History Interest › Foreign Languages › Travel…. T^T this year…… Hyuksu Ryu › Tech Leader of Speech Global Team › Japanese Speech Recognition
  4. CLOVA Speech Team › Korean / Japanese / English /

    Chinese … › Speaker (Wave, Friends, Friends Mini, Desk…) , Car Navi, AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR
  5. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – English › NEST Case Study #3 – Japanese › Conclusion
  6. What is E2E ASR Comparison to the existing ASR system

    The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model
  7. What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech

    Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST
  8. What is E2E ASR? Cons › Need so many many

    many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre
  9. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  10. CLOVA Note What is CLOVA Note? If you want Taking

    a Note Aim Meeting Minutes If you want Search who says what
  11. CLOVA Note Which Technology is necessary? › “WHO” says? ›

    Estimate how many members are Speaker Recognition Speech Recognition › Says “WHAT”? › Free Conversation › No Patterns
  12. CLOVA Note Example › You do just Recording › We

    Recognize, Analyze, Save › Now in Beta Service in KR
  13. NEST for NAVER News Demo #1 › Subtitle for NAVER

    News › From 1Q 2020 › Good Performance in Free Style https://news.naver.com/main/read.nhn?mode=LSD&mid=tvh&sid2=355&oid=056&aid=0010923940
  14. NEST for NAVER News Demo #2 › Music in BG?

    No Problem! › Robust in Noisy environment https://entertain.naver.com/read?oid=214&aid=0001069943
  15. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  16. NEST for LINE LIVE Demo #1 › Good Performance in

    Conversational Free Style › Robust in Disfluency
  17. NEST for LINE LIVE Demo #2 › Good Performance even

    in Music BG › Robust in Laughing/Applause…
  18. Active Learning › NEST system needs massive amount of data

    (incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming Motivation Basic Concept › High Confidence à Use Recognition results as LABEL › Low Confidence à Human Labelling › Do recursively Aim › Make labeling efficiently
  19. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  20. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  21. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  22. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  23. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  24. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  25. Active Learning Effect › Efficient Labeling › With Fair Performance

    100.0 33.3 20.0 14.3 10.0 0.0 20.0 40.0 60.0 80.0 100.0 120.0 70 75 80 85 90 95 100 105 hand label 100% (386hr) hand label 33.3% (137hr) + inference 249hr hand label 20.0% (80hr) + inference 306hr hand label 14.3% (57hr) + inference 329hr hand label 10.0% (39hr) + inference 347hr Relative Accuracy Hand Label (Ratio)
  26. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  27. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  28. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  29. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  30. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages