Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLOVA Speech: End-to-End Speech Recognition in Everyday Life

CLOVA Speech: End-to-End Speech Recognition in Everyday Life

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  2. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  3. Self Introduction  Presenter › 2017.09 ~ : CLOVA Speech

    in NAVER History Interest › Foreign Languages › Travel…. T^T this year…… Hyuksu Ryu › Tech Leader of Speech Global Team › Japanese Speech Recognition
  4. CLOVA Speech Team › Korean / Japanese / English /

    Chinese … › Speaker (Wave, Friends, Friends Mini, Desk…) , Car Navi, AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR
  5. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – English › NEST Case Study #3 – Japanese › Conclusion
  6. What is E2E ASR Comparison to the existing ASR system

    The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model
  7. What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech

    Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST
  8. What is E2E ASR? Cons › Need so many many

    many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre
  9. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  10. CLOVA Note What is CLOVA Note? If you want Taking

    a Note Aim Meeting Minutes If you want Search who says what
  11. CLOVA Note Which Technology is necessary? › “WHO” says? ›

    Estimate how many members are Speaker Recognition Speech Recognition › Says “WHAT”? › Free Conversation › No Patterns
  12. CLOVA Note Example › You do just Recording › We

    Recognize, Analyze, Save › Now in Beta Service in KR
  13. NEST for NAVER News Demo #1 › Subtitle for NAVER

    News › From 1Q 2020 › Good Performance in Free Style https://news.naver.com/main/read.nhn?mode=LSD&mid=tvh&sid2=355&oid=056&aid=0010923940
  14. NEST for NAVER News Demo #2 › Music in BG?

    No Problem! › Robust in Noisy environment https://entertain.naver.com/read?oid=214&aid=0001069943
  15. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  16. NEST for LINE LIVE Demo #1 › Good Performance in

    Conversational Free Style › Robust in Disfluency
  17. NEST for LINE LIVE Demo #2 › Good Performance even

    in Music BG › Robust in Laughing/Applause…
  18. Active Learning › NEST system needs massive amount of data

    (incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming Motivation Basic Concept › High Confidence à Use Recognition results as LABEL › Low Confidence à Human Labelling › Do recursively Aim › Make labeling efficiently
  19. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  20. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  21. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  22. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  23. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  24. Active Learning Recognize !" Initial Model Compute Score for !"

    Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data
  25. Active Learning Effect › Efficient Labeling › With Fair Performance

    100.0 33.3 20.0 14.3 10.0 0.0 20.0 40.0 60.0 80.0 100.0 120.0 70 75 80 85 90 95 100 105 hand label 100% (386hr) hand label 33.3% (137hr) + inference 249hr hand label 20.0% (80hr) + inference 306hr hand label 14.3% (57hr) + inference 329hr hand label 10.0% (39hr) + inference 347hr Relative Accuracy Hand Label (Ratio)
  26. Agenda › Introduction › What is End-to-End ASR (NEST)? ›

    NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion
  27. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  28. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  29. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages
  30. Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free

    Style Conversation Robust for Noisy Environment Multiple Languages