CLOVA Speech: End-to-End Speech Recognition in Everyday Life

Agenda › Introduction › What is End-to-End ASR (NEST)? ›
NEST Case Study #1 – Korean › NEST Case Study #2 – Japanese › Conclusion

CLOVA Speech

Self Introduction Presenter › 2017.09 ~ : CLOVA Speech
in NAVER History Interest › Foreign Languages › Travel…. T^T this year…… Hyuksu Ryu › Tech Leader of Speech Global Team › Japanese Speech Recognition

CLOVA Speech Team › Korean / Japanese / English /
Chinese … › Speaker (Wave, Friends, Friends Mini, Desk…) , Car Navi, AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR

NEST Case Study #1 – Korean › NEST Case Study #2 – English › NEST Case Study #3 – Japanese › Conclusion

What is E2E ASR Comparison to the existing ASR system
The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model

What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech
Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST

What is E2E ASR? Cons › Need so many many
many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre

CLOVA Note

CLOVA Note What is CLOVA Note? If you want Taking
a Note Aim Meeting Minutes If you want Search who says what

CLOVA Note Which Technology is necessary? › “WHO” says? ›
Estimate how many members are Speaker Recognition Speech Recognition › Says “WHAT”? › Free Conversation › No Patterns

CLOVA Note Example › You do just Recording › We
Recognize, Analyze, Save › Now in Beta Service in KR

NAVER News Subtitle

NEST for NAVER News Demo #1 › Subtitle for NAVER
News › From 1Q 2020 › Good Performance in Free Style https://news.naver.com/main/read.nhn?mode=LSD&mid=tvh&sid2=355&oid=056&aid=0010923940

NEST for NAVER News Demo #2 › Music in BG?
No Problem! › Robust in Noisy environment https://entertain.naver.com/read?oid=214&aid=0001069943

LINE LIVE Subtitle

NEST for LINE LIVE Demo #1 › Good Performance in
Conversational Free Style › Robust in Disfluency

NEST for LINE LIVE Demo #2 › Good Performance even
in Music BG › Robust in Laughing/Applause…

Active Learning How to make labeling efficiently

Active Learning › NEST system needs massive amount of data
(incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming Motivation Basic Concept › High Confidence à Use Recognition results as LABEL › Low Confidence à Human Labelling › Do recursively Aim › Make labeling efficiently

Active Learning Recognize !" Initial Model Compute Score for !"
Select Sample for Labeling Human Labeling Training Model Using !" & !# Is Satisfied? !" !# No Yes End of Training !# : Labelled data !" : Unlabelled data

Active Learning Effect › Efficient Labeling › With Fair Performance
100.0 33.3 20.0 14.3 10.0 0.0 20.0 40.0 60.0 80.0 100.0 120.0 70 75 80 85 90 95 100 105 hand label 100% (386hr) hand label 33.3% (137hr) + inference 249hr hand label 20.0% (80hr) + inference 306hr hand label 14.3% (57hr) + inference 329hr hand label 10.0% (39hr) + inference 347hr Relative Accuracy Hand Label (Ratio)

Conclusion CLOVA End-to-End ASR = NEST Better Performance for Free
Style Conversation Robust for Noisy Environment Multiple Languages

Conclusion We are expecting to meet with you WITH NEST!!
WITH NEW SERVICES!!

Thank you

CLOVA Speech: End-to-End Speech Recognition in ...

CLOVA Speech: End-to-End Speech Recognition in Everyday Life

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript