Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

Agenda - Introduction - What is E2E-ASR? - AI Voice
Recording Service “CLOVA Note” - Make E2E-ASR Faster - Conclusion

CLOVA Speech

Self Introduction ー Presenter › 2017.09 ~ : CLOVA Speech
in NAVER History Interest › Foreign Languages › Travel…. T^T …… à Domestic Camping Hyuksu Ryu › Leader of Speech Global Team › Global Speech Recognition including JP

CLOVA Speech Team › Korean / Japanese / English …
› Speaker (Wave, Friends, Friends Mini, Desk…), AiCall, CareCall, … Various Language + Various Domain Speaker Recognition › Who says the speech Speech Recognition › Classical DNN-HMM ASR › End-to-End ASR

What is E2E ASR? Comparison to the existing ASR system
The existing ASR System › Acoustic / Language / Pronunciation Model for ASR › We need All 3 models for update Pros › Robust for pattern speech › Easy to Customize Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT Acoustic Model Language Model Pronunciation Model

What is E2E ASR? End-to-End (NEST) › Neural End-to-end Speech
Transcriber › Make Simple Recognition with a single model from speech feature › Speech Feature à TEXT : Direct Connection Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 Time (s) 18.81 19.12 -0.5186 0.6369 0 End Point Detection Feature Extraction Recognition TEXT NEST

What is E2E ASR? Cons › Need so many many
many many data › High computational cost › Difficult Customization End-to-End (NEST) › Neural End-to-end Speech Transcriber › Make Simple Recognition with a single model from speech feature Pros › Robust for free speech with disfluency (Conversational Style) › Robust for Noisy environment › Possible to apply service area for free speech - Subtitle / Customer centre

What is CLOVA Note?

CLOVA Note What is CLOVA Note?

CLOVA Note › You do just Recording › We Recognize,
Analyze, Save › Now in Beta Service in KR › For Korean, Japanese, English

What can we do in CLOVA Note

Functions Select & Listen › You can select & listen
the desired part based on recognition result

› You can memo while recording Functions Memo

Functions Search Keywords › You can search keywords and find
what you want

Functions Note Sharing › You can share your notes to
your friends & colleagues

Functions CLOVA Note with Zoom › You can make CLOVA
Notes while Zoom meeting › From Zoom Marketplace (for Zoom pro account holder)

Functions CLOVA Note with Zoom

CLOVA Note in KR

Release Make your speech into meaningful writing Make CLOVA Note
write, You can concentrate on meeting itself. 2020. Nov. 19 CLOVA Note Release AI Voice Recording Service CLOVA Note

How many? App Download 1000K Total Users 900K 6.2 times
more than Jan. Sept. Avg. WAU 83K Market possibility is confirmed as an exclusive AI tech Service in 2021

How many? CLOVA Note App Download > 1 MILLION 2021.10
2021.06 2021.04 2021.02 1000K

Technologies in CLOVA Note

CLOVA Note Which Technology is necessary? › “WHO” says when?
Speaker Diarization (Clustering) Speech Recognition › Says “WHAT”? › Free Conversation / No Patterns › Many Proper nouns, Technical Terms

Self-supervised Learning

Self-supervised Learning › E2E system needs massive amount of data
(incl. LABELING !!!!!!) › LABELING is BOTTLENECK for data collection › Expensive & Time consuming › If we can use UNLABELLED data for training? à WOW!!! Motivation Basic Concept › Two-Step Training: Pre-training & Fine-tuning › Pre-training: Using unlabeled data, clustering similar data › Fine-tuning: Using labeled data, tuning data in detail

Self-supervised Learning Pre-training

Self-supervised Learning Pre-training #hh #cc #vv #pp

Pre-training #hh #cc #vv #pp Self-supervised Learning

Pre-training #hh Self-supervised Learning

Pre-training #hh #cc #vv #pp Self-supervised Learning

Pre-training Self-supervised Learning #pp

Pre-training Self-supervised Learning #pp #hh #cc #vv

Fine-tuning Self-supervised Learning pizza hamburger coffee vegetable Training using Label

Self-supervised Learning Performance Baevski, A., Zhou, H., Mohamed, A., &
Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.

Keyword Boosting

Keyword Boosting › So many Proper nouns & Technical terms
› It is difficult to recognize them among words with similar pronunciation › How can we treat them? Motivation Premise › User memo in Notes › Memo would be highly related to contents of recording

Keyword Boosting › Parsing memo text › Extract keyword from
parsed text using tf-idf How to do

Keyword Boosting Example - アップロードフェイル (upload fail) - TTSが上がるのを確認必要
(need TTS check) - カウントする仕組みが必要 (necessary structure for counting) User memo - アップロード, フェイル, TTS, カウント Extracted KEYWORDS by tf-idf - (before) できなくてセールがなった時に - (after) できなくてフェイルがなった時に Example 1 - (before) 次はフィットネス障害の数ですが - (after) 次はTTS障害の数ですが - (before) チケットスカウトも大事ですが - (after) TTS カウントも大事ですが Example 2 Example 3 - アップロード, フェイル, TTS, 確認, カウント, 必要, 仕組み Parsed words from Memo

Speaker Diarization

Speaker Diarization › Partitioning an input audio stream into homogeneous
segments according to the speaker identity › Part of speaker recognition o BUBTLUPJEFOUJGZlXIPTQPLFXIFOz What is Speaker Diarization? Hello. Nice to meet you. Good to see you, too. Thank you in advance. Recognition Speaker 1 Speaker 2 Diarization Speech Signal Speaker 1

Speaker Diarization Speech Signal à Feature Extraction à Clustering Speech
Signal End Point Detection Extract Speaker Features Clustering !" !# !$ Represent each segment as vectors of fixed dimension using DNN !%

Speaker Diarization › Contrastive learning Toward better clustering Existing Clustering
Contrastive Learning Generating more precise clustering from features conversion

Speaker Diarization International Challenge (DIHARD3) › 3rd place among 25
teams worldwide › https://sat.nist.gov/dihard3#tab_leaderboard

Release SOON in JP!!! We are expecting to meet you
SOON with CLOVA Note !!!

:VTVLF,JEB -*/&.BOBHFSPG4QFFDI5FBN #JPHSBQIZ • .BJOMZXPSLJOHPOTQFFDISFDPHOJUJPO UISPVHI5PTIJCBBOE:BIPP+"1"/ • +PJOFE-*/&BOEMFBEJOHTQFFDI UFBN 'BWPSJUF
• 5IJSEXBWFDPGGFF DBU

- Many of recent E2E models read whole speech signal
and then start to recognize it, making uses’ response time (latency) increase → Not suitable for applications that require real-time interaction - Latency is further increased when input speech is longer E2E-ASR vs Latency Highway and freeway mean the same thing Result

Why E2E-ASR Slow? %FDPEFS $PODBU <sos> highway highway and and
freeway … Transformer Model

Chunk-based Decoding %FDPEFS $PODBU <sos> highway highway and %FDPEFS $PODBU
and freeway freeway means …

Why E2E-ASR Slow? %FDPEFS $PODBU <sos> highway highway and and
freeway … Transformer Model

Non-Autoregressive ASR %FDPEFS $PODBU highway and freeway Transformer Model

CTC (Connectionist Temporal Classification) 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS $5$MPTT
5FYU &ODPEFS

InterCTC [Lee, Watanabe 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS
$5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS

Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS
%FDPEFS $5$MPTT 5FYU &ODPEFS $5$MPTT %FDPEFS Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS %FDPEFS

ASR Performance & Inference Speed Y. Higuchi et al., “A
Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation, ” ASRU 2021

4MPU How to Handle Queries " " $ $ $
$ " $ # " $ Queries (chunk) Time GPU scheduling Time 4MPU #

LINE NVIDIA Collaboration

Triton Inference Server $ $ $ $ " $ "
Queries (chunk) Time " " # 4MPU 4MPU $ # GPU scheduling Time

Publications from LINE/NAVER Speech 9+9 papers accepted!

CLOVA Note will be released soon! Don’t miss it!! LINE/NAVER
keep on making innovation! Working intensely on making E2E-ASR faster Conclusion

Practical Application of End-to-End Speech Reco...

Practical Application of End-to-End Speech Recognition Technology and AI Voice Recording Service "CLOVA Note"

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript