LINE CLOVA: Recent Achievements and Future Plans

LINE CLOVA: Recent Achievements and Future Plans Industrial Workshop in
APSIPA 2021 2021.12

Yusuke Kida .BOBHFSPG4QFFDI5FBN "43 #JPHSBQIZ ɾ.4PG,ZPUP6OJWFSTJUZ ɾ5PTIJCB$PSQPSBUJPO ɾ:BIPP+"1"/ ɾ -*/&
Kentaro Tachibana .BOBHFSPG7PJDF5FBN 554 #JPHSBQIZ ɾ.4PG/BSB*OTUJUVUFPG5FDIOPMPHZ ɾ5PTIJCB$PSQPSBUJPO -PBOFEUP/*$5 ɾ%F/" ɾ -*/&

Agenda - About LINE & LINE CLOVA - Recent Achievements
of Speech Technologies - Future Plans – R&D Vision – - Q&A

MAU:173million (Top 4 Regions) 89million 52million 21million 10million

LINE's Mission

LINE AI Speech Video Voice NLU Data OCR Vision Face
LINE Shopping Lens Adult Image Filter Scene Classification Ad image Filter Visual Search Analogous image Product Image Lip Reading Fashion Image Spot Clustering Food Image Indonesia LINE Split Bill LINE MUSIC Playlist OCR LINE CONOMI Handwritten Font Receipt OCR Credit card OCR Bill OCR Document Intelligence Identification Face Sign eKYC Face Sign Auto Cut Auto Cam Transcription Telephone network Voice recognition Single-Demand STT Simple voice High quality voice Voice Style Transfer Active Leaning Federated Leaning Action recognition Pose estimation Speech Note Vlive Auto Highlight Content Center AI CLOVA Dubbing LINE AiCall CLOVA Speaker Gatebox Papago Video Insight LINE CLOVA AI Interactive Avatar Interactive Avatar Media 3D Avatar LINE Profile Lip Reading LINE’s AI Technology

LINE CLOVA Products CLOVA Chatbot CLOVA OCR CLOVA Voice CLOVA
Speech CLOVA Text Analytics CLOVA Face CLOVA Assistant LINE AiCall LINE eKYC Solutions Devices CLOVA Friends CLOVA Friends mini CLOVA Desk CLOVA WAVE LINE’s AI Technology Brand

Speech Team R&D of ASR and related technologies &OEUP&OE "43
%//).. )ZCSJE"43 "DPVTUJD &WFOU %FUFDUJPO 4QFFDI &OIBODFNFOU

CTC-based ASR • Pros & Cons • Predict tokens in
parallel Ø Fast decoding • The conditional independence assumption between tokens Ø Low accuracy • Our Target • Fast & Accurate CTC-Decoding 4QFFDI &ODPEFS &ODPEFS &ODPEFS 4PGUNBY $5$MPTT &ODPEFS 5FYU

Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS
4PGUNBY $5$MPTT 5FYU &ODPEFS $5$MPTT 4PGUNBY Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS 4PGUNBY

Y. Higuchi et al., “A Comparative Study on Non-Autoregressive Modelings
for Speech-to-Text Generation, ” ASRU 2021 ASR Results

CLOVA Note

CLOVA Note with Zoom

Technologies in CLOVA Note Self-Supervised Learning Speaker Diarization Keyword Boosting
3rd place in DIHARD3

Keyword Boosting How to recognize proper nouns & technical terms
in E2E-ASR, which are rarely seen in training data 6TFS/PUF .FNP ,FZXPSE %FUFDUJPO "43XJUI #PPTUJOH Keyword Boosting in CLOVA Note

Voice team Controllable, high-quality, and expressive TTS High-quality Neural Vocoder
Controllable TTS with emotion strength Text Analyzer

Controllable TTS with emotion strength Feed emotion label and strength
to an acoustic model [ a- ri^ ga- to- o- ] ͋Γ͕ͱ͏ Acoustic model Text analyzer Vocoder Text &NPUJPOBM strength 0 1 Happy Waveform Contextual features

Prediction of emotion strength Consider two ways to annotate emotion
strength from speech 8BWFGPSN Human annotators weak strong IBQQZ weak strong 4BE Label emotion strength by listening Feature extractor Predict emotion strength with ranking algorithm J: Annotate automatically L: In some cases, reduction of expressiveness in TTS J: More expressive in TTS L: High cost )BQQZBOETBE

Acoustic model training Train acoustic model conditioning predicted emotion strength
and emotion label. [ a- ri^ ga- to- o- ] ありがとう Text analyzer Acoustic model Vocoder Text Speech Predicting Emotion strength 0 1 Emotion strength Happiness Emotion strength and emotion label Speech Contextual features

Demonstration 😄 😐 Happy Sad 😨 Neutral Speech samples in
Japanese

Neural vocoder Develop fast and small-footprint neural vocoder with GPUs
WaveNet (WN) [1] Generate each sample J: High quality L: Very slow [1] Aaron et al., "WaveNet: A Generative Model for Raw Audio," in Arxiv, 2016 AR Vocoder Acoustic Feature Speech Waveform Previous Speech Samples Parallel WaveGAN (PWG) [2] Generate speech samples for each input length at a time. Achieve high quality using GAN. J: High speed (10,000 times faster than WN) L: Low quality compared to WN [2] R. Yamamoto et. al., "Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020. Non-AR Vocoder Acoustic feature Speech Waveform Non-AR Vocoder Acoustic feature Discriminator Generated speech Real or Fake Recorded speech Noise Noise

)JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] %SBXCBDLTof 18(%FUFSJPSBUJPOPGRVBMJUZPOIBSNPOJDDPNQPOFOUT [3] H. Min-Jae et. al.,
“High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021. Recording WaveNet PWG Time Frequency Time Frequency Time Frequency Spectrograms for each method

)JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Adapt harmonic-plus-noise model to the PWG’s generator.
Split speech into harmonic and noise components. Speech Harmonic Component Noise Component

)JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Model harmonic and noise components for each
Speech Noise Component Noise generator Harmonic Component Harmonic generator Harmonic source Noise source

)JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Speech samples Harmonic Noise Harmonic + noise
Ground truth

)JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage
of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.

Large-scale general-purpose language models

Modeling status of HyperCLOVA 1.3B → 6.7B → 13B →
39B 13B → 39B 82B 204B 〜 (in 2022) Multi-lingual Model Large model JP / Multi-lingual Hyper scale JP Model JP Model Work in progress

AI R&D Division AI R&D Division 4QFFDI 7PJDF /-1 $7
5SVTUXPSUIZ"* Manager Researchers Engineers Manager Researchers Engineers …

INTERSPEECH 2021 • Relaxing the Conditional Independence Assumption of CTC-based
ASR by Conditioning on Intermediate Predictions J. Nozaki, T. Komatsu • Acoustic Event Detection with Classifier Chains T. Komatsu, S. Watanabe, K. Miyazaki, T. Hayashi • Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation Y. Nakagome, M. Togami, T. Ogawa, T. Kobayashi • Sound Source Localization with Majorization Minimization M. Togami, R. Scheibler ICASSP 2021 • END TO END LEARNING FOR CONVOLUTIVE MULTI-CHANNEL WIENER FILTERING M. Togami • DISENTANGLED SPEAKER AND LANGUAGE REPRESENTATIONS USING MUTUAL INFORMATION MINIMIZATION AND DOMAIN ADAPTATION FOR CROSS-LINGUAL TTS D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari • SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION R. Scheibler, M. Togami • REFINEMENT OF DIRECTION OF ARRIVAL ESTIMATORS BY MAJORIZATION-MINIMIZATION OPTIMIZATION ON THE ARRAY MANIFOLD R. Scheibler, M. Togami • JOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING T. Nakashima, R. Scheibler, M. Togami, N. Ono Major Publications

LINE CLOVA: Recent Achievements and Future Plans

LINE CLOVA: Recent Achievements and Future Plans

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript