Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE CLOVA: Recent Achievements and Future Plans

LINE CLOVA: Recent Achievements and Future Plans

Yusuke Kida (LINE Corporation)
Kentaro Tachibana (LINE Corporation)
Presentation material at the lunchtime workshop of APSIPA 2021
https://www.apsipa2021.org/

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

December 15, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. LINE CLOVA: Recent Achievements and Future Plans Industrial Workshop in

    APSIPA 2021 2021.12
  2. Yusuke Kida .BOBHFSPG4QFFDI5FBN "43 #JPHSBQIZ ɾ.4PG,ZPUP6OJWFSTJUZ ɾ5PTIJCB$PSQPSBUJPO ɾ:BIPP+"1"/ ɾ -*/&

    Kentaro Tachibana .BOBHFSPG7PJDF5FBN 554 #JPHSBQIZ ɾ.4PG/BSB*OTUJUVUFPG5FDIOPMPHZ ɾ5PTIJCB$PSQPSBUJPO -PBOFEUP/*$5 ɾ%F/" ɾ -*/&
  3. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  4. None
  5. MAU:173million (Top 4 Regions) 89million 52million 21million 10million

  6. LINE's Mission

  7. LINE AI Speech Video Voice NLU Data OCR Vision Face

    LINE Shopping Lens Adult Image Filter Scene Classification Ad image Filter Visual Search Analogous image Product Image Lip Reading Fashion Image Spot Clustering Food Image Indonesia LINE Split Bill LINE MUSIC Playlist OCR LINE CONOMI Handwritten Font Receipt OCR Credit card OCR Bill OCR Document Intelligence Identification Face Sign eKYC Face Sign Auto Cut Auto Cam Transcription Telephone network Voice recognition Single-Demand STT Simple voice High quality voice Voice Style Transfer Active Leaning Federated Leaning Action recognition Pose estimation Speech Note Vlive Auto Highlight Content Center AI CLOVA Dubbing LINE AiCall CLOVA Speaker Gatebox Papago Video Insight LINE CLOVA AI Interactive Avatar Interactive Avatar Media 3D Avatar LINE Profile Lip Reading LINE’s AI Technology
  8. LINE CLOVA Products CLOVA Chatbot CLOVA OCR CLOVA Voice CLOVA

    Speech CLOVA Text Analytics CLOVA Face CLOVA Assistant LINE AiCall LINE eKYC Solutions Devices CLOVA Friends CLOVA Friends mini CLOVA Desk CLOVA WAVE LINE’s AI Technology Brand
  9. None
  10. None
  11. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  12. Speech Team R&D of ASR and related technologies &OEUP&OE "43

    %//).. )ZCSJE"43 "DPVTUJD &WFOU %FUFDUJPO 4QFFDI &OIBODFNFOU
  13. CTC-based ASR • Pros & Cons • Predict tokens in

    parallel Ø Fast decoding • The conditional independence assumption between tokens Ø Low accuracy • Our Target • Fast & Accurate CTC-Decoding 4QFFDI &ODPEFS &ODPEFS &ODPEFS 4PGUNBY $5$MPTT &ODPEFS 5FYU
  14. Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS

    4PGUNBY $5$MPTT 5FYU &ODPEFS $5$MPTT 4PGUNBY Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS 4PGUNBY
  15. Y. Higuchi et al., “A Comparative Study on Non-Autoregressive Modelings

    for Speech-to-Text Generation, ” ASRU 2021 ASR Results
  16. CLOVA Note

  17. CLOVA Note with Zoom

  18. Technologies in CLOVA Note Self-Supervised Learning Speaker Diarization Keyword Boosting

    3rd place in DIHARD3
  19. Keyword Boosting How to recognize proper nouns & technical terms

    in E2E-ASR, which are rarely seen in training data 6TFS/PUF .FNP ,FZXPSE %FUFDUJPO "43XJUI #PPTUJOH Keyword Boosting in CLOVA Note
  20. Voice team Controllable, high-quality, and expressive TTS High-quality Neural Vocoder

    Controllable TTS with emotion strength Text Analyzer
  21. Controllable TTS with emotion strength Feed emotion label and strength

    to an acoustic model [ a- ri^ ga- to- o- ] ͋Γ͕ͱ͏ Acoustic model Text analyzer Vocoder Text &NPUJPOBM strength 0 1 Happy Waveform Contextual features
  22. Prediction of emotion strength Consider two ways to annotate emotion

    strength from speech 8BWFGPSN Human annotators weak strong IBQQZ weak strong 4BE Label emotion strength by listening Feature extractor Predict emotion strength with ranking algorithm J: Annotate automatically L: In some cases, reduction of expressiveness in TTS J: More expressive in TTS L: High cost )BQQZBOETBE
  23. Acoustic model training Train acoustic model conditioning predicted emotion strength

    and emotion label. [ a- ri^ ga- to- o- ] ありがとう Text analyzer Acoustic model Vocoder Text Speech Predicting Emotion strength 0 1 Emotion strength Happiness Emotion strength and emotion label Speech Contextual features
  24. Demonstration 😄 😐 Happy Sad 😨 Neutral Speech samples in

    Japanese
  25. Neural vocoder Develop fast and small-footprint neural vocoder with GPUs

    WaveNet (WN) [1] Generate each sample J: High quality L: Very slow [1] Aaron et al., "WaveNet: A Generative Model for Raw Audio," in Arxiv, 2016 AR Vocoder Acoustic Feature Speech Waveform Previous Speech Samples Parallel WaveGAN (PWG) [2] Generate speech samples for each input length at a time. Achieve high quality using GAN. J: High speed (10,000 times faster than WN) L: Low quality compared to WN [2] R. Yamamoto et. al., "Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020. Non-AR Vocoder Acoustic feature Speech Waveform Non-AR Vocoder Acoustic feature Discriminator Generated speech Real or Fake Recorded speech Noise Noise
  26. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] %SBXCBDLTof 18(%FUFSJPSBUJPOPGRVBMJUZPOIBSNPOJDDPNQPOFOUT [3] H. Min-Jae et. al.,

    “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021. Recording WaveNet PWG Time Frequency Time Frequency Time Frequency Spectrograms for each method
  27. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Adapt harmonic-plus-noise model to the PWG’s generator.

    Split speech into harmonic and noise components. Speech Harmonic Component Noise Component
  28. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Model harmonic and noise components for each

    Speech Noise Component Noise generator Harmonic Component Harmonic generator Harmonic source Noise source
  29. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Speech samples Harmonic Noise Harmonic + noise

    Ground truth
  30. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  31. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  32. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  33. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  34. None
  35. None
  36. None
  37. None
  38. None
  39. Large-scale general-purpose language models

  40. Modeling status of HyperCLOVA 1.3B → 6.7B → 13B →

    39B 13B → 39B 82B 204B 〜 (in 2022) Multi-lingual Model Large model JP / Multi-lingual Hyper scale JP Model JP Model Work in progress
  41. None
  42. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  43. None
  44. AI R&D Division AI R&D Division 4QFFDI 7PJDF /-1 $7

    5SVTUXPSUIZ"* Manager Researchers Engineers Manager Researchers Engineers …
  45. INTERSPEECH 2021 • Relaxing the Conditional Independence Assumption of CTC-based

    ASR by Conditioning on Intermediate Predictions J. Nozaki, T. Komatsu • Acoustic Event Detection with Classifier Chains T. Komatsu, S. Watanabe, K. Miyazaki, T. Hayashi • Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation Y. Nakagome, M. Togami, T. Ogawa, T. Kobayashi • Sound Source Localization with Majorization Minimization M. Togami, R. Scheibler ICASSP 2021 • END TO END LEARNING FOR CONVOLUTIVE MULTI-CHANNEL WIENER FILTERING M. Togami • DISENTANGLED SPEAKER AND LANGUAGE REPRESENTATIONS USING MUTUAL INFORMATION MINIMIZATION AND DOMAIN ADAPTATION FOR CROSS-LINGUAL TTS D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari • SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION R. Scheibler, M. Togami • REFINEMENT OF DIRECTION OF ARRIVAL ESTIMATORS BY MAJORIZATION-MINIMIZATION OPTIMIZATION ON THE ARRAY MANIFOLD R. Scheibler, M. Togami • JOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING T. Nakashima, R. Scheibler, M. Togami, N. Ono Major Publications
  46. None