Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE CLOVA: Recent Achievements and Future Plans

LINE CLOVA: Recent Achievements and Future Plans

Yusuke Kida (LINE Corporation)
Kentaro Tachibana (LINE Corporation)
Presentation material at the lunchtime workshop of APSIPA 2021
https://www.apsipa2021.org/

LINE Developers

December 15, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Yusuke Kida .BOBHFSPG4QFFDI5FBN "43 #JPHSBQIZ ɾ.4PG,ZPUP6OJWFSTJUZ ɾ5PTIJCB$PSQPSBUJPO ɾ:BIPP+"1"/ ɾ -*/&

    Kentaro Tachibana .BOBHFSPG7PJDF5FBN 554 #JPHSBQIZ ɾ.4PG/BSB*OTUJUVUFPG5FDIOPMPHZ ɾ5PTIJCB$PSQPSBUJPO -PBOFEUP/*$5 ɾ%F/" ɾ -*/&
  2. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  3. LINE AI Speech Video Voice NLU Data OCR Vision Face

    LINE Shopping Lens Adult Image Filter Scene Classification Ad image Filter Visual Search Analogous image Product Image Lip Reading Fashion Image Spot Clustering Food Image Indonesia LINE Split Bill LINE MUSIC Playlist OCR LINE CONOMI Handwritten Font Receipt OCR Credit card OCR Bill OCR Document Intelligence Identification Face Sign eKYC Face Sign Auto Cut Auto Cam Transcription Telephone network Voice recognition Single-Demand STT Simple voice High quality voice Voice Style Transfer Active Leaning Federated Leaning Action recognition Pose estimation Speech Note Vlive Auto Highlight Content Center AI CLOVA Dubbing LINE AiCall CLOVA Speaker Gatebox Papago Video Insight LINE CLOVA AI Interactive Avatar Interactive Avatar Media 3D Avatar LINE Profile Lip Reading LINE’s AI Technology
  4. LINE CLOVA Products CLOVA Chatbot CLOVA OCR CLOVA Voice CLOVA

    Speech CLOVA Text Analytics CLOVA Face CLOVA Assistant LINE AiCall LINE eKYC Solutions Devices CLOVA Friends CLOVA Friends mini CLOVA Desk CLOVA WAVE LINE’s AI Technology Brand
  5. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  6. Speech Team R&D of ASR and related technologies &OEUP&OE "43

    %//).. )ZCSJE"43 "DPVTUJD &WFOU %FUFDUJPO 4QFFDI &OIBODFNFOU
  7. CTC-based ASR • Pros & Cons • Predict tokens in

    parallel Ø Fast decoding • The conditional independence assumption between tokens Ø Low accuracy • Our Target • Fast & Accurate CTC-Decoding 4QFFDI &ODPEFS &ODPEFS &ODPEFS 4PGUNBY $5$MPTT &ODPEFS 5FYU
  8. Self-Conditioned CTC [Nozaki, Komatsu 2021] Training 4QFFDI &ODPEFS &ODPEFS &ODPEFS

    4PGUNBY $5$MPTT 5FYU &ODPEFS $5$MPTT 4PGUNBY Inference 4QFFDI &ODPEFS &ODPEFS &ODPEFS %FDPEFS 5FYU &ODPEFS 4PGUNBY
  9. Y. Higuchi et al., “A Comparative Study on Non-Autoregressive Modelings

    for Speech-to-Text Generation, ” ASRU 2021 ASR Results
  10. Keyword Boosting How to recognize proper nouns & technical terms

    in E2E-ASR, which are rarely seen in training data 6TFS/PUF .FNP ,FZXPSE %FUFDUJPO "43XJUI #PPTUJOH Keyword Boosting in CLOVA Note
  11. Controllable TTS with emotion strength Feed emotion label and strength

    to an acoustic model [ a- ri^ ga- to- o- ] ͋Γ͕ͱ͏ Acoustic model Text analyzer Vocoder Text &NPUJPOBM strength 0 1 Happy Waveform Contextual features
  12. Prediction of emotion strength Consider two ways to annotate emotion

    strength from speech 8BWFGPSN Human annotators weak strong IBQQZ weak strong 4BE Label emotion strength by listening Feature extractor Predict emotion strength with ranking algorithm J: Annotate automatically L: In some cases, reduction of expressiveness in TTS J: More expressive in TTS L: High cost )BQQZBOETBE
  13. Acoustic model training Train acoustic model conditioning predicted emotion strength

    and emotion label. [ a- ri^ ga- to- o- ] ありがとう Text analyzer Acoustic model Vocoder Text Speech Predicting Emotion strength 0 1 Emotion strength Happiness Emotion strength and emotion label Speech Contextual features
  14. Neural vocoder Develop fast and small-footprint neural vocoder with GPUs

    WaveNet (WN) [1] Generate each sample J: High quality L: Very slow [1] Aaron et al., "WaveNet: A Generative Model for Raw Audio," in Arxiv, 2016 AR Vocoder Acoustic Feature Speech Waveform Previous Speech Samples Parallel WaveGAN (PWG) [2] Generate speech samples for each input length at a time. Achieve high quality using GAN. J: High speed (10,000 times faster than WN) L: Low quality compared to WN [2] R. Yamamoto et. al., "Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020. Non-AR Vocoder Acoustic feature Speech Waveform Non-AR Vocoder Acoustic feature Discriminator Generated speech Real or Fake Recorded speech Noise Noise
  15. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] %SBXCBDLTof 18(%FUFSJPSBUJPOPGRVBMJUZPOIBSNPOJDDPNQPOFOUT [3] H. Min-Jae et. al.,

    “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021. Recording WaveNet PWG Time Frequency Time Frequency Time Frequency Spectrograms for each method
  16. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Adapt harmonic-plus-noise model to the PWG’s generator.

    Split speech into harmonic and noise components. Speech Harmonic Component Noise Component
  17. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Model harmonic and noise components for each

    Speech Noise Component Noise generator Harmonic Component Harmonic generator Harmonic source Noise source
  18. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  19. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  20. )JHIGJEFMJUZOFVSBMWPDPEFSHarmonic-plus-Noise(HN) PWG [3] Subjective and objective evaluations showed the advantage

    of HN-PWG Model Model size ↓ (M) Inference speed ↓ (RTF) MOS ↑ Analysis / synthesis scenario TTS scenario WaveNet 3.81 294.12 4.22 4.03 PWG 0.94 0.02 3.46 3.56 HN-PWG 0.94 0.02 4.18 4.01 Multi-band HN-PWG 0.99 0.02 4.29 4.03 Recordings - - 4.41 Smaller values are faster. From 1 to 5 scale. Higher values are higher quality. [3] H. Min-Jae et. al., “High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021.
  21. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  22. Modeling status of HyperCLOVA 1.3B → 6.7B → 13B →

    39B 13B → 39B 82B 204B 〜 (in 2022) Multi-lingual Model Large model JP / Multi-lingual Hyper scale JP Model JP Model Work in progress
  23. Agenda - About LINE & LINE CLOVA - Recent Achievements

    of Speech Technologies - Future Plans – R&D Vision – - Q&A
  24. AI R&D Division AI R&D Division 4QFFDI 7PJDF /-1 $7

    5SVTUXPSUIZ"* Manager Researchers Engineers Manager Researchers Engineers …
  25. INTERSPEECH 2021 • Relaxing the Conditional Independence Assumption of CTC-based

    ASR by Conditioning on Intermediate Predictions J. Nozaki, T. Komatsu • Acoustic Event Detection with Classifier Chains T. Komatsu, S. Watanabe, K. Miyazaki, T. Hayashi • Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation Y. Nakagome, M. Togami, T. Ogawa, T. Kobayashi • Sound Source Localization with Majorization Minimization M. Togami, R. Scheibler ICASSP 2021 • END TO END LEARNING FOR CONVOLUTIVE MULTI-CHANNEL WIENER FILTERING M. Togami • DISENTANGLED SPEAKER AND LANGUAGE REPRESENTATIONS USING MUTUAL INFORMATION MINIMIZATION AND DOMAIN ADAPTATION FOR CROSS-LINGUAL TTS D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari • SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION R. Scheibler, M. Togami • REFINEMENT OF DIRECTION OF ARRIVAL ESTIMATORS BY MAJORIZATION-MINIMIZATION OPTIMIZATION ON THE ARRAY MANIFOLD R. Scheibler, M. Togami • JOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING T. Nakashima, R. Scheibler, M. Togami, N. Ono Major Publications