Yusuke Kida (LINE Corporation)
Kentaro Tachibana (LINE Corporation)
Presentation material at the lunchtime workshop of APSIPA 2021
https://www.apsipa2021.org/
LINE Shopping Lens Adult Image Filter Scene Classification Ad image Filter Visual Search Analogous image Product Image Lip Reading Fashion Image Spot Clustering Food Image Indonesia LINE Split Bill LINE MUSIC Playlist OCR LINE CONOMI Handwritten Font Receipt OCR Credit card OCR Bill OCR Document Intelligence Identification Face Sign eKYC Face Sign Auto Cut Auto Cam Transcription Telephone network Voice recognition Single-Demand STT Simple voice High quality voice Voice Style Transfer Active Leaning Federated Leaning Action recognition Pose estimation Speech Note Vlive Auto Highlight Content Center AI CLOVA Dubbing LINE AiCall CLOVA Speaker Gatebox Papago Video Insight LINE CLOVA AI Interactive Avatar Interactive Avatar Media 3D Avatar LINE Profile Lip Reading LINE’s AI Technology
Speech CLOVA Text Analytics CLOVA Face CLOVA Assistant LINE AiCall LINE eKYC Solutions Devices CLOVA Friends CLOVA Friends mini CLOVA Desk CLOVA WAVE LINE’s AI Technology Brand
strength from speech 8BWFGPSN Human annotators weak strong IBQQZ weak strong 4BE Label emotion strength by listening Feature extractor Predict emotion strength with ranking algorithm J: Annotate automatically L: In some cases, reduction of expressiveness in TTS J: More expressive in TTS L: High cost )BQQZBOETBE
WaveNet (WN) [1] Generate each sample J: High quality L: Very slow [1] Aaron et al., "WaveNet: A Generative Model for Raw Audio," in Arxiv, 2016 AR Vocoder Acoustic Feature Speech Waveform Previous Speech Samples Parallel WaveGAN (PWG) [2] Generate speech samples for each input length at a time. Achieve high quality using GAN. J: High speed (10,000 times faster than WN) L: Low quality compared to WN [2] R. Yamamoto et. al., "Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020. Non-AR Vocoder Acoustic feature Speech Waveform Non-AR Vocoder Acoustic feature Discriminator Generated speech Real or Fake Recorded speech Noise Noise
“High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model” in Proc. INTERSPEECH, 2021. Recording WaveNet PWG Time Frequency Time Frequency Time Frequency Spectrograms for each method
ASR by Conditioning on Intermediate Predictions J. Nozaki, T. Komatsu • Acoustic Event Detection with Classifier Chains T. Komatsu, S. Watanabe, K. Miyazaki, T. Hayashi • Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation Y. Nakagome, M. Togami, T. Ogawa, T. Kobayashi • Sound Source Localization with Majorization Minimization M. Togami, R. Scheibler ICASSP 2021 • END TO END LEARNING FOR CONVOLUTIVE MULTI-CHANNEL WIENER FILTERING M. Togami • DISENTANGLED SPEAKER AND LANGUAGE REPRESENTATIONS USING MUTUAL INFORMATION MINIMIZATION AND DOMAIN ADAPTATION FOR CROSS-LINGUAL TTS D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari • SURROGATE SOURCE MODEL LEARNING FOR DETERMINED SOURCE SEPARATION R. Scheibler, M. Togami • REFINEMENT OF DIRECTION OF ARRIVAL ESTIMATORS BY MAJORIZATION-MINIMIZATION OPTIMIZATION ON THE ARRAY MANIFOLD R. Scheibler, M. Togami • JOINT DEREVERBERATION AND SEPARATION WITH ITERATIVE SOURCE STEERING T. Nakashima, R. Scheibler, M. Togami, N. Ono Major Publications