LINE Research on speech source separation with deep learning

2019 DevDay LINE Research on Speech Source Separation With Deep
Learning > Masahito Togami > LINE Research Labs Senior Researcher

Self Introduction Masahito Togami, Ph.D.

LINE Research Labs (April 2018-) > Collaboration with National Institute
of Informatics (NII)/ Center for Robust Intelligence and Social Technology (CRIS) > Submitting papers for major international conferences (3 ICASSP2019, 3 INTERSPEECH2019, 1 WASPAA2019, 1 BigData2019, 1 SIGIR2019). > Objective is to proceed fundamental research which will contribute to future LINE business

> What is speech source separation ? > Speech source
separation with statistical modeling > Speech source separation with deep neural network > LINE’s research on deeply integrated approach Agenda

What is speech source separation ?

Demonstration Speech source separation of the female speaker

Clean speech signal Mixed signal Speech source separation Block diagram

Applications

Multiple speech stream recognition for AI speaker

Structuring volatile meeting information

Virtual member of orchestra

Speech source separation with statistical modeling

What is speech signal ? Frequency Time

Blind speech source separation (BSS) ! " !# $ =
!" + !# $

Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller
Delayed ,smaller

Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller
Delayed ,smaller ! = #$ %& $& $' %' $ = #(&!

! = #$%& Binaural information (spatial model) Earlier, bigger Earlier,
bigger Delayed, smaller Delayed ,smaller '% !% !( '(

Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim
2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral

Speech source separation with deep neural network

Deep Neural Network Deep neural network based speech source model
Loss

Deep Neural Network DNN based speech source model + speech
source separation Spatial model estimation (Time-Frequency masking) [Heymann 2016] [Yoshioka 2018] Separation

DNN based speech source model + speech source separation Is
it optimum to learn DNN without consideration of spatial model and separation part?

Deeply integrated multi-channel speech source separation

LINE’s research on deeply integrated approach

Research direction > Insertion of speech source separation into DNN
structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation

Deep Neural Network DNN training to maximize output speech quality
[Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Speech source model Spatial model estimation Back Propagation Loss Separation

DNN training to maximize output speech quality [Togami ICASSP2019 (1)]
[Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! Loss = ! − " ! (

[Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *

[Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *# $+, ! − " ! + log # $

SIR (dB) SDR (dB) L2 loss 11.99 9.46 Proposed loss 12.16 10.17 Speech source separation performance

DNN training to maximize output speech quality [Masuyama INTERSPEECH2019] SIR
(dB) SDR (dB) Conventional DNN training (PSA) 6.21 5.79 Proposed 7.57 7.10 Spatial model estimation performance

Insertion of speech source separation into DNN structure as a
spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure

spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure

spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure Back Propagation

spatial constraint [Togami ICASSP2019 (2)] SIR (dB) SDR (dB) Cascade 15.04 13.87 Nest 15.36 14.06 4 microphones SIR (dB) SDR (dB) Cascade 17.45 15.73 Nest 18.14 16.33 8 microphones

Unsupervised DNN training [Togami arxiv2019] It is hard to obtain
oracle clean signal !

Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source
model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation

Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model
estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network

Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model
estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network Kullback Leibler Divergence (KLD)

SIR (dB) SDR (dB) Non-DNN 7.76 3.84 L2 loss 6.02
3.62 KLD loss 10.27 5.71 Unsupervised DNN training [Togami arxiv2019]

Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input
Non-DNN DNN (KLD loss)

Acknowledgements > Special thanks to my internship students, Mr. Nakagome
and Mr. Masuyama (Waseda University) > Thanks for fruitful discussion, Prof. Kobayashi, Prof. Ogawa (Waseda University), Prof. Kawahara, Prof. Yoshii (Kyoto University), Prof. Hirose (NII), and Mr. Komatsu

Conclusions > Integration of deep neural network and multi-channel speech
source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments

Rapid prototyping #Python script import numpy as np import pyroomacoustics
as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics

References [Kim 2006] T. Kim, et al., “Independent vector analysis:
an extension of ica to multivariate components,” in ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Scheibler 2018] R. Scheibler, et al., “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in ICASSP, 2018, pp. 351-355 [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196-200. [Yoshioka 2018] T. Yoshioka, et al., “Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition,” in ICASSP, 2018, pp. 5739-5743. [Togami ICASSP2019 (1)] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Masuyama INTERSPEECH2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in INTERSPEECH, Sep. 2019. [Togami ICASSP2019 (2)] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami arxiv2019] M. Togami, et al., “Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function,” in arxiv1911.04228, 2019.

Thank you for listening

LINE Research on speech source separation with ...

LINE Research on speech source separation with deep learning

More Decks by LINE DevDay 2019

Other Decks in Technology

Featured

Transcript