Slide 1

Slide 1 text

2019 DevDay LINE Research on Speech Source Separation With Deep Learning > Masahito Togami > LINE Research Labs Senior Researcher

Slide 2

Slide 2 text

Self Introduction Masahito Togami, Ph.D.

Slide 3

Slide 3 text

LINE Research Labs (April 2018-) > Collaboration with National Institute of Informatics (NII)/ Center for Robust Intelligence and Social Technology (CRIS) > Submitting papers for major international conferences (3 ICASSP2019, 3 INTERSPEECH2019, 1 WASPAA2019, 1 BigData2019, 1 SIGIR2019). > Objective is to proceed fundamental research which will contribute to future LINE business

Slide 4

Slide 4 text

> What is speech source separation ? > Speech source separation with statistical modeling > Speech source separation with deep neural network > LINE’s research on deeply integrated approach Agenda

Slide 5

Slide 5 text

What is speech source separation ?

Slide 6

Slide 6 text

Demonstration Speech source separation of the female speaker

Slide 7

Slide 7 text

Clean speech signal Mixed signal Speech source separation Block diagram

Slide 8

Slide 8 text

Applications

Slide 9

Slide 9 text

Multiple speech stream recognition for AI speaker

Slide 10

Slide 10 text

Structuring volatile meeting information

Slide 11

Slide 11 text

Virtual member of orchestra

Slide 12

Slide 12 text

Speech source separation with statistical modeling

Slide 13

Slide 13 text

What is speech signal ? Frequency Time

Slide 14

Slide 14 text

Blind speech source separation (BSS) ! " !# $ = !" + !# $

Slide 15

Slide 15 text

Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller Delayed ,smaller

Slide 16

Slide 16 text

Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller Delayed ,smaller ! = #$ %& $& $' %' $ = #(&!

Slide 17

Slide 17 text

! = #$%& Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller Delayed ,smaller '% !% !( '(

Slide 18

Slide 18 text

Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim 2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral

Slide 19

Slide 19 text

Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim 2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral

Slide 20

Slide 20 text

Speech source separation with deep neural network

Slide 21

Slide 21 text

Deep Neural Network Deep neural network based speech source model Loss

Slide 22

Slide 22 text

Deep Neural Network DNN based speech source model + speech source separation Spatial model estimation (Time-Frequency masking) [Heymann 2016] [Yoshioka 2018] Separation

Slide 23

Slide 23 text

DNN based speech source model + speech source separation Is it optimum to learn DNN without consideration of spatial model and separation part?

Slide 24

Slide 24 text

Deeply integrated multi-channel speech source separation

Slide 25

Slide 25 text

LINE’s research on deeply integrated approach

Slide 26

Slide 26 text

Research direction > Insertion of speech source separation into DNN structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation

Slide 27

Slide 27 text

Research direction > Insertion of speech source separation into DNN structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation

Slide 28

Slide 28 text

Deep Neural Network DNN training to maximize output speech quality [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Speech source model Spatial model estimation Back Propagation Loss Separation

Slide 29

Slide 29 text

DNN training to maximize output speech quality [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! Loss = ! − " ! (

Slide 30

Slide 30 text

DNN training to maximize output speech quality [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *

Slide 31

Slide 31 text

DNN training to maximize output speech quality [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *# $+, ! − " ! + log # $

Slide 32

Slide 32 text

DNN training to maximize output speech quality [Togami ICASSP2019 (1)] SIR (dB) SDR (dB) L2 loss 11.99 9.46 Proposed loss 12.16 10.17 Speech source separation performance

Slide 33

Slide 33 text

DNN training to maximize output speech quality [Masuyama INTERSPEECH2019] SIR (dB) SDR (dB) Conventional DNN training (PSA) 6.21 5.79 Proposed 7.57 7.10 Spatial model estimation performance

Slide 34

Slide 34 text

Research direction > Insertion of speech source separation into DNN structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation

Slide 35

Slide 35 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure

Slide 36

Slide 36 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure

Slide 37

Slide 37 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure

Slide 38

Slide 38 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure

Slide 39

Slide 39 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure Back Propagation

Slide 40

Slide 40 text

Insertion of speech source separation into DNN structure as a spatial constraint [Togami ICASSP2019 (2)] SIR (dB) SDR (dB) Cascade 15.04 13.87 Nest 15.36 14.06 4 microphones SIR (dB) SDR (dB) Cascade 17.45 15.73 Nest 18.14 16.33 8 microphones

Slide 41

Slide 41 text

Research direction > Insertion of speech source separation into DNN structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation

Slide 42

Slide 42 text

Unsupervised DNN training [Togami arxiv2019] It is hard to obtain oracle clean signal !

Slide 43

Slide 43 text

Unsupervised DNN training [Togami arxiv2019] It is hard to obtain oracle clean signal !

Slide 44

Slide 44 text

Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation

Slide 45

Slide 45 text

Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation

Slide 46

Slide 46 text

Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation

Slide 47

Slide 47 text

Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network

Slide 48

Slide 48 text

Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network Kullback Leibler Divergence (KLD)

Slide 49

Slide 49 text

SIR (dB) SDR (dB) Non-DNN 7.76 3.84 L2 loss 6.02 3.62 KLD loss 10.27 5.71 Unsupervised DNN training [Togami arxiv2019]

Slide 50

Slide 50 text

Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input Non-DNN DNN (KLD loss)

Slide 51

Slide 51 text

Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input Non-DNN DNN (KLD loss)

Slide 52

Slide 52 text

Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input Non-DNN DNN (KLD loss)

Slide 53

Slide 53 text

Acknowledgements > Special thanks to my internship students, Mr. Nakagome and Mr. Masuyama (Waseda University) > Thanks for fruitful discussion, Prof. Kobayashi, Prof. Ogawa (Waseda University), Prof. Kawahara, Prof. Yoshii (Kyoto University), Prof. Hirose (NII), and Mr. Komatsu

Slide 54

Slide 54 text

Conclusions > Integration of deep neural network and multi-channel speech source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments

Slide 55

Slide 55 text

Conclusions > Integration of deep neural network and multi-channel speech source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments

Slide 56

Slide 56 text

Rapid prototyping #Python script import numpy as np import pyroomacoustics as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics

Slide 57

Slide 57 text

Rapid prototyping #Python script import numpy as np import pyroomacoustics as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics

Slide 58

Slide 58 text

References [Kim 2006] T. Kim, et al., “Independent vector analysis: an extension of ica to multivariate components,” in ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Scheibler 2018] R. Scheibler, et al., “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in ICASSP, 2018, pp. 351-355 [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196-200. [Yoshioka 2018] T. Yoshioka, et al., “Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition,” in ICASSP, 2018, pp. 5739-5743. [Togami ICASSP2019 (1)] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Masuyama INTERSPEECH2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in INTERSPEECH, Sep. 2019. [Togami ICASSP2019 (2)] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami arxiv2019] M. Togami, et al., “Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function,” in arxiv1911.04228, 2019.

Slide 59

Slide 59 text

Thank you for listening