Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE Research on speech source separation with ...

LINE Research on speech source separation with deep learning

Masahito Togami
LINE Research Labs Senior Researcher
https://linedevday.linecorp.com/jp/2019/sessions/F1-5

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay LINE Research on Speech Source Separation With Deep

    Learning > Masahito Togami > LINE Research Labs Senior Researcher
  2. LINE Research Labs (April 2018-) > Collaboration with National Institute

    of Informatics (NII)/ Center for Robust Intelligence and Social Technology (CRIS) > Submitting papers for major international conferences (3 ICASSP2019, 3 INTERSPEECH2019, 1 WASPAA2019, 1 BigData2019, 1 SIGIR2019). > Objective is to proceed fundamental research which will contribute to future LINE business
  3. > What is speech source separation ? > Speech source

    separation with statistical modeling > Speech source separation with deep neural network > LINE’s research on deeply integrated approach Agenda
  4. ! = #$%& Binaural information (spatial model) Earlier, bigger Earlier,

    bigger Delayed, smaller Delayed ,smaller '% !% !( '(
  5. Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim

    2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral
  6. Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim

    2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral
  7. Deep Neural Network DNN based speech source model + speech

    source separation Spatial model estimation (Time-Frequency masking) [Heymann 2016] [Yoshioka 2018] Separation
  8. DNN based speech source model + speech source separation Is

    it optimum to learn DNN without consideration of spatial model and separation part?
  9. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  10. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  11. Deep Neural Network DNN training to maximize output speech quality

    [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Speech source model Spatial model estimation Back Propagation Loss Separation
  12. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! Loss = ! − " ! (
  13. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *
  14. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *# $+, ! − " ! + log # $
  15. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    SIR (dB) SDR (dB) L2 loss 11.99 9.46 Proposed loss 12.16 10.17 Speech source separation performance
  16. DNN training to maximize output speech quality [Masuyama INTERSPEECH2019] SIR

    (dB) SDR (dB) Conventional DNN training (PSA) 6.21 5.79 Proposed 7.57 7.10 Spatial model estimation performance
  17. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  18. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  19. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  20. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  21. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure
  22. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure Back Propagation
  23. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] SIR (dB) SDR (dB) Cascade 15.04 13.87 Nest 15.36 14.06 4 microphones SIR (dB) SDR (dB) Cascade 17.45 15.73 Nest 18.14 16.33 8 microphones
  24. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  25. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  26. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  27. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  28. Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model

    estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network
  29. Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model

    estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network Kullback Leibler Divergence (KLD)
  30. SIR (dB) SDR (dB) Non-DNN 7.76 3.84 L2 loss 6.02

    3.62 KLD loss 10.27 5.71 Unsupervised DNN training [Togami arxiv2019]
  31. Acknowledgements > Special thanks to my internship students, Mr. Nakagome

    and Mr. Masuyama (Waseda University) > Thanks for fruitful discussion, Prof. Kobayashi, Prof. Ogawa (Waseda University), Prof. Kawahara, Prof. Yoshii (Kyoto University), Prof. Hirose (NII), and Mr. Komatsu
  32. Conclusions > Integration of deep neural network and multi-channel speech

    source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments
  33. Conclusions > Integration of deep neural network and multi-channel speech

    source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments
  34. Rapid prototyping #Python script import numpy as np import pyroomacoustics

    as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics
  35. Rapid prototyping #Python script import numpy as np import pyroomacoustics

    as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics
  36. References [Kim 2006] T. Kim, et al., “Independent vector analysis:

    an extension of ica to multivariate components,” in ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Scheibler 2018] R. Scheibler, et al., “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in ICASSP, 2018, pp. 351-355 [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196-200. [Yoshioka 2018] T. Yoshioka, et al., “Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition,” in ICASSP, 2018, pp. 5739-5743. [Togami ICASSP2019 (1)] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Masuyama INTERSPEECH2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in INTERSPEECH, Sep. 2019. [Togami ICASSP2019 (2)] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami arxiv2019] M. Togami, et al., “Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function,” in arxiv1911.04228, 2019.