LINE Research on speech source separation with deep learning

LINE Research on speech source separation with deep learning

Masahito Togami
LINE Research Labs Senior Researcher
https://linedevday.linecorp.com/jp/2019/sessions/F1-5

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay LINE Research on Speech Source Separation With Deep

    Learning > Masahito Togami > LINE Research Labs Senior Researcher
  2. Self Introduction Masahito Togami, Ph.D.

  3. LINE Research Labs (April 2018-) > Collaboration with National Institute

    of Informatics (NII)/ Center for Robust Intelligence and Social Technology (CRIS) > Submitting papers for major international conferences (3 ICASSP2019, 3 INTERSPEECH2019, 1 WASPAA2019, 1 BigData2019, 1 SIGIR2019). > Objective is to proceed fundamental research which will contribute to future LINE business
  4. > What is speech source separation ? > Speech source

    separation with statistical modeling > Speech source separation with deep neural network > LINE’s research on deeply integrated approach Agenda
  5. What is speech source separation ?

  6. Demonstration Speech source separation of the female speaker

  7. Clean speech signal Mixed signal Speech source separation Block diagram

  8. Applications

  9. Multiple speech stream recognition for AI speaker

  10. Structuring volatile meeting information

  11. Virtual member of orchestra

  12. Speech source separation with statistical modeling

  13. What is speech signal ? Frequency Time

  14. Blind speech source separation (BSS) ! " !# $ =

    !" + !# $
  15. Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller

    Delayed ,smaller
  16. Binaural information (spatial model) Earlier, bigger Earlier, bigger Delayed, smaller

    Delayed ,smaller ! = #$ %& $& $' %' $ = #(&!
  17. ! = #$%& Binaural information (spatial model) Earlier, bigger Earlier,

    bigger Delayed, smaller Delayed ,smaller '% !% !( '(
  18. Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim

    2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral
  19. Speech likelihood: Statistical speech source modeling Independent Vector Analysis (IVA)[Kim

    2006][Hiroe 2006] Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016] Clean signal spectral
  20. Speech source separation with deep neural network

  21. Deep Neural Network Deep neural network based speech source model

    Loss
  22. Deep Neural Network DNN based speech source model + speech

    source separation Spatial model estimation (Time-Frequency masking) [Heymann 2016] [Yoshioka 2018] Separation
  23. DNN based speech source model + speech source separation Is

    it optimum to learn DNN without consideration of spatial model and separation part?
  24. Deeply integrated multi-channel speech source separation

  25. LINE’s research on deeply integrated approach

  26. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  27. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  28. Deep Neural Network DNN training to maximize output speech quality

    [Togami ICASSP2019 (1)] [Masuyama INTERSPEECH2019] Speech source model Spatial model estimation Back Propagation Loss Separation
  29. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! Loss = ! − " ! (
  30. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *
  31. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    [Masuyama INTERSPEECH2019] Oracle clean signal ! Estimated speech signal " ! # $ Estimated variance Loss = ! − " ! *# $+, ! − " ! + log # $
  32. DNN training to maximize output speech quality [Togami ICASSP2019 (1)]

    SIR (dB) SDR (dB) L2 loss 11.99 9.46 Proposed loss 12.16 10.17 Speech source separation performance
  33. DNN training to maximize output speech quality [Masuyama INTERSPEECH2019] SIR

    (dB) SDR (dB) Conventional DNN training (PSA) 6.21 5.79 Proposed 7.57 7.10 Spatial model estimation performance
  34. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  35. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  36. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  37. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM BLSTM BLSTM Speech source separation Cascade structure
  38. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure
  39. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] BLSTM Speech source separation BLSTM Speech source separation BLSTM Speech source separation BLSTM BLSTM BLSTM Speech source separation Cascade structure Nest structure Back Propagation
  40. Insertion of speech source separation into DNN structure as a

    spatial constraint [Togami ICASSP2019 (2)] SIR (dB) SDR (dB) Cascade 15.04 13.87 Nest 15.36 14.06 4 microphones SIR (dB) SDR (dB) Cascade 17.45 15.73 Nest 18.14 16.33 8 microphones
  41. Research direction > Insertion of speech source separation into DNN

    structure as a spatial constraint > Unsupervised DNN training with speech source separation based on non- DNN statistical speech source modeling > DNN is trained so as to maximize output speech quality after speech source separation
  42. Unsupervised DNN training [Togami arxiv2019] It is hard to obtain

    oracle clean signal !
  43. Unsupervised DNN training [Togami arxiv2019] It is hard to obtain

    oracle clean signal !
  44. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  45. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  46. Deep Neural Network Unsupervised DNN training [Togami arxiv2019] Speech source

    model Spatial model estimation Separated signal and estimated variance Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Back Propagation
  47. Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model

    estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network
  48. Unsupervised DNN training [Togami arxiv2019] Speech source model Spatial model

    estimation Separated signal and estimated variance Back Propagation Non-DNN speech source separation Separated signal and estimated variance Loss Non-DNN speech source separation is utilized as a pseudo clean signal generator ! Deep Neural Network Kullback Leibler Divergence (KLD)
  49. SIR (dB) SDR (dB) Non-DNN 7.76 3.84 L2 loss 6.02

    3.62 KLD loss 10.27 5.71 Unsupervised DNN training [Togami arxiv2019]
  50. Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input

    Non-DNN DNN (KLD loss)
  51. Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input

    Non-DNN DNN (KLD loss)
  52. Unsupervised DNN training [Togami arxiv2019] Oracle clean Noisy microphone input

    Non-DNN DNN (KLD loss)
  53. Acknowledgements > Special thanks to my internship students, Mr. Nakagome

    and Mr. Masuyama (Waseda University) > Thanks for fruitful discussion, Prof. Kobayashi, Prof. Ogawa (Waseda University), Prof. Kawahara, Prof. Yoshii (Kyoto University), Prof. Hirose (NII), and Mr. Komatsu
  54. Conclusions > Integration of deep neural network and multi-channel speech

    source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments
  55. Conclusions > Integration of deep neural network and multi-channel speech

    source separation is a key to practical use of speech source separation > Spotlight came again on conventional non-DNN speech source separation, e.g., unsupervised DNN training > Speech source separation is an emerging technique which enables speech applications to be utilized under more adverse environments
  56. Rapid prototyping #Python script import numpy as np import pyroomacoustics

    as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics
  57. Rapid prototyping #Python script import numpy as np import pyroomacoustics

    as pa #short term Fourier transform for time-series wave signal (wav_l/r) _,spec_l=np.stft(wav_l,fs=16000,window='hann',nperseg=1024,noverlap=256) _,spec_r=np.stft(wav_r,fs=16000,window='hann',nperseg=1024,noverlap=256) #spec_l/r: time,freq spec=np.concatenate((spec_l[…, np.newaxis],spec_r[…, np.newaxis])) #y:time,freq,source y=pa.bss.auxiva(spec) # y=pa.bss.ilrma(spec) #Conversion into time-domain _,odata=sp.istft(y,fs=16000,…) Pyroomacoustics [Scheibler 2018] #Cmd prompt pip install pyroomacoustics
  58. References [Kim 2006] T. Kim, et al., “Independent vector analysis:

    an extension of ica to multivariate components,” in ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Scheibler 2018] R. Scheibler, et al., “Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms,” in ICASSP, 2018, pp. 351-355 [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196-200. [Yoshioka 2018] T. Yoshioka, et al., “Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition,” in ICASSP, 2018, pp. 5739-5743. [Togami ICASSP2019 (1)] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Masuyama INTERSPEECH2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in INTERSPEECH, Sep. 2019. [Togami ICASSP2019 (2)] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami arxiv2019] M. Togami, et al., “Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function,” in arxiv1911.04228, 2019.
  59. Thank you for listening