Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pyroomacoustics: Open source library for audio ...

Pyroomacoustics: Open source library for audio room simulation and prototyping speech enhancement systems

LINE株式会社 Robin Scheibler
Tokyo BISH Bash #01での発表資料です。
https://tokyo-bish-bash.connpass.com/event/171564/

LINE Developers

April 07, 2020
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. pyroomacoustics Room Simulation / Multichannel Audio Processing Robin Scheibler LINE

    Corporation, Speech Team Tokyo BISH Bash #1 — 2020/04/07 1
  2. Self-Introduction Robin Scheibler role Senior Researcher @ LINE (since 2020/03/01)

    education Ph.D. in Signal Processing from EPFL (Switzerland) previously • Post Doc at Tokyo Metropolitan University • Intern/Researcher at NEC, IBM • Build mobile Geiger counters Safecast • Since 2014, developer of pyroomacoustics research • Fast transforms (Fourier, Hadamard, sparse, etc) • Multi-channel Audio Processing • Reproducible research hobby Ski, DIY electronics, fermentation homepage http://www.robinscheibler.org github @fakufaku twitter @fakufakurevenge 2
  3. Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming

    - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5
  4. Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming

    - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5
  5. Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming

    - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5
  6. Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming

    - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5
  7. Pyroomacoustics Python Package Summary Content • Room acoustics simulator (C/C++)

    • Multi-channel audio processing algorithms Install $ pip install pyroomacoustics (binary wheels for Mac and Windows) Python 3.7, 3.6, 3.5, (2.7) Requires numpy, scipy Optional matplotlib, sounddevice, samplingrate Doc https://pyroomacoustics.readthedocs.io GitHub https://github.com/LCAV/pyroomacoustics 6
  8. Motivation TRY to run an algorithm LISTEN to its output

    REASON and modify it Development Loop Prototyping of multichannel algorithms Without pyroomacoustics: experiments → time consuming With pyroomacoustics: simulation → fast → short cycle Data Augmentation Without pyroomacoustics: few examples of RIR, difficult to collect With pyroomacoustics: easy to generate lots of examples 7
  9. Sound Propagation in a Room • Described by wave equation:

    ∇2 − 1 c2 ∂2 ∂t2 u(r, t) = 0 • Point source in free space: u(r, t) = 1 4π r − r0 δ t − r − r0 c • Difficult for arbitrary boundaries (i.e. rooms) • Precise simulation ⇒ Finite element methods (FEM) • Approximate ⇒ image source model 9
  10. The Image Source Model • Walls are perfect reflectors •

    Impulse response from image source is an impulse • Simple to implement 10
  11. Pyroomacoustics Example import numpy as np import pyroomacoustics as pra

    room = pra.ShoeBox( [10, 5, 3.2], fs=16000, absorption=0.25, max_order=17 ) # add one source at a time, with source signal room.add_source([2.5, 1.7, 1.69], signal=my_signal) # add microphone array, R.shape == (3, n_mics) R = np.array([[5.71, 2.31, 1.4], [5.72, 2.32, 1.4]]).T room.add_microphone_array(pra.MicrophoneArray(R, fs=room.fs)) room.simulate() output_signal = room.mic_array.signals # (n_mics, n_samples) room.plot(img_order=2) # show room room.plot_rir() # show RIR 13
  12. Image Sources and Impulse Response Maximum reflection order: 0 1

    2 3 10 30 Impulse response t60 = 25 ms 14
  13. Image Sources and Impulse Response Maximum reflection order: 0 1

    2 3 10 30 Impulse response t60 = 45 ms 14
  14. Image Sources and Impulse Response Maximum reflection order: 0 1

    2 3 10 30 Impulse response t60 = 65 ms 14
  15. Image Sources and Impulse Response Maximum reflection order: 0 1

    2 3 10 30 Impulse response t60 = 245 ms 14
  16. Image Sources and Impulse Response Maximum reflection order: 0 1

    2 3 10 30 Impulse response t60 = 712 ms 14
  17. Choosing Parameters for Desired T60 room = pra.ShoeBox( [10, 5,

    3.2], fs=16000, absorption=0.25, max_order=17 ) absorption Use Sabine’s formula T60 = 24 log 10 c V Sa V : volume, S: surface, c: speed of sound ⇒ solve for a max_order • Image source are contained in a diamond • Min. integer such that sphere w. radius T60 ∗ c is enclosed Code ref: https://github.com/fakufaku/bss_speech_dataset/blob/master/room_builder.py#L12 15
  18. Samples Coming soon [branch next_gen_simulator] • Ray tracing (complex geometries,

    scattering) • Frequency dependent absorption • Air absorption Simulation Results Sim. method ISM Hybrid Hybrid • scattering • scattering Dry sound • air absorbtion Bedroom (small) Office (medium) Hall (large) 16
  19. Data Augmentation for Training a Keyword Spotter Courtesy of Eric

    Bezzam, Snips (now part of Sonos) Task Keyword spotting, i.e. recognize "Hey Snips!" Clean samples Recordings of keyword ("Hey Snips!") Noise samples MUSAN (sounds) and Librispeech (speech) Test samples Hold-out set of "Hey Snips" re-recorded Prior art1 ISM, T60 sampled randomly (ISM T60) 1. Chanwoo Kim et al., “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home," Interspeech, 2017. 17
  20. Results Feature ISM T60 ISM MAT HYB MAT HYB FREQ

    HYB FREQ AIR ISM only      Hybrid      Rand. material      Scattering      Multi-freq.      Air absorption      SNR Noise ISM T60 ISM MAT HYB MAT HYB FREQ HYB FREQ AIR clean 0.92% 0.58% 0.53% 0.46% 0.42% 5 dB sounds 9.42% 7.14% 7.25% 6.04% 5.42% 5 dB speech 16.0% 13.1% 14.7% 12.5% 12.5% 2 dB sounds 16.8% 14.6% 14.2% 12.3% 11.2% 2 dB speech 30.4% 27.1% 29.9% 26.0% 26.6% Avg. rel. improv. - 20.8% 18.2% 29.9% 33.0% Table: False rejection rates (in percent) for a false alarm per hour rate of 0.125 (three false alarms per day). 18
  21. Background Underdet. (M < K) Frequency Domain Blind Source Separation

    Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  22. Background Underdet. (M < K) Determined (M = K) Frequency

    Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  23. Background Underdet. (M < K) Determined (M = K) Overdet.

    (M > K) Frequency Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  24. Background Underdet. (M < K) Determined (M = K) Overdet.

    (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  25. Background Underdet. (M < K) Determined (M = K) Overdet.

    (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  26. Background Underdet. (M < K) Determined (M = K) Overdet.

    (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  27. Background Underdet. (M < K) Determined (M = K) Overdet.

    (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20
  28. BSS Algorithms in pyroomacoustics Algorithm Source model Under. Det. Over.

    AuxIVA1 / OverIVA2 spherical    SparseAuxIVA3 spherical    ILRMA4 low-rank    FastMNMF5 low-rank    1. N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique," WASPAA, 2011. 2. R. Scheibler and N. Ono, “Independent vector analysis with more microphones Than Sources," WASPAA, 2019. 3. J. Janský et al., “A computationally cheaper method for blind speech separation based on AuxIVA and incomplete demixing transform," Proc. IWAENC, 2016. 4. D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM Trans. ASLP, 2016. 5. K. Sekiguchi et al., “Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices," EUSIPCO, 2019. 21
  29. Example in Pyroomacoustics import pyroomacoustics as pra from pyroomacousitcs.transform import

    stft from scipy.io import wavfile fs, audio = wavfile.read("path/to/multichannel_audio.wav") # STFT parameters nfft = 4096 # 256 ms frame @ 16 kHz hop = nfft // 4 # 64 ms shifts win_a = pra.hamming(nfft) # window function win_s = stft.compute_synthesis_window(win_a, hop) # X.shape == (n_frames, n_freq, n_channels) X = stft.analysis(audio, nfft, hop, win=win_a) # Separation, n_iter ~ 10 times n_channels Y = pra.bss.auxiva(X, n_iter=30) audio_output = stft.synthesis(Y, nfft, hop, win=win_s) wavfile.write("path/to/output/file.wav", fs, audio_output) 22
  30. Example of Separated Outputs Source 1 Source 2 Source 3

    Time SIR SIR SIR Clean - ∞ ∞ ∞ Mix - -2.8 -2.89 -2.75 AuxIVA 6.33 s 10.13 15.95 11.56 ILRMA 8.84 s 10.48 16.08 12.03 FastMNMF 35.9 s 11.38 17.12 10.60 23
  31. Conclusion pyroomacoustics • Simulation of room acoustics • Reference implementations

    of multichannel processing algo. • Data augmentation effective for ASR/KWS systems • Rapid prototyping and faster experiment cycle What’s next ? • Release next_gen_simulator (ray tracing, air abs.) • Desired: directional microphones and sources • Help is very welcome! https://github.com/LCAV/pyroomacoustics 24