Slide 1

Slide 1 text

pyroomacoustics Room Simulation / Multichannel Audio Processing Robin Scheibler LINE Corporation, Speech Team Tokyo BISH Bash #1 — 2020/04/07 1

Slide 2

Slide 2 text

Self-Introduction Robin Scheibler role Senior Researcher @ LINE (since 2020/03/01) education Ph.D. in Signal Processing from EPFL (Switzerland) previously • Post Doc at Tokyo Metropolitan University • Intern/Researcher at NEC, IBM • Build mobile Geiger counters Safecast • Since 2014, developer of pyroomacoustics research • Fast transforms (Fourier, Hadamard, sparse, etc) • Multi-channel Audio Processing • Reproducible research hobby Ski, DIY electronics, fermentation homepage http://www.robinscheibler.org github @fakufaku twitter @fakufakurevenge 2

Slide 3

Slide 3 text

Outline 1. Pyroomacoustics General 2. Room Simulation 3. Blind Source Separation 3

Slide 4

Slide 4 text

Pyroomacoustics General

Slide 5

Slide 5 text

Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5

Slide 6

Slide 6 text

Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5

Slide 7

Slide 7 text

Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5

Slide 8

Slide 8 text

Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5

Slide 9

Slide 9 text

Pyroomacoustics Python Package Summary Content • Room acoustics simulator (C/C++) • Multi-channel audio processing algorithms Install $ pip install pyroomacoustics (binary wheels for Mac and Windows) Python 3.7, 3.6, 3.5, (2.7) Requires numpy, scipy Optional matplotlib, sounddevice, samplingrate Doc https://pyroomacoustics.readthedocs.io GitHub https://github.com/LCAV/pyroomacoustics 6

Slide 10

Slide 10 text

Motivation TRY to run an algorithm LISTEN to its output REASON and modify it Development Loop Prototyping of multichannel algorithms Without pyroomacoustics: experiments → time consuming With pyroomacoustics: simulation → fast → short cycle Data Augmentation Without pyroomacoustics: few examples of RIR, difficult to collect With pyroomacoustics: easy to generate lots of examples 7

Slide 11

Slide 11 text

Room Simulation

Slide 12

Slide 12 text

Sound Propagation in a Room • Described by wave equation: ∇2 − 1 c2 ∂2 ∂t2 u(r, t) = 0 • Point source in free space: u(r, t) = 1 4π r − r0 δ t − r − r0 c • Difficult for arbitrary boundaries (i.e. rooms) • Precise simulation ⇒ Finite element methods (FEM) • Approximate ⇒ image source model 9

Slide 13

Slide 13 text

The Image Source Model • Walls are perfect reflectors • Impulse response from image source is an impulse • Simple to implement 10

Slide 14

Slide 14 text

Image Source Model: Example 11

Slide 15

Slide 15 text

Image Source Model: Example 11

Slide 16

Slide 16 text

Image Source Model: Example 11

Slide 17

Slide 17 text

Image Source Model: Example 11

Slide 18

Slide 18 text

Image Source Model: Example 11

Slide 19

Slide 19 text

Implementation Concept Image Source Model Room Mics Sources Images Input Output RIRs 12

Slide 20

Slide 20 text

Pyroomacoustics Example import numpy as np import pyroomacoustics as pra room = pra.ShoeBox( [10, 5, 3.2], fs=16000, absorption=0.25, max_order=17 ) # add one source at a time, with source signal room.add_source([2.5, 1.7, 1.69], signal=my_signal) # add microphone array, R.shape == (3, n_mics) R = np.array([[5.71, 2.31, 1.4], [5.72, 2.32, 1.4]]).T room.add_microphone_array(pra.MicrophoneArray(R, fs=room.fs)) room.simulate() output_signal = room.mic_array.signals # (n_mics, n_samples) room.plot(img_order=2) # show room room.plot_rir() # show RIR 13

Slide 21

Slide 21 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 2 ms 14

Slide 22

Slide 22 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 25 ms 14

Slide 23

Slide 23 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 45 ms 14

Slide 24

Slide 24 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 65 ms 14

Slide 25

Slide 25 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 245 ms 14

Slide 26

Slide 26 text

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 712 ms 14

Slide 27

Slide 27 text

Choosing Parameters for Desired T60 room = pra.ShoeBox( [10, 5, 3.2], fs=16000, absorption=0.25, max_order=17 ) absorption Use Sabine’s formula T60 = 24 log 10 c V Sa V : volume, S: surface, c: speed of sound ⇒ solve for a max_order • Image source are contained in a diamond • Min. integer such that sphere w. radius T60 ∗ c is enclosed Code ref: https://github.com/fakufaku/bss_speech_dataset/blob/master/room_builder.py#L12 15

Slide 28

Slide 28 text

Samples Coming soon [branch next_gen_simulator] • Ray tracing (complex geometries, scattering) • Frequency dependent absorption • Air absorption Simulation Results Sim. method ISM Hybrid Hybrid • scattering • scattering Dry sound • air absorbtion Bedroom (small) Office (medium) Hall (large) 16

Slide 29

Slide 29 text

Data Augmentation for Training a Keyword Spotter Courtesy of Eric Bezzam, Snips (now part of Sonos) Task Keyword spotting, i.e. recognize "Hey Snips!" Clean samples Recordings of keyword ("Hey Snips!") Noise samples MUSAN (sounds) and Librispeech (speech) Test samples Hold-out set of "Hey Snips" re-recorded Prior art1 ISM, T60 sampled randomly (ISM T60) 1. Chanwoo Kim et al., “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home," Interspeech, 2017. 17

Slide 30

Slide 30 text

Results Feature ISM T60 ISM MAT HYB MAT HYB FREQ HYB FREQ AIR ISM only Hybrid Rand. material Scattering Multi-freq. Air absorption SNR Noise ISM T60 ISM MAT HYB MAT HYB FREQ HYB FREQ AIR clean 0.92% 0.58% 0.53% 0.46% 0.42% 5 dB sounds 9.42% 7.14% 7.25% 6.04% 5.42% 5 dB speech 16.0% 13.1% 14.7% 12.5% 12.5% 2 dB sounds 16.8% 14.6% 14.2% 12.3% 11.2% 2 dB speech 30.4% 27.1% 29.9% 26.0% 26.6% Avg. rel. improv. - 20.8% 18.2% 29.9% 33.0% Table: False rejection rates (in percent) for a false alarm per hour rate of 0.125 (three false alarms per day). 18

Slide 31

Slide 31 text

Blind Source Separation

Slide 32

Slide 32 text

Background Underdet. (M < K) Frequency Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 33

Slide 33 text

Background Underdet. (M < K) Determined (M = K) Frequency Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 34

Slide 34 text

Background Underdet. (M < K) Determined (M = K) Overdet. (M > K) Frequency Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 35

Slide 35 text

Background Underdet. (M < K) Determined (M = K) Overdet. (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 36

Slide 36 text

Background Underdet. (M < K) Determined (M = K) Overdet. (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 37

Slide 37 text

Background Underdet. (M < K) Determined (M = K) Overdet. (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 38

Slide 38 text

Background Underdet. (M < K) Determined (M = K) Overdet. (M > K) Frequency Domain Blind Source Separation Separated sources Mics spectrograms sources time frequency Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

Slide 39

Slide 39 text

BSS Algorithms in pyroomacoustics Algorithm Source model Under. Det. Over. AuxIVA1 / OverIVA2 spherical SparseAuxIVA3 spherical ILRMA4 low-rank FastMNMF5 low-rank 1. N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique," WASPAA, 2011. 2. R. Scheibler and N. Ono, “Independent vector analysis with more microphones Than Sources," WASPAA, 2019. 3. J. Janský et al., “A computationally cheaper method for blind speech separation based on AuxIVA and incomplete demixing transform," Proc. IWAENC, 2016. 4. D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM Trans. ASLP, 2016. 5. K. Sekiguchi et al., “Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices," EUSIPCO, 2019. 21

Slide 40

Slide 40 text

Example in Pyroomacoustics import pyroomacoustics as pra from pyroomacousitcs.transform import stft from scipy.io import wavfile fs, audio = wavfile.read("path/to/multichannel_audio.wav") # STFT parameters nfft = 4096 # 256 ms frame @ 16 kHz hop = nfft // 4 # 64 ms shifts win_a = pra.hamming(nfft) # window function win_s = stft.compute_synthesis_window(win_a, hop) # X.shape == (n_frames, n_freq, n_channels) X = stft.analysis(audio, nfft, hop, win=win_a) # Separation, n_iter ~ 10 times n_channels Y = pra.bss.auxiva(X, n_iter=30) audio_output = stft.synthesis(Y, nfft, hop, win=win_s) wavfile.write("path/to/output/file.wav", fs, audio_output) 22

Slide 41

Slide 41 text

Example of Separated Outputs Source 1 Source 2 Source 3 Time SIR SIR SIR Clean - ∞ ∞ ∞ Mix - -2.8 -2.89 -2.75 AuxIVA 6.33 s 10.13 15.95 11.56 ILRMA 8.84 s 10.48 16.08 12.03 FastMNMF 35.9 s 11.38 17.12 10.60 23

Slide 42

Slide 42 text

Conclusion pyroomacoustics • Simulation of room acoustics • Reference implementations of multichannel processing algo. • Data augmentation effective for ASR/KWS systems • Rapid prototyping and faster experiment cycle What’s next ? • Release next_gen_simulator (ray tracing, air abs.) • Desired: directional microphones and sources • Help is very welcome! https://github.com/LCAV/pyroomacoustics 24