pyroomacoustics Room Simulation / Multichannel Audio Processing Robin Scheibler LINE Corporation, Speech Team Tokyo BISH Bash #1 — 2020/04/07 1

Self-Introduction Robin Scheibler role Senior Researcher @ LINE (since 2020/03/01) education Ph.D. in Signal Processing from EPFL (Switzerland) previously • Post Doc at Tokyo Metropolitan University • Intern/Researcher at NEC, IBM • Build mobile Geiger counters Safecast • Since 2014, developer of pyroomacoustics research • Fast transforms (Fourier, Hadamard, sparse, etc) • Multi-channel Audio Processing • Reproducible research hobby Ski, DIY electronics, fermentation homepage github @fakufaku twitter @fakufakurevenge 2

Outline 1. Pyroomacoustics General 2. Room Simulation 3. Blind Source Separation 3

Pyroomacoustics General

Pyroomacoustics: Motivation Smart Speaker Hello! Front-end - denoising - beamforming - doa - separation Services - Personal assistant - Speech-to-text - Search - ... Internet Noise source Target speaker 5

Pyroomacoustics Python Package Summary Content • Room acoustics simulator (C/C++) • Multi-channel audio processing algorithms Install $ pip install pyroomacoustics (binary wheels for Mac and Windows) Python 3.7, 3.6, 3.5, (2.7) Requires numpy, scipy Optional matplotlib, sounddevice, samplingrate Doc GitHub 6

Motivation TRY to run an algorithm LISTEN to its output REASON and modify it Development Loop Prototyping of multichannel algorithms Without pyroomacoustics: experiments → time consuming With pyroomacoustics: simulation → fast → short cycle Data Augmentation Without pyroomacoustics: few examples of RIR, difficult to collect With pyroomacoustics: easy to generate lots of examples 7

Room Simulation

Sound Propagation in a Room • Described by wave equation: ∇2 − 1 c2 ∂2 ∂t2 u(r, t) = 0 • Point source in free space: u(r, t) = 1 4π r − r0 δ t − r − r0 c • Difficult for arbitrary boundaries (i.e. rooms) • Precise simulation ⇒ Finite element methods (FEM) • Approximate ⇒ image source model 9

The Image Source Model • Walls are perfect reflectors • Impulse response from image source is an impulse • Simple to implement 10

Image Source Model: Example 11

Implementation Concept Image Source Model Room Mics Sources Images Input Output RIRs 12

Pyroomacoustics Example import numpy as np import pyroomacoustics as pra room = pra.ShoeBox( [10, 5, 3.2], fs=16000, absorption=0.25, max_order=17 ) # add one source at a time, with source signal room.add_source([2.5, 1.7, 1.69], signal=my_signal) # add microphone array, R.shape == (3, n_mics) R = np.array([[5.71, 2.31, 1.4], [5.72, 2.32, 1.4]]).T room.add_microphone_array(pra.MicrophoneArray(R, fs=room.fs)) room.simulate() output_signal = room.mic_array.signals # (n_mics, n_samples) room.plot(img_order=2) # show room room.plot_rir() # show RIR 13

Image Sources and Impulse Response Maximum reflection order: 0 1 2 3 10 30 Impulse response t60 = 2 ms 14

Choosing Parameters for Desired T60 room = pra.ShoeBox( [10, 5, 3.2], fs=16000, absorption=0.25, max_order=17 ) absorption Use Sabine’s formula T60 = 24 log 10 c V Sa V : volume, S: surface, c: speed of sound ⇒ solve for a max_order • Image source are contained in a diamond • Min. integer such that sphere w. radius T60 ∗ c is enclosed Code ref: 15

Samples Coming soon [branch next_gen_simulator] • Ray tracing (complex geometries, scattering) • Frequency dependent absorption • Air absorption Simulation Results Sim. method ISM Hybrid Hybrid • scattering • scattering Dry sound • air absorbtion Bedroom (small) Office (medium) Hall (large) 16

Data Augmentation for Training a Keyword Spotter Courtesy of Eric Bezzam, Snips (now part of Sonos) Task Keyword spotting, i.e. recognize "Hey Snips!" Clean samples Recordings of keyword ("Hey Snips!") Noise samples MUSAN (sounds) and Librispeech (speech) Test samples Hold-out set of "Hey Snips" re-recorded Prior art1 ISM, T60 sampled randomly (ISM T60) 1. Chanwoo Kim et al., “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home," Interspeech, 2017. 17

Results Feature ISM T60 ISM MAT HYB MAT HYB FREQ HYB FREQ AIR ISM only Hybrid Rand. material Scattering Multi-freq. Air absorption SNR Noise ISM T60 ISM MAT HYB MAT HYB FREQ HYB FREQ AIR clean 0.92% 0.58% 0.53% 0.46% 0.42% 5 dB sounds 9.42% 7.14% 7.25% 6.04% 5.42% 5 dB speech 16.0% 13.1% 14.7% 12.5% 12.5% 2 dB sounds 16.8% 14.6% 14.2% 12.3% 11.2% 2 dB speech 30.4% 27.1% 29.9% 26.0% 26.6% Avg. rel. improv. - 20.8% 18.2% 29.9% 33.0% Table: False rejection rates (in percent) for a false alarm per hour rate of 0.125 (three false alarms per day). 18

Blind Source Separation

Background Underdet. (M < K) Frequency Domain Blind Source Separation Advantage of BSS • No prior information required, only signals! • Reliable enhancement via separation 20

BSS Algorithms in pyroomacoustics Algorithm Source model Under. Det. Over. AuxIVA1 / OverIVA2 spherical SparseAuxIVA3 spherical ILRMA4 low-rank FastMNMF5 low-rank 1. N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique," WASPAA, 2011. 2. R. Scheibler and N. Ono, “Independent vector analysis with more microphones Than Sources," WASPAA, 2019. 3. J. Janský et al., “A computationally cheaper method for blind speech separation based on AuxIVA and incomplete demixing transform," Proc. IWAENC, 2016. 4. D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM Trans. ASLP, 2016. 5. K. Sekiguchi et al., “Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices," EUSIPCO, 2019. 21

Example in Pyroomacoustics import pyroomacoustics as pra from pyroomacousitcs.transform import stft from import wavfile fs, audio ="path/to/multichannel_audio.wav") # STFT parameters nfft = 4096 # 256 ms frame @ 16 kHz hop = nfft // 4 # 64 ms shifts win_a = pra.hamming(nfft) # window function win_s = stft.compute_synthesis_window(win_a, hop) # X.shape == (n_frames, n_freq, n_channels) X = stft.analysis(audio, nfft, hop, win=win_a) # Separation, n_iter ~ 10 times n_channels Y = pra.bss.auxiva(X, n_iter=30) audio_output = stft.synthesis(Y, nfft, hop, win=win_s) wavfile.write("path/to/output/file.wav", fs, audio_output) 22

Example of Separated Outputs Source 1 Source 2 Source 3 Time SIR SIR SIR Clean - ∞ ∞ ∞ Mix - -2.8 -2.89 -2.75 AuxIVA 6.33 s 10.13 15.95 11.56 ILRMA 8.84 s 10.48 16.08 12.03 FastMNMF 35.9 s 11.38 17.12 10.60 23

Conclusion pyroomacoustics • Simulation of room acoustics • Reference implementations of multichannel processing algo. • Data augmentation effective for ASR/KWS systems • Rapid prototyping and faster experiment cycle What’s next ? • Release next_gen_simulator (ray tracing, air abs.) • Desired: directional microphones and sources • Help is very welcome! 24