Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing

Slide 1

Slide 1 text

Variational Autoencoders for Replay Spoof Detection in Automatic Speaker Veriﬁcation Bhusan Chettri1,2, Tomi Kinnunen2, Emmanouil Benetos1 1School of EECS, Queen Mary University of London, United Kingdom 2School of computing, University of Eastern Finland, Joensuu, Finland November 20, 2019

Slide 2

Slide 2 text

Outline Introduction/background Motivation Methodology Experimental setup Results Conclusion

Slide 3

Slide 3 text

Automatic speaker veriﬁcation (ASV) Is the speaker who he/she claims to be? APPLICATION: User authentication (eg. banks, call centres, smart phones etc.)

Slide 4

Slide 4 text

ASV spoofing and countermeasures Spoofing: attempting to gain unauthorised access to the biometric system of a registered user. Types of attacks: 1. Text-to-Speech 2. Voice conversion 3. Impersonation/Mimicry 4. Replay Countermeasures: guard ASV systems from spoofing attacks. Consists of: Frontend: extracts discriminative features Backend: classification and decision making.

Slide 5

Slide 5 text

Why spooﬁng countermeasures? Taken from “http://www.asvspoof.org/slides ASVspoof2017 Interspeech.pdf” with permission.

Slide 6

Slide 6 text

Towards securing ASV systems We focus on “Replay spooﬁng attack” simple to perform yet diﬃcult to detect reliably

Slide 7

Slide 7 text

Genuine/bonaﬁde vs replayed speech

Slide 8

Slide 8 text

Overview of this work Figure 1: High level overview. An automatic spooﬁng detection pipeline using a generative model backend.

Slide 9

Slide 9 text

Motivation and objectives Gaussian mixture models (GMMs) popular backend classifier an unsupervised generative model Why VAEs for spoofing detection? widely used in other domains (eg. computer vision) an unsupervised deep generative model ability to generate data analyse/manipulate latent space - interpretability! Main objectives Feasibility study of using VAEs as a backend classifier: 2-class setting as in GMMs. (this paper) One-class VAE: model true/bonafide data distribution!

Slide 10

Slide 10 text

Methodology We study diﬀerent variants of VAEs as a backend. Vanilla VAE Conditional VAE (C-VAE) C-VAE with an auxiliary classiﬁer

Slide 11

Slide 11 text

Variational Autoencoders (VAE) Figure 2: Naive VAE. Separate bonaﬁde and spoof VAE models are trained using the respective-class training audio ﬁles.

Slide 12

Slide 12 text

VAE training and testing The VAE is trained by maximizing a regularized log-likelihood function. Let X = {xn}N n=1 denote the training set, with xn ∈ RD. The training loss for the entire training set X, L(θ, φ) = N n=1 n(θ, φ), (1) decomposes to a sum of data-point speciﬁc losses. The loss of the nth training example is a regularized reconstruction loss: n(θ, φ) = −Ez∼qφ(z|xn) log pθ(xn|z) Reconstruction error + KL qφ(z|xn) p(z) Regularizer , (2) where, φ and θ represents encoder and decoder network parmeters. Testing: we use the same loss function Eq. 2 during scoring.

Slide 13

Slide 13 text

Conditional VAE (C-VAE) Figure 3: C-VAE. A single model is trained on the whole training dataset but with class labels.

Slide 14

Slide 14 text

Auxiliary classiﬁer C-VAE (AC-VAE) Figure 4: AC-VAE. Add an auxiliary classiﬁer on the latent mean or the decoder output. Loss function: n(θ, φ, ψ) = α · n(θ, φ) + β · n(ψ), (3)

Slide 15

Slide 15 text

Experimental setup 1. Replay spoofing dataset ASVspoof 2017 v2.0 ASVspoof 2019 PA 2. Input representation: 100 × D Constant Q cepstral coeffiient (CQCC) Log power spectrogram 3. Architecture Deep CNN with convolutional and deconvolutional layers for Encoder and decoder networks. no pooling layers but stride > 1 4. Scoring/Testing Log-likelihood difference between bonafide and spoof model Higher the score, higher is the probability of being bonafide 5. Evaluation metric Equal error rate (EER) Tandem detection cost function (t-DCF)

Slide 16

Slide 16 text

Replay spooﬁng corpora Table 1: Database statistics. Spkr: speaker. Bon: bonaﬁde/genuine, spf: spoof/replay. Each of the three subsets has non-overlapping speakers. The ASVspoof 2017 dataset has male speakers only while the ASVspoof 2019 has both male and female speakers. ASVspoof 2017 ASVspoof 2019 PA Subset # Spkr # Bon # Spf # Spkr # Bon # Spf Train 10 1507 1507 20 5400 48600 Dev 8 760 950 20 5400 24300 Eval 24 1298 12008 67 18090 116640 Total 42 3565 14465 107 28890 189540

Slide 17

Slide 17 text

Quantitative results Table 2: Performance of GMM and different VAE models. AC-VAE1 : augmenting classifier on top of the latent space. AC-VAE2 : augmenting classifier at the output of the decoder. Lower the better. ASVspoof 2017 ASVspoof 2019 PA Dev Eval Dev Eval Model EER t-DCF EER t-DCF EER t-DCF EER t-DCF GMM 19.07 0.4365 22.6 0.6211 43.77 0.9973 45.48 0.9988 VAE 29.2 0.7532 32.37 0.8079 45.24 0.9855 45.53 0.9978 C-VAE 18.1 0.4635 28.1 0.7020 34.06 0.8129 36.66 0.9104 AC-VAE1 21.8 0.4914 29.3 0.7365 34.73 0.8516 36.42 0.9036 AC-VAE2 17.78 0.4469 29.73 0.7368 34.87 0.8430 36.42 0.8963

Slide 18

Slide 18 text

Qualitative results Figure 5: Left: C-VAE with genuine class conditioning. Right: spoof-class conditioning. Top: bonaﬁde example. Bottom: spoof example.

Slide 19

Slide 19 text

Figure 6: t-SNE visualisation: Top left: 10 speaker clusters. Top right: male and female clusters Bottom left and right: distribution of bonaﬁde and 9 attack conditions for a male and a female speaker.

Slide 20

Slide 20 text

t-SNE - diﬀerent phrases in ASVspoof 2017 Figure 7: Visualisation of the latent space for 10 diﬀerent sentences in the ASVspoof 2017 training set by C-VAE.

Slide 21

Slide 21 text

Conclusions and future work 1. Challenges Getting a reasonable reconstruction Making the latent space ‘z’ retain discriminative information 2. Vanilla VAE approach did not work well bonaﬁde and spoof VAE model seem to focus on retaining information relevant for reconstruction 3. C-VAE models show encouraging results 4. Use of an auxiliary classiﬁer with C-VAE did not help much parameter not optimised well! room for exploration and improvement 5. Frame-level C-VAE 6. One-class or semi-supervised C-VAE approach

Slide 22

Slide 22 text

References [1] https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ [2] https://arxiv.org/pdf/1606.05908.pdf

Slide 23

Slide 23 text

Questions