Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing

Generative models: Variational Autoencoder vs Gaussian Mixture Models for voice anti spoofing

In this study, Bhusan Chettri explores the feasibility of generative models such as Variational Autoencoder for voice anti-spoofing and further compares their performances with the Gaussian Mixture Models, another form of generative models.

Bhusan Chettri

February 17, 2023
Tweet

More Decks by Bhusan Chettri

Other Decks in Research

Transcript

  1. Variational Autoencoders for Replay Spoof Detection in Automatic Speaker Verification

    Bhusan Chettri1,2, Tomi Kinnunen2, Emmanouil Benetos1 1School of EECS, Queen Mary University of London, United Kingdom 2School of computing, University of Eastern Finland, Joensuu, Finland November 20, 2019
  2. Automatic speaker verification (ASV) Is the speaker who he/she claims

    to be? APPLICATION: User authentication (eg. banks, call centres, smart phones etc.)
  3. ASV spoofing and countermeasures Spoofing: attempting to gain unauthorised access

    to the biometric system of a registered user. Types of attacks: 1. Text-to-Speech 2. Voice conversion 3. Impersonation/Mimicry 4. Replay Countermeasures: guard ASV systems from spoofing attacks. Consists of: Frontend: extracts discriminative features Backend: classification and decision making.
  4. Towards securing ASV systems We focus on “Replay spoofing attack”

    simple to perform yet difficult to detect reliably
  5. Overview of this work Figure 1: High level overview. An

    automatic spoofing detection pipeline using a generative model backend.
  6. Motivation and objectives Gaussian mixture models (GMMs) popular backend classifier

    an unsupervised generative model Why VAEs for spoofing detection? widely used in other domains (eg. computer vision) an unsupervised deep generative model ability to generate data analyse/manipulate latent space - interpretability! Main objectives Feasibility study of using VAEs as a backend classifier: 2-class setting as in GMMs. (this paper) One-class VAE: model true/bonafide data distribution!
  7. Methodology We study different variants of VAEs as a backend.

    Vanilla VAE Conditional VAE (C-VAE) C-VAE with an auxiliary classifier
  8. Variational Autoencoders (VAE) Figure 2: Naive VAE. Separate bonafide and

    spoof VAE models are trained using the respective-class training audio files.
  9. VAE training and testing The VAE is trained by maximizing

    a regularized log-likelihood function. Let X = {xn}N n=1 denote the training set, with xn ∈ RD. The training loss for the entire training set X, L(θ, φ) = N n=1 n(θ, φ), (1) decomposes to a sum of data-point specific losses. The loss of the nth training example is a regularized reconstruction loss: n(θ, φ) = −Ez∼qφ(z|xn) log pθ(xn|z) Reconstruction error + KL qφ(z|xn) p(z) Regularizer , (2) where, φ and θ represents encoder and decoder network parmeters. Testing: we use the same loss function Eq. 2 during scoring.
  10. Conditional VAE (C-VAE) Figure 3: C-VAE. A single model is

    trained on the whole training dataset but with class labels.
  11. Auxiliary classifier C-VAE (AC-VAE) Figure 4: AC-VAE. Add an auxiliary

    classifier on the latent mean or the decoder output. Loss function: n(θ, φ, ψ) = α · n(θ, φ) + β · n(ψ), (3)
  12. Experimental setup 1. Replay spoofing dataset ASVspoof 2017 v2.0 ASVspoof

    2019 PA 2. Input representation: 100 × D Constant Q cepstral coeffiient (CQCC) Log power spectrogram 3. Architecture Deep CNN with convolutional and deconvolutional layers for Encoder and decoder networks. no pooling layers but stride > 1 4. Scoring/Testing Log-likelihood difference between bonafide and spoof model Higher the score, higher is the probability of being bonafide 5. Evaluation metric Equal error rate (EER) Tandem detection cost function (t-DCF)
  13. Replay spoofing corpora Table 1: Database statistics. Spkr: speaker. Bon:

    bonafide/genuine, spf: spoof/replay. Each of the three subsets has non-overlapping speakers. The ASVspoof 2017 dataset has male speakers only while the ASVspoof 2019 has both male and female speakers. ASVspoof 2017 ASVspoof 2019 PA Subset # Spkr # Bon # Spf # Spkr # Bon # Spf Train 10 1507 1507 20 5400 48600 Dev 8 760 950 20 5400 24300 Eval 24 1298 12008 67 18090 116640 Total 42 3565 14465 107 28890 189540
  14. Quantitative results Table 2: Performance of GMM and different VAE

    models. AC-VAE1 : augmenting classifier on top of the latent space. AC-VAE2 : augmenting classifier at the output of the decoder. Lower the better. ASVspoof 2017 ASVspoof 2019 PA Dev Eval Dev Eval Model EER t-DCF EER t-DCF EER t-DCF EER t-DCF GMM 19.07 0.4365 22.6 0.6211 43.77 0.9973 45.48 0.9988 VAE 29.2 0.7532 32.37 0.8079 45.24 0.9855 45.53 0.9978 C-VAE 18.1 0.4635 28.1 0.7020 34.06 0.8129 36.66 0.9104 AC-VAE1 21.8 0.4914 29.3 0.7365 34.73 0.8516 36.42 0.9036 AC-VAE2 17.78 0.4469 29.73 0.7368 34.87 0.8430 36.42 0.8963
  15. Qualitative results Figure 5: Left: C-VAE with genuine class conditioning.

    Right: spoof-class conditioning. Top: bonafide example. Bottom: spoof example.
  16. Figure 6: t-SNE visualisation: Top left: 10 speaker clusters. Top

    right: male and female clusters Bottom left and right: distribution of bonafide and 9 attack conditions for a male and a female speaker.
  17. t-SNE - different phrases in ASVspoof 2017 Figure 7: Visualisation

    of the latent space for 10 different sentences in the ASVspoof 2017 training set by C-VAE.
  18. Conclusions and future work 1. Challenges Getting a reasonable reconstruction

    Making the latent space ‘z’ retain discriminative information 2. Vanilla VAE approach did not work well bonafide and spoof VAE model seem to focus on retaining information relevant for reconstruction 3. C-VAE models show encouraging results 4. Use of an auxiliary classifier with C-VAE did not help much parameter not optimised well! room for exploration and improvement 5. Frame-level C-VAE 6. One-class or semi-supervised C-VAE approach