Research introduction of anomalous sound detection for real-world scenarios

Takuya Fujimura (Nagoya University) Research introduction of anomalous sound detection
for real-world scenarios Technical seminar on anomalous sound detection@Doshisha University 2024/10/28

1/49 Self-introduction Takuya Fujimura ◼1st year Ph.D. student@Nagoya University My
research topics ◼Anomalous sound detection ◼(Unsupervised) speech enhancement Our ASD team ◼Prof. Tomoki Toda ◼Mr. Ibuki Kuroyanagi (2nd year Ph.D. student) ◼Me Co-researcher ◼Assoc. Prof. Keisuke Imoto Me Prof. Toda Mr. Kuroyanagi

2/49 Research introduction of anomalous sound detection for real-world scenarios
I’ll be talking mainly about my recent work, which has focused on practical challenges. T. Fujimura, K. Imoto, T. Toda, "Discriminative neighborhood smoothing for generative anomalous sound detection," Proc. EUSIPCO, Aug. 2024. [arxiv: https://arxiv.org/abs/2403.11508] T. Fujimura, I. Kuroyanagi, T. Toda, "Improvements of Discriminative Feature Space Training for Anomalous Sound Detection in Unlabeled Conditions,” arXiv e-prints: 2409.09332, 2024. [arxiv: https://arxiv.org/abs/2409.09332]

3/49 Outline Introduction of ASD task ◼Problem settings ◼Two basic
approaches: Generative and Discriminative Practical challenge 1: Instability ◼“Discriminative neighborhood smoothing for generative anomalous sound detection” Practical challenge 2: Data collection (annotation) costs ◼“Improvements of Discriminative Feature Space Training for Anomalous Sound Detection in Unlabeled Conditions” Summary

4/49 Anomalous Sound Detection (ASD) Detecting anomalous behavior from machine
sound ◼Calculate an anomaly score from machine sound ◼Detect anomalies based on the anomaly score Problem setting ◼Difficult to collect anomalous sound samples → Develop ASD systems using only normal sound data Basic approaches ◼Generative and Discriminative Normal or Anomalous Anomaly score ASD system Thresh- olding

5/49 Generative Approach ◼Directly construct a generative model for normal
sounds in the audio feature domain ◼Anomaly score: Degree of deviation from the generative model Normal Obser- vation Anomaly score AE Anomalies are not accurately reconstructed because they are not in the training data AE Example: Autoencoder (AE) ◼AE is trained to reconstruct audio feature of normal sounds ◼Anomaly score: Reconstruction error of the observed sound

6/49 Discriminative Approach ◼Construct a discriminative model to classify differences
in normal sounds using annotated labels ◼Anomaly score: Distance between observation and training data in the discriminative feature space (i.e., decrease in the posterior probability of the correct class) Normal Labels (e.g., machine types and operation parms.) Feature Extractor Valve (pat 02) Valve (pat 01) Slide rail (vel 300, ac 0.30)

in normal sounds using annotated labels ◼Anomaly score: Distance between observation and training data in the discriminative feature space (i.e., decrease in the posterior probability of the correct class) Normal Labels (e.g., machine types and operation parms.) Feature Extractor Valve (pat 02) Valve (pat 01) Slide rail (vel 300, ac 0.30) Labels (e.g., machine types and operation parms.) … … Labels provided in DCASE Challenge Task2

in normal sounds using annotated labels ◼Anomaly score: Distance between observation and training data in the discriminative feature space (i.e., decrease in the posterior probability of the correct class) Normal Labels (e.g., machine types and operation parms.) Feature Extractor Valve (pat 02) Valve (pat 01) Slide rail (vel 300, ac 0.30)

in normal sounds using annotated labels ◼Anomaly score: Distance between observation and training data in the discriminative feature space (i.e., decrease in the posterior probability of the correct class) Anomalies are not accurately classified because they are not in the training data Obser- vation Valve (pat 02) Valve (pat 01) Slide rail (vel 300, ac 0.30) Feature Extractor

10/49 Basic approaches and research direction Problem setting during training
Modeling a probability density function (Harder) Modeling a posterior probability function (Easier) Performance Insufficient High Generative Discriminative → Improve to a sufficient level → Further improve

11/49 Basic approaches and research direction Problem setting during training
Modeling a probability density function (Harder) Modeling a posterior probability function (Easier) Performance Insufficient High Generative Discriminative → Improve to a sufficient level → Further improve Stability Relatively stable Unstable Useful but the annotation is costly Use of labels → Reduce the annotation cost → Improve stability → Further improve

13/49 Practical challenge 1: Instability Performance examples of Gen. and
Dis. approaches Machine A B C D Basic performance Stability Gen. 60 65 55 58 ✗ ✓ Dis. 80 85 30 86 ✓ ✗ In practical applications, stability under various conditions is important

14/49 Practical challenge 1: Instability Performance examples of Gen. and
Dis. approaches Machine A B C D Basic performance Stability Gen. 60 65 55 58 ✗ ✓ Dis. 80 85 30 86 ✓ ✗ Goal 68 70 72 70 ✓ ✓ Goal: To achieve stable and (moderately) good performance Specific goals 1. To outperform Gen. 2. To avoid critical performance degradation seen in Dis.

15/49 Analysis of critical performance degradation Ideal discriminative feature space
The distance from the training data is the correct anomaly score Training data  Normal sounds are also misclassified (due to high difficulty of the classification task) Case of critical performance degradation Training data The distance from the training data does not work as the anomaly score

16/49 Proposed method Specific goals 1. To outperform Gen. 2.
To avoid critical performance degradation seen in Dis. Two important observation 1. Gen. is relatively stable, but its performance is insufficient 2. Even in a discriminative feature space that leads to critical performance degradation, normal and anomalous samples still tend to be distinguished (i.e., although the distance from the training data does not work, the space still provides useful information) Training data

17/49 Discriminative neighborhood smoothing of generative anomaly scores

18/49 Discriminative neighborhood smoothing of generative anomaly scores ◼ Performance
of Gen. is not high ◼ We assume that normal and anomalous sounds are distinguished

19/49 Discriminative neighborhood smoothing of generative anomaly scores

20/49 Specific goals 1. To outperform Gen. 2. To avoid
critical performance degradation seen in Dis. Discriminative neighborhood smoothing of generative anomaly scores ↑ Improve the performance of the original Gen. in an ensemble manner utilizing the discriminative space (Note: this method utilizes test data) ↑Remove the risk by not measuring the distance from the training data in the discriminative feature space Training data

21/49 Experimental evaluation Goal ◼To outperform Gen. ◼To avoid critical
performance degradation seen in Dis. Comparison methods ◼Gen.: AE[Koizumi+, 2020] ◼Dis.: SerialOE[Kuroyanagi+, 2022] ◼Proposed method: Combines generative anomaly scores calculated by AE and the discriminative feature extractor of SerialOE Setups ◼Dataset: DCASE2021 Task2 ◼Metric: Harmonic mean of AUC and pAUC in the source domain (0 to 100, higher is better)

22/49 Evaluation results Goal ◼To outperform Gen. ◼To avoid critical
performance degradation seen in Dis. K (# of neighborhoods) was simply determined by validation Selected the best K in the hyperparameter set

performance degradation seen in Dis. Performance of the proposed method ◼Significantly improve performance of AE, leading an absolute improvement of 7% in All-hmean ◼It has the potential to achieve better performance than Dis. when the proper # of neighborhoods is provided

performance degradation seen in Dis. Training, Normal Test, Anomalous Test, Normal  Normal sounds are also misclassified ToyCar-5 ☺ Even if such a feature space is formed, Proposed avoids performance degradation → Dis. results in Performance degradation

27/49 Visualization of the smoothing process Original AE (Gen.) AUC:
72.87 % Proposed (Oracle) AUC: 77.29 % Modified to higher values Modified to lower values

28/49 Practical challenge 1: Instability Background: Trade-off between performance and
stability ◼The performance of Gen. is insufficient ◼Dis. can sometimes causes critical performance degradation Proposed method 1. Improve the performance of Gen. by utilizing the discriminative feature space 2. Remove the risk by not measuring the distance from the training data in the discriminative space Results ◼Significantly improved the performance of the original Gen. ◼Robustly worked even when Dis. faced critical performance degradation problem

30/49 Practical challenge 2: Annotation costs Effectiveness of labels in
the discriminative approach Annotating operation params is costly 10:00 speed 30 10:05 speed 20 10:10 speed 25 … Annotation Information of labels Effectiveness Operation param. Capture differences in machine sounds → ☺ Detect anomalies based on differences in machine sounds Detailed param. Capture more detailed differences → ☺ Detect more subtle anomalies Noise type Capture differences in noise →  Detect anomalies based on differences in noise

31/49 Practical challenge 2: Annotation costs Effectiveness of labels in
the discriminative approach Annotating operation params is costly 10:00 speed 30 10:05 speed 20 10:10 speed 25 … Annotation Information of labels Effectiveness Operation param. Capture differences in machine sounds → ☺ Detect anomalies based on differences in machine sounds Detailed param. Capture more detailed differences → ☺ Detect more subtle anomalies Noise type Capture differences in noise →  Detect anomalies based on differences in noise Goal: To improve performance without relying on annotated labels

32/49 Situation we will consider DCASE 2024 Task2 Challenge settings
Machine types Operation param. Slide rail Valve … … Unavailable (on all or some machines) Available

33/49 Proposed method: pseudo-labeling Normal sounds FE for pseudo- labeling
Feature space GMM w/ BIC pseudo- labels FE for ASD A feature space that reflects differences in machines sounds Cluster-B Cluster-C Cluster-A We obtain pseudo-labels by clustering

34/49 4 types of FE for pseudo-labeling Normal sounds FE
for pseudo- labeling Feature space GMM w/ BIC pseudo- labels FE for ASD Method name Training task Training data Class Classification of available labels (supervised) DCASE: Small-scale machine sound dataset (including target machine sounds) Triplet Triplet learning (self-supervised) PANNs Audio event classification (supervised) Audioset: Large-scale audio event dataset (not including target machine sounds) OpenL3 Audio-video clip correspondence prediction (self-supervised)

35/49 Triplet Anchor (original audio) Positive = Anchor + Noise
Negative = Resize(Anchor) Pull Push Ignore noise → Noise robust feature space Reflect differences in the machine sound

36/49 Experimental evaluation Goal ◼To improve performance under unlabeled conditions
Dataset Comparison methods ◼N/A: This does not use pseudo-labels ◼Proposed pseudo-labeling: Class, Triplet, PANNs, and OpenL3 ◼GT: This uses ground-truth labels Dataset name # of machines w/ operation param. w/o operation param. Original DCASE2023 14 0 Modified DCASE2023 0 14 Original DCASE2024 9 7

37/49 Evaluation results ◼Skipping evaluation results of 9 machines w/
operation param. from DCASE 2024 (Results were almost the same) ◼Metric: the harmonic mean of AUC and pAUC over all domains ◼Show the arithmetic mean and standard deviation across 5 trials 7 machines w/o operation param. (DCASE 2023) 7 other machines w/o operation param. (DCASE 2023) 7 machines w/o operation param. (DCASE 2024)

38/49 Evaluation results ◼Ground truth labels are extremely effective in
most cases ☺ All pseudo-labeling methods improve performance ◼PANNs and OpenL3 basically achieve better performance

most cases ☺ All pseudo-labeling methods improve performance ◼PANNs and OpenL3 basically achieve better performance For example, OpenL3 achieved an absolute improvement of 30%

43/49 Good case of OpenL3 Shaker in 23eval (AUC of
N/A was 44.02) ◼OpenL3 successfully reflected the ground-truth labels ☺ It significantly improved the performance Feature spaces colored by ground truth labels

44/49 Bad case of PANNs and OpenL3 valve in 23dev
(AUC of N/A was 74.18) ◼PANNs and OpenL3 did not reflect the ground-truth labels  The generated pseudo-labels degraded performance

45/49 Bad case of PANNs and OpenL3 valve in 23dev
(AUC of N/A was 74.18) ◼The clusters also reflected differences in types of noise → unhelpful pseudo-labels Contain similar noise Contain similar noise Machine sound Machine sound

46/49 Good case of Triplet valve in 23dev (AUC of
N/A was 74.18) ◼Resize(⋅) effectively captured differences in machine sounds ☺ The generated pseudo-labels improved the performance Machine sound Noise Machine sound Noise No noise No noise

47/49 Practical challenge 2: Annotation costs Background: Annotation is costly
Proposed method: Pseudo-labeling Results ◼All types of our pseudo-labeling methods were effective ◼PANNs and OpenL3 trained on Audioset especially provided useful pseudo-labels, but they sometimes suffered from noise ◼In some machines, Resize(⋅) very effectively captured differences in machine sounds Future work ◼More detailed analysis (e.g., differences between PANNs and OpenL3, further ablation studies) ◼Noise-robust pseudo-labeling method

49/49 Summary Promising research directions ◼Accumulate and utilize observed data
◼Combine or selectively use Gen. and Dis. ◼Develop noise-robust pseudo-labels ◼There is still room for improvement in Domain shift and stability (first-shot) problems Performance Insufficient High Generative Discriminative Stability Relatively stable Unstable Useful but the annotation is costly Use of labels ↑ 2nd topic ↑ 1st topic

Thank you!

Research introduction of anomalous sound detect...

Research introduction of anomalous sound detection for real-world scenarios

Other Decks in Research

Featured

Transcript