Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intrinsic Self-Supervision for Data Quality Audits

Intrinsic Self-Supervision for Data Quality Audits

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning.

Fabian Gröger

January 13, 2025
Tweet

Other Decks in Research

Transcript

  1. Intrinsic Self-Supervision for Data Quality Audits Fabian Gröger, Simone Lionetti,

    Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander A. Navarini, Marc Pouly
  2. Everyone who is doing ML knows … … training and

    evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates.
  3. Everyone who is doing ML knows … … training and

    evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates. BUT, manual data cleaning … … can be time-consuming, labor-intensive, and error-prone. … is the least enjoyable task for many practitioners*. * Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Forbes, 2016
  4. Goals of this project (1) Reliably detect data quality issues,

    such as o ff -topic images, near duplicates, and label errors, in image datasets without introducing signi fi cant biases. (2) Reduce the time needed for detecting and con fi rming data quality issues. (3) Investigate the in fl uence of data quality issues on training and evaluation.
  5. • Self-supervised learned (SSL) representations can be exploited to fi

    nd data quality issues. • Context-aware SSL representations can capture the dataset context with minimal bias. • Combination of SSL representations and distance- based indicators e ff ectively fi nds quality issues. Our fi ndings
  6. SelfClean Kite Noisy Data Self-Supervised Representation ImageNet CheXpert Fitzpatrick17k Label

    Errors Intra-/extra- class distance ratio Label Errors Atelectasis: positive Benign epidermal Exact Duplicate Approximate Duplicate Clustered Isolated Pairwise distance Near Duplicates Off-topic Samples Agglomerative clustering
  7. Results (1) • Evaluation on both synthetic and natural contamination

    showed a signi fi cant improvement compared to current solutions. 0 10 20 30 40 50 0.0 0.1 0.2 0.3 STL Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.0 0.2 0.4 Near Duplicates (AUG) 0 10 20 30 40 50 0.0 0.2 0.4 0.6 Label Errors (LBLC) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 0.6 0.8 DDI Average Precision (AP) OÆ-topic Samples (BLUR) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Near Duplicates (ARTE) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Label Errors (LBL) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) 0 10 20 30 40 50 0.0 0.2 0.4 VDR Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.000 0.005 0.010 0.015 Near Duplicates (AUG) 0 10 20 30 40 50 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) Synthetic evaluation results, where green shows 
 SelfClean’s performance and higher is better.
  8. Results (2) • Applied to multiple image benchmarks, we identify

    up to 16% of issues, and con fi rm an improvement in evaluation reliability upon cleaning. (a) Near duplicates (b) O ff -topic samples (c) Label errors Analysis of ImageNet-1k.
  9. Results (3) • For a typical dataset SelfClean can reduce

    the inspection e ff ort by a factor between 5 and 50. 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 STL Fraction of EÆort (FE) OÆ-topic Samples (XR) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Near Duplicates (AUG) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) Analysis of the inspection e ff ort saved, where green shows 
 SelfClean’s performance and lower is better.
  10. Intrinsic Self-Supervision for Data Quality Audits Give it a try

    Scan me Co-authors Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander Navarini, Marc Pouly