Intrinsic Self-Supervision for Data Quality Audits

Intrinsic Self-Supervision for Data Quality Audits Fabian Gröger, Simone Lionetti,
Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander A. Navarini, Marc Pouly

O ff -topic samples

O ff -topic samples Near duplicates

O ff -topic samples Near duplicates Label errors

Everyone who is doing ML knows … … training and
evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates.

Everyone who is doing ML knows … … training and
evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates. BUT, manual data cleaning … … can be time-consuming, labor-intensive, and error-prone. … is the least enjoyable task for many practitioners*. * Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Forbes, 2016

Goals of this project (1) Reliably detect data quality issues,
such as o ff -topic images, near duplicates, and label errors, in image datasets without introducing signi fi cant biases. (2) Reduce the time needed for detecting and con fi rming data quality issues. (3) Investigate the in fl uence of data quality issues on training and evaluation.

• Self-supervised learned (SSL) representations can be exploited to fi
nd data quality issues. • Context-aware SSL representations can capture the dataset context with minimal bias. • Combination of SSL representations and distance- based indicators e ff ectively fi nds quality issues. Our fi ndings

SelfClean Kite Noisy Data Self-Supervised Representation ImageNet CheXpert Fitzpatrick17k Label
Errors Intra-/extra- class distance ratio Label Errors Atelectasis: positive Benign epidermal Exact Duplicate Approximate Duplicate Clustered Isolated Pairwise distance Near Duplicates Off-topic Samples Agglomerative clustering

Results (1) • Evaluation on both synthetic and natural contamination
showed a signi fi cant improvement compared to current solutions. 0 10 20 30 40 50 0.0 0.1 0.2 0.3 STL Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.0 0.2 0.4 Near Duplicates (AUG) 0 10 20 30 40 50 0.0 0.2 0.4 0.6 Label Errors (LBLC) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 0.6 0.8 DDI Average Precision (AP) OÆ-topic Samples (BLUR) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Near Duplicates (ARTE) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Label Errors (LBL) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) 0 10 20 30 40 50 0.0 0.2 0.4 VDR Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.000 0.005 0.010 0.015 Near Duplicates (AUG) 0 10 20 30 40 50 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) Synthetic evaluation results, where green shows   SelfClean’s performance and higher is better.

Results (2) • Applied to multiple image benchmarks, we identify
up to 16% of issues, and con fi rm an improvement in evaluation reliability upon cleaning. (a) Near duplicates (b) O ff -topic samples (c) Label errors Analysis of ImageNet-1k.

Results (3) • For a typical dataset SelfClean can reduce
the inspection e ff ort by a factor between 5 and 50. 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 STL Fraction of EÆort (FE) OÆ-topic Samples (XR) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Near Duplicates (AUG) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) Analysis of the inspection e ff ort saved, where green shows   SelfClean’s performance and lower is better.

Intrinsic Self-Supervision for Data Quality Audits Give it a try
Scan me Co-authors Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander Navarini, Marc Pouly

Intrinsic Self-Supervision for Data Quality Audits

Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger

Other Decks in Research

Featured

Transcript