Slide 1

Slide 1 text

Intrinsic Self-Supervision for Data Quality Audits Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander A. Navarini, Marc Pouly

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

O ff -topic samples

Slide 4

Slide 4 text

O ff -topic samples Near duplicates

Slide 5

Slide 5 text

O ff -topic samples Near duplicates Label errors

Slide 6

Slide 6 text

Everyone who is doing ML knows … … training and evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates.

Slide 7

Slide 7 text

Everyone who is doing ML knows … … training and evaluation data can be messy. … noise during evaluating leads to inconsistent performance estimates. BUT, manual data cleaning … … can be time-consuming, labor-intensive, and error-prone. … is the least enjoyable task for many practitioners*. * Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Forbes, 2016

Slide 8

Slide 8 text

Goals of this project (1) Reliably detect data quality issues, such as o ff -topic images, near duplicates, and label errors, in image datasets without introducing signi fi cant biases. (2) Reduce the time needed for detecting and con fi rming data quality issues. (3) Investigate the in fl uence of data quality issues on training and evaluation.

Slide 9

Slide 9 text

• Self-supervised learned (SSL) representations can be exploited to fi nd data quality issues. • Context-aware SSL representations can capture the dataset context with minimal bias. • Combination of SSL representations and distance- based indicators e ff ectively fi nds quality issues. Our fi ndings

Slide 10

Slide 10 text

SelfClean Kite Noisy Data Self-Supervised Representation ImageNet CheXpert Fitzpatrick17k Label Errors Intra-/extra- class distance ratio Label Errors Atelectasis: positive Benign epidermal Exact Duplicate Approximate Duplicate Clustered Isolated Pairwise distance Near Duplicates Off-topic Samples Agglomerative clustering

Slide 11

Slide 11 text

Results (1) • Evaluation on both synthetic and natural contamination showed a signi fi cant improvement compared to current solutions. 0 10 20 30 40 50 0.0 0.1 0.2 0.3 STL Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.0 0.2 0.4 Near Duplicates (AUG) 0 10 20 30 40 50 0.0 0.2 0.4 0.6 Label Errors (LBLC) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 0.6 0.8 DDI Average Precision (AP) OÆ-topic Samples (BLUR) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Near Duplicates (ARTE) 0 10 20 30 40 50 Contamination (%) 0.0 0.2 0.4 Label Errors (LBL) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) 0 10 20 30 40 50 0.0 0.2 0.4 VDR Average Precision (AP) OÆ-topic Samples (XR) 0 10 20 30 40 50 0.000 0.005 0.010 0.015 Near Duplicates (AUG) 0 10 20 30 40 50 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) Synthetic evaluation results, where green shows 
 SelfClean’s performance and higher is better.

Slide 12

Slide 12 text

Results (2) • Applied to multiple image benchmarks, we identify up to 16% of issues, and con fi rm an improvement in evaluation reliability upon cleaning. (a) Near duplicates (b) O ff -topic samples (c) Label errors Analysis of ImageNet-1k.

Slide 13

Slide 13 text

Results (3) • For a typical dataset SelfClean can reduce the inspection e ff ort by a factor between 5 and 50. 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 STL Fraction of EÆort (FE) OÆ-topic Samples (XR) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Near Duplicates (AUG) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.00 0.25 0.50 0.75 1.00 Label Errors (LBLC) HBOS (INet) HBOS (DINO) ECOD (INet) ECOD (DINO) SSIM pHASH CLearning (INet) CLearning (DINO) FastDup (INet) FastDup (DINO) SelfClean (INet) SelfClean (DINO) Analysis of the inspection e ff ort saved, where green shows 
 SelfClean’s performance and lower is better.

Slide 14

Slide 14 text

Intrinsic Self-Supervision for Data Quality Audits Give it a try Scan me Co-authors Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Matthew Groh, Alexander Navarini, Marc Pouly