Pietro Gori

Contrastive Learning in Medical Imaging An ϵ-margin metric learning approach
Pietro Gori Assistant Professor Télécom Paris (IPParis) Paris, France

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1
A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 2/91

Introduction - Computer Vision • Deep learning (e.g., CNN or
ViT) is a lazy and inefficient statistical method that needs millions if not billions of exemples to learn a precise task → data hungry 4/91

Introduction - Computer Vision • Many specific tasks in Computer
Vision, such as object detection1 (e.g., YOLO), image classification2 (e.g., ResNet-50), or semantic segmentation (e.g., U-Net), have reached astonishing results in the last years. • This has been possible mainly because large (N > 106), labeled data-sets were easily accessible and freely available 1T.-Y. Lin et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014. 2J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 5/91

Introduction - Medical Imaging • In medical imaging, current research
datasets are: ▶ small: N < 2k for common pathology and N < 200 for rare pathology ▶ biased: images are acquired in a precise hospital, following a specific protocol with a particular machine (nuisance site effect) ▶ multi-modal: many imaging modalities can be available as well as text, clinical, biological, genetic data. ▶ anonymized, quality checked, accessible, quite homogeneous • Clinical datasets are harder to analyze since they are usually not anonymized, not quality checked, not freely accessible, highly heterogeneous. • In this talk, we will focus on research medical imaging datasets 6/91

Introduction - Transfer Learning • When dealing with small labelled
datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset 8/91

datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. 8/91

datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature reuse) increases with3456: ▶ reduced target data size (small Ntarget ) ▶ visual similarity between pre-train and target domains (small FID) ▶ models with fewer inductive biases (TL works better for ViTs than CNN) ▶ larger architectures (more parameters) 3B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. 4C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 5B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. 6M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. 9/91

datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature reuse) increases with78910: ▶ reduced target data size (small Ntarget ) ▶ visual similarity between pre-train and target domains (small FID) ▶ models with fewer inductive biases (TL works better for ViTs than CNN) ▶ larger architectures (more parameters) 7B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. 8C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 9B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. 10M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. 10/91

Introduction - Transfer Learning • Natural11 and Medical12 images can
be visually very different ! → Domain gap • Furthermore, Medical images can be 3D. ImageNet is 2D. • Need for 3D, annotated, large medical dataset 11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 11/91

Introduction - Transfer Learning • Natural11 and Medical12 images can
be visually very different ! → Domain gap • Furthermore, Medical images can be 3D. ImageNet is 2D. • Need for 3D, annotated, large medical dataset → PROBLEM ! 11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 11/91

Introduction - Transfer Learning • Supervised pre-training is a not
a valid option in medical imaging. Need for another kind of pre-training. 13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). 14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). 12/91

Introduction - Transfer Learning • Supervised pre-training is a not
a valid option in medical imaging. Need for another kind of pre-training. • Recently big, multi-sites international healthy data-sets have emerged, such as UK Biobank13 (N > 100k) and OpenBHB14 (N > 10k) 13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). 14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). 12/91

Introduction - Transfer Learning • How can we employ an
healthy (thus unlabeled) data-set for pre-training ? 13/91

Introduction - Transfer Learning • How can we employ an
healthy (thus unlabeled) data-set for pre-training ? → Self-Supervised pre-training ! 13/91

Introduction - Transfer Learning • Self-supervised pre-training: leverage an annotation-free
pretext task to provide a surrogate supervision signal for feature learning. 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

pretext task to provide a surrogate supervision signal for feature learning. • Pretext task should only use the visual information and context of the images 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

pretext task to provide a surrogate supervision signal for feature learning. • Pretext task should only use the visual information and context of the images • Examples of pretext tasks: ▶ Context prediction15 ▶ Generative models’16’17 ▶ Instance discrimination (Contrastive Learning)18 ▶ Teacher/Student19 ▶ Information Maximization20 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

Self-supervised Learning - Preliminaries • Pre-text tasks should produce image
representations that are: 1. Transferable: we can easily reuse/fine-tune them in different downstream tasks (e.g., segmentation, object detection, classification, etc.) 2. Generalizable: they should not be specific to a single task but work well in several different downstream tasks 3. High-level: representations should characterize the high-level semantics/structure and not low-level features (color, texture, etc.) 4. Invariant: image representations should be invariant to geometric or appearance transformations that do not modify the information content of the image (i.e., irrelevant for downstream task) 5. Semantically coherent: semantically similar images should be close in the representation space 15/91

Contrastive Learning • Contrastive learning methods outperform the other pretext
tasks21 21T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 17/91

Contrastive Learning • And recently there has been a plethora
of works about it that is closing the performance gap with supervised pretraining22’23’24 22J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 23M. Caron et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS. 2020. 24J. Zhou et al. “Image BERT Pre-training with Online Tokenizer”. In: ICLR. 2022. 18/91

Contrastive Learning - A bit of history • Goal: given
a set of images xk ∈ X, learn a mapping function fθ : X → F such that: if xa and xb are semantically similar → f(xa) ≈ f(xb ) if xa and xb are semantically different → f(xa) ̸= f(xb ) • These conditions can be reformulated from a mathematical point using either a geometric approach, based on a distance d(f(xa), f(xb )), or an information theoretic approach, based on a statistical dependence measure, such as Mutual Information I(f(xa), f(xb )). if xa and xb are semantically similar → arg min f d(f(xa), f(xb )) arg max f I(f(xa), f(xb )) if xa and xb are semantically different → arg max f d(f(xa), f(xb )) arg min f I(f(xa), f(xb )) 19/91

Contrastive Learning - A bit of history Geometric approach (Y.
LeCun) ▶ Pairwise lossa ▶ Triplet lossb ▶ Tuplet lossc’d’e aS. Chopra et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. bF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. cH. O. Song et al. “Deep Metric Learning via Lifted Structured Feature Embedding”. In: CVPR. 2016. dK. Sohn. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective”. In: NIPS. 2016. eB. Yu et al. “Deep Metric Learning With Tuplet Margin Loss”. In: ICCV. 2019. Information theory approach (G. Hinton) ▶ Soft Nearest Neighbora’b ▶ Contrastive Predictive Coding (CPC)c ▶ Non-Parametric Instance Discriminationd ▶ Deep InfoMax (DIM)e aR. Salakhutdinov et al. “Learning a Nonlinear Embedding by Preserving Class ...”. In: AISTATS. 2007. bN. Frosst et al. “Analyzing and Improving Representations with the Soft Nearest Neighbor”. In: ICML. 2019. cA. v. d. Oord et al. Representation Learning with Contrastive Predictive Coding. 2018. dZ. Wu et al. “Unsupervised Feature Learning via Non-parametric Instance Discrimination”. In: CVPR. 2018. eR. D. Hjelm et al. “Learning deep representations by mutual information estimation ...”. In: ICLR. 2019. 20/91

Contrastive Learning - A bit of history Geometric approach (Y.
LeCun)a ▶ Need to define positive (x, x+) (semantically similar) and negative pairs (x, x−) (semantically different) ▶ Need to define similarity measure (or distance) that is maximized (or minimized) ▶ No constraints/hypotheses about negative samples aS. Chopra et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. Information theory approach (G. Hinton)a ▶ Need to define pdf of positive (x, x+) ∼ p(x, x+) and negative pairs (x, x−) ∼ p(x)p(x−) where x− ⊥ ⊥ x, x+ ▶ Maximize Mutual Information (I) between positive pairs, given independent negative pairs: I(x; x+) = I(x; x+, x−) = Ex−∼p(x−) I(x; x+)b ▶ Need to define an estimator of I aS. Becker et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). bB. Poole et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. 21/91

Contrastive Learning - A bit of history • The Information
theoretic approach is mathematically sounded and well grounded on the role of Mutual Information (I) estimation in representation learning. • But ... Large I is not necessarily predictive of downstream performance. Good results may depend on architecture choices and inductive biases rather than an accurate I estimation25 • Furthermore, a geometric approach: ▶ is easy to understand and explain ▶ can easily formalize abstract ideas for defining new losses or regularization terms (e.g., data biases) ▶ No need of implausible hypothesis (e.g., negative samples independence). 25M. Tschannen et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020. 22/91

Contrastive Learning - Geometric approach ▶ Let x ∈ X
be a sample (anchor) ▶ Let x+ i be a similar (positive) sample ▶ Let x− j be a different (negative) sample ▶ Let P be the number of positive samples ▶ Let N be the number of negative samples ▶ Let f : X → Sd−1 be the mapping ▶ Let F = Sd−1, a (d-1)-sphere Figure: From Schroff et al.a aF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. 24/91

Contrastive Learning - Geometric approach ▶ Let x ∈ X
be a sample (anchor) ▶ Let x+ i be a similar (positive) sample ▶ Let x− j be a different (negative) sample ▶ Let P be the number of positive samples ▶ Let N be the number of negative samples ▶ Let f : X → Sd−1 be the mapping ▶ Let F = Sd−1, a (d-1)-sphere Figure: From Schroff et al.a aF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. How can we define positive and negative samples ? 24/91

Contrastive Learning - Semantic definition • Positive samples x+ i
can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x26 or a nearest-neighbor from a support set27. 26T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 27D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 25/91

Unsupervised setting 26/91

can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x28 or a nearest-neighbor from a support set29. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.30 28T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 29D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 30P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 27/91

Supervised setting Figure: Image taken from31 31P. Khosla et al.
“Supervised Contrastive Learning”. In: NeurIPS. 2020. 28/91

can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x32 or a nearest-neighbor from a support set33. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.34 ▶ In regression35 or weakly-supervised classification36: x+ i is a sample with a similar continuous/weak label of x. • The definition of negative samples x− j varies accordingly. 32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023. 36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 29/91

can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x32 or a nearest-neighbor from a support set33. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.34 ▶ In regression35 or weakly-supervised classification36: x+ i is a sample with a similar continuous/weak label of x. • The definition of negative samples x− j varies accordingly. How can we contrast positive and negative samples from a mathematicla point of view ? 32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023. 36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 29/91

Contrastive Learning - ϵ-margin metric • We propose to use
an ϵ-margin metric learning point of view37. • If we have a single positive x+ and several negatives x− j (e.g., tuplet loss), we look for f such that: d(f(x), f(x+)) d+ − d(f(x), f(x− j )) d− j < −ϵ ⇐⇒ s(f(x), f(x− j )) s− j − s(f(x), f(x+) s+ ≤ −ϵ ∀j • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2 . 37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 31/91

Contrastive Learning - ϵ-margin metric • We propose to use
an ϵ-margin metric learning point of view37. • If we have a single positive x+ and several negatives x− j , we look for f such that: d(f(x), f(x+)) d+ − d(f(x), f(x− j )) d− j < −ϵ ⇐⇒ s(f(x), f(x− j )) s− j − s(f(x), f(x+) s+ ≤ −ϵ ∀j • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2 . • Two possible ways to transform this Eq. in an optimization problem are: arg min f max(0, {s− j − s+ + ϵ}j=1,..,N) arg min f N j=1 max(0, s− j − s+ + ϵ) • when these losses are = 0, the condition is fulfilled. Second is lower-bound of first. 37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 31/91

Contrastive Learning - ϵ-margin metric LogSumExp operator LSE The LogSumExp
operator LSE is a smooth approximation of the max function. It is defined as: max(x1, x2, ..., xN) ≤ LSE(x1, x2, ..., xN) = log( N i=1 exp(xi)) arg min f max(0, {s− j − s+ + ϵ}j=1,...,N) ≈ arg min f − log exp(s+) exp(s+ − ϵ) + j exp(s− j ) ϵ−InfoNCE 38 38C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 32/91

Contrastive Learning - ϵ-margin metric • When ϵ = 0,
we retrieve the InfoNCE loss39, whereas when ϵ → ∞ we obtain the InfoL1O (or Decoupled loss40). • These two losses are lower and upper bound of I(X+, X) respectively41 : E (x,x+)∼p(x,x+) x− j ∼p(x−)       log exp s+ exp s+ + j exp s− j InfoNCE       ≤ I(X+, X) ≤ E (x,x+)∼p(x,x+) x− j ∼p(x−)       log exp s+ j exp s− j InfoL1O       (1) • Changing ϵ ∈ [0, ∞) can bring to a tighter approximation of I(X+, X). The exponential function at the denominator exp(−ϵ) monotonically decreases as ϵ increases. 39T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 40C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 41B. Poole et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. 33/91

Contrastive Learning - ϵ-margin metric • The inclusion of multiple
positive samples (s+ i ) can lead to different formulations (see42). Here, we use the simplest one: s− j − s+ i ≤ −ϵ ∀i, j i max(−ϵ, {s− j − s+ i }j=1,...,N) ≈ − i log exp(s+ i ) exp(s+ i − ϵ) + j exp(s− j ) ϵ−SupInfoNCE (2) • Another formulation is the SupCon loss43, which has been presented as the “most straightforward way to generalize” the InfoNCE loss with multiple positive. However... 42C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 43P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 34/91

Contrastive Learning - ϵ-margin metric • ... it actually contains
a non-contrastive constraint44 on the positive samples: s+ t − s+ i ≤ 0 ∀i, t. s− j − s+ i ≤ −ϵ ∀i, j and s+ t − s+ i ≤ 0 ∀i, t ̸= i 1 P i max(0, {s− j − s+ i + ϵ}j, {s+ t − s+ i }t̸=i) ≈ ϵ − 1 P i log exp(s+ i ) t exp(s+ t − ϵ) + j exp(s− j ) ϵ−SupCon • when ϵ = 0 we retrieve exactly Lsup out . • One tries to align all positive samples to a single point in the representation space. Thus losing intra-class variability. 44C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 35/91

Supervised Contrastive Learning - Results Table: Accuracy on vision datasets.
SimCLR and Max-Margin results from45. Results denoted with * are (re)implemented with mixed precision due to memory constraints. Dataset Network SimCLR Max-Margin SimCLR* CE* SupCon* ϵ-SupInfoNCE* CIFAR-10 ResNet-50 93.6 92.4 91.74±0.05 94.73±0.18 95.64±0.02 96.14±0.01 CIFAR-100 ResNet-50 70.7 70.5 68.94±0.12 73.43±0.08 75.41±0.19 76.04±0.01 ImageNet-100 ResNet-50 - - 66.14±0.08 82.1±0.59 81.99±0.08 83.3±0.06 Table: Comparison of ϵ-SupInfoNCE and ϵ-SupCon on ImageNet-100 in terms of top-1 accuracy (%). Loss ϵ = 0.1 ϵ = 0.25 ϵ = 0.5 ϵ-SupInfoNCE 83.25±0.39 83.02±0.41 83.3±0.06 ϵ-SupCon 82.83±0.11 82.54±0.09 82.77±0.14 45P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 36/91

Contrastive Learning - Weakly supervised • The previous framework works
well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time 38/91

well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time • Let y be the continuous/weak label of the anchor x and yk of a sample xk . • Simple solution: threshold d between y and yk at τ to create positive and negative samples: xk is x+ is if d(y, yk ) < τ → Problem: how to choose τ ? 38/91

well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time • Let y be the continuous/weak label of the anchor x and yk of a sample xk . • Simple solution: threshold d between y and yk at τ to create positive and negative samples: xk is x+ is if d(y, yk ) < τ → Problem: how to choose τ ? • Our solution: define a degree of “positiveness” between samples using a kernel function wk = Kσ(y − yk ), where 0 ≤ wk ≤ 1. • New goal: learn f that maps samples with a high degree of positiveness (wk ∼ 1) close in the latent space and samples with a low degree (wk ∼ 0) far away from each other. 38/91

Contrastive Learning - Weakly supervised Question: Which pair of subjects
are closer in you opinion (brain MRI, axial plane) ? Subject A Subject B Subject C 39/91

Contrastive Learning - Weakly supervised Question: Which pair of subjects
are closer in you opinion (brain MRI, axial plane) ? Subject A Age=15 Subject B Age=64 Subject C Age=20 39/91

Contrastive Learning - Weakly supervised Meta-data 𝑦 ∈ ℝ Latent
Space 𝒵 Latent Space 𝒵 𝑤𝜎 𝑦1 , 𝑦2 𝑤𝜎 𝑦2 , 𝑦3 𝑡1 ∼ 𝒯 𝑡1 ′ ∼ 𝒯 𝑦1 𝑦2 𝑦3 SimCLR 𝒚-Aware Contrastive Learning 𝑥1 𝑥2 𝑥3 𝑡2 ∼ 𝒯 𝑡2 ′ ∼ 𝒯 𝑡3 ′ ∼ 𝒯 𝑡3 ∼ 𝒯 40/91

Contrastive Learning - Weakly supervised • In46’47, we propose a
new contrastive condition for weakly supervised problems: wk j wj (st − sk ) ≤ 0 ∀j, k, t ̸= k ∈ A • where A contains the indices of samples ̸= x and we consider as positives only the samples with wk > 0, and align them with a strength proportional to wk . • As before, we can transform it in an optimization problem obtaining the y-aware loss: arg min f k max(0, wk t wt {st − sk }t=1,...,N t̸=k ) ≈ Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) 46B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 47B. Dufumier et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop. 2021. 41/91

Contrastive Learning - Weakly supervised 42/91

Results - Linear evaluation (a) 5-fold CV Stratified on Site.
(b) 5-fold CV Leave-Site-Out 43/91

Results - Robustness to σ and transformations ▶ Linear classification
performance remains stable for a range σ ∈ [1, 5] ▶ Adding more transformations improve the representation (in line with SimCLR) ▶ Cutout remains competitive while being cost-less computationally 44/91

Results - Fine-tuning Task Test Set Pre-training Strategies Weakly Self-Supervised
Self-Supervised Generative Discriminative Baseline Age-Aware Contrastive48 Model Genesis49 Contrastive Learning50 VAE Age Sup. SCZ vs. HC ↑ Ntrain = 933 Internal Test 85.27±1.60 85.17±0.37 76.31±1.77 82.31±2.03 82.56±0.68 83.05±1.36 External Test 75.52±0.12 77.00±0.55 67.40±1.59 75.48±2.54 75.11±1.65 74.36±2.28 BD vs. HC ↑ Ntrain = 832 Internal Test 76.49±2.16 78.81±2.48 76.25±1.48 72.71±2.06 71.61±0.81 77.21±1.00 External Test 68.57±4.72 77.06±1.90 65.66±0.90 71.23±3.05 71.70±0.23 73.02±2.66 ASD vs. HC ↑ Ntrain = 1526 Internal Test 65.74±1.47 66.36±1.14 63.58±4.35 61.92±1.67 59.67±2.04 67.11±1.76 External Test 62.93±2.40 68.76±1.70 54.95±3.58 61.93±1.93 57.45±0.81 62.07±2.98 Table: Fine-tuning results.51 All pre-trained models use a data-set of 8754 3D MRI of healthy brains. We reported average AUC(%) for all models and the standard deviation by repeating each experiment three times. Baseline is a DenseNet121 backbone. 48B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 49Z. Zhou et al. “Models Genesis”. In: MedIA (2021). 50T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 51B. Dufumier et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In: NeuroImage (under review) (). 45/91

Contrastive Learning - Regression • We could use Ly−aware also
in regression. But... Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) • ... the numerator aligns xk , and the denominator focuses more on the closest samples in the representation space. 47/91

Contrastive Learning - Regression • We could use Ly−aware also
in regression. But... Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) • ... the numerator aligns xk , and the denominator focuses more on the closest samples in the representation space. → Problem ! These samples might have a greater degree of positiveness with the anchor than the considered xk 47/91

Contrastive Learning - Regression • We thus propose two new
losses: wk (st − sk ) ≤ 0 if wt − wk ≤ 0 ∀k, t ̸= k ∈ A(i) Lthr = − k wk t δwt<wk wt log exp(sk ) t̸=k δwt<wk exp(st) • Lthr repels only the samples that have a y greater than the one of xk but it still focuses more on the closest samples. 48/91

Contrastive Learning - Regression wk [st(1 − wt) − sk
] ≤ 0 ∀k, t ̸= k ∈ A(i) Lexp = − 1 t wt k wk log exp(sk ) t̸=k exp(st(1 − wt)) • Lexp has a repulsion strength inversely proportional to the similarity between y values, whatever their distance. • Repulsion strength only depends on the distance in the kernel space. → samples close in the kernel space will be close in the representation space. 49/91

Results - OpenBHB Challenge • OpenBHB Challenge: age prediction with
site-effect removal → Brain age ̸= chronological age in neurodegenerative disorders ! • Ntrain : 5330 3D brain MRI scans (different subjects) from 71 acquisition sites. • Two private test data-sets (internal and external) • To participate https://ramp.studio/problems/brain_age_with_site_removal 50/91

Results - Regression Method Int. MAE ↓ BAcc ↓ Ext.
MAE ↓ Lc ↓ Ly−aware 2.66±0.00 6.60±0.17 4.10±0.01 1.82 Lthr 2.95±0.01 5.73±0.15 4.10±0.01 1.74 Lexp 2.55±0.00 5.1±0.1 3.76±0.01 1.54 Table: Comparison of contrastive losses. Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓ Baseline (ℓ1 ) DenseNet 2.55±0.01 8.0±0.9 7.13±0.05 3.34 ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86 AlexNet 2.72±0.01 8.3±0.2 4.66±0.05 2.21 ComBat DenseNet 5.92±0.01 2.23±0.06 10.48±0.17 3.38 ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88 AlexNet 3.37±0.01 6.8±0.3 5.23±0.12 2.33 Lexp DenseNet 2.85±0.00 5.34±0.06 4.43±0.00 1.84 ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54 AlexNet 2.77±0.01 5.8±0.1 4.01±0.01 1.71 Table: Final scores on the OpenBHB Challenge leaderboard using a 3D ResNet-18. MAE: Mean Absolute Error. BAcc: Balanced Accuracy for site prediction. Challenge score: Lc = BAcc0.3 · MAEext . 51/91

The Issue of Biases • Contrastive learning is more robust
than traditional end-to-end approaches, such as cross-entropy, against noise in the data or in the labels52. • What about data bias, such as the site-effect ? Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓ Baseline (ℓ1 ) ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86 ComBat ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88 Lexp ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54 • Lexp shows a small overfitting on internal sites but also a low debiasing capability towards site effect → BAcc should be equal to random chance: 1/nsites = 1/64 ∼ 1.56 • Need to include debiasing regularization terms, such as FairKL53. 52F. Graf et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021. 53C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 52/91

The Issue of Biases ▶ Contrastive learning can generally guarantee
good downstream performance. However, it does not take into account the presence of data biases. ▶ Data biases: visual features that are correlated with the target downstream task (yellow colour) but don’t actually characterize it (digit). ▶ We employ the notion of bias-aligned and bias-conflicting samples54: 1. bias-aligned: shares the same bias attribute of the anchor. We denote it as x+,b 2. bias-conflicting: has a different bias attribute. We denote it as x+,b′ (a) Anchor x (b) Bias-aligned x+,b (c) Bias-conflicting x+,b′ 54J. Nam et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020. 54/91

The Issue of Biases ▶ Given an anchor x, if
the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. 55/91

the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. → We would like them to be equidistant from the anchor ! Remove the effect of the bias 55/91

the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. → We would like them to be equidistant from the anchor ! Remove the effect of the bias ▶ Thus, we say that there is a bias if we can identify an ordering on the learned representations: d+,b i < d+,b′ k ≤ d−,· j − ϵ ∀i, k, j Note This represents the worst-case scenario, where the ordering is total (i.e., ∀i, k, j). Of course, there can also be cases in which the bias is not as strong, and the ordering may be partial. Furthermore, the same reasoning can be applied to negative samples (omitted for brevity). 55/91

The Issue of Biases • Ideally, we would like d+,b′
k − d+,b i = 0 ∀i, k and d−,b′ t − d−,b j = 0 ∀t, j • However, this condition is very strict, as it would enforce uniform distance among all positive (resp. negative) samples. • We propose a more relaxed condition where we force the distributions of distances, {d+,b′ k } and {d+,b i }, to be similar (same for negative). • Assuming that the distance distributions follow a normal distribution, B+,b ∼ N(µ+,b , σ2 +,b ) and B+,b′ ∼ N(µ+,b′ , σ2 +,b′ ), we minimize the Kullback-Leibler divergence of the two distributions with the FairKL regularization term: RFairKL = DKL(B+,b ||B+,b′ ) = 1 2 σ2 +,b + (µ+,b − µ+,b′ )2 σ2 +,b′ − log σ2 +,b σ2 +,b′ − 1 56/91

FairKL visual explanation • Positive and negative samples have a
+, − symbol inside respectively. Filling colors represent different biases. Comparison between FairKL, EnD55, which only contraints the first moments µ+,b = µ+,b′ , and EnD with margin ϵ. d=1 d=2 d=3 (a) EnD. Partial ordering. d=1 d=2 d=3 (b) EnD + ϵ. Still partial ordering. d=2 (c) FairKL. No partial ordering. 55E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. 57/91

FairKL visual explanation • Simulated toy example where distances follow
a Gaussian distribution. Bias-aligned samples in blue. Bias-conflicting samples in orange. (a) µ+,b ̸= µ+,b′ σ2 +,b = σ2 +,b′ (b) µ+,b = µ+,b′ σ2 +,b ̸= σ2 +,b′ (c) µ+,b = µ+,b′ σ2 +,b = σ2 +,b′ 58/91

Results - Biased-MNIST Table: Top-1 accuracy (%) on Biased-MNIST (bias
= background color). Reference results from56. Results denoted with * are re-implemented without color-jittering and bias-conflicting oversampling. Method 0.999 0.997 0.995 0.99 CE 11.8±0.7 62.5±2.9 79.5±0.1 90.8±0.3 LNL57 18.2±1.2 57.2±2.2 72.5±0.9 86.0±0.2 ϵ-SupInfoNCE 33.16±3.57 73.86±0.81 83.65±0.36 91.18±0.49 BiasCon+BiasBal* 30.26±11.08 82.83±4.17 88.20±2.27 95.04±0.86 EnD58 59.5±2.3 82.70±0.3 94.0±0.6 94.8±0.3 BiasBal 76.8±1.6 91.2±0.2 93.9±0.1 96.3±0.2 BiasCon+CE* 15.06±2.22 90.48±5.26 95.95±0.11 97.67±0.09 ϵ-SupInfoNCE + FairKL 90.51±1.55 96.19±0.23 97.00±0.06 97.86±0.02 56Y. Hong et al. “Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning”. In: NeurIPS. 2021. 57B. Kim et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019. 58E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. 59/91

Results - Corrupted CIFAR-10 and bFFHQ Table: Top-1 accuracy (%)
on Corrupted CIFAR-10 with different corruption ratio (%) where each class is correlated with a certain texture. bFFHQ: facial images where most of the females are young and most of the males are old. Corrupted CIFAR-10 bFFHQ Ratio Ratio Method 0.5 1.0 2.0 5.0 0.5 Vanilla 23.08±1.25 25.82±0.33 30.06±0.71 39.42±0.64 56.87±2.69 EnD 19.38±1.36 23.12±1.07 34.07±4.81 36.57±3.98 56.87±1.42 HEX 13.87±0.06 14.81±0.42 15.20±0.54 16.04±0.63 52.83±0.90 ReBias 22.27±0.41 25.72±0.20 31.66±0.43 43.43±0.41 59.46±0.64 LfF 28.57±1.30 33.07±0.77 39.91±0.30 50.27±1.56 62.2±1.0 DFA 29.95±0.71 36.49±1.79 41.78±2.29 51.13±1.28 63.87±0.31 ϵ-SupInfoNCE + FairKL 33.33±0.38 36.53±0.38 41.45±0.42 50.73±0.90 64.8±0.43 60/91

Results - Biased ImageNet Table: Top-1 accuracy (%) on 9-Class
ImageNet biased and unbiased (UNB) sets, and ImageNet-A (IN-A). They have textural biases. Vanilla SIN LM RUBi ReBias LfF SoftCon ϵ-SupInfoNCE + FairKL Biased 94.0±0.1 88.4±0.9 79.2±1.1 93.9±0.2 94.0±0.2 91.2±0.1 95.3±0.2 95.1±0.1 UNB 92.7±0.2 86.6±1.0 76.6±1.2 92.5±0.2 92.7±0.2 89.6±0.3 94.1±0.3 94.8±0.3 IN-A 30.5±0.5 24.6±2.4 19.0±1.2 31.0±0.2 30.5±0.2 29.4±0.8 34.1±0.6 35.7±0.5 61/91

Prior in Contrastive Learning • In unsupervised CL, since labels
are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. 63/91

are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. → they may induce bias and be inadequate for medical imaging ! ▶ dominant objects can prevent the model from learning features of smaller objects ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the representation (a.k.a feature suppression)59 ▶ in medical imaging transformations need to preserve discriminative anatomical information while removing unwanted noise (e.g., crop → tumor suppression) 59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. 63/91

are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. → they may induce bias and be inadequate for medical imaging ! ▶ dominant objects can prevent the model from learning features of smaller objects ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the representation (a.k.a feature suppression)59 ▶ in medical imaging transformations need to preserve discriminative anatomical information while removing unwanted noise (e.g., crop → tumor suppression) • Question: can we integrate prior information into CL to make it less dependent on augmentations ? 59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. 63/91

Augmentation graph • As prior information, we consider weak attributes
(e.g., age) or representations of a pre-trained generative model (e.g., VAE, GAN). • Using the theoretical understanding of CL through the augmentation graph60, we make the connection with kernel theory and introduce a novel loss61 60Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 61B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 64/91

Augmentation graph • Augmentation graph: each point is an original
image. Two points are connected if they can be transformed into the same augmented image (support=light disk). • Colors represent semantic (unknown) classes. • An incomplete augmentation graph (1) (intra-class samples not connected due to augmentations not adapted), is reconnected (3) using a kernel defined on prior (2). 65/91

Weaker Assumptions • We assume that the extended graph is
class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds on the classification loss with these assumptions. 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds on the classification loss with these assumptions. Need for a new loss. Why not leveraging multiple views? 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

Contrastive Learning - Multi-views centroids • We can create several
views (i.e., transformations) of each sample • Let xv i be the v-th view of sample xi and V the total number of views per sample: s+ i = 1 V2 v,v′ s(f(xv i ), f(xv′ i )) similarities between views of xi s+ j = 1 V2 v,v′ s(f(xv j ), f(xv′ j )) similarities between views of xj s− ij = 1 V2 v,v′ s(f(xv i ), f(xv′ j )) similarities between views of xi and xj • We propose to look for an f such that65: s+ i + s+ j > 2s− ij + ϵ ∀i ̸= j arg min f log  exp(−ϵ) + i̸=j exp (−s+ i − s+ j + 2s− ij )   65B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 67/91

Decoupled Uniformity • Calling µi = 1 V V v=1
f(xv i ) a centroid, defined as the average of the representations of the views of sample xi • At limϵ→∞, we retrieve a new loss that we called Decoupled Uniformity66: ˆ Lde unif = log 1 n(n − 1) i̸=j exp −s+ i − s+ j + 2s− ij = log 1 n(n − 1) i̸=j exp −||µi − µj||2 • It repels distinct centroids through an average pairwise Gaussian potential (similar to uniformity67). 66B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 67T. Wang et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”. In: ICML. 2020. 68/91

Decoupled Uniformity - Properties • The properties of this new
loss ˆ Lde unif = log 1 n(n−1) i̸=j exp −||µi − µj||2 are: 1. It implicitly imposes alignment between positives → no need to explicitly add an alignment term. 2. It solves the negative-positive coupling problem of InfoNCE68 → positive samples are not attracted (alignment) and repelled (uniformity) at the same time. 3. it allows the integration of prior knowledge z(xi) about xi (weak attribute, generative model representation) → use a kernel Kσ(z(xi), z(xj)) on the priors to better estimate the centroids. Intuitively, if (z(xi), z(xj)) are close in the kernel space , so should be (f(xi), f(xj)) in the representation space. 68T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 69/91

Decoupled Uniformity - Properties • Similarly to69, we compute the
gradient w.r.t. the v-th view f(xv k ): ∇f(xv k ) Lde unif = 2 j̸=k wk,j µj repel hard negatives − 2wk µk align hard positives (3) • wk,j = e−||µk−µj||2 p̸=q e−||µp−µq||2 → quantifies whether the negative sample xj is “hard” (i.e. close to the positive sample xk ) • wk = j̸=k wk,j , s.t. wk = 1 → quantifies whether the positive sample xk is “hard” (i.e. close to other samples in the batch) • We thus implicitly align all views of each sample in the same direction (in particular the hard positives) and repel the hard negatives. • No positive-negative coupling → wk,j only depends on the relative position of µ 69C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 70/91

Decoupled Uniformity without Prior • Comparison of Decoupled Uniformity without
prior with InfoNCE70 and DC71 loss. Batch size n = 256. All models are trained for 400 epochs. Dataset Network LInfoNCE LDC Lde unif CIFAR-10 ResNet18 82.18±0.30 84.87±0.27 85.05±0.37 CIFAR-100 ResNet18 55.11±0.20 58.27±0.34 58.41±0.05 ImageNet100 ResNet50 68.76 73.98 77.18 70T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 71C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 71/91

Decoupled Uniformity with Prior • How can we estimate the
centroids using a kernel on the prior ? → conditional mean embedding theory72 Definition - Empirical Kernel Decoupled Uniformity Loss Let (xi)i∈[1..n] , the n samples with their V views xv i and Kn = [Kσ(z(xi), z(xj))]i,j∈[1..n] , the Kernel prior matrix, where Kσ is a standard kernel (e.g., Gaussian or Cosine). We define the new centroid estimator as ˆ µ¯ xj = 1 V V v=1 n i=1 αi,j f(xv i ) with αi,j = ((Kn + λnIn)−1Kn)ij , λ = O(n−1/2) a regularization constant. The empirical Kernel Decoupled Uniformity loss is then: ˆ Lde unif (f) def = log 1 n(n − 1) i̸=j exp(−||ˆ µ¯ xi − ˆ µ¯ xj ||2) 72L. Song et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013). 72/91

Decoupled Uniformity with Prior ˆ Lde unif (f) def =
log 1 n(n − 1) i̸=j exp(−||ˆ µ¯ xi − ˆ µ¯ xj ||2) • The computational cost added is roughly O(n3) (to compute the inverse matrix of size n × n) but it remains negligible compared to the back-propagation time. • We have tight bounds on the classification loss with weaker assumptions than current work (extended graph connection)73’74’75 73J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 74Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 75N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 73/91

Linear evaluation on ImageNet100 Model ImageNet100 SimCLR 68.76 BYOL 72.26
CMC 73.58 DCL 74.6 AlignUnif 76.3 DC 73.98 BigBiGAN 72.0 Decoupled Unif 77.18 KGAN Decoupled Unif 78.02 Supervised 82.1±0.59 Table: Linear evaluation accuracy (%) on ImageNet100 using ResNet50 trained for 400 epochs with batch size n = 256. We leverage BigBiGAN representation76, pre-trained on ImageNet, as prior. We define the kernel KGAN (¯ x, ¯ x′) = K(z(¯ x), z(¯ x′)) with K an RBF kernel and z(·) BigBiGAN’s encoder. 76J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 74/91

Chest radiography interpretation - CheXpert Model Atelectasis Cardiomegaly Consolidation Edema
Pleural Effusion SimCLR 82.42 77.62 90.52 89.08 86.83 BYOL 83.04 81.54 90.98 90.18 85.99 MoCo-CXR∗ 75.8 73.7 77.1 86.7 85.0 GLoRIA 86.70 86.39 90.41 90.58 91.82 CCLK 86.31 83.67 92.45 91.59 91.23 KGl Dec. Unif (ours) 86.92 85.88 93.03 92.39 91.93 Supervised∗ 81.6 79.7 90.5 86.8 89.9 Table: AUC scores(%) under linear evaluation for discriminating 5 pathologies on CheXpert. ResNet18 backbone is trained for 400 epochs (batch size n = 1024). As prior, we use Gloria77, a multi-modal approach trained with (medical report, image) pairs, and a RBF kernel KGl . 77S.-C. Huang et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021. 75/91

Bipolar disorder detection Model BD vs HC SimCLR 60.46±1.23 BYOL
58.81±0.91 MoCo v2 59.27±1.50 Model Genesis 59.94±0.81 VAE 52.86±1.24 KVAE Decoupled Unif (ours) 62.19±1.58 Supervised 67.42±0.31 Table: Linear evaluation AUC scores(%) using a 5-fold leave-site-out CV scheme with DenseNet121 backbone on the brain MRI dataset BIOBD (N ∼ 700). We use a VAE representation as prior to define KVAE (¯ x, ¯ x′) = K(µ(¯ x), µ(¯ x′)) pre-trained on BHB where µ(·) is the mean Gaussian distribution of ¯ x in the VAE latent space and K is a standard RBF kernel. 76/91

Removing optimal augmentations Model CIFAR-10 CIFAR-100 All w/o Color w/o
Color and Crop All w/o Color w/o Color and Crop SimCLR 83.06 65.00 24.47 55.11 37.63 6.62 BYOL 84.71 81.45 50.17 53.15 49.59 27.9 Barlow Twins 81.61 53.97 47.52 52.27 28.52 24.17 VAE∗ 41.37 41.37 41.37 14.34 14.34 14.34 DCGAN∗ 66.71 66.71 66.71 26.17 26.17 26.17 KGAN Dec. Unif 85.85 82.0 69.19 58.42 54.17 35.98 Table: When removing optimal augmentations, generative models provide a good kernel to connect intra-class points not connected by augmentations. All models are trained for 400 epochs under batch size n = 256 except BYOL and SimCLR trained under bigger batch size n = 1024. 77/91

Conclusions • Thanks to an ϵ-margin geometric approach, we better
formalized and understood current CL losses • We proposed new losses for unsupervised, supervised, weakly-supervised and regression settings • Using a geometric approach and the conditional mean embedding theory, we also tackled two important problems in CL: data biases (FairKL) and prior inclusion (Decoupled Uniformity) • We applied the proposed geometric CL losses on computer vision and medical imaging datasets obtaining SOTA results 79/91

Non-Contrastive Learning • In 1992 Hinton78 proposed a Siamese network
that maximizes agreement (i.e., MI) between 1D output signals (without negatives). • In 2020/2021, we came back to the same idea but.. need architectural tricks such as the the momentum encoder79 or stop-gradient operations to avoid collapsing. • Not clear why and how they work so well.. Maybe a geometric approach as in80’81? 78S. Becker et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). 79J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 80C. Zhang et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022. 81Q. Garrido et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023. 80/91

Team • The work presented here has been accomplished during
the PhD of: Carlo Alberto Barbano Benoit Dufumier • It has been published in: 1. Barbano et al., “Unbiased Supervised Contrastive Learning” 2. Barbano et al., “Contrastive learning for regression in multi-site brain age prediction” 3. Dufumier et al., “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification” 4. Dufumier et al., “Integrating Prior Knowledge in Contrastive Learning with Kernel” 81/91

Team 82/91

Institutions & Partners 83/91

References Barbano, C. A. et al. “Contrastive learning for regression
in multi-site brain age prediction”. In: IEEE ISBI. 2023. Barbano, C. A. et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. Bardes, A. et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. Becker, S. et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). Caron, M. et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS. 2020. Chen, T. et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. Chen, T. et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. Chopra, S. et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. Deng, J. et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. Doersch, C. et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. Donahue, J. et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. Dufumier, B. et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop. 2021. 84/91

References Dufumier, B. et al. “Contrastive Learning with Continuous Proxy
Meta-data for 3D MRI Classification”. In: MICCAI. 2021. Dufumier, B. et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In: NeuroImage (under review) (). Dufumier, B. et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. Dufumier, B. et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). Dwibedi, D. et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. Frosst, N. et al. “Analyzing and Improving Representations with the Soft Nearest Neighbor”. In: ICML. 2019. Garrido, Q. et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023. Graf, F. et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021. Grill, J.-B. et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. HaoChen, J. Z. et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. He, K. et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. Hjelm, R. D. et al. “Learning deep representations by mutual information estimation ...”. In: ICLR. 2019. 85/91

References Hong, Y. et al. “Unbiased Classification through Bias-Contrastive and
Bias-Balanced Learning”. In: NeurIPS. 2021. Huang, S.-C. et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021. Khosla, P. et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. Kim, B. et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019. Lin, T.-Y. et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014. Littlejohns, T. J. et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). Matsoukas, C. et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. Mustafa, B. et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. Nam, J. et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020. Neyshabur, B. et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. Oord, A. v. d. et al. Representation Learning with Contrastive Predictive Coding. 2018. Poole, B. et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. Raghu, M. et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. Salakhutdinov, R. et al. “Learning a Nonlinear Embedding by Preserving Class ...”. In: AISTATS. 2007. 86/91

References Saunshi, N. et al. “Understanding Contrastive Learning Requires Incorporating
Inductive Biases”. In: ICML. 2022. Schroff, F. et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. Sohn, K. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective”. In: NIPS. 2016. Song, H. O. et al. “Deep Metric Learning via Lifted Structured Feature Embedding”. In: CVPR. 2016. Song, L. et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013). Tartaglione, E. et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. Tschannen, M. et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020. Wang, T. et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”. In: ICML. 2020. Wang, Y. et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. Wu, Z. et al. “Unsupervised Feature Learning via Non-parametric Instance Discrimination”. In: CVPR. 2018. Yeh, C.-H. et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. Yu, B. et al. “Deep Metric Learning With Tuplet Margin Loss”. In: ICCV. 2019. Zhang, C. et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022. 87/91

References Zhou, J. et al. “Image BERT Pre-training with Online
Tokenizer”. In: ICLR. 2022. Zhou, Z. et al. “Models Genesis”. In: MedIA (2021). 88/91

Supplementary - Data Datasets Disease # Subjects # Scans Age
Sex (%F) # Sites Accessibility OpenBHB                                              IXI - 559 559 48 ± 16 55 3 Open CoRR - 1366 2873 26 ± 16 50 19 Open NPC - 65 65 26 ± 4 55 1 Open NAR - 303 323 22 ± 5 58 1 Open RBP - 40 40 22 ± 5 52 1 Open GSP - 1570 1639 21 ± 3 58 5 Open ABIDE I ASD 567 567 17 ± 8 12 20 Open HC 566 566 17 ± 8 17 20 Open ABIDE II ASD 481 481 14 ± 8 15 19 Open HC 542 555 15 ± 9 30 19 Open Localizer - 82 82 25 ± 7 56 2 Open MPI-Leipzig - 316 317 37 ± 19 40 2 Open HCP - 1113 1113 29 ± 4 45 1 Restricted OASIS 3 Only HC 578 1166 68 ± 9 62 4 Restricted ICBM - 606 939 30 ± 12 45 3 Restricted BIOBD BD 306 306 40 ± 12 55 8 Private HC 356 356 40 ± 13 55 8 Private SCHIZCONNECT-VIP SCZ 275 275 34 ± 12 28 4 Open HC 329 329 32 ± 13 47 4 Open PRAGUE HC 90 90 26 ± 7 55 1 Private BSNIP HC 198 198 32 ± 12 58 5 Private SCZ 190 190 34 ± 12 30 5 Private BD 116 116 37 ± 12 66 5 Private CANDI HC 25 25 10 ± 3 41 1 Open SCZ 20 20 13 ± 3 45 1 Open CNP HC 123 123 31 ± 9 47 1 Open SCZ 50 50 36 ± 9 24 1 Open BD 49 49 35 ± 9 43 1 Open Total 10882 13412 32 ± 19 50 101 89/91

Supplementary - Data Task Split Datasets # Subjects #Scans Age
Sex(%F) SCZ vs. HC Training SCHIZCONNECT-VIP, CNP PRAGUE, BSNIP, CANDI 933 933 33 ± 12 43 Validation 116 116 32 ± 11 37 External Test 133 133 32 ± 12 45 Internal Test 118 118 33 ± 13 34 BD vs. HC Training BIOBD, BSNIP CNP, CANDI 832 832 38 ± 13 56 Validation 103 103 37 ± 12 51 External Test 131 131 37 ± 12 52 Internal Test 107 107 37 ± 13 56 ASD vs. HC Training ABIDE 1+2 1488 1526 16 ± 8 17 Validation 188 188 17 ± 10 17 External Test 207 207 12 ± 3 30 Internal Test 184 186 17 ± 9 18 Table: Training/Validation/Test splits used for the 3 mental illness disorders detection. Out-of-site images always make the external test set, and each participant falls into only one split, avoiding data leakage. The internal testing set is always stratified according to age, sex, site, diagnosis, and training and validation set. All models use the same splits. 90/91

Supplementary - Results Figure: UMAP Visualization on ADNI 91/91

Pietro Gori

Pietro Gori

More Decks by S³ Seminar

Other Decks in Research

Featured

Transcript