Slide 1

Slide 1 text

Contrastive Learning in Medical Imaging An ϵ-margin metric learning approach Pietro Gori Assistant Professor Télécom Paris (IPParis) Paris, France

Slide 2

Slide 2 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 2/91

Slide 3

Slide 3 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 3/91

Slide 4

Slide 4 text

Introduction - Computer Vision • Deep learning (e.g., CNN or ViT) is a lazy and inefficient statistical method that needs millions if not billions of exemples to learn a precise task → data hungry 4/91

Slide 5

Slide 5 text

Introduction - Computer Vision • Many specific tasks in Computer Vision, such as object detection1 (e.g., YOLO), image classification2 (e.g., ResNet-50), or semantic segmentation (e.g., U-Net), have reached astonishing results in the last years. • This has been possible mainly because large (N > 106), labeled data-sets were easily accessible and freely available 1T.-Y. Lin et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014. 2J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 5/91

Slide 6

Slide 6 text

Introduction - Medical Imaging • In medical imaging, current research datasets are: ▶ small: N < 2k for common pathology and N < 200 for rare pathology ▶ biased: images are acquired in a precise hospital, following a specific protocol with a particular machine (nuisance site effect) ▶ multi-modal: many imaging modalities can be available as well as text, clinical, biological, genetic data. ▶ anonymized, quality checked, accessible, quite homogeneous • Clinical datasets are harder to analyze since they are usually not anonymized, not quality checked, not freely accessible, highly heterogeneous. • In this talk, we will focus on research medical imaging datasets 6/91

Slide 7

Slide 7 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 7/91

Slide 8

Slide 8 text

Introduction - Transfer Learning • When dealing with small labelled datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset 8/91

Slide 9

Slide 9 text

Introduction - Transfer Learning • When dealing with small labelled datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. 8/91

Slide 10

Slide 10 text

Introduction - Transfer Learning • When dealing with small labelled datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature reuse) increases with3456: ▶ reduced target data size (small Ntarget ) ▶ visual similarity between pre-train and target domains (small FID) ▶ models with fewer inductive biases (TL works better for ViTs than CNN) ▶ larger architectures (more parameters) 3B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. 4C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 5B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. 6M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. 9/91

Slide 11

Slide 11 text

Introduction - Transfer Learning • When dealing with small labelled datasets, a common strategy is Transfer Learning: 1. pre-training a model on a large dataset and then 2. fine-tuning it on the small target and labelled dataset • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature reuse) increases with78910: ▶ reduced target data size (small Ntarget ) ▶ visual similarity between pre-train and target domains (small FID) ▶ models with fewer inductive biases (TL works better for ViTs than CNN) ▶ larger architectures (more parameters) 7B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. 8C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 9B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. 10M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. 10/91

Slide 12

Slide 12 text

Introduction - Transfer Learning • Natural11 and Medical12 images can be visually very different ! → Domain gap • Furthermore, Medical images can be 3D. ImageNet is 2D. • Need for 3D, annotated, large medical dataset 11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 11/91

Slide 13

Slide 13 text

Introduction - Transfer Learning • Natural11 and Medical12 images can be visually very different ! → Domain gap • Furthermore, Medical images can be 3D. ImageNet is 2D. • Need for 3D, annotated, large medical dataset → PROBLEM ! 11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. 12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. 11/91

Slide 14

Slide 14 text

Introduction - Transfer Learning • Supervised pre-training is a not a valid option in medical imaging. Need for another kind of pre-training. 13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). 14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). 12/91

Slide 15

Slide 15 text

Introduction - Transfer Learning • Supervised pre-training is a not a valid option in medical imaging. Need for another kind of pre-training. • Recently big, multi-sites international healthy data-sets have emerged, such as UK Biobank13 (N > 100k) and OpenBHB14 (N > 10k) 13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). 14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). 12/91

Slide 16

Slide 16 text

Introduction - Transfer Learning • How can we employ an healthy (thus unlabeled) data-set for pre-training ? 13/91

Slide 17

Slide 17 text

Introduction - Transfer Learning • How can we employ an healthy (thus unlabeled) data-set for pre-training ? → Self-Supervised pre-training ! 13/91

Slide 18

Slide 18 text

Introduction - Transfer Learning • Self-supervised pre-training: leverage an annotation-free pretext task to provide a surrogate supervision signal for feature learning. 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

Slide 19

Slide 19 text

Introduction - Transfer Learning • Self-supervised pre-training: leverage an annotation-free pretext task to provide a surrogate supervision signal for feature learning. • Pretext task should only use the visual information and context of the images 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

Slide 20

Slide 20 text

Introduction - Transfer Learning • Self-supervised pre-training: leverage an annotation-free pretext task to provide a surrogate supervision signal for feature learning. • Pretext task should only use the visual information and context of the images • Examples of pretext tasks: ▶ Context prediction15 ▶ Generative models’16’17 ▶ Instance discrimination (Contrastive Learning)18 ▶ Teacher/Student19 ▶ Information Maximization20 15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. 16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. 17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. 14/91

Slide 21

Slide 21 text

Self-supervised Learning - Preliminaries • Pre-text tasks should produce image representations that are: 1. Transferable: we can easily reuse/fine-tune them in different downstream tasks (e.g., segmentation, object detection, classification, etc.) 2. Generalizable: they should not be specific to a single task but work well in several different downstream tasks 3. High-level: representations should characterize the high-level semantics/structure and not low-level features (color, texture, etc.) 4. Invariant: image representations should be invariant to geometric or appearance transformations that do not modify the information content of the image (i.e., irrelevant for downstream task) 5. Semantically coherent: semantically similar images should be close in the representation space 15/91

Slide 22

Slide 22 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 16/91

Slide 23

Slide 23 text

Contrastive Learning • Contrastive learning methods outperform the other pretext tasks21 21T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 17/91

Slide 24

Slide 24 text

Contrastive Learning • And recently there has been a plethora of works about it that is closing the performance gap with supervised pretraining22’23’24 22J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 23M. Caron et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS. 2020. 24J. Zhou et al. “Image BERT Pre-training with Online Tokenizer”. In: ICLR. 2022. 18/91

Slide 25

Slide 25 text

Contrastive Learning - A bit of history • Goal: given a set of images xk ∈ X, learn a mapping function fθ : X → F such that: if xa and xb are semantically similar → f(xa) ≈ f(xb ) if xa and xb are semantically different → f(xa) ̸= f(xb ) • These conditions can be reformulated from a mathematical point using either a geometric approach, based on a distance d(f(xa), f(xb )), or an information theoretic approach, based on a statistical dependence measure, such as Mutual Information I(f(xa), f(xb )). if xa and xb are semantically similar → arg min f d(f(xa), f(xb )) arg max f I(f(xa), f(xb )) if xa and xb are semantically different → arg max f d(f(xa), f(xb )) arg min f I(f(xa), f(xb )) 19/91

Slide 26

Slide 26 text

Contrastive Learning - A bit of history Geometric approach (Y. LeCun) ▶ Pairwise lossa ▶ Triplet lossb ▶ Tuplet lossc’d’e aS. Chopra et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. bF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. cH. O. Song et al. “Deep Metric Learning via Lifted Structured Feature Embedding”. In: CVPR. 2016. dK. Sohn. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective”. In: NIPS. 2016. eB. Yu et al. “Deep Metric Learning With Tuplet Margin Loss”. In: ICCV. 2019. Information theory approach (G. Hinton) ▶ Soft Nearest Neighbora’b ▶ Contrastive Predictive Coding (CPC)c ▶ Non-Parametric Instance Discriminationd ▶ Deep InfoMax (DIM)e aR. Salakhutdinov et al. “Learning a Nonlinear Embedding by Preserving Class ...”. In: AISTATS. 2007. bN. Frosst et al. “Analyzing and Improving Representations with the Soft Nearest Neighbor”. In: ICML. 2019. cA. v. d. Oord et al. Representation Learning with Contrastive Predictive Coding. 2018. dZ. Wu et al. “Unsupervised Feature Learning via Non-parametric Instance Discrimination”. In: CVPR. 2018. eR. D. Hjelm et al. “Learning deep representations by mutual information estimation ...”. In: ICLR. 2019. 20/91

Slide 27

Slide 27 text

Contrastive Learning - A bit of history Geometric approach (Y. LeCun)a ▶ Need to define positive (x, x+) (semantically similar) and negative pairs (x, x−) (semantically different) ▶ Need to define similarity measure (or distance) that is maximized (or minimized) ▶ No constraints/hypotheses about negative samples aS. Chopra et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. Information theory approach (G. Hinton)a ▶ Need to define pdf of positive (x, x+) ∼ p(x, x+) and negative pairs (x, x−) ∼ p(x)p(x−) where x− ⊥ ⊥ x, x+ ▶ Maximize Mutual Information (I) between positive pairs, given independent negative pairs: I(x; x+) = I(x; x+, x−) = Ex−∼p(x−) I(x; x+)b ▶ Need to define an estimator of I aS. Becker et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). bB. Poole et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. 21/91

Slide 28

Slide 28 text

Contrastive Learning - A bit of history • The Information theoretic approach is mathematically sounded and well grounded on the role of Mutual Information (I) estimation in representation learning. • But ... Large I is not necessarily predictive of downstream performance. Good results may depend on architecture choices and inductive biases rather than an accurate I estimation25 • Furthermore, a geometric approach: ▶ is easy to understand and explain ▶ can easily formalize abstract ideas for defining new losses or regularization terms (e.g., data biases) ▶ No need of implausible hypothesis (e.g., negative samples independence). 25M. Tschannen et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020. 22/91

Slide 29

Slide 29 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 23/91

Slide 30

Slide 30 text

Contrastive Learning - Geometric approach ▶ Let x ∈ X be a sample (anchor) ▶ Let x+ i be a similar (positive) sample ▶ Let x− j be a different (negative) sample ▶ Let P be the number of positive samples ▶ Let N be the number of negative samples ▶ Let f : X → Sd−1 be the mapping ▶ Let F = Sd−1, a (d-1)-sphere Figure: From Schroff et al.a aF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. 24/91

Slide 31

Slide 31 text

Contrastive Learning - Geometric approach ▶ Let x ∈ X be a sample (anchor) ▶ Let x+ i be a similar (positive) sample ▶ Let x− j be a different (negative) sample ▶ Let P be the number of positive samples ▶ Let N be the number of negative samples ▶ Let f : X → Sd−1 be the mapping ▶ Let F = Sd−1, a (d-1)-sphere Figure: From Schroff et al.a aF. Schroff et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. How can we define positive and negative samples ? 24/91

Slide 32

Slide 32 text

Contrastive Learning - Semantic definition • Positive samples x+ i can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x26 or a nearest-neighbor from a support set27. 26T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 27D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 25/91

Slide 33

Slide 33 text

Unsupervised setting 26/91

Slide 34

Slide 34 text

Contrastive Learning - Semantic definition • Positive samples x+ i can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x28 or a nearest-neighbor from a support set29. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.30 28T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 29D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 30P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 27/91

Slide 35

Slide 35 text

Supervised setting Figure: Image taken from31 31P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 28/91

Slide 36

Slide 36 text

Contrastive Learning - Semantic definition • Positive samples x+ i can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x32 or a nearest-neighbor from a support set33. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.34 ▶ In regression35 or weakly-supervised classification36: x+ i is a sample with a similar continuous/weak label of x. • The definition of negative samples x− j varies accordingly. 32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023. 36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 29/91

Slide 37

Slide 37 text

Contrastive Learning - Semantic definition • Positive samples x+ i can be defined in different ways: ▶ Unsupervised setting (no label): x+ i is a transformation of the anchor x32 or a nearest-neighbor from a support set33. ▶ Supervised classification setting (label): x+ i is a sample belonging to the same class as x.34 ▶ In regression35 or weakly-supervised classification36: x+ i is a sample with a similar continuous/weak label of x. • The definition of negative samples x− j varies accordingly. How can we contrast positive and negative samples from a mathematicla point of view ? 32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. 34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023. 36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 29/91

Slide 38

Slide 38 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 30/91

Slide 39

Slide 39 text

Contrastive Learning - ϵ-margin metric • We propose to use an ϵ-margin metric learning point of view37. • If we have a single positive x+ and several negatives x− j (e.g., tuplet loss), we look for f such that: d(f(x), f(x+)) d+ − d(f(x), f(x− j )) d− j < −ϵ ⇐⇒ s(f(x), f(x− j )) s− j − s(f(x), f(x+) s+ ≤ −ϵ ∀j • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2 . 37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 31/91

Slide 40

Slide 40 text

Contrastive Learning - ϵ-margin metric • We propose to use an ϵ-margin metric learning point of view37. • If we have a single positive x+ and several negatives x− j , we look for f such that: d(f(x), f(x+)) d+ − d(f(x), f(x− j )) d− j < −ϵ ⇐⇒ s(f(x), f(x− j )) s− j − s(f(x), f(x+) s+ ≤ −ϵ ∀j • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2 . • Two possible ways to transform this Eq. in an optimization problem are: arg min f max(0, {s− j − s+ + ϵ}j=1,..,N) arg min f N j=1 max(0, s− j − s+ + ϵ) • when these losses are = 0, the condition is fulfilled. Second is lower-bound of first. 37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 31/91

Slide 41

Slide 41 text

Contrastive Learning - ϵ-margin metric LogSumExp operator LSE The LogSumExp operator LSE is a smooth approximation of the max function. It is defined as: max(x1, x2, ..., xN) ≤ LSE(x1, x2, ..., xN) = log( N i=1 exp(xi)) arg min f max(0, {s− j − s+ + ϵ}j=1,...,N) ≈ arg min f − log exp(s+) exp(s+ − ϵ) + j exp(s− j ) ϵ−InfoNCE 38 38C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 32/91

Slide 42

Slide 42 text

Contrastive Learning - ϵ-margin metric • When ϵ = 0, we retrieve the InfoNCE loss39, whereas when ϵ → ∞ we obtain the InfoL1O (or Decoupled loss40). • These two losses are lower and upper bound of I(X+, X) respectively41 : E (x,x+)∼p(x,x+) x− j ∼p(x−)       log exp s+ exp s+ + j exp s− j InfoNCE       ≤ I(X+, X) ≤ E (x,x+)∼p(x,x+) x− j ∼p(x−)       log exp s+ j exp s− j InfoL1O       (1) • Changing ϵ ∈ [0, ∞) can bring to a tighter approximation of I(X+, X). The exponential function at the denominator exp(−ϵ) monotonically decreases as ϵ increases. 39T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 40C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 41B. Poole et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. 33/91

Slide 43

Slide 43 text

Contrastive Learning - ϵ-margin metric • The inclusion of multiple positive samples (s+ i ) can lead to different formulations (see42). Here, we use the simplest one: s− j − s+ i ≤ −ϵ ∀i, j i max(−ϵ, {s− j − s+ i }j=1,...,N) ≈ − i log exp(s+ i ) exp(s+ i − ϵ) + j exp(s− j ) ϵ−SupInfoNCE (2) • Another formulation is the SupCon loss43, which has been presented as the “most straightforward way to generalize” the InfoNCE loss with multiple positive. However... 42C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 43P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 34/91

Slide 44

Slide 44 text

Contrastive Learning - ϵ-margin metric • ... it actually contains a non-contrastive constraint44 on the positive samples: s+ t − s+ i ≤ 0 ∀i, t. s− j − s+ i ≤ −ϵ ∀i, j and s+ t − s+ i ≤ 0 ∀i, t ̸= i 1 P i max(0, {s− j − s+ i + ϵ}j, {s+ t − s+ i }t̸=i) ≈ ϵ − 1 P i log exp(s+ i ) t exp(s+ t − ϵ) + j exp(s− j ) ϵ−SupCon • when ϵ = 0 we retrieve exactly Lsup out . • One tries to align all positive samples to a single point in the representation space. Thus losing intra-class variability. 44C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 35/91

Slide 45

Slide 45 text

Supervised Contrastive Learning - Results Table: Accuracy on vision datasets. SimCLR and Max-Margin results from45. Results denoted with * are (re)implemented with mixed precision due to memory constraints. Dataset Network SimCLR Max-Margin SimCLR* CE* SupCon* ϵ-SupInfoNCE* CIFAR-10 ResNet-50 93.6 92.4 91.74±0.05 94.73±0.18 95.64±0.02 96.14±0.01 CIFAR-100 ResNet-50 70.7 70.5 68.94±0.12 73.43±0.08 75.41±0.19 76.04±0.01 ImageNet-100 ResNet-50 - - 66.14±0.08 82.1±0.59 81.99±0.08 83.3±0.06 Table: Comparison of ϵ-SupInfoNCE and ϵ-SupCon on ImageNet-100 in terms of top-1 accuracy (%). Loss ϵ = 0.1 ϵ = 0.25 ϵ = 0.5 ϵ-SupInfoNCE 83.25±0.39 83.02±0.41 83.3±0.06 ϵ-SupCon 82.83±0.11 82.54±0.09 82.77±0.14 45P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. 36/91

Slide 46

Slide 46 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 37/91

Slide 47

Slide 47 text

Contrastive Learning - Weakly supervised • The previous framework works well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time 38/91

Slide 48

Slide 48 text

Contrastive Learning - Weakly supervised • The previous framework works well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time • Let y be the continuous/weak label of the anchor x and yk of a sample xk . • Simple solution: threshold d between y and yk at τ to create positive and negative samples: xk is x+ is if d(y, yk ) < τ → Problem: how to choose τ ? 38/91

Slide 49

Slide 49 text

Contrastive Learning - Weakly supervised • The previous framework works well when samples are either positive or negative (unsupervised and supervised setting). But what about continuous/weak labels ? • Not possible to determine a hard boundary between positive and negative samples → all samples are positive and negative at the same time • Let y be the continuous/weak label of the anchor x and yk of a sample xk . • Simple solution: threshold d between y and yk at τ to create positive and negative samples: xk is x+ is if d(y, yk ) < τ → Problem: how to choose τ ? • Our solution: define a degree of “positiveness” between samples using a kernel function wk = Kσ(y − yk ), where 0 ≤ wk ≤ 1. • New goal: learn f that maps samples with a high degree of positiveness (wk ∼ 1) close in the latent space and samples with a low degree (wk ∼ 0) far away from each other. 38/91

Slide 50

Slide 50 text

Contrastive Learning - Weakly supervised Question: Which pair of subjects are closer in you opinion (brain MRI, axial plane) ? Subject A Subject B Subject C 39/91

Slide 51

Slide 51 text

Contrastive Learning - Weakly supervised Question: Which pair of subjects are closer in you opinion (brain MRI, axial plane) ? Subject A Age=15 Subject B Age=64 Subject C Age=20 39/91

Slide 52

Slide 52 text

Contrastive Learning - Weakly supervised Meta-data 𝑦 ∈ ℝ Latent Space 𝒵 Latent Space 𝒵 𝑤𝜎 𝑦1 , 𝑦2 𝑤𝜎 𝑦2 , 𝑦3 𝑡1 ∼ 𝒯 𝑡1 ′ ∼ 𝒯 𝑦1 𝑦2 𝑦3 SimCLR 𝒚-Aware Contrastive Learning 𝑥1 𝑥2 𝑥3 𝑡2 ∼ 𝒯 𝑡2 ′ ∼ 𝒯 𝑡3 ′ ∼ 𝒯 𝑡3 ∼ 𝒯 40/91

Slide 53

Slide 53 text

Contrastive Learning - Weakly supervised • In46’47, we propose a new contrastive condition for weakly supervised problems: wk j wj (st − sk ) ≤ 0 ∀j, k, t ̸= k ∈ A • where A contains the indices of samples ̸= x and we consider as positives only the samples with wk > 0, and align them with a strength proportional to wk . • As before, we can transform it in an optimization problem obtaining the y-aware loss: arg min f k max(0, wk t wt {st − sk }t=1,...,N t̸=k ) ≈ Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) 46B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 47B. Dufumier et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop. 2021. 41/91

Slide 54

Slide 54 text

Contrastive Learning - Weakly supervised 42/91

Slide 55

Slide 55 text

Results - Linear evaluation (a) 5-fold CV Stratified on Site. (b) 5-fold CV Leave-Site-Out 43/91

Slide 56

Slide 56 text

Results - Robustness to σ and transformations ▶ Linear classification performance remains stable for a range σ ∈ [1, 5] ▶ Adding more transformations improve the representation (in line with SimCLR) ▶ Cutout remains competitive while being cost-less computationally 44/91

Slide 57

Slide 57 text

Results - Fine-tuning Task Test Set Pre-training Strategies Weakly Self-Supervised Self-Supervised Generative Discriminative Baseline Age-Aware Contrastive48 Model Genesis49 Contrastive Learning50 VAE Age Sup. SCZ vs. HC ↑ Ntrain = 933 Internal Test 85.27±1.60 85.17±0.37 76.31±1.77 82.31±2.03 82.56±0.68 83.05±1.36 External Test 75.52±0.12 77.00±0.55 67.40±1.59 75.48±2.54 75.11±1.65 74.36±2.28 BD vs. HC ↑ Ntrain = 832 Internal Test 76.49±2.16 78.81±2.48 76.25±1.48 72.71±2.06 71.61±0.81 77.21±1.00 External Test 68.57±4.72 77.06±1.90 65.66±0.90 71.23±3.05 71.70±0.23 73.02±2.66 ASD vs. HC ↑ Ntrain = 1526 Internal Test 65.74±1.47 66.36±1.14 63.58±4.35 61.92±1.67 59.67±2.04 67.11±1.76 External Test 62.93±2.40 68.76±1.70 54.95±3.58 61.93±1.93 57.45±0.81 62.07±2.98 Table: Fine-tuning results.51 All pre-trained models use a data-set of 8754 3D MRI of healthy brains. We reported average AUC(%) for all models and the standard deviation by repeating each experiment three times. Baseline is a DenseNet121 backbone. 48B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. 49Z. Zhou et al. “Models Genesis”. In: MedIA (2021). 50T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 51B. Dufumier et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In: NeuroImage (under review) (). 45/91

Slide 58

Slide 58 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 46/91

Slide 59

Slide 59 text

Contrastive Learning - Regression • We could use Ly−aware also in regression. But... Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) • ... the numerator aligns xk , and the denominator focuses more on the closest samples in the representation space. 47/91

Slide 60

Slide 60 text

Contrastive Learning - Regression • We could use Ly−aware also in regression. But... Ly−aware = − k wk t wt log exp(sk ) N t=1 exp(st) • ... the numerator aligns xk , and the denominator focuses more on the closest samples in the representation space. → Problem ! These samples might have a greater degree of positiveness with the anchor than the considered xk 47/91

Slide 61

Slide 61 text

Contrastive Learning - Regression • We thus propose two new losses: wk (st − sk ) ≤ 0 if wt − wk ≤ 0 ∀k, t ̸= k ∈ A(i) Lthr = − k wk t δwt

Slide 62

Slide 62 text

Contrastive Learning - Regression wk [st(1 − wt) − sk ] ≤ 0 ∀k, t ̸= k ∈ A(i) Lexp = − 1 t wt k wk log exp(sk ) t̸=k exp(st(1 − wt)) • Lexp has a repulsion strength inversely proportional to the similarity between y values, whatever their distance. • Repulsion strength only depends on the distance in the kernel space. → samples close in the kernel space will be close in the representation space. 49/91

Slide 63

Slide 63 text

Results - OpenBHB Challenge • OpenBHB Challenge: age prediction with site-effect removal → Brain age ̸= chronological age in neurodegenerative disorders ! • Ntrain : 5330 3D brain MRI scans (different subjects) from 71 acquisition sites. • Two private test data-sets (internal and external) • To participate https://ramp.studio/problems/brain_age_with_site_removal 50/91

Slide 64

Slide 64 text

Results - Regression Method Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓ Ly−aware 2.66±0.00 6.60±0.17 4.10±0.01 1.82 Lthr 2.95±0.01 5.73±0.15 4.10±0.01 1.74 Lexp 2.55±0.00 5.1±0.1 3.76±0.01 1.54 Table: Comparison of contrastive losses. Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓ Baseline (ℓ1 ) DenseNet 2.55±0.01 8.0±0.9 7.13±0.05 3.34 ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86 AlexNet 2.72±0.01 8.3±0.2 4.66±0.05 2.21 ComBat DenseNet 5.92±0.01 2.23±0.06 10.48±0.17 3.38 ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88 AlexNet 3.37±0.01 6.8±0.3 5.23±0.12 2.33 Lexp DenseNet 2.85±0.00 5.34±0.06 4.43±0.00 1.84 ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54 AlexNet 2.77±0.01 5.8±0.1 4.01±0.01 1.71 Table: Final scores on the OpenBHB Challenge leaderboard using a 3D ResNet-18. MAE: Mean Absolute Error. BAcc: Balanced Accuracy for site prediction. Challenge score: Lc = BAcc0.3 · MAEext . 51/91

Slide 65

Slide 65 text

The Issue of Biases • Contrastive learning is more robust than traditional end-to-end approaches, such as cross-entropy, against noise in the data or in the labels52. • What about data bias, such as the site-effect ? Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓ Baseline (ℓ1 ) ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86 ComBat ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88 Lexp ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54 • Lexp shows a small overfitting on internal sites but also a low debiasing capability towards site effect → BAcc should be equal to random chance: 1/nsites = 1/64 ∼ 1.56 • Need to include debiasing regularization terms, such as FairKL53. 52F. Graf et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021. 53C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. 52/91

Slide 66

Slide 66 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 53/91

Slide 67

Slide 67 text

The Issue of Biases ▶ Contrastive learning can generally guarantee good downstream performance. However, it does not take into account the presence of data biases. ▶ Data biases: visual features that are correlated with the target downstream task (yellow colour) but don’t actually characterize it (digit). ▶ We employ the notion of bias-aligned and bias-conflicting samples54: 1. bias-aligned: shares the same bias attribute of the anchor. We denote it as x+,b 2. bias-conflicting: has a different bias attribute. We denote it as x+,b′ (a) Anchor x (b) Bias-aligned x+,b (c) Bias-conflicting x+,b′ 54J. Nam et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020. 54/91

Slide 68

Slide 68 text

The Issue of Biases ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. 55/91

Slide 69

Slide 69 text

The Issue of Biases ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. → We would like them to be equidistant from the anchor ! Remove the effect of the bias 55/91

Slide 70

Slide 70 text

The Issue of Biases ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned sample x+,b will probably be closer to the anchor x in the representation space than a positive bias-conflicting sample. → We would like them to be equidistant from the anchor ! Remove the effect of the bias ▶ Thus, we say that there is a bias if we can identify an ordering on the learned representations: d+,b i < d+,b′ k ≤ d−,· j − ϵ ∀i, k, j Note This represents the worst-case scenario, where the ordering is total (i.e., ∀i, k, j). Of course, there can also be cases in which the bias is not as strong, and the ordering may be partial. Furthermore, the same reasoning can be applied to negative samples (omitted for brevity). 55/91

Slide 71

Slide 71 text

The Issue of Biases • Ideally, we would like d+,b′ k − d+,b i = 0 ∀i, k and d−,b′ t − d−,b j = 0 ∀t, j • However, this condition is very strict, as it would enforce uniform distance among all positive (resp. negative) samples. • We propose a more relaxed condition where we force the distributions of distances, {d+,b′ k } and {d+,b i }, to be similar (same for negative). • Assuming that the distance distributions follow a normal distribution, B+,b ∼ N(µ+,b , σ2 +,b ) and B+,b′ ∼ N(µ+,b′ , σ2 +,b′ ), we minimize the Kullback-Leibler divergence of the two distributions with the FairKL regularization term: RFairKL = DKL(B+,b ||B+,b′ ) = 1 2 σ2 +,b + (µ+,b − µ+,b′ )2 σ2 +,b′ − log σ2 +,b σ2 +,b′ − 1 56/91

Slide 72

Slide 72 text

FairKL visual explanation • Positive and negative samples have a +, − symbol inside respectively. Filling colors represent different biases. Comparison between FairKL, EnD55, which only contraints the first moments µ+,b = µ+,b′ , and EnD with margin ϵ. d=1 d=2 d=3 (a) EnD. Partial ordering. d=1 d=2 d=3 (b) EnD + ϵ. Still partial ordering. d=2 (c) FairKL. No partial ordering. 55E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. 57/91

Slide 73

Slide 73 text

FairKL visual explanation • Simulated toy example where distances follow a Gaussian distribution. Bias-aligned samples in blue. Bias-conflicting samples in orange. (a) µ+,b ̸= µ+,b′ σ2 +,b = σ2 +,b′ (b) µ+,b = µ+,b′ σ2 +,b ̸= σ2 +,b′ (c) µ+,b = µ+,b′ σ2 +,b = σ2 +,b′ 58/91

Slide 74

Slide 74 text

Results - Biased-MNIST Table: Top-1 accuracy (%) on Biased-MNIST (bias = background color). Reference results from56. Results denoted with * are re-implemented without color-jittering and bias-conflicting oversampling. Method 0.999 0.997 0.995 0.99 CE 11.8±0.7 62.5±2.9 79.5±0.1 90.8±0.3 LNL57 18.2±1.2 57.2±2.2 72.5±0.9 86.0±0.2 ϵ-SupInfoNCE 33.16±3.57 73.86±0.81 83.65±0.36 91.18±0.49 BiasCon+BiasBal* 30.26±11.08 82.83±4.17 88.20±2.27 95.04±0.86 EnD58 59.5±2.3 82.70±0.3 94.0±0.6 94.8±0.3 BiasBal 76.8±1.6 91.2±0.2 93.9±0.1 96.3±0.2 BiasCon+CE* 15.06±2.22 90.48±5.26 95.95±0.11 97.67±0.09 ϵ-SupInfoNCE + FairKL 90.51±1.55 96.19±0.23 97.00±0.06 97.86±0.02 56Y. Hong et al. “Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning”. In: NeurIPS. 2021. 57B. Kim et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019. 58E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. 59/91

Slide 75

Slide 75 text

Results - Corrupted CIFAR-10 and bFFHQ Table: Top-1 accuracy (%) on Corrupted CIFAR-10 with different corruption ratio (%) where each class is correlated with a certain texture. bFFHQ: facial images where most of the females are young and most of the males are old. Corrupted CIFAR-10 bFFHQ Ratio Ratio Method 0.5 1.0 2.0 5.0 0.5 Vanilla 23.08±1.25 25.82±0.33 30.06±0.71 39.42±0.64 56.87±2.69 EnD 19.38±1.36 23.12±1.07 34.07±4.81 36.57±3.98 56.87±1.42 HEX 13.87±0.06 14.81±0.42 15.20±0.54 16.04±0.63 52.83±0.90 ReBias 22.27±0.41 25.72±0.20 31.66±0.43 43.43±0.41 59.46±0.64 LfF 28.57±1.30 33.07±0.77 39.91±0.30 50.27±1.56 62.2±1.0 DFA 29.95±0.71 36.49±1.79 41.78±2.29 51.13±1.28 63.87±0.31 ϵ-SupInfoNCE + FairKL 33.33±0.38 36.53±0.38 41.45±0.42 50.73±0.90 64.8±0.43 60/91

Slide 76

Slide 76 text

Results - Biased ImageNet Table: Top-1 accuracy (%) on 9-Class ImageNet biased and unbiased (UNB) sets, and ImageNet-A (IN-A). They have textural biases. Vanilla SIN LM RUBi ReBias LfF SoftCon ϵ-SupInfoNCE + FairKL Biased 94.0±0.1 88.4±0.9 79.2±1.1 93.9±0.2 94.0±0.2 91.2±0.1 95.3±0.2 95.1±0.1 UNB 92.7±0.2 86.6±1.0 76.6±1.2 92.5±0.2 92.7±0.2 89.6±0.3 94.1±0.3 94.8±0.3 IN-A 30.5±0.5 24.6±2.4 19.0±1.2 31.0±0.2 30.5±0.2 29.4±0.8 34.1±0.6 35.7±0.5 61/91

Slide 77

Slide 77 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 62/91

Slide 78

Slide 78 text

Prior in Contrastive Learning • In unsupervised CL, since labels are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. 63/91

Slide 79

Slide 79 text

Prior in Contrastive Learning • In unsupervised CL, since labels are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. → they may induce bias and be inadequate for medical imaging ! ▶ dominant objects can prevent the model from learning features of smaller objects ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the representation (a.k.a feature suppression)59 ▶ in medical imaging transformations need to preserve discriminative anatomical information while removing unwanted noise (e.g., crop → tumor suppression) 59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. 63/91

Slide 80

Slide 80 text

Prior in Contrastive Learning • In unsupervised CL, since labels are unknown, positives and negatives samples are defined via transformations/augmentations → the augmentation choice conditions the quality of the representations. • The most-used augmentations for visual representations involve aggressive crop and color distortion. → they may induce bias and be inadequate for medical imaging ! ▶ dominant objects can prevent the model from learning features of smaller objects ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the representation (a.k.a feature suppression)59 ▶ in medical imaging transformations need to preserve discriminative anatomical information while removing unwanted noise (e.g., crop → tumor suppression) • Question: can we integrate prior information into CL to make it less dependent on augmentations ? 59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. 63/91

Slide 81

Slide 81 text

Augmentation graph • As prior information, we consider weak attributes (e.g., age) or representations of a pre-trained generative model (e.g., VAE, GAN). • Using the theoretical understanding of CL through the augmentation graph60, we make the connection with kernel theory and introduce a novel loss61 60Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 61B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 64/91

Slide 82

Slide 82 text

Augmentation graph • Augmentation graph: each point is an original image. Two points are connected if they can be transformed into the same augmented image (support=light disk). • Colors represent semantic (unknown) classes. • An incomplete augmentation graph (1) (intra-class samples not connected due to augmentations not adapted), is reconnected (3) using a kernel defined on prior (2). 65/91

Slide 83

Slide 83 text

Weaker Assumptions • We assume that the extended graph is class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

Slide 84

Slide 84 text

Weaker Assumptions • We assume that the extended graph is class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds on the classification loss with these assumptions. 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

Slide 85

Slide 85 text

Weaker Assumptions • We assume that the extended graph is class-connected → Previous works assumed that the augmentation graph was class-connected62’63’64 • No need for optimal augmentations. If augmentations are inadequate → use a kernel such that disconnected points in the augmentation graph are connected in the Kernel graph. • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds on the classification loss with these assumptions. Need for a new loss. Why not leveraging multiple views? 62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 66/91

Slide 86

Slide 86 text

Contrastive Learning - Multi-views centroids • We can create several views (i.e., transformations) of each sample • Let xv i be the v-th view of sample xi and V the total number of views per sample: s+ i = 1 V2 v,v′ s(f(xv i ), f(xv′ i )) similarities between views of xi s+ j = 1 V2 v,v′ s(f(xv j ), f(xv′ j )) similarities between views of xj s− ij = 1 V2 v,v′ s(f(xv i ), f(xv′ j )) similarities between views of xi and xj • We propose to look for an f such that65: s+ i + s+ j > 2s− ij + ϵ ∀i ̸= j arg min f log  exp(−ϵ) + i̸=j exp (−s+ i − s+ j + 2s− ij )   65B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 67/91

Slide 87

Slide 87 text

Decoupled Uniformity • Calling µi = 1 V V v=1 f(xv i ) a centroid, defined as the average of the representations of the views of sample xi • At limϵ→∞, we retrieve a new loss that we called Decoupled Uniformity66: ˆ Lde unif = log 1 n(n − 1) i̸=j exp −s+ i − s+ j + 2s− ij = log 1 n(n − 1) i̸=j exp −||µi − µj||2 • It repels distinct centroids through an average pairwise Gaussian potential (similar to uniformity67). 66B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. 67T. Wang et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”. In: ICML. 2020. 68/91

Slide 88

Slide 88 text

Decoupled Uniformity - Properties • The properties of this new loss ˆ Lde unif = log 1 n(n−1) i̸=j exp −||µi − µj||2 are: 1. It implicitly imposes alignment between positives → no need to explicitly add an alignment term. 2. It solves the negative-positive coupling problem of InfoNCE68 → positive samples are not attracted (alignment) and repelled (uniformity) at the same time. 3. it allows the integration of prior knowledge z(xi) about xi (weak attribute, generative model representation) → use a kernel Kσ(z(xi), z(xj)) on the priors to better estimate the centroids. Intuitively, if (z(xi), z(xj)) are close in the kernel space , so should be (f(xi), f(xj)) in the representation space. 68T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 69/91

Slide 89

Slide 89 text

Decoupled Uniformity - Properties • Similarly to69, we compute the gradient w.r.t. the v-th view f(xv k ): ∇f(xv k ) Lde unif = 2 j̸=k wk,j µj repel hard negatives − 2wk µk align hard positives (3) • wk,j = e−||µk−µj||2 p̸=q e−||µp−µq||2 → quantifies whether the negative sample xj is “hard” (i.e. close to the positive sample xk ) • wk = j̸=k wk,j , s.t. wk = 1 → quantifies whether the positive sample xk is “hard” (i.e. close to other samples in the batch) • We thus implicitly align all views of each sample in the same direction (in particular the hard positives) and repel the hard negatives. • No positive-negative coupling → wk,j only depends on the relative position of µ 69C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 70/91

Slide 90

Slide 90 text

Decoupled Uniformity without Prior • Comparison of Decoupled Uniformity without prior with InfoNCE70 and DC71 loss. Batch size n = 256. All models are trained for 400 epochs. Dataset Network LInfoNCE LDC Lde unif CIFAR-10 ResNet18 82.18±0.30 84.87±0.27 85.05±0.37 CIFAR-100 ResNet18 55.11±0.20 58.27±0.34 58.41±0.05 ImageNet100 ResNet50 68.76 73.98 77.18 70T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. 71C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. 71/91

Slide 91

Slide 91 text

Decoupled Uniformity with Prior • How can we estimate the centroids using a kernel on the prior ? → conditional mean embedding theory72 Definition - Empirical Kernel Decoupled Uniformity Loss Let (xi)i∈[1..n] , the n samples with their V views xv i and Kn = [Kσ(z(xi), z(xj))]i,j∈[1..n] , the Kernel prior matrix, where Kσ is a standard kernel (e.g., Gaussian or Cosine). We define the new centroid estimator as ˆ µ¯ xj = 1 V V v=1 n i=1 αi,j f(xv i ) with αi,j = ((Kn + λnIn)−1Kn)ij , λ = O(n−1/2) a regularization constant. The empirical Kernel Decoupled Uniformity loss is then: ˆ Lde unif (f) def = log 1 n(n − 1) i̸=j exp(−||ˆ µ¯ xi − ˆ µ¯ xj ||2) 72L. Song et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013). 72/91

Slide 92

Slide 92 text

Decoupled Uniformity with Prior ˆ Lde unif (f) def = log 1 n(n − 1) i̸=j exp(−||ˆ µ¯ xi − ˆ µ¯ xj ||2) • The computational cost added is roughly O(n3) (to compute the inverse matrix of size n × n) but it remains negligible compared to the back-propagation time. • We have tight bounds on the classification loss with weaker assumptions than current work (extended graph connection)73’74’75 73J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. 74Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. 75N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. 73/91

Slide 93

Slide 93 text

Linear evaluation on ImageNet100 Model ImageNet100 SimCLR 68.76 BYOL 72.26 CMC 73.58 DCL 74.6 AlignUnif 76.3 DC 73.98 BigBiGAN 72.0 Decoupled Unif 77.18 KGAN Decoupled Unif 78.02 Supervised 82.1±0.59 Table: Linear evaluation accuracy (%) on ImageNet100 using ResNet50 trained for 400 epochs with batch size n = 256. We leverage BigBiGAN representation76, pre-trained on ImageNet, as prior. We define the kernel KGAN (¯ x, ¯ x′) = K(z(¯ x), z(¯ x′)) with K an RBF kernel and z(·) BigBiGAN’s encoder. 76J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 74/91

Slide 94

Slide 94 text

Chest radiography interpretation - CheXpert Model Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion SimCLR 82.42 77.62 90.52 89.08 86.83 BYOL 83.04 81.54 90.98 90.18 85.99 MoCo-CXR∗ 75.8 73.7 77.1 86.7 85.0 GLoRIA 86.70 86.39 90.41 90.58 91.82 CCLK 86.31 83.67 92.45 91.59 91.23 KGl Dec. Unif (ours) 86.92 85.88 93.03 92.39 91.93 Supervised∗ 81.6 79.7 90.5 86.8 89.9 Table: AUC scores(%) under linear evaluation for discriminating 5 pathologies on CheXpert. ResNet18 backbone is trained for 400 epochs (batch size n = 1024). As prior, we use Gloria77, a multi-modal approach trained with (medical report, image) pairs, and a RBF kernel KGl . 77S.-C. Huang et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021. 75/91

Slide 95

Slide 95 text

Bipolar disorder detection Model BD vs HC SimCLR 60.46±1.23 BYOL 58.81±0.91 MoCo v2 59.27±1.50 Model Genesis 59.94±0.81 VAE 52.86±1.24 KVAE Decoupled Unif (ours) 62.19±1.58 Supervised 67.42±0.31 Table: Linear evaluation AUC scores(%) using a 5-fold leave-site-out CV scheme with DenseNet121 backbone on the brain MRI dataset BIOBD (N ∼ 700). We use a VAE representation as prior to define KVAE (¯ x, ¯ x′) = K(µ(¯ x), µ(¯ x′)) pre-trained on BHB where µ(·) is the mean Gaussian distribution of ¯ x in the VAE latent space and K is a standard RBF kernel. 76/91

Slide 96

Slide 96 text

Removing optimal augmentations Model CIFAR-10 CIFAR-100 All w/o Color w/o Color and Crop All w/o Color w/o Color and Crop SimCLR 83.06 65.00 24.47 55.11 37.63 6.62 BYOL 84.71 81.45 50.17 53.15 49.59 27.9 Barlow Twins 81.61 53.97 47.52 52.27 28.52 24.17 VAE∗ 41.37 41.37 41.37 14.34 14.34 14.34 DCGAN∗ 66.71 66.71 66.71 26.17 26.17 26.17 KGAN Dec. Unif 85.85 82.0 69.19 58.42 54.17 35.98 Table: When removing optimal augmentations, generative models provide a good kernel to connect intra-class points not connected by augmentations. All models are trained for 400 epochs under batch size n = 256 except BYOL and SimCLR trained under bigger batch size n = 1024. 77/91

Slide 97

Slide 97 text

Summary 1. Introduction 1.1 Transfer Learning 2. Contrastive Learning 2.1 A geometric approach 2.2 ϵ-margin metric learning 2.3 Weakly supervised 2.4 Regression 3. Debiasing with FairKL 4. Prior in Contrastive Learning 5. Conclusions and Perspectives 78/91

Slide 98

Slide 98 text

Conclusions • Thanks to an ϵ-margin geometric approach, we better formalized and understood current CL losses • We proposed new losses for unsupervised, supervised, weakly-supervised and regression settings • Using a geometric approach and the conditional mean embedding theory, we also tackled two important problems in CL: data biases (FairKL) and prior inclusion (Decoupled Uniformity) • We applied the proposed geometric CL losses on computer vision and medical imaging datasets obtaining SOTA results 79/91

Slide 99

Slide 99 text

Non-Contrastive Learning • In 1992 Hinton78 proposed a Siamese network that maximizes agreement (i.e., MI) between 1D output signals (without negatives). • In 2020/2021, we came back to the same idea but.. need architectural tricks such as the the momentum encoder79 or stop-gradient operations to avoid collapsing. • Not clear why and how they work so well.. Maybe a geometric approach as in80’81? 78S. Becker et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). 79J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. 80C. Zhang et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022. 81Q. Garrido et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023. 80/91

Slide 100

Slide 100 text

Team • The work presented here has been accomplished during the PhD of: Carlo Alberto Barbano Benoit Dufumier • It has been published in: 1. Barbano et al., “Unbiased Supervised Contrastive Learning” 2. Barbano et al., “Contrastive learning for regression in multi-site brain age prediction” 3. Dufumier et al., “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification” 4. Dufumier et al., “Integrating Prior Knowledge in Contrastive Learning with Kernel” 81/91

Slide 101

Slide 101 text

Team 82/91

Slide 102

Slide 102 text

Institutions & Partners 83/91

Slide 103

Slide 103 text

References Barbano, C. A. et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023. Barbano, C. A. et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023. Bardes, A. et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022. Becker, S. et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992). Caron, M. et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS. 2020. Chen, T. et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020. Chen, T. et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021. Chopra, S. et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR. 2005. Deng, J. et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009. Doersch, C. et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015. Donahue, J. et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. Dufumier, B. et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop. 2021. 84/91

Slide 104

Slide 104 text

References Dufumier, B. et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021. Dufumier, B. et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In: NeuroImage (under review) (). Dufumier, B. et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023. Dufumier, B. et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In: NeuroImage (2022). Dwibedi, D. et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021. Frosst, N. et al. “Analyzing and Improving Representations with the Soft Nearest Neighbor”. In: ICML. 2019. Garrido, Q. et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023. Graf, F. et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021. Grill, J.-B. et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020. HaoChen, J. Z. et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In: NeurIPS. 2021. He, K. et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022. Hjelm, R. D. et al. “Learning deep representations by mutual information estimation ...”. In: ICLR. 2019. 85/91

Slide 105

Slide 105 text

References Hong, Y. et al. “Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning”. In: NeurIPS. 2021. Huang, S.-C. et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021. Khosla, P. et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020. Kim, B. et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019. Lin, T.-Y. et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014. Littlejohns, T. J. et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications (2020). Matsoukas, C. et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022. Mustafa, B. et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021. Nam, J. et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020. Neyshabur, B. et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020. Oord, A. v. d. et al. Representation Learning with Contrastive Predictive Coding. 2018. Poole, B. et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019. Raghu, M. et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019. Salakhutdinov, R. et al. “Learning a Nonlinear Embedding by Preserving Class ...”. In: AISTATS. 2007. 86/91

Slide 106

Slide 106 text

References Saunshi, N. et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022. Schroff, F. et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015. Sohn, K. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective”. In: NIPS. 2016. Song, H. O. et al. “Deep Metric Learning via Lifted Structured Feature Embedding”. In: CVPR. 2016. Song, L. et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013). Tartaglione, E. et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021. Tschannen, M. et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020. Wang, T. et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”. In: ICML. 2020. Wang, Y. et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022. Wu, Z. et al. “Unsupervised Feature Learning via Non-parametric Instance Discrimination”. In: CVPR. 2018. Yeh, C.-H. et al. “Decoupled Contrastive Learning”. In: ECCV. 2022. Yu, B. et al. “Deep Metric Learning With Tuplet Margin Loss”. In: ICCV. 2019. Zhang, C. et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022. 87/91

Slide 107

Slide 107 text

References Zhou, J. et al. “Image BERT Pre-training with Online Tokenizer”. In: ICLR. 2022. Zhou, Z. et al. “Models Genesis”. In: MedIA (2021). 88/91

Slide 108

Slide 108 text

Supplementary - Data Datasets Disease # Subjects # Scans Age Sex (%F) # Sites Accessibility OpenBHB                                              IXI - 559 559 48 ± 16 55 3 Open CoRR - 1366 2873 26 ± 16 50 19 Open NPC - 65 65 26 ± 4 55 1 Open NAR - 303 323 22 ± 5 58 1 Open RBP - 40 40 22 ± 5 52 1 Open GSP - 1570 1639 21 ± 3 58 5 Open ABIDE I ASD 567 567 17 ± 8 12 20 Open HC 566 566 17 ± 8 17 20 Open ABIDE II ASD 481 481 14 ± 8 15 19 Open HC 542 555 15 ± 9 30 19 Open Localizer - 82 82 25 ± 7 56 2 Open MPI-Leipzig - 316 317 37 ± 19 40 2 Open HCP - 1113 1113 29 ± 4 45 1 Restricted OASIS 3 Only HC 578 1166 68 ± 9 62 4 Restricted ICBM - 606 939 30 ± 12 45 3 Restricted BIOBD BD 306 306 40 ± 12 55 8 Private HC 356 356 40 ± 13 55 8 Private SCHIZCONNECT-VIP SCZ 275 275 34 ± 12 28 4 Open HC 329 329 32 ± 13 47 4 Open PRAGUE HC 90 90 26 ± 7 55 1 Private BSNIP HC 198 198 32 ± 12 58 5 Private SCZ 190 190 34 ± 12 30 5 Private BD 116 116 37 ± 12 66 5 Private CANDI HC 25 25 10 ± 3 41 1 Open SCZ 20 20 13 ± 3 45 1 Open CNP HC 123 123 31 ± 9 47 1 Open SCZ 50 50 36 ± 9 24 1 Open BD 49 49 35 ± 9 43 1 Open Total 10882 13412 32 ± 19 50 101 89/91

Slide 109

Slide 109 text

Supplementary - Data Task Split Datasets # Subjects #Scans Age Sex(%F) SCZ vs. HC Training SCHIZCONNECT-VIP, CNP PRAGUE, BSNIP, CANDI 933 933 33 ± 12 43 Validation 116 116 32 ± 11 37 External Test 133 133 32 ± 12 45 Internal Test 118 118 33 ± 13 34 BD vs. HC Training BIOBD, BSNIP CNP, CANDI 832 832 38 ± 13 56 Validation 103 103 37 ± 12 51 External Test 131 131 37 ± 12 52 Internal Test 107 107 37 ± 13 56 ASD vs. HC Training ABIDE 1+2 1488 1526 16 ± 8 17 Validation 188 188 17 ± 10 17 External Test 207 207 12 ± 3 30 Internal Test 184 186 17 ± 9 18 Table: Training/Validation/Test splits used for the 3 mental illness disorders detection. Out-of-site images always make the external test set, and each participant falls into only one split, avoiding data leakage. The internal testing set is always stratified according to age, sex, site, diagnosis, and training and validation set. All models use the same splits. 90/91

Slide 110

Slide 110 text

Supplementary - Results Figure: UMAP Visualization on ADNI 91/91