$30 off During Our Annual Pro Sale. View Details »

Pietro Gori

Pietro Gori

(Téléom Paris, France)

Title — Contrastive Learning in Medical Imaging - A metric learning approach

Abstract — Contrastive Learning (CL) is a paradigm designed for self-supervised representation learning which has been applied to unsupervised, weakly supervised and supervised problems. The objective in CL is to estimate a parametric mapping function that maps positive samples (semantically similar) close together in the representation space and negative samples (semantically dissimilar) far away from each other. In general, positive samples can be defined in different ways depending on the problem: transformations (i.e., augmentations) of the same image (unsupervised setting), samples belonging to the same class (supervised) or with similar image attributes (weakly-supervised). The definition of negative samples varies accordingly.

In this talk, we will show how a metric learning approach for CL allows us to: 1- better formalize recent contrastive losses, such as InfoNCE and SupCon, 2- derive new losses for unsupervised, supervised, and weakly supervised problems, both classification and regression, and 3- propose new regularization terms for debiasing.

Furthermore, leveraging the proposed metric learning approach and kernel theory, we will describe a novel loss, called decoupled uniformity, that allows the integration of prior knowledge, given either by generative models or weak attributes, and removes the positive-negative coupling problem, as in the InfoNCE loss.

We validate the usefulness of the proposed losses on standard vision datasets and medical imaging data.

Biography — Pietro Gori is Maître de Conférences in Artificial Intelligence and Medical Imaging at Télécom Paris (IPParis) in the IMAGES group. He did his academic training with Inria at the ARAMIS Lab in Paris and then at Neurospin (CEA). Previous to that, he obtained a MSc in Mathematical Modelling and Computation from the DTU in Copenhagen and a MSc in Biomedical Engineering from the University of Padova. He participated to the development of the open source software suite deformetrica for statistical shape analysis and of the software platform Clinica for clinical neuroimaging studies. His research interests lie primarily in the fields of machine learning, AI, representation learning, medical imaging and computational anatomy. He has 45 publications in international peer-reviewed journals or conferences and had/has the pleasure to work with 10 Master students, 14 PhD students and 2 post-docs.

S³ Seminar

June 30, 2023
Tweet

More Decks by S³ Seminar

Other Decks in Research

Transcript

  1. Contrastive Learning in Medical Imaging
    An ϵ-margin metric learning approach
    Pietro Gori
    Assistant Professor
    Télécom Paris (IPParis)
    Paris, France

    View Slide

  2. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    2/91

    View Slide

  3. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    3/91

    View Slide

  4. Introduction - Computer Vision
    • Deep learning (e.g., CNN or ViT) is a lazy and inefficient statistical method that needs
    millions if not billions of exemples to learn a precise task → data hungry
    4/91

    View Slide

  5. Introduction - Computer Vision
    • Many specific tasks in Computer Vision, such as object detection1 (e.g., YOLO), image
    classification2 (e.g., ResNet-50), or semantic segmentation (e.g., U-Net), have reached
    astonishing results in the last years.
    • This has been possible mainly because large (N > 106), labeled data-sets were easily
    accessible and freely available
    1T.-Y. Lin et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014.
    2J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009.
    5/91

    View Slide

  6. Introduction - Medical Imaging
    • In medical imaging, current research datasets are:
    ▶ small: N < 2k for common pathology and N < 200 for rare pathology
    ▶ biased: images are acquired in a precise hospital, following a specific protocol with
    a particular machine (nuisance site effect)
    ▶ multi-modal: many imaging modalities can be available as well as text, clinical,
    biological, genetic data.
    ▶ anonymized, quality checked, accessible, quite homogeneous
    • Clinical datasets are harder to analyze since they are usually not anonymized, not
    quality checked, not freely accessible, highly heterogeneous.
    • In this talk, we will focus on research medical imaging datasets
    6/91

    View Slide

  7. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    7/91

    View Slide

  8. Introduction - Transfer Learning
    • When dealing with small labelled datasets, a common strategy is Transfer Learning:
    1. pre-training a model on a large dataset and then
    2. fine-tuning it on the small target and labelled dataset
    8/91

    View Slide

  9. Introduction - Transfer Learning
    • When dealing with small labelled datasets, a common strategy is Transfer Learning:
    1. pre-training a model on a large dataset and then
    2. fine-tuning it on the small target and labelled dataset
    • Supervised pre-training from ImageNet is common.
    8/91

    View Slide

  10. Introduction - Transfer Learning
    • When dealing with small labelled datasets, a common strategy is Transfer Learning:
    1. pre-training a model on a large dataset and then
    2. fine-tuning it on the small target and labelled dataset
    • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature
    reuse) increases with3456:
    ▶ reduced target data size (small Ntarget
    )
    ▶ visual similarity between pre-train and target domains (small FID)
    ▶ models with fewer inductive biases (TL works better for ViTs than CNN)
    ▶ larger architectures (more parameters)
    3B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021.
    4C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022.
    5B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020.
    6M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019.
    9/91

    View Slide

  11. Introduction - Transfer Learning
    • When dealing with small labelled datasets, a common strategy is Transfer Learning:
    1. pre-training a model on a large dataset and then
    2. fine-tuning it on the small target and labelled dataset
    • Supervised pre-training from ImageNet is common. Its usefulness (that is, feature
    reuse) increases with78910:
    ▶ reduced target data size (small Ntarget
    )
    ▶ visual similarity between pre-train and target domains (small FID)
    ▶ models with fewer inductive biases (TL works better for ViTs than CNN)
    ▶ larger architectures (more parameters)
    7B. Mustafa et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021.
    8C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022.
    9B. Neyshabur et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020.
    10M. Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019.
    10/91

    View Slide

  12. Introduction - Transfer Learning
    • Natural11 and Medical12 images can be visually very different ! → Domain gap
    • Furthermore, Medical images can be 3D. ImageNet is 2D.
    • Need for 3D, annotated, large medical dataset
    11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009.
    12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022.
    11/91

    View Slide

  13. Introduction - Transfer Learning
    • Natural11 and Medical12 images can be visually very different ! → Domain gap
    • Furthermore, Medical images can be 3D. ImageNet is 2D.
    • Need for 3D, annotated, large medical dataset → PROBLEM !
    11J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009.
    12C. Matsoukas et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022.
    11/91

    View Slide

  14. Introduction - Transfer Learning
    • Supervised pre-training is a not a valid option in medical imaging. Need for another
    kind of pre-training.
    13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications
    (2020).
    14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In:
    NeuroImage (2022).
    12/91

    View Slide

  15. Introduction - Transfer Learning
    • Supervised pre-training is a not a valid option in medical imaging. Need for another
    kind of pre-training.
    • Recently big, multi-sites international healthy data-sets have emerged, such as UK
    Biobank13 (N > 100k) and OpenBHB14 (N > 10k)
    13T. J. Littlejohns et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature Communications
    (2020).
    14B. Dufumier et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”. In:
    NeuroImage (2022).
    12/91

    View Slide

  16. Introduction - Transfer Learning
    • How can we employ an healthy (thus unlabeled) data-set for pre-training ?
    13/91

    View Slide

  17. Introduction - Transfer Learning
    • How can we employ an healthy (thus unlabeled) data-set for pre-training ? →
    Self-Supervised pre-training !
    13/91

    View Slide

  18. Introduction - Transfer Learning
    • Self-supervised pre-training: leverage an annotation-free pretext task to provide a
    surrogate supervision signal for feature learning.
    15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015.
    16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022.
    17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019.
    18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022.
    14/91

    View Slide

  19. Introduction - Transfer Learning
    • Self-supervised pre-training: leverage an annotation-free pretext task to provide a
    surrogate supervision signal for feature learning.
    • Pretext task should only use the visual information and context of the images
    15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015.
    16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022.
    17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019.
    18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022.
    14/91

    View Slide

  20. Introduction - Transfer Learning
    • Self-supervised pre-training: leverage an annotation-free pretext task to provide a
    surrogate supervision signal for feature learning.
    • Pretext task should only use the visual information and context of the images
    • Examples of pretext tasks:
    ▶ Context prediction15
    ▶ Generative models’16’17
    ▶ Instance discrimination (Contrastive Learning)18
    ▶ Teacher/Student19
    ▶ Information Maximization20
    15C. Doersch et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015.
    16K. He et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022.
    17J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019.
    18T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    19J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    20A. Bardes et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In: ICLR. 2022.
    14/91

    View Slide

  21. Self-supervised Learning - Preliminaries
    • Pre-text tasks should produce image representations that are:
    1. Transferable: we can easily reuse/fine-tune them in different downstream tasks
    (e.g., segmentation, object detection, classification, etc.)
    2. Generalizable: they should not be specific to a single task but work well in several
    different downstream tasks
    3. High-level: representations should characterize the high-level
    semantics/structure and not low-level features (color, texture, etc.)
    4. Invariant: image representations should be invariant to geometric or appearance
    transformations that do not modify the information content of the image (i.e.,
    irrelevant for downstream task)
    5. Semantically coherent: semantically similar images should be close in the
    representation space
    15/91

    View Slide

  22. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    16/91

    View Slide

  23. Contrastive Learning
    • Contrastive learning methods outperform the other pretext tasks21
    21T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    17/91

    View Slide

  24. Contrastive Learning
    • And recently there has been a plethora of works about it that is closing the
    performance gap with supervised pretraining22’23’24
    22J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    23M. Caron et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS. 2020.
    24J. Zhou et al. “Image BERT Pre-training with Online Tokenizer”. In: ICLR. 2022.
    18/91

    View Slide

  25. Contrastive Learning - A bit of history
    • Goal: given a set of images xk
    ∈ X, learn a mapping function fθ : X → F such that:
    if xa
    and xb
    are semantically similar → f(xa) ≈ f(xb
    )
    if xa
    and xb
    are semantically different → f(xa) ̸= f(xb
    )
    • These conditions can be reformulated from a mathematical point using either a
    geometric approach, based on a distance d(f(xa), f(xb
    )), or an information theoretic
    approach, based on a statistical dependence measure, such as Mutual Information
    I(f(xa), f(xb
    )).
    if xa
    and xb
    are semantically similar →
    arg min
    f
    d(f(xa), f(xb
    )) arg max
    f
    I(f(xa), f(xb
    ))
    if xa
    and xb
    are semantically different →
    arg max
    f
    d(f(xa), f(xb
    )) arg min
    f
    I(f(xa), f(xb
    ))
    19/91

    View Slide

  26. Contrastive Learning - A bit of history
    Geometric approach (Y. LeCun)
    ▶ Pairwise lossa
    ▶ Triplet lossb
    ▶ Tuplet lossc’d’e
    aS. Chopra et al. “Learning a Similarity Metric
    Discriminatively, with Application to Face Verification”.
    In: CVPR. 2005.
    bF. Schroff et al. “FaceNet: A Unified Embedding for
    Face Recognition and Clustering”. In: CVPR. 2015.
    cH. O. Song et al. “Deep Metric Learning via Lifted
    Structured Feature Embedding”. In: CVPR. 2016.
    dK. Sohn. “Improved Deep Metric Learning with
    Multi-class N-pair Loss Objective”. In: NIPS. 2016.
    eB. Yu et al. “Deep Metric Learning With Tuplet
    Margin Loss”. In: ICCV. 2019.
    Information theory approach (G. Hinton)
    ▶ Soft Nearest Neighbora’b
    ▶ Contrastive Predictive Coding (CPC)c
    ▶ Non-Parametric Instance Discriminationd
    ▶ Deep InfoMax (DIM)e
    aR. Salakhutdinov et al. “Learning a Nonlinear Embedding by
    Preserving Class ...”. In: AISTATS. 2007.
    bN. Frosst et al. “Analyzing and Improving Representations with
    the Soft Nearest Neighbor”. In: ICML. 2019.
    cA. v. d. Oord et al. Representation Learning with Contrastive
    Predictive Coding. 2018.
    dZ. Wu et al. “Unsupervised Feature Learning via Non-parametric
    Instance Discrimination”. In: CVPR. 2018.
    eR. D. Hjelm et al. “Learning deep representations by mutual
    information estimation ...”. In: ICLR. 2019.
    20/91

    View Slide

  27. Contrastive Learning - A bit of history
    Geometric approach (Y. LeCun)a
    ▶ Need to define positive (x, x+)
    (semantically similar) and negative
    pairs (x, x−) (semantically different)
    ▶ Need to define similarity measure (or
    distance) that is maximized (or
    minimized)
    ▶ No constraints/hypotheses about
    negative samples
    aS. Chopra et al. “Learning a Similarity Metric
    Discriminatively, with Application to Face Verification”. In:
    CVPR. 2005.
    Information theory approach (G. Hinton)a
    ▶ Need to define pdf of positive
    (x, x+) ∼ p(x, x+) and negative pairs
    (x, x−) ∼ p(x)p(x−) where x− ⊥
    ⊥ x, x+
    ▶ Maximize Mutual Information (I)
    between positive pairs, given
    independent negative pairs: I(x; x+) =
    I(x; x+, x−) = Ex−∼p(x−)
    I(x; x+)b
    ▶ Need to define an estimator of I
    aS. Becker et al. “Self-organizing neural network that
    discovers surfaces in random ...”. In: Nature (1992).
    bB. Poole et al. “On Variational Bounds of Mutual
    Information”. In: ICML. 2019.
    21/91

    View Slide

  28. Contrastive Learning - A bit of history
    • The Information theoretic approach is mathematically sounded and well grounded on
    the role of Mutual Information (I) estimation in representation learning.
    • But ... Large I is not necessarily predictive of downstream performance. Good results
    may depend on architecture choices and inductive biases rather than an accurate I
    estimation25
    • Furthermore, a geometric approach:
    ▶ is easy to understand and explain
    ▶ can easily formalize abstract ideas for defining new losses or regularization terms
    (e.g., data biases)
    ▶ No need of implausible hypothesis (e.g., negative samples independence).
    25M. Tschannen et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020.
    22/91

    View Slide

  29. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    23/91

    View Slide

  30. Contrastive Learning - Geometric approach
    ▶ Let x ∈ X be a sample (anchor)
    ▶ Let x+
    i
    be a similar (positive) sample
    ▶ Let x−
    j
    be a different (negative) sample
    ▶ Let P be the number of positive samples
    ▶ Let N be the number of negative samples
    ▶ Let f : X → Sd−1 be the mapping
    ▶ Let F = Sd−1, a (d-1)-sphere
    Figure: From Schroff et al.a
    aF. Schroff et al. “FaceNet: A Unified Embedding for
    Face Recognition and Clustering”. In: CVPR. 2015.
    24/91

    View Slide

  31. Contrastive Learning - Geometric approach
    ▶ Let x ∈ X be a sample (anchor)
    ▶ Let x+
    i
    be a similar (positive) sample
    ▶ Let x−
    j
    be a different (negative) sample
    ▶ Let P be the number of positive samples
    ▶ Let N be the number of negative samples
    ▶ Let f : X → Sd−1 be the mapping
    ▶ Let F = Sd−1, a (d-1)-sphere
    Figure: From Schroff et al.a
    aF. Schroff et al. “FaceNet: A Unified Embedding for
    Face Recognition and Clustering”. In: CVPR. 2015.
    How can we define positive and negative samples ?
    24/91

    View Slide

  32. Contrastive Learning - Semantic definition
    • Positive samples x+
    i
    can be defined in different ways:
    ▶ Unsupervised setting (no label): x+
    i
    is a transformation of the anchor x26 or a
    nearest-neighbor from a support set27.
    26T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    27D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021.
    25/91

    View Slide

  33. Unsupervised setting
    26/91

    View Slide

  34. Contrastive Learning - Semantic definition
    • Positive samples x+
    i
    can be defined in different ways:
    ▶ Unsupervised setting (no label): x+
    i
    is a transformation of the anchor x28 or a
    nearest-neighbor from a support set29.
    ▶ Supervised classification setting (label): x+
    i
    is a sample belonging to the same
    class as x.30
    28T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    29D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021.
    30P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    27/91

    View Slide

  35. Supervised setting
    Figure: Image taken from31
    31P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    28/91

    View Slide

  36. Contrastive Learning - Semantic definition
    • Positive samples x+
    i
    can be defined in different ways:
    ▶ Unsupervised setting (no label): x+
    i
    is a transformation of the anchor x32 or a
    nearest-neighbor from a support set33.
    ▶ Supervised classification setting (label): x+
    i
    is a sample belonging to the same
    class as x.34
    ▶ In regression35 or weakly-supervised classification36: x+
    i
    is a sample with a similar
    continuous/weak label of x.
    • The definition of negative samples x−
    j
    varies accordingly.
    32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021.
    34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023.
    36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021.
    29/91

    View Slide

  37. Contrastive Learning - Semantic definition
    • Positive samples x+
    i
    can be defined in different ways:
    ▶ Unsupervised setting (no label): x+
    i
    is a transformation of the anchor x32 or a
    nearest-neighbor from a support set33.
    ▶ Supervised classification setting (label): x+
    i
    is a sample belonging to the same
    class as x.34
    ▶ In regression35 or weakly-supervised classification36: x+
    i
    is a sample with a similar
    continuous/weak label of x.
    • The definition of negative samples x−
    j
    varies accordingly.
    How can we contrast positive and negative samples from a mathematicla point of view ?
    32T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    33D. Dwibedi et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021.
    34P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    35C. A. Barbano et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023.
    36B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021.
    29/91

    View Slide

  38. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    30/91

    View Slide

  39. Contrastive Learning - ϵ-margin metric
    • We propose to use an ϵ-margin metric learning point of view37.
    • If we have a single positive x+ and several negatives x−
    j
    (e.g., tuplet loss), we look for f such that:
    d(f(x), f(x+))
    d+
    − d(f(x), f(x−
    j
    ))
    d−
    j
    < −ϵ ⇐⇒ s(f(x), f(x−
    j
    ))
    s−
    j
    − s(f(x), f(x+)
    s+
    ≤ −ϵ ∀j
    • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2
    .
    37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    31/91

    View Slide

  40. Contrastive Learning - ϵ-margin metric
    • We propose to use an ϵ-margin metric learning point of view37.
    • If we have a single positive x+ and several negatives x−
    j
    , we look for f such that:
    d(f(x), f(x+))
    d+
    − d(f(x), f(x−
    j
    ))
    d−
    j
    < −ϵ ⇐⇒ s(f(x), f(x−
    j
    ))
    s−
    j
    − s(f(x), f(x+)
    s+
    ≤ −ϵ ∀j
    • where ϵ ≥ 0 is a margin between positive and negative, s(f(a), f(b)) = ⟨f(a), f(b)⟩2
    .
    • Two possible ways to transform this Eq. in an optimization problem are:
    arg min
    f
    max(0, {s−
    j
    − s+ + ϵ}j=1,..,N) arg min
    f
    N
    j=1
    max(0, s−
    j
    − s+ + ϵ)
    • when these losses are = 0, the condition is fulfilled. Second is lower-bound of first.
    37C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    31/91

    View Slide

  41. Contrastive Learning - ϵ-margin metric
    LogSumExp operator LSE
    The LogSumExp operator LSE is a smooth approximation of the max function. It is
    defined as:
    max(x1, x2, ..., xN) ≤ LSE(x1, x2, ..., xN) = log(
    N
    i=1
    exp(xi))
    arg min
    f
    max(0, {s−
    j
    − s+ + ϵ}j=1,...,N) ≈ arg min
    f
    − log
    exp(s+)
    exp(s+ − ϵ) + j
    exp(s−
    j
    )
    ϵ−InfoNCE
    38
    38C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    32/91

    View Slide

  42. Contrastive Learning - ϵ-margin metric
    • When ϵ = 0, we retrieve the InfoNCE loss39, whereas when ϵ → ∞ we obtain the
    InfoL1O (or Decoupled loss40).
    • These two losses are lower and upper bound of I(X+, X) respectively41 :
    E
    (x,x+)∼p(x,x+)
    x−
    j
    ∼p(x−)






    log
    exp s+
    exp s+ + j
    exp s−
    j
    InfoNCE






    ≤ I(X+, X) ≤ E
    (x,x+)∼p(x,x+)
    x−
    j
    ∼p(x−)






    log
    exp s+
    j
    exp s−
    j
    InfoL1O






    (1)
    • Changing ϵ ∈ [0, ∞) can bring to a tighter approximation of I(X+, X). The exponential
    function at the denominator exp(−ϵ) monotonically decreases as ϵ increases.
    39T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    40C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022.
    41B. Poole et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019.
    33/91

    View Slide

  43. Contrastive Learning - ϵ-margin metric
    • The inclusion of multiple positive samples (s+
    i
    ) can lead to different formulations
    (see42). Here, we use the simplest one:
    s−
    j
    − s+
    i
    ≤ −ϵ ∀i, j
    i
    max(−ϵ, {s−
    j
    − s+
    i
    }j=1,...,N) ≈ −
    i
    log
    exp(s+
    i
    )
    exp(s+
    i
    − ϵ) + j
    exp(s−
    j
    )
    ϵ−SupInfoNCE
    (2)
    • Another formulation is the SupCon loss43, which has been presented as the “most
    straightforward way to generalize” the InfoNCE loss with multiple positive. However...
    42C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    43P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    34/91

    View Slide

  44. Contrastive Learning - ϵ-margin metric
    • ... it actually contains a non-contrastive constraint44 on the positive samples:
    s+
    t
    − s+
    i
    ≤ 0 ∀i, t.
    s−
    j
    − s+
    i
    ≤ −ϵ ∀i, j and s+
    t
    − s+
    i
    ≤ 0 ∀i, t ̸= i
    1
    P
    i
    max(0, {s−
    j
    − s+
    i
    + ϵ}j, {s+
    t
    − s+
    i
    }t̸=i) ≈ ϵ −
    1
    P
    i
    log
    exp(s+
    i
    )
    t
    exp(s+
    t
    − ϵ) + j
    exp(s−
    j
    )
    ϵ−SupCon
    • when ϵ = 0 we retrieve exactly Lsup
    out
    .
    • One tries to align all positive samples to a single point in the representation space.
    Thus losing intra-class variability.
    44C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    35/91

    View Slide

  45. Supervised Contrastive Learning - Results
    Table: Accuracy on vision datasets. SimCLR and Max-Margin results from45. Results denoted with * are
    (re)implemented with mixed precision due to memory constraints.
    Dataset Network SimCLR Max-Margin SimCLR* CE* SupCon* ϵ-SupInfoNCE*
    CIFAR-10 ResNet-50 93.6 92.4 91.74±0.05 94.73±0.18 95.64±0.02 96.14±0.01
    CIFAR-100 ResNet-50 70.7 70.5 68.94±0.12 73.43±0.08 75.41±0.19 76.04±0.01
    ImageNet-100 ResNet-50 - - 66.14±0.08 82.1±0.59 81.99±0.08 83.3±0.06
    Table: Comparison of ϵ-SupInfoNCE and ϵ-SupCon on ImageNet-100 in terms of top-1 accuracy (%).
    Loss ϵ = 0.1 ϵ = 0.25 ϵ = 0.5
    ϵ-SupInfoNCE 83.25±0.39 83.02±0.41 83.3±0.06
    ϵ-SupCon 82.83±0.11 82.54±0.09 82.77±0.14
    45P. Khosla et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    36/91

    View Slide

  46. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    37/91

    View Slide

  47. Contrastive Learning - Weakly supervised
    • The previous framework works well when samples are either positive or negative
    (unsupervised and supervised setting). But what about continuous/weak labels ?
    • Not possible to determine a hard boundary between positive and negative samples →
    all samples are positive and negative at the same time
    38/91

    View Slide

  48. Contrastive Learning - Weakly supervised
    • The previous framework works well when samples are either positive or negative
    (unsupervised and supervised setting). But what about continuous/weak labels ?
    • Not possible to determine a hard boundary between positive and negative samples →
    all samples are positive and negative at the same time
    • Let y be the continuous/weak label of the anchor x and yk
    of a sample xk
    .
    • Simple solution: threshold d between y and yk
    at τ to create positive and negative
    samples: xk
    is x+ is if d(y, yk
    ) < τ → Problem: how to choose τ ?
    38/91

    View Slide

  49. Contrastive Learning - Weakly supervised
    • The previous framework works well when samples are either positive or negative
    (unsupervised and supervised setting). But what about continuous/weak labels ?
    • Not possible to determine a hard boundary between positive and negative samples →
    all samples are positive and negative at the same time
    • Let y be the continuous/weak label of the anchor x and yk
    of a sample xk
    .
    • Simple solution: threshold d between y and yk
    at τ to create positive and negative
    samples: xk
    is x+ is if d(y, yk
    ) < τ → Problem: how to choose τ ?
    • Our solution: define a degree of “positiveness” between samples using a kernel
    function wk
    = Kσ(y − yk
    ), where 0 ≤ wk
    ≤ 1.
    • New goal: learn f that maps samples with a high degree of positiveness (wk
    ∼ 1) close
    in the latent space and samples with a low degree (wk
    ∼ 0) far away from each other.
    38/91

    View Slide

  50. Contrastive Learning - Weakly supervised
    Question: Which pair of subjects are closer in you opinion (brain MRI, axial plane) ?
    Subject A
    Subject B
    Subject C
    39/91

    View Slide

  51. Contrastive Learning - Weakly supervised
    Question: Which pair of subjects are closer in you opinion (brain MRI, axial plane) ?
    Subject A Age=15
    Subject B Age=64
    Subject C Age=20
    39/91

    View Slide

  52. Contrastive Learning - Weakly supervised
    Meta-data
    𝑦 ∈ ℝ
    Latent Space
    𝒵
    Latent Space
    𝒵
    𝑤𝜎
    𝑦1
    , 𝑦2
    𝑤𝜎
    𝑦2
    , 𝑦3
    𝑡1
    ∼ 𝒯
    𝑡1
    ′ ∼ 𝒯
    𝑦1
    𝑦2 𝑦3
    SimCLR 𝒚-Aware Contrastive
    Learning
    𝑥1
    𝑥2
    𝑥3
    𝑡2
    ∼ 𝒯
    𝑡2
    ′ ∼ 𝒯 𝑡3
    ′ ∼ 𝒯
    𝑡3
    ∼ 𝒯
    40/91

    View Slide

  53. Contrastive Learning - Weakly supervised
    • In46’47, we propose a new contrastive condition for weakly supervised problems:
    wk
    j
    wj
    (st − sk
    ) ≤ 0 ∀j, k, t ̸= k ∈ A
    • where A contains the indices of samples ̸= x and we consider as positives only the
    samples with wk
    > 0, and align them with a strength proportional to wk
    .
    • As before, we can transform it in an optimization problem obtaining the y-aware loss:
    arg min
    f k
    max(0,
    wk
    t
    wt
    {st − sk
    }t=1,...,N
    t̸=k
    ) ≈ Ly−aware = −
    k
    wk
    t
    wt
    log
    exp(sk
    )
    N
    t=1
    exp(st)
    46B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021.
    47B. Dufumier et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop. 2021.
    41/91

    View Slide

  54. Contrastive Learning - Weakly supervised
    42/91

    View Slide

  55. Results - Linear evaluation
    (a) 5-fold CV Stratified on Site.
    (b) 5-fold CV Leave-Site-Out
    43/91

    View Slide

  56. Results - Robustness to σ and transformations
    ▶ Linear classification performance remains stable for a range σ ∈ [1, 5]
    ▶ Adding more transformations improve the representation (in line with SimCLR)
    ▶ Cutout remains competitive while being cost-less computationally
    44/91

    View Slide

  57. Results - Fine-tuning
    Task Test Set
    Pre-training Strategies
    Weakly Self-Supervised Self-Supervised Generative Discriminative
    Baseline Age-Aware Contrastive48 Model Genesis49 Contrastive Learning50 VAE Age Sup.
    SCZ vs. HC ↑
    Ntrain = 933
    Internal Test 85.27±1.60
    85.17±0.37
    76.31±1.77
    82.31±2.03
    82.56±0.68
    83.05±1.36
    External Test 75.52±0.12
    77.00±0.55
    67.40±1.59
    75.48±2.54
    75.11±1.65
    74.36±2.28
    BD vs. HC ↑
    Ntrain = 832
    Internal Test 76.49±2.16
    78.81±2.48
    76.25±1.48
    72.71±2.06
    71.61±0.81
    77.21±1.00
    External Test 68.57±4.72
    77.06±1.90
    65.66±0.90
    71.23±3.05
    71.70±0.23
    73.02±2.66
    ASD vs. HC ↑
    Ntrain = 1526
    Internal Test 65.74±1.47
    66.36±1.14
    63.58±4.35
    61.92±1.67
    59.67±2.04
    67.11±1.76
    External Test 62.93±2.40
    68.76±1.70
    54.95±3.58
    61.93±1.93
    57.45±0.81
    62.07±2.98
    Table: Fine-tuning results.51 All pre-trained models use a data-set of 8754 3D MRI of healthy brains. We
    reported average AUC(%) for all models and the standard deviation by repeating each experiment three
    times. Baseline is a DenseNet121 backbone.
    48B. Dufumier et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In: MICCAI. 2021.
    49Z. Zhou et al. “Models Genesis”. In: MedIA (2021).
    50T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    51B. Dufumier et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In: NeuroImage
    (under review) ().
    45/91

    View Slide

  58. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    46/91

    View Slide

  59. Contrastive Learning - Regression
    • We could use Ly−aware also in regression. But...
    Ly−aware = −
    k
    wk
    t
    wt
    log
    exp(sk
    )
    N
    t=1
    exp(st)
    • ... the numerator aligns xk
    , and the denominator focuses more on the closest samples
    in the representation space.
    47/91

    View Slide

  60. Contrastive Learning - Regression
    • We could use Ly−aware also in regression. But...
    Ly−aware = −
    k
    wk
    t
    wt
    log
    exp(sk
    )
    N
    t=1
    exp(st)
    • ... the numerator aligns xk
    , and the denominator focuses more on the closest samples
    in the representation space. → Problem ! These samples might have a greater degree
    of positiveness with the anchor than the considered xk
    47/91

    View Slide

  61. Contrastive Learning - Regression
    • We thus propose two new losses:
    wk
    (st − sk
    ) ≤ 0 if wt − wk
    ≤ 0 ∀k, t ̸= k ∈ A(i)
    Lthr = −
    k
    wk
    t
    δwtwt
    log
    exp(sk
    )
    t̸=k
    δwtexp(st)
    • Lthr repels only the samples that have a y greater than the one of xk
    but it still
    focuses more on the closest samples.
    48/91

    View Slide

  62. Contrastive Learning - Regression
    wk
    [st(1 − wt) − sk
    ] ≤ 0 ∀k, t ̸= k ∈ A(i)
    Lexp = −
    1
    t
    wt k
    wk
    log
    exp(sk
    )
    t̸=k
    exp(st(1 − wt))
    • Lexp has a repulsion strength inversely proportional to the similarity between y values,
    whatever their distance.
    • Repulsion strength only depends on the distance in the kernel space. → samples
    close in the kernel space will be close in the representation space.
    49/91

    View Slide

  63. Results - OpenBHB Challenge
    • OpenBHB Challenge: age prediction with site-effect removal → Brain age ̸=
    chronological age in neurodegenerative disorders !
    • Ntrain
    : 5330 3D brain MRI scans (different subjects) from 71 acquisition sites.
    • Two private test data-sets (internal and external)
    • To participate https://ramp.studio/problems/brain_age_with_site_removal 50/91

    View Slide

  64. Results - Regression
    Method Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓
    Ly−aware 2.66±0.00 6.60±0.17 4.10±0.01 1.82
    Lthr 2.95±0.01 5.73±0.15 4.10±0.01 1.74
    Lexp 2.55±0.00 5.1±0.1 3.76±0.01 1.54
    Table: Comparison of contrastive losses.
    Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓
    Baseline (ℓ1
    )
    DenseNet 2.55±0.01 8.0±0.9 7.13±0.05 3.34
    ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86
    AlexNet 2.72±0.01 8.3±0.2 4.66±0.05 2.21
    ComBat
    DenseNet 5.92±0.01 2.23±0.06 10.48±0.17 3.38
    ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88
    AlexNet 3.37±0.01 6.8±0.3 5.23±0.12 2.33
    Lexp
    DenseNet 2.85±0.00 5.34±0.06 4.43±0.00 1.84
    ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54
    AlexNet 2.77±0.01 5.8±0.1 4.01±0.01 1.71
    Table: Final scores on the OpenBHB Challenge leaderboard using a 3D ResNet-18. MAE: Mean Absolute Error.
    BAcc: Balanced Accuracy for site prediction. Challenge score: Lc
    = BAcc0.3 · MAEext
    .
    51/91

    View Slide

  65. The Issue of Biases
    • Contrastive learning is more robust than traditional end-to-end approaches, such as
    cross-entropy, against noise in the data or in the labels52.
    • What about data bias, such as the site-effect ?
    Method Model Int. MAE ↓ BAcc ↓ Ext. MAE ↓ Lc ↓
    Baseline (ℓ1
    ) ResNet-18 2.67±0.05 6.7±0.1 4.18±0.01 1.86
    ComBat ResNet-18 4.15±0.01 4.5±0.0 4.76±0.03 1.88
    Lexp ResNet-18 2.55±0.00 5.1±0.1 3.76±0.01 1.54
    • Lexp shows a small overfitting on internal sites but also a low debiasing capability
    towards site effect → BAcc should be equal to random chance: 1/nsites = 1/64 ∼ 1.56
    • Need to include debiasing regularization terms, such as FairKL53.
    52F. Graf et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021.
    53C. A. Barbano et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    52/91

    View Slide

  66. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    53/91

    View Slide

  67. The Issue of Biases
    ▶ Contrastive learning can generally guarantee good downstream performance.
    However, it does not take into account the presence of data biases.
    ▶ Data biases: visual features that are correlated with the target downstream task
    (yellow colour) but don’t actually characterize it (digit).
    ▶ We employ the notion of bias-aligned and bias-conflicting samples54:
    1. bias-aligned: shares the same bias attribute of the anchor. We denote it as x+,b
    2. bias-conflicting: has a different bias attribute. We denote it as x+,b′
    (a) Anchor x (b) Bias-aligned x+,b (c) Bias-conflicting x+,b′
    54J. Nam et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020.
    54/91

    View Slide

  68. The Issue of Biases
    ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned
    sample x+,b will probably be closer to the anchor x in the representation space
    than a positive bias-conflicting sample.
    55/91

    View Slide

  69. The Issue of Biases
    ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned
    sample x+,b will probably be closer to the anchor x in the representation space
    than a positive bias-conflicting sample. → We would like them to be equidistant
    from the anchor ! Remove the effect of the bias
    55/91

    View Slide

  70. The Issue of Biases
    ▶ Given an anchor x, if the bias is “strong” and easy-to-learn, a positive bias-aligned
    sample x+,b will probably be closer to the anchor x in the representation space
    than a positive bias-conflicting sample. → We would like them to be equidistant
    from the anchor ! Remove the effect of the bias
    ▶ Thus, we say that there is a bias if we can identify an ordering on the learned
    representations: d+,b
    i
    < d+,b′
    k
    ≤ d−,·
    j
    − ϵ ∀i, k, j
    Note
    This represents the worst-case scenario, where the ordering is total (i.e., ∀i, k, j). Of
    course, there can also be cases in which the bias is not as strong, and the ordering may
    be partial. Furthermore, the same reasoning can be applied to negative samples
    (omitted for brevity).
    55/91

    View Slide

  71. The Issue of Biases
    • Ideally, we would like d+,b′
    k
    − d+,b
    i
    = 0 ∀i, k and d−,b′
    t
    − d−,b
    j
    = 0 ∀t, j
    • However, this condition is very strict, as it would enforce uniform distance among all
    positive (resp. negative) samples.
    • We propose a more relaxed condition where we force the distributions of distances,
    {d+,b′
    k
    } and {d+,b
    i
    }, to be similar (same for negative).
    • Assuming that the distance distributions follow a normal distribution,
    B+,b
    ∼ N(µ+,b
    , σ2
    +,b
    ) and B+,b′ ∼ N(µ+,b′ , σ2
    +,b′
    ), we minimize the Kullback-Leibler
    divergence of the two distributions with the FairKL regularization term:
    RFairKL = DKL(B+,b
    ||B+,b′ ) =
    1
    2
    σ2
    +,b
    + (µ+,b
    − µ+,b′ )2
    σ2
    +,b′
    − log
    σ2
    +,b
    σ2
    +,b′
    − 1
    56/91

    View Slide

  72. FairKL visual explanation
    • Positive and negative samples have a +, − symbol inside respectively. Filling colors
    represent different biases. Comparison between FairKL, EnD55, which only contraints
    the first moments µ+,b
    = µ+,b′
    , and EnD with margin ϵ.
    d=1
    d=2
    d=3
    (a) EnD. Partial ordering.
    d=1
    d=2
    d=3
    (b) EnD + ϵ. Still partial ordering.
    d=2
    (c) FairKL. No partial ordering.
    55E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021.
    57/91

    View Slide

  73. FairKL visual explanation
    • Simulated toy example where distances follow a Gaussian distribution. Bias-aligned
    samples in blue. Bias-conflicting samples in orange.
    (a) µ+,b
    ̸= µ+,b′
    σ2
    +,b
    = σ2
    +,b′
    (b) µ+,b
    = µ+,b′
    σ2
    +,b
    ̸= σ2
    +,b′
    (c) µ+,b
    = µ+,b′
    σ2
    +,b
    = σ2
    +,b′
    58/91

    View Slide

  74. Results - Biased-MNIST
    Table: Top-1 accuracy (%) on Biased-MNIST (bias = background color). Reference results from56. Results
    denoted with * are re-implemented without color-jittering and bias-conflicting oversampling.
    Method 0.999 0.997 0.995 0.99
    CE 11.8±0.7 62.5±2.9 79.5±0.1 90.8±0.3
    LNL57 18.2±1.2 57.2±2.2 72.5±0.9 86.0±0.2
    ϵ-SupInfoNCE 33.16±3.57 73.86±0.81 83.65±0.36 91.18±0.49
    BiasCon+BiasBal* 30.26±11.08 82.83±4.17 88.20±2.27 95.04±0.86
    EnD58 59.5±2.3 82.70±0.3 94.0±0.6 94.8±0.3
    BiasBal 76.8±1.6 91.2±0.2 93.9±0.1 96.3±0.2
    BiasCon+CE* 15.06±2.22 90.48±5.26 95.95±0.11 97.67±0.09
    ϵ-SupInfoNCE + FairKL 90.51±1.55 96.19±0.23 97.00±0.06 97.86±0.02
    56Y. Hong et al. “Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning”. In: NeurIPS. 2021.
    57B. Kim et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019.
    58E. Tartaglione et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR. 2021.
    59/91

    View Slide

  75. Results - Corrupted CIFAR-10 and bFFHQ
    Table: Top-1 accuracy (%) on Corrupted CIFAR-10 with different corruption ratio (%) where each class is
    correlated with a certain texture. bFFHQ: facial images where most of the females are young and most of
    the males are old.
    Corrupted CIFAR-10 bFFHQ
    Ratio Ratio
    Method 0.5 1.0 2.0 5.0 0.5
    Vanilla 23.08±1.25 25.82±0.33 30.06±0.71 39.42±0.64 56.87±2.69
    EnD 19.38±1.36 23.12±1.07 34.07±4.81 36.57±3.98 56.87±1.42
    HEX 13.87±0.06 14.81±0.42 15.20±0.54 16.04±0.63 52.83±0.90
    ReBias 22.27±0.41 25.72±0.20 31.66±0.43 43.43±0.41 59.46±0.64
    LfF 28.57±1.30 33.07±0.77 39.91±0.30 50.27±1.56 62.2±1.0
    DFA 29.95±0.71 36.49±1.79 41.78±2.29 51.13±1.28 63.87±0.31
    ϵ-SupInfoNCE + FairKL 33.33±0.38 36.53±0.38 41.45±0.42 50.73±0.90 64.8±0.43
    60/91

    View Slide

  76. Results - Biased ImageNet
    Table: Top-1 accuracy (%) on 9-Class ImageNet biased and unbiased (UNB) sets, and ImageNet-A (IN-A).
    They have textural biases.
    Vanilla SIN LM RUBi ReBias LfF SoftCon ϵ-SupInfoNCE
    + FairKL
    Biased 94.0±0.1 88.4±0.9 79.2±1.1 93.9±0.2 94.0±0.2 91.2±0.1 95.3±0.2 95.1±0.1
    UNB 92.7±0.2 86.6±1.0 76.6±1.2 92.5±0.2 92.7±0.2 89.6±0.3 94.1±0.3 94.8±0.3
    IN-A 30.5±0.5 24.6±2.4 19.0±1.2 31.0±0.2 30.5±0.2 29.4±0.8 34.1±0.6 35.7±0.5
    61/91

    View Slide

  77. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    62/91

    View Slide

  78. Prior in Contrastive Learning
    • In unsupervised CL, since labels are unknown, positives and negatives samples are
    defined via transformations/augmentations → the augmentation choice conditions
    the quality of the representations.
    • The most-used augmentations for visual representations involve aggressive crop and
    color distortion.
    63/91

    View Slide

  79. Prior in Contrastive Learning
    • In unsupervised CL, since labels are unknown, positives and negatives samples are
    defined via transformations/augmentations → the augmentation choice conditions
    the quality of the representations.
    • The most-used augmentations for visual representations involve aggressive crop and
    color distortion. → they may induce bias and be inadequate for medical imaging !
    ▶ dominant objects can prevent the model from learning features of smaller objects
    ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the
    representation (a.k.a feature suppression)59
    ▶ in medical imaging transformations need to preserve discriminative anatomical
    information while removing unwanted noise (e.g., crop → tumor suppression)
    59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021.
    63/91

    View Slide

  80. Prior in Contrastive Learning
    • In unsupervised CL, since labels are unknown, positives and negatives samples are
    defined via transformations/augmentations → the augmentation choice conditions
    the quality of the representations.
    • The most-used augmentations for visual representations involve aggressive crop and
    color distortion. → they may induce bias and be inadequate for medical imaging !
    ▶ dominant objects can prevent the model from learning features of smaller objects
    ▶ few, irrelevant and easy-to-learn features are sufficient to collapse the
    representation (a.k.a feature suppression)59
    ▶ in medical imaging transformations need to preserve discriminative anatomical
    information while removing unwanted noise (e.g., crop → tumor suppression)
    • Question: can we integrate prior information into CL to make it less dependent on
    augmentations ?
    59T. Chen et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021.
    63/91

    View Slide

  81. Augmentation graph
    • As prior information, we consider weak attributes (e.g., age) or representations of a
    pre-trained generative model (e.g., VAE, GAN).
    • Using the theoretical understanding of CL through the augmentation graph60, we
    make the connection with kernel theory and introduce a novel loss61
    60Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022.
    61B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023.
    64/91

    View Slide

  82. Augmentation graph
    • Augmentation graph: each point is an original image. Two points are connected if
    they can be transformed into the same augmented image (support=light disk).
    • Colors represent semantic (unknown) classes.
    • An incomplete augmentation graph (1) (intra-class samples not connected due to
    augmentations not adapted), is reconnected (3) using a kernel defined on prior (2).
    65/91

    View Slide

  83. Weaker Assumptions
    • We assume that the extended graph is class-connected → Previous works assumed
    that the augmentation graph was class-connected62’63’64
    • No need for optimal augmentations. If augmentations are inadequate → use a
    kernel such that disconnected points in the augmentation graph are connected in the
    Kernel graph.
    62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In:
    NeurIPS. 2021.
    63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022.
    64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022.
    66/91

    View Slide

  84. Weaker Assumptions
    • We assume that the extended graph is class-connected → Previous works assumed
    that the augmentation graph was class-connected62’63’64
    • No need for optimal augmentations. If augmentations are inadequate → use a
    kernel such that disconnected points in the augmentation graph are connected in the
    Kernel graph.
    • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds
    on the classification loss with these assumptions.
    62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In:
    NeurIPS. 2021.
    63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022.
    64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022.
    66/91

    View Slide

  85. Weaker Assumptions
    • We assume that the extended graph is class-connected → Previous works assumed
    that the augmentation graph was class-connected62’63’64
    • No need for optimal augmentations. If augmentations are inadequate → use a
    kernel such that disconnected points in the augmentation graph are connected in the
    Kernel graph.
    • Problem: Current InfoNCE-based losses (e.g., y-aware loss) can not give tight bounds
    on the classification loss with these assumptions.
    Need for a new loss. Why not leveraging multiple views?
    62J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In:
    NeurIPS. 2021.
    63Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022.
    64N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022.
    66/91

    View Slide

  86. Contrastive Learning - Multi-views centroids
    • We can create several views (i.e., transformations) of each sample
    • Let xv
    i
    be the v-th view of sample xi
    and V the total number of views per sample:
    s+
    i
    =
    1
    V2
    v,v′
    s(f(xv
    i
    ), f(xv′
    i
    ))
    similarities between views of xi
    s+
    j
    =
    1
    V2
    v,v′
    s(f(xv
    j
    ), f(xv′
    j
    ))
    similarities between views of xj
    s−
    ij
    =
    1
    V2
    v,v′
    s(f(xv
    i
    ), f(xv′
    j
    ))
    similarities between views of xi
    and xj
    • We propose to look for an f such that65:
    s+
    i
    + s+
    j
    > 2s−
    ij
    + ϵ ∀i ̸= j
    arg min
    f
    log

    exp(−ϵ) +
    i̸=j
    exp (−s+
    i
    − s+
    j
    + 2s−
    ij
    )


    65B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023.
    67/91

    View Slide

  87. Decoupled Uniformity
    • Calling µi = 1
    V
    V
    v=1
    f(xv
    i
    ) a centroid, defined as the average of the representations of
    the views of sample xi
    • At limϵ→∞, we retrieve a new loss that we called Decoupled Uniformity66:
    ˆ
    Lde
    unif
    = log
    1
    n(n − 1)
    i̸=j
    exp −s+
    i
    − s+
    j
    + 2s−
    ij
    = log
    1
    n(n − 1)
    i̸=j
    exp −||µi − µj||2
    • It repels distinct centroids through an average pairwise Gaussian potential (similar to
    uniformity67).
    66B. Dufumier et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023.
    67T. Wang et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on the
    Hypersphere”. In: ICML. 2020.
    68/91

    View Slide

  88. Decoupled Uniformity - Properties
    • The properties of this new loss ˆ
    Lde
    unif
    = log 1
    n(n−1) i̸=j
    exp −||µi − µj||2 are:
    1. It implicitly imposes alignment between positives → no need to explicitly add an
    alignment term.
    2. It solves the negative-positive coupling problem of InfoNCE68 → positive samples
    are not attracted (alignment) and repelled (uniformity) at the same time.
    3. it allows the integration of prior knowledge z(xi) about xi
    (weak attribute,
    generative model representation) → use a kernel Kσ(z(xi), z(xj)) on the priors to
    better estimate the centroids. Intuitively, if (z(xi), z(xj)) are close in the kernel
    space , so should be (f(xi), f(xj)) in the representation space.
    68T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    69/91

    View Slide

  89. Decoupled Uniformity - Properties
    • Similarly to69, we compute the gradient w.r.t. the v-th view f(xv
    k
    ):
    ∇f(xv
    k
    )
    Lde
    unif
    = 2
    j̸=k
    wk,j
    µj
    repel hard negatives
    − 2wk
    µk
    align hard positives
    (3)
    • wk,j
    = e−||µk−µj||2
    p̸=q
    e−||µp−µq||2
    → quantifies whether the negative sample xj
    is “hard” (i.e. close
    to the positive sample xk
    )
    • wk
    = j̸=k
    wk,j
    , s.t. wk
    = 1 → quantifies whether the positive sample xk
    is “hard” (i.e.
    close to other samples in the batch)
    • We thus implicitly align all views of each sample in the same direction (in particular the
    hard positives) and repel the hard negatives.
    • No positive-negative coupling → wk,j
    only depends on the relative position of µ
    69C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022.
    70/91

    View Slide

  90. Decoupled Uniformity without Prior
    • Comparison of Decoupled Uniformity without prior with InfoNCE70 and DC71 loss. Batch
    size n = 256. All models are trained for 400 epochs.
    Dataset Network LInfoNCE
    LDC Lde
    unif
    CIFAR-10 ResNet18 82.18±0.30
    84.87±0.27
    85.05±0.37
    CIFAR-100 ResNet18 55.11±0.20
    58.27±0.34
    58.41±0.05
    ImageNet100 ResNet50 68.76 73.98 77.18
    70T. Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    71C.-H. Yeh et al. “Decoupled Contrastive Learning”. In: ECCV. 2022.
    71/91

    View Slide

  91. Decoupled Uniformity with Prior
    • How can we estimate the centroids using a kernel on the prior ? → conditional mean
    embedding theory72
    Definition - Empirical Kernel Decoupled Uniformity Loss
    Let (xi)i∈[1..n]
    , the n samples with their V views xv
    i
    and Kn = [Kσ(z(xi), z(xj))]i,j∈[1..n]
    , the
    Kernel prior matrix, where Kσ is a standard kernel (e.g., Gaussian or Cosine). We define
    the new centroid estimator as ˆ
    µ¯
    xj
    = 1
    V
    V
    v=1
    n
    i=1
    αi,j
    f(xv
    i
    ) with αi,j = ((Kn + λnIn)−1Kn)ij
    ,
    λ = O(n−1/2) a regularization constant.
    The empirical Kernel Decoupled Uniformity loss is then:
    ˆ
    Lde
    unif
    (f) def
    = log
    1
    n(n − 1)
    i̸=j
    exp(−||ˆ
    µ¯
    xi
    − ˆ
    µ¯
    xj
    ||2)
    72L. Song et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013).
    72/91

    View Slide

  92. Decoupled Uniformity with Prior
    ˆ
    Lde
    unif
    (f) def
    = log
    1
    n(n − 1)
    i̸=j
    exp(−||ˆ
    µ¯
    xi
    − ˆ
    µ¯
    xj
    ||2)
    • The computational cost added is roughly O(n3) (to compute the inverse matrix of size
    n × n) but it remains negligible compared to the back-propagation time.
    • We have tight bounds on the classification loss with weaker assumptions than current
    work (extended graph connection)73’74’75
    73J. Z. HaoChen et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss”. In:
    NeurIPS. 2021.
    74Y. Wang et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR. 2022.
    75N. Saunshi et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML. 2022.
    73/91

    View Slide

  93. Linear evaluation on ImageNet100
    Model ImageNet100
    SimCLR 68.76
    BYOL 72.26
    CMC 73.58
    DCL 74.6
    AlignUnif 76.3
    DC 73.98
    BigBiGAN 72.0
    Decoupled Unif 77.18
    KGAN
    Decoupled Unif 78.02
    Supervised 82.1±0.59
    Table: Linear evaluation accuracy (%) on ImageNet100 using ResNet50 trained for 400 epochs with batch
    size n = 256. We leverage BigBiGAN representation76, pre-trained on ImageNet, as prior. We define the
    kernel KGAN

    x, ¯
    x′) = K(z(¯
    x), z(¯
    x′)) with K an RBF kernel and z(·) BigBiGAN’s encoder.
    76J. Donahue et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019. 74/91

    View Slide

  94. Chest radiography interpretation - CheXpert
    Model Atelectasis Cardiomegaly Consolidation Edema
    Pleural
    Effusion
    SimCLR 82.42 77.62 90.52 89.08 86.83
    BYOL 83.04 81.54 90.98 90.18 85.99
    MoCo-CXR∗ 75.8 73.7 77.1 86.7 85.0
    GLoRIA 86.70 86.39 90.41 90.58 91.82
    CCLK 86.31 83.67 92.45 91.59 91.23
    KGl
    Dec. Unif (ours) 86.92 85.88 93.03 92.39 91.93
    Supervised∗ 81.6 79.7 90.5 86.8 89.9
    Table: AUC scores(%) under linear evaluation for discriminating 5 pathologies on CheXpert. ResNet18
    backbone is trained for 400 epochs (batch size n = 1024). As prior, we use Gloria77, a multi-modal approach
    trained with (medical report, image) pairs, and a RBF kernel KGl
    .
    77S.-C. Huang et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021.
    75/91

    View Slide

  95. Bipolar disorder detection
    Model BD vs HC
    SimCLR 60.46±1.23
    BYOL 58.81±0.91
    MoCo v2 59.27±1.50
    Model Genesis 59.94±0.81
    VAE 52.86±1.24
    KVAE
    Decoupled Unif (ours) 62.19±1.58
    Supervised 67.42±0.31
    Table: Linear evaluation AUC scores(%) using a 5-fold leave-site-out CV scheme with DenseNet121
    backbone on the brain MRI dataset BIOBD (N ∼ 700). We use a VAE representation as prior to define
    KVAE

    x, ¯
    x′) = K(µ(¯
    x), µ(¯
    x′)) pre-trained on BHB where µ(·) is the mean Gaussian distribution of ¯
    x in the VAE
    latent space and K is a standard RBF kernel.
    76/91

    View Slide

  96. Removing optimal augmentations
    Model
    CIFAR-10 CIFAR-100
    All w/o Color
    w/o Color
    and Crop
    All w/o Color
    w/o Color
    and Crop
    SimCLR 83.06 65.00 24.47 55.11 37.63 6.62
    BYOL 84.71 81.45 50.17 53.15 49.59 27.9
    Barlow Twins 81.61 53.97 47.52 52.27 28.52 24.17
    VAE∗ 41.37 41.37 41.37 14.34 14.34 14.34
    DCGAN∗ 66.71 66.71 66.71 26.17 26.17 26.17
    KGAN
    Dec. Unif 85.85 82.0 69.19 58.42 54.17 35.98
    Table: When removing optimal augmentations, generative models provide a good kernel to connect
    intra-class points not connected by augmentations. All models are trained for 400 epochs under batch size
    n = 256 except BYOL and SimCLR trained under bigger batch size n = 1024.
    77/91

    View Slide

  97. Summary
    1. Introduction
    1.1 Transfer Learning
    2. Contrastive Learning
    2.1 A geometric approach
    2.2 ϵ-margin metric learning
    2.3 Weakly supervised
    2.4 Regression
    3. Debiasing with FairKL
    4. Prior in Contrastive Learning
    5. Conclusions and Perspectives
    78/91

    View Slide

  98. Conclusions
    • Thanks to an ϵ-margin geometric approach, we better formalized and understood
    current CL losses
    • We proposed new losses for unsupervised, supervised, weakly-supervised and
    regression settings
    • Using a geometric approach and the conditional mean embedding theory, we also
    tackled two important problems in CL: data biases (FairKL) and prior inclusion
    (Decoupled Uniformity)
    • We applied the proposed geometric CL losses on computer vision and medical
    imaging datasets obtaining SOTA results
    79/91

    View Slide

  99. Non-Contrastive Learning
    • In 1992 Hinton78 proposed a Siamese network that maximizes agreement (i.e., MI)
    between 1D output signals (without negatives).
    • In 2020/2021, we came back to the same idea but.. need architectural tricks such as
    the the momentum encoder79 or stop-gradient operations to avoid collapsing.
    • Not clear why and how they work so well.. Maybe a geometric approach as in80’81?
    78S. Becker et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992).
    79J.-B. Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    80C. Zhang et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022.
    81Q. Garrido et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023.
    80/91

    View Slide

  100. Team
    • The work presented here has been accomplished during the PhD of:
    Carlo Alberto Barbano Benoit Dufumier
    • It has been published in:
    1. Barbano et al., “Unbiased Supervised Contrastive Learning”
    2. Barbano et al., “Contrastive learning for regression in multi-site brain age
    prediction”
    3. Dufumier et al., “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI
    Classification”
    4. Dufumier et al., “Integrating Prior Knowledge in Contrastive Learning with Kernel”
    81/91

    View Slide

  101. Team
    82/91

    View Slide

  102. Institutions & Partners
    83/91

    View Slide

  103. References
    Barbano, C. A. et al. “Contrastive learning for regression in multi-site brain age prediction”. In: IEEE ISBI. 2023.
    Barbano, C. A. et al. “Unbiased Supervised Contrastive Learning”. In: ICLR. 2023.
    Bardes, A. et al. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning”. In:
    ICLR. 2022.
    Becker, S. et al. “Self-organizing neural network that discovers surfaces in random ...”. In: Nature (1992).
    Caron, M. et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”. In: NeurIPS.
    2020.
    Chen, T. et al. “A Simple Framework for Contrastive Learning of Visual Representations”. In: ICML. 2020.
    Chen, T. et al. “Intriguing Properties of Contrastive Losses”. In: NeurIPS. 2021.
    Chopra, S. et al. “Learning a Similarity Metric Discriminatively, with Application to Face Verification”. In: CVPR.
    2005.
    Deng, J. et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR. 2009.
    Doersch, C. et al. “Unsupervised Visual Representation Learning by Context Prediction”. In: ICCV. 2015.
    Donahue, J. et al. “Large Scale Adversarial Representation Learning”. In: NeurIPS. 2019.
    Dufumier, B. et al. “Conditional Alignment and Uniformity for Contrastive Learning...”. In: NeurIPS Workshop.
    2021.
    84/91

    View Slide

  104. References
    Dufumier, B. et al. “Contrastive Learning with Continuous Proxy Meta-data for 3D MRI Classification”. In:
    MICCAI. 2021.
    Dufumier, B. et al. “Deep Learning Improvement over Standard Machine Learning in Neuroimaging”. In:
    NeuroImage (under review) ().
    Dufumier, B. et al. “Integrating Prior Knowledge in Contrastive Learning with Kernel”. In: ICML. 2023.
    Dufumier, B. et al. “OpenBHB: a Large-Scale Multi-Site Brain MRI Data-set for Age Prediction and Debiasing”.
    In: NeuroImage (2022).
    Dwibedi, D. et al. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning”. In: ICCV. 2021.
    Frosst, N. et al. “Analyzing and Improving Representations with the Soft Nearest Neighbor”. In: ICML. 2019.
    Garrido, Q. et al. “On the duality between contrastive and non-contrastive self-supervised”. In: ICLR. 2023.
    Graf, F. et al. “Dissecting Supervised Contrastive Learning”. In: ICML. 2021.
    Grill, J.-B. et al. “Bootstrap your own latent: A new approach to self-supervised Learning”. In: NeurIPS. 2020.
    HaoChen, J. Z. et al. “Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive
    Loss”. In: NeurIPS. 2021.
    He, K. et al. “Masked Autoencoders Are Scalable Vision Learners”. In: CVPR. 2022.
    Hjelm, R. D. et al. “Learning deep representations by mutual information estimation ...”. In: ICLR. 2019.
    85/91

    View Slide

  105. References
    Hong, Y. et al. “Unbiased Classification through Bias-Contrastive and Bias-Balanced Learning”. In: NeurIPS.
    2021.
    Huang, S.-C. et al. “GLoRIA: A Multimodal Global-Local Representation Learning Framework ...”. In: ICCV. 2021.
    Khosla, P. et al. “Supervised Contrastive Learning”. In: NeurIPS. 2020.
    Kim, B. et al. “Learning Not to Learn: Training Deep Neural Networks With Biased Data”. In: CVPR. 2019.
    Lin, T.-Y. et al. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014.
    Littlejohns, T. J. et al. “The UK Biobank imaging enhancement of 100,000 participants:” in: Nature
    Communications (2020).
    Matsoukas, C. et al. “What Makes Transfer Learning Work for Medical Images”. In: CVPR. 2022.
    Mustafa, B. et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021.
    Nam, J. et al. “Learning from Failure: De-biasing Classifier from Biased Classifier”. In: NeurIPS. 2020.
    Neyshabur, B. et al. “What is being transferred in transfer learning?” In: NeurIPS. 2020.
    Oord, A. v. d. et al. Representation Learning with Contrastive Predictive Coding. 2018.
    Poole, B. et al. “On Variational Bounds of Mutual Information”. In: ICML. 2019.
    Raghu, M. et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”. In: NeurIPS. 2019.
    Salakhutdinov, R. et al. “Learning a Nonlinear Embedding by Preserving Class ...”. In: AISTATS. 2007.
    86/91

    View Slide

  106. References
    Saunshi, N. et al. “Understanding Contrastive Learning Requires Incorporating Inductive Biases”. In: ICML.
    2022.
    Schroff, F. et al. “FaceNet: A Unified Embedding for Face Recognition and Clustering”. In: CVPR. 2015.
    Sohn, K. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective”. In: NIPS. 2016.
    Song, H. O. et al. “Deep Metric Learning via Lifted Structured Feature Embedding”. In: CVPR. 2016.
    Song, L. et al. “Kernel Embeddings of Conditional Distributions”. In: IEEE Signal Processing Magazine (2013).
    Tartaglione, E. et al. “EnD: Entangling and Disentangling deep representations for bias correction”. In: CVPR.
    2021.
    Tschannen, M. et al. “On Mutual Information Maximization for Representation Learning”. In: ICLR. 2020.
    Wang, T. et al. “Understanding Contrastive Representation Learning through Alignment and Uniformity on
    the Hypersphere”. In: ICML. 2020.
    Wang, Y. et al. “Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via ...”. In: ICLR.
    2022.
    Wu, Z. et al. “Unsupervised Feature Learning via Non-parametric Instance Discrimination”. In: CVPR. 2018.
    Yeh, C.-H. et al. “Decoupled Contrastive Learning”. In: ECCV. 2022.
    Yu, B. et al. “Deep Metric Learning With Tuplet Margin Loss”. In: ICCV. 2019.
    Zhang, C. et al. “How Does SimSiam Avoid Collapse Without Negative Samples?” In: ICLR. 2022.
    87/91

    View Slide

  107. References
    Zhou, J. et al. “Image BERT Pre-training with Online Tokenizer”. In: ICLR. 2022.
    Zhou, Z. et al. “Models Genesis”. In: MedIA (2021).
    88/91

    View Slide

  108. Supplementary - Data
    Datasets Disease # Subjects # Scans Age Sex (%F) # Sites Accessibility
    OpenBHB













































    IXI - 559 559 48 ± 16 55 3 Open
    CoRR - 1366 2873 26 ± 16 50 19 Open
    NPC - 65 65 26 ± 4 55 1 Open
    NAR - 303 323 22 ± 5 58 1 Open
    RBP - 40 40 22 ± 5 52 1 Open
    GSP - 1570 1639 21 ± 3 58 5 Open
    ABIDE I
    ASD 567 567 17 ± 8 12 20 Open
    HC 566 566 17 ± 8 17 20 Open
    ABIDE II
    ASD 481 481 14 ± 8 15 19 Open
    HC 542 555 15 ± 9 30 19 Open
    Localizer - 82 82 25 ± 7 56 2 Open
    MPI-Leipzig - 316 317 37 ± 19 40 2 Open
    HCP - 1113 1113 29 ± 4 45 1 Restricted
    OASIS 3 Only HC 578 1166 68 ± 9 62 4 Restricted
    ICBM - 606 939 30 ± 12 45 3 Restricted
    BIOBD
    BD 306 306 40 ± 12 55 8 Private
    HC 356 356 40 ± 13 55 8 Private
    SCHIZCONNECT-VIP
    SCZ 275 275 34 ± 12 28 4 Open
    HC 329 329 32 ± 13 47 4 Open
    PRAGUE HC 90 90 26 ± 7 55 1 Private
    BSNIP
    HC 198 198 32 ± 12 58 5 Private
    SCZ 190 190 34 ± 12 30 5 Private
    BD 116 116 37 ± 12 66 5 Private
    CANDI
    HC 25 25 10 ± 3 41 1 Open
    SCZ 20 20 13 ± 3 45 1 Open
    CNP
    HC 123 123 31 ± 9 47 1 Open
    SCZ 50 50 36 ± 9 24 1 Open
    BD 49 49 35 ± 9 43 1 Open
    Total 10882 13412 32 ± 19 50 101
    89/91

    View Slide

  109. Supplementary - Data
    Task Split Datasets # Subjects #Scans Age Sex(%F)
    SCZ vs. HC
    Training
    SCHIZCONNECT-VIP, CNP
    PRAGUE, BSNIP, CANDI
    933 933 33 ± 12 43
    Validation 116 116 32 ± 11 37
    External Test 133 133 32 ± 12 45
    Internal Test 118 118 33 ± 13 34
    BD vs. HC
    Training
    BIOBD, BSNIP
    CNP, CANDI
    832 832 38 ± 13 56
    Validation 103 103 37 ± 12 51
    External Test 131 131 37 ± 12 52
    Internal Test 107 107 37 ± 13 56
    ASD vs. HC
    Training
    ABIDE 1+2
    1488 1526 16 ± 8 17
    Validation 188 188 17 ± 10 17
    External Test 207 207 12 ± 3 30
    Internal Test 184 186 17 ± 9 18
    Table: Training/Validation/Test splits used for the 3 mental illness disorders detection. Out-of-site images
    always make the external test set, and each participant falls into only one split, avoiding data leakage. The
    internal testing set is always stratified according to age, sex, site, diagnosis, and training and validation
    set. All models use the same splits. 90/91

    View Slide

  110. Supplementary - Results
    Figure: UMAP Visualization on ADNI
    91/91

    View Slide