Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis on Negative Sample Size in Contrastive Unsupervised Representation Learning

Kento Nozawa
November 09, 2022

Analysis on Negative Sample Size in Contrastive Unsupervised Representation Learning

Kento Nozawa

November 09, 2022
Tweet

More Decks by Kento Nozawa

Other Decks in Research

Transcript

  1. Kento Nozawa1,2 α Analysis on Negative Sample Size in Contrastive

    Unsupervised Representation Learning Issei Sato1 1 2 Han Bao1,2 β Yoshihiro Nagano1,2 β Currently, : IBM Research Tokyo 
 : Kyoto University α β
  2. Unsupervised representation learning 2 Goal: learning a feature encoder that

    can extract generic representations 
 automatically from an unlabelled dataset. • Feature encoder: , e.g., deep neural nets. • Representation: a real-valued vector . f f : 𝒳 → ℝd z = f(x) ∈ ℝd f( ⋅ ) Massive unlabelled dataset {xi }M i=1 Extracted representation z ∈ ℝd Input x 10.1 −0.3 ⋮ 1.7
  3. Draw samples from an unlabelled training dataset. • : anchor

    sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1 Contrastive unsupervised representation learning 3 Anchor x Negative x− k
  4. Apply data-augmentation to the samples: • For the anchor image

    , we draw and apply two data augmentations . • For negative image , we draw and apply single data augmentation . x a, a+ x− a− Contrastive unsupervised representation learning 4 Anchor x Negative x− k a+ a− k a
  5. Extract representations by feature encoder . z, z+, z− f

    Contrastive unsupervised representation learning 5 Anchor x Negative x− k a a+ a− k f z z− k z+
  6. exp[sim(z, z−)] exp[sim(z, z+)] Minimise a contrastive loss given extracted

    representations . • : a similarity function, such as a dot product or cosine similarity. Lcon z, z+, z− k sim( . ⋅ . ) Contrastive unsupervised representation learning 6 Anchor x Negative x− k a a+ a− k f z z− k z+ } : Make similar. : Make dissimilar. z, z+ z, z− contrastive loss [1]: Lcon −ln exp[sim(z, z+)] exp[sim(z, z+)] + ∑K k=1 exp[sim(z, z− k )]
  7. Common technique: enlarge # negative samples K 7 By increasing

    # negative samples , learnt yields informative features for a linear classi fi er. For ImageNet-1000 classi fi cation, w MoCo [2] used . w SimCLR [3] used . For CIFAR-10 classi fi cation, w is better than smaller K. Can we explain this theoretically? K ̂ f K = 65 536 K = 8 190 K = 512 32 64 128 256 512 # negative samples +1 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 Validation risk
  8. A theory of contrastive representation learning 8 Informal bound [4]

    modi fi ed for our setting: , . • : Supervised loss with feature encoder . • : Collision probability represents whether the anchor’s label is also in negatives' ones or not. Note: the original bound is for a meta-learning-ish loss rather than a single 
 supervised loss. Lsup (f) ≤ 1 1 − τK (Lcon (f) + 𝒪 (f, K)) ∀f ∈ ℱ Lsup (f) f τK [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.
  9. The bound of explodes with large Lsup K 9 Plots

    upper bound values of the supervised loss, , on CIFAR-10 (le ft ) and CIFAR-100 (right). • We expect that the upper bound decrease when enlarge to explain the performance gain. Lsup( ̂ f ) ≤ 1 1 − τK (Lcon ( ̂ f ) + 𝒪 ( ̂ f , K)) K 32 64 128 256 512 # negative samples +1 102 104 106 108 1010 1012 1014 1016 Upper bound of supervised loss Arora et al. Validation risk 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100
  10. Key idea: use alternative probability 10 • Intuition: Large is

    necessary to draw various classes to solve the downstream task. • Ex: if we solve the ImageNet-1000 classi fi cation as a downstream task, in the contrastive task. • Use Coupon collector’s problem [5]’s probability that represents whether drawn samples’ labels include all supervised classes or not. K K + 1 ≥ 1 000 υK+1 K + 1
  11. Informal proposed bound: , . • The coe ff i

    cient of converges 1 when increases. • Additional insight from : the expected to sample all supervised labels from ImageNet-1000 is about . • Recall that in MoCo and in SimCLR. Lsup (f) ≤ 1 υK+1 (2Lcon (f) + 𝒪 (f, K)) ∀f ∈ ℱ 1/υK+1 Lcon K υK+1 K + 1 7 700 K = 65 536 K = 8 190 Proposed upper bound of supervised loss 11
  12. Our bound doesn’t explode 12 32 64 128 256 512

    # negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation risk 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100
  13. Our bound doesn’t explode…? 13 128 256 384 512 640

    768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100 🤔 Limitations: 1. The bound is not de fi ned when 
 is smaller than number of classes 
 of a downstream task. 2. The bound looks vacuous 
 if is not large enough (red part). Can we fi x these limitations? K K
  14. Improved upper bound of supervised loss 14 We derive the

    following bound for all : . Figure: Proposed bound (blue line) is consistently tight for all . K > 0 Lsup (f) ≤ Lcon (f) + 𝒪 (ln 1 K ) K 4 16 32 64 128 512 K (Negative sample size) 101 102 103 104 105 106 Upper bound & Supervised loss values Ours Arora et al. Nozawa & Sato Ash et al. Supervised loss
  15. Conclusion 15 • We improved the existing bound of contrastive

    learning. • Practice: Large negative sample size doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed • The new bound using Coupon collector’s problem with smaller limitation. • Much tighter upper bound for all > 0. • More recent topic from Prof. Arora’s group • Saunshi et al., “Understanding Contrastive Learning Requires Incorporating Inductive Biases”, In ICML, 2022. K K K K
  16. References 16 1. Oord et al., “Representation Learning with Contrastive

    Predictive Coding”, arXiv, 2018 2. He et al., “Momentum Contrast for Unsupervised Visual Representation Learning”, In CVPR, 2020. 3. Chen et al., “A Simple Framework for Contrastive Learning of Visual Representations”, In ICML, 2020. 4. Arora et al., “A Theoretical Analysis of Contrastive Unsupervised Representation Learning”, In ICML, 2019. 5. Nakata & Kubo. “A Coupon Collector’s Problem with Bonuses”. In Fourth Colloquium on Mathematics and Computer Science, 2006.
  17. Our papers 17 • Kento Nozawa and Issei Sato. Understanding

    Negative Samples in Instance Discriminative Self-supervised Representation Learning. In NeurIPS, 2021. • Han Bao, Yoshihiro Nagano and Kento Nozawa. On the Surrogate Gap between Contrastive and Supervised Losses. In ICML, 2022. • Bonus (another extension of Arora et al. 2019) • Kento Nozawa, Pascal Germain and Benjamin Guedj. PAC-Bayesian Contrastive Unsupervised Representation Learning. In UAI, 2020.