Analysis on Negative Sample Size in Contrastive Unsupervised Representation Learning

Kento Nozawa1,2 α Analysis on Negative Sample Size in Contrastive
Unsupervised Representation Learning Issei Sato1 1 2 Han Bao1,2 β Yoshihiro Nagano1,2 β Currently, : IBM Research Tokyo   : Kyoto University α β

Unsupervised representation learning 2 Goal: learning a feature encoder that
can extract generic representations   automatically from an unlabelled dataset. • Feature encoder: , e.g., deep neural nets. • Representation: a real-valued vector . f f : 𝒳 → ℝd z = f(x) ∈ ℝd f( ⋅ ) Massive unlabelled dataset {xi }M i=1 Extracted representation z ∈ ℝd Input x 10.1 −0.3 ⋮ 1.7

Draw samples from an unlabelled training dataset. • : anchor
sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1 Contrastive unsupervised representation learning 3 Anchor x Negative x− k

Apply data-augmentation to the samples: • For the anchor image
, we draw and apply two data augmentations . • For negative image , we draw and apply single data augmentation . x a, a+ x− a− Contrastive unsupervised representation learning 4 Anchor x Negative x− k a+ a− k a

Extract representations by feature encoder . z, z+, z− f
Contrastive unsupervised representation learning 5 Anchor x Negative x− k a a+ a− k f z z− k z+

exp[sim(z, z−)] exp[sim(z, z+)] Minimise a contrastive loss given extracted
representations . • : a similarity function, such as a dot product or cosine similarity. Lcon z, z+, z− k sim( . ⋅ . ) Contrastive unsupervised representation learning 6 Anchor x Negative x− k a a+ a− k f z z− k z+ } : Make similar. : Make dissimilar. z, z+ z, z− contrastive loss [1]: Lcon −ln exp[sim(z, z+)] exp[sim(z, z+)] + ∑K k=1 exp[sim(z, z− k )]

Common technique: enlarge # negative samples K 7 By increasing
# negative samples , learnt yields informative features for a linear classi fi er. For ImageNet-1000 classi fi cation, w MoCo [2] used . w SimCLR [3] used . For CIFAR-10 classi fi cation, w is better than smaller K. Can we explain this theoretically? K ̂ f K = 65 536 K = 8 190 K = 512 32 64 128 256 512 # negative samples +1 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 Validation risk

A theory of contrastive representation learning 8 Informal bound [4]
modi fi ed for our setting: , . • : Supervised loss with feature encoder . • : Collision probability represents whether the anchor’s label is also in negatives' ones or not. Note: the original bound is for a meta-learning-ish loss rather than a single   supervised loss. Lsup (f) ≤ 1 1 − τK (Lcon (f) + 𝒪 (f, K)) ∀f ∈ ℱ Lsup (f) f τK [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.

The bound of explodes with large Lsup K 9 Plots
upper bound values of the supervised loss, , on CIFAR-10 (le ft ) and CIFAR-100 (right). • We expect that the upper bound decrease when enlarge to explain the performance gain. Lsup( ̂ f ) ≤ 1 1 − τK (Lcon ( ̂ f ) + 𝒪 ( ̂ f , K)) K 32 64 128 256 512 # negative samples +1 102 104 106 108 1010 1012 1014 1016 Upper bound of supervised loss Arora et al. Validation risk 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100

Key idea: use alternative probability 10 • Intuition: Large is
necessary to draw various classes to solve the downstream task. • Ex: if we solve the ImageNet-1000 classi fi cation as a downstream task, in the contrastive task. • Use Coupon collector’s problem [5]’s probability that represents whether drawn samples’ labels include all supervised classes or not. K K + 1 ≥ 1 000 υK+1 K + 1

Informal proposed bound: , . • The coe ff i
cient of converges 1 when increases. • Additional insight from : the expected to sample all supervised labels from ImageNet-1000 is about . • Recall that in MoCo and in SimCLR. Lsup (f) ≤ 1 υK+1 (2Lcon (f) + 𝒪 (f, K)) ∀f ∈ ℱ 1/υK+1 Lcon K υK+1 K + 1 7 700 K = 65 536 K = 8 190 Proposed upper bound of supervised loss 11

Our bound doesn’t explode 12 32 64 128 256 512
# negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation risk 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 Validation risk on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100

Our bound doesn’t explode…? 13 128 256 384 512 640
768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation risk 0.53 0.54 0.55 0.56 0.57 0.58 Validation risk on CIFAR-100 🤔 Limitations: 1. The bound is not de fi ned when   is smaller than number of classes   of a downstream task. 2. The bound looks vacuous   if is not large enough (red part). Can we fi x these limitations? K K

Improved upper bound of supervised loss 14 We derive the
following bound for all : . Figure: Proposed bound (blue line) is consistently tight for all . K > 0 Lsup (f) ≤ Lcon (f) + 𝒪 (ln 1 K ) K 4 16 32 64 128 512 K (Negative sample size) 101 102 103 104 105 106 Upper bound & Supervised loss values Ours Arora et al. Nozawa & Sato Ash et al. Supervised loss

Conclusion 15 • We improved the existing bound of contrastive
learning. • Practice: Large negative sample size doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed • The new bound using Coupon collector’s problem with smaller limitation. • Much tighter upper bound for all > 0. • More recent topic from Prof. Arora’s group • Saunshi et al., “Understanding Contrastive Learning Requires Incorporating Inductive Biases”, In ICML, 2022. K K K K

References 16 1. Oord et al., “Representation Learning with Contrastive
Predictive Coding”, arXiv, 2018 2. He et al., “Momentum Contrast for Unsupervised Visual Representation Learning”, In CVPR, 2020. 3. Chen et al., “A Simple Framework for Contrastive Learning of Visual Representations”, In ICML, 2020. 4. Arora et al., “A Theoretical Analysis of Contrastive Unsupervised Representation Learning”, In ICML, 2019. 5. Nakata & Kubo. “A Coupon Collector’s Problem with Bonuses”. In Fourth Colloquium on Mathematics and Computer Science, 2006.

Our papers 17 • Kento Nozawa and Issei Sato. Understanding
Negative Samples in Instance Discriminative Self-supervised Representation Learning. In NeurIPS, 2021. • Han Bao, Yoshihiro Nagano and Kento Nozawa. On the Surrogate Gap between Contrastive and Supervised Losses. In ICML, 2022. • Bonus (another extension of Arora et al. 2019) • Kento Nozawa, Pascal Germain and Benjamin Guedj. PAC-Bayesian Contrastive Unsupervised Representation Learning. In UAI, 2020.

Analysis on Negative Sample Size in Contrastive...

Analysis on Negative Sample Size in Contrastive Unsupervised Representation Learning

Kento Nozawa

More Decks by Kento Nozawa

Other Decks in Research

Featured

Transcript

Kento Nozawa1,2 α Analysis on Negative Sample Size in Contrastive

Unsupervised representation learning 2 Goal: learning a feature encoder that

Draw samples from an unlabelled training dataset. • : anchor

Apply data-augmentation to the samples: • For the anchor image

Extract representations by feature encoder . z, z+, z− f

exp[sim(z, z−)] exp[sim(z, z+)] Minimise a contrastive loss given extracted

Common technique: enlarge # negative samples K 7 By increasing

A theory of contrastive representation learning 8 Informal bound [4]

The bound of explodes with large Lsup K 9 Plots

Key idea: use alternative probability 10 • Intuition: Large is

Informal proposed bound: , . • The coe ff i

Our bound doesn’t explode 12 32 64 128 256 512

Our bound doesn’t explode…? 13 128 256 384 512 640

Improved upper bound of supervised loss 14 We derive the

Conclusion 15 • We improved the existing bound of contrastive

References 16 1. Oord et al., “Representation Learning with Contrastive

Our papers 17 • Kento Nozawa and Issei Sato. Understanding