Slide 1

Slide 1 text

50: ରরతࣗݾڭࢣ෇͖දݱֶशʹ͓͚Δෛྫ਺ͷղੳ ໺୔ ݈ਓ , ࠤ౻ Ұ੣ ʢ ౦ژେֶ, RIKEN AIPʣ 1,2 1 1 2 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100 • We point out the inconsistency between self-supervised learning’s common practice and an existing theoretical analysis. • Practice: Large # negative samples don’t hurt classi fi cation performance. • Theory: they hurt classi fi cation performance. • We propose an novel analysis using Coupon 
 collector’s problem. Accuracy: higher is better Bound: lower is better Paper: https://arxiv.org/abs/2102.06866 (to appear at NeurIPS2021) Code: https://github.com/nzw0301/Understanding-Negative-Samples

Slide 2

Slide 2 text

Instance discriminative self-supervised representation learning 2 Goal: Learn generic feature encoder , for example deep neural nets, for a downstream task, such as classi fi cation. Feature representations help a linear classi fi er to attain classi fi cation accuracy comparable to a supervised method from scratch. f

Slide 3

Slide 3 text

Overview of Instance discriminative self-supervised representation learning 3 Anchor x Negative x− Draw samples from an unlabeled dataset. • : anchor sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1

Slide 4

Slide 4 text

Overview of Instance discriminative self-supervised representation learning 4 Anchor x Negative x− a a+ a− Apply data augmentation to the samples: • For the anchor sample , we draw and apply two data augmentations . • For negative sample , we draw and apply single data augmentation . x a, a+ x− a−

Slide 5

Slide 5 text

Overview of Instance discriminative self-supervised representation learning 5 Anchor x Negative x− a a+ a− f h h− h+ Feature encoder maps augmented samples to feature vectors . f h, h+, h−

Slide 6

Slide 6 text

Overview of Instance discriminative self-supervised representation learning 6 Anchor x Negative x− a a+ a− f h h− h+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, h+)] exp[sim(h, h+)] + exp[sim(h, h−)] } • Minimize a contrastive loss given feature representations. • : a similarity function, such as cosine similarity. • Learned works as a feature extractor for a downstream task. sim( . ⋅ . ) ̂ f [1] Oord et al. Representation Learning with Contrastive Predictive Coding, arXiv, 2018.

Slide 7

Slide 7 text

Common technique: use large # negative samples K 7 By increasing # negative samples, learned yields informative features for linear classi fi er in practice. For ImageNet, wMoCo [2]: . wSimCLR [3]: or even more. ̂ f K = 65 536 K = 8 190 32 64 128 256 512 # negative samples +1 78 80 82 84 Validation accuracy (%) on CIFAR-10 [2] He et al. Momentum Contrast for Unsupervised Visual Representation Learning, In CVPR, 2020. [3] Chen et al. A Simple Framework for Contrastive Learning of Visual Representations, In ICML, 2020.

Slide 8

Slide 8 text

A theory of contrastive representation learning 8 Informal bound [4] modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. • : the number of duplicated negative labels with the anchor’s label. • : Supervised loss with . • : Supervised loss over subset of labels with . • : a function of , but almost constant term in practice. Lcont (f) ≥ τK ln(Col + 1) Collison term + (1 − τK )(Lsup (f) + Lsub (f)) + d(f) τK Col Lsup (f) f Lsub (f) f d(f) f [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.

Slide 9

Slide 9 text

The bound of explodes with large Lsup K 9 • The bound on CIFAR-10, where # classes is with : • About samples contribute the collision term not related to the supervised loss due to . • Plots rearranged upper bound: : 10 K = 31 96 % τK Lsup (f) ≤ (1 − τK )−1[Lcont (f) − τK ln(Col + 1) − d(f)] − Lsub (f) 32 64 128 256 512 # negative samples +1 103 106 109 1012 1015 Upper bound of supervised loss Arora et al. Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100

Slide 10

Slide 10 text

Contributions: novel lower bound of contrastive loss 10 Informal proposed bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. • Additional insight: the expected to draw all supervised labels 
 from ImageNet-1K is about . Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1 K + 1 7700

Slide 11

Slide 11 text

Our bound doesn’t explode 11 32 64 128 256 512 # negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100

Slide 12

Slide 12 text

Conclusion 12 • We pointed out the inconsistency between self-supervised learning’s common practice and the existing bound. • Practice: Large doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed the new bound using Coupon collector’s problem. • Additional results: • Upper bound of the collision term. • Optimality when with too small . • Experiments on a NLP dataset. K K υ = 0 K Paper: https://arxiv.org/abs/2102.06866 (to appear at NeurIPS2021) Code: https://github.com/nzw0301/Understanding-Negative-Samples