Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Kento Nozawa
October 18, 2021

Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning

Kento Nozawa

October 18, 2021
Tweet

More Decks by Kento Nozawa

Other Decks in Research

Transcript

  1. Kento Nozawa1,2 Understanding Negative Samples in Instance Discriminative Self-supervised representation

    Learning Paper: https://arxiv.org/abs/2102.06866 Code: https://github.com/nzw0301/Understanding-Negative-Samples Issei Sato1 1 2
  2. Short summary of this talk 2 128 256 384 512

    640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100 • We point out the inconsistency between self-supervised learning’s common practice and an existing theoretical analysis. • Practice: Large # negative samples don’t hurt classi fi cation performance. • Theory: they hurt classi fi cation performance. • We propose an novel analysis using Coupon 
 collector’s problem. Accuracy: higher is better Bound: lower is better
  3. Instance discriminative self-supervised representation learning 3 Goal: Learn generic feature

    encoder , for example deep neural nets, for a downstream task, such as classi fi cation. Feature representations help a linear classi fi er to attain classi fi cation accuracy comparable to a supervised method from scratch. f
  4. Overview of Instance discriminative self-supervised representation learning 4 Anchor x

    Negative x− Draw samples from an unlabeled dataset. • : anchor sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1
  5. Overview of Instance discriminative self-supervised representation learning 5 Anchor x

    Negative x− a a+ a− Apply data augmentation to the samples: • For the anchor sample , we draw and apply two data augmentations . • For negative sample , we draw and apply single data augmentation . x a, a+ x− a−
  6. Overview of Instance discriminative self-supervised representation learning 6 Anchor x

    Negative x− a a+ a− f h h− h+ Feature encoder maps augmented samples to feature vectors . f h, h+, h−
  7. Overview of Instance discriminative self-supervised representation learning 7 Anchor x

    Negative x− a a+ a− f h h− h+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, h+)] exp[sim(h, h+)] + exp[sim(h, h−)] } • Minimize a contrastive loss given feature representations. • : a similarity function, such as cosine similarity. • Learned works as a feature extractor for a downstream task. sim( . ⋅ . ) ̂ f [1] Oord et al. Representation Learning with Contrastive Predictive Coding, arXiv, 2018.
  8. Common technique: use large # negative samples K 8 By

    increasing # negative samples, learned yields informative features for linear classi fi er in practice. For ImageNet, wMoCo [2]: . wSimCLR [3]: or even more. ̂ f K = 65 536 K = 8 190 32 64 128 256 512 # negative samples +1 78 80 82 84 Validation accuracy (%) on CIFAR-10 [2] He et al. Momentum Contrast for Unsupervised Visual Representation Learning, In CVPR, 2020. [3] Chen et al. A Simple Framework for Contrastive Learning of Visual Representations, In ICML, 2020.
  9. A theory of contrastive representation learning 9 Informal bound [4]

    modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. • : Supervised loss with . • : Supervised loss over subset of labels with . • : the number of duplicated negative labels with the anchor’s label. • : a function of , but almost constant term in practice. Lcont (f) ≥ (1 − τK )(Lsup (f) + Lsub (f)) + τK ln(Col + 1) Collison term + d(f) τK Lsup (f) f Lsub (f) f Col d(f) f [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.
  10. The bound of explodes with large Lsup K 10 •

    The bound on CIFAR-10, where # classes is with : • About samples contribute the collision term not related to the supervised loss due to . • Plots rearranged upper bound: : 10 K = 31 96 % τK Lsup (f) ≤ (1 − τK )−1[Lcont (f) − τK ln(Col + 1) − d(f)] − Lsub (f) 32 64 128 256 512 # negative samples +1 103 106 109 1012 1015 Upper bound of supervised loss Arora et al. Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  11. Contributions: novel lower bound of contrastive loss 11 Informal proposed

    bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. • Additional insight: the expected to draw all supervised labels 
 from ImageNet-1K is about . Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1 K + 1 7700
  12. Our bound doesn’t explode 12 32 64 128 256 512

    # negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  13. Conclusion 13 • We pointed out the inconsistency between self-supervised

    learning’s common practice and the existing bound. • Practice: Large doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed the new bound using Coupon collector’s problem. • Additional results: • Upper bound of the collision term. • Optimality when with too small . • Experiments on a NLP dataset. K K υ = 0 K