Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[IBIS2021] 対照的自己教師付き表現学習おける負例数の解析

Kento Nozawa
November 04, 2021

[IBIS2021] 対照的自己教師付き表現学習おける負例数の解析

IBIS2021 (https://ibisml.org/) での発表:「対照的自己教師付き表現学習おける負例数の解析」で使用したスライドです。

著者(所属):野沢健人, 佐藤一誠 (東京大学/理化学研究所, 東京大学)
概要:敵対的自己教師付き表現学習は、大量の負例によって後続タスクの分類性能を向上させることが実験的に報告されているが、既存の理論解析では大量の負例は分類性能を下げることが示されている。本研究では、クーポンコレクタ問題を用いた解析の枠組みを提案することで、実験的な挙動を説明可能なバウンドを導出する。複数のデータセットで提案解析が成り立つことを検証した。

関連資料:
より詳細な内容: https://arxiv.org/abs/2102.06866 , https://openreview.net/forum?id=pZ5X_svdPQ
Code: https://github.com/nzw0301/Understanding-Negative-Samples

Kento Nozawa

November 04, 2021
Tweet

More Decks by Kento Nozawa

Other Decks in Research

Transcript

  1. 50: ରরతࣗݾڭࢣ෇͖දݱֶशʹ͓͚Δෛྫ਺ͷղੳ ໺୔ ݈ਓ , ࠤ౻ Ұ੣ ʢ ౦ژେֶ, RIKEN

    AIPʣ 1,2 1 1 2 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100 • We point out the inconsistency between self-supervised learning’s common practice and an existing theoretical analysis. • Practice: Large # negative samples don’t hurt classi fi cation performance. • Theory: they hurt classi fi cation performance. • We propose an novel analysis using Coupon 
 collector’s problem. Accuracy: higher is better Bound: lower is better Paper: https://arxiv.org/abs/2102.06866 (to appear at NeurIPS2021) Code: https://github.com/nzw0301/Understanding-Negative-Samples
  2. Instance discriminative self-supervised representation learning 2 Goal: Learn generic feature

    encoder , for example deep neural nets, for a downstream task, such as classi fi cation. Feature representations help a linear classi fi er to attain classi fi cation accuracy comparable to a supervised method from scratch. f
  3. Overview of Instance discriminative self-supervised representation learning 3 Anchor x

    Negative x− Draw samples from an unlabeled dataset. • : anchor sample. • : negative sample. It can be a set of samples . K + 1 x x− {x− k }K k=1
  4. Overview of Instance discriminative self-supervised representation learning 4 Anchor x

    Negative x− a a+ a− Apply data augmentation to the samples: • For the anchor sample , we draw and apply two data augmentations . • For negative sample , we draw and apply single data augmentation . x a, a+ x− a−
  5. Overview of Instance discriminative self-supervised representation learning 5 Anchor x

    Negative x− a a+ a− f h h− h+ Feature encoder maps augmented samples to feature vectors . f h, h+, h−
  6. Overview of Instance discriminative self-supervised representation learning 6 Anchor x

    Negative x− a a+ a− f h h− h+ Contrastive loss function, ex. InfoNCE [1]: −ln exp[sim(h, h+)] exp[sim(h, h+)] + exp[sim(h, h−)] } • Minimize a contrastive loss given feature representations. • : a similarity function, such as cosine similarity. • Learned works as a feature extractor for a downstream task. sim( . ⋅ . ) ̂ f [1] Oord et al. Representation Learning with Contrastive Predictive Coding, arXiv, 2018.
  7. Common technique: use large # negative samples K 7 By

    increasing # negative samples, learned yields informative features for linear classi fi er in practice. For ImageNet, wMoCo [2]: . wSimCLR [3]: or even more. ̂ f K = 65 536 K = 8 190 32 64 128 256 512 # negative samples +1 78 80 82 84 Validation accuracy (%) on CIFAR-10 [2] He et al. Momentum Contrast for Unsupervised Visual Representation Learning, In CVPR, 2020. [3] Chen et al. A Simple Framework for Contrastive Learning of Visual Representations, In ICML, 2020.
  8. A theory of contrastive representation learning 8 Informal bound [4]

    modi fi ed for self-supervised learning: . • : Collision probability that anchor’s label appears in negatives’ one. • : the number of duplicated negative labels with the anchor’s label. • : Supervised loss with . • : Supervised loss over subset of labels with . • : a function of , but almost constant term in practice. Lcont (f) ≥ τK ln(Col + 1) Collison term + (1 − τK )(Lsup (f) + Lsub (f)) + d(f) τK Col Lsup (f) f Lsub (f) f d(f) f [4] Arora et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning, In ICML, 2019.
  9. The bound of explodes with large Lsup K 9 •

    The bound on CIFAR-10, where # classes is with : • About samples contribute the collision term not related to the supervised loss due to . • Plots rearranged upper bound: : 10 K = 31 96 % τK Lsup (f) ≤ (1 − τK )−1[Lcont (f) − τK ln(Col + 1) − d(f)] − Lsub (f) 32 64 128 256 512 # negative samples +1 103 106 109 1012 1015 Upper bound of supervised loss Arora et al. Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  10. Contributions: novel lower bound of contrastive loss 10 Informal proposed

    bound: . • Key idea: replace collision probability with Coupon collector’s problem’s probability that samples’ labels include the all supervised labels. • Additional insight: the expected to draw all supervised labels 
 from ImageNet-1K is about . Lcont (f) ≥ 1 2 {υK+1 Lsup (f) + (1 − υK+1 )Lsub (f) + ln(Col + 1)} + d(f) τ υK+1 K + 1 K + 1 7700
  11. Our bound doesn’t explode 11 32 64 128 256 512

    # negative samples +1 102 105 108 1011 1014 1017 Upper bound of supervised loss Arora et al. Ours Validation accuracy 77 78 79 80 81 82 83 84 Validation accuracy (%) on CIFAR-10 128 256 384 512 640 768 896 1024 # negative samples +1 101 102 103 104 105 Upper bound of supervised loss Arora et al. Ours Validation accuracy 42 43 44 45 46 47 Validation accuracy (%) on CIFAR-100
  12. Conclusion 12 • We pointed out the inconsistency between self-supervised

    learning’s common practice and the existing bound. • Practice: Large doesn’t hurt classi fi cation performance. • Theory: large hurts classi fi cation performance. • We proposed the new bound using Coupon collector’s problem. • Additional results: • Upper bound of the collision term. • Optimality when with too small . • Experiments on a NLP dataset. K K υ = 0 K Paper: https://arxiv.org/abs/2102.06866 (to appear at NeurIPS2021) Code: https://github.com/nzw0301/Understanding-Negative-Samples