Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of PAC-Bayes and its Application for Contrastive Unsupervised Representation Learning

2ab3dc02a9448f246bab64174b19dc1e?s=47 Kento Nozawa
December 02, 2020

Introduction of PAC-Bayes and its Application for Contrastive Unsupervised Representation Learning

Invited talk at StatsML Symposium'20 in Japan.
https://sites.google.com/view/statsmlsymposium20

Paper: https://arxiv.org/abs/1910.04464
Pre-recorded video (Japanese): https://www.youtube.com/watch?v=in6c_0B6YU4

2ab3dc02a9448f246bab64174b19dc1e?s=128

Kento Nozawa

December 02, 2020
Tweet

Transcript

  1. 1 2020年12⽉04⽇ @ 第5回 統計・機械学習若⼿シンポジウム Kento Nozawa (github/twitter: @nzw0301) Slides:

    https://tinyurl.com/y6nc7md5 Pre-recorded video: https://www.youtube.com/watch?v=in6c_0B6YU4 Introduction of PAC-Bayes and its Application for Contrastive Unsupervised Representation Learning GENERALISATION
  2. Supervised classification 2 Goal: Learn classifier such that it has

    test risk as small as possible on unseen test dataset by minimising empirical risk on training dataset . • , where input and label • • Ex: Linear classifier , where • • Ex: zero-one loss, • : Empirical risk • : Test risk • : classifier by performing empirical risk minimisation f L(f ) ̂ L(f ) D D = {(xi , yi )}N i=1 x ∈ = ℝd, y ∈ = {−1,1} f : → ℝ f(x) = w⊤x + b w ∈ ℝd, b ∈ ℝ ℓ : ℝ × → ℝ+ ℓ[f(x), y] = 1 [sign(f(x)) ≠ y] ̂ L(f ) = 1/N∑N i=1 ℓ [f(xi ), yi] L(f ) ̂ f = argminf ̂ L(f )
  3. Let’s consider a binary classification 3

  4. Which classifier is supposed to be better? 4 ̂ f0

    ̂ f1
  5. Which classifier is supposed to be better? 5 is lower

    than . Does look better? ̂ L( ̂ f0 ) ̂ L( ̂ f1 ) ̂ f0 ̂ L( ̂ f0 ) = 0.02 ̂ L( ̂ f1 ) = 0.14
  6. Overfitting 6 No! is larger than , and is almost

    the same as . L( ̂ f0 ) ̂ L( ̂ f0 ) L( ̂ f1 ) ̂ L( ̂ f1 ) L( ̂ f0 ) = 0.18 L( ̂ f1 ) = 0.15
  7. How to avoid overfitting? 7 • Apply regularisation • Weight

    decay, Dropout, mixup, etc. • Create a validation dataset from the training dataset, then • Early-stopping • Model / parameter selection • Check the generalisation error gap (PAC-Bayes is here!)
  8. Notations 8 • Test data distribution : join distribution over

    • Training dataset: • Test risk: • Since is unseen, we cannot minimise it. • Empirical risk: • Since is seen, we can minimise it, instead of . × D = {(xi , yi )}N i=1 ∼ N L(f ) = (x,y)∼ ℓ(f(x), y) ̂ L(f ) = 1/N∑N i=1 ℓ(f(xi ), yi ) D L(f )
  9. Generalisation error gap 9 Generalisation error gap is the difference

    between the test risk and the empirical risk: . • If is large, tends to be over-fitting. • The previous binary classifiers case: . • Minimisation of sounds great, but it depends on , so we cannot minimise it. • Instead, we bound without using . Δ(f) = |L(f) − ̂ L(f)| Δ(f ) f Δ( ̂ f0 ) = 0.16 > Δ( ̂ f1 ) = 0.01 Δ(f ) Δ(f )
  10. Probably Approximately Correct (PAC) learning 10 • It is a

    framework that can give upper or lower bounds of . • Roughly speaking, a PAC upper bound gives the following statement: Theorem: Let be a function class of a binary classifier . Then, , with probability at least , the following holds : . Δ(f ) ℱ f ∀δ > 0 1 − δ ∀f ∈ ℱ L(f ) ≤ ̂ L(f ) + (ℛ(ℱ), δ) Complexity term
  11. Vapnik–Chervonenkis (VC) dimension generalisation bound 11 [Mohri et al. 2018.

    Corollary 3.4]: Let be a function class of a binary classifier with VC- dimension . Then, , with probability at least , the following holds : . When does the complexity term become small? 1. is large: the number of training data is large. 2. is small: does not have too expressive power. 3. is large: but it makes the probability the bound smaller, so usually this value is small. ℱ f VCdim(ℱ) ∀δ > 0 1 − δ ∀f ∈ ℱ L(f ) ≤ ̂ L(f ) + 2VCdim(ℱ)ln eN VCdim(ℱ) N + ln 1 δ 2N Complexity term N VCdim(ℱ) ℱ δ
  12. Let’s consider the VC-dimension for neural networks… 12 • We

    assume ReLU networks with hidden units for binarised MNIST (odd vs. even). • The number of training data, . • The number of trainable parameters, . • For this neural net, [P. Bartlett et al. 2019], where . • Thus, the complexity term is always greater than 1: . • This upper bound cannot explain the generalisation: vacuous bound. • PAC-Bayes could obtain a non-vacuous bound in this setting [Dziugaite & Roy. 2017]. 600 N = 60 000 W = 471 000 VCdim(ℱ) ≤ (LW ln W) L = #layers Δ(f ) = L(f ) − ̂ L(f ) ≤ 1 < 2VCdim(ℱ)ln eN VCdim(ℱ) N + ln 1 δ 2N
  13. Proof of the vacuousness of the VC-dimension bound 13 The

    previous VC-dimension generalisation bound is an upper bound of the following growth function generalisation bound: We show that (1) is always greater than 1. When , growth function from Sauer’s Lemma, . Since , (1) is always larger than 1: vacuous bound. L(f ) ≤ ̂ L(f ) + 2 ln Πℱ (N) N + ln 1 δ 2N . (1) N ≤ VCdim(ℱ) Πℱ (N) = 2N (1) = ̂ L(f ) + 2N ln 2 N + ln 1 δ 2N 2N ln 2 N ≈ 1.18 If you would like to know these technical terms, see [Mohri et al. Chapter 3, 2018].
  14. Preliminaries for PAC-Bayes 14 PAC-Bayes is a Bayesian flavour PAC

    learning framework. • Assume two probability distributions over function class . • Prior : It cannot depend on the training data. • Posterior : It can depends on the training data. • Ex: the weights of linear classifier are sampled from a multinomial Gaussian. • . • Empirical risk and test risk are re-defined by using . • Empirical expected-Q risk: • Test expected-Q risk: ℱ P Q f KL(Q∥P) = f∼Q Q(f )ln(Q(f )/P(f )) Q ̂ L(Q) = f∼Q ̂ L(f ) L(Q) = f∼Q L(f )
  15. The first PAC-Bayes bound [D. McAllester. 1998] 15 Theorem: Given

    a prior over , with probability at least over training samples the following holds , . Q. When does the complexity term become small? A. is small. But a poor causes underfitting of and we cannot optimise . P ℱ 1 − δ D ∼ N ∀Q over ℱ L(Q) ≤ ̂ L(Q) + KL(Q∥P) + ln 2 N δ 2N KL(Q∥P) P Q P
  16. What is the intuitive benefit of ? L(Q) 16 Let’s

    consider the following test risk that has flat minima and sharp minima. L(f )
  17. Flat minima’s risk tends be smaller! 17 • When we

    have two posteriors and , Q0 = (−3,1) Q1 = (0.75,1)
  18. Flat minima’s risk tends be smaller! 18 • When we

    have two posteriors and , the sharp minima is more penalised than flat minima. • Note: . Q0 = (−3,1) Q1 = (0.75,1) L(Q0 ) = 0.16 < L(Q1 ) = 0.39
  19. Related to the this topic 19 • [Dziugaite & Roy.

    2017] show that PAC-Bayes is a useful tool to measure the flatness of a solution of neural networks. • PAC-Bayes has been used for the analysis [Neyshabur et al. 2018, Dziugaite & Roy. 2018, Jiang et al. 2020, Tsuzuku et al. 2020]. • There was a competition @ NeurIPS to predict generalisation in deep learning! https://sites.google.com/view/pgdl2020/
  20. Pros / Cons of PAC-Bayes 20 • Pros • Not

    too difficult to use PAC-Bayes bound • (Maybe) tighter than non PAC-Bayes bounds • Similar concepts to Bayesian neural networks [Pérez-Ortiz et al. 2020] • Cons (but several papers tackle these issues): • Stochastic models require additional computational cost — it needs to realise a model by sampling and needs to store posterior and prior’s parameters. • The bound might be still conservative (especially neural networks). • If we select poor prior , then the becomes larger: vacuous bound :( P KL
  21. Many other applications 21 • Kernel-based machine learning [Ambroladze et

    al. 2006, Germain et al. 2011] • Reinforcement learning [Fard & Pineau. 2010, Majumdar & Goldstein. 2018] • Meta learning [Amit & Meir. 2018] • Transfer learning [McNamara & Balcan. 2017] • Domain adaptation [Germain et al. 2020] • Gaussian process [Seeger. 2002, Suzuki. 2012, Reeb et al., 2018] • Connection to Bayesian inference [Germain et al. 2016] • Connection to Rademacher complexity [Kakade et al. 2008, Yang et al. 2019] • Other divergences: -divergence [Bégin et al. 2016], -divergence [Alquier & Guedj. 2018] • …. α f
  22. Further materials to dive into PAC-Bayes 22 • Recent talks

    by Benjamin Guedj • His ICML2019’s tutorial based introduction videos • A Primer on PAC-Bayesian Learning [Guedj. 2019] • A short survey. The basic parts in his talk come form this paper. • NIPS 2017 Workshop: (Almost) 50 shades Bayesian Learning: PAC-Bayesian trends and insight
  23. PAC-Bayesian Contrastive Unsupervised Representation Learning Me Pascal Germain Benjamin Guedj

    N., K. Germain, Pascal. & Guedj, Benjamin. PAC-Bayesian Contrastive Unsupervised Representation Learning. In UAI, PMLR 124:21-30. 2020. https://arxiv.org/abs/1910.04464
  24. Contrastive unsupervised representation learning (CURL) 24 Goal: learn a good

    feature extractor , e.g. DNNs. f x similar x+ dissimilar x−
  25. Contrastive unsupervised representation learning (CURL) 25 Goal: learn a good

    feature extractor , e.g. DNNs. f f f f , are similar; , are dissimilar. f(x) f(x+) f(x) f(x−) Contrastive Loss: ℓ[f(x) ⋅ (f(x+) − f(x−))] x similar x+ dissimilar x−
  26. Learnt representation works for supervised tasks 26 ̂ f g

    ̂ y classifier Fixed feature extractor Why does CURL perform well? Input Predicted label
  27. Our contributions 27 • We show PAC-Bayes bounds for CURL

    and derive new algorithms by minimising the bounds. • We replace the Rademacher complexity term from Arora et al. [2019] with a divergence term, which is easier to compute in general. • The PAC-Bayes bound directly suggests a (theory driven) learning algorithm. • We also show a PAC-Bayes bound for non-iid contrastive data. • The iid assumption seems unrealistic in many settings and is unlikely to hold with contrastive datasets, ex. word2vec [Mikolov et al. 2013]. KL
  28. The first theoretical guarantees for CURL [Arora et al. 2019]

    28 Finding a good representation guarantees to generalise well. ̂ f Informal bound: Lsup( ̂ f ) ≤ αLun(f) + (ℛ(ℱ), δ) Complexity ∀f ∈ ℱ, w.h.p. 1 − δ • : Constant • : Rademacher complexity of function class . • : Confidence of PAC learning α ℛ ℱ δ
  29. The first PAC-Bayesian generalisation bound for CURL 29 Lsup(Q) ≤

    α ̂ Lun(Q) + (KL(Q∥P), δ) Complexity ∀Q over ℱ, w.h.p. 1 − δ Informal bound: • The complexity term is easier to compute than Rademacher one. • Since all terms in the right-hand side are explicit or easy to approximate, we can minimise the bound directly.
  30. Beyond iid assumption 30 Lsup(Q) ≤ α ̂ Lun(Q) +

    (Df (Q∥P), δ) ∀Q over ℱ, w.h.p. 1 − δ Informal bound: • Both previous bounds assume training samples are sampled from test dist. iid. • But the assumption does not hold in practice. • E.g., Similar sample is generated from a sample by applying data-augmentation. x+ x Note: : -divergence. Df f
  31. Learning algorithms & Experiments 31 • Minimising with respect to

    . • and are multivariate Gaussians with diagonal covariance. • We optimise ’s mean and covariance by using SGD [Dziugaite & Roy. 2017]. • Approximate by sampling weights of neural networks from . • Evaluation procedures: • Learning: on contrastive unsupervised data. • Evaluation: test 0-1 risk on supervised data by using centroid classifier. ̂ Lun(Q) + (KL(Q∥P), δ) Q P Q Q ̂ Lun Q Q
  32. Experimental results: Supervised performance & bound 32  AVG-2 risk:

    averaged 0-1 risk over all combination of two classes in supervised data.  PAC-Bayes bound: computed on the stochastic neural networks.
  33. Conclusion 33 • We provide the first PAC-Bayes generalisation bounds

    for CURL. • This allows to derive new algorithms by directly optimising the bound. • More results in the paper: • General PAC-Bayes bound for multiple dissimilar samples. • Learning algorithm and experiments for the non-iid case. Paper: https://arxiv.org/abs/1910.04464 Code: https://github.com/nzw0301/pb-contrastive
  34. References 1/2 34 - P. Alquier and B. Guedj, “Simpler

    PAC-Bayesian Bounds for Hostile Data,” Mach. Learn., vol. 107, no. 5, pp. 887–902, 2018. - A. Ambroladze, E. Parrado-Hernández, and J. Shawe-Taylor, “Tighter PAC-Bayes Bounds,” in NeurIPS, 2006. - R. Amit and R. Meir, “Meta-Learning by Adjusting Priors based on Extended PAC-Bayes Theory,” in ICML, 2018, pp. 205–214. - S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi, “A Theoretical Analysis of Contrastive Unsupervised Representation Learning,” in ICML, 2019, pp. 5628–5637. - P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian, “Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks,” J. Mach. Learn. Res., vol. 20, pp. 1–17, 2019. - L. Bégin, P. Germain, F. Laviolette, and J.-F. Roy, “PAC-Bayesian Bounds based on the Rényi Divergence,” in AISTATS, 2016, pp. 435–444. - G. K. Dziugaite and D. M. Roy, “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data,” in UAI, 2017. - G. K. Dziugaite and D. M. Roy, “Data-dependent PAC-Bayes Priors via Differential Privacy,” in NeurIPS, 2018. - M. M. Fard and J. Pineau, “PAC-Bayesian Model Selection for Reinforcement Learning,” in NeurIPS, 2010. - P. Germain, A. Lacoste, F. Laviolette, M. Marchand, and S. Shanian, “A PAC-Bayes Sample Compression Approach to Kernel Methods,” in ICML, 2011. - P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “A New PAC-Bayesian Perspective on Domain Adaptation,” in ICML, 2016, pp. 859–868. - P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “PAC-Bayes and domain adaptation,” Neurocomputing, vol. 379, pp. 379-397, 2020.
  35. References 2/2 35 - B. Guedj, “A Primer on PAC-Bayesian

    Learning,” J. Société Mathématiques Fr., 2019. - Y. Jiang, B. Neyshabur, D. Krishnan, H. Mobahi, and S. Bengio, “Fantastic Generalization Measures and Where to Find Them,” in ICLR, 2020. - S. M. Kakade, K. Sridharan, and A. Tewari, “On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization,” in NeurIPS, 2008. - A. Majumdar and M. Goldstein, “PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments,” in CoRL, 2018. - D. A. McAllester, “Some PAC-Bayesian Theorems,” in COLT, 1998, pp. 230–234. - D. McNamara and M.-F. Balcan, “Risk Bounds for Transferring Representations With and Without Fine-Tuning,” in ICML, 2017, pp. 2373–2381. - T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in NeurIPS, 2013, pp. 3111–3119. - M. Mohri, A. Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning (2nd. ed.), The MIT Press, 2018. - B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks,” in ICLR, 2018. - D. Reeb, A. Doerr, S. Gerwinn, and B. Rakitsch, “Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds,” in NeurIPS, 2018. - M. Seeger, “PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification,” J. Mach. Learn. Res., vol. 3, pp. 233–269, 2002. - T. Suzuki, “PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model,” in COLT, 2012. - Y. Tsuzuku, I. Sato, and M. Sugiyama, “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC- Bayesian Analysis,” in ICML, 2020. - J. Yang, S. Sun, and D. M. Roy, “Fast-rate PAC-Bayes Generalization Bounds via Shi$ed Rademacher Processes,” in NeurIPS, 2019.