https://tinyurl.com/y6nc7md5 Pre-recorded video: https://www.youtube.com/watch?v=in6c_0B6YU4 Introduction of PAC-Bayes and its Application for Contrastive Unsupervised Representation Learning GENERALISATION
test risk as small as possible on unseen test dataset by minimising empirical risk on training dataset . • , where input and label • • Ex: Linear classiﬁer , where • • Ex: zero-one loss, • : Empirical risk • : Test risk • : classiﬁer by performing empirical risk minimisation f L(f ) ̂ L(f ) D D = {(xi , yi )}N i=1 x ∈ = ℝd, y ∈ = {−1,1} f : → ℝ f(x) = w⊤x + b w ∈ ℝd, b ∈ ℝ ℓ : ℝ × → ℝ+ ℓ[f(x), y] = 1 [sign(f(x)) ≠ y] ̂ L(f ) = 1/N∑N i=1 ℓ [f(xi ), yi] L(f ) ̂ f = argminf ̂ L(f )
decay, Dropout, mixup, etc. • Create a validation dataset from the training dataset, then • Early-stopping • Model / parameter selection • Check the generalisation error gap (PAC-Bayes is here!)
• Training dataset: • Test risk: • Since is unseen, we cannot minimise it. • Empirical risk: • Since is seen, we can minimise it, instead of . × D = {(xi , yi )}N i=1 ∼ N L(f ) = (x,y)∼ ℓ(f(x), y) ̂ L(f ) = 1/N∑N i=1 ℓ(f(xi ), yi ) D L(f )
between the test risk and the empirical risk: . • If is large, tends to be over-ﬁtting. • The previous binary classiﬁers case: . • Minimisation of sounds great, but it depends on , so we cannot minimise it. • Instead, we bound without using . Δ(f) = |L(f) − ̂ L(f)| Δ(f ) f Δ( ̂ f0 ) = 0.16 > Δ( ̂ f1 ) = 0.01 Δ(f ) Δ(f )
framework that can give upper or lower bounds of . • Roughly speaking, a PAC upper bound gives the following statement: Theorem: Let be a function class of a binary classiﬁer . Then, , with probability at least , the following holds : . Δ(f ) ℱ f ∀δ > 0 1 − δ ∀f ∈ ℱ L(f ) ≤ ̂ L(f ) + (ℛ(ℱ), δ) Complexity term
Corollary 3.4]: Let be a function class of a binary classiﬁer with VC- dimension . Then, , with probability at least , the following holds : . When does the complexity term become small? 1. is large: the number of training data is large. 2. is small: does not have too expressive power. 3. is large: but it makes the probability the bound smaller, so usually this value is small. ℱ f VCdim(ℱ) ∀δ > 0 1 − δ ∀f ∈ ℱ L(f ) ≤ ̂ L(f ) + 2VCdim(ℱ)ln eN VCdim(ℱ) N + ln 1 δ 2N Complexity term N VCdim(ℱ) ℱ δ
assume ReLU networks with hidden units for binarised MNIST (odd vs. even). • The number of training data, . • The number of trainable parameters, . • For this neural net, [P. Bartlett et al. 2019], where . • Thus, the complexity term is always greater than 1: . • This upper bound cannot explain the generalisation: vacuous bound. • PAC-Bayes could obtain a non-vacuous bound in this setting [Dziugaite & Roy. 2017]. 600 N = 60 000 W = 471 000 VCdim(ℱ) ≤ (LW ln W) L = #layers Δ(f ) = L(f ) − ̂ L(f ) ≤ 1 < 2VCdim(ℱ)ln eN VCdim(ℱ) N + ln 1 δ 2N
previous VC-dimension generalisation bound is an upper bound of the following growth function generalisation bound: We show that (1) is always greater than 1. When , growth function from Sauer’s Lemma, . Since , (1) is always larger than 1: vacuous bound. L(f ) ≤ ̂ L(f ) + 2 ln Πℱ (N) N + ln 1 δ 2N . (1) N ≤ VCdim(ℱ) Πℱ (N) = 2N (1) = ̂ L(f ) + 2N ln 2 N + ln 1 δ 2N 2N ln 2 N ≈ 1.18 If you would like to know these technical terms, see [Mohri et al. Chapter 3, 2018].
learning framework. • Assume two probability distributions over function class . • Prior : It cannot depend on the training data. • Posterior : It can depends on the training data. • Ex: the weights of linear classiﬁer are sampled from a multinomial Gaussian. • . • Empirical risk and test risk are re-deﬁned by using . • Empirical expected-Q risk: • Test expected-Q risk: ℱ P Q f KL(Q∥P) = f∼Q Q(f )ln(Q(f )/P(f )) Q ̂ L(Q) = f∼Q ̂ L(f ) L(Q) = f∼Q L(f )
a prior over , with probability at least over training samples the following holds , . Q. When does the complexity term become small? A. is small. But a poor causes underﬁtting of and we cannot optimise . P ℱ 1 − δ D ∼ N ∀Q over ℱ L(Q) ≤ ̂ L(Q) + KL(Q∥P) + ln 2 N δ 2N KL(Q∥P) P Q P
2017] show that PAC-Bayes is a useful tool to measure the ﬂatness of a solution of neural networks. • PAC-Bayes has been used for the analysis [Neyshabur et al. 2018, Dziugaite & Roy. 2018, Jiang et al. 2020, Tsuzuku et al. 2020]. • There was a competition @ NeurIPS to predict generalisation in deep learning! https://sites.google.com/view/pgdl2020/
too diﬃcult to use PAC-Bayes bound • (Maybe) tighter than non PAC-Bayes bounds • Similar concepts to Bayesian neural networks [Pérez-Ortiz et al. 2020] • Cons (but several papers tackle these issues): • Stochastic models require additional computational cost — it needs to realise a model by sampling and needs to store posterior and prior’s parameters. • The bound might be still conservative (especially neural networks). • If we select poor prior , then the becomes larger: vacuous bound :( P KL
by Benjamin Guedj • His ICML2019’s tutorial based introduction videos • A Primer on PAC-Bayesian Learning [Guedj. 2019] • A short survey. The basic parts in his talk come form this paper. • NIPS 2017 Workshop: (Almost) 50 shades Bayesian Learning: PAC-Bayesian trends and insight
feature extractor , e.g. DNNs. f f f f , are similar; , are dissimilar. f(x) f(x+) f(x) f(x−) Contrastive Loss: ℓ[f(x) ⋅ (f(x+) − f(x−))] x similar x+ dissimilar x−
and derive new algorithms by minimising the bounds. • We replace the Rademacher complexity term from Arora et al. [2019] with a divergence term, which is easier to compute in general. • The PAC-Bayes bound directly suggests a (theory driven) learning algorithm. • We also show a PAC-Bayes bound for non-iid contrastive data. • The iid assumption seems unrealistic in many settings and is unlikely to hold with contrastive datasets, ex. word2vec [Mikolov et al. 2013]. KL
α ̂ Lun(Q) + (KL(Q∥P), δ) Complexity ∀Q over ℱ, w.h.p. 1 − δ Informal bound: • The complexity term is easier to compute than Rademacher one. • Since all terms in the right-hand side are explicit or easy to approximate, we can minimise the bound directly.
(Df (Q∥P), δ) ∀Q over ℱ, w.h.p. 1 − δ Informal bound: • Both previous bounds assume training samples are sampled from test dist. iid. • But the assumption does not hold in practice. • E.g., Similar sample is generated from a sample by applying data-augmentation. x+ x Note: : -divergence. Df f
. • and are multivariate Gaussians with diagonal covariance. • We optimise ’s mean and covariance by using SGD [Dziugaite & Roy. 2017]. • Approximate by sampling weights of neural networks from . • Evaluation procedures: • Learning: on contrastive unsupervised data. • Evaluation: test 0-1 risk on supervised data by using centroid classiﬁer. ̂ Lun(Q) + (KL(Q∥P), δ) Q P Q Q ̂ Lun Q Q
for CURL. • This allows to derive new algorithms by directly optimising the bound. • More results in the paper: • General PAC-Bayes bound for multiple dissimilar samples. • Learning algorithm and experiments for the non-iid case. Paper: https://arxiv.org/abs/1910.04464 Code: https://github.com/nzw0301/pb-contrastive
PAC-Bayesian Bounds for Hostile Data,” Mach. Learn., vol. 107, no. 5, pp. 887–902, 2018. - A. Ambroladze, E. Parrado-Hernández, and J. Shawe-Taylor, “Tighter PAC-Bayes Bounds,” in NeurIPS, 2006. - R. Amit and R. Meir, “Meta-Learning by Adjusting Priors based on Extended PAC-Bayes Theory,” in ICML, 2018, pp. 205–214. - S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi, “A Theoretical Analysis of Contrastive Unsupervised Representation Learning,” in ICML, 2019, pp. 5628–5637. - P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian, “Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks,” J. Mach. Learn. Res., vol. 20, pp. 1–17, 2019. - L. Bégin, P. Germain, F. Laviolette, and J.-F. Roy, “PAC-Bayesian Bounds based on the Rényi Divergence,” in AISTATS, 2016, pp. 435–444. - G. K. Dziugaite and D. M. Roy, “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data,” in UAI, 2017. - G. K. Dziugaite and D. M. Roy, “Data-dependent PAC-Bayes Priors via Diﬀerential Privacy,” in NeurIPS, 2018. - M. M. Fard and J. Pineau, “PAC-Bayesian Model Selection for Reinforcement Learning,” in NeurIPS, 2010. - P. Germain, A. Lacoste, F. Laviolette, M. Marchand, and S. Shanian, “A PAC-Bayes Sample Compression Approach to Kernel Methods,” in ICML, 2011. - P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “A New PAC-Bayesian Perspective on Domain Adaptation,” in ICML, 2016, pp. 859–868. - P. Germain, A. Habrard, F. Laviolette, and E. Morvant, “PAC-Bayes and domain adaptation,” Neurocomputing, vol. 379, pp. 379-397, 2020.
Learning,” J. Société Mathématiques Fr., 2019. - Y. Jiang, B. Neyshabur, D. Krishnan, H. Mobahi, and S. Bengio, “Fantastic Generalization Measures and Where to Find Them,” in ICLR, 2020. - S. M. Kakade, K. Sridharan, and A. Tewari, “On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization,” in NeurIPS, 2008. - A. Majumdar and M. Goldstein, “PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments,” in CoRL, 2018. - D. A. McAllester, “Some PAC-Bayesian Theorems,” in COLT, 1998, pp. 230–234. - D. McNamara and M.-F. Balcan, “Risk Bounds for Transferring Representations With and Without Fine-Tuning,” in ICML, 2017, pp. 2373–2381. - T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in NeurIPS, 2013, pp. 3111–3119. - M. Mohri, A. Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning (2nd. ed.), The MIT Press, 2018. - B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks,” in ICLR, 2018. - D. Reeb, A. Doerr, S. Gerwinn, and B. Rakitsch, “Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds,” in NeurIPS, 2018. - M. Seeger, “PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classiﬁcation,” J. Mach. Learn. Res., vol. 3, pp. 233–269, 2002. - T. Suzuki, “PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model,” in COLT, 2012. - Y. Tsuzuku, I. Sato, and M. Sugiyama, “Normalized Flat Minima: Exploring Scale Invariant Deﬁnition of Flat Minima for Neural Networks Using PAC- Bayesian Analysis,” in ICML, 2020. - J. Yang, S. Sun, and D. M. Roy, “Fast-rate PAC-Bayes Generalization Bounds via Shi$ed Rademacher Processes,” in NeurIPS, 2019.