Kazu Ghalamkari
May 04, 2020
230

Deep Neural Networks as Gaussian Processes

https://arxiv.org/abs/1711.00165

May 04, 2020

Transcript

1. Deep Neural Networks as Gaussian Processes ICLR 2018 Jaehoon Lee,

Yasaman Bahri, Roman Novak , Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
2. Back ground ・ Fully-connected NN is equivalent to GP in

the limit of infinite network width Contribution ・The result of classification of Cifar10 and MNIST by GP was better than NN ・Propose the way to find Kernel function of GP numerically The reasons why I read this paper ・GP overcome NN’s weakness ・Ｗｅ can use GP in pytorch and TF. Speed up by GPU Prolusion
3. Back ground ・GP as extended linear regression ・Kernel Trick ・The

relationship between NN and GP The contents of the paper ・The way to calculate Kernel function numerically ・Experimental result ・Phase transition related to hyper parameters of GP ・Conclusion Contents
4. Linear regression Regression by linear combination of basis functions basis

function = 1 () ⋮ () weight vector = 1 , ⋯ , = = ෍ =1 ( ) = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) ∈ ℝ×: design matrix = , | = 1 ⋯ = −1 The predicted value on any point （test） Estimate by training data set (train)
5. Weakness of Linear regression = 1, , 2 = 1,

, 2, 3, 4 = 1, , sin = ？？？？ We have to find proper basis functions manually
6. Radial basis function regression ℎ = exp − − ℎ

2 2 = − , −+1 , ⋯ , −1 , = weight vector = − , ⋯ , ∈ ℝ2+1 Use shifted gaussians as basis functions. Gaussian is expressive!
7. Radial basis function regression Number of basis：10 RBF is expressive

In n-th dimensional problem, 10 RBF are needed as basis = 1.0 10 dimensional weight vector have to be estimated. Curse of dimensionality
8. Derivation of gaussian process The number of parameters which should

be estimated will increase in high dimension RBFR GPR Avoid curse of dimensionality by integration with respect to = 1 ⋮ = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) 1 ⋮ = w We introduce prior distribution ~N , λ2 In RBFR, the output of the any input is = − = = λ2 Use it later
9. Derivation of gaussian process GPR = w ~N 0, λ2

follows gaussian = w = w = w = = − = = = = λ2 ~N(, λ2) We don’t have to know weight vector even if the dimension of is high. − = = λ2 D ： gaussian process Any input set = 1 , ⋯ , , output = 1 , ⋯ , Joint distribution follow Gaussian = def The relation between and Follows Gaussian process Avoid curse of dimensionality by integration with respect to Because w follows gaussian,
10. Kernel trick ~N , λ2 ≡ N , ：kernel ,′

= λ (′ ) ≡ (, ′) basis functions = 1 () ⋮ () Inner product of basis functions Example：RBF Kernel ℎ = exp − − ℎ/ 2 2 = −2 , −2+1 , ⋯ , 2−1 , 2 (, ′) = λ ′ = ෍ =− ℎ ℎ ′ lim → ∞ = න −∞ ∞ 2exp − − ℎ 2 2 exp − ′ − ℎ 2 2 ℎ = 1 exp − 1 2 − ′ 2 1 = 2 2 2 = 22 （how much they look similar） Put RBFs whose center are h/H placed at even 1/H intervals in range ∈ −,
11. How to regression by GP? ~N , , ′ ≡

, ′ = 1 exp − 1 2 − ′ 2 How to predict unknow value ∗ corresponding to ∗ ? ∗ ~N , ∗ ∗ T k∗∗ ∗ = ∗, 1 , ⋯ , (∗, ) k∗∗ = ∗, ∗ (∗|∗, )~N ∗ T−, ∗∗ − ∗ T−∗

13. Relationship between NN and GP 1 0 1 1 2

0 2 1 3 0 3 1 4 0 4 1 1 0 2 0 3 0 1 1 2 1 0 1 = 1 , ⋯ , 1 = 1 1, 1 1 ⋯ , 0 ~(0, /0) 1 ~(0, /1) 1 = ෍ j=1 N1 1 1() = ෍ j=1 N1 1 ෍ =1 0 Activation function In 0 → ∞ , 1 follows Gaussian Fully-connected NN is equivalent to GP in the limit of infinite network width [Neal 1994] [Williams 1997] because of central limit theorem 0 ∶ number of units in hidden layer
14. Relationship between NN and GP 1~GP(1, 1) 1 = 1

= 11() = 1 1() = 0 ~(0, /0) 1 ~(0, /1) 1, 1 = 1 (1) 1 , ′ = 1 1 ′ − 1 1 ′ = 1 1 ′ = ෍ =1 0 ෍ =1 0 ′ It is very difficult to get analytical formula When is ReLU 1 , ′ = 2 ′ sin + − cos When is Step function 1 , ′ = 2 ′ − [NIPS Kernel Methods for DL] = cos−1 ⋅ ′ ′
15. Expand to multi layer NN = ෍ j=1 N ()

= ෍ j=1 N −1 In → ∞, it is equivalent to GP , ′ = ′ = −1 −1 = 2 (−1 , ′ , −1 , ′ , −1 , ′ ) = 1 2 −1 , −1 ′, ′ sin ,′ −1 + − ,′ −1 cos ,′ −1 ,′ = cos−1 ′, ′ −1 , −1 ′, ′ [Cho & Saul. 2009] When is ReLU It is very difficult to get analytical formula because of CLT
16. How to find Kernel function numerically existing methods 2 2

+ proposed methods 2 + ( 2 + ) 1. = [− , ⋯ , ] = [0, ⋯ , ] ∈ ℝ each element placed at even intervals ∈ ℝ each element placed at even intervals < 2 = −1, ⋯ , 1 ∈ ℝ each element placed at even intervals Cost of finding Kernel function corresponding L layers NN 2. = σ, exp − 1 2 −1 σ , exp − 1 2 −1 3. Approximate the function by bilinear interpolation into the matrix
17. Numerical calculation of Kernel In case of ReLU compared to

analytical solution
18. Experimental detail Activation function is Relu or tanh. Loss function

is MSE. No Dropout. Use Google vision hyper tuner to initialize weights and bias. How to classify by regression method? →Return 0.9 when right label, -0.1 when wrong label （expectation is 0） [Rifkin & Klautau 2004] Compare the results between SGD(Adams) and NNGPs in MNIST and Cifar-10

21. A comparison between NN versus GP The advantage of GP

・No optimization ・Due to its Bayesian nature, all predictions have uncertainty estimates ・Only matrix calculation ・No overfitting The pros of increasing number of units in NN ・ generalization gap ( = test error – train error) become smaller
22. Phase transition related to hyper parameters of GP Theory[Schoenholz 2007]

Experimental Result Phase transition occurs by variance of prior distribution of weights and bias, and
23. Discussion ・The disadvantage of GP is outlier → student t

distribution overcome it ・Some researcher are trying to find GP which correspond to CNN or LSTM [2017 Mark van der Wilk] [2017 Maruan Al-Shedivat] ・The cost of implement GP regression is 3 . N is number of train data. →There are many ways to reduce cost. (cf instrumental variable method) ・There is a way to estimate correct Kernel function automatically by inputs data. [2011 Marc Peter Deisenroth]