Deep Neural Networks as Gaussian Processes

Slide 1

Slide 1 text

Deep Neural Networks as Gaussian Processes ICLR 2018 Jaehoon Lee, Yasaman Bahri, Roman Novak , Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

Slide 2

Slide 2 text

Back ground ・ Fully-connected NN is equivalent to GP in the limit of infinite network width Contribution ・The result of classification of Cifar10 and MNIST by GP was better than NN ・Propose the way to find Kernel function of GP numerically The reasons why I read this paper ・GP overcome NN’s weakness ・Ｗｅ can use GP in pytorch and TF. Speed up by GPU Prolusion

Slide 3

Slide 3 text

Back ground ・GP as extended linear regression ・Kernel Trick ・The relationship between NN and GP The contents of the paper ・The way to calculate Kernel function numerically ・Experimental result ・Phase transition related to hyper parameters of GP ・Conclusion Contents

Slide 4

Slide 4 text

Linear regression Regression by linear combination of basis functions basis function = 1 () ⋮ () weight vector = 1 , ⋯ , = = ෍ =1 ( ) = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) ∈ ℝ×: design matrix = , | = 1 ⋯ = −1 The predicted value on any point （test） Estimate by training data set (train)

Slide 5

Slide 5 text

Weakness of Linear regression = 1, , 2 = 1, , 2, 3, 4 = 1, , sin = ？？？？ We have to find proper basis functions manually

Slide 6

Slide 6 text

Radial basis function regression ℎ = exp − − ℎ 2 2 = − , −+1 , ⋯ , −1 , = weight vector = − , ⋯ , ∈ ℝ2+1 Use shifted gaussians as basis functions. Gaussian is expressive!

Slide 7

Slide 7 text

Radial basis function regression Number of basis：10 RBF is expressive In n-th dimensional problem, 10 RBF are needed as basis = 1.0 10 dimensional weight vector have to be estimated. Curse of dimensionality

Slide 8

Slide 8 text

Derivation of gaussian process The number of parameters which should be estimated will increase in high dimension RBFR GPR Avoid curse of dimensionality by integration with respect to = 1 ⋮ = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) 1 ⋮ = w We introduce prior distribution ~N , λ2 In RBFR, the output of the any input is = − = = λ2 Use it later

Slide 9

Slide 9 text

Derivation of gaussian process GPR = w ~N 0, λ2 follows gaussian = w = w = w = = − = = = = λ2 ~N(, λ2) We don’t have to know weight vector even if the dimension of is high. − = = λ2 D ： gaussian process Any input set = 1 , ⋯ , , output = 1 , ⋯ , Joint distribution follow Gaussian = def The relation between and Follows Gaussian process Avoid curse of dimensionality by integration with respect to Because w follows gaussian,

Slide 10

Slide 10 text

Kernel trick ~N , λ2 ≡ N , ：kernel ,′ = λ (′ ) ≡ (, ′) basis functions = 1 () ⋮ () Inner product of basis functions Example：RBF Kernel ℎ = exp − − ℎ/ 2 2 = −2 , −2+1 , ⋯ , 2−1 , 2 (, ′) = λ ′ = ෍ =− ℎ ℎ ′ lim → ∞ = න −∞ ∞ 2exp − − ℎ 2 2 exp − ′ − ℎ 2 2 ℎ = 1 exp − 1 2 − ′ 2 1 = 2 2 2 = 22 （how much they look similar） Put RBFs whose center are h/H placed at even 1/H intervals in range ∈ −,

Slide 11

Slide 11 text

How to regression by GP? ~N , , ′ ≡ , ′ = 1 exp − 1 2 − ′ 2 How to predict unknow value ∗ corresponding to ∗ ? ∗ ~N , ∗ ∗ T k∗∗ ∗ = ∗, 1 , ⋯ , (∗, ) k∗∗ = ∗, ∗ (∗|∗, )~N ∗ T−, ∗∗ − ∗ T−∗

Slide 12

Slide 12 text

The result

Slide 13

Slide 13 text

Relationship between NN and GP 1 0 1 1 2 0 2 1 3 0 3 1 4 0 4 1 1 0 2 0 3 0 1 1 2 1 0 1 = 1 , ⋯ , 1 = 1 1, 1 1 ⋯ , 0 ~(0, /0) 1 ~(0, /1) 1 = ෍ j=1 N1 1 1() = ෍ j=1 N1 1 ෍ =1 0 Activation function In 0 → ∞ , 1 follows Gaussian Fully-connected NN is equivalent to GP in the limit of infinite network width [Neal 1994] [Williams 1997] because of central limit theorem 0 ∶ number of units in hidden layer

Slide 14

Slide 14 text

Relationship between NN and GP 1~GP(1, 1) 1 = 1 = 11() = 1 1() = 0 ~(0, /0) 1 ~(0, /1) 1, 1 = 1 (1) 1 , ′ = 1 1 ′ − 1 1 ′ = 1 1 ′ = ෍ =1 0 ෍ =1 0 ′ It is very difficult to get analytical formula When is ReLU 1 , ′ = 2 ′ sin + − cos When is Step function 1 , ′ = 2 ′ − [NIPS Kernel Methods for DL] = cos−1 ⋅ ′ ′

Slide 15

Slide 15 text

Expand to multi layer NN = ෍ j=1 N () = ෍ j=1 N −1 In → ∞, it is equivalent to GP , ′ = ′ = −1 −1 = 2 (−1 , ′ , −1 , ′ , −1 , ′ ) = 1 2 −1 , −1 ′, ′ sin ,′ −1 + − ,′ −1 cos ,′ −1 ,′ = cos−1 ′, ′ −1 , −1 ′, ′ [Cho & Saul. 2009] When is ReLU It is very difficult to get analytical formula because of CLT

Slide 16

Slide 16 text

How to find Kernel function numerically existing methods 2 2 + proposed methods 2 + ( 2 + ) 1. = [− , ⋯ , ] = [0, ⋯ , ] ∈ ℝ each element placed at even intervals ∈ ℝ each element placed at even intervals < 2 = −1, ⋯ , 1 ∈ ℝ each element placed at even intervals Cost of finding Kernel function corresponding L layers NN 2. = σ, exp − 1 2 −1 σ , exp − 1 2 −1 3. Approximate the function by bilinear interpolation into the matrix

Slide 17

Slide 17 text

Numerical calculation of Kernel In case of ReLU compared to analytical solution

Slide 18

Slide 18 text

Experimental detail Activation function is Relu or tanh. Loss function is MSE. No Dropout. Use Google vision hyper tuner to initialize weights and bias. How to classify by regression method? →Return 0.9 when right label, -0.1 when wrong label （expectation is 0） [Rifkin & Klautau 2004] Compare the results between SGD(Adams) and NNGPs in MNIST and Cifar-10

Slide 19

Slide 19 text

Experimental Result

Slide 20

Slide 20 text

Experimental Result

Slide 21

Slide 21 text

A comparison between NN versus GP The advantage of GP ・No optimization ・Due to its Bayesian nature, all predictions have uncertainty estimates ・Only matrix calculation ・No overfitting The pros of increasing number of units in NN ・ generalization gap ( = test error – train error) become smaller

Slide 22

Slide 22 text

Phase transition related to hyper parameters of GP Theory[Schoenholz 2007] Experimental Result Phase transition occurs by variance of prior distribution of weights and bias, and

Slide 23

Slide 23 text

Discussion ・The disadvantage of GP is outlier → student t distribution overcome it ・Some researcher are trying to find GP which correspond to CNN or LSTM [2017 Mark van der Wilk] [2017 Maruan Al-Shedivat] ・The cost of implement GP regression is 3 . N is number of train data. →There are many ways to reduce cost. (cf instrumental variable method) ・There is a way to estimate correct Kernel function automatically by inputs data. [2011 Marc Peter Deisenroth]