Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes

論文紹介セミナーでのプレゼン内容.
https://arxiv.org/abs/1711.00165

Kazu Ghalamkari

May 04, 2020
Tweet

More Decks by Kazu Ghalamkari

Other Decks in Science

Transcript

  1. Deep Neural Networks as Gaussian Processes ICLR 2018 Jaehoon Lee,

    Yasaman Bahri, Roman Novak , Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
  2. Back ground ・ Fully-connected NN is equivalent to GP in

    the limit of infinite network width Contribution ・The result of classification of Cifar10 and MNIST by GP was better than NN ・Propose the way to find Kernel function of GP numerically The reasons why I read this paper ・GP overcome NN’s weakness ・We can use GP in pytorch and TF. Speed up by GPU Prolusion
  3. Back ground ・GP as extended linear regression ・Kernel Trick ・The

    relationship between NN and GP The contents of the paper ・The way to calculate Kernel function numerically ・Experimental result ・Phase transition related to hyper parameters of GP ・Conclusion Contents
  4. Linear regression Regression by linear combination of basis functions basis

    function = 1 () ⋮ () weight vector = 1 , ⋯ , = = ෍ =1 ( ) = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) ∈ ℝ×: design matrix = , | = 1 ⋯ = −1 The predicted value on any point (test) Estimate by training data set (train)
  5. Weakness of Linear regression = 1, , 2 = 1,

    , 2, 3, 4 = 1, , sin = ???? We have to find proper basis functions manually
  6. Radial basis function regression ℎ = exp − − ℎ

    2 2 = − , −+1 , ⋯ , −1 , = weight vector = − , ⋯ , ∈ ℝ2+1 Use shifted gaussians as basis functions. Gaussian is expressive!
  7. Radial basis function regression Number of basis:10 RBF is expressive

    In n-th dimensional problem, 10 RBF are needed as basis = 1.0 10 dimensional weight vector have to be estimated. Curse of dimensionality
  8. Derivation of gaussian process The number of parameters which should

    be estimated will increase in high dimension RBFR GPR Avoid curse of dimensionality by integration with respect to = 1 ⋮ = 1 (1 ) ⋯ (1 ) ⋮ ⋱ ⋮ 1 ( ) ⋯ ( ) 1 ⋮ = w We introduce prior distribution ~N , λ2 In RBFR, the output of the any input is = − = = λ2 Use it later
  9. Derivation of gaussian process GPR = w ~N 0, λ2

    follows gaussian = w = w = w = = − = = = = λ2 ~N(, λ2) We don’t have to know weight vector even if the dimension of is high. − = = λ2 D : gaussian process Any input set = 1 , ⋯ , , output = 1 , ⋯ , Joint distribution follow Gaussian = def The relation between and Follows Gaussian process Avoid curse of dimensionality by integration with respect to Because w follows gaussian,
  10. Kernel trick ~N , λ2 ≡ N , :kernel ,′

    = λ (′ ) ≡ (, ′) basis functions = 1 () ⋮ () Inner product of basis functions Example:RBF Kernel ℎ = exp − − ℎ/ 2 2 = −2 , −2+1 , ⋯ , 2−1 , 2 (, ′) = λ ′ = ෍ =− ℎ ℎ ′ lim → ∞ = න −∞ ∞ 2exp − − ℎ 2 2 exp − ′ − ℎ 2 2 ℎ = 1 exp − 1 2 − ′ 2 1 = 2 2 2 = 22 (how much they look similar) Put RBFs whose center are h/H placed at even 1/H intervals in range ∈ −,
  11. How to regression by GP? ~N , , ′ ≡

    , ′ = 1 exp − 1 2 − ′ 2 How to predict unknow value ∗ corresponding to ∗ ? ∗ ~N , ∗ ∗ T k∗∗ ∗ = ∗, 1 , ⋯ , (∗, ) k∗∗ = ∗, ∗ (∗|∗, )~N ∗ T−, ∗∗ − ∗ T−∗
  12. Relationship between NN and GP 1 0 1 1 2

    0 2 1 3 0 3 1 4 0 4 1 1 0 2 0 3 0 1 1 2 1 0 1 = 1 , ⋯ , 1 = 1 1, 1 1 ⋯ , 0 ~(0, /0) 1 ~(0, /1) 1 = ෍ j=1 N1 1 1() = ෍ j=1 N1 1 ෍ =1 0 Activation function In 0 → ∞ , 1 follows Gaussian Fully-connected NN is equivalent to GP in the limit of infinite network width [Neal 1994] [Williams 1997] because of central limit theorem 0 ∶ number of units in hidden layer
  13. Relationship between NN and GP 1~GP(1, 1) 1 = 1

    = 11() = 1 1() = 0 ~(0, /0) 1 ~(0, /1) 1, 1 = 1 (1) 1 , ′ = 1 1 ′ − 1 1 ′ = 1 1 ′ = ෍ =1 0 ෍ =1 0 ′ It is very difficult to get analytical formula When is ReLU 1 , ′ = 2 ′ sin + − cos When is Step function 1 , ′ = 2 ′ − [NIPS Kernel Methods for DL] = cos−1 ⋅ ′ ′
  14. Expand to multi layer NN = ෍ j=1 N ()

    = ෍ j=1 N −1 In → ∞, it is equivalent to GP , ′ = ′ = −1 −1 = 2 (−1 , ′ , −1 , ′ , −1 , ′ ) = 1 2 −1 , −1 ′, ′ sin ,′ −1 + − ,′ −1 cos ,′ −1 ,′ = cos−1 ′, ′ −1 , −1 ′, ′ [Cho & Saul. 2009] When is ReLU It is very difficult to get analytical formula because of CLT
  15. How to find Kernel function numerically existing methods 2 2

    + proposed methods 2 + ( 2 + ) 1. = [− , ⋯ , ] = [0, ⋯ , ] ∈ ℝ each element placed at even intervals ∈ ℝ each element placed at even intervals < 2 = −1, ⋯ , 1 ∈ ℝ each element placed at even intervals Cost of finding Kernel function corresponding L layers NN 2. = σ, exp − 1 2 −1 σ , exp − 1 2 −1 3. Approximate the function by bilinear interpolation into the matrix
  16. Experimental detail Activation function is Relu or tanh. Loss function

    is MSE. No Dropout. Use Google vision hyper tuner to initialize weights and bias. How to classify by regression method? →Return 0.9 when right label, -0.1 when wrong label (expectation is 0) [Rifkin & Klautau 2004] Compare the results between SGD(Adams) and NNGPs in MNIST and Cifar-10
  17. A comparison between NN versus GP The advantage of GP

    ・No optimization ・Due to its Bayesian nature, all predictions have uncertainty estimates ・Only matrix calculation ・No overfitting The pros of increasing number of units in NN ・ generalization gap ( = test error – train error) become smaller
  18. Phase transition related to hyper parameters of GP Theory[Schoenholz 2007]

    Experimental Result Phase transition occurs by variance of prior distribution of weights and bias, and
  19. Discussion ・The disadvantage of GP is outlier → student t

    distribution overcome it ・Some researcher are trying to find GP which correspond to CNN or LSTM [2017 Mark van der Wilk] [2017 Maruan Al-Shedivat] ・The cost of implement GP regression is 3 . N is number of train data. →There are many ways to reduce cost. (cf instrumental variable method) ・There is a way to estimate correct Kernel function automatically by inputs data. [2011 Marc Peter Deisenroth]