ランダム行列から深層学習へ

Slide 1

Slide 1 text

ランダム行列から深層学習へ早瀬友裕 Cluster Metaverse Lab. Senior Research Scientist ACT-X「数理・情報のフロンティア」１期生 2022/09/12 JST数学関係３領域WS「情報科学と拓く新しい数理科学」

Slide 2

Slide 2 text

略歴早瀬友裕　博士（数理科学） 1. 博士課程（東大数理） a. 作用素環, 特に自由確率論 b. CV&ML系のインターン 2. 富士通人工知能研究所 a. MLの実践(CV系) b. MLの基礎(JST ACT-X 数理と情報領域１期生、「自由確率論による深層学習の研究」) 3. [Now] Cluster Metaverse Lab a. MLの実践（VR, 3DCV, CG） b. MLの基礎（NNGP） 2

Slide 3

Slide 3 text

概要パラメータをランダムにしたニューラルネットを考えると？   1. 初期化や学習率の調整法  2. NNGPによるベイズ推定  3. NTKによる学習曲線予測  - 勾配降下法によるNNの学習と比較して計算時間を大幅に削減.   - ランダム行列理論（及びその発展である自由確率論[ Voiculescu’85]）が背景に.  JST ACT-X「自由確率論による深層学習の研究」 - R. Karakida & TH, AISTATS2020 : Fisher情報行列のスペクトル分布と学習率 - B. Collins & TH, Comm. in Math. Phys. (2022)　 : DNNのヤコビアンの漸近的自由独立性 Figure: Google AI 3

Slide 4

Slide 4 text

Multilayer Perceptron 4 4

Slide 5

Slide 5 text

Deep Neural Network A standard formulation of deep learning is as follows: 1. We are given a deep neural network (DNN). They have (one of ) the following conditions: 1. Parameterized family of transformations, which maps a real vector to a real vector. 2. It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function) 2. We are given Object Function : e.g. 5

Slide 6

Slide 6 text

Optimization Stochastic Gradient Descent: We need 6

Slide 7

Slide 7 text

Initialization of Parameters e.g. 7

Slide 8

Slide 8 text

無限幅極限で出力はガウス分布 Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 8

Slide 9

Slide 9 text

Neural Network Gaussian Process [Lee et al., Deep Neural Networks as Gaussian Processes. ICLR 2018.] where Estimation: 9

Slide 10

Slide 10 text

Kernel Propagation 通常のDNNのように、各層のカーネル設計の合成で帰納的に最後のカーネルが決まる。＊特定の活性化関数(ReLU, GeLU, etc)の場合には、この積分は陽に計算できる。＊以下の計算さえできれば推論できるので、計算量 O(N^2) 10

Slide 11

Slide 11 text

Nural Tangent Kernel Informal [Jacot+NeurIPS2018, Lee+NeurIPS2019]: Under the wide limit M \to \infty, the learning of the DNN is approximated by where 11 Learning dynamics of parameters is given by: Learning dynamics of DNN is given by: where ( Neural Tangent Kernel)

Slide 12

Slide 12 text

NTKは学習過程も予測する Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 12

Slide 13

Slide 13 text

CNN, ResNetにも対応 Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 13

Slide 14

Slide 14 text

Attentionにも対応 Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540] 14

Slide 15

Slide 15 text

Spectrum of NTK “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get: 15

Slide 16

Slide 16 text

Current Work MLP系統のNNは以前はトイモデルでしかなかったが、現在は現実の画像や３ D データについても適用されている。 e.g. gMLP, NeRF, etc 理論的に陽に計算しやすく、実用的でもあるちょうどよい研究対象！VRでも利用が期待される。 MLP系統のネットワークの理論解析, NNGP/NTKによる軽量代理モデルの考案。 16

Slide 17

Slide 17 text

ACT-X 自由確率論による深層学習の研究　 17

Slide 18

Slide 18 text

Vanishing/Exploding Gradients The optimization of DNN needs its parameter derivations. Since the DNN is the composition of the function, the parameter derivations are computed by backpropagation. The input-output Jacobian given by 18

Slide 19

Slide 19 text

Dynamical Isometry (Dynamical Isometry) If the eigenvalue distribution of is concentrated around 1, then we can prevent the exploding/vanishing gradients. [Pennington, Schoenholz, Ganguli, AISTATS2018, Benoit Collins & TH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry. 19

Slide 20

Slide 20 text

Limit Spectral Distributions [Pennington, Schoenholz, Ganguli, AISTATS2018] 20

Slide 21

Slide 21 text

T. H. & R. Karakida 2020 “The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 21

Slide 22

Slide 22 text

Training under D. Isometry Red line (the boarder line of the exploding gradients) : This line is expected by our theory ! 22

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Spectrum of NTK The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics. e.g. => The learning dynamics does not converge. The condition number determines the convergence speed. 24

Slide 25

Slide 25 text

Training under D. Isometry Red line (the boarder line of the exploding gradients) : This line is expected by our theory ! 25

Slide 26

Slide 26 text

Summary パラメータをランダムにしたニューラルネットを考えると？   1. 初期化や学習率の調整法  2. NNGPによるベイズ推定  3. NTKによる学習曲線予測  - 勾配降下法によるNNの学習と比較して計算時間を大幅に削減.   - ランダム行列理論（及びその発展である自由確率論[ Voiculescu’85]）が背景に.  Future Work: Toy Modelの向こう側へ MLP系統のネットワークを足がかりに、理論保証できかつ実データに対応できる境界を攻める 26