ランダム行列から深層学習へ

ランダム行列から深層学習へ早瀬友裕 Cluster Metaverse Lab. Senior Research Scientist
ACT-X「数理・情報のフロンティア」１期生 2022/09/12 JST数学関係３領域WS「情報科学と拓く新しい数理科学」

略歴早瀬友裕　博士（数理科学） 1. 博士課程（東大数理） a. 作用素環, 特に自由確率論 b. CV&ML系のインターン 2.
富士通人工知能研究所 a. MLの実践(CV系) b. MLの基礎(JST ACT-X 数理と情報領域１期生、「自由確率論による深層学習の研究」) 3. [Now] Cluster Metaverse Lab a. MLの実践（VR, 3DCV, CG） b. MLの基礎（NNGP） 2

概要パラメータをランダムにしたニューラルネットを考えると？   1. 初期化や学習率の調整法  2. NNGPによるベイズ推定  3. NTKによる学習曲線予測  -
勾配降下法によるNNの学習と比較して計算時間を大幅に削減.   - ランダム行列理論（及びその発展である自由確率論[ Voiculescu’85]）が背景に.  JST ACT-X「自由確率論による深層学習の研究」 - R. Karakida & TH, AISTATS2020 : Fisher情報行列のスペクトル分布と学習率 - B. Collins & TH, Comm. in Math. Phys. (2022)　 : DNNのヤコビアンの漸近的自由独立性 Figure: Google AI 3

Multilayer Perceptron 4 4

Deep Neural Network A standard formulation of deep learning is
as follows: 1. We are given a deep neural network (DNN). They have (one of ) the following conditions: 1. Parameterized family of transformations, which maps a real vector to a real vector. 2. It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function) 2. We are given Object Function : e.g. 5

Optimization Stochastic Gradient Descent: We need 6

Initialization of Parameters e.g. 7

無限幅極限で出力はガウス分布 Figure from [Google “Fast and Easy Infinitely Wide Networks
with Neural Tangents”] 8

Neural Network Gaussian Process [Lee et al., Deep Neural Networks
as Gaussian Processes. ICLR 2018.] where Estimation: 9

Kernel Propagation 通常のDNNのように、各層のカーネル設計の合成で帰納的に最後のカーネルが決まる。＊特定の活性化関数(ReLU, GeLU, etc)の場合には、この積分は陽に計算できる。＊以下の計算さえできれば推論できるので、計算量 O(N^2) 10

Nural Tangent Kernel Informal [Jacot+NeurIPS2018, Lee+NeurIPS2019]: Under the wide limit
M \to \infty, the learning of the DNN is approximated by where 11 Learning dynamics of parameters is given by: Learning dynamics of DNN is given by: where ( Neural Tangent Kernel)

NTKは学習過程も予測する Figure from [Google “Fast and Easy Infinitely Wide Networks
with Neural Tangents”] Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 12

CNN, ResNetにも対応 Figure from [Google “Fast and Easy Infinitely Wide
Networks with Neural Tangents”] 13

Attentionにも対応 Infinite attention: NNGP and NTK for deep attention networks
[https://arxiv.org/abs/2006.10540] 14

Spectrum of NTK “Spectra of the Conjugate Kernel and Neural
Tangent Kernel for linear-width neural networks” https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get: 15

Current Work MLP系統のNNは以前はトイモデルでしかなかったが、現在は現実の画像や３ D データについても適用されている。 e.g. gMLP, NeRF, etc
理論的に陽に計算しやすく、実用的でもあるちょうどよい研究対象！VRでも利用が期待される。 MLP系統のネットワークの理論解析, NNGP/NTKによる軽量代理モデルの考案。 16

ACT-X 自由確率論による深層学習の研究　 17

Vanishing/Exploding Gradients The optimization of DNN needs its parameter derivations.
Since the DNN is the composition of the function, the parameter derivations are computed by backpropagation. The input-output Jacobian given by 18

Dynamical Isometry (Dynamical Isometry) If the eigenvalue distribution of is
concentrated around 1, then we can prevent the exploding/vanishing gradients. [Pennington, Schoenholz, Ganguli, AISTATS2018, Benoit Collins & TH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry. 19

Limit Spectral Distributions [Pennington, Schoenholz, Ganguli, AISTATS2018] 20

T. H. & R. Karakida 2020 “The Spectrum of Fisher
Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 21

Training under D. Isometry Red line (the boarder line of
the exploding gradients) : This line is expected by our theory ! 22

T. H. & R. Karakida 2020 “The Spectrum of Fisher
Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 23

Spectrum of NTK The spectrum (eigenvalues) of the NTK has
vital role in tuning the learning dynamics. e.g. => The learning dynamics does not converge. The condition number determines the convergence speed. 24

Training under D. Isometry Red line (the boarder line of
the exploding gradients) : This line is expected by our theory ! 25

Summary パラメータをランダムにしたニューラルネットを考えると？   1. 初期化や学習率の調整法  2. NNGPによるベイズ推定  3. NTKによる学習曲線予測  -
勾配降下法によるNNの学習と比較して計算時間を大幅に削減.   - ランダム行列理論（及びその発展である自由確率論[ Voiculescu’85]）が背景に.  Future Work: Toy Modelの向こう側へ MLP系統のネットワークを足がかりに、理論保証できかつ実データに対応できる境界を攻める 26

ランダム行列から深層学習へ

ランダム行列から深層学習へ

Cluster

More Decks by Cluster

Other Decks in Research

Featured

Transcript