## Slide 1

### Slide 1 text

ランダム行列から 深層学習へ 早瀬 友裕 Cluster Metaverse Lab. Senior Research Scientist ACT-X「数理・情報のフロンティア」１期生 2022/09/12 JST数学関係３領域WS「情報科学と拓く新しい数理科学」

## Slide 4

### Slide 4 text

Multilayer Perceptron 4 4

## Slide 5

### Slide 5 text

Deep Neural Network A standard formulation of deep learning is as follows: 1. We are given a deep neural network (DNN). They have (one of ) the following conditions: 1. Parameterized family of transformations, which maps a real vector to a real vector. 2. It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function) 2. We are given Object Function : e.g. 5

## Slide 6

### Slide 6 text

Optimization Stochastic Gradient Descent: We need 6

## Slide 7

### Slide 7 text

Initialization of Parameters e.g. 7

## Slide 9

### Slide 9 text

Neural Network Gaussian Process [Lee et al., Deep Neural Networks as Gaussian Processes. ICLR 2018.] where Estimation: 9

## Slide 10

### Slide 10 text

Kernel Propagation 通常のDNNのように、各層のカーネル設計の合成で帰納的に最後のカーネルが決まる。 ＊特定の活性化関数(ReLU, GeLU, etc)の場合には、この積分は陽に計算できる。 ＊以下の計算さえできれば推論できるので、計算量 O(N^2) 10

## Slide 11

### Slide 11 text

Nural Tangent Kernel Informal [Jacot+NeurIPS2018, Lee+NeurIPS2019]: Under the wide limit M \to \infty, the learning of the DNN is approximated by where 11 Learning dynamics of parameters is given by: Learning dynamics of DNN is given by: where ( Neural Tangent Kernel)

## Slide 12

### Slide 12 text

NTKは学習過程も予測する Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 12

## Slide 13

### Slide 13 text

CNN, ResNetにも対応 Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”] 13

## Slide 14

### Slide 14 text

Attentionにも対応 Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540] 14

## Slide 15

### Slide 15 text

Spectrum of NTK “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get: 15

## Slide 16

### Slide 16 text

Current Work MLP系統のNNは以前はトイモデルでし かなかったが、現在は現実の画像や３ D データについても適用されている。 e.g. gMLP, NeRF, etc 理論的に陽に計算しやすく、実用的でも あるちょうどよい研究対象！VRでも利用 が期待される。 MLP系統のネットワークの理論解析, NNGP/NTKによる軽量代理モデルの考 案。 16

## Slide 17

### Slide 17 text

ACT-X 自由確率論による深層学習の 研究　 17

## Slide 18

### Slide 18 text

Vanishing/Exploding Gradients The optimization of DNN needs its parameter derivations. Since the DNN is the composition of the function, the parameter derivations are computed by backpropagation. The input-output Jacobian given by 18

## Slide 19

### Slide 19 text

Dynamical Isometry (Dynamical Isometry) If the eigenvalue distribution of is concentrated around 1, then we can prevent the exploding/vanishing gradients. [Pennington, Schoenholz, Ganguli, AISTATS2018, Benoit Collins & TH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry. 19

## Slide 20

### Slide 20 text

Limit Spectral Distributions [Pennington, Schoenholz, Ganguli, AISTATS2018] 20

## Slide 21

### Slide 21 text

T. H. & R. Karakida 2020 “The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 21

## Slide 22

### Slide 22 text

Training under D. Isometry Red line (the boarder line of the exploding gradients) : This line is expected by our theory ! 22

## Slide 23

### Slide 23 text

T. H. & R. Karakida 2020 “The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 23

## Slide 24

### Slide 24 text

Spectrum of NTK The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics. e.g. => The learning dynamics does not converge. The condition number determines the convergence speed. 24

## Slide 25

### Slide 25 text

Training under D. Isometry Red line (the boarder line of the exploding gradients) : This line is expected by our theory ! 25

## Slide 26

### Slide 26 text

Summary パラメータをランダムにしたニューラルネットを考えると？   1. 初期化や学習率の調整法  2. NNGPによるベイズ推定  3. NTKによる学習曲線予測  - 勾配降下法によるNNの学習と比較して計算時間を大幅に削減.   - ランダム行列理論（及びその発展である自由確率論[ Voiculescu’85]）が背景に.  Future Work: Toy Modelの向こう側へ MLP系統のネットワークを足がかりに、理論保証できかつ実データに対応できる境界を攻 める 26