as follows: 1. We are given a deep neural network (DNN). They have (one of ) the following conditions: 1. Parameterized family of transformations, which maps a real vector to a real vector. 2. It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function) 2. We are given Object Function : e.g. 5
M \to \infty, the learning of the DNN is approximated by where 11 Learning dynamics of parameters is given by: Learning dynamics of DNN is given by: where ( Neural Tangent Kernel)
Tangent Kernel for linear-width neural networks” https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get: 15
concentrated around 1, then we can prevent the exploding/vanishing gradients. [Pennington, Schoenholz, Ganguli, AISTATS2018, Benoit Collins & TH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry. 19
Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 21
Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Key recursive equation: 23