論文紹介：On the Variance of the Fisher Information for Deep Learning

Intro Fisher Information Matrix FIM for the Neural Networks The
Variance of the FIM Estimators Effect of Neural Network Derivatives References 論文紹介：On the Variance of the Fisher Information for Deep Learning Masanari Kimura 総研大統計科学専攻日野研究室 [email protected]

Variance of the FIM Estimators Effect of Neural Network Derivatives References Intro 2/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Introduction 3/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References TL;DR ▶ DNN の Fisher 情報行列の推定量について，それらの振る舞いを分散の意味で分析; ▶ Soen and Sun [2021] 4/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Basic notations 全体を通して Einstein 縮約を採用（e.g.,ai bi = i ai bi ）． ▶ I(θ)：Fisher information matrix（FIM）； ▶ ˆ I(θ)：FIM の推定量； ▶ hl ：ニューラルネットワークの l 層の出力； ▶ hL ：ニューラルネットワークの最終層の出力（自然パラメータ）； ▶ nl ：ニューラルネットワークの l 層のサイズ； ▶ θ = {Wl−1 }L l−1 }L l=1 ：L 層のニューラルネットワークのパラメータ； ▶ z = (x, y)． 5/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix 6/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix Fisher 情報行列（Fisher Information Matrix; FIM）は以下で定義される： I(θ) = Ep(z|θ) ∂ℓ ∂θ ∂ℓ ∂θT . (1) また緩い条件のもと，以下の同値な表現が得られる： I(θ) = −Ep(z|θ) ∂2ℓ ∂θ∂θT . (2) 7/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM ▶ FIM は DNNs の理論上，応用上ともに重要な概念（e.g. natural gradient）； ▶ 期待値を含む FIM の真の値は得られないため，推定量を考える必要がある． ˆ I1 (θ) = 1 N N i=1 ∂ℓi ∂θ ∂ℓi ∂θT , (3) ˆ I2 (θ) = − 1 N N i=1 ∂2ℓi ∂θ∂θT . (4) これらが真の I(θ) からどれだけ離れているか，もしくはどれだけ早く収束するかを議論するために分散を考えたい． 8/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks 9/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Feed-Forward Neural Networks パラメータ θ = {Wl−1 }L l=1 を持ち，出力が指数型分布族になるようなニューラルネットワークは以下のように書ける： p(y|x) = exp{tT(y)hL − F(hL)}, hL = WL−1 ¯ hL−1 , hl = σ(Wl−1 ¯ hl−1 ), ¯ hl−1 = (hT l , 1)T, h0 = x. (5) ここで t(y) は y の十分統計量，F(h) = log exp(tT(y)h)dy は log-partition 関数， σ : R → R は非線形活性化関数． 10/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks Lemma 前述のように表現されるニューラルネットワークについて以下が成り立つ： I(hL) = Cov(t(y)) = ∂η ∂hL . (6) ここで Cov(·) は p(y|x, θ) に関する共分散行列，η := η(hL) := ∂F/∂hL は期待値パラメータ，∂η/∂hL は写像 hL → η の Jacobian 行列． 11/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Derivatives of log-likelihood of the Neural Networks 対数尤度 ℓ(θ) := log p(x, y|θ) は Fisher 情報行列の計算において本質的． ∂ℓ ∂θ = ∂hL ∂θ T (t(y) − η(hL)) = ∂ha L ∂θ (ta − ηa). (7) Eq. (7)と(6)から ∂2ℓ ∂θ∂θT = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ ∂ηa ∂θT (8) = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (9) 12/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Smoothness of the Activation Functions Theorem 任意の σ ∈ C2(R) を活性化関数として備えた Eq. (5)で表現されるニューラルネットワークは FIM の同値表現 I(θ) = Ep[−∂2ℓ/∂θ∂θT] を得る． Corollary ReLU(z) は z = 0 で微分不可能であることから，ReLU を活性関数として備えたニューラルネットワークは，FIM の同値表現を持たない． 13/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM on the Neural Networks ˆ I1 (θ) と ˆ I2 (θ) は現実的な計算量の元での FIM の推定量になる．p(y|x, θ) が Eq. (5)で与えられる形のとき，Eq. (7)および Eq. (8)から，これらは以下で計算される． ˆ I1 (θ) = ∂ha L ∂θ · 1 N N i=1 (ta(yi) − ηa)(tb(yi) − ηb) · ∂hb L ∂θT , (10) ˆ I2 (θ) = ηa − 1 N N i=1 ta(yi) ∂2ha L ∂θ∂θT + ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (11) Eq. (11)の右辺の第二項は FIM そのものになることから，第一項がバイアス項であることがわかる． 14/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References The Variance of the FIM Estimators 15/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I1 (θ) Theorem Cov(ˆ I1 (θ)) ijkl = 1 N · Cov ∂ℓ ∂θi ∂ℓ ∂θj , ∂ℓ ∂θk ∂ℓ ∂θl (12) = 1 N ∂i ha L (x)∂j hb L (x)∂k hc L (x)∂l hd L (x) · (Kabcd(t) − Iab(hL) · Icd(hL)), (13) ここで 4 次元テンソル Kabcd(t) := E[(ta − ηa(hL(x)))(tb − ηb(hL(x)))(tc − ηc(hL(x)))(td − ηd(hL(x)))] (14) は t(y) の 4 次モーメントで ∂i hL(x) := ∂hL(x)/∂θi ． 16/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I2 (θ) Theorem Cov(ˆ I2 (θ)) ijkl = 1 N · Cov − ∂2ℓ ∂θi∂θj , − ∂2ℓ ∂θk∂θl (15) = 1 N · ∂2 ij hα L (x)∂2 kl hβ L (x)Iαβ (hL), (16) ここで ∂2 ij hL(x) := ∂2hL(x)/∂θi∂θj ． 17/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Upper Bounds Theorem Cov(ˆ I1 (θ)) F ≤ 1 N · ∂hL ∂θ 4 F · ∥K(t) − I(hL) ⊗ I(hL)∥F , (17) Theorem Cov(ˆ I2 (θ)) F ≤ 1 N · ∂2hL(x) ∂θ∂θT 2 F · ∥I(hL)∥F . (18) 18/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives 19/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives I 前定理から，DNNs の微分が２つの推定量の分散に影響を与えることがわかる． Lemma ∂ℓ ∂Wi = Di ∂ℓ ∂hl+1 , ∂ℓ ∂hl = BT i (t(y) − η(hL)), ∂ha L ∂Wl = Dl BT l+1 ea ¯ hT l , (19) ここで ea は a 番目の標準基底ベクトルであり，Bl と Dl は以下のように再帰的に定義される． BL = I, Bl = Bl+1 Dl W− l , DL−1 = I, Dl = diag(σ′(Wl ¯ hl)). 20/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives II 全補題から，隠れ層 hl についての FIM は以下のように推定できる： ˆ I1 (hl) = 1 N N i=1 ∂ℓi ∂hl ∂ℓi ∂hT l = BT l 1 N N i=1 (t(yi) − η(hL))(t(yi) − η(hL))T Bl. Bl は１つ前の層から次の層へ再帰的に評価されるため，FIM も同様に ˆ I(θ) に基づいて再帰的に推定できる．これは誤差逆伝播法の手続きに類似． 21/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives III Theorem 活性化関数が有界な勾配を持ち，かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき， ∂hL ∂Wl F = ∥Bl+1 Dl∥ · ∥¯ hl∥2 ≤ L−1 i=l+1 ∥W− i ∥F · ∥¯ hl∥2 . (20) ここから，重みパラメータのノルムの正則化が ˆ I1 (θ) の分散の低減に有用． 22/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives IV 活性化関数が有界な勾配を持ち，かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき， ∂hL ∂Wl 2σ ≤ L−1 i=l+1 smax(W− i ) · ∥¯ hl∥2 , (21) ここで smax(·) は最大特異値であり，∥T∥2σ は 3 次元テンソル T の spectral norm ∥T∥2σ = max{⟨T, α ⊗ β ⊗ γ⟩ : ∥α∥2 = ∥β∥2 = ∥γ∥2 = 1}. (22) ここから，重みパラメータの最大特異値の正則化が FIM の推定精度の向上に寄与することがわかる． 23/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References Conclusion ▶ FIM の推定精度向上 →FIM に依存するアルゴリズム（e.g., natural gradient descent）の性能向上なので，FIM の推定量の精度評価は重要； ▶ FIM の 2 つの推定量の分散の挙動をそれぞれ分析し上界を導出； ▶ 重みパラメータへの適切な正則化が FIM の推定性能向上に寄与することを導出． 24/25

Variance of the FIM Estimators Effect of Neural Network Derivatives References References I Alexander Soen and Ke Sun. On the variance of the fisher information for deep learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 5708–5719. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper/2021/file/2d290e496d16c9dcaa9b4ded5cac10cc-Paper.pdf. 25/25

論文紹介：On the Variance of the Fisher Information ...

論文紹介：On the Variance of the Fisher Information for Deep Learning

Masanari Kimura

More Decks by Masanari Kimura

Other Decks in Research

Featured

Transcript

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The

Intro Fisher Information Matrix FIM for the Neural Networks The