Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介:On the Variance of the Fisher Information ...

論文紹介:On the Variance of the Fisher Information for Deep Learning

Masanari Kimura

May 17, 2022
Tweet

More Decks by Masanari Kimura

Other Decks in Research

Transcript

  1. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References 論文紹介:On the Variance of the Fisher Information for Deep Learning Masanari Kimura 総研大 統計科学専攻 日野研究室 [email protected]
  2. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Intro 2/25
  3. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Introduction 3/25
  4. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References TL;DR ▶ DNN の Fisher 情報行列の推定量について,それらの振る舞いを分散の意味で分析; ▶ Soen and Sun [2021] 4/25
  5. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Basic notations 全体を通して Einstein 縮約を採用(e.g.,ai bi = i ai bi ) . ▶ I(θ):Fisher information matrix(FIM) ; ▶ ˆ I(θ):FIM の推定量; ▶ hl :ニューラルネットワークの l 層の出力; ▶ hL :ニューラルネットワークの最終層の出力(自然パラメータ) ; ▶ nl :ニューラルネットワークの l 層のサイズ; ▶ θ = {Wl−1 }L l−1 }L l=1 :L 層のニューラルネットワークのパラメータ; ▶ z = (x, y). 5/25
  6. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix 6/25
  7. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix Fisher 情報行列(Fisher Information Matrix; FIM)は以下で定義される: I(θ) = Ep(z|θ) ∂ℓ ∂θ ∂ℓ ∂θT . (1) また緩い条件のもと,以下の同値な表現が得られる: I(θ) = −Ep(z|θ) ∂2ℓ ∂θ∂θT . (2) 7/25
  8. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM ▶ FIM は DNNs の理論上,応用上ともに重要な概念(e.g. natural gradient) ; ▶ 期待値を含む FIM の真の値は得られないため,推定量を考える必要がある. ˆ I1 (θ) = 1 N N i=1 ∂ℓi ∂θ ∂ℓi ∂θT , (3) ˆ I2 (θ) = − 1 N N i=1 ∂2ℓi ∂θ∂θT . (4) これらが真の I(θ) からどれだけ離れているか,もしくはどれだけ早く収束するかを議論 するために分散を考えたい. 8/25
  9. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks 9/25
  10. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Feed-Forward Neural Networks パラメータ θ = {Wl−1 }L l=1 を持ち,出力が指数型分布族になるようなニューラルネット ワークは以下のように書ける: p(y|x) = exp{tT(y)hL − F(hL)}, hL = WL−1 ¯ hL−1 , hl = σ(Wl−1 ¯ hl−1 ), ¯ hl−1 = (hT l , 1)T, h0 = x. (5) ここで t(y) は y の十分統計量,F(h) = log exp(tT(y)h)dy は log-partition 関数, σ : R → R は非線形活性化関数. 10/25
  11. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks Lemma 前述のように表現されるニューラルネットワークについて以下が成り立つ: I(hL) = Cov(t(y)) = ∂η ∂hL . (6) ここで Cov(·) は p(y|x, θ) に関する共分散行列,η := η(hL) := ∂F/∂hL は期待値パラメー タ,∂η/∂hL は写像 hL → η の Jacobian 行列. 11/25
  12. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Derivatives of log-likelihood of the Neural Networks 対数尤度 ℓ(θ) := log p(x, y|θ) は Fisher 情報行列の計算において本質的. ∂ℓ ∂θ = ∂hL ∂θ T (t(y) − η(hL)) = ∂ha L ∂θ (ta − ηa). (7) Eq. (7)と(6)から ∂2ℓ ∂θ∂θT = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ ∂ηa ∂θT (8) = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (9) 12/25
  13. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Smoothness of the Activation Functions Theorem 任意の σ ∈ C2(R) を活性化関数として備えた Eq. (5)で表現されるニューラルネットワー クは FIM の同値表現 I(θ) = Ep[−∂2ℓ/∂θ∂θT] を得る. Corollary ReLU(z) は z = 0 で微分不可能であることから,ReLU を活性関数として備えたニューラ ルネットワークは,FIM の同値表現を持たない. 13/25
  14. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM on the Neural Networks ˆ I1 (θ) と ˆ I2 (θ) は現実的な計算量の元での FIM の推定量になる.p(y|x, θ) が Eq. (5)で与 えられる形のとき,Eq. (7)および Eq. (8)から,これらは以下で計算される. ˆ I1 (θ) = ∂ha L ∂θ · 1 N N i=1 (ta(yi) − ηa)(tb(yi) − ηb) · ∂hb L ∂θT , (10) ˆ I2 (θ) = ηa − 1 N N i=1 ta(yi) ∂2ha L ∂θ∂θT + ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (11) Eq. (11)の右辺の第二項は FIM そのものになることから,第一項がバイアス項であること がわかる. 14/25
  15. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References The Variance of the FIM Estimators 15/25
  16. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I1 (θ) Theorem Cov(ˆ I1 (θ)) ijkl = 1 N · Cov ∂ℓ ∂θi ∂ℓ ∂θj , ∂ℓ ∂θk ∂ℓ ∂θl (12) = 1 N ∂i ha L (x)∂j hb L (x)∂k hc L (x)∂l hd L (x) · (Kabcd(t) − Iab(hL) · Icd(hL)), (13) ここで 4 次元テンソル Kabcd(t) := E[(ta − ηa(hL(x)))(tb − ηb(hL(x)))(tc − ηc(hL(x)))(td − ηd(hL(x)))] (14) は t(y) の 4 次モーメントで ∂i hL(x) := ∂hL(x)/∂θi . 16/25
  17. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I2 (θ) Theorem Cov(ˆ I2 (θ)) ijkl = 1 N · Cov − ∂2ℓ ∂θi∂θj , − ∂2ℓ ∂θk∂θl (15) = 1 N · ∂2 ij hα L (x)∂2 kl hβ L (x)Iαβ (hL), (16) ここで ∂2 ij hL(x) := ∂2hL(x)/∂θi∂θj . 17/25
  18. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Upper Bounds Theorem Cov(ˆ I1 (θ)) F ≤ 1 N · ∂hL ∂θ 4 F · ∥K(t) − I(hL) ⊗ I(hL)∥F , (17) Theorem Cov(ˆ I2 (θ)) F ≤ 1 N · ∂2hL(x) ∂θ∂θT 2 F · ∥I(hL)∥F . (18) 18/25
  19. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives 19/25
  20. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives I 前定理から,DNNs の微分が2つの推定量の分散に影響を与えることがわかる. Lemma ∂ℓ ∂Wi = Di ∂ℓ ∂hl+1 , ∂ℓ ∂hl = BT i (t(y) − η(hL)), ∂ha L ∂Wl = Dl BT l+1 ea ¯ hT l , (19) ここで ea は a 番目の標準基底ベクトルであり,Bl と Dl は以下のように再帰的に定義さ れる. BL = I, Bl = Bl+1 Dl W− l , DL−1 = I, Dl = diag(σ′(Wl ¯ hl)). 20/25
  21. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives II 全補題から,隠れ層 hl についての FIM は以下のように推定できる: ˆ I1 (hl) = 1 N N i=1 ∂ℓi ∂hl ∂ℓi ∂hT l = BT l 1 N N i=1 (t(yi) − η(hL))(t(yi) − η(hL))T Bl. Bl は1つ前の層から次の層へ再帰的に評価されるため,FIM も同様に ˆ I(θ) に基づいて再 帰的に推定できる.これは誤差逆伝播法の手続きに類似. 21/25
  22. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives III Theorem 活性化関数が有界な勾配を持ち,かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき, ∂hL ∂Wl F = ∥Bl+1 Dl∥ · ∥¯ hl∥2 ≤ L−1 i=l+1 ∥W− i ∥F · ∥¯ hl∥2 . (20) ここから,重みパラメータのノルムの正則化が ˆ I1 (θ) の分散の低減に有用. 22/25
  23. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives IV 活性化関数が有界な勾配を持ち,かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき, ∂hL ∂Wl 2σ ≤ L−1 i=l+1 smax(W− i ) · ∥¯ hl∥2 , (21) ここで smax(·) は最大特異値であり,∥T∥2σ は 3 次元テンソル T の spectral norm ∥T∥2σ = max{⟨T, α ⊗ β ⊗ γ⟩ : ∥α∥2 = ∥β∥2 = ∥γ∥2 = 1}. (22) ここから,重みパラメータの最大特異値の正則化が FIM の推定精度の向上に寄与するこ とがわかる. 23/25
  24. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References Conclusion ▶ FIM の推定精度向上 →FIM に依存するアルゴリズム(e.g., natural gradient descent) の性能向上なので,FIM の推定量の精度評価は重要; ▶ FIM の 2 つの推定量の分散の挙動をそれぞれ分析し上界を導出; ▶ 重みパラメータへの適切な正則化が FIM の推定性能向上に寄与することを導出. 24/25
  25. Intro Fisher Information Matrix FIM for the Neural Networks The

    Variance of the FIM Estimators Effect of Neural Network Derivatives References References I Alexander Soen and Ke Sun. On the variance of the fisher information for deep learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 5708–5719. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper/2021/file/2d290e496d16c9dcaa9b4ded5cac10cc-Paper.pdf. 25/25