論文紹介：On the Variance of the Fisher Information for Deep Learning

Slide 1

Slide 1 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References 論文紹介：On the Variance of the Fisher Information for Deep Learning Masanari Kimura 総研大統計科学専攻日野研究室 [email protected]

Slide 2

Slide 2 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Intro 2/25

Slide 3

Slide 3 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Introduction 3/25

Slide 4

Slide 4 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References TL;DR ▶ DNN の Fisher 情報行列の推定量について，それらの振る舞いを分散の意味で分析; ▶ Soen and Sun [2021] 4/25

Slide 5

Slide 5 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Basic notations 全体を通して Einstein 縮約を採用（e.g.,ai bi = i ai bi ）． ▶ I(θ)：Fisher information matrix（FIM）； ▶ ˆ I(θ)：FIM の推定量； ▶ hl ：ニューラルネットワークの l 層の出力； ▶ hL ：ニューラルネットワークの最終層の出力（自然パラメータ）； ▶ nl ：ニューラルネットワークの l 層のサイズ； ▶ θ = {Wl−1 }L l−1 }L l=1 ：L 層のニューラルネットワークのパラメータ； ▶ z = (x, y)． 5/25

Slide 6

Slide 6 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix 6/25

Slide 7

Slide 7 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Fisher Information Matrix Fisher 情報行列（Fisher Information Matrix; FIM）は以下で定義される： I(θ) = Ep(z|θ) ∂ℓ ∂θ ∂ℓ ∂θT . (1) また緩い条件のもと，以下の同値な表現が得られる： I(θ) = −Ep(z|θ) ∂2ℓ ∂θ∂θT . (2) 7/25

Slide 8

Slide 8 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM ▶ FIM は DNNs の理論上，応用上ともに重要な概念（e.g. natural gradient）； ▶ 期待値を含む FIM の真の値は得られないため，推定量を考える必要がある． ˆ I1 (θ) = 1 N N i=1 ∂ℓi ∂θ ∂ℓi ∂θT , (3) ˆ I2 (θ) = − 1 N N i=1 ∂2ℓi ∂θ∂θT . (4) これらが真の I(θ) からどれだけ離れているか，もしくはどれだけ早く収束するかを議論するために分散を考えたい． 8/25

Slide 9

Slide 9 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks 9/25

Slide 10

Slide 10 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Feed-Forward Neural Networks パラメータ θ = {Wl−1 }L l=1 を持ち，出力が指数型分布族になるようなニューラルネットワークは以下のように書ける： p(y|x) = exp{tT(y)hL − F(hL)}, hL = WL−1 ¯ hL−1 , hl = σ(Wl−1 ¯ hl−1 ), ¯ hl−1 = (hT l , 1)T, h0 = x. (5) ここで t(y) は y の十分統計量，F(h) = log exp(tT(y)h)dy は log-partition 関数， σ : R → R は非線形活性化関数． 10/25

Slide 11

Slide 11 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References FIM for the Neural Networks Lemma 前述のように表現されるニューラルネットワークについて以下が成り立つ： I(hL) = Cov(t(y)) = ∂η ∂hL . (6) ここで Cov(·) は p(y|x, θ) に関する共分散行列，η := η(hL) := ∂F/∂hL は期待値パラメータ，∂η/∂hL は写像 hL → η の Jacobian 行列． 11/25

Slide 12

Slide 12 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Derivatives of log-likelihood of the Neural Networks 対数尤度 ℓ(θ) := log p(x, y|θ) は Fisher 情報行列の計算において本質的． ∂ℓ ∂θ = ∂hL ∂θ T (t(y) − η(hL)) = ∂ha L ∂θ (ta − ηa). (7) Eq. (7)と(6)から ∂2ℓ ∂θ∂θT = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ ∂ηa ∂θT (8) = (ta − ηa) ∂2ha L ∂θ∂θT − ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (9) 12/25

Slide 13

Slide 13 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Smoothness of the Activation Functions Theorem 任意の σ ∈ C2(R) を活性化関数として備えた Eq. (5)で表現されるニューラルネットワークは FIM の同値表現 I(θ) = Ep[−∂2ℓ/∂θ∂θT] を得る． Corollary ReLU(z) は z = 0 で微分不可能であることから，ReLU を活性関数として備えたニューラルネットワークは，FIM の同値表現を持たない． 13/25

Slide 14

Slide 14 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Estimators for FIM on the Neural Networks ˆ I1 (θ) と ˆ I2 (θ) は現実的な計算量の元での FIM の推定量になる．p(y|x, θ) が Eq. (5)で与えられる形のとき，Eq. (7)および Eq. (8)から，これらは以下で計算される． ˆ I1 (θ) = ∂ha L ∂θ · 1 N N i=1 (ta(yi) − ηa)(tb(yi) − ηb) · ∂hb L ∂θT , (10) ˆ I2 (θ) = ηa − 1 N N i=1 ta(yi) ∂2ha L ∂θ∂θT + ∂ha L ∂θ Iab(hL) ∂hb L ∂θT . (11) Eq. (11)の右辺の第二項は FIM そのものになることから，第一項がバイアス項であることがわかる． 14/25

Slide 15

Slide 15 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References The Variance of the FIM Estimators 15/25

Slide 16

Slide 16 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I1 (θ) Theorem Cov(ˆ I1 (θ)) ijkl = 1 N · Cov ∂ℓ ∂θi ∂ℓ ∂θj , ∂ℓ ∂θk ∂ℓ ∂θl (12) = 1 N ∂i ha L (x)∂j hb L (x)∂k hc L (x)∂l hd L (x) · (Kabcd(t) − Iab(hL) · Icd(hL)), (13) ここで 4 次元テンソル Kabcd(t) := E[(ta − ηa(hL(x)))(tb − ηb(hL(x)))(tc − ηc(hL(x)))(td − ηd(hL(x)))] (14) は t(y) の 4 次モーメントで ∂i hL(x) := ∂hL(x)/∂θi ． 16/25

Slide 17

Slide 17 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Variance of I2 (θ) Theorem Cov(ˆ I2 (θ)) ijkl = 1 N · Cov − ∂2ℓ ∂θi∂θj , − ∂2ℓ ∂θk∂θl (15) = 1 N · ∂2 ij hα L (x)∂2 kl hβ L (x)Iαβ (hL), (16) ここで ∂2 ij hL(x) := ∂2hL(x)/∂θi∂θj ． 17/25

Slide 18

Slide 18 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Upper Bounds Theorem Cov(ˆ I1 (θ)) F ≤ 1 N · ∂hL ∂θ 4 F · ∥K(t) − I(hL) ⊗ I(hL)∥F , (17) Theorem Cov(ˆ I2 (θ)) F ≤ 1 N · ∂2hL(x) ∂θ∂θT 2 F · ∥I(hL)∥F . (18) 18/25

Slide 19

Slide 19 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives 19/25

Slide 20

Slide 20 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives I 前定理から，DNNs の微分が２つの推定量の分散に影響を与えることがわかる． Lemma ∂ℓ ∂Wi = Di ∂ℓ ∂hl+1 , ∂ℓ ∂hl = BT i (t(y) − η(hL)), ∂ha L ∂Wl = Dl BT l+1 ea ¯ hT l , (19) ここで ea は a 番目の標準基底ベクトルであり，Bl と Dl は以下のように再帰的に定義される． BL = I, Bl = Bl+1 Dl W− l , DL−1 = I, Dl = diag(σ′(Wl ¯ hl)). 20/25

Slide 21

Slide 21 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives II 全補題から，隠れ層 hl についての FIM は以下のように推定できる： ˆ I1 (hl) = 1 N N i=1 ∂ℓi ∂hl ∂ℓi ∂hT l = BT l 1 N N i=1 (t(yi) − η(hL))(t(yi) − η(hL))T Bl. Bl は１つ前の層から次の層へ再帰的に評価されるため，FIM も同様に ˆ I(θ) に基づいて再帰的に推定できる．これは誤差逆伝播法の手続きに類似． 21/25

Slide 22

Slide 22 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives III Theorem 活性化関数が有界な勾配を持ち，かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき， ∂hL ∂Wl F = ∥Bl+1 Dl∥ · ∥¯ hl∥2 ≤ L−1 i=l+1 ∥W− i ∥F · ∥¯ hl∥2 . (20) ここから，重みパラメータのノルムの正則化が ˆ I1 (θ) の分散の低減に有用． 22/25

Slide 23

Slide 23 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Effect of Neural Network Derivatives IV 活性化関数が有界な勾配を持ち，かつ ∀z ∈ R, |σ′(z)| ≤ 1 のとき， ∂hL ∂Wl 2σ ≤ L−1 i=l+1 smax(W− i ) · ∥¯ hl∥2 , (21) ここで smax(·) は最大特異値であり，∥T∥2σ は 3 次元テンソル T の spectral norm ∥T∥2σ = max{⟨T, α ⊗ β ⊗ γ⟩ : ∥α∥2 = ∥β∥2 = ∥γ∥2 = 1}. (22) ここから，重みパラメータの最大特異値の正則化が FIM の推定精度の向上に寄与することがわかる． 23/25

Slide 24

Slide 24 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References Conclusion ▶ FIM の推定精度向上 →FIM に依存するアルゴリズム（e.g., natural gradient descent）の性能向上なので，FIM の推定量の精度評価は重要； ▶ FIM の 2 つの推定量の分散の挙動をそれぞれ分析し上界を導出； ▶ 重みパラメータへの適切な正則化が FIM の推定性能向上に寄与することを導出． 24/25

Slide 25

Slide 25 text

Intro Fisher Information Matrix FIM for the Neural Networks The Variance of the FIM Estimators Effect of Neural Network Derivatives References References I Alexander Soen and Ke Sun. On the variance of the fisher information for deep learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 5708–5719. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/ paper/2021/file/2d290e496d16c9dcaa9b4ded5cac10cc-Paper.pdf. 25/25