論文紹介：Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

Slide 1

Slide 1 text

Intro Critical discussion of the empirical Fisher Conclusions References 論文紹介：Limitations of the Empirical Fisher Approximation for Natural Gradient Descent Masanari Kimura 総研大統計科学専攻日野研究室 [email protected]

Slide 2

Slide 2 text

Intro Critical discussion of the empirical Fisher Conclusions References Intro 2/18

Slide 3

Slide 3 text

Intro Critical discussion of the empirical Fisher Conclusions References Introduction 3/18

Slide 4

Slide 4 text

Intro Critical discussion of the empirical Fisher Conclusions References TL;DR ▶ 統計学の様々な文脈で有用な Fisher 情報行列は，統計モデルの SGD による最適化の際に最急方向の勾配を捉える自然勾配降下法においても重要な役割を持つ； ▶ 実際に最適化の際に Fisher 情報行列を計算するのは難しいため，その近似として Empirical Fisher が広く用いられている； ▶ 本論文では，Empirical Fisher は自然勾配降下において Fisher 情報行列が担っていた「最急方向への勾配修正」という役割を果たせていないことを指摘． 4/18

Slide 5

Slide 5 text

Intro Critical discussion of the empirical Fisher Conclusions References Fisher Information Definition パラメータ θ をもつ統計モデル pθ の Fisher 情報行列は以下で計算される： F(θ) := ∑ n Epθ(y|xn) [ ∇θ log pθ (y|xn)∇θ log pθ (y|xn)T ] . (1) ここで {xn} は入力データ． 5/18

Slide 6

Slide 6 text

Intro Critical discussion of the empirical Fisher Conclusions References Natural Gradient ▶ SGD はユークリッド空間において最急方向にパラメータを更新する； ▶ 統計モデルはユークリッド空間ではなく一般のリーマン多様体を構成することから，統計モデルを SGD によって更新する際の勾配は最急方向とは限らない； Theorem 以下のような更新則を用いることで，最急方向へのパラメータの更新が保証される [Amari, 1998]： θt+1 = θt − αt F−1(θt)∇θ L(θt). (2) ここで αt > 0 は学習率，L(θt) は損失関数． 6/18

Slide 7

Slide 7 text

Intro Critical discussion of the empirical Fisher Conclusions References Empirical Fisher ▶ 自然勾配を用いたパラメータ更新を行うためには統計モデルの Fisher 情報行列が必要； ▶ 実際に Fisher 情報行列を計算するのは難しいため，以下の Empirical Fisher による近似が用いられる： ˜ F(θ) := ∑ n ∇θ log pθ (yn|xn)∇θ log pθ (yn|xn)T. (3) 7/18

Slide 8

Slide 8 text

Intro Critical discussion of the empirical Fisher Conclusions References Fisher 情報行列と Empirical Fisher の相違点 ▶ Empirical Fisher は Fisher 情報行列に含まれるモデルの予測分布に関する期待値の部分を学習ラベルに関する和に置き換えている； ▶ Empirical Fisher はその名前とは対称的に，Fisher 情報行列の経験的な推定にはなっていない； 8/18

Slide 9

Slide 9 text

Intro Critical discussion of the empirical Fisher Conclusions References GD vs. NGD vs. Empirical NGD Figure: GD vs. NGD vs. Empirical NGD. 9/18

Slide 10

Slide 10 text

Intro Critical discussion of the empirical Fisher Conclusions References Gauss-Newton Method Definition オリジナルの Gauss-Newton は非線形最小二乗問題の近似として与えられる： ∇2L(θ) = ∑ n ∇θ f(xn; θ)∇θ f(xn; θ)T + ∑ n rn∇2 θ f(xn; θ) (4) = G(θ) + R(θ). (5) ここで L(θ) = 1 2 ∑ n (f(xn; θ) − y)2 かつ rn = f(xn; θ) − yn ．残差項が小さい時，G(θ) は Hessian の近似になる． 10/18

Slide 11

Slide 11 text

Intro Critical discussion of the empirical Fisher Conclusions References Generalized Gauss-Newton Method Definition Generalized Gauss-Newton は Gauss-Newton における目的関数を L(θ) = ∑ n an(bn(θ)) の形式に一般化したものとして与えられる： ∇2L(θ) = ∑ n (Jθ bn(θ))T∇2 b an(bn(θ))(Jθ bn(θ)) + ∑ n,m [∇b an(bn(θ))]m∇2 θ b(m) n (θ). (6) 11/18

Slide 12

Slide 12 text

Intro Critical discussion of the empirical Fisher Conclusions References Critical discussion of the empirical Fisher 12/18

Slide 13

Slide 13 text

Intro Critical discussion of the empirical Fisher Conclusions References The empirical Fisher as a generalized Gauss-Newton matrix G(θ) の分割を以下のようにとることで，Empirical Fisher は Generalized Gauss-Newton matrix に一致する： an(b) = − log b, bn(θ) = p(yn|f(xn, θ)). この式操作は正しいものの ▶ G(θ) は残差が小さい時 Hessian をよく近似する； ▶ Empirical Fisher は残差が小さくなるにつれ 0 に近づく： ˜ F(θ) = ∑ n r2∇θ f(xn; θ)∇θ f(xn; θ)T. (7) ▶ 一方，元々の Fisher は残差が小さいとき Hessian を近似する． 13/18

Slide 14

Slide 14 text

Intro Critical discussion of the empirical Fisher Conclusions References The empirical Fisher near a minimum Figure: Examples of model misspecification and the effect on the empirical and true Fisher. 14/18

Slide 15

Slide 15 text

Intro Critical discussion of the empirical Fisher Conclusions References Preconditioning with the empirical Fisher far from an optimum Figure: While the EF can be a good approximation for preconditioning on some problems (e.g., a1a), it is not guaranteed to be. 15/18

Slide 16

Slide 16 text

Intro Critical discussion of the empirical Fisher Conclusions References Conclusions 16/18

Slide 17

Slide 17 text

Intro Critical discussion of the empirical Fisher Conclusions References Conclusions ▶ Empirical Fisher は一般化 Gauss-Newton 行列の形式的な定義には沿っているものの，有用な 2 次情報を保持できていない； ▶ Empirical Fisher と Fisher 情報行列の関係性は少なくとも次の強力な仮定のもとでのみ成り立つ： 1. モデルが正しい，かつ 2. モデルキャパシティに対して相対的にデータ量が大きいこと． ▶ Empirical Fisher による勾配修正は最適とは程遠いことから，ステップサイズの調整の複雑化やモデルの性能劣化につながる； ▶ Empirical Fisher の実験的成功の代替の説明として，SGD における勾配のノイズの影響を低減するからではないかと予想 [Kunstner et al., 2019]． 17/18

Slide 18

Slide 18 text

Intro Critical discussion of the empirical Fisher Conclusions References References I Shun-Ichi Amari. Natural gradient works eﬀiciently in learning. Neural computation, 10(2): 251–276, 1998. Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical fisher approximation for natural gradient descent. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 4156–4167, 2019. 18/18