# Natural Gradient Learning by Shun'ichi Amari, RIKEN - RRS #1

October 28, 2016

## Transcript

Institute
2. ### Optimization Problem Parameter Cost (loss) function Minimize L 1 2

( , ,..., ) ( ) n L     θ θ * argmin ( ) L  θ θ
3. ### What is gradient: steepest direction L 1 2 ( )

( , ,..., ) n L L L L            θ
4. ### Gradient descent method for optimization unimodal, conve multi‐modal: simulated annealing

1 ( ) t t t L      θ θ θ
5. ### On‐line gradient learning Instantaneous loss function 2 ( , )

1 ( , ) { *( ) ( , )} 2 ( ) E[ ( , )] y f x l x y x f x L l x       θ θ θ θ θ 1 ( , ) t t t t l x      θ θ θ
6. ### Riemannian manifold     2 2 ij i

j T ds d g d d d G d            j   i  d    ( ) ( ( ))] ij G g  θ θ
7. ### Manifold of probability distributions Parameterized model Gaussian Discrete x :

Probability simplex { ( , )} M p x  θ 2 2 2 ( , ) 1 ( ) ( , ) exp{ } 2 2 x p x          θ θ
8. ###        2 2 1

; , ; , exp 2 2 x S p x p x                      Information Geometry ? Information Geometry ?     p x       ; S p x  θ Gaussian distributions ( , )    θ
9. ### Manifold of Probability Distributions Manifold of Probability Distributions  

1 2 3 1 2 3 1, 2,3 { ( )} , , 1 x p x p p p p p p      3 p 2 p 1 p p     ; M p x  
10. ### Fisher information matrix and Riemannian metric Estimstion error ( )

E[ log ( , ) log ( , )] T ij g p x p x    θ θ θ 1 1 E[ ] i i j G T        
11. ### Minimizing cost in Riemannian manifold minimize 2 1 ( )

{ ( ) } | | 1 ( ) ( ) T T L L G G G L             θ a θ a a a a a a a θ θ
12. ### Steepest Direction ---Natural Gradient Steepest Direction ---Natural Gradient  

 1 1 2 , , = n i j ij l l l l G l d d Gd G d d                             d ( ) l  ( , ; ) t t t t t l x y       
13. ### Natural Gradient Learning    1 1 ( ,

) t t t t l x l G l          θ θ θ   ( , * ; ) t t t t t l x y        
14. ### Information Geometry of MLP Information Geometry of MLP Adaptive Natural

Gradient Learning :     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      
15. ### Independent Component Analysis 1 i ij j A x A

s W W A      x s y x s A W y x observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)
16. ### mixture and unmixture of independent signals 2 x 1 s

n s 2 s m x 1 x 1 n i ij j j x A s     x As
17. ### x=As y=Wx : W= 1 A A: unknown matrix s:

unknown (s): unknown observations: x(1), x(2), …, x(t) 1 1 2 2 independ ( ) ( ent distributi ) ( )... ( ) on [ ] 0 n n r r s r s r s E   s s   , l       y W W W cost function: degree of non-independence r
18. ### Riemannian manifold     2 2 ij i

j T ds d g d d d G d            j   i  d      W Euclid: G= I
19. ### Space of Matrices : Lie group -1 d d 

X WW      2 1 tr tr T T T T d d d d d l l         W X X W W W W W W W : dX I I d  X W d  W W non-holonomic basis 1 W 
20. ### Natural Gradient   , T l   

   y W W W W W
21. ### Natural Gradient Learning Algorithm      

  { } ; { } log T ij ik i i k jk k i i i i I W y y W d y q y dy                  W F F y y W y Wx  ( ) ( ) = y y
22. ### Mathematical Neurons     i i y w

x h        w x x y ( ) u  u
23. ### Multilayer Perceptrons   i i y v n 

    w x           2 1 ; exp , 2 , i i p y c y f f v             x x x w x    1 2 ( , ,..., ) n x x x x  1 1 ( ,..., ; ,..., ) m m w w v v   1 w 2 w 1 v 2 v y x

25. ### Geometry of singular model   y v n 

   w x Ｗ v | | 0 v  w
26. ### Learning, Estimation, and Model Selection     

   gen 0 train emp : ; ; E D p y p y E D p y               x x x   gen gen train : dimension 2 d E d n d E E n   
27. ### Problem of Backprop •slow convergence‐‐‐‐plateau‐‐‐saddle •local minima ( , ;

) t t t t t l x y       
28. ### Flaws of MLP Flaws of MLP slow convergence : Plateau

local minima Boosting and Bagging error
29. ### Information Geometry of MLP Information Geometry of MLP Natural Gradient

Learning : S. Amari ; H.Y. Park     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      
30. ### Reinforcement learning: Markov decision process : { } : {

} : ( ) : { ( )} : [ ( )] t t t state S action A reward R r policy P value V E r        s a s,a a | s;θ s ,a

32. ### Fisher information matrix Policy natural gradient E[ log ( )

log ( )] T G      a | s;θ a | s;θ    1 1 ( , ) t t t t l x l G l          θ θ θ  