Natural Gradient Learning by Shun'ichi Amari, RIKEN - RRS #1

Natural Gradient Learning 自然勾配学習法 Shun‐ichi Amari （甘利俊一） RIKEN Brain Science
Institute

Optimization Problem Parameter Cost (loss) function Minimize L 1 2
( , ,..., ) ( ) n L     θ θ * argmin ( ) L  θ θ

What is gradient: steepest direction L 1 2 ( )
( , ,..., ) n L L L L            θ

Gradient descent method for optimization unimodal, conve multi‐modal: simulated annealing
1 ( ) t t t L      θ θ θ

On‐line gradient learning Instantaneous loss function 2 ( , )
1 ( , ) { *( ) ( , )} 2 ( ) E[ ( , )] y f x l x y x f x L l x       θ θ θ θ θ 1 ( , ) t t t t l x      θ θ θ

Riemannian manifold     2 2 ij i
j T ds d g d d d G d            j   i  d    ( ) ( ( ))] ij G g  θ θ

Manifold of probability distributions Parameterized model Gaussian Discrete x :
Probability simplex { ( , )} M p x  θ 2 2 2 ( , ) 1 ( ) ( , ) exp{ } 2 2 x p x          θ θ

       2 2 1
; , ; , exp 2 2 x S p x p x                      Information Geometry ? Information Geometry ?     p x       ; S p x  θ Gaussian distributions ( , )    θ

Manifold of Probability Distributions Manifold of Probability Distributions  
1 2 3 1 2 3 1, 2,3 { ( )} , , 1 x p x p p p p p p      3 p 2 p 1 p p     ; M p x  

Fisher information matrix and Riemannian metric Estimstion error ( )
E[ log ( , ) log ( , )] T ij g p x p x    θ θ θ 1 1 E[ ] i i j G T        

Minimizing cost in Riemannian manifold minimize 2 1 ( )
{ ( ) } | | 1 ( ) ( ) T T L L G G G L             θ a θ a a a a a a a θ θ

Steepest Direction ---Natural Gradient Steepest Direction ---Natural Gradient  
 1 1 2 , , = n i j ij l l l l G l d d Gd G d d                             d ( ) l  ( , ; ) t t t t t l x y       

Natural Gradient Learning    1 1 ( ,
) t t t t l x l G l          θ θ θ   ( , * ; ) t t t t t l x y        

Information Geometry of MLP Information Geometry of MLP Adaptive Natural
Gradient Learning :     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      

Independent Component Analysis 1 i ij j A x A
s W W A      x s y x s A W y x observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

mixture and unmixture of independent signals 2 x 1 s
n s 2 s m x 1 x 1 n i ij j j x A s     x As

x=As y=Wx : W= 1 A A: unknown matrix s:
unknown (s): unknown observations: x(1), x(2), …, x(t) 1 1 2 2 independ ( ) ( ent distributi ) ( )... ( ) on [ ] 0 n n r r s r s r s E   s s   , l       y W W W cost function: degree of non-independence r

Riemannian manifold     2 2 ij i
j T ds d g d d d G d            j   i  d      W Euclid: G= I

Space of Matrices : Lie group -1 d d 
X WW      2 1 tr tr T T T T d d d d d l l         W X X W W W W W W W : dX I I d  X W d  W W non-holonomic basis 1 W 

Natural Gradient   , T l   
   y W W W W W

Natural Gradient Learning Algorithm      
  { } ; { } log T ij ik i i k jk k i i i i I W y y W d y q y dy                  W F F y y W y Wx  ( ) ( ) = y y

Mathematical Neurons     i i y w
x h        w x x y ( ) u  u

Multilayer Perceptrons   i i y v n 
    w x           2 1 ; exp , 2 , i i p y c y f f v             x x x w x    1 2 ( , ,..., ) n x x x x  1 1 ( ,..., ; ,..., ) m m w w v v   1 w 2 w 1 v 2 v y x

singularitiesー特異点微分幾何ー代数幾何

Geometry of singular model   y v n 
   w x Ｗ v | | 0 v  w

Learning, Estimation, and Model Selection     
   gen 0 train emp : ; ; E D p y p y E D p y               x x x   gen gen train : dimension 2 d E d n d E E n   

Problem of Backprop •slow convergence‐‐‐‐plateau‐‐‐saddle •local minima ( , ;
) t t t t t l x y       

Flaws of MLP Flaws of MLP slow convergence : Plateau
local minima Boosting and Bagging error

Information Geometry of MLP Information Geometry of MLP Natural Gradient
Learning : S. Amari ; H.Y. Park     1 1 1 1 1 1 1 T t t t t l G G G G f f G                      

Reinforcement learning: Markov decision process : { } : {
} : ( ) : { ( )} : [ ( )] t t t state S action A reward R r policy P value V E r        s a s,a a | s;θ s ,a

Diagram

Fisher information matrix Policy natural gradient E[ log ( )
log ( )] T G      a | s;θ a | s;θ    1 1 ( , ) t t t t l x l G l          θ θ θ  

Natural Gradient Learning by Shun'ichi Amari, R...

Natural Gradient Learning by Shun'ichi Amari, RIKEN - RRS #1

Tokyo Machine Learning Society

More Decks by Tokyo Machine Learning Society

Featured

Transcript

Natural Gradient Learning 自然勾配学習法 Shun‐ichi Amari （甘利俊一） RIKEN Brain Science

Optimization Problem Parameter Cost (loss) function Minimize L 1 2

What is gradient: steepest direction L 1 2 ( )

Gradient descent method for optimization unimodal, conve multi‐modal: simulated annealing

On‐line gradient learning Instantaneous loss function 2 ( , )

Riemannian manifold     2 2 ij i

Manifold of probability distributions Parameterized model Gaussian Discrete x :

       2 2 1

Manifold of Probability Distributions Manifold of Probability Distributions  

Fisher information matrix and Riemannian metric Estimstion error ( )

Minimizing cost in Riemannian manifold minimize 2 1 ( )

Steepest Direction ---Natural Gradient Steepest Direction ---Natural Gradient  

Natural Gradient Learning    1 1 ( ,

Information Geometry of MLP Information Geometry of MLP Adaptive Natural

Independent Component Analysis 1 i ij j A x A

mixture and unmixture of independent signals 2 x 1 s

x=As y=Wx : W= 1 A A: unknown matrix s:

Riemannian manifold     2 2 ij i

Space of Matrices : Lie group -1 d d 

Natural Gradient   , T l   

Natural Gradient Learning Algorithm      

Mathematical Neurons     i i y w

Multilayer Perceptrons   i i y v n 

singularitiesー特異点微分幾何ー代数幾何

Geometry of singular model   y v n 

Learning, Estimation, and Model Selection     

Problem of Backprop •slow convergence‐‐‐‐plateau‐‐‐saddle •local minima ( , ;

Flaws of MLP Flaws of MLP slow convergence : Plateau

Information Geometry of MLP Information Geometry of MLP Natural Gradient

Reinforcement learning: Markov decision process : { } : {

Diagram

Fisher information matrix Policy natural gradient E[ log ( )