Slide 40
Slide 40 text
Dynamic Programming
40
( ) ( ) ( )
( ) ( )
( ) ( )
( )
( ) ( )
( ) ( )
( )
( ) ( )
( ) ( )
1 1
1
~ ~ | ~ | ,
~ ~ |
~ ~ |
| argmax , |
argmax , |
argmax , log |
t t t t t t t t t
t t t t t
t t t t t
t t t t t t t
s q s a q a s s p s s a
t t t t
s q s a q a s
t t t t
s q s a q a s
q a s E E r s a E V s H q a s
E E Q s a H q a s
E E Q s a q a s
+ +
+
= + +
= +
= −
( ) ( ) ( )
( ) ( )
( )
( )
1: 1: 1: 1:
, ~ , ~
|
1
arg max , |
T T T T t t
t t
T
t t t t
s a q s a s q s
q a s
t
E r s a E H q a s
=
+
1時刻前の評価値が分かったのでそれをもともとの式を
展開したと想像して代入する
( ) ( )
( ) ( )
( ) ( )
( )
~ ~ |
~ ~ |
, log |
T T T T T
T T T T T
T T T T
s q s a q a s
T
s q s a q a s
E E r s a q a s
E E V s
−
=
( ) ( ) ( )
( )
1 1
1
~ | ,
, ,
t t t t
t t t t t
s p s s a
Q s a r s a E V s
+ +
+
= +
と書けいつものベルマンバックアップに!!
maximized when ( ) ( )
( )
| exp ,
t t t t
q a s Q a s
( ) ( ) ( )
( )
| exp ,
t t t t t
q a s Q s a V s
= −
( ) ( )
( )
log exp ,
t t t t
V s Q s a da
=
変分推論 応用
( ) ( )
( ) ( )
~ |
, log |
t t t
t t t t t
a q a s
V s E Q s a q a s
= −
代入