⇢⇡(s, a) = ⇡(a | s) X t tP(st = s | ⇡) '+ E⇡ [c(s, a)] = X s,a ⇢⇡(s, a)c(s, a) H(⇡) = X s,a ⇢(s, a) log ⇢(s, a)/ X a0 ⇢(s, a 0) ! = ¯ H(⇢) Generative Adversarial Imitation Learning (https://arxiv.org/abs/1606.03476)
min ⇡2⇧ (c) H(⇡) + E⇡ [c(s, a)] E⇡E [c(s, a)] max c min ⇢ const ¯ H(⇢) + X s,a c(s, a) (⇢(s, a) ⇢E(s, a)) min ⇢ ¯ H(⇢) s.t., ⇢(s, a) = ⇢E(s, a) ! " constant minimizexf(x) s.t. h(x) = 0 $ maximize inf x f(x) + X i ih(x) ! http://ir5.hatenablog.com/entry/20141214/1418553079