in Markov decision processes is construction of a reward function given observed, expert behaviors. RL and IRL 2/19 Keita Watanabe Journal club Nov , 2019
M = (S, A, P, γ, R), where ∗ S = {s1 , s2 , ...} is a ﬁnite set of N states. ∗ A = {a1 , a2 , ...} is a ﬁnite set of k actions. ∗ P(sj |si , ai,j ): state transition probabilities upon taking action ai,j in state si to sj . ∗ γ ∈ [0, 1) is the discount factor. ∗ R(si ) is the reward function S → R. and π : S → A is a policy 3/19 Keita Watanabe Journal club Nov , 2019
of a expert. Algorithm 1 Evaluate expert behavior 2 Learn tactic using the estimated reward function 3 Update the reward function to make it closer to the expert’s evaluation 4 Go back to 2 Action Evaluation Reward Function Problem Linear Programming [Ng and Russell, 2000](*) Tactic Linear Max Margin Apparentship learning [Abbeel and Ng, 2004] State Transition Linear Max Margin Max Entropy [Ziebart et al., 2008] State Transition Linear/Non-Linear Max Entropy Bayesian [Ramachandran and Amir, 2007] State Transition Distibution Bayesian Inference (*) We have discussed in the last J.C. We assume that we have M expert trajectories ξi , i = 1, . . . , M. 4/19 Keita Watanabe Journal club Nov , 2019
shown the beneﬁt of framing problems of imitation learning as solutions to Markov Decision Problems. This approach reduces learning to the problem of recovering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behavior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-deﬁned, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling realworld navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories.” (Abstract) 5/19 Keita Watanabe Journal club Nov , 2019
under a given condition is the most universal distribution that contains no information other than the condition. The idea of choosing a universal distribution based on this idea is called theʠ maximum entropy principle ʡ . It is automatically requested that the probability is non-negative and the sum is 1. 6/19 Keita Watanabe Journal club Nov , 2019
pi log pi Constraint: ∑ i pi = 1 Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − λ( ∑ i pi − 1) pi = 1 N H(p) = − 1 N N ∑ i=1 log 1 N = log N (*) Derivation Which means that, uniform distribution is the one which maximize entropy under no constraint. 7/19 Keita Watanabe Journal club Nov , 2019
pi log pi Constraint: ∑ i pi = 1, ∑ i Ei pi = U Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − α( ∑ i pi − 1) −β( ∑ i Ei pi − U) Interpretation of the second constraint When the probability of taking energy Ei is pi , and the expected value is a certain value U. pi = exp (−βEi ) Z(β) = exp (−βEi ) ∑ j exp (−βEj ) This is called Boltzmann distribution. (*) Derivation 8/19 Keita Watanabe Journal club Nov , 2019
) ∑ j exp (−βEj ) → P(ξi |θ) = exp (R(ξi |θ)) ∑ M j=1 exp (R(ξj |θ)) This distribution takes probability p(ξi |θ) at R(ξi |θ) while maximizes entropy. ∗ P(ξi |θ): Probability that the expert takes trajectory ξi . ∗ R(ξi |θ): Reward value for trajectory ξi . Finding R(ξi |θ) that maximizes the following likelihood is the objective! R(ξi ) = θT fξi = ∑ si ∈ξ θT fsi where fsi is one-hot feature vector. 9/19 Keita Watanabe Journal club Nov , 2019
with known structure. We model this structure for the road network surrounding Pittsburgh, Pennsylvania, as a deterministic MDP with over 300,000 states (i.e., road segments) and 900,000 actions (i.e., transitions at intersections).”’ Features: ∗ Road type ∗ Speed ∗ Lanes ∗ Transitions 12/19 Keita Watanabe Journal club Nov , 2019
path distribution given partially traveled path. The partially traveled path is heading westward, which is a very ineﬃcient (i.e., improbable) partial route to any of the eastern destinations (3, 4, 5). The posterior destination probability is split between destinations 1 and 2 primarily based on the prior distribution on destinations. Right: Posterior prediction accuracy over ﬁve destinations given partial path. [Ziebart et al., 2008], Figure 4 and 5. 13/19 Keita Watanabe Journal club Nov , 2019
used to represent reward is too simple → [Wulfmeier et al., 2015] ∗ High computational complexity → [Boularias et al., 2011, Finn et al., 2016] Relative Entropy IRL [Boularias et al., 2011] is a technique that uses importance sampling to avoid strategy optimization. Regarding the calculation of the state transition characteristics, the weight is high when it is close to the expert state transition, and low when it is far away. Even if it is not an optimized strategy, it is weighted accordingly, so you can avoid learning the strategy and ﬁnd f. Guided cost learning [Finn et al., 2016] further improves eﬃciency by sampling the trajectory for calculating the characteristics of state transitions from learned strategies. 15/19 Keita Watanabe Journal club Nov , 2019
A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004, pages 1–8. [Boularias et al., 2011] Boularias, A., Kober, J., and Peters, J. (2011). Relative Entropy Inverse Reinforcement Learning. Technical report. [Finn et al., 2016] Finn, C., Levine, S., and Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. Technical report. [Ng and Russell, 2000] Ng, A. and Russell, S. (2000). Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. [Ramachandran and Amir, 2007] Ramachandran, D. and Amir, E. (2007). Bayesian inverse reinforcement learning. In IJCAI International Joint Conference on Artiﬁcial Intelligence, pages 2586–2591. 18/19 Keita Watanabe Journal club Nov , 2019
and Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement Learning. [Yrlu, ] Yrlu. yrlu/irl-imitation: Implementation of Inverse Reinforcement Learning (IRL) algorithms in python/Tensorﬂow. Deep MaxEnt, MaxEnt, LPIRL. [Ziebart et al., 2008] Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the National Conference on Artiﬁcial Intelligence, volume 3, pages 1433–1438. 19/19 Keita Watanabe Journal club Nov , 2019