Journal Club: Maximum Entropy Inverse Reinforcement Learning

Maximum Entropy Inverse Reinforcement Learning Keita Watanabe 1/19 Keita Watanabe
Journal club Nov , 2019

Inverse Reinforcement Learning The problem of inverse reinforcement learning (IRL)
in Markov decision processes is construction of a reward function given observed, expert behaviors. RL and IRL 2/19 Keita Watanabe Journal club Nov , 2019

Markov Decision Processes (MDP) A MDP M is a tuple
M = (S, A, P, γ, R), where ∗ S = {s1 , s2 , ...} is a ﬁnite set of N states. ∗ A = {a1 , a2 , ...} is a ﬁnite set of k actions. ∗ P(sj |si , ai,j ): state transition probabilities upon taking action ai,j in state si to sj . ∗ γ ∈ [0, 1) is the discount factor. ∗ R(si ) is the reward function S → R. and π : S → A is a policy 3/19 Keita Watanabe Journal club Nov , 2019

Three diﬀerent types of IRLs Objective: Estimate a reward function
of a expert. Algorithm 1 Evaluate expert behavior 2 Learn tactic using the estimated reward function 3 Update the reward function to make it closer to the expert’s evaluation 4 Go back to 2 Action Evaluation Reward Function Problem Linear Programming [Ng and Russell, 2000](*) Tactic Linear Max Margin Apparentship learning [Abbeel and Ng, 2004] State Transition Linear Max Margin Max Entropy [Ziebart et al., 2008] State Transition Linear/Non-Linear Max Entropy Bayesian [Ramachandran and Amir, 2007] State Transition Distibution Bayesian Inference (*) We have discussed in the last J.C. We assume that we have M expert trajectories ξi , i = 1, . . . , M. 4/19 Keita Watanabe Journal club Nov , 2019

Today we focus on Maximum Entropy IRL “Recent research has
shown the beneﬁt of framing problems of imitation learning as solutions to Markov Decision Problems. This approach reduces learning to the problem of recovering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behavior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-deﬁned, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling realworld navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories.” (Abstract) 5/19 Keita Watanabe Journal club Nov , 2019

The principal of Maximum entropy? The distribution that maximizes entropy
under a given condition is the most universal distribution that contains no information other than the condition. The idea of choosing a universal distribution based on this idea is called theʠ maximum entropy principle ʡ . It is automatically requested that the probability is non-negative and the sum is 1. 6/19 Keita Watanabe Journal club Nov , 2019

Unconstrained case Cost function: H(p) = − N ∑ i=1
pi log pi Constraint: ∑ i pi = 1 Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − λ( ∑ i pi − 1) pi = 1 N H(p) = − 1 N N ∑ i=1 log 1 N = log N (*) Derivation Which means that, uniform distribution is the one which maximize entropy under no constraint. 7/19 Keita Watanabe Journal club Nov , 2019

Constrained case Cost function: H(p) = − N ∑ i=1
pi log pi Constraint: ∑ i pi = 1, ∑ i Ei pi = U Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − α( ∑ i pi − 1) −β( ∑ i Ei pi − U) Interpretation of the second constraint When the probability of taking energy Ei is pi , and the expected value is a certain value U. pi = exp (−βEi ) Z(β) = exp (−βEi ) ∑ j exp (−βEj ) This is called Boltzmann distribution. (*) Derivation 8/19 Keita Watanabe Journal club Nov , 2019

Principal of maximum entropy in IR pi = exp (−βEi
) ∑ j exp (−βEj ) → P(ξi |θ) = exp (R(ξi |θ)) ∑ M j=1 exp (R(ξj |θ)) This distribution takes probability p(ξi |θ) at R(ξi |θ) while maximizes entropy. ∗ P(ξi |θ): Probability that the expert takes trajectory ξi . ∗ R(ξi |θ): Reward value for trajectory ξi . Finding R(ξi |θ) that maximizes the following likelihood is the objective! R(ξi ) = θT fξi = ∑ si ∈ξ θT fsi where fsi is one-hot feature vector. 9/19 Keita Watanabe Journal club Nov , 2019

Learning parameter θ θ∗ = arg maxθ M ∑ i=1
L(θ) = arg maxθ M ∑ i=1 log P(ζi |θ) = arg maxθ { 1 M M ∑ i=1 θTfζi − log M ∑ i=1 exp θTfζi } ∇L(θ) = 1 M ∑ i=1 fsi − ∑ i=1 P(si |θ)fsi = ˜ f − ∑ si Dsi fsi Dsi : Expected state visitation frequencies (*) Derivation 10/19 Keita Watanabe Journal club Nov , 2019

Expected Edge Frequency Calculation [Ziebart et al., 2008], Algorithm1 11/19
Keita Watanabe Journal club Nov , 2019

Driver Route Modeling “‘Road networks present a large planning space
with known structure. We model this structure for the road network surrounding Pittsburgh, Pennsylvania, as a deterministic MDP with over 300,000 states (i.e., road segments) and 900,000 actions (i.e., transitions at intersections).”’ Features: ∗ Road type ∗ Speed ∗ Lanes ∗ Transitions 12/19 Keita Watanabe Journal club Nov , 2019

Destination Prediction Left: Destination distribution (from 5 destinations) and remaining
path distribution given partially traveled path. The partially traveled path is heading westward, which is a very ineﬃcient (i.e., improbable) partial route to any of the eastern destinations (3, 4, 5). The posterior destination probability is split between destinations 1 and 2 primarily based on the prior distribution on destinations. Right: Posterior prediction accuracy over ﬁve destinations given partial path. [Ziebart et al., 2008], Figure 4 and 5. 13/19 Keita Watanabe Journal club Nov , 2019

Result [Yrlu, ] Simulation result 14/19 Keita Watanabe Journal club
Nov , 2019

Two problems in [Ziebart et al., 2008] ∗ The model
used to represent reward is too simple → [Wulfmeier et al., 2015] ∗ High computational complexity → [Boularias et al., 2011, Finn et al., 2016] Relative Entropy IRL [Boularias et al., 2011] is a technique that uses importance sampling to avoid strategy optimization. Regarding the calculation of the state transition characteristics, the weight is high when it is close to the expert state transition, and low when it is far away. Even if it is not an optimized strategy, it is weighted accordingly, so you can avoid learning the strategy and ﬁnd f. Guided cost learning [Finn et al., 2016] further improves eﬃciency by sampling the trajectory for calculating the characteristics of state transitions from learned strategies. 15/19 Keita Watanabe Journal club Nov , 2019

Maximum Entropy Deep Inverse Reinforcement Learning [Wulfmeier et al., 2015]
Schema for Neural Network based reward function approximation based on the feature representation of MDP states ([Wulfmeier et al., 2015], Fig. 2) r ≈ g(f , θ1 , θ2 , . . . , θn ) = g1 (g2 (. . . (gn (f , θn ), . . . ), θ2 ), θ1 ) 16/19 Keita Watanabe Journal club Nov , 2019

Reward reconstruction sample for the Binaryworld benchmark provided N =
128 demonstrations. White - high reward; black - low reward. ([Wulfmeier et al., 2015], Fig. 7) 17/19 Keita Watanabe Journal club Nov , 2019

References I [Abbeel and Ng, 2004] Abbeel, P. and Ng,
A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004, pages 1–8. [Boularias et al., 2011] Boularias, A., Kober, J., and Peters, J. (2011). Relative Entropy Inverse Reinforcement Learning. Technical report. [Finn et al., 2016] Finn, C., Levine, S., and Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. Technical report. [Ng and Russell, 2000] Ng, A. and Russell, S. (2000). Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. [Ramachandran and Amir, 2007] Ramachandran, D. and Amir, E. (2007). Bayesian inverse reinforcement learning. In IJCAI International Joint Conference on Artiﬁcial Intelligence, pages 2586–2591. 18/19 Keita Watanabe Journal club Nov , 2019

References II [Wulfmeier et al., 2015] Wulfmeier, M., Ondruska, P.,
and Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement Learning. [Yrlu, ] Yrlu. yrlu/irl-imitation: Implementation of Inverse Reinforcement Learning (IRL) algorithms in python/Tensorﬂow. Deep MaxEnt, MaxEnt, LPIRL. [Ziebart et al., 2008] Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the National Conference on Artiﬁcial Intelligence, volume 3, pages 1433–1438. 19/19 Keita Watanabe Journal club Nov , 2019

Journal Club: Maximum Entropy Inverse Reinforce...

Journal Club: Maximum Entropy Inverse Reinforcement Learning

Keita Watanabe

More Decks by Keita Watanabe

Featured

Transcript

Maximum Entropy Inverse Reinforcement Learning Keita Watanabe 1/19 Keita Watanabe

Inverse Reinforcement Learning The problem of inverse reinforcement learning (IRL)

Markov Decision Processes (MDP) A MDP M is a tuple

Three diﬀerent types of IRLs Objective: Estimate a reward function

Today we focus on Maximum Entropy IRL “Recent research has

The principal of Maximum entropy? The distribution that maximizes entropy

Unconstrained case Cost function: H(p) = − N ∑ i=1

Constrained case Cost function: H(p) = − N ∑ i=1

Principal of maximum entropy in IR pi = exp (−βEi

Learning parameter θ θ∗ = arg maxθ M ∑ i=1

Expected Edge Frequency Calculation [Ziebart et al., 2008], Algorithm1 11/19

Driver Route Modeling “‘Road networks present a large planning space

Destination Prediction Left: Destination distribution (from 5 destinations) and remaining

Result [Yrlu, ] Simulation result 14/19 Keita Watanabe Journal club

Two problems in [Ziebart et al., 2008] ∗ The model

Maximum Entropy Deep Inverse Reinforcement Learning [Wulfmeier et al., 2015]

Reward reconstruction sample for the Binaryworld benchmark provided N =

References I [Abbeel and Ng, 2004] Abbeel, P. and Ng,

References II [Wulfmeier et al., 2015] Wulfmeier, M., Ondruska, P.,