Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journal Club: Maximum Entropy Inverse Reinforcement Learning

Keita Watanabe
October 30, 2021
85

Journal Club: Maximum Entropy Inverse Reinforcement Learning

Keita Watanabe

October 30, 2021
Tweet

Transcript

  1. Inverse Reinforcement Learning The problem of inverse reinforcement learning (IRL)

    in Markov decision processes is construction of a reward function given observed, expert behaviors. RL and IRL 2/19 Keita Watanabe Journal club Nov , 2019
  2. Markov Decision Processes (MDP) A MDP M is a tuple

    M = (S, A, P, γ, R), where ∗ S = {s1 , s2 , ...} is a finite set of N states. ∗ A = {a1 , a2 , ...} is a finite set of k actions. ∗ P(sj |si , ai,j ): state transition probabilities upon taking action ai,j in state si to sj . ∗ γ ∈ [0, 1) is the discount factor. ∗ R(si ) is the reward function S → R. and π : S → A is a policy 3/19 Keita Watanabe Journal club Nov , 2019
  3. Three different types of IRLs Objective: Estimate a reward function

    of a expert. Algorithm 1 Evaluate expert behavior 2 Learn tactic using the estimated reward function 3 Update the reward function to make it closer to the expert’s evaluation 4 Go back to 2 Action Evaluation Reward Function Problem Linear Programming [Ng and Russell, 2000](*) Tactic Linear Max Margin Apparentship learning [Abbeel and Ng, 2004] State Transition Linear Max Margin Max Entropy [Ziebart et al., 2008] State Transition Linear/Non-Linear Max Entropy Bayesian [Ramachandran and Amir, 2007] State Transition Distibution Bayesian Inference (*) We have discussed in the last J.C. We assume that we have M expert trajectories ξi , i = 1, . . . , M. 4/19 Keita Watanabe Journal club Nov , 2019
  4. Today we focus on Maximum Entropy IRL “Recent research has

    shown the benefit of framing problems of imitation learning as solutions to Markov Decision Problems. This approach reduces learning to the problem of recovering a utility function that makes the behavior induced by a near-optimal policy closely mimic demonstrated behavior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling realworld navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories.” (Abstract) 5/19 Keita Watanabe Journal club Nov , 2019
  5. The principal of Maximum entropy? The distribution that maximizes entropy

    under a given condition is the most universal distribution that contains no information other than the condition. The idea of choosing a universal distribution based on this idea is called theʠ maximum entropy principle ʡ . It is automatically requested that the probability is non-negative and the sum is 1. 6/19 Keita Watanabe Journal club Nov , 2019
  6. Unconstrained case Cost function: H(p) = − N ∑ i=1

    pi log pi Constraint: ∑ i pi = 1 Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − λ( ∑ i pi − 1) pi = 1 N H(p) = − 1 N N ∑ i=1 log 1 N = log N (*) Derivation Which means that, uniform distribution is the one which maximize entropy under no constraint. 7/19 Keita Watanabe Journal club Nov , 2019
  7. Constrained case Cost function: H(p) = − N ∑ i=1

    pi log pi Constraint: ∑ i pi = 1, ∑ i Ei pi = U Lagrange multiplier: L(p; λ) = − N ∑ i=1 pi log pi − α( ∑ i pi − 1) −β( ∑ i Ei pi − U) Interpretation of the second constraint When the probability of taking energy Ei is pi , and the expected value is a certain value U. pi = exp (−βEi ) Z(β) = exp (−βEi ) ∑ j exp (−βEj ) This is called Boltzmann distribution. (*) Derivation 8/19 Keita Watanabe Journal club Nov , 2019
  8. Principal of maximum entropy in IR pi = exp (−βEi

    ) ∑ j exp (−βEj ) → P(ξi |θ) = exp (R(ξi |θ)) ∑ M j=1 exp (R(ξj |θ)) This distribution takes probability p(ξi |θ) at R(ξi |θ) while maximizes entropy. ∗ P(ξi |θ): Probability that the expert takes trajectory ξi . ∗ R(ξi |θ): Reward value for trajectory ξi . Finding R(ξi |θ) that maximizes the following likelihood is the objective! R(ξi ) = θT fξi = ∑ si ∈ξ θT fsi where fsi is one-hot feature vector. 9/19 Keita Watanabe Journal club Nov , 2019
  9. Learning parameter θ θ∗ = arg maxθ M ∑ i=1

    L(θ) = arg maxθ M ∑ i=1 log P(ζi |θ) = arg maxθ { 1 M M ∑ i=1 θTfζi − log M ∑ i=1 exp θTfζi } ∇L(θ) = 1 M ∑ i=1 fsi − ∑ i=1 P(si |θ)fsi = ˜ f − ∑ si Dsi fsi Dsi : Expected state visitation frequencies (*) Derivation 10/19 Keita Watanabe Journal club Nov , 2019
  10. Driver Route Modeling “‘Road networks present a large planning space

    with known structure. We model this structure for the road network surrounding Pittsburgh, Pennsylvania, as a deterministic MDP with over 300,000 states (i.e., road segments) and 900,000 actions (i.e., transitions at intersections).”’ Features: ∗ Road type ∗ Speed ∗ Lanes ∗ Transitions 12/19 Keita Watanabe Journal club Nov , 2019
  11. Destination Prediction Left: Destination distribution (from 5 destinations) and remaining

    path distribution given partially traveled path. The partially traveled path is heading westward, which is a very inefficient (i.e., improbable) partial route to any of the eastern destinations (3, 4, 5). The posterior destination probability is split between destinations 1 and 2 primarily based on the prior distribution on destinations. Right: Posterior prediction accuracy over five destinations given partial path. [Ziebart et al., 2008], Figure 4 and 5. 13/19 Keita Watanabe Journal club Nov , 2019
  12. Two problems in [Ziebart et al., 2008] ∗ The model

    used to represent reward is too simple → [Wulfmeier et al., 2015] ∗ High computational complexity → [Boularias et al., 2011, Finn et al., 2016] Relative Entropy IRL [Boularias et al., 2011] is a technique that uses importance sampling to avoid strategy optimization. Regarding the calculation of the state transition characteristics, the weight is high when it is close to the expert state transition, and low when it is far away. Even if it is not an optimized strategy, it is weighted accordingly, so you can avoid learning the strategy and find f. Guided cost learning [Finn et al., 2016] further improves efficiency by sampling the trajectory for calculating the characteristics of state transitions from learned strategies. 15/19 Keita Watanabe Journal club Nov , 2019
  13. Maximum Entropy Deep Inverse Reinforcement Learning [Wulfmeier et al., 2015]

    Schema for Neural Network based reward function approximation based on the feature representation of MDP states ([Wulfmeier et al., 2015], Fig. 2) r ≈ g(f , θ1 , θ2 , . . . , θn ) = g1 (g2 (. . . (gn (f , θn ), . . . ), θ2 ), θ1 ) 16/19 Keita Watanabe Journal club Nov , 2019
  14. Reward reconstruction sample for the Binaryworld benchmark provided N =

    128 demonstrations. White - high reward; black - low reward. ([Wulfmeier et al., 2015], Fig. 7) 17/19 Keita Watanabe Journal club Nov , 2019
  15. References I [Abbeel and Ng, 2004] Abbeel, P. and Ng,

    A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004, pages 1–8. [Boularias et al., 2011] Boularias, A., Kober, J., and Peters, J. (2011). Relative Entropy Inverse Reinforcement Learning. Technical report. [Finn et al., 2016] Finn, C., Levine, S., and Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. Technical report. [Ng and Russell, 2000] Ng, A. and Russell, S. (2000). Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. [Ramachandran and Amir, 2007] Ramachandran, D. and Amir, E. (2007). Bayesian inverse reinforcement learning. In IJCAI International Joint Conference on Artificial Intelligence, pages 2586–2591. 18/19 Keita Watanabe Journal club Nov , 2019
  16. References II [Wulfmeier et al., 2015] Wulfmeier, M., Ondruska, P.,

    and Posner, I. (2015). Maximum Entropy Deep Inverse Reinforcement Learning. [Yrlu, ] Yrlu. yrlu/irl-imitation: Implementation of Inverse Reinforcement Learning (IRL) algorithms in python/Tensorflow. Deep MaxEnt, MaxEnt, LPIRL. [Ziebart et al., 2008] Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In Proceedings of the National Conference on Artificial Intelligence, volume 3, pages 1433–1438. 19/19 Keita Watanabe Journal club Nov , 2019