states s ∈ S ▪A set of actions (per state) A ▪A model T(s,a,s’) ▪A reward function R(s,a,s’) ▪ Looking for a policy π(s) ▪ Don’t know T or R, so must try out actions [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
V*, Q*, π* Value / policy iteration Evaluate a fixed policy π Policy evaluation Unknown MDP: Model-Based Unknown MDP: Model-Free Goal Technique Compute V*, Q*, π* VI/PI on approx. MDP Evaluate a fixed policy π PE on approx. MDP Goal Technique Compute V*, Q*, π* Q-learning Evaluate a fixed policy π Value Learning [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] >> Could humans help the agent’s learning? >> Could the agent learn from humans?
Verbal Communication in Social Robots • Communicating with Nonverbal Behaviors • Understanding Human Intentions • Human-Robot Collaboration • Social Navigation • Robot Learning from Humans From Computational Human-Robot Interaction by Thomaz et al.
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via Inverse Reinforcement Learning by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via human reinforcement: The TAMER framework by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao, NIPS 2006
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via Inverse Reinforcement Learning by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via human reinforcement: The TAMER framework by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao, NIPS 2006
• Agents do not interact • The mentor is executing π* • At each iteration, the observer is at so, takes ao, and results in to; it also observes the mentor moving from sm, to tm • The observer knows R(s) but does not know Pr(t|s, a) and is looking for π
value iteration) with the following changes: A. Augmented Bellman Backup With each transition <s, a, s'> the observer takes, the estimated model Presto(t | s, a) is updated and an augmented backup (Equation 2) is performed at state s. Augmented backups are then performed at a fixed number of states using the usual priority queue implementation. B. Focusing Mechanism With each observed mentor transition <s, t>, the estimated model Presto(t | s) is updated and an augmented backup is performed at s. Augmented backups are then performed at a fixed number of states using the usual priority queue implementation. From Implicit Imitation in Multiagent Reinforcement Learning by Price and Boutilier
different from that of the agent, e.g. humanoid robot imitation a human body moment • Discusses potential extensions to model-free approaches and feature-based representations • Limitations • State space is huge • Hard to find an optimal mentor • The agent designer needs to define R
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via Inverse Reinforcement Learning by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via human reinforcement: The TAMER framework by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao, NIPS 2006
Psa (st+1 | st , at ) ! No reward function ! Teacher’s demonstration: s0 , a0 , s1 , a1 , s2 , a2 , … (= trace of the teacher’s policy π*) Problem setup (= trace of the teacher’s policy π*) ! Inverse RL: ! Can we recover R ? ! Apprenticeship learning via inverse RL ! Can we then use this R to find a good policy ? ! Behavioral cloning ! Can we directly learn the teacher’s policy using supervised learning? Inverse Reinforce Learning Courtesy of Pieter Abbeel
that the expert outperforms other policies Feature matching Let R(s) = w⊤φ(s), where w ∈ ℜn, and φ : S → ℜn. Find w∗ such that w∗⊤µ(π∗) ≥ w∗⊤µ(π) ∀π ! Observation in Abbeel and Ng, 2004: for a policy π to be guaranteed to perform as well as the expert policy π*, it suffices that the feature expectations match: implies that for all w with ∥µ(π) − µ(π∗)∥1 ≤ ǫ ∥w∥∞ ≤ 1: |w∗⊤µ(π) − w∗⊤µ(π∗)| ≤ ǫ Courtesy of Pieter Abbeel
pick some controller π0 . ! Iterate for i = 1, 2, … : ! “Guess” the reward function: Find a reward function such that the teacher maximally outperforms Find a reward function such that the teacher maximally outperforms all previously found controllers. ! Find optimal control policy πi for the current guess of the reward function Rw. ! If , exit the algorithm. Courtesy of Pieter Abbeel
policy class ! E.g., support vector machine, neural network, decision tree, deep belief net, … ! Estimate a policy (=mapping from states to actions) from the training examples (s , a ), (s , a ), (s , a ), … Behavioral cloning from the training examples (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), … ! Two of the most notable success stories: ! Pomerleau, NIPS 1989: ALVINN ! Sammut et al., ICML 1992: Learning to fly (flight sim) Q: Can’t we directly learn teacher’s policy using supervised learning? Courtesy of Pieter Abbeel
π* vs. R R R R*? ! Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy. Inverse RL vs. behavioral cloning is often much more succinct than the optimal policy. Courtesy of Pieter Abbeel
function trades off: ! Staying “on-road,” ! Forward vs. reverse driving, ! Amount of switching between forward and reverse, ! Lane keeping, ! On-road vs. off-road, ! Curvature of paths. Courtesy of Pieter Abbeel
setup ! Run our apprenticeship learning algorithm to find the reward function. ! Receive “test parking lot” map + starting point and destination. ! Find the trajectory that maximizes the learned reward function for navigating the test parking lot. Courtesy of Pieter Abbeel
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via Inverse Reinforcement Learning by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via human reinforcement: The TAMER framework by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao, NIPS 2006
who observes the agent and understands a predefined performance metric provides occasional scalar reinforcement signals (RH: S x A →ℝ) • How can an agent learn the best possible task policy (π : S → A), as measured by the performance metric?
and long-term effects • use them instead of expected rewards, e.g., computed by an expensive value-based method! • exploration is not needed! • Human trainer’s reinforcement function (RH) is a moving target, e.g., to continuously improve the agent’s policy • model and update RestH on-the-fly
TAMER framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones
framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones
• Non-myopically learn from humans, i.e., robust against the human bias towards giving positive rewards See Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance by Knox and Stones for more details
future actions • A positive bias in RL rewards • Human-generated reward signal changes over time • RL should accommodate above findings! • New ways to transfer tasks or skills to robots? Humans Providing Online Rewards • Sophie’s Kitchen • Human trainer can award a scalar reward signal r = [−1, 1] 7HDFKDEOHURERWV8QGHUVWDQGLQJKXPDQWHDFKLQJEHKDYLRUWREXLOGPRUHHIIHFWLYH URERWOHDUQHUV$UWLILFLDO,QWHOOLJHQFH-RXUQDO From Reinforcement Learning with Human Teachers: Understanding How People Want to Teach Robots by Thomaz et al.
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via Inverse Reinforcement Learning by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via human reinforcement: The TAMER framework by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao, NIPS 2006
on goal inference. Demonstration Goal-based imitation Infants above 1.5 years of age can imitate action even from an unsuccessful demonstration (Meltzoff & Brook 1998)
of others. Self-experience plays an important role in goal inference and imitation. Computational Model: Probabilistic instantiation of “Like-me” hypothesis. Meltzoff, Dev Sci, 2007 Metlzoff, Acta Psychologica, 2007
Bayesian Network) • Given initial state s and goal state Goalg, infer: Standard MDP Goal-based MDP From Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao Also check out Probabilistic inference for solving discrete and continuous state Markov Decision Processes by Toussaint and Storkey Reached Observation Goal State Action