Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UW CSEP590B Robotics (2018 Spr) Guest Lecture "Human-Robot Interaction"

Mike Chung
May 30, 2018
69

UW CSEP590B Robotics (2018 Spr) Guest Lecture "Human-Robot Interaction"

University of Washington CSEP590B Robotics week 10 lecture "Human-Robot Interaction" slides

https://courses.cs.washington.edu/courses/csep590b/18sp/

Mike Chung

May 30, 2018
Tweet

More Decks by Mike Chung

Transcript

  1. Reinforcement Learning ▪ We assume an MDP: ▪A set of

    states s ∈ S ▪A set of actions (per state) A ▪A model T(s,a,s’) ▪A reward function R(s,a,s’) ▪ Looking for a policy π(s) ▪ Don’t know T or R, so must try out actions [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
  2. MDPs and RL Known MDP: Offline Solution Goal Technique Compute

    V*, Q*, π* Value / policy iteration Evaluate a fixed policy π Policy evaluation Unknown MDP: Model-Based Unknown MDP: Model-Free Goal Technique Compute V*, Q*, π* VI/PI on approx. MDP Evaluate a fixed policy π PE on approx. MDP Goal Technique Compute V*, Q*, π* Q-learning Evaluate a fixed policy π Value Learning [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] >> Could humans help the agent’s learning? >> Could the agent learn from humans?
  3. DISCUSSION How is HRI different from HCI? >>What is a

    robot? Courtesy of Maya Cakmak >> Detour to HRI
  4. DISCUSSION How is HRI different from HCI? >>What is a

    robot? What does that imply for studying HRI? Should robots exploit being perceived as agents? Courtesy of Maya Cakmak
  5. Human-Robot Interaction Topics • Perceiving Humans for Social Interaction •

    Verbal Communication in Social Robots • Communicating with Nonverbal Behaviors • Understanding Human Intentions • Human-Robot Collaboration • Social Navigation • Robot Learning from Humans From Computational Human-Robot Interaction by Thomaz et al.
  6. programming a robot HUMAN-ROBOT INTERACTION programming a robot HUMAN-ROBOT INTERACTION

    commanding a robot collaborating with a robot HUMAN-ROBOT INTERACTION Courtesy of Maya Cakmak At Maya’s lab
  7. METHODS •Asking users –Questionnaires, interviews, focus groups, contextual inquiry •Observing

    users –Passive observation, empirical user studies, think-aloud protocol, ethnography, field studies •Make users observe themselves –Diaries, experience sampling •Ask experts –Heuristic evaluation, cognitive walkthrough Courtesy of Maya Cakmak
  8. Today’s Outline 1. Implicit Imitation in
 Multiagent Reinforcement Learning
 by

    Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via
 Inverse Reinforcement Learning
 by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via
 human reinforcement: The TAMER framework
 by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as
 Probabilistic Inference over Graphical Models
 by Verma and Rao, NIPS 2006
  9. Today’s Outline 1. Implicit Imitation in
 Multiagent Reinforcement Learning
 by

    Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via
 Inverse Reinforcement Learning
 by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via
 human reinforcement: The TAMER framework
 by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as
 Probabilistic Inference over Graphical Models
 by Verma and Rao, NIPS 2006
  10. Implicit Imitation: Setup • 2≥ identical MDPs; “mentor(s)” and “observer”

    • Agents do not interact • The mentor is executing π* • At each iteration, the observer is at so, takes ao, and results in to; it also observes the mentor moving from sm, to tm • The observer knows R(s) but does not know
 Pr(t|s, a) and is looking for π
  11. Model Extraction • Augmented Bellman Equation: • Focusing Mechanism: Back

    up the states visited by mentor • Action Selection: Guess the mentor’s action with
  12. Implicit Imitation Algorithm: Sketch • Update prioritized sweeping (an asynchronous

    value iteration) with the following changes: A. Augmented Bellman Backup With each transition <s, a, s'> 
 the observer takes, the estimated model Presto(t | s, a) is updated and an augmented backup (Equation 2) is performed at state s. Augmented backups are then performed at a fixed number of states using the usual priority queue implementation. B. Focusing Mechanism With each observed mentor transition 
 <s, t>, the estimated model Presto(t | s) is updated and an augmented backup is performed at s. Augmented backups are then performed at a fixed number of states using the usual priority queue implementation. From Implicit Imitation in Multiagent Reinforcement Learning by Price and Boutilier
  13. Extensions & Limitations • Heterogeneous settings; the mentor’s P(t|s,a) is

    different from that of the agent, e.g. humanoid robot imitation a human body moment • Discusses potential extensions to model-free approaches and feature-based representations • Limitations • State space is huge • Hard to find an optimal mentor • The agent designer needs to define R
  14. Today’s Outline 1. Implicit Imitation in
 Multiagent Reinforcement Learning
 by

    Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via
 Inverse Reinforcement Learning
 by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via
 human reinforcement: The TAMER framework
 by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as
 Probabilistic Inference over Graphical Models
 by Verma and Rao, NIPS 2006
  15. Motivation Would be nice if • agent designers don’t need

    to define R and agents learn it from human demonstrations • human demonstrations does not have to be perfect
  16. ! Input: ! State space, action space ! Transition model

    Psa (st+1 | st , at ) ! No reward function ! Teacher’s demonstration: s0 , a0 , s1 , a1 , s2 , a2 , … (= trace of the teacher’s policy π*) Problem setup (= trace of the teacher’s policy π*) ! Inverse RL: ! Can we recover R ? ! Apprenticeship learning via inverse RL ! Can we then use this R to find a good policy ? ! Behavioral cloning ! Can we directly learn the teacher’s policy using supervised learning? Inverse Reinforce Learning Courtesy of Pieter Abbeel
  17. ! ff Feature based reward function Let R(s) = w⊤φ(s),

    where w ∈ ℜn, and φ : S → ℜn. E[ ∞ ! t=0 γtR(st )|π] = E[ ∞ ! t=0 γtw⊤φ(st )|π] = w⊤E[ ∞ ! t=0 γtφ(st )|π] ! Subbing into gives us: ! t=0 | = w⊤µ(π) Expected cumulative discounted sum of feature values or “feature expectations” E[ "∞ t=0 γtR∗(st )|π∗] ≥ E[ "∞ t=0 γtR∗(st )|π] ∀π Find w∗ such that w∗⊤µ(π∗) ≥ w∗⊤µ(π) ∀π Courtesy of Pieter Abbeel
  18. ! Inverse RL starting point: find a reward function such

    that the expert outperforms other policies Feature matching Let R(s) = w⊤φ(s), where w ∈ ℜn, and φ : S → ℜn. Find w∗ such that w∗⊤µ(π∗) ≥ w∗⊤µ(π) ∀π ! Observation in Abbeel and Ng, 2004: for a policy π to be guaranteed to perform as well as the expert policy π*, it suffices that the feature expectations match: implies that for all w with ∥µ(π) − µ(π∗)∥1 ≤ ǫ ∥w∥∞ ≤ 1: |w∗⊤µ(π) − w∗⊤µ(π∗)| ≤ ǫ Courtesy of Pieter Abbeel
  19. Apprenticeship learning [Abbeel & Ng, 2004] ! Assume ! Initialize:

    pick some controller π0 . ! Iterate for i = 1, 2, … : ! “Guess” the reward function: Find a reward function such that the teacher maximally outperforms Find a reward function such that the teacher maximally outperforms all previously found controllers. ! Find optimal control policy πi for the current guess of the reward function Rw. ! If , exit the algorithm. Courtesy of Pieter Abbeel
  20. Theoretical guarantees ! Guarantee w.r.t. unrecoverable reward function of teacher.

    ! Sample complexity does not depend on complexity of teacher’s policy π*.
  21. ! Formulate as standard machine learning problem ! Fix a

    policy class ! E.g., support vector machine, neural network, decision tree, deep belief net, … ! Estimate a policy (=mapping from states to actions) from the training examples (s , a ), (s , a ), (s , a ), … Behavioral cloning from the training examples (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), … ! Two of the most notable success stories: ! Pomerleau, NIPS 1989: ALVINN ! Sammut et al., ICML 1992: Learning to fly (flight sim) Q: Can’t we directly learn teacher’s policy using supervised learning? Courtesy of Pieter Abbeel
  22. ! Which has the most succinct description: π π π

    π* vs. R R R R*? ! Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy. Inverse RL vs. behavioral cloning is often much more succinct than the optimal policy. Courtesy of Pieter Abbeel
  23. Parking lot navigation [Abbeel et al., IROS 08] ! Reward

    function trades off: ! Staying “on-road,” ! Forward vs. reverse driving, ! Amount of switching between forward and reverse, ! Lane keeping, ! On-road vs. off-road, ! Curvature of paths. Courtesy of Pieter Abbeel
  24. ! Demonstrate parking lot navigation on “train parking lots.” Experimental

    setup ! Run our apprenticeship learning algorithm to find the reward function. ! Receive “test parking lot” map + starting point and destination. ! Find the trajectory that maximizes the learned reward function for navigating the test parking lot. Courtesy of Pieter Abbeel
  25. Today’s Outline 1. Implicit Imitation in
 Multiagent Reinforcement Learning
 by

    Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via
 Inverse Reinforcement Learning
 by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via
 human reinforcement: The TAMER framework
 by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as
 Probabilistic Inference over Graphical Models
 by Verma and Rao, NIPS 2006
  26. Motivation Would be nice if • agents could use simple

    feedback instead of full demonstrations • agent designers don’t need to define R
  27. Shaping Problem • Given • MDP/R • A human trainer

    who observes the agent and understands a predefined performance metric provides occasional scalar reinforcement signals (RH: S x A →ℝ) • How can an agent learn the best possible task policy (π : S → A), as measured by the performance metric?
  28. Motivating Insights • Human reinforcement signals capture both short- term

    and long-term effects • use them instead of expected rewards, e.g., computed by an expensive value-based method! • exploration is not needed! • Human trainer’s reinforcement function (RH) is a moving target, e.g., to continuously improve the agent’s policy • model and update RestH on-the-fly
  29. • Human reinforcement signals are delayed; hence, 
 a model

    for assigning credits over time: Temporal Credit Assignment From Interactively Shaping Agents via Human Reinforcement by Knox and Stones
  30. TAMER Algorithm: Sketch Humans Providing Online Rewards :%UDGOH\.QR[DQG3HWHU6WRQH&RPELQLQJ0DQXDO)HHGEDFNZLWK6XEVHTXHQW0'35HZDUG6LJQDOVIRU5HLQIRUFHPHQW/HDUQLQJ,Q3URFHHGLQJ WKH1LQWK,QWHUQDWLRQDO&RQIHUHQFHRQ$XWRQRPRXV$JHQWVDQG0XOWLDJHQW6\VWHPV0D\%HVWVWXGHQWSDSHU •

    TAMER framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones
  31. TAMER Algorithm Humans Providing Online Rewards :%UDGOH\.QR[DQG3HWHU6WRQH&RPELQLQJ0DQXDO)HHGEDFNZLWK6XEVHTXHQW0'35HZDUG6LJQDOVIRU5HLQIRUFHPHQW/HDUQLQJ,Q3URFHHGLQJ WKH1LQWK,QWHUQDWLRQDO&RQIHUHQFHRQ$XWRQRPRXV$JHQWVDQG0XOWLDJHQW6\VWHPV0D\%HVWVWXGHQWSDSHU • TAMER

    framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones
  32. Extensions • Combining signals from environments (R) • Episodic tasks

    • Non-myopically learn from humans, i.e., robust against the human bias towards giving positive rewards See Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance by Knox and Stones for more details
  33. Findings about Human Teachers • Tendency to provide guidance of

    future actions • A positive bias in RL rewards • Human-generated reward signal changes over time • RL should accommodate above findings! • New ways to transfer tasks or skills to robots? Humans Providing Online Rewards • Sophie’s Kitchen • Human trainer can award a scalar reward signal r = [−1, 1] 7HDFKDEOHURERWV8QGHUVWDQGLQJKXPDQWHDFKLQJEHKDYLRUWREXLOGPRUHHIIHFWLYH URERWOHDUQHUV$UWLILFLDO,QWHOOLJHQFH-RXUQDO From Reinforcement Learning with Human Teachers: Understanding How People Want to Teach Robots by Thomaz et al.
  34. Today’s Outline 1. Implicit Imitation in
 Multiagent Reinforcement Learning
 by

    Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via
 Inverse Reinforcement Learning
 by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via
 human reinforcement: The TAMER framework
 by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as
 Probabilistic Inference over Graphical Models
 by Verma and Rao, NIPS 2006
  35. Imitation Learning in Humans Not mere trajectory following. Imitation based

    on goal inference. Demonstration Goal-based imitation Infants above 1.5 years of age can imitate action even from an unsuccessful demonstration (Meltzoff & Brook 1998)
  36. Another Example: Gaze Following and Blindfolds 12 month olds Meltzoff

    & Brooks, Dev Psych Blind fold experience no training after training
  37. The “Like-Me” Hypothesis Self-experience allows infants to interpret the act

    of others. Self-experience plays an important role in goal inference and imitation. Computational Model: Probabilistic instantiation of “Like-me” hypothesis. Meltzoff, Dev Sci, 2007 Metlzoff, Acta Psychologica, 2007
  38. Planning as Probabilistic Inference • MDP as graphical model (Dynamical

    Bayesian Network) • Given initial state s and goal state Goalg, infer: Standard MDP Goal-based MDP From Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao Also check out Probabilistic inference for solving discrete and continuous state Markov Decision Processes by Toussaint and Storkey Reached Observation Goal State Action
  39. Policy Learning From Goal-Based Imitation as Probabilistic Inference over Graphical

    Models by Verma and Rao Motor control learning / Body babbling Skill / Task learning
  40. Related Work: Modeling Humans 0 200 400 600 −600 −400

    −200 0 200 400 x position (cm) y position (cm) Agent Mentor true fixation points inferred fixation points Friesen & Rao, Cog Sci, 2011 Meltzoff et al., Neural Networks, 2010
  41. Conclusion • Exciting area of research! • More generally covers

    topics related to: • Programming for non-programmers • Customization by non-programmers • Faster and/or less costly learning • Study how people teach