UW CSEP590B Robotics (2018 Spr) Guest Lecture "Human-Robot Interaction"

Reinforcement Learning in   Human Robot Interaction Michael Jae-Yoon Chung
May 30, 2018   CSE P 590 B Robotics

Reinforcement Learning ▪ We assume an MDP: ▪A set of
states s ∈ S ▪A set of actions (per state) A ▪A model T(s,a,s’) ▪A reward function R(s,a,s’) ▪ Looking for a policy π(s) ▪ Don’t know T or R, so must try out actions [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

MDPs and RL Known MDP: Offline Solution Goal Technique Compute
V*, Q*, π* Value / policy iteration Evaluate a fixed policy π Policy evaluation Unknown MDP: Model-Based Unknown MDP: Model-Free Goal Technique Compute V*, Q*, π* VI/PI on approx. MDP Evaluate a fixed policy π PE on approx. MDP Goal Technique Compute V*, Q*, π* Q-learning Evaluate a fixed policy π Value Learning [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] >> Could humans help the agent’s learning? >> Could the agent learn from humans?

DISCUSSION How is HRI different from HCI? >>What is a
robot? Courtesy of Maya Cakmak >> Detour to HRI

AGENCY & INTENTIONALITY [Heider & Simmel, 1944] https://www.youtube.com/watch?v=76p64j3H1Ng

DISCUSSION How is HRI different from HCI? >>What is a
robot? What does that imply for studying HRI? Should robots exploit being perceived as agents? Courtesy of Maya Cakmak

Human-Robot Interaction Topics • Perceiving Humans for Social Interaction •
Verbal Communication in Social Robots • Communicating with Nonverbal Behaviors • Understanding Human Intentions • Human-Robot Collaboration • Social Navigation • Robot Learning from Humans From Computational Human-Robot Interaction by Thomaz et al.

programming a robot HUMAN-ROBOT INTERACTION programming a robot HUMAN-ROBOT INTERACTION
commanding a robot collaborating with a robot HUMAN-ROBOT INTERACTION Courtesy of Maya Cakmak At Maya’s lab

METHODS •Asking users –Questionnaires, interviews, focus groups, contextual inquiry •Observing
users –Passive observation, empirical user studies, think-aloud protocol, ethnography, ﬁeld studies •Make users observe themselves –Diaries, experience sampling •Ask experts –Heuristic evaluation, cognitive walkthrough Courtesy of Maya Cakmak

METHODS IN HCI Discovery Evaluation pre-design (formative) during/post-design (summative) Courtesy
of Maya Cakmak

DATA OBTAINED (subjective) (objective) Data Source Data type Courtesy of
Maya Cakmak

Today’s Outline 1. Implicit Imitation in  Multiagent Reinforcement Learning  by
Price and Boutilier, IMCL 1999 2. Apprenticeship Learning via  Inverse Reinforcement Learning  by Abbeel and Ng, ICML 2004 3. Interactively shaping agents via  human reinforcement: The TAMER framework  by Knox and Stones, KCAP 2009 4. Goal-Based Imitation as  Probabilistic Inference over Graphical Models  by Verma and Rao, NIPS 2006

Implicit Imitation: Setup • 2≥ identical MDPs; “mentor(s)” and “observer”
• Agents do not interact • The mentor is executing π* • At each iteration, the observer is at so, takes ao, and results in to; it also observes the mentor moving from sm, to tm • The observer knows R(s) but does not know  Pr(t|s, a) and is looking for π

Model Extraction • Augmented Bellman Equation: • Focusing Mechanism: Back
up the states visited by mentor • Action Selection: Guess the mentor’s action with

Implicit Imitation Algorithm: Sketch • Update prioritized sweeping (an asynchronous
value iteration) with the following changes: A. Augmented Bellman Backup With each transition <s, a, s'>   the observer takes, the estimated model Presto(t | s, a) is updated and an augmented backup (Equation 2) is performed at state s. Augmented backups are then performed at a ﬁxed number of states using the usual priority queue implementation. B. Focusing Mechanism With each observed mentor transition   <s, t>, the estimated model Presto(t | s) is updated and an augmented backup is performed at s. Augmented backups are then performed at a ﬁxed number of states using the usual priority queue implementation. From Implicit Imitation in Multiagent Reinforcement Learning by Price and Boutilier

Experiments From Implicit Imitation in Multiagent Reinforcement Learning by Price
and Boutilier

Extensions & Limitations • Heterogeneous settings; the mentor’s P(t|s,a) is
different from that of the agent, e.g. humanoid robot imitation a human body moment • Discusses potential extensions to model-free approaches and feature-based representations • Limitations • State space is huge • Hard to find an optimal mentor • The agent designer needs to define R

Motivation Would be nice if • agent designers don’t need
to deﬁne R and agents learn it from human demonstrations • human demonstrations does not have to be perfect

! Input: ! State space, action space ! Transition model
Psa (st+1 | st , at ) ! No reward function ! Teacher’s demonstration: s0 , a0 , s1 , a1 , s2 , a2 , … (= trace of the teacher’s policy π*) Problem setup (= trace of the teacher’s policy π*) ! Inverse RL: ! Can we recover R ? ! Apprenticeship learning via inverse RL ! Can we then use this R to find a good policy ? ! Behavioral cloning ! Can we directly learn the teacher’s policy using supervised learning? Inverse Reinforce Learning Courtesy of Pieter Abbeel

! ff Feature based reward function Let R(s) = w⊤φ(s),
where w ∈ ℜn, and φ : S → ℜn. E[ ∞ ! t=0 γtR(st )|π] = E[ ∞ ! t=0 γtw⊤φ(st )|π] = w⊤E[ ∞ ! t=0 γtφ(st )|π] ! Subbing into gives us: ! t=0 | = w⊤µ(π) Expected cumulative discounted sum of feature values or “feature expectations” E[ "∞ t=0 γtR∗(st )|π∗] ≥ E[ "∞ t=0 γtR∗(st )|π] ∀π Find w∗ such that w∗⊤µ(π∗) ≥ w∗⊤µ(π) ∀π Courtesy of Pieter Abbeel

! Inverse RL starting point: find a reward function such
that the expert outperforms other policies Feature matching Let R(s) = w⊤φ(s), where w ∈ ℜn, and φ : S → ℜn. Find w∗ such that w∗⊤µ(π∗) ≥ w∗⊤µ(π) ∀π ! Observation in Abbeel and Ng, 2004: for a policy π to be guaranteed to perform as well as the expert policy π*, it suffices that the feature expectations match: implies that for all w with ∥µ(π) − µ(π∗)∥1 ≤ ǫ ∥w∥∞ ≤ 1: |w∗⊤µ(π) − w∗⊤µ(π∗)| ≤ ǫ Courtesy of Pieter Abbeel

Apprenticeship learning [Abbeel & Ng, 2004] ! Assume ! Initialize:
pick some controller π0 . ! Iterate for i = 1, 2, … : ! “Guess” the reward function: Find a reward function such that the teacher maximally outperforms Find a reward function such that the teacher maximally outperforms all previously found controllers. ! Find optimal control policy πi for the current guess of the reward function Rw. ! If , exit the algorithm. Courtesy of Pieter Abbeel

Theoretical guarantees ! Guarantee w.r.t. unrecoverable reward function of teacher.
! Sample complexity does not depend on complexity of teacher’s policy π*.

! Formulate as standard machine learning problem ! Fix a
policy class ! E.g., support vector machine, neural network, decision tree, deep belief net, … ! Estimate a policy (=mapping from states to actions) from the training examples (s , a ), (s , a ), (s , a ), … Behavioral cloning from the training examples (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), … ! Two of the most notable success stories: ! Pomerleau, NIPS 1989: ALVINN ! Sammut et al., ICML 1992: Learning to fly (flight sim) Q: Can’t we directly learn teacher’s policy using supervised learning? Courtesy of Pieter Abbeel

! Which has the most succinct description: π π π
π* vs. R R R R*? ! Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy. Inverse RL vs. behavioral cloning is often much more succinct than the optimal policy. Courtesy of Pieter Abbeel

Parking lot navigation [Abbeel et al., IROS 08] ! Reward
function trades off: ! Staying “on-road,” ! Forward vs. reverse driving, ! Amount of switching between forward and reverse, ! Lane keeping, ! On-road vs. off-road, ! Curvature of paths. Courtesy of Pieter Abbeel

! Demonstrate parking lot navigation on “train parking lots.” Experimental
setup ! Run our apprenticeship learning algorithm to find the reward function. ! Receive “test parking lot” map + starting point and destination. ! Find the trajectory that maximizes the learned reward function for navigating the test parking lot. Courtesy of Pieter Abbeel

Nice driving style Courtesy of Pieter Abbeel

Sloppy driving-style Courtesy of Pieter Abbeel

“Don’t mind reverse” driving-style Courtesy of Pieter Abbeel

Motivation Would be nice if • agents could use simple
feedback instead of full demonstrations • agent designers don’t need to deﬁne R

Shaping Problem • Given • MDP/R • A human trainer
who observes the agent and understands a predeﬁned performance metric provides occasional scalar reinforcement signals (RH: S x A →ℝ) • How can an agent learn the best possible task policy (π : S → A), as measured by the performance metric?

Motivating Insights • Human reinforcement signals capture both short- term
and long-term eﬀects • use them instead of expected rewards, e.g., computed by an expensive value-based method! • exploration is not needed! • Human trainer’s reinforcement function (RH) is a moving target, e.g., to continuously improve the agent’s policy • model and update RestH on-the-ﬂy

• Human reinforcement signals are delayed; hence,   a model
for assigning credits over time: Temporal Credit Assignment From Interactively Shaping Agents via Human Reinforcement by Knox and Stones

TAMER Algorithm: Sketch Humans Providing Online Rewards :%UDGOH\.QR[DQG3HWHU6WRQH&RPELQLQJ0DQXDO)HHGEDFNZLWK6XEVHTXHQW0'35HZDUG6LJQDOVIRU5HLQIRUFHPHQW/HDUQLQJ,Q3URFHHGLQJ WKH1LQWK,QWHUQDWLRQDO&RQIHUHQFHRQ$XWRQRPRXV$JHQWVDQG0XOWLDJHQW6\VWHPV0D\%HVWVWXGHQWSDSHU •
TAMER framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones

TAMER Algorithm Humans Providing Online Rewards :%UDGOH\.QR[DQG3HWHU6WRQH&RPELQLQJ0DQXDO)HHGEDFNZLWK6XEVHTXHQW0'35HZDUG6LJQDOVIRU5HLQIRUFHPHQW/HDUQLQJ,Q3URFHHGLQJ WKH1LQWK,QWHUQDWLRQDO&RQIHUHQFHRQ$XWRQRPRXV$JHQWVDQG0XOWLDJHQW6\VWHPV0D\%HVWVWXGHQWSDSHU • TAMER
framework • Uses reward feedback to help shape the agent’s reward • Can be used in addition to other reward signals from domain From Interactively Shaping Agents via Human Reinforcement by Knox and Stones

Extensions • Combining signals from environments (R) • Episodic tasks
• Non-myopically learn from humans, i.e., robust against the human bias towards giving positive rewards See Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance by Knox and Stones for more details

Findings about Human Teachers • Tendency to provide guidance of
future actions • A positive bias in RL rewards • Human-generated reward signal changes over time • RL should accommodate above ﬁndings! • New ways to transfer tasks or skills to robots? Humans Providing Online Rewards • Sophie’s Kitchen • Human trainer can award a scalar reward signal r = [−1, 1] 7HDFKDEOHURERWV8QGHUVWDQGLQJKXPDQWHDFKLQJEHKDYLRUWREXLOGPRUHHIIHFWLYH URERWOHDUQHUV$UWLILFLDO,QWHOOLJHQFH-RXUQDO From Reinforcement Learning with Human Teachers: Understanding How People Want to Teach Robots by Thomaz et al.

Imitation Learning in Humans Not mere trajectory following. Imitation based
on goal inference. Demonstration Goal-based imitation Infants above 1.5 years of age can imitate action even from an unsuccessful demonstration (Meltzoff & Brook 1998)

Another Example: Gaze Following and Blindfolds 12 month olds Meltzoff
& Brooks, Dev Psych Blind fold experience no training after training

The “Like-Me” Hypothesis Self-experience allows infants to interpret the act
of others. Self-experience plays an important role in goal inference and imitation. Computational Model: Probabilistic instantiation of “Like-me” hypothesis. Meltzoff, Dev Sci, 2007 Metlzoff, Acta Psychologica, 2007

Planning as Probabilistic Inference • MDP as graphical model (Dynamical
Bayesian Network) • Given initial state s and goal state Goalg, infer: Standard MDP Goal-based MDP From Goal-Based Imitation as Probabilistic Inference over Graphical Models by Verma and Rao Also check out Probabilistic inference for solving discrete and continuous state Markov Decision Processes by Toussaint and Storkey Reached Observation Goal State Action

Policy Learning From Goal-Based Imitation as Probabilistic Inference over Graphical
Models by Verma and Rao Motor control learning / Body babbling Skill / Task learning

Goal-Inference Online Imitation From Goal-Based Imitation as Probabilistic Inference over
Graphical Models by Verma and Rao =

Related Work:  Dynamic, Heterogenous Imitation Grimes et al., RSS, 2006

Related Work: Modeling Humans 0 200 400 600 −600 −400
−200 0 200 400 x position (cm) y position (cm) Agent Mentor true fixation points inferred fixation points Friesen & Rao, Cog Sci, 2011 Meltzoff et al., Neural Networks, 2010

Conclusion • Exciting area of research! • More generally covers
topics related to: • Programming for non-programmers • Customization by non-programmers • Faster and/or less costly learning • Study how people teach

UW CSEP590B Robotics (2018 Spr) Guest Lecture "...

UW CSEP590B Robotics (2018 Spr) Guest Lecture "Human-Robot Interaction"

More Decks by Mike Chung

Featured

Transcript