Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human-in-the-Loop Reinforcement Learning (Pieter Abbeel, UC Berkeley | Covariant | The Robot Brains Podcast)

Human-in-the-Loop Reinforcement Learning (Pieter Abbeel, UC Berkeley | Covariant | The Robot Brains Podcast)

Deep reinforcement learning (Deep RL) has seen many successes, including learning to play Atari games, the classical game of Go, robotic locomotion and manipulation. However, now that Deep RL has become fairly capable of optimizing reward, a new challenge has arisen: How to choose the reward function that is to be optimized? Indeed, this often becomes the key engineering time sink for practitioners. In this talk, I will present some recent progress on human-in-the-loop reinforcement learning. The newly proposed algorithm, PEBBLE, empowers a human supervisor to directly teach an AI agent new skills without the usual extensive reward engineering or curriculum design efforts.

Anyscale

July 15, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Fast Progress on Deep RL 2013 Atari (DQN) [Deepmind] Pong

    Enduro Beamrider Q*bert Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  2. [Source: Mnih et al., Nature 2015 (DeepMind) ] Deep Q-Network

    (DQN): From Pixels to Joystick Commands Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  3. Fast Progress on Deep RL AlphaGo Silver et al, Nature

    2015 AlphaGoZero Silver et al, Nature 2017 AlphaZero Silver et al, 2017 Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015 2013 Atari (DQN) [Deepmind] 2015 AlphaGo [Deepmind] Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  4. Fast Progress on Deep RL [Schulman, Moritz, Levine, Jordan, Abbeel,

    ICLR 2016] 2013 Atari (DQN) [Deepmind] 2015 AlphaGo [Deepmind] 2016 3D locomotion (TRPO+GAE) [Berkeley]
  5. Fast Progress on Deep RL [Levine*, Finn*, Darrell, Abbeel, JMLR

    2016] 2013 Atari (DQN) [Deepmind] 2015 AlphaGo [Deepmind] 2016 3D locomotion (TRPO+GAE) [Berkeley] 2016 Real Robot Manipulation (GPS) [Berkeley] Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  6. Fast Progress on Deep RL 2013 Atari (DQN) [Deepmind] 2015

    AlphaGo [Deepmind] 2016 3D locomotion (TRPO+GAE) [Berkeley] 2016 Real Robot Manipulation (GPS) [Berkeley] 2019 Rubik’s Cube (PPO+DR) [OpenAI] Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  7. Challenge: Designing Suitable Reward Decision (actions) Consequences (observations, rewards) Hard

    tasks to define a reward (e.g. cooking) Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  8. Challenge: Designing Suitable Reward Decision (actions) Consequences (observations, rewards) Hard

    tasks to define a reward (e.g. cooking) Reward exploitation https://openai.com/blog/faulty-reward-functions/
  9. What is an Alternative Solution? • Putting (non-expert) humans into

    the agent learning loop! RL algorithm Environment action observation Reward 💰 Human Behaviors of an agent 👩💻 Feedback Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  10. What is an Alternative Solution? • Putting (non-expert) humans into

    the agent learning loop! RL algorithm Environment action observation Human Behaviors of an agent 👩💻 Feedback Preference 👩💻 Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S. and Amodei, D., Reward learning from human preferences and demonstrations in atari. In NeurIPS, 2018. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S. and Amodei, D., Deep reinforcement learning from human preferences. NeurIPS, 2017.
  11. What is an Alternative Solution? • Putting (non-expert) humans into

    the agent learning loop! RL algorithm Environment action observation Human Behaviors of an agent 👩💻 Feedback Human has ability to interactively guide agents according to their progress à Can teach harder tasks, where we can’t easily define the reward à Can avoid reward exploitation Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  12. The PEBBLE Algorithm • Step 1. Collect samples via interactions

    with environment • Step 2. Collect human preferences 👩💻
  13. The PEBBLE Algorithm • Step 1. Collect samples via interactions

    with environment • Step 2. Collect human preferences • Step 3. Optimize a reward model using cross entropy loss 👩💻
  14. Learning Reward from Preferences • Fitting a reward model [1]

    ◦ Main idea: formulate this problem as a binary classification! ◦ By following the Bradley-Terry model [2], we can model a preference predictor as follows: [1] Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S. and Amodei, D., Deep reinforcement learning from human preferences. NeurIPS, 2017. [2] Bradley, R.A. and Terry, M.E., Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), pp.324-345, 1952.
  15. The PEBBLE Algorithm • Step 1. Collect samples via interactions

    with environment • Step 2. Collect human preferences • Step 3. Optimize a reward model using cross entropy loss • Step 4. Optimize a policy using off-policy algorithms 👩💻
  16. The PEBBLE Algorithm • Step 1. Collect samples via interactions

    with environment • Step 2. Collect human preferences • Step 3. Optimize a reward model using cross entropy loss • Step 4. Optimize a policy using off-policy algorithms Repeat step 1 - step 4 👩💻
  17. Unsupervised Pre-training: APT • Obtaining a good initial state space

    coverage is important! ◦ Human can’t convey much meaningful information to the agent Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  18. Unsupervised Pre-training: APT • Obtaining a good initial state space

    coverage is important! ◦ Human can’t convey much meaningful information to the agent Behavior from random exploration policy Behavior from pre-trained policy Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  19. Can Humans Teach Novel Behaviors with PEBBLE? Pieter Abbeel --

    UC Berkeley | Covariant | The Robot Brains
  20. Can Humans Teach Novel Behaviors with PEBBLE? • 40 queries

    in less than 5 mins Counter clockwise Clockwise Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  21. Can Humans Teach Novel Behaviors with PEBBLE? • 200 queries

    in less than 30 mins Waving right front leg Waving left front leg Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  22. Can Humans Teach Novel Behaviors with PEBBLE? • 400 queries

    in less than one hour Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  23. Can We Avoid Reward Exploitation? SAC with task reward on

    walker, walk (use one leg even if score ~=1000) Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  24. Can We Avoid Reward Exploitation? • 150 queries in less

    than 20 mins SAC with task reward on walker, walk (use one leg even if score ~=1000) SAC trained with human feedback (use both legs) Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains
  25. Comparison: Locomotion Tasks • Learning curves (10 random seeds) *

    Asymptotic performance of PPO and Preference PPO is indicated by dotted lines of the corresponding color
  26. n Reinforcement learning algorithms are become effective at optimizing reward

    n BUT: outside of games, reward can be hard to specify correctly n PEBBLE enables effective human-in-the-loop reinforcement learning Summary Pieter Abbeel -- UC Berkeley | Covariant | The Robot Brains