Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Александр Новиков​, Обзор главных работ и результатов в RL в 2020 году

opentalks3
February 05, 2021

OpenTalks.AI - Александр Новиков​, Обзор главных работ и результатов в RL в 2020 году

opentalks3

February 05, 2021
Tweet

More Decks by opentalks3

Other Decks in Business

Transcript

  1. (Some) cool RL papers
    from 2020
    Alex Novikov

    View Slide

  2. What is Reinforcement Learning (RL)
    Environment
    State,
    Reward
    Reward: 1
    Go left Action

    View Slide

  3. Recap: how does RL work?
    In a few (oversimplified) words
    1) Try random stuff
    2) “Reinforce” (do more in the future) the stuff that worked better according to
    reward

    View Slide

  4. Learning to Summarize with Human Feedback
    Nisan Stiennon, Long Ouyang, et al., OpenAI [blog post, Neurips paper]
    1) Finetune GPT-3 to summarize Reddit posts into TL;DRs
    2) Sample a lot of artificial summaries and ask humans to compare pairs of them

    View Slide

  5. Learning to Summarize with Human Feedback
    1) Finetune GPT-3 to summarize Reddit posts into TL;DRs
    2) Sample a lot of artificial summaries and ask humans to compare pairs of them
    3) Train a neural net (reward model) to predict human labels

    View Slide

  6. Learning to Summarize with Human Feedback
    1) Finetune GPT-3 to summarize Reddit posts into TL;DRs
    2) Sample a lot of artificial summaries and ask humans to compare pairs of them
    3) Train a neural net (reward model) to predict human labels
    4) Use RL to finutene the summarizer using the learned reward function (e.g.
    preferring longer summaries)

    View Slide

  7. Autonomous navigation of stratospheric balloons using
    reinforcement learning
    Marc G. Bellemare, Salvatore Candido, et al., Google Brain [Nature paper]
    Training in simulation Deploying in real world

    View Slide

  8. Emergent Complexity and Zero-shot Transfer via
    Unsupervised Environment Design
    Michael Dennis, Natasha Jaques, et al., Google Brain [talk, Neurips paper]

    View Slide

  9. Emergent Complexity and Zero-shot Transfer via
    Unsupervised Environment Design

    View Slide

  10. Emergent Complexity and Zero-shot Transfer via
    Unsupervised Environment Design

    View Slide

  11. Emergent Complexity and Zero-shot Transfer via
    Unsupervised Environment Design

    View Slide

  12. Asymmetric self-play for automatic goal discovery in
    robotic manipulation
    Matthias Plappert, Raul Sampedro, et al., OpenAI [paper, blogpost]
    Similarly, can we generate goals?

    View Slide

  13. Asymmetric self-play for automatic goal discovery in
    robotic manipulation
    Similarly, can we generate goals?
    Alice end-state = goal for Bob

    View Slide

  14. Asymmetric self-play for automatic goal discovery in
    robotic manipulation
    Similarly, can we generate goals?
    Alice end-state = goal for Bob
    Bob’s reward to reach the same state (preferably faster than Alice)
    Alice reward is to make Bob fail

    View Slide

  15. Asymmetric self-play for automatic goal discovery in
    robotic manipulation
    Similarly, can we generate goals?
    Alice end-state = goal for Bob
    Bob’s reward to reach the same state (preferably faster than Alice)
    Alice reward is to make Bob fail
    Additionally, Bob can cheat and look into how Alice did it

    View Slide

  16. Asymmetric self-play for automatic goal discovery in
    robotic manipulation
    Bob learns to reach new goals without any finetuning (zero shot)

    View Slide

  17. Never Give Up: Learning Directed Exploration Strategies
    Adrià Puigdomènech Badia, Pablo Sprechmann, et al., DeepMind [ICLR paper]
    How exploration usually works
    1. A separate network tells you how novel your current state it.
    2. Add “novelty” to your reward:
    3. Push beta to 0 with time to start exploiting.

    View Slide

  18. Classic approach for building a
    novelty network: learn to predict
    s_t+1 from s_t. Prediction error
    is novelty.
    But the agent stucks watching
    TV..
    Never Give Up: Learning Directed Exploration Strategies

    View Slide

  19. Never Give Up: Learning Directed Exploration Strategies
    [Yuri Burda, Harrison Edwards, et al., Exploration by Random Network Distillation, OpenAI & Univ. of Edinburgh, ICLR 2019]

    View Slide

  20. Never Give Up: Learning Directed Exploration Strategies
    Insights
    1. Novelty measures update slowly. Instead: normal lifelong novelty (which updates slowly)
    AND episodic memory (quick, but only within episode)
    2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t
    count).
    3. Don’t stop exploring when you get better. Instead, have separate exploration and
    exploitation policies and run all in parallel.

    View Slide

  21. Never Give Up: Learning Directed Exploration Strategies
    Insights
    1. Novelty measures update slowly. Instead: normal lifelong
    novelty (which updates slowly) AND episodic memory (quick,
    but only within episode)
    2. Only count states which are novel in a controllable way (e.g.
    novel TV picture doesn’t count).
    3. Don’t stop exploring when you get better. Instead, have
    separate exploration and exploitation policies and run all in
    parallel.

    View Slide

  22. Never Give Up: Learning Directed Exploration Strategies
    Insights
    1. Novelty measures update slowly. Instead: normal lifelong novelty (which updates slowly)
    AND episodic memory (quick, but only within episode)
    2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t
    count).
    3. Don’t stop exploring when you get better. Instead, have separate exploration and
    exploitation policies and run all in parallel.

    View Slide

  23. Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically

    View Slide

  24. Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true

    View Slide

  25. Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory

    View Slide

  26. Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    Discounted return (“how good did the episode end”)

    View Slide

  27. Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    future is bad
    it’s obvious that the future will be bad
    don’t encourage / penalize taken action
    [image source]
    crashed

    View Slide

  28. Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    not crashed
    future is ok
    it’s obvious that the future will be bad
    the action avoided forecasted crash, awesome

    View Slide

  29. Discovering Reinforcement Learning Algorithms
    Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper]
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    c. Update by doing a gradient step on ,
    where

    View Slide

  30. Discovering Reinforcement Learning Algorithms
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    c. Update by doing a gradient step on ,
    where
    Keep
    Learn

    View Slide

  31. Discovering Reinforcement Learning Algorithms
    Use RL to discover an RL algorithm (e.g. something like PPO) automatically
    Recap of a simple RL algorithm
    1. Initialize parameters of policy and of value function
    2. While true
    a. Run policy in the episode and collect a trajectory
    b. Update , where
    c. Update by doing a gradient step on ,
    where
    Keep
    Learn and

    View Slide

  32. Discovering Reinforcement Learning Algorithms
    1. Discovered learning algorithm generalizes from grid worlds to Atari
    2. Came up with the idea of value function (and how to learn it) on its own
    Training on gridworld during meta-training Training on Atari during meta-test

    View Slide

  33. Offline (batch) RL
    Can we use RL to learn to drive cars or control datacenter cooling systems?

    View Slide

  34. Offline (batch) RL
    Can we use RL to learn to drive cars or control datacenter cooling systems?
    The old recipe is too dangerous and slow:
    1) Try random stuff (“drive randomly”)
    2) Reinforce the stuff that worked better (“try to not repeat actions that led to
    crashes”)

    View Slide

  35. Offline (batch) RL
    Can we use RL to learn to drive cars or control datacenter cooling systems?
    Instead, collect human data and try to learn from it
    Environm
    ent
    State,
    Rewa
    rd
    Action
    Dataset
    Data collection
    States,
    rewards,
    actions

    View Slide

  36. Offline (batch) RL
    Can we use RL to learn to drive cars or control datacenter cooling systems?
    Instead, collect human data and try to learn from it
    Environm
    ent
    State,
    Rewa
    rd
    Action
    Dataset
    Data collection
    States,
    rewards,
    actions
    Learning

    View Slide

  37. Offline (batch) RL
    Can we use RL to learn to drive cars or control datacenter cooling systems?
    Instead, collect human data and try to learn from it
    Environm
    ent
    State,
    Rewa
    rd
    Action
    Environm
    ent
    State,
    Rewa
    rd
    Action
    Dataset
    Data collection
    States,
    rewards,
    actions
    Learning
    Testing

    View Slide

  38. Offline RL pros and cons
    1. Can solve harder tasks (safer; no need for collecting billions of frames)
    2. (But of course there is no need to apply offline RL to e.g. Atari)
    3. Cheaper research (no need for 10k CPUs per run, since data is prerecorded)
    4. Existing datasets and code examples to get started
    5. Some unique challenges compared to classic RL
    6. More low hanging fruits :)

    View Slide

  39. How to train offline RL agents?
    1. Behaviour Cloning: just train a neural net to predict actions from states via
    supervised learning
    ○ Only works when all data is of high quality, i.e. BC can’t do better than the data.
    2. Just apply classic RL -- will that work?
    ○ Not well, because of overestimating some actions: classic RL can randomly think that
    something not presented in the data (“drive into wall”) is worth a try.
    Most offline RL methods are thus focusing on avoiding the overestimation.

    View Slide

  40. Critic regularized regression
    Ziyu Wang, Alex Novikov, Konrad Zołna, et al., DeepMind [code, Neurips paper]

    View Slide

  41. Critic regularized regression

    View Slide

  42. Critic regularized regression

    View Slide

  43. Critic regularized regression

    View Slide

  44. Critic regularized regression
    It doesn’t make sense to use BC on this non-expert (random) data

    View Slide

  45. Critic regularized regression

    View Slide

  46. Thanks!
    Imitate it now!
    Critic regularized regression

    View Slide

  47. Critic regularized regression

    View Slide

  48. Critic regularized regression

    View Slide

  49. Any questions?

    View Slide