Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Александр Новиков​, Обзор главны...

opentalks3
February 05, 2021

OpenTalks.AI - Александр Новиков​, Обзор главных работ и результатов в RL в 2020 году

opentalks3

February 05, 2021
Tweet

More Decks by opentalks3

Other Decks in Business

Transcript

  1. Recap: how does RL work? In a few (oversimplified) words

    1) Try random stuff 2) “Reinforce” (do more in the future) the stuff that worked better according to reward
  2. Learning to Summarize with Human Feedback Nisan Stiennon, Long Ouyang,

    et al., OpenAI [blog post, Neurips paper] 1) Finetune GPT-3 to summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them
  3. Learning to Summarize with Human Feedback 1) Finetune GPT-3 to

    summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them 3) Train a neural net (reward model) to predict human labels
  4. Learning to Summarize with Human Feedback 1) Finetune GPT-3 to

    summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them 3) Train a neural net (reward model) to predict human labels 4) Use RL to finutene the summarizer using the learned reward function (e.g. preferring longer summaries)
  5. Autonomous navigation of stratospheric balloons using reinforcement learning Marc G.

    Bellemare, Salvatore Candido, et al., Google Brain [Nature paper] Training in simulation Deploying in real world
  6. Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design Michael

    Dennis, Natasha Jaques, et al., Google Brain [talk, Neurips paper]
  7. Asymmetric self-play for automatic goal discovery in robotic manipulation Matthias

    Plappert, Raul Sampedro, et al., OpenAI [paper, blogpost] Similarly, can we generate goals?
  8. Asymmetric self-play for automatic goal discovery in robotic manipulation Similarly,

    can we generate goals? Alice end-state = goal for Bob
  9. Asymmetric self-play for automatic goal discovery in robotic manipulation Similarly,

    can we generate goals? Alice end-state = goal for Bob Bob’s reward to reach the same state (preferably faster than Alice) Alice reward is to make Bob fail
  10. Asymmetric self-play for automatic goal discovery in robotic manipulation Similarly,

    can we generate goals? Alice end-state = goal for Bob Bob’s reward to reach the same state (preferably faster than Alice) Alice reward is to make Bob fail Additionally, Bob can cheat and look into how Alice did it
  11. Asymmetric self-play for automatic goal discovery in robotic manipulation Bob

    learns to reach new goals without any finetuning (zero shot)
  12. Never Give Up: Learning Directed Exploration Strategies Adrià Puigdomènech Badia,

    Pablo Sprechmann, et al., DeepMind [ICLR paper] How exploration usually works 1. A separate network tells you how novel your current state it. 2. Add “novelty” to your reward: 3. Push beta to 0 with time to start exploiting.
  13. Classic approach for building a novelty network: learn to predict

    s_t+1 from s_t. Prediction error is novelty. But the agent stucks watching TV.. Never Give Up: Learning Directed Exploration Strategies
  14. Never Give Up: Learning Directed Exploration Strategies [Yuri Burda, Harrison

    Edwards, et al., Exploration by Random Network Distillation, OpenAI & Univ. of Edinburgh, ICLR 2019]
  15. Never Give Up: Learning Directed Exploration Strategies Insights 1. Novelty

    measures update slowly. Instead: normal lifelong novelty (which updates slowly) AND episodic memory (quick, but only within episode) 2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t count). 3. Don’t stop exploring when you get better. Instead, have separate exploration and exploitation policies and run all in parallel.
  16. Never Give Up: Learning Directed Exploration Strategies Insights 1. Novelty

    measures update slowly. Instead: normal lifelong novelty (which updates slowly) AND episodic memory (quick, but only within episode) 2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t count). 3. Don’t stop exploring when you get better. Instead, have separate exploration and exploitation policies and run all in parallel.
  17. Never Give Up: Learning Directed Exploration Strategies Insights 1. Novelty

    measures update slowly. Instead: normal lifelong novelty (which updates slowly) AND episodic memory (quick, but only within episode) 2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t count). 3. Don’t stop exploring when you get better. Instead, have separate exploration and exploitation policies and run all in parallel.
  18. Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al.,

    DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically
  19. Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al.,

    DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true
  20. Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al.,

    DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory
  21. Use RL to discover an RL algorithm (e.g. something like

    PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] Discounted return (“how good did the episode end”)
  22. Use RL to discover an RL algorithm (e.g. something like

    PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] future is bad it’s obvious that the future will be bad don’t encourage / penalize taken action [image source] crashed
  23. Use RL to discover an RL algorithm (e.g. something like

    PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] not crashed future is ok it’s obvious that the future will be bad the action avoided forecasted crash, awesome
  24. Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al.,

    DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where
  25. Discovering Reinforcement Learning Algorithms Use RL to discover an RL

    algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where Keep Learn
  26. Discovering Reinforcement Learning Algorithms Use RL to discover an RL

    algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where Keep Learn and
  27. Discovering Reinforcement Learning Algorithms 1. Discovered learning algorithm generalizes from

    grid worlds to Atari 2. Came up with the idea of value function (and how to learn it) on its own Training on gridworld during meta-training Training on Atari during meta-test
  28. Offline (batch) RL Can we use RL to learn to

    drive cars or control datacenter cooling systems?
  29. Offline (batch) RL Can we use RL to learn to

    drive cars or control datacenter cooling systems? The old recipe is too dangerous and slow: 1) Try random stuff (“drive randomly”) 2) Reinforce the stuff that worked better (“try to not repeat actions that led to crashes”)
  30. Offline (batch) RL Can we use RL to learn to

    drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions
  31. Offline (batch) RL Can we use RL to learn to

    drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions Learning
  32. Offline (batch) RL Can we use RL to learn to

    drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions Learning Testing
  33. Offline RL pros and cons 1. Can solve harder tasks

    (safer; no need for collecting billions of frames) 2. (But of course there is no need to apply offline RL to e.g. Atari) 3. Cheaper research (no need for 10k CPUs per run, since data is prerecorded) 4. Existing datasets and code examples to get started 5. Some unique challenges compared to classic RL 6. More low hanging fruits :)
  34. How to train offline RL agents? 1. Behaviour Cloning: just

    train a neural net to predict actions from states via supervised learning ◦ Only works when all data is of high quality, i.e. BC can’t do better than the data. 2. Just apply classic RL -- will that work? ◦ Not well, because of overestimating some actions: classic RL can randomly think that something not presented in the data (“drive into wall”) is worth a try. Most offline RL methods are thus focusing on avoiding the overestimation.