OpenTalks.AI - Александр Новиков, Обзор главных работ и результатов в RL в 2020 году

(Some) cool RL papers from 2020 Alex Novikov

What is Reinforcement Learning (RL) Environment State, Reward Reward: 1
Go left Action

Recap: how does RL work? In a few (oversimplified) words
1) Try random stuff 2) “Reinforce” (do more in the future) the stuff that worked better according to reward

Learning to Summarize with Human Feedback Nisan Stiennon, Long Ouyang,
et al., OpenAI [blog post, Neurips paper] 1) Finetune GPT-3 to summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them

Learning to Summarize with Human Feedback 1) Finetune GPT-3 to
summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them 3) Train a neural net (reward model) to predict human labels

Learning to Summarize with Human Feedback 1) Finetune GPT-3 to
summarize Reddit posts into TL;DRs 2) Sample a lot of artificial summaries and ask humans to compare pairs of them 3) Train a neural net (reward model) to predict human labels 4) Use RL to finutene the summarizer using the learned reward function (e.g. preferring longer summaries)

Autonomous navigation of stratospheric balloons using reinforcement learning Marc G.
Bellemare, Salvatore Candido, et al., Google Brain [Nature paper] Training in simulation Deploying in real world

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design Michael
Dennis, Natasha Jaques, et al., Google Brain [talk, Neurips paper]

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

Asymmetric self-play for automatic goal discovery in robotic manipulation Matthias
Plappert, Raul Sampedro, et al., OpenAI [paper, blogpost] Similarly, can we generate goals?

Asymmetric self-play for automatic goal discovery in robotic manipulation Similarly,
can we generate goals? Alice end-state = goal for Bob

can we generate goals? Alice end-state = goal for Bob Bob’s reward to reach the same state (preferably faster than Alice) Alice reward is to make Bob fail

can we generate goals? Alice end-state = goal for Bob Bob’s reward to reach the same state (preferably faster than Alice) Alice reward is to make Bob fail Additionally, Bob can cheat and look into how Alice did it

Asymmetric self-play for automatic goal discovery in robotic manipulation Bob
learns to reach new goals without any finetuning (zero shot)

Never Give Up: Learning Directed Exploration Strategies Adrià Puigdomènech Badia,
Pablo Sprechmann, et al., DeepMind [ICLR paper] How exploration usually works 1. A separate network tells you how novel your current state it. 2. Add “novelty” to your reward: 3. Push beta to 0 with time to start exploiting.

Classic approach for building a novelty network: learn to predict
s_t+1 from s_t. Prediction error is novelty. But the agent stucks watching TV.. Never Give Up: Learning Directed Exploration Strategies

Never Give Up: Learning Directed Exploration Strategies [Yuri Burda, Harrison
Edwards, et al., Exploration by Random Network Distillation, OpenAI & Univ. of Edinburgh, ICLR 2019]

Never Give Up: Learning Directed Exploration Strategies Insights 1. Novelty
measures update slowly. Instead: normal lifelong novelty (which updates slowly) AND episodic memory (quick, but only within episode) 2. Only count states which are novel in a controllable way (e.g. novel TV picture doesn’t count). 3. Don’t stop exploring when you get better. Instead, have separate exploration and exploitation policies and run all in parallel.

Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al.,
DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically

DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true

DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory

Use RL to discover an RL algorithm (e.g. something like
PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] Discounted return (“how good did the episode end”)

PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] future is bad it’s obvious that the future will be bad don’t encourage / penalize taken action [image source] crashed

PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where Discovering Reinforcement Learning Algorithms Junhyuk Oh, Matteo Hessel, et al., DeepMind [Neurips paper] not crashed future is ok it’s obvious that the future will be bad the action avoided forecasted crash, awesome

DeepMind [Neurips paper] Use RL to discover an RL algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where

Discovering Reinforcement Learning Algorithms Use RL to discover an RL
algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where Keep Learn

Discovering Reinforcement Learning Algorithms Use RL to discover an RL
algorithm (e.g. something like PPO) automatically Recap of a simple RL algorithm 1. Initialize parameters of policy and of value function 2. While true a. Run policy in the episode and collect a trajectory b. Update , where c. Update by doing a gradient step on , where Keep Learn and

Discovering Reinforcement Learning Algorithms 1. Discovered learning algorithm generalizes from
grid worlds to Atari 2. Came up with the idea of value function (and how to learn it) on its own Training on gridworld during meta-training Training on Atari during meta-test

Offline (batch) RL Can we use RL to learn to
drive cars or control datacenter cooling systems?

drive cars or control datacenter cooling systems? The old recipe is too dangerous and slow: 1) Try random stuff (“drive randomly”) 2) Reinforce the stuff that worked better (“try to not repeat actions that led to crashes”)

drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions

drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions Learning

drive cars or control datacenter cooling systems? Instead, collect human data and try to learn from it Environm ent State, Rewa rd Action Environm ent State, Rewa rd Action Dataset Data collection States, rewards, actions Learning Testing

Offline RL pros and cons 1. Can solve harder tasks
(safer; no need for collecting billions of frames) 2. (But of course there is no need to apply offline RL to e.g. Atari) 3. Cheaper research (no need for 10k CPUs per run, since data is prerecorded) 4. Existing datasets and code examples to get started 5. Some unique challenges compared to classic RL 6. More low hanging fruits :)

How to train offline RL agents? 1. Behaviour Cloning: just
train a neural net to predict actions from states via supervised learning ◦ Only works when all data is of high quality, i.e. BC can’t do better than the data. 2. Just apply classic RL -- will that work? ◦ Not well, because of overestimating some actions: classic RL can randomly think that something not presented in the data (“drive into wall”) is worth a try. Most offline RL methods are thus focusing on avoiding the overestimation.

Critic regularized regression Ziyu Wang, Alex Novikov, Konrad Zołna, et
al., DeepMind [code, Neurips paper]

Critic regularized regression

Critic regularized regression It doesn’t make sense to use BC
on this non-expert (random) data

Thanks! Imitate it now! Critic regularized regression

Any questions?

OpenTalks.AI - Александр Новиков​, Обзор главны...

OpenTalks.AI - Александр Новиков​, Обзор главных работ и результатов в RL в 2020 году

More Decks by opentalks3

Other Decks in Business

Featured

Transcript

OpenTalks.AI - Александр Новиков, Обзор главны...

OpenTalks.AI - Александр Новиков, Обзор главных работ и результатов в RL в 2020 году