Slide 1

Slide 1 text

Offline Reinforcement Learning 1 Sergey Levine UC Berkeley

Slide 2

Slide 2 text

2 What makes modern machine learning work?

Slide 3

Slide 3 text

Machine learning is automated decision-making 3 If ultimately ML is always about making a decision, why don’t we treat every machine learning problem like a reinforcement learning problem? Typical supervised learning problems have assumptions that make them “easy”: Ø independent datapoints Ø outputs don’t influence future inputs Ø ground truth labels are provided at training-time Decision-making problems often don’t satisfy these assumptions: Ø current actions influence future observations Ø goal is to maximize some utility (reward) Ø optimal actions are not provided These are not just issues for control: in many cases, real-world deployment of ML has these same feedback issues Example: decisions made by a traffic prediction system might affect the route that people take, which changes traffic

Slide 4

Slide 4 text

4 So why aren’t we all using RL? Reinforcement learning is two different things: 1. Framework for learning-based decision making 2. Active, online learning algorithms for control this is done many times [object label] [decision] almost all real-world learning problems look like this almost all real-world learning problems make it very difficult to do this

Slide 5

Slide 5 text

5 But it works in practice, right? this is done many times Mnih et al. ‘13 Schulman et al. ’14 & ‘15 Levine*, Finn*, et al. ‘16 enormous gulf

Slide 6

Slide 6 text

6 Making RL look more like supervised learning [decision] on-policy RL offline reinforcement learning

Slide 7

Slide 7 text

7 The offline RL workflow Levine, Kumar, Tucker, Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 offline reinforcement learning 1. Collect a dataset using any policy or mixture of policies • Humans performing the task • Existing system performing the task • Random behaviors • Any combination of the above 2. Run offline RL on this dataset to learn a policy • Could think of it as the best policy we can get from the provided data 3. Deploy the policy in the real world • If we are not happy with how well it does, modify the algorithm and go back to 2, reusing the same data! This is only done once

Slide 8

Slide 8 text

8 Why is offline RL hard? Levine, Kumar, Tucker, Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 Fundamental problem: counterfactual queries Training data What the policy wants to do Is this good? Bad? How do we know if we didn’t see it in the data? Online RL algorithms don’t have to handle this, because they can simply try this action and see what happens Offline RL methods must somehow account for these unseen (“out-of-distribution”) actions, ideally in a safe way …while still making use of generalization to come up with behaviors that are better than the best thing seen in the data!

Slide 9

Slide 9 text

What do we expect offline RL methods to do? Bad intuition: it’s like imitation learning Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions! See: Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning? Better intuition: get order from chaos “Macro-scale” stitching But this is just the clearest example! “Micro-scale” stitching: If we have algorithms that properly perform dynamic programming, we can take this idea much further and get near-optimal policies from highly suboptimal data 9

Slide 10

Slide 10 text

A vivid example 10 Singh, Yu, Yang, Zhang, Kumar, Levine. COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning. ‘20 blocked by object blocked by drawer closed drawer training task

Slide 11

Slide 11 text

Off-policy RL: a quick primer enforce this equation at all states!

Slide 12

Slide 12 text

What’s the problem? Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19 amount of data log scale (massive overestimation) how well it does how well it thinks it does (Q-values) 12

Slide 13

Slide 13 text

13 Where do we suffer from distribution shift? Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19 target value behavior policy how well it does how well it thinks it does (Q-values)

Slide 14

Slide 14 text

14 Conservative Q-learning how well it does how well it thinks it does (Q-values) regular objective term to push down big Q-values true Q-function Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning. ‘20

Slide 15

Slide 15 text

How does CQL compare today? 15 the last two years (since CQL came out) saw enormous growth in interest in offline RL CQL is quite widely used so far! some excerpts from evaluations in other papers studies some methodological problems in offline RL evaluation adds data augmentation, CQL results get even better

Slide 16

Slide 16 text

16 The power of offline RL standard real-world RL process offline RL process 1. instrument the task so that we can run RL Ø safety mechanisms Ø autonomous collection Ø rewards, resets, etc. 2. wait a long time for online RL to run 3. change the algorithm in some small way 4. throw it all in the garbage and start over for the next task 1. collect initial dataset Ø human-provided Ø scripted controller Ø baseline policy Ø all of the above 2. Train a policy offline 3. change the algorithm in some small way 4. collect more data, add to growing dataset 5. keep the dataset and use it again for the next project!

Slide 17

Slide 17 text

17 Offline RL in robotic manipulation: MT-Opt, AMs Ø12 different tasks ØThousands of objects ØMonths of data collection Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-Based Robotic Manipulation Skills Kalashnikov, Varley, Chebotar, Swanson, Jonschkowski, Finn, Levine, Hausman. MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale. 2021. New hypothesis: could we learn these tasks without rewards using goal-conditioned RL? reuse the same exact data

Slide 18

Slide 18 text

Actionable Models: Offline RL with Goals Chebotar, Hausman, Lu, Xiao, Kalashnikov, Varley, Irpan, Eysenbach, Julian, Finn, Levine. Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. 2021. 18 Ø No reward function at all, task is defined entirely using a goal image! Ø Uses a conservative offline RL method designed for goal-reaching, based on CQL Ø Works very well as an unsupervised pretraining objective! 1. Train goal-conditioned Q- function with offline RL 2. Finetune with a task reward and limited data

Slide 19

Slide 19 text

19 More examples Kahn, Abbeel, Levine. BADGR: An Autonomous Self-Supervised Learning- Based Navigation System. 2020. Early 2020: Greg Kahn collects 40 hours of robot navigation data Shah, Eysenbach, Kahn, Rhinehart, Levine. ViNG: Learning Open-World Navigation with Visual Goals. 2020. Late 2020: Dhruv Shah uses it to build goal-conditioned navigation system Early 2021: Dhruv Shah uses the same dataset to train an exploration system Shah, Eysenbach, Rhinehart, Levine. RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models. 2020.

Slide 20

Slide 20 text

20 Other topics in offline RL Model-based offline RL Multi-task offline RL many more… Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. NeurIPS 2021. generally a great fit for offline RL! similar principles apply: conservatism, OOD actions (+ states) “model-based version of CQL” much more practical with huge models (e.g., Transformers)! Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. NeurIPS 2021. “model-based offline RL with a Transformer” Trajectory Transformer making accurate predictions to hundreds of steps Yu, Kumar, Chebotar, Hausman, Levine, Finn. Conservative Data Sharing for Multi-Task Offline Reinforcement Learning. NeurIPS 2021. “share data between tasks to minimize distribution shift”

Slide 21

Slide 21 text

Takeaways, conclusions, future directions 21 1. Collect a dataset using any policy or mixture of policies 2. Run offline RL on this dataset to learn a policy 3. Deploy the policy in the real world current offline RL algorithms “the dream” “the gap” • An offline RL workflow • Supervised learning workflow: train/test split • Offline RL workflow: ??? • Statistical guarantees • Biggest challenge: distributional shift/counterfactuals • Can we make any guarantees? • Scalable methods, large-scale applications • Dialogue systems • Data-driven navigation and driving A starting point: Kumar, Singh, Tian, Finn, Levine. A Workflow for Offline Model-Free Robotic Reinforcement Learning. CoRL 2021

Slide 22

Slide 22 text

RAIL Robotic AI & Learning Lab website: http://rail.eecs.berkeley.edu