Offline reinforcement learning

Offline Reinforcement Learning 1 Sergey Levine UC Berkeley

2 What makes modern machine learning work?

Machine learning is automated decision-making 3 If ultimately ML is
always about making a decision, why don’t we treat every machine learning problem like a reinforcement learning problem? Typical supervised learning problems have assumptions that make them “easy”: Ø independent datapoints Ø outputs don’t influence future inputs Ø ground truth labels are provided at training-time Decision-making problems often don’t satisfy these assumptions: Ø current actions influence future observations Ø goal is to maximize some utility (reward) Ø optimal actions are not provided These are not just issues for control: in many cases, real-world deployment of ML has these same feedback issues Example: decisions made by a traffic prediction system might affect the route that people take, which changes traffic

4 So why aren’t we all using RL? Reinforcement learning
is two different things: 1. Framework for learning-based decision making 2. Active, online learning algorithms for control this is done many times [object label] [decision] almost all real-world learning problems look like this almost all real-world learning problems make it very difficult to do this

5 But it works in practice, right? this is done
many times Mnih et al. ‘13 Schulman et al. ’14 & ‘15 Levine*, Finn*, et al. ‘16 enormous gulf

6 Making RL look more like supervised learning [decision] on-policy
RL offline reinforcement learning

7 The offline RL workflow Levine, Kumar, Tucker, Fu. Offline
Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 offline reinforcement learning 1. Collect a dataset using any policy or mixture of policies • Humans performing the task • Existing system performing the task • Random behaviors • Any combination of the above 2. Run offline RL on this dataset to learn a policy • Could think of it as the best policy we can get from the provided data 3. Deploy the policy in the real world • If we are not happy with how well it does, modify the algorithm and go back to 2, reusing the same data! This is only done once

8 Why is offline RL hard? Levine, Kumar, Tucker, Fu.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 Fundamental problem: counterfactual queries Training data What the policy wants to do Is this good? Bad? How do we know if we didn’t see it in the data? Online RL algorithms don’t have to handle this, because they can simply try this action and see what happens Offline RL methods must somehow account for these unseen (“out-of-distribution”) actions, ideally in a safe way …while still making use of generalization to come up with behaviors that are better than the best thing seen in the data!

What do we expect offline RL methods to do? Bad
intuition: it’s like imitation learning Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions! See: Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning? Better intuition: get order from chaos “Macro-scale” stitching But this is just the clearest example! “Micro-scale” stitching: If we have algorithms that properly perform dynamic programming, we can take this idea much further and get near-optimal policies from highly suboptimal data 9

A vivid example 10 Singh, Yu, Yang, Zhang, Kumar, Levine.
COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning. ‘20 blocked by object blocked by drawer closed drawer training task

Off-policy RL: a quick primer enforce this equation at all
states!

What’s the problem? Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning
via Bootstrapping Error Reduction. NeurIPS ‘19 amount of data log scale (massive overestimation) how well it does how well it thinks it does (Q-values) 12

13 Where do we suffer from distribution shift? Kumar, Fu,
Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19 target value behavior policy how well it does how well it thinks it does (Q-values)

14 Conservative Q-learning how well it does how well it
thinks it does (Q-values) regular objective term to push down big Q-values true Q-function Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning. ‘20

How does CQL compare today? 15 the last two years
(since CQL came out) saw enormous growth in interest in offline RL CQL is quite widely used so far! some excerpts from evaluations in other papers studies some methodological problems in offline RL evaluation adds data augmentation, CQL results get even better

16 The power of offline RL standard real-world RL process
offline RL process 1. instrument the task so that we can run RL Ø safety mechanisms Ø autonomous collection Ø rewards, resets, etc. 2. wait a long time for online RL to run 3. change the algorithm in some small way 4. throw it all in the garbage and start over for the next task 1. collect initial dataset Ø human-provided Ø scripted controller Ø baseline policy Ø all of the above 2. Train a policy offline 3. change the algorithm in some small way 4. collect more data, add to growing dataset 5. keep the dataset and use it again for the next project!

17 Offline RL in robotic manipulation: MT-Opt, AMs Ø12 different
tasks ØThousands of objects ØMonths of data collection Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-Based Robotic Manipulation Skills Kalashnikov, Varley, Chebotar, Swanson, Jonschkowski, Finn, Levine, Hausman. MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale. 2021. New hypothesis: could we learn these tasks without rewards using goal-conditioned RL? reuse the same exact data

Actionable Models: Offline RL with Goals Chebotar, Hausman, Lu, Xiao,
Kalashnikov, Varley, Irpan, Eysenbach, Julian, Finn, Levine. Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. 2021. 18 Ø No reward function at all, task is defined entirely using a goal image! Ø Uses a conservative offline RL method designed for goal-reaching, based on CQL Ø Works very well as an unsupervised pretraining objective! 1. Train goal-conditioned Q- function with offline RL 2. Finetune with a task reward and limited data

19 More examples Kahn, Abbeel, Levine. BADGR: An Autonomous Self-Supervised
Learning- Based Navigation System. 2020. Early 2020: Greg Kahn collects 40 hours of robot navigation data Shah, Eysenbach, Kahn, Rhinehart, Levine. ViNG: Learning Open-World Navigation with Visual Goals. 2020. Late 2020: Dhruv Shah uses it to build goal-conditioned navigation system Early 2021: Dhruv Shah uses the same dataset to train an exploration system Shah, Eysenbach, Rhinehart, Levine. RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models. 2020.

20 Other topics in offline RL Model-based offline RL Multi-task
offline RL many more… Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. NeurIPS 2021. generally a great fit for offline RL! similar principles apply: conservatism, OOD actions (+ states) “model-based version of CQL” much more practical with huge models (e.g., Transformers)! Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. NeurIPS 2021. “model-based offline RL with a Transformer” Trajectory Transformer making accurate predictions to hundreds of steps Yu, Kumar, Chebotar, Hausman, Levine, Finn. Conservative Data Sharing for Multi-Task Offline Reinforcement Learning. NeurIPS 2021. “share data between tasks to minimize distribution shift”

Takeaways, conclusions, future directions 21 1. Collect a dataset using
any policy or mixture of policies 2. Run offline RL on this dataset to learn a policy 3. Deploy the policy in the real world current offline RL algorithms “the dream” “the gap” • An offline RL workflow • Supervised learning workflow: train/test split • Offline RL workflow: ??? • Statistical guarantees • Biggest challenge: distributional shift/counterfactuals • Can we make any guarantees? • Scalable methods, large-scale applications • Dialogue systems • Data-driven navigation and driving A starting point: Kumar, Singh, Tian, Finn, Levine. A Workflow for Offline Model-Free Robotic Reinforcement Learning. CoRL 2021

RAIL Robotic AI & Learning Lab website: http://rail.eecs.berkeley.edu

Offline reinforcement learning

Offline reinforcement learning

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript