Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Offline reinforcement learning

Anyscale
March 30, 2022

Offline reinforcement learning

Reinforcement learning (RL) provides an algorithmic framework for rational sequential decision making. However, the kinds of problem domains where RL has been applied successfully seem to differ substantially from the settings where supervised machine learning has been successful. RL algorithms can learn to play Atari or board games, whereas supervised machine learning algorithms can make highly accurate predictions in complex open-world settings.
Virtually all the problems that we want to solve with machine learning are really decision making problems — deciding which product to show to a customer, deciding how to tag a photo, or deciding how to translate a string of text — so why aren't we solving them all with RL? One of the biggest issues with modern RL is that it does not effectively utilize the kinds of large and highly diverse datasets that have been instrumental to the success of supervised machine learning.
In this talk, I will discuss the technologies that can help us address this issue: enabling RL methods to use large datasets via offline RL. Offline RL algorithms can analyze large, previously collected datasets to extract the most effective policies, and then fine-tune these policies with additional online interaction as needed. I will cover the technical foundations of offline RL, discuss recent algorithm advances, and present several applications.

Anyscale

March 30, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Machine learning is automated decision-making 3 If ultimately ML is

    always about making a decision, why don’t we treat every machine learning problem like a reinforcement learning problem? Typical supervised learning problems have assumptions that make them “easy”: Ø independent datapoints Ø outputs don’t influence future inputs Ø ground truth labels are provided at training-time Decision-making problems often don’t satisfy these assumptions: Ø current actions influence future observations Ø goal is to maximize some utility (reward) Ø optimal actions are not provided These are not just issues for control: in many cases, real-world deployment of ML has these same feedback issues Example: decisions made by a traffic prediction system might affect the route that people take, which changes traffic
  2. 4 So why aren’t we all using RL? Reinforcement learning

    is two different things: 1. Framework for learning-based decision making 2. Active, online learning algorithms for control this is done many times [object label] [decision] almost all real-world learning problems look like this almost all real-world learning problems make it very difficult to do this
  3. 5 But it works in practice, right? this is done

    many times Mnih et al. ‘13 Schulman et al. ’14 & ‘15 Levine*, Finn*, et al. ‘16 enormous gulf
  4. 7 The offline RL workflow Levine, Kumar, Tucker, Fu. Offline

    Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 offline reinforcement learning 1. Collect a dataset using any policy or mixture of policies • Humans performing the task • Existing system performing the task • Random behaviors • Any combination of the above 2. Run offline RL on this dataset to learn a policy • Could think of it as the best policy we can get from the provided data 3. Deploy the policy in the real world • If we are not happy with how well it does, modify the algorithm and go back to 2, reusing the same data! This is only done once
  5. 8 Why is offline RL hard? Levine, Kumar, Tucker, Fu.

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ‘20 Fundamental problem: counterfactual queries Training data What the policy wants to do Is this good? Bad? How do we know if we didn’t see it in the data? Online RL algorithms don’t have to handle this, because they can simply try this action and see what happens Offline RL methods must somehow account for these unseen (“out-of-distribution”) actions, ideally in a safe way …while still making use of generalization to come up with behaviors that are better than the best thing seen in the data!
  6. What do we expect offline RL methods to do? Bad

    intuition: it’s like imitation learning Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions! See: Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning? Better intuition: get order from chaos “Macro-scale” stitching But this is just the clearest example! “Micro-scale” stitching: If we have algorithms that properly perform dynamic programming, we can take this idea much further and get near-optimal policies from highly suboptimal data 9
  7. A vivid example 10 Singh, Yu, Yang, Zhang, Kumar, Levine.

    COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning. ‘20 blocked by object blocked by drawer closed drawer training task
  8. What’s the problem? Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning

    via Bootstrapping Error Reduction. NeurIPS ‘19 amount of data log scale (massive overestimation) how well it does how well it thinks it does (Q-values) 12
  9. 13 Where do we suffer from distribution shift? Kumar, Fu,

    Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19 target value behavior policy how well it does how well it thinks it does (Q-values)
  10. 14 Conservative Q-learning how well it does how well it

    thinks it does (Q-values) regular objective term to push down big Q-values true Q-function Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning. ‘20
  11. How does CQL compare today? 15 the last two years

    (since CQL came out) saw enormous growth in interest in offline RL CQL is quite widely used so far! some excerpts from evaluations in other papers studies some methodological problems in offline RL evaluation adds data augmentation, CQL results get even better
  12. 16 The power of offline RL standard real-world RL process

    offline RL process 1. instrument the task so that we can run RL Ø safety mechanisms Ø autonomous collection Ø rewards, resets, etc. 2. wait a long time for online RL to run 3. change the algorithm in some small way 4. throw it all in the garbage and start over for the next task 1. collect initial dataset Ø human-provided Ø scripted controller Ø baseline policy Ø all of the above 2. Train a policy offline 3. change the algorithm in some small way 4. collect more data, add to growing dataset 5. keep the dataset and use it again for the next project!
  13. 17 Offline RL in robotic manipulation: MT-Opt, AMs Ø12 different

    tasks ØThousands of objects ØMonths of data collection Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-Based Robotic Manipulation Skills Kalashnikov, Varley, Chebotar, Swanson, Jonschkowski, Finn, Levine, Hausman. MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale. 2021. New hypothesis: could we learn these tasks without rewards using goal-conditioned RL? reuse the same exact data
  14. Actionable Models: Offline RL with Goals Chebotar, Hausman, Lu, Xiao,

    Kalashnikov, Varley, Irpan, Eysenbach, Julian, Finn, Levine. Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. 2021. 18 Ø No reward function at all, task is defined entirely using a goal image! Ø Uses a conservative offline RL method designed for goal-reaching, based on CQL Ø Works very well as an unsupervised pretraining objective! 1. Train goal-conditioned Q- function with offline RL 2. Finetune with a task reward and limited data
  15. 19 More examples Kahn, Abbeel, Levine. BADGR: An Autonomous Self-Supervised

    Learning- Based Navigation System. 2020. Early 2020: Greg Kahn collects 40 hours of robot navigation data Shah, Eysenbach, Kahn, Rhinehart, Levine. ViNG: Learning Open-World Navigation with Visual Goals. 2020. Late 2020: Dhruv Shah uses it to build goal-conditioned navigation system Early 2021: Dhruv Shah uses the same dataset to train an exploration system Shah, Eysenbach, Rhinehart, Levine. RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models. 2020.
  16. 20 Other topics in offline RL Model-based offline RL Multi-task

    offline RL many more… Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. NeurIPS 2021. generally a great fit for offline RL! similar principles apply: conservatism, OOD actions (+ states) “model-based version of CQL” much more practical with huge models (e.g., Transformers)! Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. NeurIPS 2021. “model-based offline RL with a Transformer” Trajectory Transformer making accurate predictions to hundreds of steps Yu, Kumar, Chebotar, Hausman, Levine, Finn. Conservative Data Sharing for Multi-Task Offline Reinforcement Learning. NeurIPS 2021. “share data between tasks to minimize distribution shift”
  17. Takeaways, conclusions, future directions 21 1. Collect a dataset using

    any policy or mixture of policies 2. Run offline RL on this dataset to learn a policy 3. Deploy the policy in the real world current offline RL algorithms “the dream” “the gap” • An offline RL workflow • Supervised learning workflow: train/test split • Offline RL workflow: ??? • Statistical guarantees • Biggest challenge: distributional shift/counterfactuals • Can we make any guarantees? • Scalable methods, large-scale applications • Dialogue systems • Data-driven navigation and driving A starting point: Kumar, Singh, Tian, Finn, Levine. A Workflow for Offline Model-Free Robotic Reinforcement Learning. CoRL 2021