Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Offline Reinforcement Learning with RLlib (Edi Palencia)

Offline Reinforcement Learning with RLlib (Edi Palencia)

Reinforcement Learning is a fast growing field that is starting to make an impact across different engineering areas. However, Reinforcement Learning is typically framed as an Online Learning approach where an Environment (simulated or real) is required during the learning process. The need of an environment is typically a constrain that prevents the application of RL techniques in fields where having a simulator is very hard or unfeasible (e.g., Health, NLP, etc.).
In this talk, we will show how to apply Reinforcement Learning to AI/ML problems where the only available resource is a Dataset, i.e., a recording of interactions of an Agent in an Environment. To this end, we will show how RLlib can be used to train an Agent by only using previously collected Data (Offline Data).

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 13, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Offline RL with RLlib Edilmo Palencia Microsoft Principal AI Engineer

  2. Agenda 1. Who we are and how we use Ray

    & RLlib 2. What Offline RL is and why we need it 3. How we leverage RLlib to handle our Offline RL needs 4. How our users benefit from Offline RL 5. What is challenging about Offline RL? 6. What is next?
  3. Who are we? Project Bonsai is a Low Code AI

    Development Platform that speeds the creation of AI-powered automation. Without requiring data scientists, engineers can build specific AI components that provide operator guidance or directly make decisions.
  4. Help Industry to move from automation to autonomy

  5. Help Industry to move from automation to autonomy

  6. Building Brains for Autonomous Systems Value Proposition: Machine Teaching A

    way to transfer knowledge from humans to the underlying Machine Learning algorithm which combines AI with traditional optimization and control
  7. Building Brains for Autonomous Systems Machine Teaching Tool Chain 1.

    Machine Teaching injects subject matter expertise into brain training 3. AI Engine automates the generation and management of neural networks and DRL algorithms 2. Simulation tools for accelerated integration and scale of training 4. Flexible runtime to deploy and scale models in the real world
  8. Reinforcement Learning is Key And it’s Hard for now but

    it will not be forever Value Proposition Machine Teaching Open Source Today Tomorrow Autonomous Systems RL
  9. How do we use RLlib? • Comprehensive • Extensible •

    Algorithms • Execution plans (former optimizers) • Metrics • Models • Pre-processing • Envs/Sims • Flexible RLlib is our bet for Reinforcement Learning New Algorithms Distributed Training Faster prototyping and experimentation Multiple HW support
  10. Distributed training is a need Not only because Reinforcement Learning

    Simulation Dynamism Just-In-Time Machine Teaching requirements:
  11. How do we use Ray? Ray is our framework for

    Distributed Training Not only because RLlib SDK Machine Teaching Services Kubernetes … … API … … … … … … Ray Machine Teaching Engine Inkling Compiler AI Engine [custom] RLlib … … … … … Kubernetes Only for training AI Focus Micro-Service Architecture:
  12. We have a challenge!!! Same for Reinforcement Learning Simulators Hard

    to implement Sometimes very slow Sometimes only data available
  13. Why do we need Offline RL? Our users must be

    able to train directly from data – without simulators Collect Data Create Dataset in Bonsai Train a Brain directly from Data Design a Simulator Create Sim Package in Bonsai Train a Brain with Simulators Online Training – Simulators RL Offline Training – No Simulators Offline RL
  14. What is Offline RL? More than RL without Simulators Even

    more than training from Data Machine Learning Offline RL
  15. Comparing Offline RL and Online RL Online could be on-policy

    or off-policy Agent acting in the world Several recordings of agents acting
  16. Comparing Offline RL and Machine Learning Both benefit from large,

    diverse and previously collected datasets Machine Learning Offline RL Decision Active Behavioral Recognition Passive Agent acting in the world Several recordings of agents acting Unsupervised Learning Supervised Learning
  17. How did we leverage RLlib to handle our Offline RL

    needs? Implement a new Algorithm Reuse existent RLlib components Extend RLlib CLI
  18. The algorithm select was CQL Tensorflow implementation over SAC and

    DQN Others CQL Combine decisions from sub-optimal episodes Improve over behavioral policy Simple Simple environments Low performance Learns lower bounded Q-Values
  19. CQL Reuse existent RLlib components SAC and DQN were extended

    Slight modification of SAC and DQN to make them reusable Create CQL losses over SAC and DQN losses New agents CQL-SAC & CQL-DQN
  20. Extend RLlib CLI New Dataset command rllib dataset 250 --steps

    1000000 -f /Users/edilmo/tests/rllib/moab-expert-ppo.yaml Expert Trained up to Max Reward Mediu m Trained up to Half of Max Reward Rando m First Policy initialized Mediu mExpe rt Half Episodes from Expert and Half from Medium Mediu mRand om Half Episodes from Medium and Half from Random - Training online with any simulator to collect data and generate datasets - Five Datasets Generated in multiple formats - Any algorithm could be used, like PPO, regardless of which offline algorithm will be used later - Use these datasets to test any Offline RL algorithm under different behavioral policies with different levels of optimality
  21. How will our users benefit from Offline RL? Let’s see

    a preview Comparing Offline and Online Training Offline RL Online RL
  22. What is challenging about Offline RL? With the techniques And

    the lack of simulators Distributional Shift No exploration Evaluation CQL deal with Action Distributional Shift because happens at Training Time, but it’s not saving us from State Distributional Shift because happens at Evaluation Time Results will be as good as the Dataset and the behavioral policies present in that allows. FQE or Model-based OPE is our recommendation
  23. What is next? With Ray & RLlib And Offline RL

    Placement Groups Ray Offline Model Based RLlib Hybrid Modes Offline RL
  24. Thank you!