Offline Reinforcement Learning with RLlib (Edi Palencia)

Ofﬂine RL with RLlib Edilmo Palencia Microsoft Principal AI Engineer

Agenda 1. Who we are and how we use Ray
& RLlib 2. What Offline RL is and why we need it 3. How we leverage RLlib to handle our Offline RL needs 4. How our users benefit from Offline RL 5. What is challenging about Offline RL? 6. What is next?

Who are we? Project Bonsai is a Low Code AI
Development Platform that speeds the creation of AI-powered automation. Without requiring data scientists, engineers can build speciﬁc AI components that provide operator guidance or directly make decisions.

Help Industry to move from automation to autonomy

Building Brains for Autonomous Systems Value Proposition: Machine Teaching A
way to transfer knowledge from humans to the underlying Machine Learning algorithm which combines AI with traditional optimization and control

Building Brains for Autonomous Systems Machine Teaching Tool Chain 1.
Machine Teaching injects subject matter expertise into brain training 3. AI Engine automates the generation and management of neural networks and DRL algorithms 2. Simulation tools for accelerated integration and scale of training 4. Flexible runtime to deploy and scale models in the real world

Reinforcement Learning is Key And it’s Hard for now but
it will not be forever Value Proposition Machine Teaching Open Source Today Tomorrow Autonomous Systems RL

How do we use RLlib? • Comprehensive • Extensible •
Algorithms • Execution plans (former optimizers) • Metrics • Models • Pre-processing • Envs/Sims • Flexible RLlib is our bet for Reinforcement Learning New Algorithms Distributed Training Faster prototyping and experimentation Multiple HW support

Distributed training is a need Not only because Reinforcement Learning
Simulation Dynamism Just-In-Time Machine Teaching requirements:

How do we use Ray? Ray is our framework for
Distributed Training Not only because RLlib SDK Machine Teaching Services Kubernetes … … API … … … … … … Ray Machine Teaching Engine Inkling Compiler AI Engine [custom] RLlib … … … … … Kubernetes Only for training AI Focus Micro-Service Architecture:

We have a challenge!!! Same for Reinforcement Learning Simulators Hard
to implement Sometimes very slow Sometimes only data available

Why do we need Offline RL? Our users must be
able to train directly from data – without simulators Collect Data Create Dataset in Bonsai Train a Brain directly from Data Design a Simulator Create Sim Package in Bonsai Train a Brain with Simulators Online Training – Simulators RL Offline Training – No Simulators Offline RL

What is Ofﬂine RL? More than RL without Simulators Even
more than training from Data Machine Learning Ofﬂine RL

Comparing Ofﬂine RL and Online RL Online could be on-policy
or off-policy Agent acting in the world Several recordings of agents acting

Comparing Offline RL and Machine Learning Both benefit from large,
diverse and previously collected datasets Machine Learning Offline RL Decision Active Behavioral Recognition Passive Agent acting in the world Several recordings of agents acting Unsupervised Learning Supervised Learning

How did we leverage RLlib to handle our Ofﬂine RL
needs? Implement a new Algorithm Reuse existent RLlib components Extend RLlib CLI

The algorithm select was CQL Tensorﬂow implementation over SAC and
DQN Others CQL Combine decisions from sub-optimal episodes Improve over behavioral policy Simple Simple environments Low performance Learns lower bounded Q-Values

CQL Reuse existent RLlib components SAC and DQN were extended
Slight modiﬁcation of SAC and DQN to make them reusable Create CQL losses over SAC and DQN losses New agents CQL-SAC & CQL-DQN

Extend RLlib CLI New Dataset command rllib dataset 250 --steps
1000000 -f /Users/edilmo/tests/rllib/moab-expert-ppo.yaml Expert Trained up to Max Reward Mediu m Trained up to Half of Max Reward Rando m First Policy initialized Mediu mExpe rt Half Episodes from Expert and Half from Medium Mediu mRand om Half Episodes from Medium and Half from Random - Training online with any simulator to collect data and generate datasets - Five Datasets Generated in multiple formats - Any algorithm could be used, like PPO, regardless of which ofﬂine algorithm will be used later - Use these datasets to test any Ofﬂine RL algorithm under different behavioral policies with different levels of optimality

How will our users benefit from Offline RL? Let’s see
a preview Comparing Offline and Online Training Offline RL Online RL

What is challenging about Ofﬂine RL? With the techniques And
the lack of simulators Distributional Shift No exploration Evaluation CQL deal with Action Distributional Shift because happens at Training Time, but it’s not saving us from State Distributional Shift because happens at Evaluation Time Results will be as good as the Dataset and the behavioral policies present in that allows. FQE or Model-based OPE is our recommendation

What is next? With Ray & RLlib And Offline RL
Placement Groups Ray Offline Model Based RLlib Hybrid Modes Offline RL

Thank you!

Offline Reinforcement Learning with RLlib (Edi ...

Offline Reinforcement Learning with RLlib (Edi Palencia)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Ofﬂine RL with RLlib Edilmo Palencia Microsoft Principal AI Engineer

Agenda 1. Who we are and how we use Ray

Who are we? Project Bonsai is a Low Code AI

Help Industry to move from automation to autonomy

Help Industry to move from automation to autonomy

Building Brains for Autonomous Systems Value Proposition: Machine Teaching A

Building Brains for Autonomous Systems Machine Teaching Tool Chain 1.

Reinforcement Learning is Key And it’s Hard for now but

How do we use RLlib? • Comprehensive • Extensible •

Distributed training is a need Not only because Reinforcement Learning

How do we use Ray? Ray is our framework for

We have a challenge!!! Same for Reinforcement Learning Simulators Hard

Why do we need Ofﬂine RL? Our users must be

What is Ofﬂine RL? More than RL without Simulators Even

Comparing Ofﬂine RL and Online RL Online could be on-policy

Comparing Ofﬂine RL and Machine Learning Both beneﬁt from large,

How did we leverage RLlib to handle our Ofﬂine RL

The algorithm select was CQL Tensorﬂow implementation over SAC and

CQL Reuse existent RLlib components SAC and DQN were extended

Extend RLlib CLI New Dataset command rllib dataset 250 --steps

How will our users beneﬁt from Ofﬂine RL? Let’s see

What is challenging about Ofﬂine RL? With the techniques And

What is next? With Ray & RLlib And Ofﬂine RL

Thank you!