OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform
from Practical Perspectives Haruka Kiyohara, Kosuke Kawakami Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 1

From online to offline RL in RecSys Reinforcement Learning (RL)
models sequential interactions. Example on a search engine September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 2 search query (𝒔), click (𝒓) document (𝒂) environment policy state, action, reward, next state

From online to offline RL in RecSys Reinforcement Learning (RL)
models sequential interactions. Example on a search engine ..however, online learning and evaluation can be risky and costly. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 3 environment policy Should be optimized search query (𝒔), click (𝒓) document (𝒂) maximize reward

From online to offline RL in RecSys Offline RL is
growing as a safe and cost-effective substitute for online counterpart. We’ve implemented a new offline RL platform to facilitate its practical application. (especially focusing on the policy evaluation part) September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 4 online logged data policy

Motivation Why do we need a new offline RL platform?
September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 5

Desirable workflow of offline RL Providing a streamlined implementation is
important to facilitate practical applications. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 6 data collection offline RL OPE/OPS evaluation of OPE

important to facilitate practical applications. • facilitates applications to under-explored settings September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 7 data collection offline RL OPE/OPS evaluation of OPE

important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 8 data collection offline RL OPE/OPS evaluation of OPE

important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data • ensures safe and cost-effective policy deployment September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 9 data collection offline RL OPE/OPS evaluation of OPE

important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data • ensures safe and cost-effective policy deployment • validates the reliability of OPE/OPS methods September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 10 data collection offline RL OPE/OPS evaluation of OPE

important to facilitate practical applications. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 11 data collection offline RL OPE/OPS evaluation of OPE Unfortunately, most of the existing platforms / benchmark suites are insufficient to enable an end-to-end implementation..

Limitations of the existing platforms None of the existing platforms
enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 12 Offline RL library (d3rlpy, Horizon, RLlib) Benchmark for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 13 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 14 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (specific) ✓ × (limited) × ✓? ✓? ✓? ✓?

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 15 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (specific) ✓ × (limited) × × (not applicable to RL)

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 16 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) ×

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 17 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) ×

enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 18 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) × ×

OFRL An end-to-end platform for offline RL and OPE September
2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 19

Overview of the OFRL platform OFRL enables an end-to-end implementation
of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 20 data collection offline RL OPE/OPS evaluation of OPE

of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 21 data collection offline RL OPE/OPS evaluation of OPE focus of many existing platforms

of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 22 data collection offline RL OPE/OPS evaluation of OPE focus of many existing platforms our focus

of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 23 data collection offline RL OPE/OPS evaluation of OPE ෠ 𝐹(𝜋) cumulative distribution OPE [Chandak+,21] [Huang+,21] [Huang+,22] + risk function estimation (e.g., CVaR, variance)

Example Usage A quick demo of an end-to-end offline RL
and OPE with OFRL September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 24

Step 1: data collection We need only 6 lines of
code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 25 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection

code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 28 Quick demo with RTBGym data collection offline RL OPE/OPS evaluation of OPE

Step2: learning a new policy offline (offline RL) We use
d3rlpy for the offline RL part. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 30 data collection offline RL OPE/OPS evaluation of OPE

Step3: Basic OPE to evaluate the policy value Users can
compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 31 data collection offline RL OPE/OPS evaluation of OPE

compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 36 estimated policy value

Step4: Cumulative distribution OPE September 2022 OFRL: Offline RL and
OPE platform @ CONSEQUENCES+REVEAL WS 37 Users can conduct cumulative distribution OPE in a manner similar to the basic OPE. data collection offline RL OPE/OPS evaluation of OPE

Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE
in a manner similar to the basic OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 38 estimated cumulative distribution function estimated conditional value at risk (with various range)

Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE
in a manner similar to the basic OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 39 estimated interquartile range (10%-90%)

Step5: OPS and evaluation of OPE/OPS Users can also easily
implement both OPS and evaluation of OPE/OPS. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 40 data collection offline RL OPE/OPS evaluation of OPE

Step5: OPS and evaluation of OPE/OPS Users can also easily
implement both OPS and evaluation of OPE/OPS. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 41 comparing the true (x) and estimated (y) variance evaluating the quality of OPS results

Summary • Offline RL and OPE are practically relevant, as
they are safe and cost-effective substitutes of the online counterparts. • To facilitate their practical application, we are building a new software (OFRL). • OFRL is the first end-to-end platform for offline RL and OPE. • OFRL provides informative insights on policy performance using cumulative distribution OPE. OFRL enables quick and flexible prototyping of offline RL and OPE, facilitating practical applications in a range of problem settings. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 42

Thank you for listening! Slides are now available at: https://sites.google.com/view/harukakiyohara
contact: [email protected] September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 43

Cumulative distribution OPE for risk function estimation Recent advances in
OPE enables to estimate the cumulative distribution function (CDF) and some risk functions, which is of great practical relevance. [Chandak+,21] [Huang+,21] [Huang+,22] September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 44 cumulative distribution OPE 𝐹(𝜋) cumulative probability calculate various risk function based on CDF

References September 2022 OFRL: Offline RL and OPE platform @
CONSEQUENCES+REVEAL WS 45

References (1/6) [Seno+,21 (d3rlpy)] Takuma Seno and Michita Imai. “d3rlpy:
An Offline Deep Reinforcement Learning Library.” 2021. https://arxiv.org/abs/2111.03788 [Gauci+,18 (Horizon)] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. “Horizon: Facebook's Open Source Applied Reinforcement Learning Platform.” 2018. https://arxiv.org/abs/1811.00260 [Liang+,18 (RLlib)] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed Reinforcement Learning.” ICML, 2018. https://arxiv.org/abs/1712.09381 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 46

References (2/6) [Fu+,21 (DOPE)] Justin Fu, Mohammad Norouzi, Ofir Nachum,
George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep Off-Policy Evaluation.” ICLR, 2021. https://arxiv.org/abs/2103.16596 [Voloshin+,21 (COBS)] Cameron Voloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. “Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/1911.06854 [Rohde+,18 (RecoGym)] David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou “RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising.” 2018. https://arxiv.org/abs/1808.00720 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 47

References (3/6) [Wang+,21 (RL4RS)] Kai Wang, Zhene Zou, Yue Shang,
Qilin Deng, Minghao Zhao, Yile Liang, Runze Wu, Jianrong Tao, Xudong Shen, Tangjie Lyu, and Changjie Fan. “RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System.” 2021. https://arxiv.org/abs/2110.11073 [Saito+,21 (OBP)] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off- Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Brockman+,16 (OpenAI Gym)] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016. https://arxiv.org/abs/1606.01540 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 48

References (4/6) [Kiyohara+,21 (RTBGym)] Haruka Kiyohara, Kosuke Kawakami, and Yuta
Saito. “Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation.” 2021. https://arxiv.org/abs/2109.08331 [Chandak+,21 (cumulative distribution OPE)] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 [Huang+,21 (cumulative distribution OPE)] Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 49

References (5/6) [Huang+,22 (cumulative distribution OPE)] Audrey Huang, Liu Leqi,
Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov Decision Processes.” AISTATS, 2022. https://proceedings.mlr.press/v151/huang22b.html [Hasselt+,16 (DDQN)] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-learning.” AAAI, 2016. https://arxiv.org/abs/1509.06461 [Kumar+,20 (CQL)] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. “Conservative Q-Learning for Offline Reinforcement Learning.” NeurIPS, 2020. https://arxiv.org/abs/2006.04779 [Le+,19 (DM)] Hoang M. Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” ICML, 2019. https://arxiv.org/abs/1903.08738 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 50

References (6/6) [Precup+,00 (IPS)] Doina Precup, Richard S. Sutton, and
Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Jiang&Li,16 (DR)] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 [Thomas&Brunskill,16 (DR)] Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 51

OFRL: Designing an Offline Reinforcement Learni...

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript