Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

Haruka Kiyohara
September 22, 2022

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

CONSEQUENCES+REVEAL WS @ RecSys2022 (Day1, REVEAL)
About WS: https://sites.google.com/view/consequences2022

Haruka Kiyohara

September 22, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform

    from Practical Perspectives Haruka Kiyohara, Kosuke Kawakami Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 1
  2. From online to offline RL in RecSys Reinforcement Learning (RL)

    models sequential interactions. Example on a search engine September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 2 search query (𝒔), click (𝒓) document (𝒂) environment policy state, action, reward, next state
  3. From online to offline RL in RecSys Reinforcement Learning (RL)

    models sequential interactions. Example on a search engine ..however, online learning and evaluation can be risky and costly. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 3 environment policy Should be optimized search query (𝒔), click (𝒓) document (𝒂) maximize reward
  4. From online to offline RL in RecSys Offline RL is

    growing as a safe and cost-effective substitute for online counterpart. We’ve implemented a new offline RL platform to facilitate its practical application. (especially focusing on the policy evaluation part) September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 4 online logged data policy
  5. Motivation Why do we need a new offline RL platform?

    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 5
  6. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 6 data collection offline RL OPE/OPS evaluation of OPE
  7. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. • facilitates applications to under-explored settings September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 7 data collection offline RL OPE/OPS evaluation of OPE
  8. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 8 data collection offline RL OPE/OPS evaluation of OPE
  9. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data • ensures safe and cost-effective policy deployment September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 9 data collection offline RL OPE/OPS evaluation of OPE
  10. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. • facilitates applications to under-explored settings • learns a new policy safely from logged data • ensures safe and cost-effective policy deployment • validates the reliability of OPE/OPS methods September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 10 data collection offline RL OPE/OPS evaluation of OPE
  11. Desirable workflow of offline RL Providing a streamlined implementation is

    important to facilitate practical applications. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 11 data collection offline RL OPE/OPS evaluation of OPE Unfortunately, most of the existing platforms / benchmark suites are insufficient to enable an end-to-end implementation..
  12. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 12 Offline RL library (d3rlpy, Horizon, RLlib) Benchmark for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×
  13. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 13 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×
  14. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 14 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (specific) ✓ × (limited) × ✓? ✓? ✓? ✓?
  15. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 15 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (specific) ✓ × (limited) × × (not applicable to RL)
  16. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 16 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) ×
  17. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 17 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) ×
  18. Limitations of the existing platforms None of the existing platforms

    enables an end-to-end implementation. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 18 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) Others (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (not applicable to RL) × (specific) ✓ × (limited) × ×
  19. OFRL An end-to-end platform for offline RL and OPE September

    2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 19
  20. Overview of the OFRL platform OFRL enables an end-to-end implementation

    of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 20 data collection offline RL OPE/OPS evaluation of OPE
  21. Overview of the OFRL platform OFRL enables an end-to-end implementation

    of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 21 data collection offline RL OPE/OPS evaluation of OPE focus of many existing platforms
  22. Overview of the OFRL platform OFRL enables an end-to-end implementation

    of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 22 data collection offline RL OPE/OPS evaluation of OPE focus of many existing platforms our focus
  23. Overview of the OFRL platform OFRL enables an end-to-end implementation

    of offline RL and OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 23 data collection offline RL OPE/OPS evaluation of OPE ෠ 𝐹(𝜋) cumulative distribution OPE [Chandak+,21] [Huang+,21] [Huang+,22] + risk function estimation (e.g., CVaR, variance)
  24. Example Usage A quick demo of an end-to-end offline RL

    and OPE with OFRL September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 24
  25. Step 1: data collection We need only 6 lines of

    code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 25 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  26. Step 1: data collection We need only 6 lines of

    code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 26 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  27. Step 1: data collection We need only 6 lines of

    code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 27 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  28. Step 1: data collection We need only 6 lines of

    code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 28 Quick demo with RTBGym data collection offline RL OPE/OPS evaluation of OPE
  29. Step 1: data collection We need only 6 lines of

    code. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 29 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  30. Step2: learning a new policy offline (offline RL) We use

    d3rlpy for the offline RL part. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 30 data collection offline RL OPE/OPS evaluation of OPE
  31. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 31 data collection offline RL OPE/OPS evaluation of OPE
  32. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 32 data collection offline RL OPE/OPS evaluation of OPE
  33. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 33 data collection offline RL OPE/OPS evaluation of OPE
  34. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 34 data collection offline RL OPE/OPS evaluation of OPE
  35. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 35 data collection offline RL OPE/OPS evaluation of OPE
  36. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 36 estimated policy value
  37. Step4: Cumulative distribution OPE September 2022 OFRL: Offline RL and

    OPE platform @ CONSEQUENCES+REVEAL WS 37 Users can conduct cumulative distribution OPE in a manner similar to the basic OPE. data collection offline RL OPE/OPS evaluation of OPE
  38. Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE

    in a manner similar to the basic OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 38 estimated cumulative distribution function estimated conditional value at risk (with various range)
  39. Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE

    in a manner similar to the basic OPE. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 39 estimated interquartile range (10%-90%)
  40. Step5: OPS and evaluation of OPE/OPS Users can also easily

    implement both OPS and evaluation of OPE/OPS. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 40 data collection offline RL OPE/OPS evaluation of OPE
  41. Step5: OPS and evaluation of OPE/OPS Users can also easily

    implement both OPS and evaluation of OPE/OPS. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 41 comparing the true (x) and estimated (y) variance evaluating the quality of OPS results
  42. Summary • Offline RL and OPE are practically relevant, as

    they are safe and cost-effective substitutes of the online counterparts. • To facilitate their practical application, we are building a new software (OFRL). • OFRL is the first end-to-end platform for offline RL and OPE. • OFRL provides informative insights on policy performance using cumulative distribution OPE. OFRL enables quick and flexible prototyping of offline RL and OPE, facilitating practical applications in a range of problem settings. September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 42
  43. Thank you for listening! Slides are now available at: https://sites.google.com/view/harukakiyohara

    contact: kiyohara.h.aa@m.titech.ac.jp September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 43
  44. Cumulative distribution OPE for risk function estimation Recent advances in

    OPE enables to estimate the cumulative distribution function (CDF) and some risk functions, which is of great practical relevance. [Chandak+,21] [Huang+,21] [Huang+,22] September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 44 cumulative distribution OPE 𝐹(𝜋) cumulative probability calculate various risk function based on CDF
  45. References September 2022 OFRL: Offline RL and OPE platform @

    CONSEQUENCES+REVEAL WS 45
  46. References (1/6) [Seno+,21 (d3rlpy)] Takuma Seno and Michita Imai. “d3rlpy:

    An Offline Deep Reinforcement Learning Library.” 2021. https://arxiv.org/abs/2111.03788 [Gauci+,18 (Horizon)] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. “Horizon: Facebook's Open Source Applied Reinforcement Learning Platform.” 2018. https://arxiv.org/abs/1811.00260 [Liang+,18 (RLlib)] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed Reinforcement Learning.” ICML, 2018. https://arxiv.org/abs/1712.09381 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 46
  47. References (2/6) [Fu+,21 (DOPE)] Justin Fu, Mohammad Norouzi, Ofir Nachum,

    George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep Off-Policy Evaluation.” ICLR, 2021. https://arxiv.org/abs/2103.16596 [Voloshin+,21 (COBS)] Cameron Voloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. “Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/1911.06854 [Rohde+,18 (RecoGym)] David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou “RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising.” 2018. https://arxiv.org/abs/1808.00720 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 47
  48. References (3/6) [Wang+,21 (RL4RS)] Kai Wang, Zhene Zou, Yue Shang,

    Qilin Deng, Minghao Zhao, Yile Liang, Runze Wu, Jianrong Tao, Xudong Shen, Tangjie Lyu, and Changjie Fan. “RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System.” 2021. https://arxiv.org/abs/2110.11073 [Saito+,21 (OBP)] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off- Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Brockman+,16 (OpenAI Gym)] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016. https://arxiv.org/abs/1606.01540 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 48
  49. References (4/6) [Kiyohara+,21 (RTBGym)] Haruka Kiyohara, Kosuke Kawakami, and Yuta

    Saito. “Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation.” 2021. https://arxiv.org/abs/2109.08331 [Chandak+,21 (cumulative distribution OPE)] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 [Huang+,21 (cumulative distribution OPE)] Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 49
  50. References (5/6) [Huang+,22 (cumulative distribution OPE)] Audrey Huang, Liu Leqi,

    Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov Decision Processes.” AISTATS, 2022. https://proceedings.mlr.press/v151/huang22b.html [Hasselt+,16 (DDQN)] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-learning.” AAAI, 2016. https://arxiv.org/abs/1509.06461 [Kumar+,20 (CQL)] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. “Conservative Q-Learning for Offline Reinforcement Learning.” NeurIPS, 2020. https://arxiv.org/abs/2006.04779 [Le+,19 (DM)] Hoang M. Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” ICML, 2019. https://arxiv.org/abs/1903.08738 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 50
  51. References (6/6) [Precup+,00 (IPS)] Doina Precup, Richard S. Sutton, and

    Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Jiang&Li,16 (DR)] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 [Thomas&Brunskill,16 (DR)] Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 51