$30 off During Our Annual Pro Sale. View Details »

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

Haruka Kiyohara
September 22, 2022

OFRL: Designing an Offline Reinforcement Learning and Policy Evaluation Platform from Practical Perspectives

CONSEQUENCES+REVEAL WS @ RecSys2022 (Day1, REVEAL)
About WS: https://sites.google.com/view/consequences2022

Haruka Kiyohara

September 22, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. OFRL: Designing an Offline Reinforcement Learning and
    Policy Evaluation Platform from Practical Perspectives
    Haruka Kiyohara, Kosuke Kawakami
    Haruka Kiyohara, Tokyo Institute of Technology
    https://sites.google.com/view/harukakiyohara
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 1

    View Slide

  2. From online to offline RL in RecSys
    Reinforcement Learning (RL) models sequential interactions.
    Example on a search engine
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 2
    search query (𝒔), click (𝒓)
    document (𝒂)
    environment
    policy
    state, action, reward, next state

    View Slide

  3. From online to offline RL in RecSys
    Reinforcement Learning (RL) models sequential interactions.
    Example on a search engine
    ..however, online learning and evaluation can be risky and costly.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 3
    environment
    policy
    Should be
    optimized search query (𝒔), click (𝒓)
    document (𝒂)
    maximize reward

    View Slide

  4. From online to offline RL in RecSys
    Offline RL is growing as a safe and cost-effective substitute for online counterpart.
    We’ve implemented a new offline RL platform to facilitate its practical application.
    (especially focusing on the policy evaluation part)
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 4
    online
    logged data policy

    View Slide

  5. Motivation
    Why do we need a new offline RL platform?
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 5

    View Slide

  6. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 6
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  7. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    • facilitates applications to under-explored settings
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 7
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  8. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    • facilitates applications to under-explored settings
    • learns a new policy safely from logged data
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 8
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  9. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    • facilitates applications to under-explored settings
    • learns a new policy safely from logged data
    • ensures safe and cost-effective policy deployment
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 9
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  10. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    • facilitates applications to under-explored settings
    • learns a new policy safely from logged data
    • ensures safe and cost-effective policy deployment
    • validates the reliability of OPE/OPS methods
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 10
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  11. Desirable workflow of offline RL
    Providing a streamlined implementation is important to facilitate practical applications.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 11
    data collection offline RL OPE/OPS evaluation of OPE
    Unfortunately, most of the existing platforms / benchmark suites
    are insufficient to enable an end-to-end implementation..

    View Slide

  12. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 12
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmark for OPE
    (DOPE, COBS)
    (OBP)
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (implements whole procedures, but is not applicable to RL)
    × (specific)
    ✓ × (limited)
    ×

    View Slide

  13. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 13
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    (OBP)
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (implements whole procedures, but is not applicable to RL)
    × (specific)
    ✓ × (limited)
    ×

    View Slide

  14. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 14
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    OPE platform
    (OBP)
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (specific)
    ✓ × (limited)
    ×
    ✓? ✓? ✓? ✓?

    View Slide

  15. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 15
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    OPE platform
    (OBP)
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (specific)
    ✓ × (limited)
    ×
    × (not applicable to RL)

    View Slide

  16. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 16
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    OPE platform
    (OBP)
    Others
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (not applicable to RL)
    × (specific)
    ✓ × (limited)
    ×

    View Slide

  17. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 17
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    OPE platform
    (OBP)
    Others
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (not applicable to RL)
    × (specific)
    ✓ × (limited)
    ×

    View Slide

  18. Limitations of the existing platforms
    None of the existing platforms enables an end-to-end implementation.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 18
    Offline RL library
    (d3rlpy, Horizon, RLlib)
    Benchmarks for OPE
    (DOPE, COBS)
    OPE platform
    (OBP)
    Others
    (RecoGym, RL4RS)
    evaluation of OPE
    OPE
    offline RL
    data collection
    ✓ ✓ × (limited)
    ×
    × (not flexible)
    ✓ ✓
    × (not applicable to RL)
    × (specific)
    ✓ × (limited)
    ×
    ×

    View Slide

  19. OFRL
    An end-to-end platform for offline RL and OPE
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 19

    View Slide

  20. Overview of the OFRL platform
    OFRL enables an end-to-end implementation of offline RL and OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 20
    data collection offline RL OPE/OPS evaluation of OPE

    View Slide

  21. Overview of the OFRL platform
    OFRL enables an end-to-end implementation of offline RL and OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 21
    data collection offline RL OPE/OPS evaluation of OPE
    focus of many existing platforms

    View Slide

  22. Overview of the OFRL platform
    OFRL enables an end-to-end implementation of offline RL and OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 22
    data collection offline RL OPE/OPS evaluation of OPE
    focus of many existing platforms our focus

    View Slide

  23. Overview of the OFRL platform
    OFRL enables an end-to-end implementation of offline RL and OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 23
    data collection offline RL OPE/OPS evaluation of OPE

    𝐹(𝜋)
    cumulative distribution OPE
    [Chandak+,21] [Huang+,21] [Huang+,22]
    + risk function estimation
    (e.g., CVaR, variance)

    View Slide

  24. Example Usage
    A quick demo of an end-to-end offline RL and OPE with OFRL
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 24

    View Slide

  25. Step 1: data collection
    We need only 6 lines of code.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 25
    Quick demo with RTBGym
    offline RL
    OPE/OPS
    evaluation of OPE
    data collection

    View Slide

  26. Step 1: data collection
    We need only 6 lines of code.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 26
    Quick demo with RTBGym
    offline RL
    OPE/OPS
    evaluation of OPE
    data collection

    View Slide

  27. Step 1: data collection
    We need only 6 lines of code.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 27
    Quick demo with RTBGym
    offline RL
    OPE/OPS
    evaluation of OPE
    data collection

    View Slide

  28. Step 1: data collection
    We need only 6 lines of code.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 28
    Quick demo with RTBGym
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  29. Step 1: data collection
    We need only 6 lines of code.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 29
    Quick demo with RTBGym
    offline RL
    OPE/OPS
    evaluation of OPE
    data collection

    View Slide

  30. Step2: learning a new policy offline (offline RL)
    We use d3rlpy for the offline RL part.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 30
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  31. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 31
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  32. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 32
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  33. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 33
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  34. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 34
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  35. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 35
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  36. Step3: Basic OPE to evaluate the policy value
    Users can compare various policies and OPE estimators at once.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 36
    estimated policy value

    View Slide

  37. Step4: Cumulative distribution OPE
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 37
    Users can conduct cumulative distribution OPE in a manner similar to the basic OPE.
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  38. Step4: Cumulative distribution OPE
    Users can conduct cumulative distribution OPE in a manner similar to the basic OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 38
    estimated cumulative distribution function
    estimated conditional value at risk (with various range)

    View Slide

  39. Step4: Cumulative distribution OPE
    Users can conduct cumulative distribution OPE in a manner similar to the basic OPE.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 39
    estimated interquartile range (10%-90%)

    View Slide

  40. Step5: OPS and evaluation of OPE/OPS
    Users can also easily implement both OPS and evaluation of OPE/OPS.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 40
    data collection
    offline RL
    OPE/OPS
    evaluation of OPE

    View Slide

  41. Step5: OPS and evaluation of OPE/OPS
    Users can also easily implement both OPS and evaluation of OPE/OPS.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 41
    comparing the true (x) and estimated (y) variance
    evaluating the quality of OPS results

    View Slide

  42. Summary
    • Offline RL and OPE are practically relevant, as they are
    safe and cost-effective substitutes of the online counterparts.
    • To facilitate their practical application, we are building a new software (OFRL).
    • OFRL is the first end-to-end platform for offline RL and OPE.
    • OFRL provides informative insights on policy performance using
    cumulative distribution OPE.
    OFRL enables quick and flexible prototyping of offline RL and OPE,
    facilitating practical applications in a range of problem settings.
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 42

    View Slide

  43. Thank you for listening!
    Slides are now available at: https://sites.google.com/view/harukakiyohara
    contact: [email protected]
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 43

    View Slide

  44. Cumulative distribution OPE for risk function estimation
    Recent advances in OPE enables to estimate the cumulative distribution function
    (CDF) and some risk functions, which is of great practical relevance.
    [Chandak+,21] [Huang+,21] [Huang+,22]
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 44
    cumulative distribution OPE
    𝐹(𝜋)
    cumulative probability
    calculate various risk function based on CDF

    View Slide

  45. References
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 45

    View Slide

  46. References (1/6)
    [Seno+,21 (d3rlpy)] Takuma Seno and Michita Imai. “d3rlpy: An Offline Deep
    Reinforcement Learning Library.” 2021. https://arxiv.org/abs/2111.03788
    [Gauci+,18 (Horizon)] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri,
    Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott
    Fujimoto. “Horizon: Facebook's Open Source Applied Reinforcement Learning
    Platform.” 2018.
    https://arxiv.org/abs/1811.00260
    [Liang+,18 (RLlib)] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox,
    Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib:
    Abstractions for Distributed Reinforcement Learning.” ICML, 2018.
    https://arxiv.org/abs/1712.09381
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 46

    View Slide

  47. References (2/6)
    [Fu+,21 (DOPE)] Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu
    Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral
    Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep
    Off-Policy Evaluation.” ICLR, 2021. https://arxiv.org/abs/2103.16596
    [Voloshin+,21 (COBS)] Cameron Voloshin, Hoang M. Le, Nan Jiang, and Yisong Yue.
    “Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning.” NeurIPS
    dataset&benchmark, 2021. https://arxiv.org/abs/1911.06854
    [Rohde+,18 (RecoGym)] David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile,
    and Alexandros Karatzoglou “RecoGym: A Reinforcement Learning Environment for
    the problem of Product Recommendation in Online Advertising.” 2018.
    https://arxiv.org/abs/1808.00720
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 47

    View Slide

  48. References (3/6)
    [Wang+,21 (RL4RS)] Kai Wang, Zhene Zou, Yue Shang, Qilin Deng, Minghao Zhao,
    Yile Liang, Runze Wu, Jianrong Tao, Xudong Shen, Tangjie Lyu, and Changjie Fan.
    “RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender
    System.” 2021. https://arxiv.org/abs/2110.11073
    [Saito+,21 (OBP)] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke
    Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-
    Policy Evaluation.” NeurIPS dataset&benchmark, 2021.
    https://arxiv.org/abs/2008.07146
    [Brockman+,16 (OpenAI Gym)] Greg Brockman, Vicki Cheung, Ludwig Pettersson,
    Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.”
    2016. https://arxiv.org/abs/1606.01540
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 48

    View Slide

  49. References (4/6)
    [Kiyohara+,21 (RTBGym)] Haruka Kiyohara, Kosuke Kawakami, and Yuta Saito.
    “Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and
    Recommendation: Potential Use of Simulation.” 2021.
    https://arxiv.org/abs/2109.08331
    [Chandak+,21 (cumulative distribution OPE)] Yash Chandak, Scott Niekum, Bruno
    Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal
    Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820
    [Huang+,21 (cumulative distribution OPE)] Audrey Huang, Liu Leqi, Zachary C.
    Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual
    Bandits.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 49

    View Slide

  50. References (5/6)
    [Huang+,22 (cumulative distribution OPE)] Audrey Huang, Liu Leqi, Zachary C.
    Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov
    Decision Processes.” AISTATS, 2022.
    https://proceedings.mlr.press/v151/huang22b.html
    [Hasselt+,16 (DDQN)] Hado van Hasselt, Arthur Guez, and David Silver. “Deep
    Reinforcement Learning with Double Q-learning.” AAAI, 2016.
    https://arxiv.org/abs/1509.06461
    [Kumar+,20 (CQL)] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine.
    “Conservative Q-Learning for Offline Reinforcement Learning.” NeurIPS, 2020.
    https://arxiv.org/abs/2006.04779
    [Le+,19 (DM)] Hoang M. Le, Cameron Voloshin, and Yisong Yue. “Batch Policy
    Learning under Constraints.” ICML, 2019. https://arxiv.org/abs/1903.08738
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 50

    View Slide

  51. References (6/6)
    [Precup+,00 (IPS)] Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility
    Traces for Off-Policy Policy Evaluation.” ICML, 2000.
    https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult
    y_pubs
    [Jiang&Li,16 (DR)] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value
    Evaluation for Reinforcement Learning.” ICML, 2016.
    https://arxiv.org/abs/1511.03722
    [Thomas&Brunskill,16 (DR)] Philip S. Thomas and Emma Brunskill. “Data-Efficient
    Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016.
    https://arxiv.org/abs/1604.00923
    September 2022 OFRL: Offline RL and OPE platform @ CONSEQUENCES+REVEAL WS 51

    View Slide