Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[arXiv'23] SCOPE-RL: A Python Library for Offline RL and Off-Policy Evaluation

Haruka Kiyohara
November 30, 2023

[arXiv'23] SCOPE-RL: A Python Library for Offline RL and Off-Policy Evaluation

Haruka Kiyohara

November 30, 2023
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. SCOPE-RL: A Python package for offline RL, off-policy evaluation and

    selection (OPE/OPS) Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito May 2024 SCOPE-RL package description 1
  2. Real-world sequential decision making Example of sequential decision-making in healthcare

    We aim to optimize such decisions as a Reinforcement Learning (RL) problem. May 2024 SCOPE-RL package description 2 Other applications include.. • Robotics • Education • Recommender systems • … Sequential decision-making is everywhere!
  3. Online and Offline Reinforcement Learning (RL) • Online RL –

    • learns a policy through interaction • may harm the real system with bad action choices • Offline RL – • learns and evaluate a policy solely from offline data • can be a safe alternative for online RL May 2024 SCOPE-RL package description 3 We talk about a library for offline RL and policy evaluation.
  4. Content • Motivation • Key features of SCOPE-RL • End-to-end

    implementation of offline RL and off-policy evaluation (OPE) • Various OPE estimators • Assessment protocols for OPE • User-friendly APIs • Appendix • Quick demo on the usage • Implemented OPE estimators and assessment protocols May 2024 SCOPE-RL package description 4
  5. Two sides of offline RL: policy learning and evaluation Both

    policy learning and evaluation are critical for deploying a well-performing policy. May 2024 SCOPE-RL package description 6 Off-Policy Evaluation (OPE) (+ online A/B tests) If policy learning fails.. → we cannot include a well-performing policy in the candidate sets. If policy evaluation fails.. → we may choose a poor-performing policy as the final production policy. Both cases result in a poor-performing production policy, which should be avoided.
  6. Desirable workflow of offline RL Providing a streamlined implementation of

    offline RL and OPE is the key. May 2024 SCOPE-RL package description 7 data collection offline RL OPE/OPS evaluation of OPE
  7. Desirable workflow of offline RL Providing a streamlined implementation of

    offline RL and OPE is the key. A Flexible, end-to-end implementation facilitates.. • real-world applications (practice) • benchmarking and quick testing of new offline RL/OPE algorithms (research) May 2024 SCOPE-RL package description 8 data collection offline RL OPE/OPS evaluation of OPE
  8. Desirable property of each module Providing a streamlined implementation of

    offline RL and OPE is the key. • applicable to various RL environments including under-explored settings May 2024 SCOPE-RL package description 9 data collection offline RL OPE/OPS evaluation of OPE
  9. Desirable property of each module Providing a streamlined implementation of

    offline RL and OPE is the key. • applicable to various RL environments including under-explored settings • implement and enable to compare various offline RL algorithms May 2024 SCOPE-RL package description 10 data collection offline RL OPE/OPS evaluation of OPE
  10. Desirable property of each module Providing a streamlined implementation of

    offline RL and OPE is the key. • applicable to various RL environments including under-explored settings • implement and enable to compare various offline RL algorithms • able to evaluate various policies with various OPE estimators May 2024 SCOPE-RL package description 11 data collection offline RL OPE/OPS evaluation of OPE
  11. Desirable property of each module Providing a streamlined implementation of

    offline RL and OPE is the key. • applicable to various RL environments including under-explored settings • implement and enable to compare various offline RL algorithms • able to evaluate various policies with various OPE estimators • validate the reliability of OPE and downstream policy selection methods May 2024 SCOPE-RL package description 12 data collection offline RL OPE/OPS evaluation of OPE
  12. Issue of existing libraries for offline RL and OPE Providing

    a streamlined implementation of offline RL and OPE is the key. May 2024 SCOPE-RL package description 13 data collection offline RL OPE/OPS evaluation of OPE Unfortunately, most of the existing platforms / benchmark suites are insufficient to enable an end-to-end implementation..
  13. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 14 Offline RL library (d3rlpy, Horizon, RLlib) Benchmark for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×
  14. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 15 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (implements whole procedures, but is not applicable to RL) × (specific) ✓ × (limited) ×
  15. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 16 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × (not flexible) ✓ ✓ × (specific) ✓ × (limited) × ✓? ✓? ✓? ✓?
  16. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 17 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × ✓ ✓ × (specific) ✓ × (limited) × × (not flexible) (not applicable to RL)
  17. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 18 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) application-specific (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × ✓ ✓ × × (specific) ✓ × (limited) × (not flexible) (not applicable to RL)
  18. Issue of existing libraries for offline RL and OPE None

    of the existing platforms enables an end-to-end implementation. May 2024 SCOPE-RL package description 19 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) application-specific (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × ✓ ✓ × × (specific) ✓ × (limited) × (not flexible) (not applicable to RL)
  19. Issue of existing libraries for offline RL and OPE Particularly,

    only a few libraries support OPE implementations. May 2024 SCOPE-RL package description 20 Offline RL library (d3rlpy, Horizon, RLlib) Benchmarks for OPE (DOPE, COBS) OPE platform (OBP) application-specific (RecoGym, RL4RS) evaluation of OPE OPE offline RL data collection ✓ ✓ × (limited) × × ✓ ✓ × × (specific) ✓ × (limited) × × (not flexible) (not applicable to RL)
  20. Summary of the contribution In the following slides, we will

    discuss each features one by one: • End-to-end implementation of offline RL and OPE • Variety of OPE estimators and assessment protocols • Cumulative distribution OPE for risk function estimation • Risk-return assessments of OPE and the downstream policy selection • User friendly APIs, visualization tools, and documentation May 2024 SCOPE-RL package description 22
  21. End-to-end implementation of offline RL and OPE We streamline the

    implementation with the following four modules for the first time. May 2024 SCOPE-RL package description 23 data collection offline RL OPE/OPS evaluation of OPE Compatibility with OpenAI Gym/ Gymnasium Integration with d3rlpy Our particular focus: various OPE estimators and assessment protocols of OPE
  22. Our particular interest is in policy evaluation May 2024 SCOPE-RL

    package description 24 Markov Decision Process (MDP) is defined as . • : state • : action • : reward • : timestep • : state transition • : reward function • : discount • : trajectory ▼ our interest
  23. Variety of OPE estimators and assessment protocols SCOPE-RL implements various

    OPE estimators to estimate the expected rewards. • (Basic) Direct Method (DM) / Per-Decision Importance Sampling (PDIS) / Doubly Robust (DR) • (Advanced) State(-action) Marginal Importance Sampling (S(A)MIS) and Doubly Robust (S(A)MDR) / Double Reinforcement Learning (DRL) • (Options) Self-normalized estimators / Spectrum of OPE (SOPE) / kernel-based estimators for continuous actions May 2024 SCOPE-RL package description 25 See Appendix for the details.
  24. Variety of OPE estimators and assessment protocols SCOPE-RL additionally provides

    more fine-grained evaluation protocols. May 2024 SCOPE-RL package description 26 policy evaluation Evaluation-of-OPE
  25. (1) Cumulative distribution OPE for risk function estim. Cumlative distribution

    OPE (CD-OPE) estimates the whole performance distribution. May 2024 SCOPE-RL package description 27 𝐹(𝜋) reward threshold
  26. (1) Cumulative distribution OPE for risk function estim. Then, using

    the estimated cumulative distribution function (CDF), we can derive.. May 2024 SCOPE-RL package description 28 enables us to compare the worst-case policy value trajectory-wise reward Note: CVaR is the average of the worst (1 - 𝛼) % trials.
  27. (2) Assessment protocols for OPE From the next slide, we

    will move on to the assessment of OPE. • applicable to various RL environments including under-explored settings • implement and enable to compare various offline RL algorithms • able to evaluate various policies with various OPE estimators • validate the reliability of OPE and downstream policy selection methods May 2024 SCOPE-RL package description 29 data collection offline RL OPE/OPS evaluation of OPE
  28. (2) Assessment protocols for OPE SCOPE-RL implement following three conventional

    metrics. • Mean squared error (MSE) – “accuracy” of policy evaluation • Rank correlation (RankCorr) – “accuracy” of policy alignment • Regret – “accuracy” of the downstream policy selection May 2024 SCOPE-RL package description 30 See Appendix for the definitions.
  29. (2) Assessment protocols for OPE Three existing metrics are suitable

    for the top-1 selection. May 2024 SCOPE-RL package description 31 directly chooses the production policy via OPE low MSE high RankCorr low Regret near-best production policy ? ✔ ✔ assessment of OPE
  30. (2) Assessment protocols for OPE Three existing metrics are suitable

    for the top-1 selection. .. but in practice, we cannot sorely rely on the OPE result. May 2024 SCOPE-RL package description 32 directly chooses the production policy via OPE low MSE high RankCorr low Regret near-best production policy ? ✔ ✔ assessment of OPE
  31. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. May 2024 SCOPE-RL package description 33 OPE as a screening process combine A/B test results for policy selection
  32. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. May 2024 SCOPE-RL package description 34 OPE as a screening process Assess risk-returns of online A/B tests via statistics of policy portfolio
  33. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. • best@𝑘 (return): measures the performance of the Lnal production policy. • worst@𝑘, mean@𝑘, std@𝑘, safety violation rate@𝑘 (risk): measures the risk of deploying poor-performing policies in online A/B tests. • SharpeRatio@k (ef]ciency): measures the return (best@𝑘) over the risk-free baseline (𝐽(𝜋𝑏 )), discounted by the risk of deploying poor policies (std@𝑘). May 2024 SCOPE-RL package description 35 See Appendix for the definitions.
  34. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. • best@𝑘 (return): measures the performance of the ]nal production policy. May 2024 SCOPE-RL package description 36 See Appendix for the definitions. the higher, the better
  35. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. • worst@𝑘, mean@𝑘, std@𝑘, safety violation rate@𝑘 (risk): measures the risk of deploying poor-performing policies in online A/B tests. May 2024 SCOPE-RL package description 37 See Appendix for the dePnitions. the lower, the better (depending on the metric)
  36. (2) Assessing top-𝑘 policy selection results SCOPE-RL additionally reports the

    statistics of top-𝑘 policy portfolio selected by OPE. • SharpeRatio@k (efficiency): measures the return (best@𝑘) over the risk-free baseline (𝐽(𝜋)), discounted by the risk of deploying poor policies (std@𝑘). May 2024 SCOPE-RL package description 38 See Appendix for the dePnitions. the higher, the better
  37. Summary of the distinctive features on the OPE module SCOPE-RL

    additionally provides more fine-grained evaluation protocols. May 2024 SCOPE-RL package description 39 policy evaluation Evaluation-of-OPE
  38. Summary of the contribution of SCOPE-RL • SCOPE-RL is the

    first end-to-end open-source platfom for offline RL and OPE. • Unlike most existing offline RL libraries, SCOPE-RL puts weight on the OPE module. • .. implements a variety of OPE estimators. • .. supports cumulative distribution OPE for the first time. • .. handles assessments of OPE estimators. SCOPE-RL can be used for a quick testbed for OPE estimators! May 2024 SCOPE-RL package description 40
  39. Last but not least, our API is very easy to

    use. E.g., we can obtain and visualize the assessment results in a few lines of code. May 2024 SCOPE-RL package description 41 GitHub Install now!!
  40. Our documentation also provides detailed tips for use. May 2024

    SCOPE-RL package description 42 documentation Install now!!
  41. Find more in the SCOPE-RL reference pages! • Webpage (documentation):

    https://scope-rl.readthedocs.io/en/latest/ • Package reference: https://scope-rl.readthedocs.io/en/latest/ documentation/scope_rl_api.html • GitHub: https://github.com/hakuhodo-technologies/scope-rl • PyPI: https://pypi.org/project/scope-rl/ • Google Group: https://groups.google.com/g/scope-rl May 2024 SCOPE-RL package description 43 documentation GitHub PyPI
  42. Example Usage A quick demo for streamlining offline RL and

    OPE May 2024 SCOPE-RL package description 45
  43. Step 1: data collection We need only 6 lines of

    code. May 2024 SCOPE-RL package description 46 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  44. Step 1: data collection We need only 6 lines of

    code. May 2024 SCOPE-RL package description 47 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  45. Step 1: data collection We need only 6 lines of

    code. May 2024 SCOPE-RL package description 48 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  46. Step 1: data collection We need only 6 lines of

    code. May 2024 SCOPE-RL package description 49 Quick demo with RTBGym data collection offline RL OPE/OPS evaluation of OPE
  47. Step 1: data collection We need only 6 lines of

    code. May 2024 SCOPE-RL package description 50 Quick demo with RTBGym offline RL OPE/OPS evaluation of OPE data collection
  48. Step2: learning a new policy offline (offline RL) We use

    d3rlpy for the offline RL part. May 2024 SCOPE-RL package description 51 data collection offline RL OPE/OPS evaluation of OPE
  49. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 52 data collection offline RL OPE/OPS evaluation of OPE
  50. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 53 data collection offline RL OPE/OPS evaluation of OPE
  51. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 54 data collection offline RL OPE/OPS evaluation of OPE
  52. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 55 data collection offline RL OPE/OPS evaluation of OPE
  53. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 56 data collection offline RL OPE/OPS evaluation of OPE
  54. Step3: Basic OPE to evaluate the policy value Users can

    compare various policies and OPE estimators at once. May 2024 SCOPE-RL package description 57 estimated policy value
  55. Step4: Cumulative distribution OPE May 2024 SCOPE-RL package description 58

    Users can conduct cumulative distribution OPE in a manner similar to the basic OPE. data collection offline RL OPE/OPS evaluation of OPE
  56. Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE

    in a manner similar to the basic OPE. May 2024 SCOPE-RL package description 59 estimated cumulative distribution function estimated conditional value at risk (with various range)
  57. Step4: Cumulative distribution OPE Users can conduct cumulative distribution OPE

    in a manner similar to the basic OPE. May 2024 SCOPE-RL package description 60 estimated interquartile range (10%-90%)
  58. Step5: OPS and evaluation of OPE/OPS Users can also easily

    implement both OPS and evaluation of OPE/OPS. May 2024 SCOPE-RL package description 61 data collection offline RL OPE/OPS evaluation of OPE
  59. Step5: OPS and evaluation of OPE/OPS Users can also easily

    implement both OPS and evaluation of OPE/OPS. May 2024 SCOPE-RL package description 62 comparing the true (x) and estimated (y) variance evaluating the quality of OPS results
  60. Step6: Evaluating the risk-return tradeoff of OPE/OPS Users can also

    compare top-𝑘 policy selection results. May 2024 SCOPE-RL package description 63 data collection offline RL OPE/OPS evaluation of OPE
  61. Implemented OPE estimators and metrics May 2024 SCOPE-RL package description

    64 p65-79: standard OPE p80-85: cumulative distribution OPE P86-96: assessment protocols of OPE
  62. Preliminary May 2024 SCOPE-RL package description 65 Markov Decision Process

    (MDP) is defined as . • : state • : action • : reward • : timestep • : state transition • : reward function • : discount • : trajectory ▼ our interest
  63. Estimators for the standard OPE We aim to estimate the

    expected trajectory-wise reward (i.e., policy value): May 2024 SCOPE-RL package description 66 OPE estimator logged data collected by a past (behavior) policy counterfactuals & distribution shift behavior policy
  64. Direct Method (DM) [Le+,19] DM trains a value predictor and

    estimates the policy value from the prediction. Pros: variance is small. Cons: bias can be large when & 𝑄 is inaccurate. May 2024 SCOPE-RL package description 67 value prediction estimating expected reward at future timesteps empirical average (𝑛 is the data size and 𝑖 is the index)
  65. Per-Decision Importance Sampling (PDIS) [Precup+,00] PDIS applies importance sampling to

    correct the distribution shift. Pros: unbiased (under the common support assumption: ). Cons: variance can be exponentially large as 𝑡 grows. May 2024 SCOPE-RL package description 68 importance weight = product of step-wise iimportance weights
  66. Doubly Robust (DR) [Jiang&Li,16] [Thomas&Brunskill,16] DR is a hydrid of

    DM and IPS, which apply importance sampling only on the residual. May 2024 SCOPE-RL package description 69 (recursive form) importance weight is multiplied on the residual value after timestep 𝒕
  67. Doubly Robust (DR) [Jiang&Li,16] [Thomas&Brunskill,16] DR is a hydrid of

    DM and IPS, which apply importance sampling only on the residual. Pros: unbiased and often reduce variance compared to PDIS. Cons: can still suffer from high variance when 𝑡 is large. May 2024 SCOPE-RL package description 70
  68. State-action Marginal IS (SAM-IS) [Uehara+,20] To alleviate variance, SAM-IS considers

    IS on the (state-action) marginal distribution. Pros: unbiased when ) 𝜌 is correct and reduces variance compared to PDIS. Cons: accurate estimation of ) 𝜌 is often challenging, resulting in some bias. May 2024 SCOPE-RL package description 71 (estimated) marginal importance weight state-action visitation probability
  69. State-action Marginal DR (SAM-DR) [Uehara+,20] SAM-DR is a DR variant

    that leverages the (state-action) marginal distribution. Pros: unbiased when ) 𝜌 or & 𝑄 is accurate and reduces variance compared to DR. Cons: accurate estimation of ) 𝜌 is often challenging, resulting in some bias. May 2024 SCOPE-RL package description 72 marginal importance weight is multiplied on the residual
  70. State Marginal estimators (SM-IS/DR) [Liu+,18] Likewise, state marginal estimators uses

    the (state) marginal importance weights. where is the (estimated) state marginal importance weight. is the step-wise importance weight at timestep 𝑡. May 2024 SCOPE-RL package description 73
  71. Spectrum of Off-Policy Evaluation (SOPE) [Yuan+,21] SOPE interpolates between marginal

    IS and per-decision IS to balance bias-variance. May 2024 SCOPE-RL package description 74
  72. Spectrum of Off-Policy Evaluation (SOPE) [Yuan+,21] For example, SAM-IS/DR w/

    SOPE are defined as follows. May 2024 SCOPE-RL package description 75
  73. Double Reinforcement Learning (DRL) [Kallus&Uehara,20] DRL achieves the lowerest variance

    among unbiased estimators. DRL also uses cross-]tting, which estimate ) 𝜌 and & 𝑄 on 𝐷\𝐷𝑘 and estimate - 𝐽 on 𝐷𝑘 (i.e., different subsets of data), to alleviate potential bias in estimation. May 2024 SCOPE-RL package description 76 (reference) cross-fitting
  74. Self-normalized estimators [Kallus&Uehara,19] Self-normalized estimators alleviate variance by modifying the

    importance weight. Self-normalized estimators are no longer unbiased, but remains consistent. May 2024 SCOPE-RL package description 77
  75. Extension to continuous action spaces [Kallus&Zhou,18] As the naive importance

    weight rejects almost all actions , we exploit similarity between actions using a kernel. May 2024 SCOPE-RL package description 79 kernel function (e.g., Gaussian kernel)
  76. Estimating high-conXdence intervals SCOPE-RL uses the following inequality to derive

    a probability bound: • Hoeffding: • Empirical Bernstein: • Student’s T-test: • Bootstrapping: May 2024 SCOPE-RL package description 80 𝛼: confidence level
  77. Estimators for the cumulative distribution OPE Cumlative distribution OPE (CD-OPE)

    estimates the whole performance distribution. May 2024 SCOPE-RL package description 81 𝐹(𝜋) enable us to compare the worst-case policy value reward threshold
  78. Estimators for the cumulative distribution OPE Cumlative distribution OPE (CD-OPE)

    estimates the whole performance distribution. May 2024 SCOPE-RL package description 82 OPE estimator 𝐹(𝜋) [Chandak+,21] [Huang+,21,22] reward threshold
  79. DM for CD-OPE DM is a model-based approach, which uses

    predicted reward. May 2024 SCOPE-RL package description 83 reward prediction
  80. Trajectory-wise IS (TIS) for CD-OPE TIS applies IS to estimate

    the cumulative distribution function (CDF). As the probability may exceed 1 due to large importance weight, we apply clipping. May 2024 SCOPE-RL package description 84 trajectory-wise importance weight
  81. Trajectory-wise DR (TDR) for CD-OPE We can also define a

    DR-style estimator by combining DM and TIS. May 2024 SCOPE-RL package description 85 importance sampling on the residual
  82. Implemented conventional assessment metrics There are four metrics used to

    assess the accuracy of OPE. • Mean squared error (MSE) – “accuracy” of policy evaluation • Rank correlation (RankCorr) – “accuracy” of policy alignment • Regret – “accuracy” of policy selection • Type I and Type II error rates – “accuracy” of safety validation May 2024 SCOPE-RL package description 87
  83. Implemented conventional assessment metrics (1/4) There are four metrics used

    to assess the accuracy of OPE and policy selection. • Mean squared error (MSE) – “accuracy” of policy evaluation [Voloshin+,21] May 2024 SCOPE-RL package description 88 estimation true value the lower, the better
  84. Implemented conventional assessment metrics (2/4) There are four metrics used

    to assess the accuracy of OPE/OPS. • Rank correlation (RankCorr) – “accuracy” of policy alignment [Fu+,21] May 2024 SCOPE-RL package description 89 1 2 3 4 5 6 7 estimation true ranking the higher, the better
  85. Implemented conventional assessment metrics (3/4) There are four metrics used

    to assess the accuracy of OPE/OPS. • Regret – “accuracy” of policy selection [Doroudi+,18] May 2024 SCOPE-RL package description 90 performance of the true best policy performance of the estimated best policy the lower, the better
  86. Implemented conventional assessment metrics (4/4) There are four metrics used

    to assess the accuracy of OPE/OPS. • Type I and Type II error rates – “accuracy” of safety validation May 2024 SCOPE-RL package description 91 false positive true negative ̅ 𝐽: safety threshold (true negative / true) (false positive / false) the lower, the better
  87. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • best@𝑘 (return): measures the performance of the final production policy. • worst@𝑘, mean@𝑘, std@𝑘, safety violation rate@𝑘 (risk): measures the risk of deploying poor-performing policies in online A/B tests. • SharpeRatio@k (efficiency): measures the return (best@𝑘) over the risk-free baseline (𝐽(𝜋)), discounted by the risk of deploying poor policies (std@𝑘). May 2024 SCOPE-RL package description 92
  88. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • best@𝑘 (return; the higher, the better): measures the performance of the ]nal production policy. May 2024 SCOPE-RL package description 93
  89. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • worst@𝑘, mean@𝑘 (risk; the higher, the better): measures the risk of deploying poor-performing policies in online A/B tests. May 2024 SCOPE-RL package description 94
  90. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • std@𝑘 (risk; the lower, the better): measures the risk of deploying poor-performing policies in online A/B tests. May 2024 SCOPE-RL package description 95
  91. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • safety violation rate@𝑘 (risk; the lower, the better): measures the risk of deploying poor-performing policies in online A/B tests. May 2024 SCOPE-RL package description 96 ̅ 𝐽: safety threshold
  92. Top-𝑘 risk-return tradeoff metrics SCOPE-RL additionally reports the statistics of

    top-𝑘 policy portfolio selected by OPE. • SharpeRatio@k (ef]ciency; the higher, the better): [Kiyohara+,23] measures the return (best@𝑘) over the risk-free baseline (𝐽(𝜋)), discounted by the risk of deploying poor policies (std@𝑘). May 2024 SCOPE-RL package description 97
  93. References (1/9) [Seno&Imai,22 (d3rlpy)] Takuma Seno and Michita Imai. “d3rlpy:

    An Offline Deep Reinforcement Learning Library.” JMLR, 2022. https://arxiv.org/abs/2111.03788 [Gauci+,18 (Horizon)] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. “Horizon: Facebook's Open Source Applied Reinforcement Learning Platform.” 2018. https://arxiv.org/abs/1811.00260 [Liang+,18 (RLlib)] Eric Liang, Richard Liaw, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed Reinforcement Learning.” ICML, 2018. https://arxiv.org/abs/1712.09381 May 2024 SCOPE-RL package description 99
  94. References (2/9) [Fu+,21 (DOPE)] Justin Fu, Mohammad Norouzi, Ofir Nachum,

    George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep Off-Policy Evaluation.” ICLR, 2021. https://arxiv.org/abs/2103.16596 [Voloshin+,21 (COBS)] Cameron Voloshin, Hoang M. Le, Nan Jiang, and Yisong Yue. “Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/1911.06854 [Rohde+,18 (RecoGym)] David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou “RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising.” 2018. https://arxiv.org/abs/1808.00720 May 2024 SCOPE-RL package description 100
  95. References (3/9) [Wang+,21 (RL4RS)] Kai Wang, Zhene Zou, Yue Shang,

    Qilin Deng, Minghao Zhao, Yile Liang, Runze Wu, Jianrong Tao, Xudong Shen, Tangjie Lyu, and Changjie Fan. “RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System.” 2021. https://arxiv.org/abs/2110.11073 [Saito+,21 (OBP)] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off- Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Brockman+,16 (OpenAI Gym)] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym.” 2016. https://arxiv.org/abs/1606.01540 May 2024 SCOPE-RL package description 101
  96. References (4/9) [Kiyohara+,21 (RTBGym)] Haruka Kiyohara, Kosuke Kawakami, and Yuta

    Saito. “Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation.” 2021. https://arxiv.org/abs/2109.08331 [Chandak+,21 (CD-OPE)] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 [Huang+,21 (CD-OPE)] Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” NeurIPS, 2021. https://arxiv.org/abs/2104.12820 May 2024 SCOPE-RL package description 102
  97. References (5/9) [Huang+,22 (CD-OPE)] Audrey Huang, Liu Leqi, Zachary C.

    Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov Decision Processes.” AISTATS, 2022. https://proceedings.mlr.press/v151/huang22b.html [Hasselt+,16 (DDQN)] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-learning.” AAAI, 2016. https://arxiv.org/abs/1509.06461 [Kumar+,20 (CQL)] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. “Conservative Q-Learning for Offline Reinforcement Learning.” NeurIPS, 2020. https://arxiv.org/abs/2006.04779 [Le+,19 (DM)] Hoang M. Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” ICML, 2019. https://arxiv.org/abs/1903.08738 May 2024 SCOPE-RL package description 103
  98. References (6/9) [Precup+,00 (IPS)] Doina Precup, Richard S. Sutton, and

    Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Jiang&Li,16 (DR)] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 [Thomas&Brunskill,16 (DR)] Philip S. Thomas and Emma Brunskill. “Data-Ef]cient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Uehara+,20 (SAM-IS/DR)] Masatoshi Uehara, Jiawei Huang, Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1910.12809 May 2024 SCOPE-RL package description 104
  99. References (7/9) [Liu+,18 (SM-IS/DR)] Qiang Liu, Lihong Li, Ziyang Tang,

    Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” NeurIPS, 2018. https://arxiv.org/abs/1810.12429 [Yuan+,21 (SOPE)] Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” NeurIPS, 2021. https://arxiv.org/abs/2111.03936 [Kallus&Uehara,20 (DRL)] Nathan Kallus, Masatoshi Uehara. “Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes.” JMLR, 2020. https://arxiv.org/abs/1908.08526 [Kallus&Uehara,19 (Self-normalized estimators)] Nathan Kallus, Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” NeurIPS, 2019. https://arxiv.org/abs/1906.03735 May 2024 SCOPE-RL package description 105
  100. References (8/9) [Kallus&Zhou,18 (extension to continuous actions)] Nathan Kallus, Angela

    Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” AISTATS, 2018. https://arxiv.org/abs/1802.06037 [Thomas+,15 (high-confidence OPE)] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh. “High Confidence Off-Policy Evaluation.” AAAI, 2015. https://people.cs.umass.edu/~pthomas/papers/Thomas2015.pdf [Thomas+,15 (high-confidence OPE)] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh. “High Confidence Policy Improvement.” ICML, 2015. https://people.cs.umass.edu/~pthomas/papers/Thomas2015b.pdf [Voloshin+,21 (MSE)] Cameron Voloshin, Hoang M. Le, Nan Jiang, Yisong Yue. “Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning.” NeurIPS datasets&benchmarks, 2021. https://arxiv.org/abs/1911.06854 May 2024 SCOPE-RL package description 106
  101. References (9/9) [Fu+,21 (RankCorr)] Justin Fu, Mohammad Norouzi, O]r Nachum,

    George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, Tom Le Paine. “Benchmarks for Deep Off- Policy Evaluation.” ICLR, 2021. https://arxiv.org/abs/2103.16596 [Doroudi+,18 (Regret)] Shayan Doroudi, Philip S. Thomas, Emma Brunskill. “Importance Sampling for Fair Policy Selection.” IJCAI, 2018. https://people.cs.umass.edu/~pthomas/papers/Daroudi2017.pdf [Kiyohara+,23 (SharpeRatio@k)] Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. “Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning.” 2023. May 2024 SCOPE-RL package description 107