Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluating the Robustness of Off Policy Evaluation

17c1e4a05739a33e166d1dd982d717ec?s=47 Haruka Kiyohara
September 27, 2021

Evaluating the Robustness of Off Policy Evaluation

Slides for the oral presentation at RecSys 2021.
paper: https://dl.acm.org/doi/10.1145/3460231.3474245

17c1e4a05739a33e166d1dd982d717ec?s=128

Haruka Kiyohara

September 27, 2021
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021

    1 Evaluating the Robustness of Off-Policy Evaluation Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2021 1
  2. Machine decision making in recommenders September 2021 Evaluating the Robustness

    of Off-Policy Evaluation @ RecSys2021 2 a policy (e.g., contextual bandit) makes decisions to recommend items, with the goal of maximizing the (expected) reward a coming user an item reward (e.g., click)
  3. The system also produces logged data September 2021 Evaluating the

    Robustness of Off-Policy Evaluation @ RecSys2021 3 reward (e.g., click) (reward 𝒓) logged bandit feedback collected by a behavior policy 𝝅𝒃 a coming user (context 𝒙) an item (action 𝒂) Motivation: We want to evaluate the future policies using the logged data.
  4. Outline • Off-Policy Evaluation (OPE) • Emerging challenge: Selection of

    OPE estimators • Our goal: Evaluating the Robustness of Off-Policy Evaluation September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 4
  5. Off-Policy Evaluation (OPE) In OPE, we aim to evaluate the

    performance of a new evaluation policy 𝜋 𝑒 using logged bandit feedback collected by the behavior policy 𝜋 𝑏 . September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 5 hyperparameters of the OPE estimator ෡ 𝑽 where expected reward obtained by running on 𝝅𝒆 the real system distribution shift
  6. Off-Policy Evaluation (OPE) In OPE, we aim to evaluate the

    performance of a new evaluation policy 𝜋 𝑒 using logged bandit feedback collected by the behavior policy 𝜋 𝑏 . An accurate OPE is beneficial, because it.. • avoids deploying poor policies without A/B tests • identifies promising new policies among many candidates September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 6 hyperparameters of the OPE estimator ෡ 𝑽 Growing interest in OPE! distribution shift
  7. Direct Method (DM) DM estimates mean reward function. Large bias*,

    small variance. *due to inaccuracy of ො 𝑞 Hyperparameter: ො 𝑞 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 7 where :empirical average
  8. Inverse Probability Weighting (IPW) [Strehl+, 2010] IPW mitigates the distribution

    shift between 𝜋 𝑏 and 𝜋 𝑒 using importance sampling. Unbiased*, but large variance. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 8 where
  9. Doubly Robust (DR) [Dudík+, 2014] DR tackles the variance of

    IPW by leveraging baseline estimation ො 𝑞 and performing importance weighting only on its residual. Unbiased* and lower variance than IPW. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) + ො 𝑞 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 9 baseline importance weighting on the residual where
  10. Pessimistic Shrinkage (IPWps, DRps) [Su+, 2020] IPWps and DRps further

    reduce the variance by clipping large importance weights. Lower variance than IPW / DR. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) (, ො 𝑞) + 𝜆 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 10 clipped importance weight
  11. Self-Normalization (SNIPW, SNDR) [Swaminathan & Joachims, 2015] SNIPW and SNDR

    address the variance issue of IPW and DR by using self-normalized value for importance weights. Consistent* and lower variance than IPW / DR. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) (, ො 𝑞) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 11 self-normalization
  12. Switch-DR [Wang+, 2017] Switch-DR interpolates between DM and DR (𝜏

    → 0 to DM, 𝜏 → ∞ to DR). Lower variance than DR. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown), ො 𝑞 + 𝜏 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 12 use importance weighting only when the weight is small
  13. DR with Optimistic Shrinkage (DRos) [Su+, 2020] DRos use new

    weight function to bridge DM and DR (λ → 0 to DM, λ → ∞ to DR). Minimize sharp bounds of mean-squared-error. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown), ො 𝑞 + 𝜆 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 13 where weight function to minimize error bounds
  14. Many estimators with different hyperparameters September 2021 Evaluating the Robustness

    of Off-Policy Evaluation @ RecSys2021 14 Estimator Selection: Which OPE estimator (and hyperparameters) should be used in practice?
  15. What properties are desirable in practice? • An estimator that

    works without significant hyperparameter tuning. .. because hyperparameters may depend on the logged data and evaluation policy, which might also entail risks for overfitting. • An estimator that is stably accurate across various evaluation policies. .. because we need to evaluate various candidate policies to choose from. • An estimator that shows acceptable errors in the worst case. .. because uncertainty of estimation is of great interest. We want to evaluate the estimators’ robustness to the possible changes in configurations such as hyperparameters and evaluation policies! September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 15
  16. Is conventional evaluation sufficient? Conventional OPE experiments compare mean-squared-error to

    evaluate the performance (estimation accuracy) of OPE estimators. Pitfall: fails to evaluate the estimators’ robustness for configuration changes.. (such as hyperparameters 𝜃 and evaluation policy 𝜋 𝑒 ) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 16 evaluate only on a single set of configurations
  17. Towards more informative evaluation for practice To tackle the issues

    in conventional experimental procedure, we propose Interpretable evaluation for offline evaluation (IEOE), which can.. ✓ evaluate the estimators’ robustness to the possible configuration changes ✓ provide a visual interpretation of the distribution of estimation errors ✓ be easily implemented using our open-source Python software, pyIEOE September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 17
  18. Interpretable evaluation for offline evaluation (IEOE) ① set configurations spaces

    (hyperparameters 𝜃 and evaluation policies 𝜋 𝑒 ) ② for each random seed 𝑠, sample configurations ③ calculate the estimators’ squared error (SE) on the sampled configurations ④ obtain an error distribution September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 18 ① ② ③ ④
  19. Visual comparison of OPE estimators Next step is to approximate

    the cumulative distribution function (CDF) of SEs. We can interpret how the estimators are robust across the given configurations. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 19
  20. Quantitative performance measure September 2021 Evaluating the Robustness of Off-Policy

    Evaluation @ RecSys2021 20 Based on the CDF, we can define some summary scores, which are useful for quantitative performance comparisons. • Area under the curve (AU-CDF) compares the estimators’ squared errors below the threshold. • Conditional value-at-risk (CVaR) compares the expected values of the estimators’ squared error in the worst 𝛼 x 100 % trials.
  21. Experiments in a real-world application • We applied IEOE to

    estimator selection in a real e-commerce platform. • The result demonstrates that SNIPW is stable across various configurations. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 21 The platform now uses SNIPW based on our analysis! *The values are normalized by that of SNIPW. (the conclusion may change when we consider different applications)
  22. Take-Home Message • OPE is beneficial for identifying “good” counterfactual

    policies. • In practice, we want to use OPE estimators robust to the configuration changes. • However, the conventional experiment failed to provide informative results. • IEOE evaluates estimators’ robustness and conveys results in an interpretable way. IEOE can help practitioners to choose a reliable OPE estimator! September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 22
  23. Thank you for listening! Find out more (e.g., quantitative metrics

    and experiments) in the full paper! contact: kiyohara.h.aa@m.titech.ac.jp September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 23
  24. References [Strehl+, 2010] Alex Strehl, John Langford, Sham Kakade, and

    Lihong Li. Learning from Logged Implicit Exploration Data. NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+, 2014] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 2014. https://arxiv.org/abs/1503.02834 [Swaminathan & Joachims, 2015] Adith Swaminathan and Thorsten Joachims. The Self-Normalized Estimator for Counterfactual Learning. NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+, 2017] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. ICML, 2017. https://arxiv.org/abs/1612.01205 [Su+, 2020] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust Off-policy Evaluation with Shrinkage. ICML, 2020. https://arxiv.org/abs/1907.09623 [Narita+, 2021] Yusuke Narita, Shota Yasui, Kohei Yata. Debiased Off-Policy Evaluation for Recommendation Systems. RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 24