Evaluating the Robustness of Off Policy Evaluation

Slide 1

Slide 1 text

September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 1 Evaluating the Robustness of Off-Policy Evaluation Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2021 1

Slide 2

Slide 2 text

Machine decision making in recommenders September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 2 a policy (e.g., contextual bandit) makes decisions to recommend items, with the goal of maximizing the (expected) reward a coming user an item reward (e.g., click)

Slide 3

Slide 3 text

The system also produces logged data September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 3 reward (e.g., click) (reward 𝒓) logged bandit feedback collected by a behavior policy 𝝅𝒃 a coming user (context 𝒙) an item (action 𝒂) Motivation: We want to evaluate the future policies using the logged data.

Slide 4

Slide 4 text

Outline • Off-Policy Evaluation (OPE) • Emerging challenge: Selection of OPE estimators • Our goal: Evaluating the Robustness of Off-Policy Evaluation September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 4

Slide 5

Slide 5 text

Off-Policy Evaluation (OPE) In OPE, we aim to evaluate the performance of a new evaluation policy 𝜋 𝑒 using logged bandit feedback collected by the behavior policy 𝜋 𝑏 . September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 5 hyperparameters of the OPE estimator ෡ 𝑽 where expected reward obtained by running on 𝝅𝒆 the real system distribution shift

Slide 6

Slide 6 text

Off-Policy Evaluation (OPE) In OPE, we aim to evaluate the performance of a new evaluation policy 𝜋 𝑒 using logged bandit feedback collected by the behavior policy 𝜋 𝑏 . An accurate OPE is beneficial, because it.. • avoids deploying poor policies without A/B tests • identifies promising new policies among many candidates September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 6 hyperparameters of the OPE estimator ෡ 𝑽 Growing interest in OPE! distribution shift

Slide 7

Slide 7 text

Direct Method (DM) DM estimates mean reward function. Large bias*, small variance. *due to inaccuracy of ො 𝑞 Hyperparameter: ො 𝑞 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 7 where ：empirical average

Slide 8

Slide 8 text

Inverse Probability Weighting (IPW) [Strehl+, 2010] IPW mitigates the distribution shift between 𝜋 𝑏 and 𝜋 𝑒 using importance sampling. Unbiased*, but large variance. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 8 where

Slide 9

Slide 9 text

Doubly Robust (DR) [Dudík+, 2014] DR tackles the variance of IPW by leveraging baseline estimation ො 𝑞 and performing importance weighting only on its residual. Unbiased* and lower variance than IPW. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) + ො 𝑞 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 9 baseline importance weighting on the residual where

Slide 10

Slide 10 text

Pessimistic Shrinkage (IPWps, DRps) [Su+, 2020] IPWps and DRps further reduce the variance by clipping large importance weights. Lower variance than IPW / DR. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) (, ො 𝑞) + 𝜆 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 10 clipped importance weight

Slide 11

Slide 11 text

Self-Normalization (SNIPW, SNDR) [Swaminathan & Joachims, 2015] SNIPW and SNDR address the variance issue of IPW and DR by using self-normalized value for importance weights. Consistent* and lower variance than IPW / DR. *when 𝜋𝑏 is known or accurately estimated Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown) (, ො 𝑞) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 11 self-normalization

Slide 12

Slide 12 text

Switch-DR [Wang+, 2017] Switch-DR interpolates between DM and DR (𝜏 → 0 to DM, 𝜏 → ∞ to DR). Lower variance than DR. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown), ො 𝑞 + 𝜏 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 12 use importance weighting only when the weight is small

Slide 13

Slide 13 text

DR with Optimistic Shrinkage (DRos) [Su+, 2020] DRos use new weight function to bridge DM and DR (λ → 0 to DM, λ → ∞ to DR). Minimize sharp bounds of mean-squared-error. Hyperparameter: ො 𝜋 𝑏 (when 𝜋 𝑏 is unknown), ො 𝑞 + 𝜆 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 13 where weight function to minimize error bounds

Slide 14

Slide 14 text

Many estimators with different hyperparameters September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 14 Estimator Selection: Which OPE estimator (and hyperparameters) should be used in practice?

Slide 15

Slide 15 text

What properties are desirable in practice? • An estimator that works without significant hyperparameter tuning. .. because hyperparameters may depend on the logged data and evaluation policy, which might also entail risks for overfitting. • An estimator that is stably accurate across various evaluation policies. .. because we need to evaluate various candidate policies to choose from. • An estimator that shows acceptable errors in the worst case. .. because uncertainty of estimation is of great interest. We want to evaluate the estimators’ robustness to the possible changes in configurations such as hyperparameters and evaluation policies! September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 15

Slide 16

Slide 16 text

Is conventional evaluation sufficient? Conventional OPE experiments compare mean-squared-error to evaluate the performance (estimation accuracy) of OPE estimators. Pitfall: fails to evaluate the estimators’ robustness for configuration changes.. (such as hyperparameters 𝜃 and evaluation policy 𝜋 𝑒 ) September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 16 evaluate only on a single set of configurations

Slide 17

Slide 17 text

Towards more informative evaluation for practice To tackle the issues in conventional experimental procedure, we propose Interpretable evaluation for offline evaluation (IEOE), which can.. ✓ evaluate the estimators’ robustness to the possible configuration changes ✓ provide a visual interpretation of the distribution of estimation errors ✓ be easily implemented using our open-source Python software, pyIEOE September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 17

Slide 18

Slide 18 text

Interpretable evaluation for offline evaluation (IEOE) ① set configurations spaces (hyperparameters 𝜃 and evaluation policies 𝜋 𝑒 ) ② for each random seed 𝑠, sample configurations ③ calculate the estimators’ squared error (SE) on the sampled configurations ④ obtain an error distribution September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 18 ① ② ③ ④

Slide 19

Slide 19 text

Visual comparison of OPE estimators Next step is to approximate the cumulative distribution function (CDF) of SEs. We can interpret how the estimators are robust across the given configurations. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 19

Slide 20

Slide 20 text

Quantitative performance measure September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 20 Based on the CDF, we can define some summary scores, which are useful for quantitative performance comparisons. • Area under the curve (AU-CDF) compares the estimators’ squared errors below the threshold. • Conditional value-at-risk (CVaR) compares the expected values of the estimators’ squared error in the worst 𝛼 x 100 % trials.

Slide 21

Slide 21 text

Experiments in a real-world application • We applied IEOE to estimator selection in a real e-commerce platform. • The result demonstrates that SNIPW is stable across various configurations. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 21 The platform now uses SNIPW based on our analysis! *The values are normalized by that of SNIPW. (the conclusion may change when we consider different applications)

Slide 22

Slide 22 text

Take-Home Message • OPE is beneficial for identifying “good” counterfactual policies. • In practice, we want to use OPE estimators robust to the configuration changes. • However, the conventional experiment failed to provide informative results. • IEOE evaluates estimators’ robustness and conveys results in an interpretable way. IEOE can help practitioners to choose a reliable OPE estimator! September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 22

Slide 23

Slide 23 text

Thank you for listening! Find out more (e.g., quantitative metrics and experiments) in the full paper! contact: [email protected] September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 23

Slide 24

Slide 24 text

References [Strehl+, 2010] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. Learning from Logged Implicit Exploration Data. NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+, 2014] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 2014. https://arxiv.org/abs/1503.02834 [Swaminathan & Joachims, 2015] Adith Swaminathan and Thorsten Joachims. The Self-Normalized Estimator for Counterfactual Learning. NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+, 2017] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. ICML, 2017. https://arxiv.org/abs/1612.01205 [Su+, 2020] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust Off-policy Evaluation with Shrinkage. ICML, 2020. https://arxiv.org/abs/1907.09623 [Narita+, 2021] Yusuke Narita, Shota Yasui, Kohei Yata. Debiased Off-Policy Evaluation for Recommendation Systems. RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 24