Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Accuracy of Off-Policy Evaluation via Policy Adaptive Estimator Selection

Haruka Kiyohara
September 22, 2022

Improving Accuracy of Off-Policy Evaluation via Policy Adaptive Estimator Selection

CONSEQUENCES+REVEAL WS @ RecSys2022 (Day2, CONSEQUENCES)
About WS: https://sites.google.com/view/consequences2022

Haruka Kiyohara

September 22, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Improving Accuracy of Off-Policy Evaluation via Policy Adaptive Estimator Selection

    Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 1
  2. Off-Policy Evaluation Motivation towards Estimator Selection September 2022 Policy Adaptive

    Estimator Selection @ CONSEQUENCES+REVEAL WS 2
  3. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 3 a user feedback (reward) a coming user (context) an item (action)
  4. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 4 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃
  5. Off-Policy Evaluation The goal is to evaluate the performance of

    an evaluation policy 𝜋 𝑒 . September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 5 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)
  6. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 6 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight) * definition of each estimator is in Appendix.
  7. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight) * definition of each estimator is in Appendix.
  8. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight) * definition of each estimator is in Appendix.
  9. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight) * definition of each estimator is in Appendix.
  10. To reduce the variance of IPS/ DR, we make importance

    weights smaller. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 10 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]
  11. To reduce the variance of IPS/ DR, we make importance

    weights smaller. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 11 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?
  12. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 12 𝜋 𝑏 𝜋 𝑒
  13. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 13 𝜋 𝑏 𝜋 𝑒 Estimator Selection is important!
  14. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 14 𝜋 𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.
  15. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 15 𝜋 𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.
  16. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 16 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.
  17. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 17 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  18. Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator

    Selection @ CONSEQUENCES+REVEAL WS 18 𝜋 𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  19. Estimator Selection for OPE September 2022 Policy Adaptive Estimator Selection

    @ CONSEQUENCES+REVEAL WS 19
  20. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 20
  21. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 21 true policy value (estimand)
  22. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 22 estimated from the logged data
  23. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 23
  24. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 24 pseudo-evaluation policy
  25. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 25 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value
  26. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    @ CONSEQUENCES+REVEAL WS 26 Do these estimators really work well? param=-2 non-adaptive heuristic
  27. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    @ CONSEQUENCES+REVEAL WS 27 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? param=-2 non-adaptive heuristic true performance
  28. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    @ CONSEQUENCES+REVEAL WS 28 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? param=-2 non-adaptive heuristic true performance
  29. Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection

    @ CONSEQUENCES+REVEAL WS 29 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? param=-2 non-adaptive heuristic true performance How to be adaptive to the given OPE task (e.g., evaluation policy)?
  30. PAS-IF Policy Adaptive Estimator Selection via Importance Fitting September 2022

    Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 30
  31. Key idea of PAS-IF How about sampling the logged data

    and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 31
  32. Key idea of PAS-IF How about sampling the logged data

    and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 32 (𝑆 is a set of random states for bootstrapping)
  33. Solving optimization of sub-sampling rule PAS-IF optimizes the subsampling function

    to replicate the original distribution. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 33 Subsampling function:
  34. Solving optimization of sub-sampling rule PAS-IF optimizes the subsampling function

    to replicate the original distribution. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 34 Subsampling function: Objective of importance fitting:
  35. What is importance fitting doing? When we have September 2022

    Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 35 Note: the simplified case of .
  36. What is importance fitting doing? When we have September 2022

    Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 36 Note: the simplified case of .
  37. What is importance fitting doing? When we have ⇒ PAS-IF

    can produce a similar distribution shift! September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 37 Note: the simplified case of .
  38. Summary of the key points • PAS-IF produces the pseudo

    OPE datasets via adaptive subsampling. • thereby enables an accurate and adaptive estimator selection. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 38
  39. Synthetic Experiment September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL

    WS 39
  40. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 40
  41. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 41 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]
  42. PAS-IF enables an accurate estimator selection PAS-IF enables far more

    accurate estimator selection by being adaptive. September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 42 PAS-IF is accurate across various evaluation policies lower, the better 𝜋 𝑏1 𝜋 𝑏2 𝜋 𝑏1 𝜋 𝑏2 ෝ 𝑚 -- selected 𝑚 ∗ -- true best
  43. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 43 hyperparam tuning estimator selection 1. Estimator Selection
  44. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 44 hyperparam tuning estimator selection 1. Estimator Selection ෠ 𝑉 1 ෠ 𝑉 2 ෠ 𝑉 3 PAS-IF different estimator for each policy non-adaptive ෠ 𝑉 universal estimator for all policies ෠ 𝑉 ෠ 𝑉
  45. Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a

    favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 45 lower, the better ො 𝜋 -- selected 𝜋 ∗ -- true best
  46. Summary • Estimator Selection is important to enable an accurate

    OPE. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. • The empirical results show that PAS-IF is beneficial in both estimator selection and the downstream policy selection tasks. PAS-IF will help practitioners identify an accurate OPE estimator! September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 46
  47. Thank you for listening! Slides are now available at: https://sites.google.com/view/harukakiyohara

    contact: kiyohara.h.aa@m.titech.ac.jp September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 47
  48. Detailed optimization procedure of PAS-IF We optimize the subsampling rule

    𝜌 𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 48
  49. Definition of the representative OPE estimators • Direct Method (DM)

    – model-based • Inverse Propensity Scoring (IPS) – importance sampling-based • Doubly Robust (DR) – hybrid September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 49 ✓ reward predictor importance weight
  50. References September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS

    50
  51. References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset

    Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 51
  52. References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-

    Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 52
  53. References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and

    Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 53
  54. References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and

    Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 September 2022 Policy Adaptive Estimator Selection @ CONSEQUENCES+REVEAL WS 54