Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[AAAI'23] Policy-Adaptive Estimator Selection for Off-Policy Evaluation

[AAAI'23] Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Haruka Kiyohara

February 10, 2023
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka

    Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara February 2023 Policy Adaptive Estimator Selection @ AAAI2023 1
  2. Content • Introduction to Off-Policy Evaluation (OPE) • Estimator Selection

    for OPE • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) • Synthetic Experiments • Estimator Selection • Policy Selection February 2023 Policy Adaptive Estimator Selection @ AAAI2023 2
  3. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 4 a user feedback (reward) a coming user (context) an item (action)
  4. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃
  5. Off-Policy Evaluation The goal is to evaluate the performance of

    an evaluation policy 𝜋𝑒 . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 6 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)
  6. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)
  7. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high ✓ reward predictor
  8. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight
  9. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 10 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  10. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 11 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  11. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 12 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high control variate
  12. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 13 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]
  13. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?
  14. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 16 𝜋𝑏 𝜋𝑒 Estimator Selection is important!
  15. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 17 𝜋𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.
  16. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 18 𝜋𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.
  17. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 19 𝜋𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.
  18. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 20 𝜋𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  19. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 21 𝜋𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  20. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 23
  21. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 24 true policy value (estimand)
  22. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 25 estimated from the logged data
  23. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 26
  24. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 27 pseudo-evaluation policy
  25. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 28 pseudo-evaluation policy OPE estimate on-policy policy value
  26. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 29 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value
  27. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 30 Do these estimators really work well? non-adaptive heuristic (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵
  28. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 31 𝜋𝑏 𝜋𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵
  29. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 32 𝜋𝑏 𝜋𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵 How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?
  30. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 34 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy
  31. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 35 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation
  32. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 36 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆 total amount of logged data
  33. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 37 ~𝝅𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆
  34. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 38 ~𝝅𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆
  35. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 39 Subsampling function
  36. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 40 Subsampling function
  37. How to optimize the subsampling function? PAS-IF optimizes 𝜌 to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 41 Objective of importance fitting: Subsampling function
  38. Key contribution of PAS-IF PAS-IF enables MSE estimation that are..

    February 2023 Policy Adaptive Estimator Selection @ AAAI2023 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->
  39. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 44
  40. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]
  41. PAS-IF enables an accurate estimator selection PAS-IF enables far more

    accurate estimator selection by being adaptive. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 46 PAS-IF is accurate across various evaluation policies lower, the better 𝜋𝑏1 𝜋𝑏2 𝜋𝑏1 𝜋𝑏2 " 𝑚 -- selected 𝑚 ∗ -- true best
  42. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 47 hyperparam tuning estimator selection 1. Estimator Selection
  43. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 48 hyperparam tuning estimator selection 1. Estimator Selection ! 𝑉1 ! 𝑉2 ! 𝑉3 PAS-IF different estimator for each policy non-adaptive ! 𝑉 universal estimator for all policies ! 𝑉 ! 𝑉
  44. Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a

    favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 49 lower, the better % 𝜋 -- selected 𝜋 ∗ -- true best
  45. Summary • Estimator Selection is important to enable an accurate

    OPE. • Non-adaptive heuristic fails to be adaptive to the given OPE task. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 50
  46. Thank you for listening! Feel free to ask any questions,

    and discussions are welcome! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 51
  47. Example case of importance fitting When we have ⇒ PAS-IF

    can produce a similar distribution shift! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 52 Note: the simplified case of .
  48. Detailed optimization procedure of PAS-IF We optimize the subsampling rule

    𝜌𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 53
  49. Key idea of PAS-IF How about sampling the logged data

    and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? February 2023 Policy Adaptive Estimator Selection @ AAAI2023 54 (𝑆 is a set of random states for bootstrapping)
  50. References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset

    Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 56
  51. References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-

    Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html February 2023 Policy Adaptive Estimator Selection @ AAAI2023 57
  52. References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and

    Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 58
  53. References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and

    Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 59