Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

[AAAI'23] Policy-Adaptive Estimator Selection f...

Avatar for Haruka Kiyohara Haruka Kiyohara
February 10, 2023

[AAAI'23] Policy-Adaptive Estimator Selection for Off-Policyย Evaluation

Avatar for Haruka Kiyohara

Haruka Kiyohara

February 10, 2023
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka

    Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara February 2023 Policy Adaptive Estimator Selection @ AAAI2023 1
  2. Content โ€ข Introduction to Off-Policy Evaluation (OPE) โ€ข Estimator Selection

    for OPE โ€ข Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) โ€ข Synthetic Experiments โ€ข Estimator Selection โ€ข Policy Selection February 2023 Policy Adaptive Estimator Selection @ AAAI2023 2
  3. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 4 a user feedback (reward) a coming user (context) an item (action)
  4. Interactions in recommender systems A behavior policy interacts with users

    and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy ๐…๐’ƒ
  5. Off-Policy Evaluation The goal is to evaluate the performance of

    an evaluation policy ๐œ‹๐‘’ . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 6 offline A/B test logged bandit feedback behavior policy ๐…๐’ƒ OPE estimator (policy performance)
  6. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 7 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)
  7. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 8 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high โœ“ reward predictor
  8. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 9 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high importance weight
  9. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 10 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  10. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 11 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high importance weight evaluation behavior
  11. Representative OPE estimators We aim to reduce both bias and

    variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 12 โœ“ โœ“ โœ“ โœ“ โœ“ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudรญk+,14] unbiased lower than IPS, but still high control variate
  12. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 13 * requires hyperparameter tuning of ๐œ†, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]
  13. To reduce the variance of IPS/ DR, many OPE estimators

    have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?
  14. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 15 ๐œ‹๐‘ ๐œ‹๐‘’
  15. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 16 ๐œ‹๐‘ ๐œ‹๐‘’ Estimator Selection is important!
  16. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 17 ๐œ‹๐‘ Estimator Selection is important! but.. The best estimator can be different under different situations.
  17. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 18 ๐œ‹๐‘ among the best Estimator Selection is important! but.. The best estimator can be different under different situations.
  18. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 19 ๐œ‹๐‘ among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.
  19. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 20 ๐œ‹๐‘ among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  20. Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator

    Selection @ AAAI2023 21 ๐œ‹๐‘ Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.
  21. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 23
  22. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 24 true policy value (estimand)
  23. Objective for estimator selection The goal is to identify the

    most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 25 estimated from the logged data
  24. Baseline โ€“ non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 26
  25. Baseline โ€“ non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 27 pseudo-evaluation policy
  26. Baseline โ€“ non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 28 pseudo-evaluation policy OPE estimate on-policy policy value
  27. Baseline โ€“ non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged

    data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 29 โ€ป ๐‘† is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value
  28. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 30 Do these estimators really work well? non-adaptive heuristic (estimation) " ๐‘‰ ๐œ‹๐ด ; ๐ท๐ต
  29. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 31 ๐œ‹๐‘ ๐œ‹๐ด Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " ๐‘‰ ๐œ‹๐ด ; ๐ท๐ต
  30. Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection

    @ AAAI2023 32 ๐œ‹๐‘ ๐œ‹๐ด Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " ๐‘‰ ๐œ‹๐ด ; ๐ท๐ต How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?
  31. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 34 ~๐…๐’ƒ ~๐…๐‘ฉ ~๐…๐‘จ pseudo-behavior policy total amount of logged data pseudo-evaluation policy
  32. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 35 ~๐…๐’ƒ ~๐…๐‘ฉ ~๐…๐‘จ pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation
  33. Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates

    MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 36 ~๐…๐’ƒ ~๐…๐‘ฉ ~๐…๐‘จ pseudo-behavior policy pseudo-evaluation policy ~$ ๐…๐’ƒ ~$ ๐…๐’† total amount of logged data
  34. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 37 ~๐…๐’ƒ pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ ๐…๐’ƒ ~$ ๐…๐’†
  35. Subsampling function controls the pseudo-policies We now introduce a subsampling

    function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 38 ~๐…๐’ƒ pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ ๐…๐’ƒ ~$ ๐…๐’†
  36. How to optimize the subsampling function? PAS-IF optimizes ๐œŒ to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 39 Subsampling function
  37. How to optimize the subsampling function? PAS-IF optimizes ๐œŒ to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 40 Subsampling function
  38. How to optimize the subsampling function? PAS-IF optimizes ๐œŒ to

    reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 41 Objective of importance fitting: Subsampling function
  39. Key contribution of PAS-IF PAS-IF enables MSE estimation that are..

    February 2023 Policy Adaptive Estimator Selection @ AAAI2023 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->
  40. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 44
  41. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]
  42. PAS-IF enables an accurate estimator selection PAS-IF enables far more

    accurate estimator selection by being adaptive. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 46 PAS-IF is accurate across various evaluation policies lower, the better ๐œ‹๐‘1 ๐œ‹๐‘2 ๐œ‹๐‘1 ๐œ‹๐‘2 " ๐‘š -- selected ๐‘š โˆ— -- true best
  43. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 47 hyperparam tuning estimator selection 1. Estimator Selection
  44. Experimental settings We compare PAS-IF and non-adaptive heuristic in two

    tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 48 hyperparam tuning estimator selection 1. Estimator Selection ! ๐‘‰1 ! ๐‘‰2 ! ๐‘‰3 PAS-IF different estimator for each policy non-adaptive ! ๐‘‰ universal estimator for all policies ! ๐‘‰ ! ๐‘‰
  45. Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a

    favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 49 lower, the better % ๐œ‹ -- selected ๐œ‹ โˆ— -- true best
  46. Summary โ€ข Estimator Selection is important to enable an accurate

    OPE. โ€ข Non-adaptive heuristic fails to be adaptive to the given OPE task. โ€ข PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 50
  47. Thank you for listening! Feel free to ask any questions,

    and discussions are welcome! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 51
  48. Example case of importance fitting When we have โ‡’ PAS-IF

    can produce a similar distribution shift! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 52 Note: the simplified case of .
  49. Detailed optimization procedure of PAS-IF We optimize the subsampling rule

    ๐œŒ๐œƒ via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune ๐œ† so that . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 53
  50. Key idea of PAS-IF How about sampling the logged data

    and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? February 2023 Policy Adaptive Estimator Selection @ AAAI2023 54 (๐‘† is a set of random states for bootstrapping)
  51. References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. โ€œThe Offset

    Tree for Learning with Partial Labels.โ€ KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. โ€œEligibility Traces for Off-Policy Policy Evaluation.โ€ ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. โ€œLearning from Logged Implicit Exploration Data.โ€ NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudรญk+,14] Miroslav Dudรญk, Dumitru Erhan, John Langford, and Lihong Li. โ€œDoubly Robust Policy Evaluation and Optimization.โ€ ICML, 2011. https://arxiv.org/abs/1503.02834 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 56
  52. References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. โ€œThe Self-

    Normalized Estimator for Counterfactual Learning.โ€ NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudรญk. โ€œOptimal and Adaptive Off-policy Evaluation in Contextual Bandits.โ€ ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. โ€œSubgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.โ€ NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html February 2023 Policy Adaptive Estimator Selection @ AAAI2023 57
  53. References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and

    Miroslav Dudรญk. โ€œDoubly Robust Off-policy Evaluation with Shrinkage.โ€ ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. โ€œAdaptive Estimator Selection for Off-Policy Evaluation.โ€ ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. โ€œImproved Estimator Selection for Off-Policy Evaluation.โ€ 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. โ€Debiased Off-Policy Evaluation for Recommendation Systems.โ€ RecSys, 2021. https://arxiv.org/abs/2002.08536 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 58
  54. References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and

    Yusuke Narita. โ€œOpen Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.โ€ NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. โ€œEvaluating the Robustness of Off-Policy Evaluation.โ€ RecSys, 2021. https://arxiv.org/abs/2108.13703 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 59