Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Slide 1

Slide 1 text

Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara September 2022 Policy Adaptive Estimator Selection (PAS-IF) 1

Slide 2

Slide 2 text

Content • Introduction to Off-Policy Evaluation (OPE) • Estimator Selection for OPE • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) • Synthetic Experiments • Estimator Selection • Policy Selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 2

Slide 3

Slide 3 text

Off-Policy Evaluation Motivation towards Estimator Selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 3

Slide 4

Slide 4 text

Interactions in recommender systems A behavior policy interacts with users and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 4 a user feedback (reward) a coming user (context) an item (action)

Slide 5

Slide 5 text

Interactions in recommender systems A behavior policy interacts with users and collects logged data. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃

Slide 6

Slide 6 text

Off-Policy Evaluation The goal is to evaluate the performance of an evaluation policy 𝜋 𝑒 . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 6 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)

Slide 7

Slide 7 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)

Slide 8

Slide 8 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high ✓ reward predictor

Slide 9

Slide 9 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight

Slide 10

Slide 10 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 10 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

Slide 11

Slide 11 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 11 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

Slide 12

Slide 12 text

Representative OPE estimators We aim to reduce both bias and variance to enable an accurate OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 12 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high control variate

Slide 13

Slide 13 text

To reduce the variance of IPS/ DR, many OPE estimators have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 13 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]

Slide 14

Slide 14 text

To reduce the variance of IPS/ DR, many OPE estimators have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators September 2022 Policy Adaptive Estimator Selection (PAS-IF) 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?

Slide 15

Slide 15 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 15 𝜋 𝑏 𝜋 𝑒

Slide 16

Slide 16 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 16 𝜋 𝑏 𝜋 𝑒 Estimator Selection is important!

Slide 17

Slide 17 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 17 𝜋 𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.

Slide 18

Slide 18 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 18 𝜋 𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.

Slide 19

Slide 19 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 19 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.

Slide 20

Slide 20 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 20 𝜋 𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Slide 21

Slide 21 text

Motivation towards data-driven estimator selection September 2022 Policy Adaptive Estimator Selection (PAS-IF) 21 𝜋 𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Slide 22

Slide 22 text

Estimator Selection for OPE September 2022 Policy Adaptive Estimator Selection (PAS-IF) 22

Slide 23

Slide 23 text

Objective for estimator selection The goal is to identify the most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 23

Slide 24

Slide 24 text

Objective for estimator selection The goal is to identify the most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 24 true policy value (estimand)

Slide 25

Slide 25 text

Objective for estimator selection The goal is to identify the most accurate OPE estimator in terms of MSE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 25 estimated from the logged data

Slide 26

Slide 26 text

Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 26

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 28 pseudo-evaluation policy OPE estimate on-policy policy value

Slide 29

Slide 29 text

Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged data from previous A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 29 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value

Slide 30

Slide 30 text

Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 30 Do these estimators really work well? non-adaptive heuristic (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵

Slide 31

Slide 31 text

Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 31 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵

Slide 32

Slide 32 text

Does non-adaptive heuristic work? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 32 𝜋 𝑏 𝜋 𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) ෠ 𝑉 𝜋 𝐴 ; 𝐷 𝐵 How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?

Slide 33

Slide 33 text

PAS-IF Policy Adaptive Estimator Selection via Importance Fitting September 2022 Policy Adaptive Estimator Selection (PAS-IF) 33

Slide 34

Slide 34 text

Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 34 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy

Slide 35

Slide 35 text

Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 35 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation

Slide 36

Slide 36 text

Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 36 ~𝝅 𝒃 ~𝝅 𝑩 ~𝝅 𝑨 pseudo-behavior policy pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆 total amount of logged data

Slide 37

Slide 37 text

Subsampling function controls the pseudo-policies We now introduce a subsampling function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 37 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆

Slide 38

Slide 38 text

Subsampling function controls the pseudo-policies We now introduce a subsampling function . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 38 ~𝝅 𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~෥ 𝝅 𝒃 ~෥ 𝝅 𝒆

Slide 39

Slide 39 text

How to optimize the subsampling function? PAS-IF optimizes 𝜌 to reproduce the bias-variance tradeoff of the original OPE. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 39 Subsampling function

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Key contribution of PAS-IF PAS-IF enables MSE estimation that are.. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->

Slide 43

Slide 43 text

Synthetic Experiment September 2022 Policy Adaptive Estimator Selection (PAS-IF) 43

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Experimental settings We compare PAS-IF and non-adaptive heuristic in two tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]

Slide 46

Slide 46 text

PAS-IF enables an accurate estimator selection PAS-IF enables far more accurate estimator selection by being adaptive. September 2022 Policy Adaptive Estimator Selection (PAS-IF) 46 PAS-IF is accurate across various evaluation policies lower, the better 𝜋 𝑏1 𝜋 𝑏2 𝜋 𝑏1 𝜋 𝑏2 ෝ 𝑚 -- selected 𝑚 ∗ -- true best

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Experimental settings We compare PAS-IF and non-adaptive heuristic in two tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator September 2022 Policy Adaptive Estimator Selection (PAS-IF) 48 hyperparam tuning estimator selection 1. Estimator Selection ෠ 𝑉 1 ෠ 𝑉 2 ෠ 𝑉 3 PAS-IF different estimator for each policy non-adaptive ෠ 𝑉 universal estimator for all policies ෠ 𝑉 ෠ 𝑉

Slide 49

Slide 49 text

Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 49 lower, the better ො 𝜋 -- selected 𝜋 ∗ -- true best

Slide 50

Slide 50 text

Summary • Estimator Selection is important to enable an accurate OPE. • Non-adaptive heuristic fails to be adaptive to the given OPE task. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 50

Slide 51

Slide 51 text

Thank you for listening! Feel free to ask any questions, and discussions are welcome! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 51

Slide 52

Slide 52 text

Example case of importance fitting When we have ⇒ PAS-IF can produce a similar distribution shift! September 2022 Policy Adaptive Estimator Selection (PAS-IF) 52 Note: the simplified case of .

Slide 53

Slide 53 text

Detailed optimization procedure of PAS-IF We optimize the subsampling rule 𝜌 𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . September 2022 Policy Adaptive Estimator Selection (PAS-IF) 53

Slide 54

Slide 54 text

Key idea of PAS-IF How about sampling the logged data and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? September 2022 Policy Adaptive Estimator Selection (PAS-IF) 54 (𝑆 is a set of random states for bootstrapping)

Slide 55

Slide 55 text

References September 2022 Policy Adaptive Estimator Selection (PAS-IF) 55

Slide 56

Slide 56 text

References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 56

Slide 57

Slide 57 text

References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self- Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html September 2022 Policy Adaptive Estimator Selection (PAS-IF) 57

Slide 58

Slide 58 text

References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 58

Slide 59

Slide 59 text

References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 September 2022 Policy Adaptive Estimator Selection (PAS-IF) 59