[AAAI'23] Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Policy Adaptive Estimator Selection for Off-Policy Evaluation Takuma Udagawa, Haruka
Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara February 2023 Policy Adaptive Estimator Selection @ AAAI2023 1

Content • Introduction to Off-Policy Evaluation (OPE) • Estimator Selection
for OPE • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF) • Synthetic Experiments • Estimator Selection • Policy Selection February 2023 Policy Adaptive Estimator Selection @ AAAI2023 2

Off-Policy Evaluation Motivation towards Estimator Selection February 2023 Policy Adaptive
Estimator Selection @ AAAI2023 3

Interactions in recommender systems A behavior policy interacts with users
and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 4 a user feedback (reward) a coming user (context) an item (action)

Interactions in recommender systems A behavior policy interacts with users
and collects logged data. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 5 a user feedback (reward) a coming user (context) an item (action) logged bandit feedback behavior policy 𝝅𝒃

Off-Policy Evaluation The goal is to evaluate the performance of
an evaluation policy 𝜋𝑒 . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 6 offline A/B test logged bandit feedback behavior policy 𝝅𝒃 OPE estimator (policy performance)

Representative OPE estimators We aim to reduce both bias and
variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 7 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high (reward predictor) (importance weight)

variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 8 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high ✓ reward predictor

variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 9 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight

variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 10 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 11 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high importance weight evaluation behavior

variance to enable an accurate OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 12 ✓ ✓ ✓ ✓ ✓ model-based importance sampling-based bias variance Direct Method (DM) [Beygelzimer&Langford,09] --- high low Inverse Propensity Scoring (IPS) [Precup+,00] [Strehl+,10] --- unbiased very high Doubly Robust (DR) [Dudík+,14] unbiased lower than IPS, but still high control variate

To reduce the variance of IPS/ DR, many OPE estimators
have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/ DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 13 * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]

To reduce the variance of IPS/ DR, many OPE estimators
have been proposed. modification on importance weights Self-Normalized (IPS/ DR) [Swaminathan&Joachims,15] Clipped (IPS/ DR) * [Su+,20a] Switch (DR) * [Wang+,17] Optimistic Shrinkage (DR) * [Su+,20a] Subgaussian (IPS/DR) * [Metelli+,21] Advanced OPE estimators February 2023 Policy Adaptive Estimator Selection @ AAAI2023 14 * requires hyperparameter tuning, e.g., SLOPE [Su+,20b] Which OPE estimator should be used to enable an accurate OPE?

Motivation towards data-driven estimator selection February 2023 Policy Adaptive Estimator
Selection @ AAAI2023 15 𝜋𝑏 𝜋𝑒

Selection @ AAAI2023 16 𝜋𝑏 𝜋𝑒 Estimator Selection is important!

Selection @ AAAI2023 17 𝜋𝑏 Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection @ AAAI2023 18 𝜋𝑏 among the best Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection @ AAAI2023 19 𝜋𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations.

Selection @ AAAI2023 20 𝜋𝑏 among the best among the worst Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Selection @ AAAI2023 21 𝜋𝑏 Estimator Selection: How to identify the most accurate OPE estimator using only the available logged data? Estimator Selection is important! but.. The best estimator can be different under different situations. - data size - evaluation policy - reward noise matter.

Estimator Selection for OPE February 2023 Policy Adaptive Estimator Selection
@ AAAI2023 22

Objective for estimator selection The goal is to identify the
most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 23

most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 24 true policy value (estimand)

most accurate OPE estimator in terms of MSE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 25 estimated from the logged data

Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b] Suppose we have logged
data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 26

data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 27 pseudo-evaluation policy

data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 28 pseudo-evaluation policy OPE estimate on-policy policy value

data from previous A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 29 ※ 𝑆 is a set of random states for bootstrapping. pseudo-evaluation policy OPE estimate on-policy policy value

Does non-adaptive heuristic work? February 2023 Policy Adaptive Estimator Selection
@ AAAI2023 30 Do these estimators really work well? non-adaptive heuristic (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵

@ AAAI2023 31 𝜋𝑏 𝜋𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵

@ AAAI2023 32 𝜋𝑏 𝜋𝐴 Do these estimators really work well? non-adaptive heuristic true performance Non-adaptive heuristic does not consider the difference among OPE tasks. (estimation) " 𝑉 𝜋𝐴 ; 𝐷𝐵 How to choose OPE estimators adaptively to the given OPE task (e.g., evaluation policy)?

PAS-IF Policy Adaptive Estimator Selection via Importance Fitting February 2023
Policy Adaptive Estimator Selection @ AAAI2023 33

Is it possible to make pseudo-policies adaptive? Non-adaptive heuristic calculates
MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 34 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy

MSE using two datasets collected by A/B tests. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 35 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy total amount of logged data pseudo-evaluation policy behavior evaluation

MSE using two datasets collected by A/B tests. We aim to split the logged datasets adaptive to the given OPE task. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 36 ~𝝅𝒃 ~𝝅𝑩 ~𝝅𝑨 pseudo-behavior policy pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆 total amount of logged data

Subsampling function controls the pseudo-policies We now introduce a subsampling
function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 37 ~𝝅𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆

Subsampling function controls the pseudo-policies We now introduce a subsampling
function . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 38 ~𝝅𝒃 pseudo-behavior policy total amount of logged data pseudo-evaluation policy ~$ 𝝅𝒃 ~$ 𝝅𝒆

How to optimize the subsampling function? PAS-IF optimizes 𝜌 to
reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 39 Subsampling function

reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 40 Subsampling function

reproduce the bias-variance tradeoff of the original OPE. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 41 Objective of importance fitting: Subsampling function

Key contribution of PAS-IF PAS-IF enables MSE estimation that are..
February 2023 Policy Adaptive Estimator Selection @ AAAI2023 42 Data Driven -> by splitting the logged data into pseudo datasets Adaptive -> by optimizing subsampling function to simulate the distribution shift of the original OPE task Accurate Estimator Selection! . ->

Synthetic Experiment February 2023 Policy Adaptive Estimator Selection @ AAAI2023
43

Experimental settings We compare PAS-IF and non-adaptive heuristic in two
tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 44

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 45 hyperparam tuning* estimator selection 1. Estimator Selection * SLOPE [Su+,20b] [Tucker&Lee,21]

PAS-IF enables an accurate estimator selection PAS-IF enables far more
accurate estimator selection by being adaptive. February 2023 Policy Adaptive Estimator Selection @ AAAI2023 46 PAS-IF is accurate across various evaluation policies lower, the better 𝜋𝑏1 𝜋𝑏2 𝜋𝑏1 𝜋𝑏2 " 𝑚 -- selected 𝑚 ∗ -- true best

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 47 hyperparam tuning estimator selection 1. Estimator Selection

tasks. 1. Estimator Selection 2. Policy Selection using the selected estimator February 2023 Policy Adaptive Estimator Selection @ AAAI2023 48 hyperparam tuning estimator selection 1. Estimator Selection ! 𝑉1 ! 𝑉2 ! 𝑉3 PAS-IF different estimator for each policy non-adaptive ! 𝑉 universal estimator for all policies ! 𝑉 ! 𝑉

Moreover, PAS-IF also benefits policy selection PAS-IF also reveals a
favorable result in the policy selection task. PAS-IF can identify better policies among many candidates by using different (appropriate) estimator for each policy! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 49 lower, the better % 𝜋 -- selected 𝜋 ∗ -- true best

Summary • Estimator Selection is important to enable an accurate
OPE. • Non-adaptive heuristic fails to be adaptive to the given OPE task. • PAS-IF enables an adaptive and accurate estimator selection by subsampling and optimizing the pseudo OPE datasets. PAS-IF will help identify an accurate OPE estimator in practice! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 50

Thank you for listening! Feel free to ask any questions,
and discussions are welcome! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 51

Example case of importance fitting When we have ⇒ PAS-IF
can produce a similar distribution shift! February 2023 Policy Adaptive Estimator Selection @ AAAI2023 52 Note: the simplified case of .

Detailed optimization procedure of PAS-IF We optimize the subsampling rule
𝜌𝜃 via gradient decent. To maintain the similar data size with the original OPE task, PAS-IF also imposes the regularization on the data size. We tune 𝜆 so that . February 2023 Policy Adaptive Estimator Selection @ AAAI2023 53

Key idea of PAS-IF How about sampling the logged data
and constructing a pseudo-evaluation policy that has a bias-variance tradeoff similar to the given OPE task? February 2023 Policy Adaptive Estimator Selection @ AAAI2023 54 (𝑆 is a set of random states for bootstrapping)

References February 2023 Policy Adaptive Estimator Selection @ AAAI2023 55

References (1/4) [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset
Tree for Learning with Partial Labels.” KDD, 2009. [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult y_pubs [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 56

References (2/4) [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-
Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015. https://dl.acm.org/doi/10.5555/2969442.2969600 [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017. https://arxiv.org/abs/1612.01205 [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS, 2021. https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b a23a-Abstract.html February 2023 Policy Adaptive Estimator Selection @ AAAI2023 57

References (3/4) [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and
Miroslav Dudík. “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623 [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection for Off-Policy Evaluation.” 2021. https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy Evaluation for Recommendation Systems.” RecSys, 2021. https://arxiv.org/abs/2002.08536 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 58

References (4/4) [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and
Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 February 2023 Policy Adaptive Estimator Selection @ AAAI2023 59

[AAAI'23] Policy-Adaptive Estimator Selection f...

[AAAI'23] Policy-Adaptive Estimator Selection for Off-Policy Evaluation

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript