Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML2022) Yuta
Saito and Thorsten Joachims

Outline • Standard Off-Policy Evaluation for Contextual Bandits • Issues
of Existing Estimators for Large Action Spaces • New Framework and Estimator using Action Embeddings • Some Experimental Results how to best utilize auxiliary information about actions for offline evaluation? typical importance weighting approach fails

Machine Learning for Decision Making (Bandit / RL) We often
use machine learning to make decisions, not predictions incoming user Decision Making: item recommendation clicks policy Ultimate Goal: the Reward maximization Not the CTR predction

Many Applications of “Machine Decision Making” How can we evaluate
the performance of a new decision making policy using only the data collected by a logging, past policy? Motivation of OPE • video recommendation (Youtube) • playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo)

Many Applications of “Machine Decision Making” • video recommendation (Youtube)
• playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo) How can we evaluate the performance of a new decision making policy using only the data collected by a logging, past policy? Large Action Spaces thouthons/millions (or even more) of actions Motivation of OPE

Data Generating Process (contextual bandit setting) the logging policy interacts
with the environment and produces the log data: Observe context (user info) A “logging”policy picks an action (movie reco) Observe reward (clicks, conversions, etc.)

Off-Policy Evaluation: Logged Bandit Data We are given logged bandit
data collected by logging policy where unknown unknown known

Off-Policy Evaluation: Goal Our goal is to estimate the value/performance
of evaluation policy value of eval policy (our “estimand”) = expected reward we get when we (hypothetically) implement the eval policy

Off-Policy Evaluation: Goal Technically, our goal is to develop an
accurate estimator where value (estimand) an estimator logging policy is different from the evaluation policy

Off-Policy Evaluation: Goal An estimator’s accuracy is quantified by its
mean squared error; MSE where Bias and Variance are equally important for a small MSE

The Inverse Propensity Score (IPS) Estimator IPS uses importance weighting
to unbiasedly estimate the policy value (vanilla) importance weight Very easy to implement and some nice statistical properties popular in practice, and basis of many advanced estimators

Inverse Propensity Score (IPS): Unbiased OPE IPS is unbiased (bias=zero)
in the sense that for any evaluation policy satisfying the common support assumption (checkable)

IPS is inaccurate with growing number of actions number of
data=3000 increasing number of actions IPS is getting significantly worse with growing number of actions This is simply due to the use of the vanilla importance weight “biased” baseline (DM) max importance weight is growing

Recent Advances: Combining DM and IPS More Advanced Estimators •
Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency)

Simply combining IPS and DM seems not enough for large
actions Advanced estimators still have the same issues they are still using the vanilla importance weight, the source of high variance.. number of actions DR suffers from a very large variance other estimators work almost the same as DM, by aggressively modifying the importance weight (losing unbiasedness & consistency)

Recent Advances: Combining DM and IPS More Advanced Estimators •
Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency) How can we gain a large variance reduction without sacrificing unbiasedness/consistency in large action spaces? (we may need to avoid relying on the vanilla importance weight somehow)

Typical Logged Bandit Data Contexts Actions ??? ??? Conversion Rate
User 1 Item A ??? ??? 5% User 2 Item B ??? ??? 2% … … … … … Example OPE Situation: Product Recommendations

We Should Be Able to Use Some “Action Embeddings” Contexts
Actions Category Price Conversion Rate User 1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … Example OPE Situation: Product Recommendations

Idea: Auxiliary Information about the Actions Key idea: why not
leveraging auxiliary information about the actions? we additionally observe action embeddings typical logged bandit data for OPE logged bandit data w/ action embeddings

Idea: Auxiliary Information about the Actions We naturally generalize the
typical DGP as: unknown unknown known unknown action embedding distribution given context “x” and action “a” may be context-dependent, stochastic, and continuous

Action Embeddings: Examples Contexts Actions Category Price Conversion Rate User
1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … • discrete • context-independent • deterministic • continuous • context-dependent • stochastic if price is given by some personalized algorithm

Idea: Auxiliary Information about the Actions We generalize the typical
DGP as follows How should we utilize action embeds for an accurate OPE? *we assume the existence of some action embeddings and analyze the connection between their quality and OPE accuracy, optimizing or learning action embeddings is an interesting future work action embedding distribution

A Key Assumption: No Direct Effect To construct a new
estimator, let me assume the no direct effect assumption every causal effect of “a” on “r” should be mediated by “e” Action embeddings should be informative enough action embedding reward

A Key Assumption: No Direct Effect this implies To construct
a new estimator, let me assume the no direct effect assumption

No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR
(“r”) Tenet SF 10% Rocky Sport 5% Star Wars SF 20% Many Ball Sport 30% Action Embeddings fail to explain the variation in the reward, so the assumption is violated should be some direct effect

No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR
(“r”) Tenet SF 10% Rocky Sport 20% Star Wars SF 10% Many Ball Sport 20% Action Embeddings well explain the variation in the reward, so the assumption is now True no direct effect

New Expression of the Policy Value If the no direct
effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)

effect assumption is true, we have where is the marginal distribution of action embeddings induced by a particular policy

Marginal Embedding Distribution movie (“a”) category (“e”) Tenet 0.2 SF
0.4 Rocky 0.1 Sport 0.6 Star Wars 0.2 SF 0.4 Many Ball 0.5 Sport 0.6 Given a policy and embeddings, the marginal distribution is very easily defined marginal distribution

effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)

The Marginalized Inverse Propensity Score (MIPS) Estimator The new expression
of the policy value leads to the following new estimator marginal importance weight computed over the action embedding space Marginalized IPS (MIPS) vanilla importance weight of IPS -> ignoring the vanilla importance weight

Large Variance Reduction for Many Number of Actions (compared to
IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)

MIPS is Unbiased with Alternative Assumptions Under the no direct
effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)

Bias of MIPS with Violated Assumption (Thm 3.5) If the
no direct effect assumption is NOT True... Bias of MIPS (1) (2)

Bias of MIPS with Violated Assumption (Thm 3.5) If the
no direct effect assumption is NOT True... Bias of MIPS (1) (2) (1) how identifiable an action is from action embeddings embeddings should be as informative as possible to reduce the bias

Bias of MIPS with Violated Assumption If the no direct
effect assumption is NOT True... Bias of MIPS (1) (2) (2) amount of direct effect from “a” to “r” embeddings should be as informative as possible to reduce the bias

Large Variance Reduction for Many Number of Actions (compared to
IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)

Variance Reduction by MIPS (Thm 3.6) MIPS’s variance is never
worse than that of IPS Comparing the variance of IPS and MIPS

Variance Reduction by MIPS (Thm 3.6) We get a large
variance reduction when • the vanilla importance weights have a high variance (many actions) • embeddings should NOT be so informative ( should be stochastic) opposite motivation compared to the bias reduction

Bias-Variance Trade-Off of MIPS • To reduce the bias, we
should use informative action embeddings (high dimensional, high cardinality) • To reduce the variance , we should use coase/noisy action embeddings (low dimensional, low cardinality) We do not have to necessarily satisfy the “no direct effect” assumption to achieve a small MSE (= Bias^2+ Var) (there are some interesting empirical results about this trade-off)

Data-Driven Action Embedding Selection • We want to identify a
set of action embeddings/features that minimizes the MSE of the resulting MIPS where is a set of available action features small when is informative small when is coase

set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself depends on the true policy value

set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself • So, we adjust “SLOPE” [Su+20] [Tucker+21] to our setup • SLOPE is originally developed to tune hyperparameters of OPE and does not need to estimate the bias of the estimator detailed in the paper and in appdendix

Estimating the Marginal Importance Weights • Even if we know
the logging policy, we may have to estimate because we do not know the true distribution • A simple procedure is to utilize the following transformaiton So, our task is to estimate and use it to compute

Summary • IPS (and many others based on it) become
very inaccurate and impractical with growing action set mainly due to their huge variance • We assume auxiliary knowledge about the actions in the form of action embeddings and develop MIPS based on the no direct effect assumption • We characterize the bias and variance of MIPS, which implies an interesting bias-variance trade-off in terms of quality of the embeddings • Sometimes we should violate “no direct effect” to improve the MSE of MIPS

Synthetic Experiment: Setup • Baselines ◦ DM, IPS, DR (other
baselines are compared in the paper) ◦ MIPS (estimated weight), and MIPS (true) • Basic Setting ◦ n=10000, |A|=1000 (much larger than typical OPE experiments) ◦ continuous rewards with some gaussian noise ◦ 3-dimensional categorical action embeddings where the cardinality of each dimension is 10, i.e., |E|=10^3=1,000 provides best achievable accuracy

MIPS is More Robust to Increasing Number of Actions robust
to growing action sets vs IPS&DR • MIPS is over 10x better for a large number of actions vs DM • MIPS is also consistently better than DM Achieving Large Variance Reduction

MIPS Makes Full Use of Increasing Data make full use
of large data • MIPS works like IPS/DR, and is much better in small sample size • MIPS is also increasingly better than DM for larger sample sizes Avoid Sacrificing Consistency

Resolving the Large Action Space Problem MIPS dominates DM/IPS/DR for
both crucial cases robust to growing action sets make full use of large data

How does MIPS perform with the violated assumption? Synthesize 20
dims action embeddings, all are necessary to satisfy the assumption Gradually dropping necessary embed dimensions (out of 20) dropping some dimensions minimizes the MSE of MIPS even if the no direct effect assumption is violated

How does MIPS perform with the violated assumption? missing dimensions
introduce some bias but, it also reduces the variance squared bias variance

How does data-drive action embedding selection work? We can data-drivenly
select which embedding dimensions to drop to improve the MSE detailed in the paper, but is based on “SLOPE” [Su+20]

Other Benefits of MIPS robust to deterministic evaluation policies robust
to noisy rewards In addition to the number of actions, MIPS is more... deterministic uniform

Summary of Empirical Results • With growing number of actions,
MIPS provides a large variance reduction (works even better than DM), while the MSEs of IPS&DR infulate • With growing sample size, MIPS works like an unbiased/consistent estimator (similarly to IPS/DR), while DM remains highly biased • Even if we violate the assumption and introduce some bias, intentionally dropping some embedding dimensions might provide a greater MSE gain Potential to improve many other estimators that depend on IPS

More About The Paper • Contact: [email protected] • arXiv: https://arxiv.org/abs/2202.06317
• Experiment code: https://github.com/usaito/icml2022-mips • Generic Implementation: https://github.com/st-tech/zr-obp Thank you!

Off-Policy Evaluation for Large Action Spaces v...

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

More Decks by usaito

Other Decks in Research

Featured

Transcript