usaito
PRO
July 04, 2022
210

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators – most of which are based on inverse propensity score weighting – degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.

July 04, 2022

Transcript

1. Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML2022) Yuta

Saito and Thorsten Joachims
2. Outline • Standard Off-Policy Evaluation for Contextual Bandits • Issues

of Existing Estimators for Large Action Spaces • New Framework and Estimator using Action Embeddings • Some Experimental Results how to best utilize auxiliary information about actions for offline evaluation? typical importance weighting approach fails
3. Machine Learning for Decision Making (Bandit / RL) We often

use machine learning to make decisions, not predictions incoming user Decision Making: item recommendation clicks policy Ultimate Goal: the Reward maximization Not the CTR predction
4. Many Applications of “Machine Decision Making” How can we evaluate

the performance of a new decision making policy using only the data collected by a logging, past policy? Motivation of OPE • video recommendation (Youtube) • playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo)
5. Many Applications of “Machine Decision Making” • video recommendation (Youtube)

• playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo) How can we evaluate the performance of a new decision making policy using only the data collected by a logging, past policy? Large Action Spaces thouthons/millions (or even more) of actions Motivation of OPE
6. Data Generating Process (contextual bandit setting) the logging policy interacts

with the environment and produces the log data: Observe context (user info) A “logging”policy picks an action (movie reco) Observe reward (clicks, conversions, etc.)
7. Off-Policy Evaluation: Logged Bandit Data We are given logged bandit

data collected by logging policy where unknown unknown known
8. Off-Policy Evaluation: Goal Our goal is to estimate the value/performance

of evaluation policy value of eval policy (our “estimand”) = expected reward we get when we (hypothetically) implement the eval policy
9. Off-Policy Evaluation: Goal Technically, our goal is to develop an

accurate estimator where value (estimand) an estimator logging policy is different from the evaluation policy
10. Off-Policy Evaluation: Goal An estimator’s accuracy is quantified by its

mean squared error; MSE where Bias and Variance are equally important for a small MSE
11. The Inverse Propensity Score (IPS) Estimator IPS uses importance weighting

to unbiasedly estimate the policy value (vanilla) importance weight Very easy to implement and some nice statistical properties popular in practice, and basis of many advanced estimators
12. Inverse Propensity Score (IPS): Unbiased OPE IPS is unbiased (bias=zero)

in the sense that for any evaluation policy satisfying the common support assumption (checkable)
13. IPS is inaccurate with growing number of actions number of

data=3000 increasing number of actions IPS is getting significantly worse with growing number of actions This is simply due to the use of the vanilla importance weight “biased” baseline (DM) max importance weight is growing

Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency)
15. Simply combining IPS and DM seems not enough for large

actions Advanced estimators still have the same issues they are still using the vanilla importance weight, the source of high variance.. number of actions DR suffers from a very large variance other estimators work almost the same as DM, by aggressively modifying the importance weight (losing unbiasedness & consistency)

Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency) How can we gain a large variance reduction without sacrificing unbiasedness/consistency in large action spaces? (we may need to avoid relying on the vanilla importance weight somehow)
17. Typical Logged Bandit Data Contexts Actions ??? ??? Conversion Rate

User 1 Item A ??? ??? 5% User 2 Item B ??? ??? 2% … … … … … Example OPE Situation: Product Recommendations
18. We Should Be Able to Use Some “Action Embeddings” Contexts

Actions Category Price Conversion Rate User 1 Item A Books $20 5% User 2 Item B Computers$500 2% … … … … … Example OPE Situation: Product Recommendations
19. Idea: Auxiliary Information about the Actions Key idea: why not

leveraging auxiliary information about the actions? we additionally observe action embeddings typical logged bandit data for OPE logged bandit data w/ action embeddings
20. Idea: Auxiliary Information about the Actions We naturally generalize the

typical DGP as: unknown unknown known unknown action embedding distribution given context “x” and action “a” may be context-dependent, stochastic, and continuous
21. Action Embeddings: Examples Contexts Actions Category Price Conversion Rate User

1 Item A Books $20 5% User 2 Item B Computers$500 2% … … … … … • discrete • context-independent • deterministic • continuous • context-dependent • stochastic if price is given by some personalized algorithm
22. Idea: Auxiliary Information about the Actions We generalize the typical

DGP as follows How should we utilize action embeds for an accurate OPE? *we assume the existence of some action embeddings and analyze the connection between their quality and OPE accuracy, optimizing or learning action embeddings is an interesting future work action embedding distribution
23. A Key Assumption: No Direct Effect To construct a new

estimator, let me assume the no direct effect assumption every causal effect of “a” on “r” should be mediated by “e” Action embeddings should be informative enough action embedding reward
24. A Key Assumption: No Direct Effect this implies To construct

a new estimator, let me assume the no direct effect assumption
25. No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR

(“r”) Tenet SF 10% Rocky Sport 5% Star Wars SF 20% Many Ball Sport 30% Action Embeddings fail to explain the variation in the reward, so the assumption is violated should be some direct effect
26. No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR

(“r”) Tenet SF 10% Rocky Sport 20% Star Wars SF 10% Many Ball Sport 20% Action Embeddings well explain the variation in the reward, so the assumption is now True no direct effect
27. New Expression of the Policy Value If the no direct

effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)
28. New Expression of the Policy Value If the no direct

effect assumption is true, we have where is the marginal distribution of action embeddings induced by a particular policy
29. Marginal Embedding Distribution movie (“a”) category (“e”) Tenet 0.2 SF

0.4 Rocky 0.1 Sport 0.6 Star Wars 0.2 SF 0.4 Many Ball 0.5 Sport 0.6 Given a policy and embeddings, the marginal distribution is very easily defined marginal distribution
30. New Expression of the Policy Value If the no direct

effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)
31. The Marginalized Inverse Propensity Score (MIPS) Estimator The new expression

of the policy value leads to the following new estimator marginal importance weight computed over the action embedding space Marginalized IPS (MIPS) vanilla importance weight of IPS -> ignoring the vanilla importance weight
32. Large Variance Reduction for Many Number of Actions (compared to

IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
33. Large Variance Reduction for Many Number of Actions (compared to

IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
34. MIPS is Unbiased with Alternative Assumptions Under the no direct

effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)
35. MIPS is Unbiased with Alternative Assumptions Under the no direct

effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)
36. Bias of MIPS with Violated Assumption (Thm 3.5) If the

no direct effect assumption is NOT True... Bias of MIPS (1) (2)
37. Bias of MIPS with Violated Assumption (Thm 3.5) If the

no direct effect assumption is NOT True... Bias of MIPS (1) (2) (1) how identifiable an action is from action embeddings embeddings should be as informative as possible to reduce the bias
38. Bias of MIPS with Violated Assumption If the no direct

effect assumption is NOT True... Bias of MIPS (1) (2) (2) amount of direct effect from “a” to “r” embeddings should be as informative as possible to reduce the bias
39. Large Variance Reduction for Many Number of Actions (compared to

IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
40. Variance Reduction by MIPS (Thm 3.6) MIPS’s variance is never

worse than that of IPS Comparing the variance of IPS and MIPS
41. Variance Reduction by MIPS (Thm 3.6) We get a large

variance reduction when • the vanilla importance weights have a high variance (many actions) • embeddings should NOT be so informative ( should be stochastic) opposite motivation compared to the bias reduction
42. Bias-Variance Trade-Off of MIPS • To reduce the bias, we

should use informative action embeddings (high dimensional, high cardinality) • To reduce the variance , we should use coase/noisy action embeddings (low dimensional, low cardinality) We do not have to necessarily satisfy the “no direct effect” assumption to achieve a small MSE (= Bias^2+ Var) (there are some interesting empirical results about this trade-off)
43. Data-Driven Action Embedding Selection • We want to identify a

set of action embeddings/features that minimizes the MSE of the resulting MIPS where is a set of available action features small when is informative small when is coase
44. Data-Driven Action Embedding Selection • We want to identify a

set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself depends on the true policy value
45. Data-Driven Action Embedding Selection • We want to identify a

set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself • So, we adjust “SLOPE” [Su+20] [Tucker+21] to our setup • SLOPE is originally developed to tune hyperparameters of OPE and does not need to estimate the bias of the estimator detailed in the paper and in appdendix
46. Estimating the Marginal Importance Weights • Even if we know

the logging policy, we may have to estimate because we do not know the true distribution • A simple procedure is to utilize the following transformaiton So, our task is to estimate and use it to compute
47. Summary • IPS (and many others based on it) become

very inaccurate and impractical with growing action set mainly due to their huge variance • We assume auxiliary knowledge about the actions in the form of action embeddings and develop MIPS based on the no direct effect assumption • We characterize the bias and variance of MIPS, which implies an interesting bias-variance trade-off in terms of quality of the embeddings • Sometimes we should violate “no direct effect” to improve the MSE of MIPS
48. Synthetic Experiment: Setup • Baselines ◦ DM, IPS, DR (other

baselines are compared in the paper) ◦ MIPS (estimated weight), and MIPS (true) • Basic Setting ◦ n=10000, |A|=1000 (much larger than typical OPE experiments) ◦ continuous rewards with some gaussian noise ◦ 3-dimensional categorical action embeddings where the cardinality of each dimension is 10, i.e., |E|=10^3=1,000 provides best achievable accuracy
49. MIPS is More Robust to Increasing Number of Actions robust

to growing action sets vs IPS&DR • MIPS is over 10x better for a large number of actions vs DM • MIPS is also consistently better than DM Achieving Large Variance Reduction
50. MIPS Makes Full Use of Increasing Data make full use

of large data • MIPS works like IPS/DR, and is much better in small sample size • MIPS is also increasingly better than DM for larger sample sizes Avoid Sacrificing Consistency
51. Resolving the Large Action Space Problem MIPS dominates DM/IPS/DR for

both crucial cases robust to growing action sets make full use of large data
52. How does MIPS perform with the violated assumption? Synthesize 20

dims action embeddings, all are necessary to satisfy the assumption Gradually dropping necessary embed dimensions (out of 20) dropping some dimensions minimizes the MSE of MIPS even if the no direct effect assumption is violated
53. How does MIPS perform with the violated assumption? missing dimensions

introduce some bias but, it also reduces the variance squared bias variance
54. How does data-drive action embedding selection work? We can data-drivenly

select which embedding dimensions to drop to improve the MSE detailed in the paper, but is based on “SLOPE” [Su+20]
55. Other Benefits of MIPS robust to deterministic evaluation policies robust

to noisy rewards In addition to the number of actions, MIPS is more... deterministic uniform
56. Summary of Empirical Results • With growing number of actions,

MIPS provides a large variance reduction (works even better than DM), while the MSEs of IPS&DR infulate • With growing sample size, MIPS works like an unbiased/consistent estimator (similarly to IPS/DR), while DM remains highly biased • Even if we violate the assumption and introduce some bias, intentionally dropping some embedding dimensions might provide a greater MSE gain Potential to improve many other estimators that depend on IPS
57. More About The Paper • Contact: [email protected] • arXiv: https://arxiv.org/abs/2202.06317

• Experiment code: https://github.com/usaito/icml2022-mips • Generic Implementation: https://github.com/st-tech/zr-obp Thank you!