Slide 1

Slide 1 text

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML2022) Yuta Saito and Thorsten Joachims

Slide 2

Slide 2 text

Outline ● Standard Off-Policy Evaluation for Contextual Bandits ● Issues of Existing Estimators for Large Action Spaces ● New Framework and Estimator using Action Embeddings ● Some Experimental Results how to best utilize auxiliary information about actions for offline evaluation? typical importance weighting approach fails

Slide 3

Slide 3 text

Machine Learning for Decision Making (Bandit / RL) We often use machine learning to make decisions, not predictions incoming user Decision Making: item recommendation clicks policy Ultimate Goal: the Reward maximization Not the CTR predction

Slide 4

Slide 4 text

Many Applications of “Machine Decision Making” How can we evaluate the performance of a new decision making policy using only the data collected by a logging, past policy? Motivation of OPE ● video recommendation (Youtube) ● playlist recommendation (Spotify) ● artwork personalization (Netflix) ● ad allocation optimization (Criteo)

Slide 5

Slide 5 text

Many Applications of “Machine Decision Making” ● video recommendation (Youtube) ● playlist recommendation (Spotify) ● artwork personalization (Netflix) ● ad allocation optimization (Criteo) How can we evaluate the performance of a new decision making policy using only the data collected by a logging, past policy? Large Action Spaces thouthons/millions (or even more) of actions Motivation of OPE

Slide 6

Slide 6 text

Data Generating Process (contextual bandit setting) the logging policy interacts with the environment and produces the log data: Observe context (user info) A “logging”policy picks an action (movie reco) Observe reward (clicks, conversions, etc.)

Slide 7

Slide 7 text

Off-Policy Evaluation: Logged Bandit Data We are given logged bandit data collected by logging policy where unknown unknown known

Slide 8

Slide 8 text

Off-Policy Evaluation: Goal Our goal is to estimate the value/performance of evaluation policy value of eval policy (our “estimand”) = expected reward we get when we (hypothetically) implement the eval policy

Slide 9

Slide 9 text

Off-Policy Evaluation: Goal Technically, our goal is to develop an accurate estimator where value (estimand) an estimator logging policy is different from the evaluation policy

Slide 10

Slide 10 text

Off-Policy Evaluation: Goal An estimator’s accuracy is quantified by its mean squared error; MSE where Bias and Variance are equally important for a small MSE

Slide 11

Slide 11 text

The Inverse Propensity Score (IPS) Estimator IPS uses importance weighting to unbiasedly estimate the policy value (vanilla) importance weight Very easy to implement and some nice statistical properties popular in practice, and basis of many advanced estimators

Slide 12

Slide 12 text

Inverse Propensity Score (IPS): Unbiased OPE IPS is unbiased (bias=zero) in the sense that for any evaluation policy satisfying the common support assumption (checkable)

Slide 13

Slide 13 text

IPS is inaccurate with growing number of actions number of data=3000 increasing number of actions IPS is getting significantly worse with growing number of actions This is simply due to the use of the vanilla importance weight “biased” baseline (DM) max importance weight is growing

Slide 14

Slide 14 text

Recent Advances: Combining DM and IPS More Advanced Estimators ● Doubly Robust (DR) [Dudik+11,14] ● Switch DR [Wang+17] ● DR with Optimistic Shrinkage [Su+20] ● DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency)

Slide 15

Slide 15 text

Simply combining IPS and DM seems not enough for large actions Advanced estimators still have the same issues they are still using the vanilla importance weight, the source of high variance.. number of actions DR suffers from a very large variance other estimators work almost the same as DM, by aggressively modifying the importance weight (losing unbiasedness & consistency)

Slide 16

Slide 16 text

Recent Advances: Combining DM and IPS More Advanced Estimators ● Doubly Robust (DR) [Dudik+11,14] ● Switch DR [Wang+17] ● DR with Optimistic Shrinkage [Su+20] ● DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency) How can we gain a large variance reduction without sacrificing unbiasedness/consistency in large action spaces? (we may need to avoid relying on the vanilla importance weight somehow)

Slide 17

Slide 17 text

Typical Logged Bandit Data Contexts Actions ??? ??? Conversion Rate User 1 Item A ??? ??? 5% User 2 Item B ??? ??? 2% … … … … … Example OPE Situation: Product Recommendations

Slide 18

Slide 18 text

We Should Be Able to Use Some “Action Embeddings” Contexts Actions Category Price Conversion Rate User 1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … Example OPE Situation: Product Recommendations

Slide 19

Slide 19 text

Idea: Auxiliary Information about the Actions Key idea: why not leveraging auxiliary information about the actions? we additionally observe action embeddings typical logged bandit data for OPE logged bandit data w/ action embeddings

Slide 20

Slide 20 text

Idea: Auxiliary Information about the Actions We naturally generalize the typical DGP as: unknown unknown known unknown action embedding distribution given context “x” and action “a” may be context-dependent, stochastic, and continuous

Slide 21

Slide 21 text

Action Embeddings: Examples Contexts Actions Category Price Conversion Rate User 1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … ● discrete ● context-independent ● deterministic ● continuous ● context-dependent ● stochastic if price is given by some personalized algorithm

Slide 22

Slide 22 text

Idea: Auxiliary Information about the Actions We generalize the typical DGP as follows How should we utilize action embeds for an accurate OPE? *we assume the existence of some action embeddings and analyze the connection between their quality and OPE accuracy, optimizing or learning action embeddings is an interesting future work action embedding distribution

Slide 23

Slide 23 text

A Key Assumption: No Direct Effect To construct a new estimator, let me assume the no direct effect assumption every causal effect of “a” on “r” should be mediated by “e” Action embeddings should be informative enough action embedding reward

Slide 24

Slide 24 text

A Key Assumption: No Direct Effect this implies To construct a new estimator, let me assume the no direct effect assumption

Slide 25

Slide 25 text

No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR (“r”) Tenet SF 10% Rocky Sport 5% Star Wars SF 20% Many Ball Sport 30% Action Embeddings fail to explain the variation in the reward, so the assumption is violated should be some direct effect

Slide 26

Slide 26 text

No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR (“r”) Tenet SF 10% Rocky Sport 20% Star Wars SF 10% Many Ball Sport 20% Action Embeddings well explain the variation in the reward, so the assumption is now True no direct effect

Slide 27

Slide 27 text

New Expression of the Policy Value If the no direct effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)

Slide 28

Slide 28 text

New Expression of the Policy Value If the no direct effect assumption is true, we have where is the marginal distribution of action embeddings induced by a particular policy

Slide 29

Slide 29 text

Marginal Embedding Distribution movie (“a”) category (“e”) Tenet 0.2 SF 0.4 Rocky 0.1 Sport 0.6 Star Wars 0.2 SF 0.4 Many Ball 0.5 Sport 0.6 Given a policy and embeddings, the marginal distribution is very easily defined marginal distribution

Slide 30

Slide 30 text

New Expression of the Policy Value If the no direct effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)

Slide 31

Slide 31 text

The Marginalized Inverse Propensity Score (MIPS) Estimator The new expression of the policy value leads to the following new estimator marginal importance weight computed over the action embedding space Marginalized IPS (MIPS) vanilla importance weight of IPS -> ignoring the vanilla importance weight

Slide 32

Slide 32 text

Large Variance Reduction for Many Number of Actions (compared to IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)

Slide 33

Slide 33 text

Large Variance Reduction for Many Number of Actions (compared to IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)

Slide 34

Slide 34 text

MIPS is Unbiased with Alternative Assumptions Under the no direct effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)

Slide 35

Slide 35 text

MIPS is Unbiased with Alternative Assumptions Under the no direct effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)

Slide 36

Slide 36 text

Bias of MIPS with Violated Assumption (Thm 3.5) If the no direct effect assumption is NOT True... Bias of MIPS (1) (2)

Slide 37

Slide 37 text

Bias of MIPS with Violated Assumption (Thm 3.5) If the no direct effect assumption is NOT True... Bias of MIPS (1) (2) (1) how identifiable an action is from action embeddings embeddings should be as informative as possible to reduce the bias

Slide 38

Slide 38 text

Bias of MIPS with Violated Assumption If the no direct effect assumption is NOT True... Bias of MIPS (1) (2) (2) amount of direct effect from “a” to “r” embeddings should be as informative as possible to reduce the bias

Slide 39

Slide 39 text

Large Variance Reduction for Many Number of Actions (compared to IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)

Slide 40

Slide 40 text

Variance Reduction by MIPS (Thm 3.6) MIPS’s variance is never worse than that of IPS Comparing the variance of IPS and MIPS

Slide 41

Slide 41 text

Variance Reduction by MIPS (Thm 3.6) We get a large variance reduction when ● the vanilla importance weights have a high variance (many actions) ● embeddings should NOT be so informative ( should be stochastic) opposite motivation compared to the bias reduction

Slide 42

Slide 42 text

Bias-Variance Trade-Off of MIPS ● To reduce the bias, we should use informative action embeddings (high dimensional, high cardinality) ● To reduce the variance , we should use coase/noisy action embeddings (low dimensional, low cardinality) We do not have to necessarily satisfy the “no direct effect” assumption to achieve a small MSE (= Bias^2+ Var) (there are some interesting empirical results about this trade-off)

Slide 43

Slide 43 text

Data-Driven Action Embedding Selection ● We want to identify a set of action embeddings/features that minimizes the MSE of the resulting MIPS where is a set of available action features small when is informative small when is coase

Slide 44

Slide 44 text

Data-Driven Action Embedding Selection ● We want to identify a set of action embeddings/features that minimizes the MSE of the resulting MIPS ● The problem is that estimating the bias is equally difficult as OPE itself depends on the true policy value

Slide 45

Slide 45 text

Data-Driven Action Embedding Selection ● We want to identify a set of action embeddings/features that minimizes the MSE of the resulting MIPS ● The problem is that estimating the bias is equally difficult as OPE itself ● So, we adjust “SLOPE” [Su+20] [Tucker+21] to our setup ● SLOPE is originally developed to tune hyperparameters of OPE and does not need to estimate the bias of the estimator detailed in the paper and in appdendix

Slide 46

Slide 46 text

Estimating the Marginal Importance Weights ● Even if we know the logging policy, we may have to estimate because we do not know the true distribution ● A simple procedure is to utilize the following transformaiton So, our task is to estimate and use it to compute

Slide 47

Slide 47 text

Summary ● IPS (and many others based on it) become very inaccurate and impractical with growing action set mainly due to their huge variance ● We assume auxiliary knowledge about the actions in the form of action embeddings and develop MIPS based on the no direct effect assumption ● We characterize the bias and variance of MIPS, which implies an interesting bias-variance trade-off in terms of quality of the embeddings ● Sometimes we should violate “no direct effect” to improve the MSE of MIPS

Slide 48

Slide 48 text

Synthetic Experiment: Setup ● Baselines ○ DM, IPS, DR (other baselines are compared in the paper) ○ MIPS (estimated weight), and MIPS (true) ● Basic Setting ○ n=10000, |A|=1000 (much larger than typical OPE experiments) ○ continuous rewards with some gaussian noise ○ 3-dimensional categorical action embeddings where the cardinality of each dimension is 10, i.e., |E|=10^3=1,000 provides best achievable accuracy

Slide 49

Slide 49 text

MIPS is More Robust to Increasing Number of Actions robust to growing action sets vs IPS&DR ● MIPS is over 10x better for a large number of actions vs DM ● MIPS is also consistently better than DM Achieving Large Variance Reduction

Slide 50

Slide 50 text

MIPS Makes Full Use of Increasing Data make full use of large data ● MIPS works like IPS/DR, and is much better in small sample size ● MIPS is also increasingly better than DM for larger sample sizes Avoid Sacrificing Consistency

Slide 51

Slide 51 text

Resolving the Large Action Space Problem MIPS dominates DM/IPS/DR for both crucial cases robust to growing action sets make full use of large data

Slide 52

Slide 52 text

How does MIPS perform with the violated assumption? Synthesize 20 dims action embeddings, all are necessary to satisfy the assumption Gradually dropping necessary embed dimensions (out of 20) dropping some dimensions minimizes the MSE of MIPS even if the no direct effect assumption is violated

Slide 53

Slide 53 text

How does MIPS perform with the violated assumption? missing dimensions introduce some bias but, it also reduces the variance squared bias variance

Slide 54

Slide 54 text

How does data-drive action embedding selection work? We can data-drivenly select which embedding dimensions to drop to improve the MSE detailed in the paper, but is based on “SLOPE” [Su+20]

Slide 55

Slide 55 text

Other Benefits of MIPS robust to deterministic evaluation policies robust to noisy rewards In addition to the number of actions, MIPS is more... deterministic uniform

Slide 56

Slide 56 text

Summary of Empirical Results ● With growing number of actions, MIPS provides a large variance reduction (works even better than DM), while the MSEs of IPS&DR infulate ● With growing sample size, MIPS works like an unbiased/consistent estimator (similarly to IPS/DR), while DM remains highly biased ● Even if we violate the assumption and introduce some bias, intentionally dropping some embedding dimensions might provide a greater MSE gain Potential to improve many other estimators that depend on IPS

Slide 57

Slide 57 text

More About The Paper ● Contact: [email protected] ● arXiv: ● Experiment code: ● Generic Implementation: Thank you!