Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

usaito
July 04, 2022

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators – most of which are based on inverse propensity score weighting – degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.

usaito

July 04, 2022
Tweet

More Decks by usaito

Other Decks in Research

Transcript

  1. Outline • Standard Off-Policy Evaluation for Contextual Bandits • Issues

    of Existing Estimators for Large Action Spaces • New Framework and Estimator using Action Embeddings • Some Experimental Results how to best utilize auxiliary information about actions for offline evaluation? typical importance weighting approach fails
  2. Machine Learning for Decision Making (Bandit / RL) We often

    use machine learning to make decisions, not predictions incoming user Decision Making: item recommendation clicks policy Ultimate Goal: the Reward maximization Not the CTR predction
  3. Many Applications of “Machine Decision Making” How can we evaluate

    the performance of a new decision making policy using only the data collected by a logging, past policy? Motivation of OPE • video recommendation (Youtube) • playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo)
  4. Many Applications of “Machine Decision Making” • video recommendation (Youtube)

    • playlist recommendation (Spotify) • artwork personalization (Netflix) • ad allocation optimization (Criteo) How can we evaluate the performance of a new decision making policy using only the data collected by a logging, past policy? Large Action Spaces thouthons/millions (or even more) of actions Motivation of OPE
  5. Data Generating Process (contextual bandit setting) the logging policy interacts

    with the environment and produces the log data: Observe context (user info) A “logging”policy picks an action (movie reco) Observe reward (clicks, conversions, etc.)
  6. Off-Policy Evaluation: Logged Bandit Data We are given logged bandit

    data collected by logging policy where unknown unknown known
  7. Off-Policy Evaluation: Goal Our goal is to estimate the value/performance

    of evaluation policy value of eval policy (our “estimand”) = expected reward we get when we (hypothetically) implement the eval policy
  8. Off-Policy Evaluation: Goal Technically, our goal is to develop an

    accurate estimator where value (estimand) an estimator logging policy is different from the evaluation policy
  9. Off-Policy Evaluation: Goal An estimator’s accuracy is quantified by its

    mean squared error; MSE where Bias and Variance are equally important for a small MSE
  10. The Inverse Propensity Score (IPS) Estimator IPS uses importance weighting

    to unbiasedly estimate the policy value (vanilla) importance weight Very easy to implement and some nice statistical properties popular in practice, and basis of many advanced estimators
  11. Inverse Propensity Score (IPS): Unbiased OPE IPS is unbiased (bias=zero)

    in the sense that for any evaluation policy satisfying the common support assumption (checkable)
  12. IPS is inaccurate with growing number of actions number of

    data=3000 increasing number of actions IPS is getting significantly worse with growing number of actions This is simply due to the use of the vanilla importance weight “biased” baseline (DM) max importance weight is growing
  13. Recent Advances: Combining DM and IPS More Advanced Estimators •

    Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency)
  14. Simply combining IPS and DM seems not enough for large

    actions Advanced estimators still have the same issues they are still using the vanilla importance weight, the source of high variance.. number of actions DR suffers from a very large variance other estimators work almost the same as DM, by aggressively modifying the importance weight (losing unbiasedness & consistency)
  15. Recent Advances: Combining DM and IPS More Advanced Estimators •

    Doubly Robust (DR) [Dudik+11,14] • Switch DR [Wang+17] • DR with Optimistic Shrinkage [Su+20] • DR with \lambda-smoothing [Metelli+21] etc.. all heavily rely on the importance weight still suffer from variance or introduce a large bias (losing unbiasedenss and consistency) How can we gain a large variance reduction without sacrificing unbiasedness/consistency in large action spaces? (we may need to avoid relying on the vanilla importance weight somehow)
  16. Typical Logged Bandit Data Contexts Actions ??? ??? Conversion Rate

    User 1 Item A ??? ??? 5% User 2 Item B ??? ??? 2% … … … … … Example OPE Situation: Product Recommendations
  17. We Should Be Able to Use Some “Action Embeddings” Contexts

    Actions Category Price Conversion Rate User 1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … Example OPE Situation: Product Recommendations
  18. Idea: Auxiliary Information about the Actions Key idea: why not

    leveraging auxiliary information about the actions? we additionally observe action embeddings typical logged bandit data for OPE logged bandit data w/ action embeddings
  19. Idea: Auxiliary Information about the Actions We naturally generalize the

    typical DGP as: unknown unknown known unknown action embedding distribution given context “x” and action “a” may be context-dependent, stochastic, and continuous
  20. Action Embeddings: Examples Contexts Actions Category Price Conversion Rate User

    1 Item A Books $20 5% User 2 Item B Computers $500 2% … … … … … • discrete • context-independent • deterministic • continuous • context-dependent • stochastic if price is given by some personalized algorithm
  21. Idea: Auxiliary Information about the Actions We generalize the typical

    DGP as follows How should we utilize action embeds for an accurate OPE? *we assume the existence of some action embeddings and analyze the connection between their quality and OPE accuracy, optimizing or learning action embeddings is an interesting future work action embedding distribution
  22. A Key Assumption: No Direct Effect To construct a new

    estimator, let me assume the no direct effect assumption every causal effect of “a” on “r” should be mediated by “e” Action embeddings should be informative enough action embedding reward
  23. A Key Assumption: No Direct Effect this implies To construct

    a new estimator, let me assume the no direct effect assumption
  24. No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR

    (“r”) Tenet SF 10% Rocky Sport 5% Star Wars SF 20% Many Ball Sport 30% Action Embeddings fail to explain the variation in the reward, so the assumption is violated should be some direct effect
  25. No Direct Effect Assumption: Example movie (“a”) category (“e”) CVR

    (“r”) Tenet SF 10% Rocky Sport 20% Star Wars SF 10% Many Ball Sport 20% Action Embeddings well explain the variation in the reward, so the assumption is now True no direct effect
  26. New Expression of the Policy Value If the no direct

    effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)
  27. New Expression of the Policy Value If the no direct

    effect assumption is true, we have where is the marginal distribution of action embeddings induced by a particular policy
  28. Marginal Embedding Distribution movie (“a”) category (“e”) Tenet 0.2 SF

    0.4 Rocky 0.1 Sport 0.6 Star Wars 0.2 SF 0.4 Many Ball 0.5 Sport 0.6 Given a policy and embeddings, the marginal distribution is very easily defined marginal distribution
  29. New Expression of the Policy Value If the no direct

    effect assumption is true, we have a new expression without using the action variable “a” embedding “e” is enough to specify the value value of eval policy (our “estimand”)
  30. The Marginalized Inverse Propensity Score (MIPS) Estimator The new expression

    of the policy value leads to the following new estimator marginal importance weight computed over the action embedding space Marginalized IPS (MIPS) vanilla importance weight of IPS -> ignoring the vanilla importance weight
  31. Large Variance Reduction for Many Number of Actions (compared to

    IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
  32. Large Variance Reduction for Many Number of Actions (compared to

    IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
  33. MIPS is Unbiased with Alternative Assumptions Under the no direct

    effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)
  34. MIPS is Unbiased with Alternative Assumptions Under the no direct

    effect assumption, MIPS is unbiased, i.e., for any evaluation policy satisfying the common embedding support assumption (weaker than common support of IPS)
  35. Bias of MIPS with Violated Assumption (Thm 3.5) If the

    no direct effect assumption is NOT True... Bias of MIPS (1) (2)
  36. Bias of MIPS with Violated Assumption (Thm 3.5) If the

    no direct effect assumption is NOT True... Bias of MIPS (1) (2) (1) how identifiable an action is from action embeddings embeddings should be as informative as possible to reduce the bias
  37. Bias of MIPS with Violated Assumption If the no direct

    effect assumption is NOT True... Bias of MIPS (1) (2) (2) amount of direct effect from “a” to “r” embeddings should be as informative as possible to reduce the bias
  38. Large Variance Reduction for Many Number of Actions (compared to

    IPS) Nice Properties of MIPS Unbiased with Alternative Assumptions (compared to IPS)
  39. Variance Reduction by MIPS (Thm 3.6) MIPS’s variance is never

    worse than that of IPS Comparing the variance of IPS and MIPS
  40. Variance Reduction by MIPS (Thm 3.6) We get a large

    variance reduction when • the vanilla importance weights have a high variance (many actions) • embeddings should NOT be so informative ( should be stochastic) opposite motivation compared to the bias reduction
  41. Bias-Variance Trade-Off of MIPS • To reduce the bias, we

    should use informative action embeddings (high dimensional, high cardinality) • To reduce the variance , we should use coase/noisy action embeddings (low dimensional, low cardinality) We do not have to necessarily satisfy the “no direct effect” assumption to achieve a small MSE (= Bias^2+ Var) (there are some interesting empirical results about this trade-off)
  42. Data-Driven Action Embedding Selection • We want to identify a

    set of action embeddings/features that minimizes the MSE of the resulting MIPS where is a set of available action features small when is informative small when is coase
  43. Data-Driven Action Embedding Selection • We want to identify a

    set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself depends on the true policy value
  44. Data-Driven Action Embedding Selection • We want to identify a

    set of action embeddings/features that minimizes the MSE of the resulting MIPS • The problem is that estimating the bias is equally difficult as OPE itself • So, we adjust “SLOPE” [Su+20] [Tucker+21] to our setup • SLOPE is originally developed to tune hyperparameters of OPE and does not need to estimate the bias of the estimator detailed in the paper and in appdendix
  45. Estimating the Marginal Importance Weights • Even if we know

    the logging policy, we may have to estimate because we do not know the true distribution • A simple procedure is to utilize the following transformaiton So, our task is to estimate and use it to compute
  46. Summary • IPS (and many others based on it) become

    very inaccurate and impractical with growing action set mainly due to their huge variance • We assume auxiliary knowledge about the actions in the form of action embeddings and develop MIPS based on the no direct effect assumption • We characterize the bias and variance of MIPS, which implies an interesting bias-variance trade-off in terms of quality of the embeddings • Sometimes we should violate “no direct effect” to improve the MSE of MIPS
  47. Synthetic Experiment: Setup • Baselines ◦ DM, IPS, DR (other

    baselines are compared in the paper) ◦ MIPS (estimated weight), and MIPS (true) • Basic Setting ◦ n=10000, |A|=1000 (much larger than typical OPE experiments) ◦ continuous rewards with some gaussian noise ◦ 3-dimensional categorical action embeddings where the cardinality of each dimension is 10, i.e., |E|=10^3=1,000 provides best achievable accuracy
  48. MIPS is More Robust to Increasing Number of Actions robust

    to growing action sets vs IPS&DR • MIPS is over 10x better for a large number of actions vs DM • MIPS is also consistently better than DM Achieving Large Variance Reduction
  49. MIPS Makes Full Use of Increasing Data make full use

    of large data • MIPS works like IPS/DR, and is much better in small sample size • MIPS is also increasingly better than DM for larger sample sizes Avoid Sacrificing Consistency
  50. Resolving the Large Action Space Problem MIPS dominates DM/IPS/DR for

    both crucial cases robust to growing action sets make full use of large data
  51. How does MIPS perform with the violated assumption? Synthesize 20

    dims action embeddings, all are necessary to satisfy the assumption Gradually dropping necessary embed dimensions (out of 20) dropping some dimensions minimizes the MSE of MIPS even if the no direct effect assumption is violated
  52. How does MIPS perform with the violated assumption? missing dimensions

    introduce some bias but, it also reduces the variance squared bias variance
  53. How does data-drive action embedding selection work? We can data-drivenly

    select which embedding dimensions to drop to improve the MSE detailed in the paper, but is based on “SLOPE” [Su+20]
  54. Other Benefits of MIPS robust to deterministic evaluation policies robust

    to noisy rewards In addition to the number of actions, MIPS is more... deterministic uniform
  55. Summary of Empirical Results • With growing number of actions,

    MIPS provides a large variance reduction (works even better than DM), while the MSEs of IPS&DR infulate • With growing sample size, MIPS works like an unbiased/consistent estimator (similarly to IPS/DR), while DM remains highly biased • Even if we violate the assumption and introduce some bias, intentionally dropping some embedding dimensions might provide a greater MSE gain Potential to improve many other estimators that depend on IPS
  56. More About The Paper • Contact: [email protected] • arXiv: https://arxiv.org/abs/2202.06317

    • Experiment code: https://github.com/usaito/icml2022-mips • Generic Implementation: https://github.com/st-tech/zr-obp Thank you!