Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

Off-Policy Evaluation for Large Action Spaces via Embeddings (ICML'22)

Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators – most of which are based on inverse propensity score weighting – degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.

usaito
PRO

July 04, 2022
Tweet

More Decks by usaito

Other Decks in Research

Transcript

  1. Off-Policy Evaluation
    for Large Action Spaces via Embeddings
    (ICML2022)
    Yuta Saito and Thorsten Joachims

    View Slide

  2. Outline
    ● Standard Off-Policy Evaluation for Contextual Bandits
    ● Issues of Existing Estimators for Large Action Spaces
    ● New Framework and Estimator using Action Embeddings
    ● Some Experimental Results
    how to best utilize auxiliary information
    about actions for offline evaluation?
    typical importance weighting approach fails

    View Slide

  3. Machine Learning for Decision Making (Bandit / RL)
    We often use machine learning to make decisions, not predictions
    incoming user
    Decision Making: item recommendation
    clicks
    policy
    Ultimate Goal:
    the Reward maximization
    Not the CTR predction

    View Slide

  4. Many Applications of “Machine Decision Making”
    How can we evaluate the performance of a new decision making
    policy using only the data collected by a logging, past policy?
    Motivation of OPE
    ● video recommendation (Youtube)
    ● playlist recommendation (Spotify)
    ● artwork personalization (Netflix)
    ● ad allocation optimization (Criteo)

    View Slide

  5. Many Applications of “Machine Decision Making”
    ● video recommendation (Youtube)
    ● playlist recommendation (Spotify)
    ● artwork personalization (Netflix)
    ● ad allocation optimization (Criteo)
    How can we evaluate the performance of a new decision making
    policy using only the data collected by a logging, past policy?
    Large Action Spaces
    thouthons/millions
    (or even more) of actions
    Motivation of OPE

    View Slide

  6. Data Generating Process (contextual bandit setting)
    the logging policy interacts with the environment
    and produces the log data:
    Observe context (user info)
    A “logging”policy picks an action (movie reco)
    Observe reward (clicks, conversions, etc.)

    View Slide

  7. Off-Policy Evaluation: Logged Bandit Data
    We are given logged bandit data collected by logging policy
    where
    unknown unknown
    known

    View Slide

  8. Off-Policy Evaluation: Goal
    Our goal is to estimate the value/performance of evaluation policy
    value of eval policy
    (our “estimand”)
    = expected reward we get when we (hypothetically) implement the eval policy

    View Slide

  9. Off-Policy Evaluation: Goal
    Technically, our goal is to develop an accurate estimator
    where
    value (estimand) an estimator
    logging policy is different
    from the evaluation policy

    View Slide

  10. Off-Policy Evaluation: Goal
    An estimator’s accuracy is quantified by its mean squared error; MSE
    where
    Bias and Variance are equally
    important for a small MSE

    View Slide

  11. The Inverse Propensity Score (IPS) Estimator
    IPS uses importance weighting to unbiasedly estimate the policy value
    (vanilla)
    importance weight
    Very easy to implement and some nice statistical properties
    popular in practice, and basis of many advanced estimators

    View Slide

  12. Inverse Propensity Score (IPS): Unbiased OPE
    IPS is unbiased (bias=zero) in the sense that
    for any evaluation policy satisfying the common support assumption (checkable)

    View Slide

  13. IPS is inaccurate with growing number of actions
    number of data=3000
    increasing number of actions
    IPS is getting significantly worse
    with growing number of actions
    This is simply due to the use of
    the vanilla importance weight
    “biased” baseline (DM)
    max importance
    weight is growing

    View Slide

  14. Recent Advances: Combining DM and IPS
    More Advanced Estimators
    ● Doubly Robust (DR) [Dudik+11,14]
    ● Switch DR [Wang+17]
    ● DR with Optimistic Shrinkage [Su+20]
    ● DR with \lambda-smoothing [Metelli+21]
    etc..
    all heavily rely on
    the importance weight
    still suffer from variance or
    introduce a large bias
    (losing unbiasedenss and consistency)

    View Slide

  15. Simply combining IPS and DM seems not enough for large actions
    Advanced estimators still have the same issues
    they are still using the vanilla importance weight, the source of high variance..
    number of actions
    DR suffers from a very large variance
    other estimators work almost the
    same as DM, by aggressively
    modifying the importance weight
    (losing unbiasedness & consistency)

    View Slide

  16. Recent Advances: Combining DM and IPS
    More Advanced Estimators
    ● Doubly Robust (DR) [Dudik+11,14]
    ● Switch DR [Wang+17]
    ● DR with Optimistic Shrinkage [Su+20]
    ● DR with \lambda-smoothing [Metelli+21]
    etc..
    all heavily rely on
    the importance weight
    still suffer from variance or
    introduce a large bias
    (losing unbiasedenss and consistency)
    How can we gain a large variance reduction without sacrificing
    unbiasedness/consistency in large action spaces?
    (we may need to avoid relying on the vanilla importance weight somehow)

    View Slide

  17. Typical Logged Bandit Data
    Contexts Actions ??? ??? Conversion Rate
    User 1 Item A ??? ??? 5%
    User 2 Item B ??? ??? 2%
    … … … … …
    Example OPE Situation: Product Recommendations

    View Slide

  18. We Should Be Able to Use Some “Action Embeddings”
    Contexts Actions Category Price Conversion Rate
    User 1 Item A Books $20 5%
    User 2 Item B Computers $500 2%
    … … … … …
    Example OPE Situation: Product Recommendations

    View Slide

  19. Idea: Auxiliary Information about the Actions
    Key idea: why not leveraging auxiliary information about the actions?
    we additionally observe
    action embeddings
    typical logged
    bandit data for OPE
    logged bandit data
    w/ action embeddings

    View Slide

  20. Idea: Auxiliary Information about the Actions
    We naturally generalize the typical DGP as:
    unknown unknown
    known unknown
    action embedding distribution
    given context “x” and action “a”
    may be context-dependent, stochastic, and continuous

    View Slide

  21. Action Embeddings: Examples
    Contexts Actions Category Price Conversion Rate
    User 1 Item A Books $20 5%
    User 2 Item B Computers $500 2%
    … … … … …
    ● discrete
    ● context-independent
    ● deterministic
    ● continuous
    ● context-dependent
    ● stochastic
    if price is given by some
    personalized algorithm

    View Slide

  22. Idea: Auxiliary Information about the Actions
    We generalize the typical DGP as follows
    How should we utilize action embeds for an accurate OPE?
    *we assume the existence of some action embeddings and analyze the connection between their
    quality and OPE accuracy, optimizing or learning action embeddings is an interesting future work
    action embedding distribution

    View Slide

  23. A Key Assumption: No Direct Effect
    To construct a new estimator, let me assume the no direct effect assumption
    every causal effect of “a” on “r”
    should be mediated by “e”
    Action embeddings should be informative enough
    action embedding reward

    View Slide

  24. A Key Assumption: No Direct Effect
    this implies
    To construct a new estimator, let me assume the no direct effect assumption

    View Slide

  25. No Direct Effect Assumption: Example
    movie (“a”) category (“e”) CVR (“r”)
    Tenet SF 10%
    Rocky Sport 5%
    Star Wars SF 20%
    Many Ball Sport 30%
    Action Embeddings fail to explain the variation in the reward,
    so the assumption is violated
    should be some direct effect

    View Slide

  26. No Direct Effect Assumption: Example
    movie (“a”) category (“e”) CVR (“r”)
    Tenet SF 10%
    Rocky Sport 20%
    Star Wars SF 10%
    Many Ball Sport 20%
    Action Embeddings well explain the variation in the reward,
    so the assumption is now True
    no direct effect

    View Slide

  27. New Expression of the Policy Value
    If the no direct effect assumption is true, we have
    a new expression without using the action variable “a”
    embedding “e” is enough to specify the value
    value of eval policy
    (our “estimand”)

    View Slide

  28. New Expression of the Policy Value
    If the no direct effect assumption is true, we have
    where
    is the marginal distribution of
    action embeddings induced by a
    particular policy

    View Slide

  29. Marginal Embedding Distribution
    movie (“a”) category (“e”)
    Tenet 0.2 SF 0.4
    Rocky 0.1 Sport 0.6
    Star Wars 0.2 SF 0.4
    Many Ball 0.5 Sport 0.6
    Given a policy and embeddings, the marginal distribution is very easily defined
    marginal distribution

    View Slide

  30. New Expression of the Policy Value
    If the no direct effect assumption is true, we have
    a new expression without using the action variable “a”
    embedding “e” is enough to specify the value
    value of eval policy
    (our “estimand”)

    View Slide

  31. The Marginalized Inverse Propensity Score (MIPS) Estimator
    The new expression of the policy value leads to the following new estimator
    marginal importance weight computed
    over the action embedding space
    Marginalized IPS (MIPS)
    vanilla importance weight of IPS
    -> ignoring the vanilla importance weight

    View Slide

  32. Large Variance Reduction for Many Number of Actions (compared to IPS)
    Nice Properties of MIPS
    Unbiased with Alternative Assumptions (compared to IPS)

    View Slide

  33. Large Variance Reduction for Many Number of Actions (compared to IPS)
    Nice Properties of MIPS
    Unbiased with Alternative Assumptions (compared to IPS)

    View Slide

  34. MIPS is Unbiased with Alternative Assumptions
    Under the no direct effect assumption, MIPS is unbiased, i.e.,
    for any evaluation policy satisfying the common embedding support assumption
    (weaker than common support of IPS)

    View Slide

  35. MIPS is Unbiased with Alternative Assumptions
    Under the no direct effect assumption, MIPS is unbiased, i.e.,
    for any evaluation policy satisfying the common embedding support assumption
    (weaker than common support of IPS)

    View Slide

  36. Bias of MIPS with Violated Assumption (Thm 3.5)
    If the no direct effect assumption is NOT True...
    Bias of MIPS (1)
    (2)

    View Slide

  37. Bias of MIPS with Violated Assumption (Thm 3.5)
    If the no direct effect assumption is NOT True...
    Bias of MIPS (1)
    (2)
    (1) how identifiable an action is
    from action embeddings
    embeddings should be as informative as possible to reduce the bias

    View Slide

  38. Bias of MIPS with Violated Assumption
    If the no direct effect assumption is NOT True...
    Bias of MIPS (1)
    (2)
    (2) amount of direct effect
    from “a” to “r”
    embeddings should be as informative as possible to reduce the bias

    View Slide

  39. Large Variance Reduction for Many Number of Actions (compared to IPS)
    Nice Properties of MIPS
    Unbiased with Alternative Assumptions (compared to IPS)

    View Slide

  40. Variance Reduction by MIPS (Thm 3.6)
    MIPS’s variance is never worse than that of IPS
    Comparing the variance of IPS and MIPS

    View Slide

  41. Variance Reduction by MIPS (Thm 3.6)
    We get a large variance reduction when
    ● the vanilla importance weights have a high variance (many actions)
    ● embeddings should NOT be so informative ( should be stochastic)
    opposite motivation compared to the bias reduction

    View Slide

  42. Bias-Variance Trade-Off of MIPS
    ● To reduce the bias, we should use informative action embeddings
    (high dimensional, high cardinality)
    ● To reduce the variance , we should use coase/noisy action embeddings
    (low dimensional, low cardinality)
    We do not have to necessarily satisfy the “no direct effect”
    assumption to achieve a small MSE (= Bias^2+ Var)
    (there are some interesting empirical results about this trade-off)

    View Slide

  43. Data-Driven Action Embedding Selection
    ● We want to identify a set of action embeddings/features
    that minimizes the MSE of the resulting MIPS
    where is a set of available action features
    small when is informative small when is coase

    View Slide

  44. Data-Driven Action Embedding Selection
    ● We want to identify a set of action embeddings/features
    that minimizes the MSE of the resulting MIPS
    ● The problem is that estimating the bias is equally difficult as OPE itself
    depends on the true policy value

    View Slide

  45. Data-Driven Action Embedding Selection
    ● We want to identify a set of action embeddings/features
    that minimizes the MSE of the resulting MIPS
    ● The problem is that estimating the bias is equally difficult as OPE itself
    ● So, we adjust “SLOPE” [Su+20] [Tucker+21] to our setup
    ● SLOPE is originally developed to tune hyperparameters of OPE
    and does not need to estimate the bias of the estimator
    detailed in the paper and in appdendix

    View Slide

  46. Estimating the Marginal Importance Weights
    ● Even if we know the logging policy, we may have to estimate
    because we do not know the true distribution
    ● A simple procedure is to utilize the following transformaiton
    So, our task is to estimate
    and use it to compute

    View Slide

  47. Summary
    ● IPS (and many others based on it) become very inaccurate and impractical
    with growing action set mainly due to their huge variance
    ● We assume auxiliary knowledge about the actions in the form of action
    embeddings and develop MIPS based on the no direct effect assumption
    ● We characterize the bias and variance of MIPS, which implies an
    interesting bias-variance trade-off in terms of quality of the embeddings
    ● Sometimes we should violate “no direct effect” to improve the MSE of MIPS

    View Slide

  48. Synthetic Experiment: Setup
    ● Baselines
    ○ DM, IPS, DR (other baselines are compared in the paper)
    ○ MIPS (estimated weight), and MIPS (true)
    ● Basic Setting
    ○ n=10000, |A|=1000 (much larger than typical OPE experiments)
    ○ continuous rewards with some gaussian noise
    ○ 3-dimensional categorical action embeddings
    where the cardinality of each dimension is 10, i.e., |E|=10^3=1,000
    provides best achievable accuracy

    View Slide

  49. MIPS is More Robust to Increasing Number of Actions
    robust to growing action sets
    vs IPS&DR
    ● MIPS is over 10x better for
    a large number of actions
    vs DM
    ● MIPS is also consistently
    better than DM
    Achieving Large Variance Reduction

    View Slide

  50. MIPS Makes Full Use of Increasing Data
    make full use of large data
    ● MIPS works like IPS/DR, and is
    much better in small sample size
    ● MIPS is also increasingly better
    than DM for larger sample sizes
    Avoid Sacrificing Consistency

    View Slide

  51. Resolving the Large Action Space Problem
    MIPS dominates DM/IPS/DR for both crucial cases
    robust to growing action sets make full use of large data

    View Slide

  52. How does MIPS perform with the violated assumption?
    Synthesize 20 dims action
    embeddings, all are necessary to
    satisfy the assumption
    Gradually dropping necessary
    embed dimensions (out of 20)
    dropping some dimensions
    minimizes the MSE of MIPS
    even if the no direct effect
    assumption is violated

    View Slide

  53. How does MIPS perform with the violated assumption?
    missing dimensions
    introduce some bias
    but, it also reduces the variance
    squared bias variance

    View Slide

  54. How does data-drive action embedding selection work?
    We can data-drivenly select
    which embedding dimensions
    to drop to improve the MSE
    detailed in the paper,
    but is based on “SLOPE” [Su+20]

    View Slide

  55. Other Benefits of MIPS
    robust to deterministic evaluation policies robust to noisy rewards
    In addition to the number of actions, MIPS is more...
    deterministic uniform

    View Slide

  56. Summary of Empirical Results
    ● With growing number of actions, MIPS provides a large variance
    reduction (works even better than DM), while the MSEs of IPS&DR infulate
    ● With growing sample size, MIPS works like an unbiased/consistent
    estimator (similarly to IPS/DR), while DM remains highly biased
    ● Even if we violate the assumption and introduce some bias, intentionally
    dropping some embedding dimensions might provide a greater MSE gain
    Potential to improve many other estimators that depend on IPS

    View Slide

  57. More About The Paper
    ● Contact: [email protected]
    ● arXiv: https://arxiv.org/abs/2202.06317
    ● Experiment code: https://github.com/usaito/icml2022-mips
    ● Generic Implementation: https://github.com/st-tech/zr-obp
    Thank you!

    View Slide