$30 off During Our Annual Pro Sale. View Details »

Evaluating the Robustness of Off Policy Evaluation

Haruka Kiyohara
September 27, 2021

Evaluating the Robustness of Off Policy Evaluation

Haruka Kiyohara

September 27, 2021
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 1
    Evaluating the Robustness of
    Off-Policy Evaluation
    Yuta Saito, Takuma Udagawa, Haruka Kiyohara,
    Kazuki Mogi, Yusuke Narita, Kei Tateno
    Haruka Kiyohara, Tokyo Institute of Technology
    https://sites.google.com/view/harukakiyohara
    September 2021 1

    View Slide

  2. Machine decision making in recommenders
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 2
    a policy (e.g., contextual bandit)
    makes decisions to recommend items,
    with the goal of maximizing the (expected) reward
    a coming user an item reward (e.g., click)

    View Slide

  3. The system also produces logged data
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 3
    reward (e.g., click)
    (reward 𝒓)
    logged bandit feedback collected
    by a behavior policy 𝝅𝒃
    a coming user
    (context 𝒙)
    an item
    (action 𝒂)
    Motivation:
    We want to evaluate the future
    policies using the logged data.

    View Slide

  4. Outline
    • Off-Policy Evaluation (OPE)
    • Emerging challenge: Selection of OPE estimators
    • Our goal: Evaluating the Robustness of Off-Policy Evaluation
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 4

    View Slide

  5. Off-Policy Evaluation (OPE)
    In OPE, we aim to evaluate the performance of a new evaluation policy 𝜋
    𝑒
    using logged bandit feedback collected by the behavior policy 𝜋
    𝑏
    .
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 5
    hyperparameters of the OPE estimator ෡
    𝑽
    where
    expected reward obtained by running on 𝝅𝒆
    the real system
    distribution shift

    View Slide

  6. Off-Policy Evaluation (OPE)
    In OPE, we aim to evaluate the performance of a new evaluation policy 𝜋
    𝑒
    using logged bandit feedback collected by the behavior policy 𝜋
    𝑏
    .
    An accurate OPE is beneficial, because it..
    • avoids deploying poor policies without A/B tests
    • identifies promising new policies among many candidates
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 6
    hyperparameters of the OPE estimator ෡
    𝑽
    Growing interest in OPE!
    distribution shift

    View Slide

  7. Direct Method (DM)
    DM estimates mean reward function.
    Large bias*, small variance. *due to inaccuracy of ො
    𝑞
    Hyperparameter: ො
    𝑞
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 7
    where :empirical average

    View Slide

  8. Inverse Probability Weighting (IPW) [Strehl+, 2010]
    IPW mitigates the distribution shift between 𝜋
    𝑏
    and 𝜋
    𝑒
    using importance sampling.
    Unbiased*, but large variance. *when 𝜋𝑏
    is known or accurately estimated
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown)
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 8
    where

    View Slide

  9. Doubly Robust (DR) [Dudík+, 2014]
    DR tackles the variance of IPW by leveraging baseline estimation ො
    𝑞 and performing
    importance weighting only on its residual.
    Unbiased* and lower variance than IPW. *when 𝜋𝑏
    is known or accurately estimated
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown) + ො
    𝑞
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 9
    baseline importance weighting on the residual
    where

    View Slide

  10. Pessimistic Shrinkage (IPWps, DRps) [Su+, 2020]
    IPWps and DRps further reduce the variance by clipping large importance weights.
    Lower variance than IPW / DR.
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown) (, ො
    𝑞) + 𝜆
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 10
    clipped importance weight

    View Slide

  11. Self-Normalization (SNIPW, SNDR) [Swaminathan & Joachims, 2015]
    SNIPW and SNDR address the variance issue of IPW and DR by using
    self-normalized value for importance weights.
    Consistent* and lower variance than IPW / DR. *when 𝜋𝑏
    is known or accurately estimated
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown) (, ො
    𝑞)
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 11
    self-normalization

    View Slide

  12. Switch-DR [Wang+, 2017]
    Switch-DR interpolates between DM and DR (𝜏 → 0 to DM, 𝜏 → ∞ to DR).
    Lower variance than DR.
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown), ො
    𝑞 + 𝜏
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 12
    use importance weighting only when the weight is small

    View Slide

  13. DR with Optimistic Shrinkage (DRos) [Su+, 2020]
    DRos use new weight function to bridge DM and DR (λ → 0 to DM, λ → ∞ to DR).
    Minimize sharp bounds of mean-squared-error.
    Hyperparameter: ො
    𝜋
    𝑏
    (when 𝜋
    𝑏
    is unknown), ො
    𝑞 + 𝜆
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 13
    where
    weight function to minimize error bounds

    View Slide

  14. Many estimators with different hyperparameters
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 14
    Estimator Selection:
    Which OPE estimator (and hyperparameters) should be used in practice?

    View Slide

  15. What properties are desirable in practice?
    • An estimator that works without significant hyperparameter tuning.
    .. because hyperparameters may depend on the logged data and evaluation policy,
    which might also entail risks for overfitting.
    • An estimator that is stably accurate across various evaluation policies.
    .. because we need to evaluate various candidate policies to choose from.
    • An estimator that shows acceptable errors in the worst case.
    .. because uncertainty of estimation is of great interest.
    We want to evaluate the estimators’ robustness to the possible changes
    in configurations such as hyperparameters and evaluation policies!
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 15

    View Slide

  16. Is conventional evaluation sufficient?
    Conventional OPE experiments compare mean-squared-error to evaluate the
    performance (estimation accuracy) of OPE estimators.
    Pitfall: fails to evaluate the estimators’ robustness for configuration changes..
    (such as hyperparameters 𝜃 and evaluation policy 𝜋
    𝑒
    )
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 16
    evaluate only on a single set of configurations

    View Slide

  17. Towards more informative evaluation for practice
    To tackle the issues in conventional experimental procedure, we propose
    Interpretable evaluation for offline evaluation (IEOE), which can..
    ✓ evaluate the estimators’ robustness to the possible configuration changes
    ✓ provide a visual interpretation of the distribution of estimation errors
    ✓ be easily implemented using our open-source Python software, pyIEOE
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 17

    View Slide

  18. Interpretable evaluation for offline evaluation (IEOE)
    ① set configurations spaces
    (hyperparameters 𝜃 and
    evaluation policies 𝜋
    𝑒
    )
    ② for each random seed 𝑠,
    sample configurations
    ③ calculate the estimators’
    squared error (SE) on the
    sampled configurations
    ④ obtain an error distribution
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 18




    View Slide

  19. Visual comparison of OPE estimators
    Next step is to approximate the cumulative distribution function (CDF) of SEs.
    We can interpret how the estimators are robust
    across the given configurations.
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 19

    View Slide

  20. Quantitative performance measure
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 20
    Based on the CDF, we can define some summary scores, which are useful for
    quantitative performance comparisons.
    • Area under the curve (AU-CDF) compares the estimators’ squared errors
    below the threshold.
    • Conditional value-at-risk (CVaR) compares the expected values of the estimators’
    squared error in the worst 𝛼 x 100 % trials.

    View Slide

  21. Experiments in a real-world application
    • We applied IEOE to estimator selection in a real e-commerce platform.
    • The result demonstrates that SNIPW is stable across various configurations.
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 21
    The platform now uses SNIPW based on our analysis!
    *The values are normalized by that of SNIPW.
    (the conclusion may change when we consider different applications)

    View Slide

  22. Take-Home Message
    • OPE is beneficial for identifying “good” counterfactual policies.
    • In practice, we want to use OPE estimators robust to the configuration changes.
    • However, the conventional experiment failed to provide informative results.
    • IEOE evaluates estimators’ robustness and conveys results in an interpretable way.
    IEOE can help practitioners to choose a reliable OPE estimator!
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 22

    View Slide

  23. Thank you for listening!
    Find out more (e.g., quantitative metrics and experiments) in the full paper!
    contact: [email protected]
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 23

    View Slide

  24. References
    [Strehl+, 2010] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. Learning from Logged Implicit
    Exploration Data. NeurIPS, 2010. https://arxiv.org/abs/1003.0120
    [Dudík+, 2014] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy
    Evaluation and Optimization. Statistical Science, 2014. https://arxiv.org/abs/1503.02834
    [Swaminathan & Joachims, 2015] Adith Swaminathan and Thorsten Joachims. The Self-Normalized
    Estimator for Counterfactual Learning. NeurIPS, 2015.
    https://dl.acm.org/doi/10.5555/2969442.2969600
    [Wang+, 2017] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. Optimal and Adaptive Off-policy
    Evaluation in Contextual Bandits. ICML, 2017. https://arxiv.org/abs/1612.01205
    [Su+, 2020] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust
    Off-policy Evaluation with Shrinkage. ICML, 2020. https://arxiv.org/abs/1907.09623
    [Narita+, 2021] Yusuke Narita, Shota Yasui, Kohei Yata. Debiased Off-Policy Evaluation for
    Recommendation Systems. RecSys, 2021. https://arxiv.org/abs/2002.08536
    September 2021 Evaluating the Robustness of Off-Policy Evaluation @ RecSys2021 24

    View Slide