$30 off During Our Annual Pro Sale. View Details »

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

Haruka Kiyohara
September 22, 2022

Policy-Adaptive Estimator Selection for Off-Policy Evaluation

AAAI2023
arXiv: https://arxiv.org/abs/2211.13904

CONSEQUENCES+REVEAL WS @ RecSys2022 (Day2, CONSEQUENCES)
About WS: https://sites.google.com/view/consequences2022

CFML勉強会#7
https://cfml.connpass.com/event/264017/

RecSys読み会2022
https://connpass.com/event/261571/

Haruka Kiyohara

September 22, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Policy Adaptive Estimator Selection
    for Off-Policy Evaluation
    Takuma Udagawa, Haruka Kiyohara, Yusuke Narita,
    Yuta Saito, Kei Tateno
    Haruka Kiyohara, Tokyo Institute of Technology
    https://sites.google.com/view/harukakiyohara
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 1

    View Slide

  2. Content
    • Introduction to Off-Policy Evaluation (OPE)
    • Estimator Selection for OPE
    • Our proposal: Policy-Adaptive Estimator Selection via Importance Fitting (PAS-IF)
    • Synthetic Experiments
    • Estimator Selection
    • Policy Selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 2

    View Slide

  3. Off-Policy Evaluation
    Motivation towards Estimator Selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 3

    View Slide

  4. Interactions in recommender systems
    A behavior policy interacts with users and collects logged data.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 4
    a user feedback
    (reward)
    a coming user
    (context)
    an item
    (action)

    View Slide

  5. Interactions in recommender systems
    A behavior policy interacts with users and collects logged data.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 5
    a user feedback
    (reward)
    a coming user
    (context)
    an item
    (action)
    logged bandit feedback
    behavior
    policy 𝝅𝒃

    View Slide

  6. Off-Policy Evaluation
    The goal is to evaluate the performance of an evaluation policy 𝜋
    𝑒
    .
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 6
    offline A/B test
    logged bandit feedback
    behavior
    policy 𝝅𝒃
    OPE estimator
    (policy performance)

    View Slide

  7. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 7



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high
    (reward predictor) (importance weight)

    View Slide

  8. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 8



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high

    reward predictor

    View Slide

  9. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 9



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high
    importance weight

    View Slide

  10. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 10



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high
    importance weight
    evaluation behavior

    View Slide

  11. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 11



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high
    importance weight
    evaluation behavior

    View Slide

  12. Representative OPE estimators
    We aim to reduce both bias and variance to enable an accurate OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 12



    ✓ ✓
    model-based
    importance
    sampling-based
    bias variance
    Direct Method (DM)
    [Beygelzimer&Langford,09]
    --- high low
    Inverse Propensity Scoring (IPS)
    [Precup+,00] [Strehl+,10]
    --- unbiased very high
    Doubly Robust (DR)
    [Dudík+,14]
    unbiased
    lower than IPS,
    but still high
    control variate

    View Slide

  13. To reduce the variance of IPS/ DR, many OPE estimators have been proposed.
    modification on importance weights
    Self-Normalized (IPS/ DR)
    [Swaminathan&Joachims,15]
    Clipped (IPS/ DR) *
    [Su+,20a]
    Switch (DR) *
    [Wang+,17]
    Optimistic Shrinkage (DR) *
    [Su+,20a]
    Subgaussian (IPS/ DR) *
    [Metelli+,21]
    Advanced OPE estimators
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 13
    * requires hyperparameter tuning of 𝜆, e.g., SLOPE [Su+,20b] [Tucker&Lee,21]

    View Slide

  14. To reduce the variance of IPS/ DR, many OPE estimators have been proposed.
    modification on importance weights
    Self-Normalized (IPS/ DR)
    [Swaminathan&Joachims,15]
    Clipped (IPS/ DR) *
    [Su+,20a]
    Switch (DR) *
    [Wang+,17]
    Optimistic Shrinkage (DR) *
    [Su+,20a]
    Subgaussian (IPS/DR) *
    [Metelli+,21]
    Advanced OPE estimators
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 14
    * requires hyperparameter tuning, e.g., SLOPE [Su+,20b]
    Which OPE estimator should be used to enable an accurate OPE?

    View Slide

  15. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 15
    𝜋
    𝑏
    𝜋
    𝑒

    View Slide

  16. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 16
    𝜋
    𝑏
    𝜋
    𝑒
    Estimator Selection is important!

    View Slide

  17. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 17
    𝜋
    𝑏
    Estimator Selection is important!
    but..
    The best estimator can be different
    under different situations.

    View Slide

  18. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 18
    𝜋
    𝑏
    among the best
    Estimator Selection is important!
    but..
    The best estimator can be different
    under different situations.

    View Slide

  19. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 19
    𝜋
    𝑏
    among the best
    among the worst
    Estimator Selection is important!
    but..
    The best estimator can be different
    under different situations.

    View Slide

  20. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 20
    𝜋
    𝑏
    among the best
    among the worst
    Estimator Selection is important!
    but..
    The best estimator can be different
    under different situations.
    - data size
    - evaluation policy
    - reward noise
    matter.

    View Slide

  21. Motivation towards data-driven estimator selection
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 21
    𝜋
    𝑏
    Estimator Selection: How to identify the most accurate OPE estimator
    using only the available logged data?
    Estimator Selection is important!
    but..
    The best estimator can be different
    under different situations.
    - data size
    - evaluation policy
    - reward noise
    matter.

    View Slide

  22. Estimator Selection for OPE
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 22

    View Slide

  23. Objective for estimator selection
    The goal is to identify the most accurate OPE estimator in terms of MSE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 23

    View Slide

  24. Objective for estimator selection
    The goal is to identify the most accurate OPE estimator in terms of MSE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 24
    true policy value
    (estimand)

    View Slide

  25. Objective for estimator selection
    The goal is to identify the most accurate OPE estimator in terms of MSE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 25
    estimated from the logged data

    View Slide

  26. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b]
    Suppose we have logged data from previous A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 26

    View Slide

  27. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b]
    Suppose we have logged data from previous A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 27
    pseudo-evaluation policy

    View Slide

  28. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b]
    Suppose we have logged data from previous A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 28
    pseudo-evaluation policy
    OPE estimate on-policy policy value

    View Slide

  29. Baseline – non-adaptive heuristic [Saito+,21a] [Saito+,21b]
    Suppose we have logged data from previous A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 29
    ※ 𝑆 is a set of random states for bootstrapping.
    pseudo-evaluation policy
    OPE estimate on-policy policy value

    View Slide

  30. Does non-adaptive heuristic work?
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 30
    Do these estimators
    really work well?
    non-adaptive heuristic
    (estimation)

    𝑉 𝜋
    𝐴
    ; 𝐷
    𝐵

    View Slide

  31. Does non-adaptive heuristic work?
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 31
    𝜋
    𝑏
    𝜋
    𝐴
    Do these estimators
    really work well?
    non-adaptive heuristic true performance
    Non-adaptive heuristic does not consider the difference among OPE tasks.
    (estimation)

    𝑉 𝜋
    𝐴
    ; 𝐷
    𝐵

    View Slide

  32. Does non-adaptive heuristic work?
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 32
    𝜋
    𝑏
    𝜋
    𝐴
    Do these estimators
    really work well?
    non-adaptive heuristic true performance
    Non-adaptive heuristic does not consider the difference among OPE tasks.
    (estimation)

    𝑉 𝜋
    𝐴
    ; 𝐷
    𝐵
    How to choose OPE estimators adaptively
    to the given OPE task (e.g., evaluation policy)?

    View Slide

  33. PAS-IF
    Policy Adaptive Estimator Selection via Importance Fitting
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 33

    View Slide

  34. Is it possible to make pseudo-policies adaptive?
    Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 34
    ~𝝅
    𝒃
    ~𝝅
    𝑩
    ~𝝅
    𝑨
    pseudo-behavior policy
    total amount
    of logged data
    pseudo-evaluation policy

    View Slide

  35. Is it possible to make pseudo-policies adaptive?
    Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 35
    ~𝝅
    𝒃
    ~𝝅
    𝑩
    ~𝝅
    𝑨
    pseudo-behavior policy
    total amount
    of logged data
    pseudo-evaluation policy
    behavior
    evaluation

    View Slide

  36. Is it possible to make pseudo-policies adaptive?
    Non-adaptive heuristic calculates MSE using two datasets collected by A/B tests.
    We aim to split the logged datasets adaptive to the given OPE task.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 36
    ~𝝅
    𝒃
    ~𝝅
    𝑩
    ~𝝅
    𝑨
    pseudo-behavior policy
    pseudo-evaluation policy
    ~෥
    𝝅
    𝒃
    ~෥
    𝝅
    𝒆
    total amount
    of logged data

    View Slide

  37. Subsampling function controls the pseudo-policies
    We now introduce a subsampling function .
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 37
    ~𝝅
    𝒃
    pseudo-behavior policy
    total amount
    of logged data
    pseudo-evaluation policy
    ~෥
    𝝅
    𝒃
    ~෥
    𝝅
    𝒆

    View Slide

  38. Subsampling function controls the pseudo-policies
    We now introduce a subsampling function .
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 38
    ~𝝅
    𝒃
    pseudo-behavior policy
    total amount
    of logged data
    pseudo-evaluation policy
    ~෥
    𝝅
    𝒃
    ~෥
    𝝅
    𝒆

    View Slide

  39. How to optimize the subsampling function?
    PAS-IF optimizes 𝜌 to reproduce the bias-variance tradeoff of the original OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 39
    Subsampling function

    View Slide

  40. How to optimize the subsampling function?
    PAS-IF optimizes 𝜌 to reproduce the bias-variance tradeoff of the original OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 40
    Subsampling function

    View Slide

  41. How to optimize the subsampling function?
    PAS-IF optimizes 𝜌 to reproduce the bias-variance tradeoff of the original OPE.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 41
    Objective of importance fitting:
    Subsampling function

    View Slide

  42. Key contribution of PAS-IF
    PAS-IF enables MSE estimation that are..
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 42
    Data Driven -> by splitting the logged data into pseudo datasets
    Adaptive -> by optimizing subsampling function to simulate
    the distribution shift of the original OPE task
    Accurate Estimator Selection! .
    ->

    View Slide

  43. Synthetic Experiment
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 43

    View Slide

  44. Experimental settings
    We compare PAS-IF and non-adaptive heuristic in two tasks.
    1. Estimator Selection
    2. Policy Selection using the selected estimator
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 44

    View Slide

  45. Experimental settings
    We compare PAS-IF and non-adaptive heuristic in two tasks.
    1. Estimator Selection
    2. Policy Selection using the selected estimator
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 45
    hyperparam tuning* estimator selection
    1. Estimator Selection
    * SLOPE [Su+,20b] [Tucker&Lee,21]

    View Slide

  46. PAS-IF enables an accurate estimator selection
    PAS-IF enables far more accurate estimator selection by being adaptive.
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 46
    PAS-IF is accurate across
    various evaluation policies
    lower, the better
    𝜋
    𝑏1
    𝜋
    𝑏2
    𝜋
    𝑏1
    𝜋
    𝑏2

    𝑚 -- selected
    𝑚 ∗ -- true best

    View Slide

  47. Experimental settings
    We compare PAS-IF and non-adaptive heuristic in two tasks.
    1. Estimator Selection
    2. Policy Selection using the selected estimator
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 47
    hyperparam tuning estimator selection
    1. Estimator Selection

    View Slide

  48. Experimental settings
    We compare PAS-IF and non-adaptive heuristic in two tasks.
    1. Estimator Selection
    2. Policy Selection using the selected estimator
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 48
    hyperparam tuning estimator selection
    1. Estimator Selection

    𝑉
    1

    𝑉
    2

    𝑉
    3
    PAS-IF
    different estimator
    for each policy
    non-adaptive

    𝑉
    universal estimator
    for all policies

    𝑉

    𝑉

    View Slide

  49. Moreover, PAS-IF also benefits policy selection
    PAS-IF also reveals a favorable result in the policy selection task.
    PAS-IF can identify better policies among many candidates
    by using different (appropriate) estimator for each policy!
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 49
    lower, the better

    𝜋 -- selected 𝜋 ∗ -- true best

    View Slide

  50. Summary
    • Estimator Selection is important to enable an accurate OPE.
    • Non-adaptive heuristic fails to be adaptive to the given OPE task.
    • PAS-IF enables an adaptive and accurate estimator selection by subsampling
    and optimizing the pseudo OPE datasets.
    PAS-IF will help identify an accurate OPE estimator in practice!
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 50

    View Slide

  51. Thank you for listening!
    Feel free to ask any questions, and discussions are welcome!
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 51

    View Slide

  52. Example case of importance fitting
    When we have
    ⇒ PAS-IF can produce a similar distribution shift!
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 52
    Note: the simplified case of .

    View Slide

  53. Detailed optimization procedure of PAS-IF
    We optimize the subsampling rule 𝜌
    𝜃
    via gradient decent.
    To maintain the similar data size with the original OPE task, PAS-IF also imposes
    the regularization on the data size.
    We tune 𝜆 so that .
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 53

    View Slide

  54. Key idea of PAS-IF
    How about sampling the logged data and constructing a pseudo-evaluation policy
    that has a bias-variance tradeoff similar to the given OPE task?
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 54
    (𝑆 is a set of random states for bootstrapping)

    View Slide

  55. References
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 55

    View Slide

  56. References (1/4)
    [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree
    for Learning with Partial Labels.” KDD, 2009.
    [Precup+,00] Doina Precup, Richard S. Sutton, and Satinder Singh. “Eligibility Traces
    for Off-Policy Policy Evaluation.” ICML, 2000.
    https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_facult
    y_pubs
    [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from
    Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120
    [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly
    Robust Policy Evaluation and Optimization.” ICML, 2011.
    https://arxiv.org/abs/1503.02834
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 56

    View Slide

  57. References (2/4)
    [Swaminathan&Joachims,15] Adith Swaminathan and Thorsten Joachims. “The Self-
    Normalized Estimator for Counterfactual Learning.” NeurIPS, 2015.
    https://dl.acm.org/doi/10.5555/2969442.2969600
    [Wang+,17] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. “Optimal and
    Adaptive Off-policy Evaluation in Contextual Bandits.” ICML, 2017.
    https://arxiv.org/abs/1612.01205
    [Metelli+,21] Alberto M. Metelli, Alessio Russo, Marcello Restelli. “Subgaussian and
    Differentiable Importance Sampling for Off-Policy Evaluation and Learning.” NeurIPS,
    2021.
    https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6b
    a23a-Abstract.html
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 57

    View Slide

  58. References (3/4)
    [Su+,20a] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík.
    “Doubly Robust Off-policy Evaluation with Shrinkage.” ICML, 2020.
    https://arxiv.org/abs/1907.09623
    [Su+,20b] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator
    Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/1907.09623
    [Tucker&Lee, 21] George Tucker and Jonathan Lee. “Improved Estimator Selection
    for Off-Policy Evaluation.” 2021.
    https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf
    [Narita+,21] Yusuke Narita, Shota Yasui, and Kohei Yata. ”Debiased Off-Policy
    Evaluation for Recommendation Systems.” RecSys, 2021.
    https://arxiv.org/abs/2002.08536
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 58

    View Slide

  59. References (4/4)
    [Saito+,21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita.
    “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy
    Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146
    [Saito+,21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke
    Narita, and Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys,
    2021. https://arxiv.org/abs/2108.13703
    September 2022 Policy Adaptive Estimator Selection (PAS-IF) 59

    View Slide