Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[WSDM'22] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

[WSDM'22] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Best Paper Runner-Up Award @ WSDM2022
Proceedings: https://dl.acm.org/doi/10.1145/3488560.3498380
arXiv: https://arxiv.org/abs/2202.01562

Haruka Kiyohara

February 23, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade

    Behavior Model Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 1
  2. Real world ranking decision making Examples of recommending a ranking

    of items February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?
  3. Content • Overview of Off-Policy Evaluation (OPE) of Ranking Policies

    • Existing Estimators and Challenges • Seminal Work: Doubly Robust Estimator • Proposal: Cascade Doubly Robust (Cascade-DR) • The Benefit of Cascade-DR February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 4
  4. Ranking decision making February 2022 Cascade Doubly Robust Off-Policy Evaluation

    @ WSDM2022 5 song 1 song 2 song 3 song 4 click no click click no click a coming user ranking position a ranked list of items rewards
  5. The policy also produces logged data February 2022 Cascade Doubly

    Robust Off-Policy Evaluation @ WSDM2022 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback
  6. Off-Policy Evaluation (OPE) The goal is to evaluate the performance

    of an evaluation policy 𝜋𝑒 . February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)
  7. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 8
  8. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 9 Bias is caused by the distribution shift.
  9. Distribution Shift Behavior and evaluation policies (𝜋𝑒 and 𝜋𝑏 )

    follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 10 behavior evaluation
  10. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 11 Variance increases with the size of the combinatorial action space and decreases with the data size.
  11. How large is the action space in slate? Non-factorizable case

    – policy chooses actions without duplication. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!
  12. How large is the action space in slate? Factorizable case

    – policy chooses actions independently. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation
  13. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.
  14. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋𝑒 and 𝜋𝑏 . February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4
  15. Dealing with the distribution shift by IPS Behavior and evaluation

    policies (𝜋𝑒 and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 17 behavior evaluation
  16. Dealing with the distribution shift by IPS Behavior and evaluation

    policies (𝜋𝑒 and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 18 evaluation behavior
  17. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋𝑒 and 𝜋𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions
  18. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋𝑒 and 𝜋𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size
  19. Huge importance weight of IPS Behavior and evaluation policies (𝜋𝑒

    and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation
  20. products of the slot-level importance weights (factorizable case) Source of

    high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 22
  21. IIPS assumes that users interacts with items independently of the

    other positions. • pros: substantially reduces the variance of IPS Independent IPS (IIPS) [Li+, 18] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 23 independence assumption product disappears
  22. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 24 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒
  23. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 25 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IPS
  24. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 26 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS
  25. IIPS assumes that users interacts with items independently of the

    other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 27 independence assumption product disappears
  26. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 28 cascade assumption considers only higher positions
  27. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 29 cascade assumption considers only higher positions
  28. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 30 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS
  29. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 31 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS RIPS
  30. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟒, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 32 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS
  31. Let’s compare the importance weight February 2022 Cascade Doubly Robust

    Off-Policy Evaluation @ WSDM2022 33 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS RIPS When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)
  32. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 34 cascade assumption considers only higher positions
  33. A difficult tradeoff remains for the existing estimators February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model
  34. A difficult tradeoff remains for the existing estimators February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal
  35. Our goal: Can we dominate all existing estimators? February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 37 Bias Variance achievable bias-variance tradeoff by modifying click models IIPS RIPS IPS independent cascade standard click model
  36. Our goal: Can we dominate all existing estimators? February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?
  37. From IPS to Doubly Robust (DR) [Dudík+, 14] In a

    single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 40 (hereinafter)
  38. From IPS to Doubly Robust (DR) [Dudík+, 14] In a

    single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 41 baseline estimation importance weighting on residual (hereinafter) control variate
  39. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 42 importance weight leads to variance
  40. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 43 scale down the weighted value importance weight leads to variance
  41. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.
  42. Recursive form of RIPS Transform RIPS into the recursive form.

    February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 46
  43. Recursive form of RIPS Transform RIPS into the recursive form.

    February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 47 policy value after position 𝒍 policy value after position 𝒍
  44. Recursive form of RIPS Transform RIPS into the recursive form.

    February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .
  45. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 49
  46. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍
  47. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 51 existing works: click model bias-variance tradeoff policy value after position 𝒍
  48. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍
  49. • pros: reduces the variance of RIPS • pros: still

    unbiased under Cascade Statistical advantages of Cascade-DR February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 54 (under a reasonable assumption on # 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!
  50. A difficult tradeoff remains for the existing estimators February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model
  51. Cascade-DR clearly dominates IPS, IIPS, RIPS February 2022 Cascade Doubly

    Robust Off-Policy Evaluation @ WSDM2022 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model
  52. Cascade-DR performs well on a real platform February 2022 Cascade

    Doubly Robust Off-Policy Evaluation @ WSDM2022 57 the lower, the better Cascade-DR is most accurate and stable even under realistic user behavior.
  53. Summary • OPE of ranking policies has a variety of

    applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 58
  54. Cascade-DR is available in OpenBanditPipeline! Implemented as `obp.ope.SlateCascadeDoublyRobust`. February 2022

    Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 59 https://github.com/st-tech/zr-obp Our experimental code also uses obp. Only four lines of code to implement OPE. # estimate q_hat regression_model = obp.ope.SlateRegeressionModel(..) q_hat = regression_model.fit_predict(..) # estimate policy value cascade_dr = obp.ope.SlateCascadeDoublyRobust(..) policy_value = cascade_dr.estimate_policy_value(..)
  55. Thank you for listening! Find out more (e.g., theoretical analysis

    and experiments) in the full paper! contact: [email protected] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 60
  56. Cascade Doubly Robust (Cascade-DR) By solving the recursive form, we

    obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 61 recursively estimate baseline
  57. How accurate OPE estimators are? The smaller the squared error

    (SE), the more accurate the estimator < 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 63 experimental configurations
  58. Experimental setup: reward function We define slot-level mean reward function

    as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 64 :sigmoid determined by the corresponding action interaction from the other slots (linear)
  59. Experimental setup: reward function We define slot-level mean reward function

    as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 65 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 independence
  60. Experimental setup: reward function We define slot-level mean reward function

    as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 66 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 cascade from the higher slots
  61. Experimental setup: reward function We define slot-level mean reward function

    as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 67 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 standard all the other slots
  62. Experimental setup: reward function We define slot-level mean reward function

    as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 68 :sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)
  63. Experimental setup: interaction function Two ways to define . •

    additive effect from co-occurrence • decay effect from neighboring actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4
  64. Experimental setup: policies Behavior and evaluation policies are factorizable. •

    behavior policy • evaluation policy February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 70 𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar 𝝀 → 𝟏 to account 𝝅𝒃 more, |𝝀| → 𝟎 to uniform
  65. Study Design How the estimators’ performance and their superiority change

    depending on • reward structure • data size / slate size / policy similarity February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 71
  66. Experimental procedure [Saito+, 21b] • We first randomly sample configurations

    (10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate % 𝑽 on various situations.
  67. Varying data size Cascade-DR stably performs well on various configurations!

    February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 73 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade standard (𝐿 = 5)
  68. Varying data size Cascade-DR reduces the variance of IPS and

    RIPS a lot. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 74 standard data size 𝑛 relative MSE bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) standard Unbiased -> IPS Large data size -> IPS, Cascade-DR, .. Small data size -> Cascade-DR, RIPS, ..
  69. Varying data size Cascade-DR is the best, being unbiased while

    reducing the variance. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 75 standard data size 𝑛 relative MSE bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade Unbiased -> IPS, RIPS, Cascade-DR Large data size -> Cascade-DR, RIPS, .. Small data size -> Cascade-DR, IIPS, ..
  70. Varying data size Cascade-DR is the best among the estimators

    using reasonable assumptions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 76 standard data size 𝑛 relative MSE variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) independence Unbiased -> all estimators Large data size -> IIPS, Cascade-DR, .. Small data size -> IIPS, Cascade-DR, ..
  71. Varying data size Cascade-DR stably performs well on various configurations!

    February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 77 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade standard (𝐿 = 5)
  72. Varying slate size • Cascade-DR stably outperforms RIPS on various

    slate size. • When baseline estimation is successful, Cascade-DR becomes more powerful. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 78 standard cascade independence slate size 𝐿 relative MSE difficult easy cascade standard | | | | (𝑛 = 1000)
  73. Varying policy similarity • When the behavior and evaluation policies

    are dissimilar, Cascade-DR is more promising. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 79 standard cascade independence policy similarity 𝜆 relative MSE bias, variance bias, variance variance cascade standard (𝑛 = 1000)
  74. How to calculate ground-truth policy value? In synthetic experiments, we

    take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 80
  75. Experimental procedure of the real world experiment 1. We first

    run two different policies 𝜋𝐴 and 𝜋𝐵 to construct datasets 𝐷𝐴 and 𝐷𝐵 . Here, we use 𝐷𝐴 to estimate 𝑉(𝜋𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷𝐴 ′. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 81
  76. Inspiration from Reinforcement Learning (RL) DR leverages the recursive structure

    of Markov Decision Process (MDP). February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 83 baseline estimation policy value after visiting 𝒙𝒍 recursive derivation importance weighting on residual [Jiang&Li, 16] [Thomas&Brunskill, 16]
  77. Causal similarity between MDP and Cascade asm. Cascade assumption can

    be interpreted as a special case of MDP. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 84
  78. PseudoInverse (PI) [Swaminathan+, 17] • Designed for the situation where

    the slot-level rewards are unobservable. • Implicitly assumes independence assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 85
  79. References (1/2) [Precup+, 00] Doina Precup, Richard S. Sutton, and

    Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 87
  80. References (2/2) [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill.

    “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2008.07146 [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 88