Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doubly Robust Off-Policy Evaluation for Ranking...

Haruka Kiyohara
February 23, 2022

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Best Paper Runner-Up Award @ WSDM2022
Proceedings: https://dl.acm.org/doi/10.1145/3488560.3498380
arXiv: https://arxiv.org/abs/2202.01562

RL4RealLife WS @ ICML2021
About WS: https://sites.google.com/view/RL4RealLife2021

CFML勉強会 #6
https://cfml.connpass.com/event/249531/

IR reading, Tokyo, 2022 Spring
https://sigir.jp/post/2022-05-21-irreading_2022spring/

解説記事(Yahoo! Japan Techblog)
https://techblog.yahoo.co.jp/entry/2021121130233784/

Haruka Kiyohara

February 23, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade

    Behavior Model Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 1
  2. Real world ranking decision making Examples of recommending a ranking

    of items July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?
  3. Content • Overview of Off-Policy Evaluation (OPE) of Ranking Policies

    • Existing Estimators and Challenges • Seminal Work: Doubly Robust Estimator • Proposal: Cascade Doubly Robust (Cascade-DR) • The Benefit of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 4
  4. Ranking decision making July 2022 Cascade Doubly Robust Off-Policy Evaluation

    @ CFML勉強会 5 song 1 song 2 song 3 song 4 click no click click no click a coming user ranking position a ranked list of items rewards
  5. The policy also produces logged data July 2022 Cascade Doubly

    Robust Off-Policy Evaluation @ CFML勉強会 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback
  6. Off-Policy Evaluation (OPE) The goal is to evaluate the performance

    of an evaluation policy 𝜋 𝑒 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)
  7. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 8
  8. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 9 Bias is caused by the distribution shift.
  9. Distribution Shift Behavior and evaluation policies (𝜋 𝑒 and 𝜋

    𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 10 behavior evaluation
  10. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 11 Variance increases with the size of the combinatorial action space and decreases with the data size.
  11. How large is the action space in slate? Non-factorizable case

    – policy chooses actions without duplication. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!
  12. How large is the action space in slate? Factorizable case

    – policy chooses actions independently. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation
  13. How to derive an accurate OPE estimation? We need to

    reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.
  14. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4
  15. Dealing with the distribution shift by IPS Behavior and evaluation

    policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 17 behavior evaluation
  16. Dealing with the distribution shift by IPS Behavior and evaluation

    policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 18 evaluation behavior
  17. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions
  18. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects

    the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size
  19. Huge importance weight of IPS Behavior and evaluation policies (𝜋

    𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation
  20. products of the slot-level importance weights (factorizable case) Source of

    high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 22
  21. IIPS assumes that users interacts with items independently of the

    other positions. • pros: substantially reduces the variance of IPS Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 23 independence assumption product disappears
  22. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 24 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒
  23. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 25 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IPS
  24. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 26 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS
  25. IIPS assumes that users interacts with items independently of the

    other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 27 independence assumption product disappears
  26. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 28 cascade assumption considers only higher positions
  27. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 29 cascade assumption considers only higher positions
  28. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 30 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS
  29. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 31 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS
  30. Let’s compare the importance weight When we evaluate slot-level reward

    at slot 𝒍 = 𝟒, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 32 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS
  31. Let’s compare the importance weight July 2022 Cascade Doubly Robust

    Off-Policy Evaluation @ CFML勉強会 33 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)
  32. Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users

    interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 34 cascade assumption considers only higher positions
  33. A difficult tradeoff remains for the existing estimators July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model
  34. A difficult tradeoff remains for the existing estimators July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal
  35. Our goal: Can we dominate all existing estimators? July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 37 Bias Variance achievable bias-variance tradeoff by modifying click models IIPS RIPS IPS independent cascade standard click model
  36. Our goal: Can we dominate all existing estimators? July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?
  37. From IPS to Doubly Robust (DR) [Dudík+, 14] In a

    single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 40 (hereinafter)
  38. From IPS to Doubly Robust (DR) [Dudík+, 14] In a

    single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 41 baseline estimation importance weighting on residual (hereinafter) control variate
  39. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 42 importance weight leads to variance
  40. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 43 scale down the weighted value importance weight leads to variance
  41. Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,

    𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.
  42. Recursive form of RIPS Transform RIPS into the recursive form.

    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 46
  43. Recursive form of RIPS Transform RIPS into the recursive form.

    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 47 policy value after position 𝒍 policy value after position 𝒍
  44. Recursive form of RIPS Transform RIPS into the recursive form.

    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .
  45. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 49
  46. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍
  47. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 51 existing works: click model bias-variance tradeoff policy value after position 𝒍
  48. Introducing a control variate (Q-hat) Now we can define DR

    under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍
  49. • pros: reduces the variance of RIPS • pros: still

    unbiased under Cascade Statistical advantages of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 54 (under a reasonable assumption on ෠ 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!
  50. A difficult tradeoff remains for the existing estimators July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model
  51. Cascade-DR clearly dominates IPS, IIPS, RIPS July 2022 Cascade Doubly

    Robust Off-Policy Evaluation @ CFML勉強会 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model
  52. Cascade-DR performs well on a real platform July 2022 Cascade

    Doubly Robust Off-Policy Evaluation @ CFML勉強会 57 the lower, the better Cascade-DR is most accurate and stable even under realistic user behavior.
  53. Summary • OPE of ranking policies has a variety of

    applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 58
  54. Cascade-DR is available in OpenBanditPipeline! Implemented as `obp.ope.SlateCascadeDoublyRobust`. July 2022

    Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 59 https://github.com/st-tech/zr-obp Our experimental code also uses obp. Only four lines of code to implement OPE. # estimate q_hat regression_model = obp.ope.SlateRegeressionModel(..) q_hat = regression_model.fit_predict(..) # estimate policy value cascade_dr = obp.ope.SlateCascadeDoublyRobust(..) policy_value = cascade_dr.estimate_policy_value(..)
  55. Thank you for listening! Find out more (e.g., theoretical analysis

    and experiments) in the full paper! contact: [email protected] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 60
  56. Cascade Doubly Robust (Cascade-DR) By solving the recursive form, we

    obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 61 recursively estimate baseline
  57. How accurate OPE estimators are? The smaller the squared error

    (SE), the more accurate the estimator ෠ 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 63 experimental configurations
  58. Experimental setup: reward function We define slot-level mean reward function

    as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 64 :sigmoid determined by the corresponding action interaction from the other slots (linear)
  59. Experimental setup: reward function We define slot-level mean reward function

    as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 65 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 independence
  60. Experimental setup: reward function We define slot-level mean reward function

    as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 66 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 cascade from the higher slots
  61. Experimental setup: reward function We define slot-level mean reward function

    as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 67 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 standard all the other slots
  62. Experimental setup: reward function We define slot-level mean reward function

    as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 68 :sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)
  63. Experimental setup: interaction function Two ways to define . •

    additive effect from co-occurrence • decay effect from neighboring actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4
  64. Experimental setup: policies Behavior and evaluation policies are factorizable. •

    behavior policy • evaluation policy July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 70 𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar 𝝀 → 𝟏 to account 𝝅𝒃 more, |𝝀| → 𝟎 to uniform
  65. Study Design How the estimators’ performance and their superiority change

    depending on • reward structure • data size / slate size / policy similarity July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 71
  66. Experimental procedure [Saito+, 21b] • We first randomly sample configurations

    (10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate ෡ 𝑽 on various situations.
  67. Varying data size Cascade-DR stably performs well on various configurations!

    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 73 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)
  68. Varying data size Cascade-DR reduces the variance of IPS and

    RIPS a lot. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 74 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) standard Unbiased -> IPS Large data size -> IPS, Cascade-DR, .. Small data size -> Cascade-DR, RIPS, ..
  69. Varying data size Cascade-DR is the best, being unbiased while

    reducing the variance. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 75 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade Unbiased -> IPS, RIPS, Cascade-DR Large data size -> Cascade-DR, RIPS, .. Small data size -> Cascade-DR, IIPS, ..
  70. Varying data size Cascade-DR is the best among the estimators

    using reasonable assumptions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 76 standard data size 𝑛 relative MSE variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) independence Unbiased -> all estimators Large data size -> IIPS, Cascade-DR, .. Small data size -> IIPS, Cascade-DR, ..
  71. Varying data size Cascade-DR stably performs well on various configurations!

    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 77 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)
  72. Varying slate size • Cascade-DR stably outperforms RIPS on various

    slate size. • When baseline estimation is successful, Cascade-DR becomes more powerful. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 78 standard cascade independence slate size 𝐿 relative MSE difficult easy cascade standard | | | | (𝑛 = 1000)
  73. Varying policy similarity • When the behavior and evaluation policies

    are dissimilar, Cascade-DR is more promising. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 79 standard cascade independence policy similarity 𝜆 relative MSE bias, variance bias, variance variance cascade standard (𝑛 = 1000)
  74. How to calculate ground-truth policy value? In synthetic experiments, we

    take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋 𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 80
  75. Experimental procedure of the real world experiment 1. We first

    run two different policies 𝜋 𝐴 and 𝜋 𝐵 to construct datasets 𝐷 𝐴 and 𝐷 𝐵 . Here, we use 𝐷 𝐴 to estimate 𝑉(𝜋 𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷 𝐴 ′. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 81
  76. Inspiration from Reinforcement Learning (RL) DR leverages the recursive structure

    of Markov Decision Process (MDP). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 83 baseline estimation policy value after visiting 𝒙𝒍 recursive derivation importance weighting on residual [Jiang&Li, 16] [Thomas&Brunskill, 16]
  77. Causal similarity between MDP and Cascade asm. Cascade assumption can

    be interpreted as a special case of MDP. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 84
  78. PseudoInverse (PI) [Swaminathan+, 17] • Designed for the situation where

    the slot-level rewards are unobservable. • Implicitly assumes independence assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 85
  79. References (1/2) [Precup+, 00] Doina Precup, Richard S. Sutton, and

    Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 87
  80. References (2/2) [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill.

    “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2008.07146 [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 88