[WSDM'22] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade
Behavior Model Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 1

OPE of Ranking Policies February 2022 Cascade Doubly Robust Off-Policy
Evaluation @ WSDM2022 2

Real world ranking decision making Examples of recommending a ranking
of items February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?

Content • Overview of Off-Policy Evaluation (OPE) of Ranking Policies
• Existing Estimators and Challenges • Seminal Work: Doubly Robust Estimator • Proposal: Cascade Doubly Robust (Cascade-DR) • The Benefit of Cascade-DR February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 4

Ranking decision making February 2022 Cascade Doubly Robust Off-Policy Evaluation
@ WSDM2022 5 song 1 song 2 song 3 song 4 click no click click no click a coming user ranking position a ranked list of items rewards

The policy also produces logged data February 2022 Cascade Doubly
Robust Off-Policy Evaluation @ WSDM2022 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback

Off-Policy Evaluation (OPE) The goal is to evaluate the performance
of an evaluation policy 𝜋𝑒 . February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)

How to derive an accurate OPE estimation? We need to
reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 8

reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 9 Bias is caused by the distribution shift.

Distribution Shift Behavior and evaluation policies (𝜋𝑒 and 𝜋𝑏 )
follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 10 behavior evaluation

reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 11 Variance increases with the size of the combinatorial action space and decreases with the data size.

How large is the action space in slate? Non-factorizable case
– policy chooses actions without duplication. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!

How large is the action space in slate? Factorizable case
– policy chooses actions independently. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation

reduce both bias and variance moderately. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.

Existing Approaches February 2022 Cascade Doubly Robust Off-Policy Evaluation @
WSDM2022 15

Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects
the distribution shift between 𝜋𝑒 and 𝜋𝑏 . February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4

Dealing with the distribution shift by IPS Behavior and evaluation
policies (𝜋𝑒 and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 17 behavior evaluation

Dealing with the distribution shift by IPS Behavior and evaluation
policies (𝜋𝑒 and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 18 evaluation behavior

the distribution shift between 𝜋𝑒 and 𝜋𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions

the distribution shift between 𝜋𝑒 and 𝜋𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size

Huge importance weight of IPS Behavior and evaluation policies (𝜋𝑒
and 𝜋𝑏 ) follow different probability distributions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation

products of the slot-level importance weights (factorizable case) Source of
high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 22

IIPS assumes that users interacts with items independently of the
other positions. • pros: substantially reduces the variance of IPS Independent IPS (IIPS) [Li+, 18] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 23 independence assumption product disappears

Let’s compare the importance weight When we evaluate slot-level reward
at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 24 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒

at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 25 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IPS

at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 26 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS

IIPS assumes that users interacts with items independently of the
other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 27 independence assumption product disappears

Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users
interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 28 cascade assumption considers only higher positions

interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 29 cascade assumption considers only higher positions

at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 30 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS

at slot 𝒍 = 𝟐, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 31 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS RIPS

at slot 𝒍 = 𝟒, (factorizable case) February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 32 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS

Let’s compare the importance weight February 2022 Cascade Doubly Robust
Off-Policy Evaluation @ WSDM2022 33 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 IIPS IPS RIPS When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)

interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 34 cascade assumption considers only higher positions

A difficult tradeoff remains for the existing estimators February 2022
Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model

Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal

Our goal: Can we dominate all existing estimators? February 2022
Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 37 Bias Variance achievable bias-variance tradeoff by modifying click models IIPS RIPS IPS independent cascade standard click model

Our goal: Can we dominate all existing estimators? February 2022
Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?

Seminal work February 2022 Cascade Doubly Robust Off-Policy Evaluation @
WSDM2022 39

From IPS to Doubly Robust (DR) [Dudík+, 14] In a
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 40 (hereinafter)

From IPS to Doubly Robust (DR) [Dudík+, 14] In a
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 41 baseline estimation importance weighting on residual (hereinafter) control variate

Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,
𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 42 importance weight leads to variance

𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 43 scale down the weighted value importance weight leads to variance

𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, 7 𝑞 𝑥, 𝑎 = 0.9, February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.

Cascade Doubly Robust February 2022 Cascade Doubly Robust Off-Policy Evaluation
@ WSDM2022 45

Recursive form of RIPS Transform RIPS into the recursive form.
February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 46

February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 47 policy value after position 𝒍 policy value after position 𝒍

February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .

Introducing a control variate (Q-hat) Now we can define DR
under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 49

under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍

under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 51 existing works: click model bias-variance tradeoff policy value after position 𝒍

under the Cascade assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍

Benefits of Cascade-DR February 2022 Cascade Doubly Robust Off-Policy Evaluation
@ WSDM2022 53

• pros: reduces the variance of RIPS • pros: still
unbiased under Cascade Statistical advantages of Cascade-DR February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 54 (under a reasonable assumption on # 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!

Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model

Cascade-DR clearly dominates IPS, IIPS, RIPS February 2022 Cascade Doubly
Robust Off-Policy Evaluation @ WSDM2022 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model

Cascade-DR performs well on a real platform February 2022 Cascade
Doubly Robust Off-Policy Evaluation @ WSDM2022 57 the lower, the better Cascade-DR is most accurate and stable even under realistic user behavior.

Summary • OPE of ranking policies has a variety of
applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 58

Cascade-DR is available in OpenBanditPipeline! Implemented as `obp.ope.SlateCascadeDoublyRobust`. February 2022
Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 59 https://github.com/st-tech/zr-obp Our experimental code also uses obp. Only four lines of code to implement OPE. # estimate q_hat regression_model = obp.ope.SlateRegeressionModel(..) q_hat = regression_model.fit_predict(..) # estimate policy value cascade_dr = obp.ope.SlateCascadeDoublyRobust(..) policy_value = cascade_dr.estimate_policy_value(..)

Thank you for listening! Find out more (e.g., theoretical analysis
and experiments) in the full paper! contact: [email protected] February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 60

Cascade Doubly Robust (Cascade-DR) By solving the recursive form, we
obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 61 recursively estimate baseline

Additional experimental results February 2022 Cascade Doubly Robust Off-Policy Evaluation
@ WSDM2022 62

How accurate OPE estimators are? The smaller the squared error
(SE), the more accurate the estimator < 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 63 experimental configurations

Experimental setup: reward function We define slot-level mean reward function
as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 64 ：sigmoid determined by the corresponding action interaction from the other slots (linear)

as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 65 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 independence

as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 66 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 cascade from the higher slots

as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 67 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒 standard all the other slots

as follows. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 68 ：sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)

Experimental setup: interaction function Two ways to define . •
additive effect from co-occurrence • decay effect from neighboring actions February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4

Experimental setup: policies Behavior and evaluation policies are factorizable. •
behavior policy • evaluation policy February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 70 𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar 𝝀 → 𝟏 to account 𝝅𝒃 more, |𝝀| → 𝟎 to uniform

Study Design How the estimators’ performance and their superiority change
depending on • reward structure • data size / slate size / policy similarity February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 71

Experimental procedure [Saito+, 21b] • We first randomly sample configurations
(10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate % 𝑽 on various situations.

Varying data size Cascade-DR stably performs well on various configurations!
February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 73 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade standard (𝐿 = 5)

Varying data size Cascade-DR reduces the variance of IPS and
RIPS a lot. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 74 standard data size 𝑛 relative MSE bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) standard Unbiased -> IPS Large data size -> IPS, Cascade-DR, .. Small data size -> Cascade-DR, RIPS, ..

Varying data size Cascade-DR is the best, being unbiased while
reducing the variance. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 75 standard data size 𝑛 relative MSE bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade Unbiased -> IPS, RIPS, Cascade-DR Large data size -> Cascade-DR, RIPS, .. Small data size -> Cascade-DR, IIPS, ..

Varying data size Cascade-DR is the best among the estimators
using reasonable assumptions. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 76 standard data size 𝑛 relative MSE variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) independence Unbiased -> all estimators Large data size -> IIPS, Cascade-DR, .. Small data size -> IIPS, Cascade-DR, ..

Varying data size Cascade-DR stably performs well on various configurations!
February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 77 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( 4 𝑉 ) = MSE ( 4 𝑉 ) / MSE ( 4 𝑉Cascade-DR ) cascade standard (𝐿 = 5)

Varying slate size • Cascade-DR stably outperforms RIPS on various
slate size. • When baseline estimation is successful, Cascade-DR becomes more powerful. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 78 standard cascade independence slate size 𝐿 relative MSE difficult easy cascade standard | | | | (𝑛 = 1000)

Varying policy similarity • When the behavior and evaluation policies
are dissimilar, Cascade-DR is more promising. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 79 standard cascade independence policy similarity 𝜆 relative MSE bias, variance bias, variance variance cascade standard (𝑛 = 1000)

How to calculate ground-truth policy value? In synthetic experiments, we
take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 80

Experimental procedure of the real world experiment 1. We first
run two different policies 𝜋𝐴 and 𝜋𝐵 to construct datasets 𝐷𝐴 and 𝐷𝐵 . Here, we use 𝐷𝐴 to estimate 𝑉(𝜋𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷𝐴 ′. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 81

Other related work February 2022 Cascade Doubly Robust Off-Policy Evaluation
@ WSDM2022 82

Inspiration from Reinforcement Learning (RL) DR leverages the recursive structure
of Markov Decision Process (MDP). February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 83 baseline estimation policy value after visiting 𝒙𝒍 recursive derivation importance weighting on residual [Jiang&Li, 16] [Thomas&Brunskill, 16]

Causal similarity between MDP and Cascade asm. Cascade assumption can
be interpreted as a special case of MDP. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 84

PseudoInverse (PI) [Swaminathan+, 17] • Designed for the situation where
the slot-level rewards are unobservable. • Implicitly assumes independence assumption. February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 85

References February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022
86

References (1/2) [Precup+, 00] Doina Precup, Richard S. Sutton, and
Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 87

References (2/2) [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill.
“Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2008.07146 [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 February 2022 Cascade Doubly Robust Off-Policy Evaluation @ WSDM2022 88

[WSDM'22] Doubly Robust Off-Policy Evaluation f...

[WSDM'22] Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript