Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade
Behavior Model Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 1

OPE of Ranking Policies July 2022 Cascade Doubly Robust Off-Policy
Evaluation @ CFML勉強会 2

Real world ranking decision making Examples of recommending a ranking
of items July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?

Content • Overview of Off-Policy Evaluation (OPE) of Ranking Policies
• Existing Estimators and Challenges • Seminal Work: Doubly Robust Estimator • Proposal: Cascade Doubly Robust (Cascade-DR) • The Benefit of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 4

Ranking decision making July 2022 Cascade Doubly Robust Off-Policy Evaluation
@ CFML勉強会 5 song 1 song 2 song 3 song 4 click no click click no click a coming user ranking position a ranked list of items rewards

The policy also produces logged data July 2022 Cascade Doubly
Robust Off-Policy Evaluation @ CFML勉強会 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback

Off-Policy Evaluation (OPE) The goal is to evaluate the performance
of an evaluation policy 𝜋 𝑒 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)

How to derive an accurate OPE estimation? We need to
reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 8

reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 9 Bias is caused by the distribution shift.

Distribution Shift Behavior and evaluation policies (𝜋 𝑒 and 𝜋
𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 10 behavior evaluation

reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 11 Variance increases with the size of the combinatorial action space and decreases with the data size.

How large is the action space in slate? Non-factorizable case
– policy chooses actions without duplication. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!

How large is the action space in slate? Factorizable case
– policy chooses actions independently. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation

reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.

Existing Approaches July 2022 Cascade Doubly Robust Off-Policy Evaluation @
CFML勉強会 15

Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects
the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4

Dealing with the distribution shift by IPS Behavior and evaluation
policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 17 behavior evaluation

Dealing with the distribution shift by IPS Behavior and evaluation
policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 18 evaluation behavior

the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions

the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size

Huge importance weight of IPS Behavior and evaluation policies (𝜋
𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation

products of the slot-level importance weights (factorizable case) Source of
high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 22

IIPS assumes that users interacts with items independently of the
other positions. • pros: substantially reduces the variance of IPS Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 23 independence assumption product disappears

Let’s compare the importance weight When we evaluate slot-level reward
at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 24 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒

at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 25 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IPS

at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 26 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

IIPS assumes that users interacts with items independently of the
other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 27 independence assumption product disappears

Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users
interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 28 cascade assumption considers only higher positions

interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 29 cascade assumption considers only higher positions

at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 30 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 31 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS

at slot 𝒍 = 𝟒, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 32 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

Let’s compare the importance weight July 2022 Cascade Doubly Robust
Off-Policy Evaluation @ CFML勉強会 33 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)

interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 34 cascade assumption considers only higher positions

A difficult tradeoff remains for the existing estimators July 2022
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model

Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal

Our goal: Can we dominate all existing estimators? July 2022
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 37 Bias Variance achievable bias-variance tradeoff by modifying click models IIPS RIPS IPS independent cascade standard click model

Our goal: Can we dominate all existing estimators? July 2022
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?

Seminal work July 2022 Cascade Doubly Robust Off-Policy Evaluation @
CFML勉強会 39

From IPS to Doubly Robust (DR) [Dudík+, 14] In a
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 40 (hereinafter)

From IPS to Doubly Robust (DR) [Dudík+, 14] In a
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 41 baseline estimation importance weighting on residual (hereinafter) control variate

Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10,
𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 42 importance weight leads to variance

𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 43 scale down the weighted value importance weight leads to variance

𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.

Cascade Doubly Robust July 2022 Cascade Doubly Robust Off-Policy Evaluation
@ CFML勉強会 45

Recursive form of RIPS Transform RIPS into the recursive form.
July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 46

July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 47 policy value after position 𝒍 policy value after position 𝒍

July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .

Introducing a control variate (Q-hat) Now we can define DR
under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 49

under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍

under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 51 existing works: click model bias-variance tradeoff policy value after position 𝒍

under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍

Benefits of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation
@ CFML勉強会 53

• pros: reduces the variance of RIPS • pros: still
unbiased under Cascade Statistical advantages of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 54 (under a reasonable assumption on ෠ 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!

Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model

Cascade-DR clearly dominates IPS, IIPS, RIPS July 2022 Cascade Doubly
Robust Off-Policy Evaluation @ CFML勉強会 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model

Cascade-DR performs well on a real platform July 2022 Cascade
Doubly Robust Off-Policy Evaluation @ CFML勉強会 57 the lower, the better Cascade-DR is most accurate and stable even under realistic user behavior.

Summary • OPE of ranking policies has a variety of
applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 58

Cascade-DR is available in OpenBanditPipeline! Implemented as `obp.ope.SlateCascadeDoublyRobust`. July 2022
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 59 https://github.com/st-tech/zr-obp Our experimental code also uses obp. Only four lines of code to implement OPE. # estimate q_hat regression_model = obp.ope.SlateRegeressionModel(..) q_hat = regression_model.fit_predict(..) # estimate policy value cascade_dr = obp.ope.SlateCascadeDoublyRobust(..) policy_value = cascade_dr.estimate_policy_value(..)

Thank you for listening! Find out more (e.g., theoretical analysis
and experiments) in the full paper! contact: [email protected] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 60

Cascade Doubly Robust (Cascade-DR) By solving the recursive form, we
obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 61 recursively estimate baseline

Additional experimental results July 2022 Cascade Doubly Robust Off-Policy Evaluation
@ CFML勉強会 62

How accurate OPE estimators are? The smaller the squared error
(SE), the more accurate the estimator ෠ 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 63 experimental configurations

Experimental setup: reward function We define slot-level mean reward function
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 64 ：sigmoid determined by the corresponding action interaction from the other slots (linear)

as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 65 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 independence

as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 66 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 cascade from the higher slots

as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 67 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 standard all the other slots

as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 68 ：sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)

Experimental setup: interaction function Two ways to define . •
additive effect from co-occurrence • decay effect from neighboring actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4

Experimental setup: policies Behavior and evaluation policies are factorizable. •
behavior policy • evaluation policy July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 70 𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar 𝝀 → 𝟏 to account 𝝅𝒃 more, |𝝀| → 𝟎 to uniform

Study Design How the estimators’ performance and their superiority change
depending on • reward structure • data size / slate size / policy similarity July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 71

Experimental procedure [Saito+, 21b] • We first randomly sample configurations
(10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate ෡ 𝑽 on various situations.

Varying data size Cascade-DR stably performs well on various configurations!
July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 73 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)

Varying data size Cascade-DR reduces the variance of IPS and
RIPS a lot. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 74 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) standard Unbiased -> IPS Large data size -> IPS, Cascade-DR, .. Small data size -> Cascade-DR, RIPS, ..

Varying data size Cascade-DR is the best, being unbiased while
reducing the variance. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 75 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade Unbiased -> IPS, RIPS, Cascade-DR Large data size -> Cascade-DR, RIPS, .. Small data size -> Cascade-DR, IIPS, ..

Varying data size Cascade-DR is the best among the estimators
using reasonable assumptions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 76 standard data size 𝑛 relative MSE variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) independence Unbiased -> all estimators Large data size -> IIPS, Cascade-DR, .. Small data size -> IIPS, Cascade-DR, ..

Varying data size Cascade-DR stably performs well on various configurations!
July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 77 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)

Varying slate size • Cascade-DR stably outperforms RIPS on various
slate size. • When baseline estimation is successful, Cascade-DR becomes more powerful. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 78 standard cascade independence slate size 𝐿 relative MSE difficult easy cascade standard | | | | (𝑛 = 1000)

Varying policy similarity • When the behavior and evaluation policies
are dissimilar, Cascade-DR is more promising. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 79 standard cascade independence policy similarity 𝜆 relative MSE bias, variance bias, variance variance cascade standard (𝑛 = 1000)

How to calculate ground-truth policy value? In synthetic experiments, we
take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋 𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 80

Experimental procedure of the real world experiment 1. We first
run two different policies 𝜋 𝐴 and 𝜋 𝐵 to construct datasets 𝐷 𝐴 and 𝐷 𝐵 . Here, we use 𝐷 𝐴 to estimate 𝑉(𝜋 𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷 𝐴 ′. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 81

Other related work July 2022 Cascade Doubly Robust Off-Policy Evaluation
@ CFML勉強会 82

Inspiration from Reinforcement Learning (RL) DR leverages the recursive structure
of Markov Decision Process (MDP). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 83 baseline estimation policy value after visiting 𝒙𝒍 recursive derivation importance weighting on residual [Jiang&Li, 16] [Thomas&Brunskill, 16]

Causal similarity between MDP and Cascade asm. Cascade assumption can
be interpreted as a special case of MDP. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 84

PseudoInverse (PI) [Swaminathan+, 17] • Designed for the situation where
the slot-level rewards are unobservable. • Implicitly assumes independence assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 85

References July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会
86

References (1/2) [Precup+, 00] Doina Precup, Richard S. Sutton, and
Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 87

References (2/2) [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill.
“Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2008.07146 [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 88

Doubly Robust Off-Policy Evaluation for Ranking...

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript