Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Slide 1

Slide 1 text

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto Haruka Kiyohara, Tokyo Institute of Technology https://sites.google.com/view/harukakiyohara July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 1

Slide 2

Slide 2 text

OPE of Ranking Policies July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 2

Slide 3

Slide 3 text

Real world ranking decision making Examples of recommending a ranking of items July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?

Slide 4

Slide 4 text

Content • Overview of Off-Policy Evaluation (OPE) of Ranking Policies • Existing Estimators and Challenges • Seminal Work: Doubly Robust Estimator • Proposal: Cascade Doubly Robust (Cascade-DR) • The Benefit of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 4

Slide 5

Slide 5 text

Ranking decision making July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 5 song 1 song 2 song 3 song 4 click no click click no click a coming user ranking position a ranked list of items rewards

Slide 6

Slide 6 text

The policy also produces logged data July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback

Slide 7

Slide 7 text

Off-Policy Evaluation (OPE) The goal is to evaluate the performance of an evaluation policy 𝜋 𝑒 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)

Slide 8

Slide 8 text

How to derive an accurate OPE estimation? We need to reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 8

Slide 9

Slide 9 text

How to derive an accurate OPE estimation? We need to reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 9 Bias is caused by the distribution shift.

Slide 10

Slide 10 text

Distribution Shift Behavior and evaluation policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 10 behavior evaluation

Slide 11

Slide 11 text

How to derive an accurate OPE estimation? We need to reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 11 Variance increases with the size of the combinatorial action space and decreases with the data size.

Slide 12

Slide 12 text

How large is the action space in slate? Non-factorizable case – policy chooses actions without duplication. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!

Slide 13

Slide 13 text

How large is the action space in slate? Factorizable case – policy chooses actions independently. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation

Slide 14

Slide 14 text

How to derive an accurate OPE estimation? We need to reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.

Slide 15

Slide 15 text

Existing Approaches July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 15

Slide 16

Slide 16 text

Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4

Slide 17

Slide 17 text

Dealing with the distribution shift by IPS Behavior and evaluation policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 17 behavior evaluation

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions

Slide 20

Slide 20 text

Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10] IPS corrects the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size

Slide 21

Slide 21 text

Huge importance weight of IPS Behavior and evaluation policies (𝜋 𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation

Slide 22

Slide 22 text

products of the slot-level importance weights (factorizable case) Source of high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 22

Slide 23

Slide 23 text

IIPS assumes that users interacts with items independently of the other positions. • pros: substantially reduces the variance of IPS Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 23 independence assumption product disappears

Slide 24

Slide 24 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 24 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒

Slide 25

Slide 25 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 25 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IPS

Slide 26

Slide 26 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 26 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

Slide 27

Slide 27 text

IIPS assumes that users interacts with items independently of the other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 27 independence assumption product disappears

Slide 28

Slide 28 text

Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 28 cascade assumption considers only higher positions

Slide 29

Slide 29 text

Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 29 cascade assumption considers only higher positions

Slide 30

Slide 30 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 30 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

Slide 31

Slide 31 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 31 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS

Slide 32

Slide 32 text

Let’s compare the importance weight When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 32 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS

Slide 33

Slide 33 text

Let’s compare the importance weight July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 33 Song 1 Song 2 Song 3 Song 4 10.0 10.0 10.0 10.0 slot-level importance weight 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 IIPS IPS RIPS When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)

Slide 34

Slide 34 text

Reward interaction IPS (RIPS) [McInerney+, 20] RIPS assumes that users interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 34 cascade assumption considers only higher positions

Slide 35

Slide 35 text

A difficult tradeoff remains for the existing estimators July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model

Slide 36

Slide 36 text

A difficult tradeoff remains for the existing estimators July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal

Slide 37

Slide 37 text

Our goal: Can we dominate all existing estimators? July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 37 Bias Variance achievable bias-variance tradeoff by modifying click models IIPS RIPS IPS independent cascade standard click model

Slide 38

Slide 38 text

Our goal: Can we dominate all existing estimators? July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?

Slide 39

Slide 39 text

Seminal work July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 39

Slide 40

Slide 40 text

From IPS to Doubly Robust (DR) [Dudík+, 14] In a single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 40 (hereinafter)

Slide 41

Slide 41 text

From IPS to Doubly Robust (DR) [Dudík+, 14] In a single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 41 baseline estimation importance weighting on residual (hereinafter) control variate

Slide 42

Slide 42 text

Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 42 importance weight leads to variance

Slide 43

Slide 43 text

Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 43 scale down the weighted value importance weight leads to variance

Slide 44

Slide 44 text

Variance reduction of DR When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.

Slide 45

Slide 45 text

Cascade Doubly Robust July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 45

Slide 46

Slide 46 text

Recursive form of RIPS Transform RIPS into the recursive form. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 46

Slide 47

Slide 47 text

Recursive form of RIPS Transform RIPS into the recursive form. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 47 policy value after position 𝒍 policy value after position 𝒍

Slide 48

Slide 48 text

Recursive form of RIPS Transform RIPS into the recursive form. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .

Slide 49

Slide 49 text

Introducing a control variate (Q-hat) Now we can define DR under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 49

Slide 50

Slide 50 text

Introducing a control variate (Q-hat) Now we can define DR under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍

Slide 51

Slide 51 text

Introducing a control variate (Q-hat) Now we can define DR under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 51 existing works: click model bias-variance tradeoff policy value after position 𝒍

Slide 52

Slide 52 text

Introducing a control variate (Q-hat) Now we can define DR under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍

Slide 53

Slide 53 text

Benefits of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 53

Slide 54

Slide 54 text

• pros: reduces the variance of RIPS • pros: still unbiased under Cascade Statistical advantages of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 54 (under a reasonable assumption on ෠ 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!

Slide 55

Slide 55 text

A difficult tradeoff remains for the existing estimators July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model

Slide 56

Slide 56 text

Cascade-DR clearly dominates IPS, IIPS, RIPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model

Slide 57

Slide 57 text

Cascade-DR performs well on a real platform July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 57 the lower, the better Cascade-DR is most accurate and stable even under realistic user behavior.

Slide 58

Slide 58 text

Summary • OPE of ranking policies has a variety of applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 58

Slide 59

Slide 59 text

Cascade-DR is available in OpenBanditPipeline! Implemented as `obp.ope.SlateCascadeDoublyRobust`. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 59 https://github.com/st-tech/zr-obp Our experimental code also uses obp. Only four lines of code to implement OPE. # estimate q_hat regression_model = obp.ope.SlateRegeressionModel(..) q_hat = regression_model.fit_predict(..) # estimate policy value cascade_dr = obp.ope.SlateCascadeDoublyRobust(..) policy_value = cascade_dr.estimate_policy_value(..)

Slide 60

Slide 60 text

Thank you for listening! Find out more (e.g., theoretical analysis and experiments) in the full paper! contact: [email protected] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 60

Slide 61

Slide 61 text

Cascade Doubly Robust (Cascade-DR) By solving the recursive form, we obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 61 recursively estimate baseline

Slide 62

Slide 62 text

Additional experimental results July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 62

Slide 63

Slide 63 text

How accurate OPE estimators are? The smaller the squared error (SE), the more accurate the estimator ෠ 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 63 experimental configurations

Slide 64

Slide 64 text

Experimental setup: reward function We define slot-level mean reward function as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 64 ：sigmoid determined by the corresponding action interaction from the other slots (linear)

Slide 65

Slide 65 text

Experimental setup: reward function We define slot-level mean reward function as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 65 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 independence

Slide 66

Slide 66 text

Experimental setup: reward function We define slot-level mean reward function as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 66 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 cascade from the higher slots

Slide 67

Slide 67 text

Experimental setup: reward function We define slot-level mean reward function as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 67 ：sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 standard all the other slots

Slide 68

Slide 68 text

Experimental setup: reward function We define slot-level mean reward function as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 68 ：sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)

Slide 69

Slide 69 text

Experimental setup: interaction function Two ways to define . • additive effect from co-occurrence • decay effect from neighboring actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4

Slide 70

Slide 70 text

Experimental setup: policies Behavior and evaluation policies are factorizable. • behavior policy • evaluation policy July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 70 𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar 𝝀 → 𝟏 to account 𝝅𝒃 more, |𝝀| → 𝟎 to uniform

Slide 71

Slide 71 text

Study Design How the estimators’ performance and their superiority change depending on • reward structure • data size / slate size / policy similarity July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 71

Slide 72

Slide 72 text

Experimental procedure [Saito+, 21b] • We first randomly sample configurations (10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate ෡ 𝑽 on various situations.

Slide 73

Slide 73 text

Varying data size Cascade-DR stably performs well on various configurations! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 73 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)

Slide 74

Slide 74 text

Varying data size Cascade-DR reduces the variance of IPS and RIPS a lot. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 74 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) standard Unbiased -> IPS Large data size -> IPS, Cascade-DR, .. Small data size -> Cascade-DR, RIPS, ..

Slide 75

Slide 75 text

Varying data size Cascade-DR is the best, being unbiased while reducing the variance. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 75 standard data size 𝑛 relative MSE bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade Unbiased -> IPS, RIPS, Cascade-DR Large data size -> Cascade-DR, RIPS, .. Small data size -> Cascade-DR, IIPS, ..

Slide 76

Slide 76 text

Varying data size Cascade-DR is the best among the estimators using reasonable assumptions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 76 standard data size 𝑛 relative MSE variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) independence Unbiased -> all estimators Large data size -> IIPS, Cascade-DR, .. Small data size -> IIPS, Cascade-DR, ..

Slide 77

Slide 77 text

Varying data size Cascade-DR stably performs well on various configurations! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 77 standard cascade independence data size 𝑛 relative MSE bias variance variance bias variance relative MSE ( ෠ 𝑉 ) = MSE ( ෠ 𝑉 ) / MSE ( ෠ 𝑉 Cascade-DR ) cascade standard (𝐿 = 5)

Slide 78

Slide 78 text

Varying slate size • Cascade-DR stably outperforms RIPS on various slate size. • When baseline estimation is successful, Cascade-DR becomes more powerful. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 78 standard cascade independence slate size 𝐿 relative MSE difficult easy cascade standard | | | | (𝑛 = 1000)

Slide 79

Slide 79 text

Varying policy similarity • When the behavior and evaluation policies are dissimilar, Cascade-DR is more promising. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 79 standard cascade independence policy similarity 𝜆 relative MSE bias, variance bias, variance variance cascade standard (𝑛 = 1000)

Slide 80

Slide 80 text

How to calculate ground-truth policy value? In synthetic experiments, we take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋 𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 80

Slide 81

Slide 81 text

Experimental procedure of the real world experiment 1. We first run two different policies 𝜋 𝐴 and 𝜋 𝐵 to construct datasets 𝐷 𝐴 and 𝐷 𝐵 . Here, we use 𝐷 𝐴 to estimate 𝑉(𝜋 𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷 𝐴 ′. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 81

Slide 82

Slide 82 text

Other related work July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 82

Slide 83

Slide 83 text

Inspiration from Reinforcement Learning (RL) DR leverages the recursive structure of Markov Decision Process (MDP). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 83 baseline estimation policy value after visiting 𝒙𝒍 recursive derivation importance weighting on residual [Jiang&Li, 16] [Thomas&Brunskill, 16]

Slide 84

Slide 84 text

Causal similarity between MDP and Cascade asm. Cascade assumption can be interpreted as a special case of MDP. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 84

Slide 85

Slide 85 text

PseudoInverse (PI) [Swaminathan+, 17] • Designed for the situation where the slot-level rewards are unobservable. • Implicitly assumes independence assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 85

Slide 86

Slide 86 text

References July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 86

Slide 87

Slide 87 text

References (1/2) [Precup+, 00] Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” ICML, 2000. https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 87

Slide 88

Slide 88 text

References (2/2) [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923 [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/2008.07146 [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno. “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703 [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 88