of items July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 3 Applications include • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these ranking decision making?
Robust Off-Policy Evaluation @ CFML勉強会 6 user feedbacks (reward vector) a coming user (context) a ranked list of items (action vector) behavior policy 𝝅𝒃 logged bandit feedback
of an evaluation policy 𝜋 𝑒 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 7 where logged bandit feedback collected by 𝝅𝒃 expected reward obtained by deploying 𝝅𝒆 in the real system (e.g., sum of clicks)
reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 11 Variance increases with the size of the combinatorial action space and decreases with the data size.
– policy chooses actions without duplication. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 12 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), 𝑷(𝑳, |𝑨|) permutation choices 10 9 8 7 x x x When 𝐿 = 10, combination can be.. 3,628,800!
– policy chooses actions independently. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 13 Song 1 Song 2 Song 3 Song 4 When there are 10 unique actions ( 𝐴 =10), When 𝐿 = 10, combination can be.. 10,000,000,000! choices 10 10 10 10 x x x |𝑨|𝑳 exponentiation
reduce both bias and variance moderately. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 14 Bias is caused by the distribution shift. Variance increases with the size of the combinatorial action space and decreases with the data size.
the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 16 𝑛: data size importance weight w.r.t combinatorial actions song 1 song 2 song 3 song 4
the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 19 song 1 song 2 song 3 song 4 𝑛: data size importance weight w.r.t combinatorial actions
the distribution shift between 𝜋 𝑒 and 𝜋 𝑏 . • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 20 song 1 song 2 song 3 song 4 combinations importance weight w.r.t combinatorial actions 𝑛: data size
𝑒 and 𝜋 𝑏 ) follow different probability distributions. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 21 Too large importance weight makes IPS too sensitive to the data point observed with a small probability. behavior evaluation
high variance in IPS IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking. • pros: unbiased under all possible user behavior (i.e., click model) • cons: struggles from a very high variance due to combinatorial actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 22
other positions. • pros: substantially reduces the variance of IPS • cons: may suffer from a large bias due to the strong independence assumption on user behavior Independent IPS (IIPS) [Li+, 18] July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 27 independence assumption product disappears
interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 28 cascade assumption considers only higher positions
interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 29 cascade assumption considers only higher positions
interact with items sequentially from top to bottom. (i.e., cascade assumption) • pros: reduces the bias of IIPS and the variance of IPS • cons: still suffers from a high variance when 𝐿 is large July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 34 cascade assumption considers only higher positions
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 35 standard cascade (𝐿 = 5) MSE a lower value is better Best RIPS or IPS IIPS or RIPS IIPS data size 𝑛 independence cascade standard true click model
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 36 standard cascade (𝐿 = 5) MSE a lower value is better We want an OPE estimator that works well on various situations data size 𝑛 independence cascade standard true click model Our goal
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 38 Bias Variance IIPS RIPS IPS independent cascade standard click model achievable bias-variance tradeoff by modifying click models Can we further reduce the variance of RIPS, while remaining unbiased under the Cascade assumption?
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 40 (hereinafter)
single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS. + unbiased and small variance July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 41 baseline estimation importance weighting on residual (hereinafter) control variate
𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, ො 𝑞 𝑥, 𝑎 = 0.9, July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 44 scale down the weighted value importance weight leads to variance We want to define DR is ranking OPE, but how can we do it under the complex Cascade assumption? This term is computationally intractable.
July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 48 policy value after position 𝒍 policy value after position 𝒍 Now, the importance weight depends only on 𝒂𝒍 .
under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 50 baseline estimation importance weighting only on residual control variate policy value after position 𝒍
under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 51 existing works: click model bias-variance tradeoff policy value after position 𝒍
under the Cascade assumption. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 52 our idea: control variate reduce variance more! existing works: click model bias-variance tradeoff policy value after position 𝒍
unbiased under Cascade Statistical advantages of Cascade-DR July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 54 (under a reasonable assumption on 𝑄) Better bias-variance tradeoff than IPS, IIPS, and RIPS!
Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 55 standard cascade (𝐿 = 5) MSE a lower value is better Best estimator can change with data sizes and click models.. (estimator selection is hard) data size 𝑛 independence cascade standard true click model
Robust Off-Policy Evaluation @ CFML勉強会 56 Cascade-DR clearly dominates all existing estimators on various configurations! (no need for the difficult estimator selection anymore) standard cascade (𝐿 = 5) MSE a lower value is better data size 𝑛 independence cascade standard true click model
applications (e.g., search engine). • Existing estimators suffer either from a large bias or variance, and the best estimator can change depending on the true click model and data size. • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators, by introducing a control variate under the Cascade assumption. Cascade-DR enables an accurate OPE of real world ranking decisions! July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 58
obtain Cascade-DR. If estimation error is within ±100% ( , ), then, Cascade-DR can reduce the variance of RIPS. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 61 recursively estimate baseline
(SE), the more accurate the estimator 𝑉 is. We use OpenBanditPipeline [Saito+, 21a]. • reward structure (user behavior assumption) • slate size 𝐿 and data size 𝑛 • policy similarity 𝜆 July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 63 experimental configurations
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 64 :sigmoid determined by the corresponding action interaction from the other slots (linear)
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 65 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 independence
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 66 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 cascade from the higher slots
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 67 :sigmoid determined by the corresponding action interaction from the other slots (linear) Song 1 Song 2 Song 3 Song 4 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝒓 𝟒 standard all the other slots
as follows. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 68 :sigmoid determined by the corresponding action interaction from the other slots standard all the other slots cascade from the previous slots independence no interaction (linear)
additive effect from co-occurrence • decay effect from neighboring actions July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 69 symmetric matrix decay function Song 1 Song 2 Song 3 Song 4
(10000 times) and calculate SE. • Then, we aggregate results and calculate the mean value of SE (MSE). July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 72 1. Define configuration space. 2. For each random seed.. 2-1. sample configuration based on the seed 2-2. calculate SE on the sampled evaluation policy and dataset (configuration). -> we can evaluate 𝑽 on various situations.
take the expectation of the following weighted slate-level expected reward over the contexts. 1. Enumerate all combinations of actions. ( 𝐴 𝐿) 2. For each action vector, calculate evaluation policy pscore 𝜋 𝑒 𝑎 𝑥) and its slate-level expected reward. 3. Calculate weighted sum of slate-level expected reward using pscore. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 80
run two different policies 𝜋 𝐴 and 𝜋 𝐵 to construct datasets 𝐷 𝐴 and 𝐷 𝐵 . Here, we use 𝐷 𝐴 to estimate 𝑉(𝜋 𝐵 ) by OPE. 2. Then, approximate ground-truth policy value by on-policy estimation as follows. 3. To see a distribution of errors, duplicate dataset using bootstrap sampling. 4. Finally, calculate the squared errors on bootstrapped datasets 𝐷 𝐴 ′. July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 81