Slide 1

Slide 1 text

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito Haruka Kiyohara https://sites.google.com/view/harukakiyohara August 2023 Adaptive OPE of Ranking Policies @ KDD'23 1

Slide 2

Slide 2 text

Real world ranking decision making Examples of recommending a ranking of items August 2023 Adaptive OPE of Ranking Policies @ KDD'23 2 • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these rankings offline in advance? … …

Slide 3

Slide 3 text

How does a ranking system work? August 2023 Adaptive OPE of Ranking Policies @ KDD'23 3 ranking with 𝑲 items a coming user context clicks reward(s) …

Slide 4

Slide 4 text

How does a ranking system work? August 2023 Adaptive OPE of Ranking Policies @ KDD'23 4 ranking with 𝑲 items a coming user context clicks reward(s) a ranking policy … ▼ evaluate this one

Slide 5

Slide 5 text

Evaluating with the policy value We evaluate a ranking policy with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 5

Slide 6

Slide 6 text

Evaluating with the policy value We evaluate a ranking policy with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 6 position-wise policy value (depends on the whole ranking)

Slide 7

Slide 7 text

Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies @ KDD'23 7 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …

Slide 8

Slide 8 text

Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies @ KDD'23 8 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …

Slide 9

Slide 9 text

Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies @ KDD'23 9 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy an evaluation policy …

Slide 10

Slide 10 text

Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies @ KDD'23 10 a logging policy an evaluation policy OPE estimator

Slide 11

Slide 11 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 11 importance weight ・unbiased

Slide 12

Slide 12 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 12 importance weight ・unbiased evaluation logging ranking A ranking B more less less more

Slide 13

Slide 13 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 13 importance weight ・unbiased correcting the distribution shift evaluation logging ranking A ranking B more less less more

Slide 14

Slide 14 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 14 importance weight ・unbiased ・variance evaluation logging ranking A more less when the importance weight is large

Slide 15

Slide 15 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 15 importance weight ・unbiased ・variance When 𝜋0 is the uniform random policy,

Slide 16

Slide 16 text

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 16 importance weight ・unbiased ・variance!! When 𝜋0 is the uniform random policy,

Slide 17

Slide 17 text

User behavior assumptions for variance reduction We assume that users are affected only by some subsets of actions. • Independent IPS [Li+,18] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 17

Slide 18

Slide 18 text

User behavior assumptions for variance reduction We assume that users are affected only by some subsets of actions. • Independent IPS [Li+,18] • Reward Interaction IPS [McInerney+,20] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 18

Slide 19

Slide 19 text

Introducing an assumption, but is this enough? August 2023 Adaptive OPE of Ranking Policies @ KDD'23 19 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] bias variance tradeoff depending on a single user behavior assumption

Slide 20

Slide 20 text

Introducing an assumption, but is this enough? August 2023 Adaptive OPE of Ranking Policies @ KDD'23 20 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] Are they enough to capture real-world user behaviors..? bias variance tradeoff depending on the user behavior assumption

Slide 21

Slide 21 text

Adaptive IPS for diverse users August 2023 Adaptive OPE of Ranking Policies @ KDD'23 21

Slide 22

Slide 22 text

User behavior can change with the user context August 2023 Adaptive OPE of Ranking Policies @ KDD'23 22 query: clothes (general) -> only browse top results query: T-shirts (specific) -> click after browsing more items clothes … T-shirts … User behavior can change with search query, users’ browsing history, etc..

Slide 23

Slide 23 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 23 Our idea A single, universal assumption

Slide 24

Slide 24 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 24 Our idea true ones mismatch!

Slide 25

Slide 25 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 25 Our idea true ones mismatch! bias

Slide 26

Slide 26 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 26 Our idea true ones mismatch! excessive variance

Slide 27

Slide 27 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 27 Our idea adaptive! -> reduces mismatch on assumptions

Slide 28

Slide 28 text

Our idea: Adapting to user behavior August 2023 Adaptive OPE of Ranking Policies @ KDD'23 28 Our idea adaptive! -> reduces mismatch on assumptions Our idea … example of complex (1) user behaviors that are not captured by cascade, etc Further, we aim to model more diverse and complex user behaviors as well.

Slide 29

Slide 29 text

Our proposal: Adaptive IPS August 2023 Adaptive OPE of Ranking Policies @ KDD'23 29 Statistical benefits • Unbiased under any given user behavior model. • Minimum variance among other IPS-based unbiased estimators. importance weight of only actions that matters

Slide 30

Slide 30 text

How much variance is reduced by AIPS? AIPS reduces the variance of unrelated actions August 2023 Adaptive OPE of Ranking Policies @ KDD'23 30 : relevant actions : irrelevant actions

Slide 31

Slide 31 text

What the bias will be when 𝑐 is unavailable? In practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 31 overlap matters

Slide 32

Slide 32 text

What the bias will be when 𝑐 is unavailable? In practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 32 overlap matters small bias large bias source of bias

Slide 33

Slide 33 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 33 action set

Slide 34

Slide 34 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 34 weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance action set

Slide 35

Slide 35 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 35 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance

Slide 36

Slide 36 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 36 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff?

Slide 37

Slide 37 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 37 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance=MSE

Slide 38

Slide 38 text

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking Policies @ KDD'23 38 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance=MSE We aim to optimize the user behavior model adaptive to the context.

Slide 39

Slide 39 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 39 to minimize MSE

Slide 40

Slide 40 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 40 MSE estimation: [Su+,20] [Udagawa+,23] to minimize MSE

Slide 41

Slide 41 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 41 context space to minimize MSE

Slide 42

Slide 42 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 42 context space to minimize MSE

Slide 43

Slide 43 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 43 to minimize MSE context space

Slide 44

Slide 44 text

How to estimate optimize the user behavior model? Based on the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 44 context space to minimize MSE context space

Slide 45

Slide 45 text

Experiments August 2023 Adaptive OPE of Ranking Policies @ KDD'23 45

Slide 46

Slide 46 text

Experiment on diverse user behaviors August 2023 Adaptive OPE of Ranking Policies @ KDD'23 46 interactions from related actions (simple) (diverse) (complex) user behavior distributions

Slide 47

Slide 47 text

AIPS works well across various user behaviors IPS (red) has high variance across various user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 47 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 48

Slide 48 text

AIPS works well across various user behaviors IIPS (blue) RIPS (purple) has high bias under complex user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 48 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 49

Slide 49 text

AIPS works well across various user behaviors AIPS (true) (gray) reduces variance while being unbiased. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 49 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 50

Slide 50 text

AIPS works well across various user behaviors AIPS (true) (gray), however, increases variance as user behavior becomes complex. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 50 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 51

Slide 51 text

AIPS works well across various user behaviors AIPS (ours) (green) reduces both bias and variance, and is thus accurate. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 51 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 52

Slide 52 text

AIPS works well across various user behaviors AIPS (ours) (green) works well even under diverse and complex user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 52 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

Slide 53

Slide 53 text

AIPS also works well across various configurations AIPS adaptively balances the bias-variance tradeoff and minimizes MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 53 slate sizes data sizes

Slide 54

Slide 54 text

Real-world experiment We conduct the experiment with the data from an e-commerce platform. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 54 the best in more than 75% of trials improves worst case performance

Slide 55

Slide 55 text

Summary • Effectively controlling the bias-variance tradeoff is the key in OPE of ranking policies. • However, existing estimators apply a single user behavior, arising both excessive bias and variance in the presence of diverse user behaviors. • In response, we propose Adaptive IPS, which switches importance weight to minimize the estimation error depending on the user context. AIPS enables an accurate OPE estimation under diverse user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 55

Slide 56

Slide 56 text

Thank you for listening! contact: [email protected] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 56

Slide 57

Slide 57 text

References August 2023 Adaptive OPE of Ranking Policies @ KDD'23 57

Slide 58

Slide 58 text

References (1/2) [Saito+,21] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Li+,18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+,20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Athey&Imbens,16] Susan Athey and Guido Imbens. “Recursive Partitioning for Heterogeneous Causal Effects.” PNAS, 2016. https://arxiv.org/abs/1504.01132 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 58

Slide 59

Slide 59 text

References (2/2) [Kiyohara+,22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. “Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.” WSDM, 2022. https://arxiv.org/abs/2202.01562 [Su+,20] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/2002.07729 [Udagawa+,23] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. “Policy-Adaptive Estimator Selection for Off-Policy Evaluation.” AAAI, 2023. https://arxiv.org/abs/2211.13904 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 59