[KDD'23] Off-Policy Evaluation of Ranking Policies under Diverse User Behaviors

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior Haruka
Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito Haruka Kiyohara https://sites.google.com/view/harukakiyohara August 2023 Adaptive OPE of Ranking Policies @ KDD'23 1

Real world ranking decision making Examples of recommending a ranking
of items August 2023 Adaptive OPE of Ranking Policies @ KDD'23 2 • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these rankings offline in advance? … …

How does a ranking system work? August 2023 Adaptive OPE
of Ranking Policies @ KDD'23 3 ranking with 𝑲 items a coming user context clicks reward(s) …

How does a ranking system work? August 2023 Adaptive OPE
of Ranking Policies @ KDD'23 4 ranking with 𝑲 items a coming user context clicks reward(s) a ranking policy … ▼ evaluate this one

Evaluating with the policy value We evaluate a ranking policy
with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 5

Evaluating with the policy value We evaluate a ranking policy
with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 6 position-wise policy value (depends on the whole ranking)

Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies
@ KDD'23 7 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …

@ KDD'23 8 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …

@ KDD'23 9 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy an evaluation policy …

@ KDD'23 10 a logging policy an evaluation policy OPE estimator

De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE
of Ranking Policies @ KDD'23 11 importance weight ・unbiased

of Ranking Policies @ KDD'23 12 importance weight ・unbiased evaluation logging ranking A ranking B more less less more

of Ranking Policies @ KDD'23 13 importance weight ・unbiased correcting the distribution shift evaluation logging ranking A ranking B more less less more

of Ranking Policies @ KDD'23 14 importance weight ・unbiased ・variance evaluation logging ranking A more less when the importance weight is large

of Ranking Policies @ KDD'23 15 importance weight ・unbiased ・variance When 𝜋0 is the uniform random policy,

of Ranking Policies @ KDD'23 16 importance weight ・unbiased ・variance!! When 𝜋0 is the uniform random policy,

User behavior assumptions for variance reduction We assume that users
are affected only by some subsets of actions. • Independent IPS [Li+,18] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 17

User behavior assumptions for variance reduction We assume that users
are affected only by some subsets of actions. • Independent IPS [Li+,18] • Reward Interaction IPS [McInerney+,20] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 18

Introducing an assumption, but is this enough? August 2023 Adaptive
OPE of Ranking Policies @ KDD'23 19 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] bias variance tradeoff depending on a single user behavior assumption

Introducing an assumption, but is this enough? August 2023 Adaptive
OPE of Ranking Policies @ KDD'23 20 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] Are they enough to capture real-world user behaviors..? bias variance tradeoff depending on the user behavior assumption

Adaptive IPS for diverse users August 2023 Adaptive OPE of
Ranking Policies @ KDD'23 21

User behavior can change with the user context August 2023
Adaptive OPE of Ranking Policies @ KDD'23 22 query: clothes (general) -> only browse top results query: T-shirts (specific) -> click after browsing more items clothes … T-shirts … User behavior can change with search query, users’ browsing history, etc..

Our idea: Adapting to user behavior August 2023 Adaptive OPE
of Ranking Policies @ KDD'23 23 Our idea A single, universal assumption

of Ranking Policies @ KDD'23 24 Our idea true ones mismatch!

of Ranking Policies @ KDD'23 25 Our idea true ones mismatch! bias

of Ranking Policies @ KDD'23 26 Our idea true ones mismatch! excessive variance

of Ranking Policies @ KDD'23 27 Our idea adaptive! -> reduces mismatch on assumptions

of Ranking Policies @ KDD'23 28 Our idea adaptive! -> reduces mismatch on assumptions Our idea … example of complex (1) user behaviors that are not captured by cascade, etc Further, we aim to model more diverse and complex user behaviors as well.

Our proposal: Adaptive IPS August 2023 Adaptive OPE of Ranking
Policies @ KDD'23 29 Statistical benefits • Unbiased under any given user behavior model. • Minimum variance among other IPS-based unbiased estimators. importance weight of only actions that matters

How much variance is reduced by AIPS? AIPS reduces the
variance of unrelated actions August 2023 Adaptive OPE of Ranking Policies @ KDD'23 30 : relevant actions : irrelevant actions

What the bias will be when 𝑐 is unavailable? In
practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 31 overlap matters

What the bias will be when 𝑐 is unavailable? In
practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 32 overlap matters small bias large bias source of bias

Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking
Policies @ KDD'23 33 action set

Policies @ KDD'23 34 weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance action set

Policies @ KDD'23 35 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance

Policies @ KDD'23 36 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff?

Policies @ KDD'23 37 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance＝MSE

Policies @ KDD'23 38 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance＝MSE We aim to optimize the user behavior model adaptive to the context.

How to estimate optimize the user behavior model? Based on
the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 39 to minimize MSE

the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 40 MSE estimation: [Su+,20] [Udagawa+,23] to minimize MSE

the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 41 context space to minimize MSE

the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 42 context space to minimize MSE

the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 43 to minimize MSE context space

the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 44 context space to minimize MSE context space

Experiments August 2023 Adaptive OPE of Ranking Policies @ KDD'23
45

Experiment on diverse user behaviors August 2023 Adaptive OPE of
Ranking Policies @ KDD'23 46 interactions from related actions (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors IPS (red) has
high variance across various user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 47 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors IIPS (blue) RIPS
(purple) has high bias under complex user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 48 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors AIPS (true) (gray)
reduces variance while being unbiased. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 49 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors AIPS (true) (gray),
however, increases variance as user behavior becomes complex. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 50 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors AIPS (ours) (green)
reduces both bias and variance, and is thus accurate. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 51 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS works well across various user behaviors AIPS (ours) (green)
works well even under diverse and complex user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 52 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions

AIPS also works well across various configurations AIPS adaptively balances
the bias-variance tradeoff and minimizes MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 53 slate sizes data sizes

Real-world experiment We conduct the experiment with the data from
an e-commerce platform. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 54 the best in more than 75% of trials improves worst case performance

Summary • Effectively controlling the bias-variance tradeoff is the key
in OPE of ranking policies. • However, existing estimators apply a single user behavior, arising both excessive bias and variance in the presence of diverse user behaviors. • In response, we propose Adaptive IPS, which switches importance weight to minimize the estimation error depending on the user context. AIPS enables an accurate OPE estimation under diverse user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 55

Thank you for listening! contact: [email protected] August 2023 Adaptive OPE
of Ranking Policies @ KDD'23 56

References August 2023 Adaptive OPE of Ranking Policies @ KDD'23
57

References (1/2) [Saito+,21] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and
Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Li+,18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+,20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Athey&Imbens,16] Susan Athey and Guido Imbens. “Recursive Partitioning for Heterogeneous Causal Effects.” PNAS, 2016. https://arxiv.org/abs/1504.01132 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 58

References (2/2) [Kiyohara+,22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke
Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. “Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.” WSDM, 2022. https://arxiv.org/abs/2202.01562 [Su+,20] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/2002.07729 [Udagawa+,23] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. “Policy-Adaptive Estimator Selection for Off-Policy Evaluation.” AAAI, 2023. https://arxiv.org/abs/2211.13904 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 59

[KDD'23] Off-Policy Evaluation of Ranking Polic...

[KDD'23] Off-Policy Evaluation of Ranking Policies under Diverse User Behaviors

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript