Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[KDD'23] Off-Policy Evaluation of Ranking Polic...

[KDD'23] Off-Policy Evaluation of Ranking Policies under Diverse User Behaviors

Haruka Kiyohara

July 16, 2023
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior Haruka

    Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito Haruka Kiyohara https://sites.google.com/view/harukakiyohara August 2023 Adaptive OPE of Ranking Policies @ KDD'23 1
  2. Real world ranking decision making Examples of recommending a ranking

    of items August 2023 Adaptive OPE of Ranking Policies @ KDD'23 2 • Search Engine • Music Streaming • E-commerce • News • and more..! Can we evaluate the value of these rankings offline in advance? … …
  3. How does a ranking system work? August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 3 ranking with 𝑲 items a coming user context clicks reward(s) …
  4. How does a ranking system work? August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 4 ranking with 𝑲 items a coming user context clicks reward(s) a ranking policy … ▼ evaluate this one
  5. Evaluating with the policy value We evaluate a ranking policy

    with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 5
  6. Evaluating with the policy value We evaluate a ranking policy

    with its expected ranking-wise reward. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 6 position-wise policy value (depends on the whole ranking)
  7. Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies

    @ KDD'23 7 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …
  8. Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies

    @ KDD'23 8 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy …
  9. Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies

    @ KDD'23 9 ranking with 𝑲 items a coming user context clicks reward(s) a logging policy an evaluation policy …
  10. Off-policy evaluation; OPE August 2023 Adaptive OPE of Ranking Policies

    @ KDD'23 10 a logging policy an evaluation policy OPE estimator
  11. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 11 importance weight ・unbiased
  12. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 12 importance weight ・unbiased evaluation logging ranking A ranking B more less less more
  13. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 13 importance weight ・unbiased correcting the distribution shift evaluation logging ranking A ranking B more less less more
  14. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 14 importance weight ・unbiased ・variance evaluation logging ranking A more less when the importance weight is large
  15. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 15 importance weight ・unbiased ・variance When 𝜋0 is the uniform random policy,
  16. De-facto approach: Inverse Propensity Scoring [Strehl+,10] August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 16 importance weight ・unbiased ・variance!! When 𝜋0 is the uniform random policy,
  17. User behavior assumptions for variance reduction We assume that users

    are affected only by some subsets of actions. • Independent IPS [Li+,18] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 17
  18. User behavior assumptions for variance reduction We assume that users

    are affected only by some subsets of actions. • Independent IPS [Li+,18] • Reward Interaction IPS [McInerney+,20] August 2023 Adaptive OPE of Ranking Policies @ KDD'23 18
  19. Introducing an assumption, but is this enough? August 2023 Adaptive

    OPE of Ranking Policies @ KDD'23 19 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] bias variance tradeoff depending on a single user behavior assumption
  20. Introducing an assumption, but is this enough? August 2023 Adaptive

    OPE of Ranking Policies @ KDD'23 20 Bias Variance IIPS RIPS IPS independent cascade standard click model IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00] Are they enough to capture real-world user behaviors..? bias variance tradeoff depending on the user behavior assumption
  21. User behavior can change with the user context August 2023

    Adaptive OPE of Ranking Policies @ KDD'23 22 query: clothes (general) -> only browse top results query: T-shirts (specific) -> click after browsing more items clothes … T-shirts … User behavior can change with search query, users’ browsing history, etc..
  22. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 23 Our idea A single, universal assumption
  23. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 24 Our idea true ones mismatch!
  24. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 25 Our idea true ones mismatch! bias
  25. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 26 Our idea true ones mismatch! excessive variance
  26. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 27 Our idea adaptive! -> reduces mismatch on assumptions
  27. Our idea: Adapting to user behavior August 2023 Adaptive OPE

    of Ranking Policies @ KDD'23 28 Our idea adaptive! -> reduces mismatch on assumptions Our idea … example of complex (1) user behaviors that are not captured by cascade, etc Further, we aim to model more diverse and complex user behaviors as well.
  28. Our proposal: Adaptive IPS August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 29 Statistical benefits • Unbiased under any given user behavior model. • Minimum variance among other IPS-based unbiased estimators. importance weight of only actions that matters
  29. How much variance is reduced by AIPS? AIPS reduces the

    variance of unrelated actions August 2023 Adaptive OPE of Ranking Policies @ KDD'23 30 : relevant actions : irrelevant actions
  30. What the bias will be when 𝑐 is unavailable? In

    practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 31 overlap matters
  31. What the bias will be when 𝑐 is unavailable? In

    practice, user behavior 𝑐 is often unobservable, thus consider ̂ 𝑐 instead. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 32 overlap matters small bias large bias source of bias
  32. Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 34 weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance action set
  33. Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 35 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance
  34. Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 36 stronger assumption (e.g., independent) = more biased, but have lower variance action set weaker assumption (e.g., no assumption) = unbiased (or less biased), but have large variance Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff?
  35. Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 37 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance=MSE
  36. Controlling the bias-variance tradeoff August 2023 Adaptive OPE of Ranking

    Policies @ KDD'23 38 action set Why not optimize # 𝒄 instead of using 𝒄 for a better bias-variance tradeoff? user behavior bias variance MSE true one 0.0 0.5 0.50 optimized counterpart 0.1 0.3 0.31 (bias)2 + variance=MSE We aim to optimize the user behavior model adaptive to the context.
  37. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 39 to minimize MSE
  38. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 40 MSE estimation: [Su+,20] [Udagawa+,23] to minimize MSE
  39. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 41 context space to minimize MSE
  40. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 42 context space to minimize MSE
  41. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 43 to minimize MSE context space
  42. How to estimate optimize the user behavior model? Based on

    the bias-variance analysis, we optimize the user behavior to minimize MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 44 context space to minimize MSE context space
  43. Experiment on diverse user behaviors August 2023 Adaptive OPE of

    Ranking Policies @ KDD'23 46 interactions from related actions (simple) (diverse) (complex) user behavior distributions
  44. AIPS works well across various user behaviors IPS (red) has

    high variance across various user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 47 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  45. AIPS works well across various user behaviors IIPS (blue) RIPS

    (purple) has high bias under complex user behaviors. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 48 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  46. AIPS works well across various user behaviors AIPS (true) (gray)

    reduces variance while being unbiased. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 49 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  47. AIPS works well across various user behaviors AIPS (true) (gray),

    however, increases variance as user behavior becomes complex. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 50 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  48. AIPS works well across various user behaviors AIPS (ours) (green)

    reduces both bias and variance, and is thus accurate. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 51 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  49. AIPS works well across various user behaviors AIPS (ours) (green)

    works well even under diverse and complex user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 52 performance: a lower value is better (simple) (diverse) (complex) user behavior distributions
  50. AIPS also works well across various configurations AIPS adaptively balances

    the bias-variance tradeoff and minimizes MSE. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 53 slate sizes data sizes
  51. Real-world experiment We conduct the experiment with the data from

    an e-commerce platform. August 2023 Adaptive OPE of Ranking Policies @ KDD'23 54 the best in more than 75% of trials improves worst case performance
  52. Summary • Effectively controlling the bias-variance tradeoff is the key

    in OPE of ranking policies. • However, existing estimators apply a single user behavior, arising both excessive bias and variance in the presence of diverse user behaviors. • In response, we propose Adaptive IPS, which switches importance weight to minimize the estimation error depending on the user context. AIPS enables an accurate OPE estimation under diverse user behaviors! August 2023 Adaptive OPE of Ranking Policies @ KDD'23 55
  53. References (1/2) [Saito+,21] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and

    Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146 [Li+,18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018. https://arxiv.org/abs/1804.10488 [McInerney+,20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986 [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Athey&Imbens,16] Susan Athey and Guido Imbens. “Recursive Partitioning for Heterogeneous Causal Effects.” PNAS, 2016. https://arxiv.org/abs/1504.01132 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 58
  54. References (2/2) [Kiyohara+,22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke

    Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. “Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model.” WSDM, 2022. https://arxiv.org/abs/2202.01562 [Su+,20] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/2002.07729 [Udagawa+,23] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and Kei Tateno. “Policy-Adaptive Estimator Selection for Off-Policy Evaluation.” AAAI, 2023. https://arxiv.org/abs/2211.13904 August 2023 Adaptive OPE of Ranking Policies @ KDD'23 59