Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Haruka Kiyohara

June 28, 2023
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Off-Policy Evaluation of Ranking Policies
    under Diverse User Behavior
    Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita,
    Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito
    Haruka Kiyohara
    https://sites.google.com/view/harukakiyohara
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 1

    View full-size slide

  2. Real world ranking decision making
    Examples of recommending a ranking of items
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 2
    • Search Engine
    • Music Streaming
    • E-commerce
    • News
    • and more..!
    Can we evaluate the value of
    these rankings offline in advance?


    View full-size slide

  3. How does a ranking system work?
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 3
    ranking with 𝑲 items
    a coming user
    context
    clicks
    reward(s)

    View full-size slide

  4. How does a ranking system work?
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 4
    ranking with 𝑲 items
    a coming user
    context
    clicks
    reward(s)
    a ranking policy

    ▼ evaluate this one

    View full-size slide

  5. Evaluating with the policy value
    We evaluate a ranking policy with its expected ranking-wise reward.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 5

    View full-size slide

  6. Evaluating with the policy value
    We evaluate a ranking policy with its expected ranking-wise reward.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 6
    position-wise policy value
    (depends on the whole ranking)

    View full-size slide

  7. Off-policy evaluation; OPE
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 7
    ranking with 𝑲 items
    a coming user
    context
    clicks
    reward(s)
    a logging policy

    View full-size slide

  8. Off-policy evaluation; OPE
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 8
    ranking with 𝑲 items
    a coming user
    context
    clicks
    reward(s)
    a logging policy

    View full-size slide

  9. Off-policy evaluation; OPE
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 9
    ranking with 𝑲 items
    a coming user
    context
    clicks
    reward(s)
    a logging policy
    an evaluation policy

    View full-size slide

  10. Off-policy evaluation; OPE
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 10
    a logging policy
    an evaluation policy
    OPE estimator

    View full-size slide

  11. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 11
    importance weight
    ・unbiased

    View full-size slide

  12. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 12
    importance weight
    ・unbiased
    evaluation
    logging ranking A ranking B
    more
    less
    less
    more

    View full-size slide

  13. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 13
    importance weight
    ・unbiased
    correcting the distribution shift
    evaluation
    logging ranking A ranking B
    more
    less
    less
    more

    View full-size slide

  14. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 14
    importance weight
    ・unbiased
    ・variance
    evaluation
    logging ranking A
    more
    less
    when the importance weight
    is large

    View full-size slide

  15. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 15
    importance weight
    ・unbiased
    ・variance
    When 𝜋0
    is the uniform random policy,

    View full-size slide

  16. De-facto approach: Inverse Propensity Scoring [Strehl+,10]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 16
    importance weight
    ・unbiased
    ・variance!!
    When 𝜋0
    is the uniform random policy,

    View full-size slide

  17. User behavior assumptions for variance reduction
    We assume that users are affected only by some subsets of actions.
    • Independent IPS [Li+,18]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 17

    View full-size slide

  18. User behavior assumptions for variance reduction
    We assume that users are affected only by some subsets of actions.
    • Independent IPS [Li+,18]
    • Reward Interaction IPS [McInerney+,20]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 18

    View full-size slide

  19. Introducing an assumption, but is this enough?
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 19
    Bias
    Variance
    IIPS
    RIPS
    IPS
    independent
    cascade
    standard
    click model
    IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00]
    bias variance tradeoff depending on
    a single user behavior assumption

    View full-size slide

  20. Introducing an assumption, but is this enough?
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 20
    Bias
    Variance
    IIPS
    RIPS
    IPS
    independent
    cascade
    standard
    click model
    IIPS: [Li+,18], RIPS: [McInerney+,20], IPS: [Precup+,00]
    Are they enough to capture
    real-world user behaviors..?
    bias variance tradeoff depending on
    the user behavior assumption

    View full-size slide

  21. Adaptive IPS for diverse users
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 21

    View full-size slide

  22. User behavior can change with the user context
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 22
    query: clothes (general)
    -> only browse top results
    query: T-shirts (specific)
    -> click after browsing more items
    clothes

    T-shirts

    User behavior can change with search query, users’ browsing history, etc..

    View full-size slide

  23. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 23
    Our idea
    A single, universal assumption

    View full-size slide

  24. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 24
    Our idea
    true ones mismatch!

    View full-size slide

  25. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 25
    Our idea
    true ones mismatch!
    bias

    View full-size slide

  26. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 26
    Our idea
    true ones mismatch!
    excessive variance

    View full-size slide

  27. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 27
    Our idea
    adaptive! -> reduces mismatch on assumptions

    View full-size slide

  28. Our idea: Adapting to user behavior
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 28
    Our idea
    adaptive! -> reduces mismatch on assumptions
    Our idea

    example of complex (1)
    user behaviors that are not
    captured by cascade, etc
    Further, we aim to model more diverse
    and complex user behaviors as well.

    View full-size slide

  29. Our proposal: Adaptive IPS
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 29
    Statistical benefits
    • Unbiased under any given user behavior model.
    • Minimum variance among other IPS-based unbiased estimators.
    importance weight of only actions that matters

    View full-size slide

  30. How much variance is reduced by AIPS?
    AIPS reduces the variance of unrelated actions
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 30
    : relevant actions
    : irrelevant actions

    View full-size slide

  31. What the bias will be when 𝑐 is unavailable?
    In practice, user behavior 𝑐 is often unobservable, thus consider ̂
    𝑐 instead.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 31
    overlap matters

    View full-size slide

  32. What the bias will be when 𝑐 is unavailable?
    In practice, user behavior 𝑐 is often unobservable, thus consider ̂
    𝑐 instead.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 32
    overlap matters
    small bias large bias
    source of bias

    View full-size slide

  33. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 33
    action set

    View full-size slide

  34. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 34
    weaker assumption (e.g., no assumption)
    = unbiased (or less biased), but have large variance
    action set

    View full-size slide

  35. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 35
    stronger assumption (e.g., independent)
    = more biased, but have lower variance
    action set
    weaker assumption (e.g., no assumption)
    = unbiased (or less biased), but have large variance

    View full-size slide

  36. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 36
    stronger assumption (e.g., independent)
    = more biased, but have lower variance
    action set
    weaker assumption (e.g., no assumption)
    = unbiased (or less biased), but have large variance
    Why not optimize #
    𝒄 instead of using 𝒄
    for a better bias-variance tradeoff?

    View full-size slide

  37. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 37
    action set
    Why not optimize #
    𝒄 instead of using 𝒄
    for a better bias-variance tradeoff?
    user behavior bias variance MSE
    true one 0.0 0.5 0.50
    optimized counterpart 0.1 0.3 0.31
    (bias)2 + variance=MSE

    View full-size slide

  38. Controlling the bias-variance tradeoff
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 38
    action set
    Why not optimize #
    𝒄 instead of using 𝒄
    for a better bias-variance tradeoff?
    user behavior bias variance MSE
    true one 0.0 0.5 0.50
    optimized counterpart 0.1 0.3 0.31
    (bias)2 + variance=MSE
    We aim to optimize the user behavior model
    adaptive to the context.

    View full-size slide

  39. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 39
    to minimize MSE

    View full-size slide

  40. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 40
    MSE estimation: [Su+,20] [Udagawa+,23]
    to minimize MSE

    View full-size slide

  41. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 41
    context space
    to minimize MSE

    View full-size slide

  42. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 42
    context space
    to minimize MSE

    View full-size slide

  43. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 43
    to minimize MSE
    context space

    View full-size slide

  44. How to estimate optimize the user behavior model?
    Based on the bias-variance analysis, we optimize the user behavior to minimize MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 44
    context space
    to minimize MSE
    context space

    View full-size slide

  45. Experiments
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 45

    View full-size slide

  46. Experiment on diverse user behaviors
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 46
    interactions from related actions
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  47. AIPS works well across various user behaviors
    IPS (red) has high variance across various user behaviors.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 47
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  48. AIPS works well across various user behaviors
    IIPS (blue) RIPS (purple) has high bias under complex user behaviors.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 48
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  49. AIPS works well across various user behaviors
    AIPS (true) (gray) reduces variance while being unbiased.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 49
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  50. AIPS works well across various user behaviors
    AIPS (true) (gray), however, increases variance as user behavior becomes complex.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 50
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  51. AIPS works well across various user behaviors
    AIPS (ours) (green) reduces both bias and variance, and is thus accurate.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 51
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  52. AIPS works well across various user behaviors
    AIPS (ours) (green) works well even under diverse and complex user behaviors!
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 52
    performance:
    a lower value
    is better
    (simple) (diverse) (complex)
    user behavior
    distributions

    View full-size slide

  53. AIPS also works well across various configurations
    AIPS adaptively balances the bias-variance tradeoff and minimizes MSE.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 53
    slate sizes
    data sizes

    View full-size slide

  54. Real-world experiment
    We conduct the experiment with the data from an e-commerce platform.
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 54
    the best in more than 75% of trials improves worst case performance

    View full-size slide

  55. Summary
    • Effectively controlling the bias-variance tradeoff is the key in OPE of ranking policies.
    • However, existing estimators apply a single user behavior, arising both excessive
    bias and variance in the presence of diverse user behaviors.
    • In response, we propose Adaptive IPS, which switches importance weight
    to minimize the estimation error depending on the user context.
    AIPS enables an accurate OPE estimation under diverse user behaviors!
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 55

    View full-size slide

  56. Thank you for listening!
    contact: [email protected]
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 56

    View full-size slide

  57. References
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 57

    View full-size slide

  58. References (1/2)
    [Saito+,21] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita.
    “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy
    Evaluation.” NeurIPS dataset&benchmark, 2021. https://arxiv.org/abs/2008.07146
    [Li+,18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa
    Vinay, and Zheng Wen. “Offline Evaluation of Ranking Policies with Click Models.”
    KDD, 2018. https://arxiv.org/abs/1804.10488
    [McInerney+,20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra,
    and Ben Carterette. “Counterfactual Evaluation of Slate Recommendations with
    Sequential Reward Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986
    [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from
    Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120
    [Athey&Imbens,16] Susan Athey and Guido Imbens. “Recursive Partitioning for
    Heterogeneous Causal Effects.” PNAS, 2016. https://arxiv.org/abs/1504.01132
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 58

    View full-size slide

  59. References (2/2)
    [Kiyohara+,22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita,
    Nobuyuki Shimizu, and Yasuo Yamamoto. “Doubly Robust Off-Policy Evaluation for
    Ranking Policies under the Cascade Behavior Model.” WSDM, 2022.
    https://arxiv.org/abs/2202.01562
    [Su+,20] Yi Su, Pavithra Srinath, and Akshay Krishnamurthy. “Adaptive Estimator
    Selection for Off-Policy Evaluation.” ICML, 2020. https://arxiv.org/abs/2002.07729
    [Udagawa+,23] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, and
    Kei Tateno. “Policy-Adaptive Estimator Selection for Off-Policy Evaluation.” AAAI, 2023.
    https://arxiv.org/abs/2211.13904
    August 2023 Adaptive OPE of Ranking Policies @ KDD'23 59

    View full-size slide