$30 off During Our Annual Pro Sale. View Details »

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Haruka Kiyohara
February 23, 2022

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Best Paper Runner-Up Award @ WSDM2022
Proceedings: https://dl.acm.org/doi/10.1145/3488560.3498380
arXiv: https://arxiv.org/abs/2202.01562

RL4RealLife WS @ ICML2021
About WS: https://sites.google.com/view/RL4RealLife2021

CFML勉強会 #6
https://cfml.connpass.com/event/249531/

IR reading, Tokyo, 2022 Spring
https://sigir.jp/post/2022-05-21-irreading_2022spring/

解説記事(Yahoo! Japan Techblog)
https://techblog.yahoo.co.jp/entry/2021121130233784/

Haruka Kiyohara

February 23, 2022
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Doubly Robust Off-Policy Evaluation for Ranking
    Policies under the Cascade Behavior Model
    Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro,
    Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto
    Haruka Kiyohara, Tokyo Institute of Technology
    https://sites.google.com/view/harukakiyohara
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 1

    View Slide

  2. OPE of Ranking Policies
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 2

    View Slide

  3. Real world ranking decision making
    Examples of recommending a ranking of items
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 3
    Applications include
    • Search Engine
    • Music Streaming
    • E-commerce
    • News
    • and more..!
    Can we evaluate the value of
    these ranking decision making?

    View Slide

  4. Content
    • Overview of Off-Policy Evaluation (OPE) of Ranking Policies
    • Existing Estimators and Challenges
    • Seminal Work: Doubly Robust Estimator
    • Proposal: Cascade Doubly Robust (Cascade-DR)
    • The Benefit of Cascade-DR
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 4

    View Slide

  5. Ranking decision making
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 5
    song 1
    song 2
    song 3
    song 4
    click
    no click
    click
    no click
    a coming user
    ranking
    position
    a ranked list of items rewards

    View Slide

  6. The policy also produces logged data
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 6
    user feedbacks
    (reward vector)
    a coming user
    (context)
    a ranked list of items
    (action vector)
    behavior policy 𝝅𝒃
    logged bandit feedback

    View Slide

  7. Off-Policy Evaluation (OPE)
    The goal is to evaluate the performance of an evaluation policy 𝜋
    𝑒
    .
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 7
    where
    logged bandit feedback collected by 𝝅𝒃
    expected reward obtained by deploying 𝝅𝒆
    in the real system (e.g., sum of clicks)

    View Slide

  8. How to derive an accurate OPE estimation?
    We need to reduce both bias and variance moderately.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 8

    View Slide

  9. How to derive an accurate OPE estimation?
    We need to reduce both bias and variance moderately.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 9
    Bias is caused by
    the distribution shift.

    View Slide

  10. Distribution Shift
    Behavior and evaluation policies (𝜋
    𝑒
    and 𝜋
    𝑏
    ) follow different probability distributions.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 10
    behavior evaluation

    View Slide

  11. How to derive an accurate OPE estimation?
    We need to reduce both bias and variance moderately.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 11
    Variance increases with the size of the combinatorial action space
    and decreases with the data size.

    View Slide

  12. How large is the action space in slate?
    Non-factorizable case – policy chooses actions without duplication.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 12
    Song 1
    Song 2
    Song 3
    Song 4
    When there are 10 unique actions ( 𝐴 =10),
    𝑷(𝑳, |𝑨|)
    permutation
    choices
    10
    9
    8
    7
    x
    x
    x
    When 𝐿 = 10, combination can be..
    3,628,800!

    View Slide

  13. How large is the action space in slate?
    Factorizable case – policy chooses actions independently.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 13
    Song 1
    Song 2
    Song 3
    Song 4
    When there are 10 unique actions ( 𝐴 =10),
    When 𝐿 = 10, combination can be..
    10,000,000,000!
    choices
    10
    10
    10
    10
    x
    x
    x
    |𝑨|𝑳
    exponentiation

    View Slide

  14. How to derive an accurate OPE estimation?
    We need to reduce both bias and variance moderately.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 14
    Bias is caused by
    the distribution shift.
    Variance increases with the size of the combinatorial action space
    and decreases with the data size.

    View Slide

  15. Existing Approaches
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 15

    View Slide

  16. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10]
    IPS corrects the distribution shift between 𝜋
    𝑒
    and 𝜋
    𝑏
    .
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 16
    𝑛: data size
    importance weight
    w.r.t combinatorial actions
    song 1
    song 2
    song 3
    song 4

    View Slide

  17. Dealing with the distribution shift by IPS
    Behavior and evaluation policies (𝜋
    𝑒
    and 𝜋
    𝑏
    ) follow different probability distributions.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 17
    behavior evaluation

    View Slide

  18. Dealing with the distribution shift by IPS
    Behavior and evaluation policies (𝜋
    𝑒
    and 𝜋
    𝑏
    ) follow different probability distributions.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 18
    evaluation
    behavior

    View Slide

  19. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10]
    IPS corrects the distribution shift between 𝜋
    𝑒
    and 𝜋
    𝑏
    .
    • pros: unbiased under all possible user behavior (i.e., click model)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 19
    song 1
    song 2
    song 3
    song 4
    𝑛: data size
    importance weight
    w.r.t combinatorial actions

    View Slide

  20. Inverse Propensity Scoring (IPS) [Precup+, 00] [Strehl+, 10]
    IPS corrects the distribution shift between 𝜋
    𝑒
    and 𝜋
    𝑏
    .
    • pros: unbiased under all possible user behavior (i.e., click model)
    • cons: struggles from a very high variance due to combinatorial actions
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 20
    song 1
    song 2
    song 3
    song 4
    combinations
    importance weight
    w.r.t combinatorial actions
    𝑛: data size

    View Slide

  21. Huge importance weight of IPS
    Behavior and evaluation policies (𝜋
    𝑒
    and 𝜋
    𝑏
    ) follow different probability distributions.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 21
    Too large importance weight makes IPS
    too sensitive to the data point observed with a small probability.
    behavior evaluation

    View Slide

  22. products of the slot-level importance weights
    (factorizable case)
    Source of high variance in IPS
    IPS regards that the reward at slot 𝑙 depends on all the actions in the ranking.
    • pros: unbiased under all possible user behavior (i.e., click model)
    • cons: struggles from a very high variance due to combinatorial actions
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 22

    View Slide

  23. IIPS assumes that users interacts with items independently of the other positions.
    • pros: substantially reduces the variance of IPS
    Independent IPS (IIPS) [Li+, 18]
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 23
    independence
    assumption
    product disappears

    View Slide

  24. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 24
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒

    View Slide

  25. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 25
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IPS

    View Slide

  26. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 26
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IIPS
    IPS

    View Slide

  27. IIPS assumes that users interacts with items independently of the other positions.
    • pros: substantially reduces the variance of IPS
    • cons: may suffer from a large bias due to
    the strong independence assumption on user behavior
    Independent IPS (IIPS) [Li+, 18]
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 27
    independence
    assumption
    product disappears

    View Slide

  28. Reward interaction IPS (RIPS) [McInerney+, 20]
    RIPS assumes that users interact with items sequentially from top to bottom.
    (i.e., cascade assumption)
    • pros: reduces the bias of IIPS and the variance of IPS
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 28
    cascade
    assumption
    considers only higher positions

    View Slide

  29. Reward interaction IPS (RIPS) [McInerney+, 20]
    RIPS assumes that users interact with items sequentially from top to bottom.
    (i.e., cascade assumption)
    • pros: reduces the bias of IIPS and the variance of IPS
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 29
    cascade
    assumption
    considers only higher positions

    View Slide

  30. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 30
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IIPS
    IPS

    View Slide

  31. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟐, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 31
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IIPS
    IPS
    RIPS

    View Slide

  32. Let’s compare the importance weight
    When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 32
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IIPS
    IPS

    View Slide

  33. Let’s compare the importance weight
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 33
    Song 1
    Song 2
    Song 3
    Song 4
    10.0
    10.0
    10.0
    10.0
    slot-level
    importance weight
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    IIPS
    IPS
    RIPS
    When we evaluate slot-level reward at slot 𝒍 = 𝟒, (factorizable case)

    View Slide

  34. Reward interaction IPS (RIPS) [McInerney+, 20]
    RIPS assumes that users interact with items sequentially from top to bottom.
    (i.e., cascade assumption)
    • pros: reduces the bias of IIPS and the variance of IPS
    • cons: still suffers from a high variance when 𝐿 is large
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 34
    cascade
    assumption
    considers only higher positions

    View Slide

  35. A difficult tradeoff remains for the existing estimators
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 35
    standard cascade
    (𝐿 = 5)
    MSE
    a lower value
    is better
    Best RIPS or IPS IIPS or RIPS IIPS
    data size 𝑛
    independence
    cascade
    standard
    true click model

    View Slide

  36. A difficult tradeoff remains for the existing estimators
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 36
    standard cascade
    (𝐿 = 5)
    MSE
    a lower value
    is better
    We want an OPE estimator that works well on various situations
    data size 𝑛
    independence
    cascade
    standard
    true click model
    Our goal

    View Slide

  37. Our goal: Can we dominate all existing estimators?
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 37
    Bias
    Variance
    achievable bias-variance tradeoff
    by modifying click models
    IIPS
    RIPS
    IPS
    independent
    cascade
    standard
    click model

    View Slide

  38. Our goal: Can we dominate all existing estimators?
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 38
    Bias
    Variance
    IIPS
    RIPS
    IPS
    independent
    cascade
    standard
    click model
    achievable bias-variance tradeoff
    by modifying click models
    Can we further reduce the variance of RIPS,
    while remaining unbiased under the Cascade assumption?

    View Slide

  39. Seminal work
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 39

    View Slide

  40. From IPS to Doubly Robust (DR) [Dudík+, 14]
    In a single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 40
    (hereinafter)

    View Slide

  41. From IPS to Doubly Robust (DR) [Dudík+, 14]
    In a single action setting (𝐿 = 1), we often use DR to reduce the variance of IPS.
    + unbiased and small variance
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 41
    baseline estimation
    importance weighting on residual
    (hereinafter)
    control variate

    View Slide

  42. Variance reduction of DR
    When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො
    𝑞 𝑥, 𝑎 = 0.9,
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 42
    importance weight leads to variance

    View Slide

  43. Variance reduction of DR
    When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 1, 𝑟 = 1, ∀𝑎, ො
    𝑞 𝑥, 𝑎 = 0.9,
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 43
    scale down the weighted value
    importance weight leads to variance

    View Slide

  44. Variance reduction of DR
    When 𝑤 𝑥, 𝑎 = 10, 𝑞 𝑥, 𝑎 = 0.8, 𝑟 = 1, ∀𝑎, ො
    𝑞 𝑥, 𝑎 = 0.9,
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 44
    scale down the weighted value
    importance weight leads to variance
    We want to define DR is ranking OPE,
    but how can we do it under the complex
    Cascade assumption?
    This term is
    computationally intractable.

    View Slide

  45. Cascade Doubly Robust
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 45

    View Slide

  46. Recursive form of RIPS
    Transform RIPS into the recursive form.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 46

    View Slide

  47. Recursive form of RIPS
    Transform RIPS into the recursive form.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 47
    policy value after position 𝒍
    policy value after position 𝒍

    View Slide

  48. Recursive form of RIPS
    Transform RIPS into the recursive form.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 48
    policy value after position 𝒍
    policy value after position 𝒍
    Now, the importance weight depends only on 𝒂𝒍
    .

    View Slide

  49. Introducing a control variate (Q-hat)
    Now we can define DR under the Cascade assumption.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 49

    View Slide

  50. Introducing a control variate (Q-hat)
    Now we can define DR under the Cascade assumption.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 50
    baseline estimation
    importance weighting only on residual
    control variate
    policy value after position 𝒍

    View Slide

  51. Introducing a control variate (Q-hat)
    Now we can define DR under the Cascade assumption.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 51
    existing works:
    click model
    bias-variance
    tradeoff policy value after position 𝒍

    View Slide

  52. Introducing a control variate (Q-hat)
    Now we can define DR under the Cascade assumption.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 52
    our idea:
    control variate
    reduce variance
    more!
    existing works:
    click model
    bias-variance
    tradeoff policy value after position 𝒍

    View Slide

  53. Benefits of Cascade-DR
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 53

    View Slide

  54. • pros: reduces the variance of RIPS • pros: still unbiased under Cascade
    Statistical advantages of Cascade-DR
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 54
    (under a reasonable assumption on ෠
    𝑄)
    Better bias-variance tradeoff
    than IPS, IIPS, and RIPS!

    View Slide

  55. A difficult tradeoff remains for the existing estimators
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 55
    standard cascade
    (𝐿 = 5)
    MSE
    a lower value
    is better
    Best estimator can change with data sizes and click models..
    (estimator selection is hard)
    data size 𝑛
    independence
    cascade
    standard
    true click model

    View Slide

  56. Cascade-DR clearly dominates IPS, IIPS, RIPS
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 56
    Cascade-DR clearly dominates all existing estimators on various configurations!
    (no need for the difficult estimator selection anymore)
    standard cascade
    (𝐿 = 5)
    MSE
    a lower value
    is better
    data size 𝑛
    independence
    cascade
    standard
    true click model

    View Slide

  57. Cascade-DR performs well on a real platform
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 57
    the lower,
    the better
    Cascade-DR is most accurate and stable even under realistic user behavior.

    View Slide

  58. Summary
    • OPE of ranking policies has a variety of applications (e.g., search engine).
    • Existing estimators suffer either from a large bias or variance, and
    the best estimator can change depending on the true click model and data size.
    • Cascade-DR achieves a better bias-variance tradeoff than all existing estimators,
    by introducing a control variate under the Cascade assumption.
    Cascade-DR enables an accurate OPE of real world ranking decisions!
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 58

    View Slide

  59. Cascade-DR is available in OpenBanditPipeline!
    Implemented as `obp.ope.SlateCascadeDoublyRobust`.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 59
    https://github.com/st-tech/zr-obp
    Our experimental code also uses obp.
    Only four lines of code to implement OPE.
    # estimate q_hat
    regression_model = obp.ope.SlateRegeressionModel(..)
    q_hat = regression_model.fit_predict(..)
    # estimate policy value
    cascade_dr = obp.ope.SlateCascadeDoublyRobust(..)
    policy_value = cascade_dr.estimate_policy_value(..)

    View Slide

  60. Thank you for listening!
    Find out more (e.g., theoretical analysis and experiments) in the full paper!
    contact: [email protected]
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 60

    View Slide

  61. Cascade Doubly Robust (Cascade-DR)
    By solving the recursive form, we obtain Cascade-DR.
    If estimation error is within ±100% ( , ),
    then, Cascade-DR can reduce the variance of RIPS.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 61
    recursively estimate baseline

    View Slide

  62. Additional experimental results
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 62

    View Slide

  63. How accurate OPE estimators are?
    The smaller the squared error (SE), the more accurate the estimator ෠
    𝑉 is.
    We use OpenBanditPipeline [Saito+, 21a].
    • reward structure (user behavior assumption)
    • slate size 𝐿 and data size 𝑛
    • policy similarity 𝜆
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 63
    experimental configurations

    View Slide

  64. Experimental setup: reward function
    We define slot-level mean reward function as follows.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 64
    :sigmoid
    determined by the
    corresponding action
    interaction from
    the other slots
    (linear)

    View Slide

  65. Experimental setup: reward function
    We define slot-level mean reward function as follows.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 65
    :sigmoid
    determined by the
    corresponding action
    interaction from
    the other slots
    (linear)
    Song 1
    Song 2
    Song 3
    Song 4
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    independence

    View Slide

  66. Experimental setup: reward function
    We define slot-level mean reward function as follows.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 66
    :sigmoid
    determined by the
    corresponding action
    interaction from
    the other slots
    (linear)
    Song 1
    Song 2
    Song 3
    Song 4
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    cascade
    from the higher slots

    View Slide

  67. Experimental setup: reward function
    We define slot-level mean reward function as follows.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 67
    :sigmoid
    determined by the
    corresponding action
    interaction from
    the other slots
    (linear)
    Song 1
    Song 2
    Song 3
    Song 4
    𝒓
    𝟏
    𝒓
    𝟐
    𝒓
    𝟑
    𝒓
    𝟒
    standard
    all the other slots

    View Slide

  68. Experimental setup: reward function
    We define slot-level mean reward function as follows.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 68
    :sigmoid
    determined by the
    corresponding action
    interaction from
    the other slots
    standard
    all the other slots
    cascade
    from the previous slots
    independence
    no interaction
    (linear)

    View Slide

  69. Experimental setup: interaction function
    Two ways to define .
    • additive effect from co-occurrence
    • decay effect from neighboring actions
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 69
    symmetric matrix
    decay function
    Song 1
    Song 2
    Song 3
    Song 4

    View Slide

  70. Experimental setup: policies
    Behavior and evaluation policies are factorizable.
    • behavior policy
    • evaluation policy
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 70
    𝝀 → 𝟏 to similar, 𝝀 → −𝟏 to dissimilar
    𝝀 → 𝟏 to account 𝝅𝒃
    more, |𝝀| → 𝟎 to uniform

    View Slide

  71. Study Design
    How the estimators’ performance and their superiority change depending on
    • reward structure
    • data size / slate size / policy similarity
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 71

    View Slide

  72. Experimental procedure [Saito+, 21b]
    • We first randomly sample configurations (10000 times) and calculate SE.
    • Then, we aggregate results and calculate the mean value of SE (MSE).
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 72
    1. Define configuration space.
    2. For each random seed..
    2-1. sample configuration based on the seed
    2-2. calculate SE on the sampled evaluation
    policy and dataset (configuration).
    -> we can evaluate ෡
    𝑽 on various situations.

    View Slide

  73. Varying data size
    Cascade-DR stably performs well on various configurations!
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 73
    standard cascade
    independence
    data size 𝑛
    relative MSE
    bias
    variance variance bias variance
    relative MSE ( ෠
    𝑉 ) = MSE ( ෠
    𝑉 ) / MSE ( ෠
    𝑉
    Cascade-DR
    )
    cascade
    standard
    (𝐿 = 5)

    View Slide

  74. Varying data size
    Cascade-DR reduces the variance of IPS and RIPS a lot.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 74
    standard
    data size 𝑛
    relative MSE
    bias
    variance
    relative MSE ( ෠
    𝑉 ) = MSE ( ෠
    𝑉 ) / MSE ( ෠
    𝑉
    Cascade-DR
    )
    standard
    Unbiased
    -> IPS
    Large data size
    -> IPS, Cascade-DR, ..
    Small data size
    -> Cascade-DR, RIPS, ..

    View Slide

  75. Varying data size
    Cascade-DR is the best, being unbiased while reducing the variance.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 75
    standard
    data size 𝑛
    relative MSE
    bias
    variance
    relative MSE ( ෠
    𝑉 ) = MSE ( ෠
    𝑉 ) / MSE ( ෠
    𝑉
    Cascade-DR
    )
    cascade
    Unbiased
    -> IPS, RIPS, Cascade-DR
    Large data size
    -> Cascade-DR, RIPS, ..
    Small data size
    -> Cascade-DR, IIPS, ..

    View Slide

  76. Varying data size
    Cascade-DR is the best among the estimators using reasonable assumptions.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 76
    standard
    data size 𝑛
    relative MSE
    variance
    relative MSE ( ෠
    𝑉 ) = MSE ( ෠
    𝑉 ) / MSE ( ෠
    𝑉
    Cascade-DR
    )
    independence
    Unbiased
    -> all estimators
    Large data size
    -> IIPS, Cascade-DR, ..
    Small data size
    -> IIPS, Cascade-DR, ..

    View Slide

  77. Varying data size
    Cascade-DR stably performs well on various configurations!
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 77
    standard cascade
    independence
    data size 𝑛
    relative MSE
    bias
    variance variance bias variance
    relative MSE ( ෠
    𝑉 ) = MSE ( ෠
    𝑉 ) / MSE ( ෠
    𝑉
    Cascade-DR
    )
    cascade
    standard
    (𝐿 = 5)

    View Slide

  78. Varying slate size
    • Cascade-DR stably outperforms RIPS on various slate size.
    • When baseline estimation is successful, Cascade-DR becomes more powerful.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 78
    standard cascade
    independence
    slate size 𝐿
    relative MSE
    difficult easy
    cascade
    standard
    | | | |
    (𝑛 = 1000)

    View Slide

  79. Varying policy similarity
    • When the behavior and evaluation policies are dissimilar,
    Cascade-DR is more promising.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 79
    standard cascade
    independence
    policy similarity 𝜆
    relative MSE
    bias, variance bias, variance variance
    cascade
    standard
    (𝑛 = 1000)

    View Slide

  80. How to calculate ground-truth policy value?
    In synthetic experiments, we take the expectation of the following weighted slate-level
    expected reward over the contexts.
    1. Enumerate all combinations of actions. ( 𝐴 𝐿)
    2. For each action vector, calculate evaluation policy pscore 𝜋
    𝑒
    𝑎 𝑥) and
    its slate-level expected reward.
    3. Calculate weighted sum of slate-level expected reward using pscore.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 80

    View Slide

  81. Experimental procedure of the real world experiment
    1. We first run two different policies 𝜋
    𝐴
    and 𝜋
    𝐵
    to construct datasets 𝐷
    𝐴
    and 𝐷
    𝐵
    . Here,
    we use 𝐷
    𝐴
    to estimate 𝑉(𝜋
    𝐵
    ) by OPE.
    2. Then, approximate ground-truth policy value by on-policy estimation as follows.
    3. To see a distribution of errors, duplicate dataset using bootstrap sampling.
    4. Finally, calculate the squared errors on bootstrapped datasets 𝐷
    𝐴
    ′.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 81

    View Slide

  82. Other related work
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 82

    View Slide

  83. Inspiration from Reinforcement Learning (RL)
    DR leverages the recursive structure of Markov Decision Process (MDP).
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 83
    baseline estimation
    policy value
    after visiting 𝒙𝒍
    recursive derivation
    importance weighting on residual
    [Jiang&Li, 16] [Thomas&Brunskill, 16]

    View Slide

  84. Causal similarity between MDP and Cascade asm.
    Cascade assumption can be interpreted as a special case of MDP.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 84

    View Slide

  85. PseudoInverse (PI) [Swaminathan+, 17]
    • Designed for the situation where the slot-level rewards are unobservable.
    • Implicitly assumes independence assumption.
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 85

    View Slide

  86. References
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 86

    View Slide

  87. References (1/2)
    [Precup+, 00] Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy
    Policy Evaluation.” ICML, 2000.
    https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1079&context=cs_faculty_pubs
    [Strehl+, 10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit
    Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120
    [Li+, 18] Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng
    Wen. “Offline Evaluation of Ranking Policies with Click Models.” KDD, 2018.
    https://arxiv.org/abs/1804.10488
    [McInerney+, 20] James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben
    Carterette. “Counterfactual Evaluation of Slate Recommendations with Sequential Reward
    Interactions.” KDD, 2020. https://arxiv.org/abs/2007.12986
    [Dudík+, 14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy
    Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834
    [Jiang&Li, 16] Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement
    Learning.” ICML, 2016. https://arxiv.org/abs/1511.03722
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 87

    View Slide

  88. References (2/2)
    [Thomas&Brunskill, 16] Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy
    Evaluation for Reinforcement Learning.” ICML, 2016. https://arxiv.org/abs/1604.00923
    [Saito+, 21a] Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit
    Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” NeurIPS, 2021.
    https://arxiv.org/abs/2008.07146
    [Saito+, 21b] Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, Kei Tateno.
    “Evaluating the Robustness of Off-Policy Evaluation.” RecSys, 2021. https://arxiv.org/abs/2108.13703
    [Swaminathan+, 17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John
    Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017.
    https://arxiv.org/abs/1605.04812
    July 2022 Cascade Doubly Robust Off-Policy Evaluation @ CFML勉強会 88

    View Slide