Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[WWW'24] Off-Policy Evaluation of Slate Bandit ...

[WWW'24] Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Haruka Kiyohara

April 15, 2024
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction Haruka

    Kiyohara1, Masahiro Nomura2, Yuta Saito1 1 Cornell University, 2 CyberAgent AI Lab April 2024 OPE for slate bandits with abstraction @ WWW2024 1
  2. Slate recommendations In online ads or medicine, we often optimize

    combinatorial action called slates. April 2024 OPE for slate bandits with abstraction @ WWW2024 2 Start a convenient life with smart devices 20% discounts! Decision 1: slogan Decision 2: key visual Decision 3: discount rate Our goal is to evaluate the slate policy using logged data.
  3. Data generation process and objectives • context 𝑥 ∈ 𝑋

    • users’ demographic profile • slate 𝑠 = 𝑎1 , 𝑎2 , ⋯ , 𝑎𝐿 ∈ 𝑆 = ∑!∈[$] 𝐴𝑙 • slot action 𝑎𝑙 ∈ 𝐴𝑙 • e.g., in an email promotion, 𝐴1 can be a set of subject lines, 𝐴2 can be candidate visuals. • reward 𝑟 • click or purchase April 2024 OPE for slate bandits with abstraction @ WWW2024 3
  4. Data generation process and objectives • context 𝑥 ∈ 𝑋

    • users’ demographic profile • slate 𝑠 = 𝑎1 , 𝑎2 , ⋯ , 𝑎𝐿 ∈ 𝑆 = ∑!∈[$] 𝐴𝑙 • slot action 𝑎𝑙 ∈ 𝐴𝑙 • e.g., in an email promotion, 𝐴1 can be a set of subject lines, 𝐴2 can be candidate visuals. • reward 𝑟 • click or purchase April 2024 OPE for slate bandits with abstraction @ WWW2024 4 The goal of Off-Policy Evaluation (OPE) is to accurately estimate the policy value: , using the logged data collected by a logging policy 𝜋0 .
  5. Distribution shift and importance sampling [Strehl+,10] Because 𝜋0 and 𝜋

    is different, we need to address distribution shift. April 2024 OPE for slate bandits with abstraction @ WWW2024 5 evaluation logging ranking A ranking B more less less more
  6. Distribution shift and importance sampling [Strehl+,10] Because 𝜋0 and 𝜋

    is different, we need to address distribution shift. April 2024 OPE for slate bandits with abstraction @ WWW2024 6 evaluation logging ranking A ranking B more less less more ・unbiased
  7. Distribution shift and importance sampling [Strehl+,10] Because 𝜋0 and 𝜋

    is different, we need to address distribution shift. April 2024 OPE for slate bandits with abstraction @ WWW2024 7 evaluation logging ranking A more less ・unbiased ・variance
  8. Linearity assum. and PseudoInverse (PI) [Swamminathan+,17] To deal with the

    variance issue, PI considers the following linearity assumption. April 2024 OPE for slate bandits with abstraction @ WWW2024 8 expected reward of an email = value of title + value of coupon + value of visual + …
  9. Linearity assum. and PseudoInverse (PI) [Swamminathan+,17] To deal with the

    variance issue, PI considers the following linearity assumption. Then, PI estimates the policy value as follows. April 2024 OPE for slate bandits with abstraction @ WWW2024 9 importance weight (iw) is no longer the product but is instead the sum of slot-wise iw.
  10. Summary of existing estimators • Inverse Propensity Scoring (IPS) •

    no assumption • (pros) unbiased without any assumption on the reward • (cons) extremely high variance (due to large action space) • PseudoInverse (PI) • linearity assumption • (pros) reduces variance compared to IPS • (cons) induces bias when the linearity does not hold Can we reduce the variance of IPS while not using a restrictive assumption? April 2024 OPE for slate bandits with abstraction @ WWW2024 10
  11. Key idea: Using latent abstraction To deal with large slate

    space without introducing reward assumptions, we apply importance sampling in the latent abstraction space. April 2024 OPE for slate bandits with abstraction @ WWW2024 12 • Leverages similarity among different slates • Reduce variance while keeping bias small without restrictive assumption Key features encoder decoder slate features slate features reward latent abstraction importance sampling
  12. Our proposal: Latent IPS (LIPS) Using a slate abstraction function

    𝜙: 𝑆 ⟶ 𝑍 , we difine Latent IPS as follows. April 2024 OPE for slate bandits with abstraction @ WWW2024 13 importance weight on the latent abstraction
  13. Our proposal: Latent IPS (LIPS) Using a slate abstraction function

    𝜙: 𝑆 ⟶ 𝑍 , we difine Latent IPS as follows. Generalizing this to context-dependent and stochastic clustering, we also have, April 2024 OPE for slate bandits with abstraction @ WWW2024 14 importance weight on the latent abstraction
  14. Key properties of LIPS (1/3) • LIPS is unbiased when

    the following condition of the sufficient slate abstraction. (Sufficient slate abstraction) A slate abstraction function 𝜙𝜃 is said to be “sufficient” if it satisfies 𝑞 𝑥, 𝑠 = 𝑞(𝑥, 𝑠′), for all 𝑥 ∈ 𝑋, 𝑠 ∈ 𝑆, and 𝑠' ∈ 𝑠'' ∈ 𝑆 𝜙𝜃 𝑠 = 𝜙𝜃 (𝑠′′)} (e.g., one-hot embedding of the original slate is a saficient slate abstraction) April 2024 OPE for slate bandits with abstraction @ WWW2024 15
  15. Key properties of LIPS (2/3) • Even when saficient abstraction

    does not hold, the bias of LIPS is derived as follows. April 2024 OPE for slate bandits with abstraction @ WWW2024 16 ① ② ③
  16. Key properties of LIPS (2/3) • Even when saficient abstraction

    does not hold, the bias of LIPS is derived as follows. April 2024 OPE for slate bandits with abstraction @ WWW2024 17 ① ② ③ ① identifiability of the slates from their abstractions ② difference in the expected rewards between a pair of slates ③ difference in the slate-wise importance weights between a pair of slates Bias can be small with a fine-grained abstraction.
  17. Key properties of LIPS (3/3) • LIPS has the following

    variance reduction compared to naive IPS. Given a slate abstraction function 𝜙𝜃 , April 2024 OPE for slate bandits with abstraction @ WWW2024 18 ① ② ① variance of reward conditioned by the slate abstraction ② variance of slate-wise iw regarding the conditional distribution 𝜋0 (𝑠|𝑥, 𝜙𝜃(𝑠)) Variance can be small with a coarse-grained abstraction.
  18. Bias-variance tradeoff The granularity of abstraction is a key factor

    for the bias-variance tradeoff. April 2024 OPE for slate bandits with abstraction @ WWW2024 19 coarse-grained abstraction fine-grained abstraction bias variance
  19. Bias-variance tradeoff The granularity of abstraction is a key factor

    for the bias-variance tradeoff. Thus, we can minimize the mean-squared error (MSE) by optimizing the abstraction: April 2024 OPE for slate bandits with abstraction @ WWW2024 20 coarse-grained abstraction fine-grained abstraction bias variance
  20. How to optimize the slate abstraction? We use the following

    loss function to balance the bias and variance of LIPS. April 2024 OPE for slate bandits with abstraction @ WWW2024 21 𝜷: hyperparam.
  21. Semi-synthetic experiments April 2024 OPE for slate bandits with abstraction

    @ WWW2024 23 • Experiment on two datasets: Wiki10-31K and Eurlex-4K • Non-linear reward functions (co-occurance) (conditional reward) (representative slot matters) ※ (𝑎1 , 𝑎2 , ⋯ , 𝑎 [𝐿/2]) is a saficient slate abstraction.
  22. Compared estimators • Inverse Propensity Scoring (IPS) • PseudoInverse (PI)

    April 2024 OPE for slate bandits with abstraction @ WWW2024 24 • LIPS (w/ data-driven choice of 𝛽) • LIPS (w/ best 𝛽)
  23. Compared estimators • Inverse Propensity Scoring (IPS) • PseudoInverse (PI)

    • Direct Method (DM) – regression-based approach [Beygelzimer&Langford,09] • Marginalized IPS – uses iw of a saficient slate abstraction, (𝑎1 , 𝑎2 , ⋯ , 𝑎 [𝐿/2] ). [Saito&Joachims,22] April 2024 OPE for slate bandits with abstraction @ WWW2024 25 • LIPS (w/ data-driven choice of 𝛽) • LIPS (w/ best 𝛽)
  24. Evaluation metrics and configurations We compare the mean-squared error (MSE):

    April 2024 OPE for slate bandits with abstraction @ WWW2024 26 (estimated by logged data generated with 50 different random seeds)
  25. Evaluation metrics and configurations We compare the mean-squared error (MSE):

    across the varying configurations. • data sizes (𝑛): {1000, 2000, 4000, 8000, 16000} • lengths of slate (𝐿): {4, 6, 8, 10, 12} April 2024 OPE for slate bandits with abstraction @ WWW2024 27 (estimated by logged data generated with 50 different random seeds) value: default value
  26. Results with varying data sizes (𝑛) April 2024 OPE for

    slate bandits with abstraction @ WWW2024 28 The proposed LIPS is accurate even when the data size is small.
  27. Results with varying lengths of slate (𝐿) April 2024 OPE

    for slate bandits with abstraction @ WWW2024 29 The proposed LIPS is accurate across varying slate lengths and reward functions.
  28. Takeaways • We studied OPE of slate bandit policies, which

    involves combinatorial actions. • Baseline estimators suffer either from variance caused by large action space or bias caused by a restrictive assumption of linearity. • We proposed to apply importance weight on latent abstraction space, with a way to learn an abstraction that reduces both bias and variance. • Semi-synthetic experiments demonstrates that the proposed LIPS performs well across various data sizes and lengths of slate. April 2024 OPE for slate bandits with abstraction @ WWW2024 30
  29. Comparing with DR-type estimators • Doubly Robust (DR) estimators combine

    importance sampling and regression. April 2024 OPE for slate bandits with abstraction @ WWW2024 32 control variate [Dudík+,14] [Vlassis+,21] [Saito+,23]
  30. Results with varying data sizes (𝑛) April 2024 OPE for

    slate bandits with abstraction @ WWW2024 33 The proposed LIPS is accurate even when the data size is small.
  31. Results with varying lengths of slate (𝐿) April 2024 OPE

    for slate bandits with abstraction @ WWW2024 34 The proposed LIPS is accurate across varying slate lengths and reward functions.
  32. Ablations with varying values of 𝛽 April 2024 OPE for

    slate bandits with abstraction @ WWW2024 35 We can see a tradeoff here: ・bias is small when 𝜷 is small ・variance is small when 𝜷 is large
  33. References (1/2) [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, Lihong

    Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. https://arxiv.org/abs/1003.0120 [Swamminathan+,17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. “Off-policy evaluation for slate recommendation.” NeurIPS, 2017. https://arxiv.org/abs/1605.04812 [Beygelzimer&Langford,09] Alina Beygelzimer, John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. https://arxiv.org/abs/0812.4044 [Saito&Joachims,22] Yuta Saito, Thorsten Joachims. “Off-Policy Evaluation for Large Action Spaces via Embeddings.” ICML, 2022. https://arxiv.org/abs/2202.06317 [Dudík+,14] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. “Doubly Robust Policy Evaluation and Optimization.” ICML, 2011. https://arxiv.org/abs/1503.02834 April 2024 OPE for slate bandits with abstraction @ WWW2024 37
  34. References (2/2) [Vlassis+,21] Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil,

    Nathan Kallus. “Control Variates for Slate Off-Policy Evaluation.” NeurIPS, 2021. https://arxiv.org/abs/1003.0120 [Saito+,23] Yuta Saito, Qingyang Ren, Thorsten Joachims. “Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling.” ICML, 2023. https://arxiv.org/abs/1003.0120 April 2024 OPE for slate bandits with abstraction @ WWW2024 38