Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Guest lecture Fall'25] Off-policy evaluation a...

[Guest lecture Fall'25] Off-policy evaluation and learning in "CS6784: ML in feedback systems" at Cornell

The slides used for the guest lecture in the ML in feedback systems class (CS6784) at Cornell.

video recording: https://vod.video.cornell.edu/media/Guest+Lecture%3A+Off-Policy+Evaluation+and+Learning+%28ML+In+Feedback+Sys+F25%29+/1_eyiazrlc
class info: https://github.com/ml-feedback-sys/materials-f25/tree/main

----

The main content is (1) Intro to off-policy evaluation and learning (OPE/L) and (2) Research example of OPL for sentence personalization. I also briefly mentioned my research interests related to "ML in feedback systems" topics, and mentioned the following papers in the lecture.

OPL for sentence personalization
paper: https://arxiv.org/abs/2504.02646
slides: https://speakerdeck.com/harukakiyohara_/opl-prompt

Steering systems for long-term objectives
paper: https://arxiv.org/abs/2502.01792
slides: https://speakerdeck.com/harukakiyohara_/dynamics-two-stage-rec

Scalable and adaptable RecSys under practical constraints
(on-going, workshop) paper: https://drive.google.com/file/d/1pc7aa5dvv9cpMRnbUDaeh9-dKP6wzC5J/view?usp=drive_link

----

My research statement is available here.
https://drive.google.com/file/d/1LqONxB8Qw4Z0GSUavAwSSV_oCANcl9TI/view?usp=sharing

Avatar for Haruka Kiyohara

Haruka Kiyohara

October 23, 2025
Tweet

More Decks by Haruka Kiyohara

Other Decks in Education

Transcript

  1. Off-policy evaluation and learning with a research example in real-life

    Haruka Kiyohara (hk844 [at] cornell.edu) CS6784: ML in feedback systems @ Cornell October 2025 Off-policy evaluation and learning @ CS6784 1 (Guest lecture in ML in feedback systems)
  2. Self-introduction • 3rd year Ph.D. student at Cornell CS (advised

    by Thorsten Joachims and Sarah Dean) • If you know me, I’m TA of this class. • ML and RecSys research (e.g., ICML, NeurIPS, ICLR and KDD, WSDM, RecSys) • Funai Overseas Scholarship (2023-2025) • Quad Fellowship (2025-2026) October 2025 Off-policy evaluation and learning @ CS6784 2 Haruka Kiyohara
  3. Self-introduction • 3rd year Ph.D. student at Cornell CS (advised

    by Thorsten Joachims and Sarah Dean) • If you know me, I’m TA of this class. • ML and RecSys research (e.g., ICML, NeurIPS, ICLR and KDD, WSDM, RecSys) Guest lecturer in the today’s class! October 2025 Off-policy evaluation and learning @ CS6784 3 Haruka Kiyohara
  4. What I will talk about today? We’ve learned many topics

    so far.. I will showcase How an actual research in ML in feedback systems look like? October 2025 Off-policy evaluation and learning @ CS6784 4
  5. What I will talk about today? We’ve learned many topics

    so far.. I will showcase How an actual research in ML in feedback systems look like? My research is especially related to contextual bandits/RL. October 2025 Off-policy evaluation and learning @ CS6784 5
  6. What I will talk about today? We’ve learned many topics

    so far.. I will showcase How an actual research in ML in feedback systems look like? My research is especially related to contextual bandits/RL. “off-policy evaluation and learning” October 2025 Off-policy evaluation and learning @ CS6784 6
  7. Before going to the lecture.. Let me share my research

    interests, Support human decisions using ML systems (1) how to leverage logged data (2) how to steer systems for long-term success (3) how to build a scalable and adaptable RecSys October 2025 Off-policy evaluation and learning @ CS6784 7
  8. Before going to the lecture.. Let me share my research

    interests, Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 8
  9. Before going to the lecture.. Let me share my research

    interests, Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 9 Main topic
  10. Today’s topic • Intro to off-policy evaluation and learning (OPE/OPL)

    • Recent work on OPL for personalized sentence generation • (If I have time..) showcasing projects in other two research topics October 2025 Off-policy evaluation and learning @ CS6784 10
  11. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 12 recommendation/ranking … • Music streaming • Search engine • Fashion e-commerse • News platform • SNS..
  12. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 13 recommendation/ranking … a coming user
  13. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 14 recommendation/ranking … a coming user clicks
  14. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 15 recommendation/ranking … a coming user context clicks reward action 𝑎 contextual bandits
  15. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 16 recommendation/ranking … a policy a coming user context clicks reward action 𝑎
  16. How does a recommender/ranking system work? October 2025 Off-policy evaluation

    and learning @ CS6784 17 recommendation/ranking … a policy ▼ evaluate this one a coming user context clicks reward action 𝑎
  17. Goal: evaluating with the policy value We evaluate a policy

    with its expected reward. October 2025 Off-policy evaluation and learning @ CS6784 18
  18. Goal: evaluating with the policy value We evaluate a policy

    with its expected reward. October 2025 Off-policy evaluation and learning @ CS6784 19 Online A/B tests is a straightforward evaluation, but.. Online testing may harm user experience when the policy performs poorly ..
  19. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 20 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎
  20. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 21 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎
  21. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 22 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward)
  22. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 23 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward) “bandit” in that the reward is observed only for the actions chosen by the logging policy = no “counterfactual” outcome
  23. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 24 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward) “bandit” in that the reward is observed only for the actions chosen by the logging policy = no “counterfactual” outcome an evaluation policy
  24. Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @

    CS6784 25 a logging policy an evaluation policy OPE estimator
  25. Now, let’s consider the following estimator October 2025 Off-policy evaluation

    and learning @ CS6784 26 Let’s take the empirical average!
  26. Now, let’s consider the following estimator October 2025 Off-policy evaluation

    and learning @ CS6784 27 Let’s take the empirical average! What is the issue of this estimator?
  27. Now, let’s consider the following estimator October 2025 Off-policy evaluation

    and learning @ CS6784 28 evaluation logging action A action B more less less more bias caused by distribution shift reward = +5 reward = -5
  28. Importance sampling [Strehl+,10] October 2025 Off-policy evaluation and learning @

    CS6784 29 evaluation logging more less less more action A action B correcting the distribution shift ・unbiased importance weight reward = +5 reward = -5
  29. Importance sampling [Strehl+,10] October 2025 Off-policy evaluation and learning @

    CS6784 30 evaluation logging more less action A when the importance weight is large ・unbiased ・variance
  30. Importance sampling [Strehl+,10] October 2025 Off-policy evaluation and learning @

    CS6784 31 evaluation logging more less action A when the importance weight is large ・unbiased ・variance How should we apply IS efficiently? => Many OPE estimators invented by the community (better bias-variance)
  31. Question so far? October 2025 Off-policy evaluation and learning @

    CS6784 32 • OPE aims to evaluate a (new) policy using logged data collected by a different policy. • Importance sampling is a default approach, but the variance issue is troblematic. • Many OPE estimators are invented to achieve a better bias-variance tradeoffs. Intro to off-policy evaluation and learning (OPE/L)
  32. Question so far? Next >> Presenting an efficient OPL approach

    along with practical application October 2025 Off-policy evaluation and learning @ CS6784 33 • OPE aims to evaluate a (new) policy using logged data collected by a different policy. • Importance sampling is a default approach, but the variance issue is troblematic. • Many OPE estimators are invented to achieve a better bias-variance tradeoffs. Intro to off-policy evaluation and learning (OPE/L)
  33. An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

    Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims October 2025 Off-policy evaluation and learning @ CS6784 34 (Presented at RecSys2025)
  34. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: October 2025 Off-policy evaluation and learning @ CS6784 35 “WALL-E (2008)”
  35. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: October 2025 Off-policy evaluation and learning @ CS6784 36 “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes
  36. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: October 2025 Off-policy evaluation and learning @ CS6784 37 For sci-fi lovers, In the distant future, one little robot sparked a cosmic revolution. For romance lovers, In a lonely world, a small robot discovers the power of connection. We’d like to personalize the sentence to each user. “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes
  37. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E October 2025 Off-policy evaluation and learning @ CS6784 38
  38. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious October 2025 Off-policy evaluation and learning @ CS6784 39
  39. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” October 2025 Off-policy evaluation and learning @ CS6784 40
  40. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” • reward 𝑟 • click, purchase October 2025 Off-policy evaluation and learning @ CS6784 41
  41. Our goal is to optimize the policy to maximize the

    total reward: Goal of Off-Policy Learning (OPL) October 2025 Off-policy evaluation and learning @ CS6784 42
  42. Our goal is to optimize the policy to maximize the

    total reward: , using the logged data collected by a logging policy 𝜋0 . Goal of Off-Policy Learning (OPL) October 2025 Off-policy evaluation and learning @ CS6784 43 need to deal with the partial rewards and distribution shift
  43. Naive approaches Naive approaches estimate the action policy gradient (PG)

    to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 44 (true action policy gradient) 𝜃: policy parameter
  44. Naive approaches Naive approaches estimate the action policy gradient (PG)

    to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 45 (true action policy gradient) 𝜃: policy parameter This is just like using gradient descent for squared error minimization for the regression task.
  45. Naive approaches Naive approaches estimate the action policy gradient (PG)

    to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 46 (true action policy gradient) 𝜃: policy parameter This is just like using gradient descent for squared error minimization for the regression task. from Sarah’s lecture slides 𝑟 can be −𝑐 (max. reward = min. cost)
  46. Naive approaches Naive approaches estimate the action policy gradient (PG)

    to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 47 (true action policy gradient) 𝜃: policy parameter
  47. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. October 2025 Off-policy evaluation and learning @ CS6784 48 regression-based [Konda&Tsitsiklis,99] important sampling-based [Swaminathan&Joachims,16] • impute regressed reward • introduce bias when the regression is inaccurate • (regression is often demanding due to partial reward and covariate shift) • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts)
  48. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. October 2025 Off-policy evaluation and learning @ CS6784 49 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]
  49. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. October 2025 Off-policy evaluation and learning @ CS6784 50 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]
  50. How to leverage similarities among sentences? October 2025 Off-policy evaluation

    and learning @ CS6784 51 A. Take the gradient directly in the sentence space.
  51. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. October 2025 Off-policy evaluation and learning @ CS6784 52 (true sentence policy gradient) gradient w.r.t. sentence distribution however, the issue is that the original sentence space is high dimensional..
  52. How to leverage similarities among sentences? We consider estimating the

    following gradient in the marginalized sentence space. October 2025 Off-policy evaluation and learning @ CS6784 53 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution
  53. How to leverage similarities among sentences? We consider estimating the

    following gradient in the marginalized sentence space. where October 2025 Off-policy evaluation and learning @ CS6784 54 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution
  54. We estimate the sentence policy gradient using logged data as

    follows. How we can actually estimate/implement the weighted score function? Direct Sentence Off-policy gradient (DSO) October 2025 Off-policy evaluation and learning @ CS6784 55 (estimating marginalized sentence policy gradient)
  55. Estimation of the weighted score function We use the following

    re-sampling technique. October 2025 Off-policy evaluation and learning @ CS6784 56 See Appendix for the derivation.
  56. Estimation of the weighted score function We use the following

    re-sampling technique. , which suggests that ① DSO does implicit data augmentation via resampling (𝑎, 𝑠′) from the policy 𝜋𝜃 . ② DSO uses soft rejection sampling using the kernel weight. ③ DSO corrects the logging distribution in the marginalized sentence space. October 2025 Off-policy evaluation and learning @ CS6784 57 ① ② ③ See Appendix for the derivation.
  57. Theoretical analysis; summary ① DSO is less likely to incur

    deficient support. ② DSO has small bias when kernel bandwidth 𝛕 is small. ③ DSO has a large variance reduction when kernel bandwidth 𝛕 is large. October 2025 Off-policy evaluation and learning @ CS6784 58 See Appendix for the details.
  58. Theoretical analysis; bias-variance tradeoff ④ Kernel bandwidth 𝛕 plays important

    role in controlling the bias-variance. October 2025 Off-policy evaluation and learning @ CS6784 59
  59. Theoretical analysis; bias-variance tradeoff ④ Kernel bandwidth 𝛕 plays important

    role in controlling the bias-variance. October 2025 Off-policy evaluation and learning @ CS6784 60 DSO often achieves better pareto-frontier of bias-variance tradeoff than action-based IS.
  60. Synthetic experiments evaluation metric October 2025 Off-policy evaluation and learning

    @ CS6784 61 optimal policy uniform random compared methods • Regression [Konda&Tsitsiklis,99] • IS [Swaminathan&Joachims,16] • DR [Dudík+,11] • POTEC [Saito+,24] • DSO (ours) the higher, the better DR: hybrid of regression and IS POTEC: two-stage policy that uses the cluster of actions
  61. Synthetic experiments configurations • data sizes: {500, 1000, 2000, 4000,

    8000} • number of candidate prompts: {10, 50, 100, 500, 1000} • reward noises: {0.0, 1.0, 2.0, 3.0} • For DSO, we use the Gaussian kernel with 𝜏 = 𝟏. 𝟎. October 2025 Off-policy evaluation and learning @ CS6784 62 value: default value
  62. Results October 2025 Off-policy evaluation and learning @ CS6784 63

    • DSO particularly works well when # of actions and reward noises are large. • DSO is much more data-efficient than the baselines.
  63. MovieLens Result • DSO often performs better than other OPL

    methods. • Especially compared to those involving importance sampling (IS, DR, POTEC), DSO is more robust to performance corruption. Note: “policy value” is the improvement observed over the sentences generated without prompt, which we call no-prompt baseline. Experiments results is from 25 different trials. October 2025 Off-policy evaluation and learning @ CS6784 64
  64. Summary • We studied OPL for prompt-guided language personalization. •

    The key challenge is dealing with large action spaces of prompts, and we proposed to leverage similarity among sentence via kernels. • DSO reduces variance by (1) applying IS in the marginalized sentence space. (2) applying implicit data augmentation via the re-sampling technique. • Experiments on synthetic/full-LLM envs demonstrate that DSO works well by reducing the variance while keeping the bias small. October 2025 Off-policy evaluation and learning @ CS6784 65
  65. Questions? October 2025 Off-policy evaluation and learning @ CS6784 66

    • OPL aims to learn a new policy using logged data collected by an old policy. • Importance sampling has a variance issue, and we resolve it by using similarity among sentences. • The Kernel-IS gradient estimator (ours) enables data-efficient OPL (better bias-variance). OPE/L and its application to sentence personalization
  66. Extra Next >> Showcasing some other research in ML in

    feedback systems October 2025 Off-policy evaluation and learning @ CS6784 67
  67. I am working on several topics.. Support human decisions using

    ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 68
  68. I am working on several topics.. Support human decisions using

    ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 69
  69. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 70 … … viewer provider
  70. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 71 … … provider Viewers receive content recommendation. If viewers are happy/dissatisfied with contents, they may increase/decrease participation. Providers supply contents to the platform. If providers recieve high/inadequate exposure, they may increase/decrease production. Many papers assume that both viewer and provider populations are static, but..
  71. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 72 … … provider How should we design content allocation to pursue the long-term success? Applications include video streaming, online ads, job matching, SNS, and more! HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025.
  72. I am working on several topics.. Support human decisions using

    ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 73
  73. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 74 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)
  74. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. How can we improve the early-stage retrieval process to enhance the overall quality of the two-stage decisions? September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 75 Large-scale Recsys less research More research
  75. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. How can we present diverse items to users by improving the candidate retrieval process? September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 76 Large-scale Recsys less research More research In some application, diversity is very important! (e.g., news recommendation, opinion/review summarization) HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two Stage Decisions. 2025.
  76. Thank you for listening! If you are interested in related

    topics, feel free to reach out to me! October 2025 Off-policy evaluation and learning @ CS6784 77
  77. Appendix for the RecSys paper >> Please refer to the

    following slides instead: https://speakerdeck.com/harukakiyohara_/opl-prompt October 2025 Off-policy evaluation and learning @ CS6784 78
  78. Reference (1/6) [SST20] Noveen Sachdeva, Yi Su, Thorsten Joachims. Off-policy

    Bandits with Deficient Support. KDD, 2020. [FDRC22] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, Paolo Cremonesi. Off-Policy Evaluation with Deficient Support Using Side Information. NeurIPS, 2022. [KSU23] Samir Khan, Martin Saveski, Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. 2023. [TGTRV21] Hung Tran-The, Sunil Gupta, Thanh Nguyen-Tang, Santu Rana, Svetha Venkatesh. Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support. 2021. [KKKKNS24] Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. ICLR, 2024. [UKNST24] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. AAAI, 2023. October 2025 Off-policy evaluation and learning @ CS6784 80
  79. Reference (2/6) [SSK20] Yi Su, Pavithra Srinath, Akshay Krishnamurthy. Adaptive

    Estimator Selection for Off-Policy Evaluation. ICML, 2020. [SAAC24] Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin. Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning. 2024. [CNSMBT21] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas. Universal Off-Policy Evaluation. NeurIPS, 2021. [HLLA21] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Contextual Bandits. NeurIPS, 2021. [HLLA22] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Markov Decision Processes. AISTATS, 2022. [WUS23] Runzhe Wu, Masatoshi Uehara, Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML, 2023. [YJ22] Yuta Saito, Thorsten Joachims. Off-Policy Evaluation for Large Action Spaces via Embeddings. ICML, 2022. October 2025 Off-policy evaluation and learning @ CS6784 81
  80. Reference (3/6) [YRJ22] Yuta Saito, Qingyang Ren, Thorsten Joachims. Off-Policy

    Evaluation for Large Action Spaces via Conjunct Effect Modeling. ICML, 2023. [TDCT23] Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. NeurIPS, 2023. [SWLKM24] Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, Julian McAuley. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. WWW, 2024. [KSMNSY22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. WSDM, 2022. [KUNSYS23] Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. KDD, 2023. [KNS24] Haruka Kiyohara, Masahiro Nomura, Yuta Saito. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. WWW, 2024. October 2025 Off-policy evaluation and learning @ CS6784 82
  81. Reference (4/6) [STKKNS24] Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka

    Kiyohara, Masahiro Nomura, Yuta Saito. Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits. RecSys, 2024. [SKADLJZ17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. Off-policy evaluation for slate recommendation. NeurIPS, 2017. [SAACL24] Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas. Long- term Off-Policy Evaluation and Learning. WWW, 2024. [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. October 2025 Off-policy evaluation and learning @ CS6784 83
  82. Reference (5/6) [Konda&Tsitsiklis,99] Vijay Konda and John Tsitsiklis. Actor-critic algorithms.

    NeurIPS, 1999. [Swaminathan&Joachims,16] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2016. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Saito+,24] Yuta Saito, Jihan Yao, and Thorsten Joachims. Potec: Off-policy learning for large action spaces via two-stage policy decomposition. 2024. [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. October 2025 Off-policy evaluation and learning @ CS6784 84
  83. Reference (6/6) [Jiang+,23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,

    Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7b. 2023. [Sanh+,19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019. [Harper&Konstan,15] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. TIIS, 2015. October 2025 Off-policy evaluation and learning @ CS6784 85