[Guest lecture Fall'25] Off-policy evaluation and learning in "CS6784: ML in feedback systems" at Cornell

Off-policy evaluation and learning with a research example in real-life
Haruka Kiyohara (hk844 [at] cornell.edu) CS6784: ML in feedback systems @ Cornell October 2025 Off-policy evaluation and learning @ CS6784 1 (Guest lecture in ML in feedback systems)

Self-introduction • 3rd year Ph.D. student at Cornell CS (advised
by Thorsten Joachims and Sarah Dean) • If you know me, I’m TA of this class. • ML and RecSys research (e.g., ICML, NeurIPS, ICLR and KDD, WSDM, RecSys) • Funai Overseas Scholarship (2023-2025) • Quad Fellowship (2025-2026) October 2025 Off-policy evaluation and learning @ CS6784 2 Haruka Kiyohara

Self-introduction • 3rd year Ph.D. student at Cornell CS (advised
by Thorsten Joachims and Sarah Dean) • If you know me, I’m TA of this class. • ML and RecSys research (e.g., ICML, NeurIPS, ICLR and KDD, WSDM, RecSys) Guest lecturer in the today’s class! October 2025 Off-policy evaluation and learning @ CS6784 3 Haruka Kiyohara

What I will talk about today? We’ve learned many topics
so far.. I will showcase How an actual research in ML in feedback systems look like? October 2025 Off-policy evaluation and learning @ CS6784 4

so far.. I will showcase How an actual research in ML in feedback systems look like? My research is especially related to contextual bandits/RL. October 2025 Off-policy evaluation and learning @ CS6784 5

so far.. I will showcase How an actual research in ML in feedback systems look like? My research is especially related to contextual bandits/RL. “off-policy evaluation and learning” October 2025 Off-policy evaluation and learning @ CS6784 6

Before going to the lecture.. Let me share my research
interests, Support human decisions using ML systems (1) how to leverage logged data (2) how to steer systems for long-term success (3) how to build a scalable and adaptable RecSys October 2025 Off-policy evaluation and learning @ CS6784 7

interests, Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 8

interests, Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 9 Main topic

Today’s topic • Intro to off-policy evaluation and learning (OPE/OPL)
• Recent work on OPL for personalized sentence generation • (If I have time..) showcasing projects in other two research topics October 2025 Off-policy evaluation and learning @ CS6784 10

Introduction to OPE/OPL October 2025 Off-policy evaluation and learning @
CS6784 11

How does a recommender/ranking system work? October 2025 Off-policy evaluation
and learning @ CS6784 12 recommendation/ranking … • Music streaming • Search engine • Fashion e-commerse • News platform • SNS..

and learning @ CS6784 13 recommendation/ranking … a coming user

and learning @ CS6784 14 recommendation/ranking … a coming user clicks

and learning @ CS6784 15 recommendation/ranking … a coming user context clicks reward action 𝑎 contextual bandits

and learning @ CS6784 16 recommendation/ranking … a policy a coming user context clicks reward action 𝑎

and learning @ CS6784 17 recommendation/ranking … a policy ▼ evaluate this one a coming user context clicks reward action 𝑎

Goal: evaluating with the policy value We evaluate a policy
with its expected reward. October 2025 Off-policy evaluation and learning @ CS6784 18

Goal: evaluating with the policy value We evaluate a policy
with its expected reward. October 2025 Off-policy evaluation and learning @ CS6784 19 Online A/B tests is a straightforward evaluation, but.. Online testing may harm user experience when the policy performs poorly ..

Off-policy evaluation; OPE October 2025 Off-policy evaluation and learning @
CS6784 20 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎

CS6784 21 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎

CS6784 22 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward)

CS6784 23 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward) “bandit” in that the reward is observed only for the actions chosen by the logging policy = no “counterfactual” outcome

CS6784 24 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 “logged bandit data” • Which user (context) visited/observed • Which item (action) were presented • How was the user feedback (reward) “bandit” in that the reward is observed only for the actions chosen by the logging policy = no “counterfactual” outcome an evaluation policy

CS6784 25 a logging policy an evaluation policy OPE estimator

Now, let’s consider the following estimator October 2025 Off-policy evaluation
and learning @ CS6784 26 Let’s take the empirical average!

and learning @ CS6784 27 Let’s take the empirical average! What is the issue of this estimator?

and learning @ CS6784 28 evaluation logging action A action B more less less more bias caused by distribution shift reward = +5 reward = -5

Importance sampling [Strehl+,10] October 2025 Off-policy evaluation and learning @
CS6784 29 evaluation logging more less less more action A action B correcting the distribution shift ・unbiased importance weight reward = +5 reward = -5

CS6784 30 evaluation logging more less action A when the importance weight is large ・unbiased ・variance

CS6784 31 evaluation logging more less action A when the importance weight is large ・unbiased ・variance How should we apply IS efficiently? => Many OPE estimators invented by the community (better bias-variance)

Question so far? October 2025 Off-policy evaluation and learning @
CS6784 32 • OPE aims to evaluate a (new) policy using logged data collected by a different policy. • Importance sampling is a default approach, but the variance issue is troblematic. • Many OPE estimators are invented to achieve a better bias-variance tradeoffs. Intro to off-policy evaluation and learning (OPE/L)

Question so far? Next >> Presenting an efficient OPL approach
along with practical application October 2025 Off-policy evaluation and learning @ CS6784 33 • OPE aims to evaluate a (new) policy using logged data collected by a different policy. • Importance sampling is a default approach, but the variance issue is troblematic. • Many OPE estimators are invented to achieve a better bias-variance tradeoffs. Intro to off-policy evaluation and learning (OPE/L)

An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization
Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims October 2025 Off-policy evaluation and learning @ CS6784 34 (Presented at RecSys2025)

Motivation for the personalized sentence generation Example of summary/review/reasons for
recommendations: October 2025 Off-policy evaluation and learning @ CS6784 35 “WALL-E (2008)”

recommendations: October 2025 Off-policy evaluation and learning @ CS6784 36 “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes

recommendations: October 2025 Off-policy evaluation and learning @ CS6784 37 For sci-fi lovers, In the distant future, one little robot sparked a cosmic revolution. For romance lovers, In a lonely world, a small robot discovers the power of connection. We’d like to personalize the sentence to each user. “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes

Prompt personalization as a contextual bandit problem • user 𝑢,
query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E October 2025 Off-policy evaluation and learning @ CS6784 38

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious October 2025 Off-policy evaluation and learning @ CS6784 39

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” October 2025 Off-policy evaluation and learning @ CS6784 40

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” • reward 𝑟 • click, purchase October 2025 Off-policy evaluation and learning @ CS6784 41

Our goal is to optimize the policy to maximize the
total reward: Goal of Off-Policy Learning (OPL) October 2025 Off-policy evaluation and learning @ CS6784 42

Our goal is to optimize the policy to maximize the
total reward: , using the logged data collected by a logging policy 𝜋0 . Goal of Off-Policy Learning (OPL) October 2025 Off-policy evaluation and learning @ CS6784 43 need to deal with the partial rewards and distribution shift

Naive approaches Naive approaches estimate the action policy gradient (PG)
to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 44 (true action policy gradient) 𝜃: policy parameter

to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 45 (true action policy gradient) 𝜃: policy parameter This is just like using gradient descent for squared error minimization for the regression task.

to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 46 (true action policy gradient) 𝜃: policy parameter This is just like using gradient descent for squared error minimization for the regression task. from Sarah’s lecture slides 𝑟 can be −𝑐 (max. reward = min. cost)

to update the policy. October 2025 Off-policy evaluation and learning @ CS6784 47 (true action policy gradient) 𝜃: policy parameter

Naive approaches Naive approaches estimate the policy gradient (PG) to
update the policy. October 2025 Off-policy evaluation and learning @ CS6784 48 regression-based [Konda&Tsitsiklis,99] important sampling-based [Swaminathan&Joachims,16] • impute regressed reward • introduce bias when the regression is inaccurate • (regression is often demanding due to partial reward and covariate shift) • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts)

update the policy. October 2025 Off-policy evaluation and learning @ CS6784 49 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

update the policy. October 2025 Off-policy evaluation and learning @ CS6784 50 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

How to leverage similarities among sentences? October 2025 Off-policy evaluation
and learning @ CS6784 51 A. Take the gradient directly in the sentence space.

How to leverage similarities among sentences? We consider estimating the
following gradient in the sentence space. October 2025 Off-policy evaluation and learning @ CS6784 52 (true sentence policy gradient) gradient w.r.t. sentence distribution however, the issue is that the original sentence space is high dimensional..

following gradient in the marginalized sentence space. October 2025 Off-policy evaluation and learning @ CS6784 53 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

following gradient in the marginalized sentence space. where October 2025 Off-policy evaluation and learning @ CS6784 54 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

We estimate the sentence policy gradient using logged data as
follows. How we can actually estimate/implement the weighted score function? Direct Sentence Off-policy gradient (DSO) October 2025 Off-policy evaluation and learning @ CS6784 55 (estimating marginalized sentence policy gradient)

Estimation of the weighted score function We use the following
re-sampling technique. October 2025 Off-policy evaluation and learning @ CS6784 56 See Appendix for the derivation.

Estimation of the weighted score function We use the following
re-sampling technique. , which suggests that ① DSO does implicit data augmentation via resampling (𝑎, 𝑠′) from the policy 𝜋𝜃 . ② DSO uses soft rejection sampling using the kernel weight. ③ DSO corrects the logging distribution in the marginalized sentence space. October 2025 Off-policy evaluation and learning @ CS6784 57 ① ② ③ See Appendix for the derivation.

Theoretical analysis; summary ① DSO is less likely to incur
deficient support. ② DSO has small bias when kernel bandwidth 𝛕 is small. ③ DSO has a large variance reduction when kernel bandwidth 𝛕 is large. October 2025 Off-policy evaluation and learning @ CS6784 58 See Appendix for the details.

Theoretical analysis; bias-variance tradeoff ④ Kernel bandwidth 𝛕 plays important
role in controlling the bias-variance. October 2025 Off-policy evaluation and learning @ CS6784 59

Theoretical analysis; bias-variance tradeoff ④ Kernel bandwidth 𝛕 plays important
role in controlling the bias-variance. October 2025 Off-policy evaluation and learning @ CS6784 60 DSO often achieves better pareto-frontier of bias-variance tradeoff than action-based IS.

Synthetic experiments evaluation metric October 2025 Off-policy evaluation and learning
@ CS6784 61 optimal policy uniform random compared methods • Regression [Konda&Tsitsiklis,99] • IS [Swaminathan&Joachims,16] • DR [Dudík+,11] • POTEC [Saito+,24] • DSO (ours) the higher, the better DR: hybrid of regression and IS POTEC: two-stage policy that uses the cluster of actions

Synthetic experiments configurations • data sizes: {500, 1000, 2000, 4000,
8000} • number of candidate prompts: {10, 50, 100, 500, 1000} • reward noises: {0.0, 1.0, 2.0, 3.0} • For DSO, we use the Gaussian kernel with 𝜏 = 𝟏. 𝟎. October 2025 Off-policy evaluation and learning @ CS6784 62 value: default value

Results October 2025 Off-policy evaluation and learning @ CS6784 63
• DSO particularly works well when # of actions and reward noises are large. • DSO is much more data-efficient than the baselines.

MovieLens Result • DSO often performs better than other OPL
methods. • Especially compared to those involving importance sampling (IS, DR, POTEC), DSO is more robust to performance corruption. Note: “policy value” is the improvement observed over the sentences generated without prompt, which we call no-prompt baseline. Experiments results is from 25 different trials. October 2025 Off-policy evaluation and learning @ CS6784 64

Summary • We studied OPL for prompt-guided language personalization. •
The key challenge is dealing with large action spaces of prompts, and we proposed to leverage similarity among sentence via kernels. • DSO reduces variance by (1) applying IS in the marginalized sentence space. (2) applying implicit data augmentation via the re-sampling technique. • Experiments on synthetic/full-LLM envs demonstrate that DSO works well by reducing the variance while keeping the bias small. October 2025 Off-policy evaluation and learning @ CS6784 65

Questions? October 2025 Off-policy evaluation and learning @ CS6784 66
• OPL aims to learn a new policy using logged data collected by an old policy. • Importance sampling has a variance issue, and we resolve it by using similarity among sentences. • The Kernel-IS gradient estimator (ours) enables data-efficient OPL (better bias-variance). OPE/L and its application to sentence personalization

Extra Next >> Showcasing some other research in ML in
feedback systems October 2025 Off-policy evaluation and learning @ CS6784 67

I am working on several topics.. Support human decisions using
ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) October 2025 Off-policy evaluation and learning @ CS6784 68

(2) how to steer systems for long-term success Both viewers
and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 70 … … viewer provider

and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 71 … … provider Viewers receive content recommendation. If viewers are happy/dissatisfied with contents, they may increase/decrease participation. Providers supply contents to the platform. If providers recieve high/inadequate exposure, they may increase/decrease production. Many papers assume that both viewer and provider populations are static, but..

and providers are essential for the success of two-sided platforms. July 2025 Participation Dynamics in Two-sided Platforms @ ICML 72 … … provider How should we design content allocation to pursue the long-term success? Applications include video streaming, online ads, job matching, SNS, and more! HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025.

Two-stage decision-making systems In large-scale recommender systems, we often employ
a two-stage selection. September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 74 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)

a two-stage selection. How can we improve the early-stage retrieval process to enhance the overall quality of the two-stage decisions? September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 75 Large-scale Recsys less research More research

a two-stage selection. How can we present diverse items to users by improving the candidate retrieval process? September 2025 Diversity-aware OPL for two-stage decisions @ CONSEQUENCES 76 Large-scale Recsys less research More research In some application, diversity is very important! (e.g., news recommendation, opinion/review summarization) HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two Stage Decisions. 2025.

Thank you for listening! If you are interested in related
topics, feel free to reach out to me! October 2025 Off-policy evaluation and learning @ CS6784 77

Appendix for the RecSys paper >> Please refer to the
following slides instead: https://speakerdeck.com/harukakiyohara_/opl-prompt October 2025 Off-policy evaluation and learning @ CS6784 78

Reference October 2025 Off-policy evaluation and learning @ CS6784 79

Reference (1/6) [SST20] Noveen Sachdeva, Yi Su, Thorsten Joachims. Off-policy
Bandits with Deficient Support. KDD, 2020. [FDRC22] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, Paolo Cremonesi. Off-Policy Evaluation with Deficient Support Using Side Information. NeurIPS, 2022. [KSU23] Samir Khan, Martin Saveski, Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. 2023. [TGTRV21] Hung Tran-The, Sunil Gupta, Thanh Nguyen-Tang, Santu Rana, Svetha Venkatesh. Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support. 2021. [KKKKNS24] Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. ICLR, 2024. [UKNST24] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. AAAI, 2023. October 2025 Off-policy evaluation and learning @ CS6784 80

Reference (2/6) [SSK20] Yi Su, Pavithra Srinath, Akshay Krishnamurthy. Adaptive
Estimator Selection for Off-Policy Evaluation. ICML, 2020. [SAAC24] Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin. Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning. 2024. [CNSMBT21] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas. Universal Off-Policy Evaluation. NeurIPS, 2021. [HLLA21] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Contextual Bandits. NeurIPS, 2021. [HLLA22] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Markov Decision Processes. AISTATS, 2022. [WUS23] Runzhe Wu, Masatoshi Uehara, Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML, 2023. [YJ22] Yuta Saito, Thorsten Joachims. Off-Policy Evaluation for Large Action Spaces via Embeddings. ICML, 2022. October 2025 Off-policy evaluation and learning @ CS6784 81

Reference (3/6) [YRJ22] Yuta Saito, Qingyang Ren, Thorsten Joachims. Off-Policy
Evaluation for Large Action Spaces via Conjunct Effect Modeling. ICML, 2023. [TDCT23] Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. NeurIPS, 2023. [SWLKM24] Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, Julian McAuley. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. WWW, 2024. [KSMNSY22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. WSDM, 2022. [KUNSYS23] Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. KDD, 2023. [KNS24] Haruka Kiyohara, Masahiro Nomura, Yuta Saito. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. WWW, 2024. October 2025 Off-policy evaluation and learning @ CS6784 82

Reference (4/6) [STKKNS24] Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka
Kiyohara, Masahiro Nomura, Yuta Saito. Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits. RecSys, 2024. [SKADLJZ17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. Off-policy evaluation for slate recommendation. NeurIPS, 2017. [SAACL24] Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas. Long- term Off-Policy Evaluation and Learning. WWW, 2024. [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. October 2025 Off-policy evaluation and learning @ CS6784 83

Reference (5/6) [Konda&Tsitsiklis,99] Vijay Konda and John Tsitsiklis. Actor-critic algorithms.
NeurIPS, 1999. [Swaminathan&Joachims,16] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2016. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Saito+,24] Yuta Saito, Jihan Yao, and Thorsten Joachims. Potec: Off-policy learning for large action spaces via two-stage policy decomposition. 2024. [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. October 2025 Off-policy evaluation and learning @ CS6784 84

Reference (6/6) [Jiang+,23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7b. 2023. [Sanh+,19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019. [Harper&Konstan,15] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. TIIS, 2015. October 2025 Off-policy evaluation and learning @ CS6784 85

[Guest lecture Fall'25] Off-policy evaluation a...

[Guest lecture Fall'25] Off-policy evaluation and learning in "CS6784: ML in feedback systems" at Cornell

More Decks by Haruka Kiyohara

Other Decks in Education

Featured

Transcript