[A-exam'25] Data-efficiency, steerability, and adaptability for personalized decisions at scale

Data efficiency, steerability, and adaptability for personalized decisions at scale
Haruka Kiyohara ([email protected]) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 1 Committee: Thorsten Joachims (chair), Sarah Dean (co-chair), Nikhil Garg A-exam

Machine decision-making systems are everywhere! November 2025 PhD candidacy exam
(A-exam) @ Cornell CS 2 search recommendation SNS Chatbots/AI assistance (LLMs) Creatives (GenAI)

Three key challenges and my research goal • Naturally collect
interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 3

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 4 user service/platfotm logs daily interaction feedback

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 5 human AI inquiry/ logs idea model update adaptation

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 6 Large-scale Recsys Millions, or billions of items!

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness ⇨ Support human decisions using ML systems November 2025 PhD candidacy exam (A-exam) @ Cornell CS 7

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness ⇨ Support human decisions using ML systems November 2025 PhD candidacy exam (A-exam) @ Cornell CS 8 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints)

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 9 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints) The work I’ve worked so far in my first two years of PhD

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness In the first 2/3 of this talk (~30min), I will present.. • (Quick intro to OPE/L) • A paper I’ve worked in the first research direction November 2025 PhD candidacy exam (A-exam) @ Cornell CS 10 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints) The work I’ve worked so far in my first two years of PhD

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 11 (off-policy evaluation and learning) (practical constraints) Main research direction in the rest of my PhD

interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness In the last 1/3 of this talk (~15min), I will present.. • Research plan for the third research direction November 2025 PhD candidacy exam (A-exam) @ Cornell CS 12 (off-policy evaluation and learning) (practical constraints) Main research direction in the rest of my PhD

Quick introduction to OPE/OPL November 2025 PhD candidacy exam (A-exam)
@ Cornell CS 13

How does a recommender/ranking system work? November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 14 recommendation/ranking … a policy a coming user context clicks reward action 𝑎

How does a recommender/ranking system work? November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 15 recommendation/ranking … a policy ▼ evaluate this one a coming user context clicks reward action 𝑎

Goal: evaluating with the policy value We evaluate a policy
with its expected reward. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 16

Goal: evaluating with the policy value We evaluate a policy
with its expected reward. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 17 Online A/B tests is a straightforward evaluation, but.. Online testing may harm user experience when the policy performs poorly ..

Off-policy evaluation; OPE November 2025 PhD candidacy exam (A-exam) @
Cornell CS 18 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎

Cornell CS 19 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎

Cornell CS 20 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 an evaluation policy

Cornell CS 21 a logging policy an evaluation policy OPE estimator

Now, let’s consider the following estimator November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 22 bias caused by distribution shift

Now, let’s consider the following estimator November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 23 evaluation logging action A action B more less less more bias caused by distribution shift reward = +5 reward = -5

Importance sampling [Strehl+,10] November 2025 PhD candidacy exam (A-exam) @
Cornell CS 24 evaluation logging more less less more action A action B correcting the distribution shift importance weight reward = +5 reward = -5 ・unbiased ・variance (data inefficient)

Cornell CS 25 evaluation logging more less action A when the importance weight is large ・unbiased ・variance (data inefficient)

Cornell CS 26 evaluation logging more less action A when the importance weight is large How can we apply IS efficiently in large action space? Key question: ・unbiased ・variance (data inefficient)

Cornell CS 27 evaluation logging more less action A when the importance weight is large How can we apply IS efficiently in large action space? => Many OPE estimators invented by the community (better bias-variance) Key question: ・unbiased ・variance (data inefficient)

Examples of OPE/L applications • Ranking recommendation • Set selection
(e.g., fashion) • Sequential decisions (e.g., medicine) • Emerging applications November 2025 PhD candidacy exam (A-exam) @ Cornell CS 28 [KSMSY, WSDM22 Best Paper Award Runner-up; KUNSYS, KDD23; KTKSYS, 24] [KNS, WebConf24; STKKNS, RecSys24] [U*K*BCJKSS, NeurIPS23, KKKKNS, ICLR24] Large-scale RecSys & Gen AI [KCSJ, RecSys25; KKJ, 25]

A paper from my previous work at Cornell November 2025
PhD candidacy exam (A-exam) @ Cornell CS 29

An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization
Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims November 2025 PhD candidacy exam (A-exam) @ Cornell CS 30 (Presented at RecSys2025)

Motivation for the personalized sentence generation Example of summary/review/reasons for
recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 31 “WALL-E (2008)” short summary https://movies.disney.com/wall-e

recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 32 “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes

recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 33 For sci-fi lovers, In the distant future, one little robot sparked a cosmic revolution. For romance lovers, In a lonely world, a small robot discovers the power of connection. We’d like to personalize the sentence to each user. “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes

Prompt personalization as a contextual bandit problem • user 𝑢,
query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E November 2025 PhD candidacy exam (A-exam) @ Cornell CS 34

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious November 2025 PhD candidacy exam (A-exam) @ Cornell CS 35

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” November 2025 PhD candidacy exam (A-exam) @ Cornell CS 36

query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” • reward 𝑟 • click, purchase November 2025 PhD candidacy exam (A-exam) @ Cornell CS 37

Our goal is to optimize the policy to maximize the
total reward: Goal of Off-Policy Learning (OPL) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 38

Our goal is to optimize the policy to maximize the
total reward: , using the logged data collected by a logging policy 𝜋0 . Goal of Off-Policy Learning (OPL) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 39 need to deal with the partial rewards and distribution shift

Naive approaches Naive approaches estimate the action policy gradient (PG)
to update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 40 (true action policy gradient) 𝜃: policy parameter

Naive approaches Naive approaches estimate the policy gradient (PG) to
update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 41 regression-based [Konda&Tsitsiklis,99] important sampling-based [Swaminathan&Joachims,16] • impute regressed reward • introduce bias when the regression is inaccurate • (regression is often demanding due to partial reward and covariate shift) • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts)

update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 42 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to discard rich information about sentences! But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 43 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to discard rich information about sentences! But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

How to leverage similarities among sentences? November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 44 A. Take the gradient directly in the sentence space.

How to leverage similarities among sentences? We consider estimating the
following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 45 (true sentence policy gradient) gradient w.r.t. sentence distribution however, the issue is that the original sentence space is high dimensional..

following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 46 (true sentence policy gradient) gradient w.r.t. sentence distribution (true marginalized sentence policy gradient) gradient w.r.t. marginalized sentence distribution 𝜙(𝑠): kernel-based neighbors of sentence 𝑠

following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 47 (true sentence policy gradient) gradient w.r.t. sentence distribution (true marginalized sentence policy gradient) gradient w.r.t. marginalized sentence distribution 𝜙(𝑠): kernel-based neighbors of sentence 𝑠

following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 48 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

following gradient in the sentence space. where November 2025 PhD candidacy exam (A-exam) @ Cornell CS 49 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

We estimate the sentence policy gradient using logged data as
follows. How we can actually estimate/implement the weighted score function? Direct Sentence Off-policy gradient (DSO) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 50 (estimating marginalized sentence policy gradient)

Estimation of the weighted score function We use the following
re-sampling technique. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 51 See Appendix for the derivation.

Estimation of the weighted score function We use the following
re-sampling technique. , which suggests that ① DSO does implicit data augmentation via resampling (𝑎, 𝑠′) from the policy 𝜋𝜃 . ② DSO uses soft rejection sampling using the kernel weight. ③ DSO corrects the logging distribution in the marginalized sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 52 ① ② ③ See Appendix for the derivation.

Theoretical analysis; bias-variance tradeoff Kernel bandwidth 𝛕 plays important role
in controlling the bias-variance. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 53 See Appendix for the detailed analysis.

Theoretical analysis; bias-variance tradeoff Kernel bandwidth 𝛕 plays important role
in controlling the bias-variance. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 54 DSO often achieves better pareto-frontier of bias-variance tradeoff than action-based IS.

Synthetic experiments evaluation metric November 2025 PhD candidacy exam (A-exam)
@ Cornell CS 55 optimal policy uniform random compared methods • Regression [Konda&Tsitsiklis,99] • IS [Swaminathan&Joachims,16] • DR [Dudík+,11] • POTEC [Saito+,24] • DSO (ours) the higher, the better DR: hybrid of regression and IS POTEC: two-stage policy that uses the cluster of actions

Synthetic experiments configurations • data sizes: {500, 1000, 2000, 4000,
8000} • number of candidate prompts: {10, 50, 100, 500, 1000} • reward noises: {0.0, 1.0, 2.0, 3.0} • For DSO, we use the Gaussian kernel with 𝜏 = 𝟏. 𝟎. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 56 value: default value

Results November 2025 PhD candidacy exam (A-exam) @ Cornell CS
57 • DSO particularly works well when # of actions and reward noises are large. • DSO is much more data-efficient than the baselines.

MovieLens Result • DSO often performs better than other OPL
methods. • Especially compared to those involving importance sampling (IS, DR, POTEC), DSO is more robust to performance corruption. Note: “policy value” is the improvement observed over the sentences generated without prompt, which we call no-prompt baseline. Experiments results is from 25 different trials. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 58

Summary • We studied OPL for prompt-guided language personalization. •
The key challenge is dealing with large action spaces of prompts, and we proposed to leverage similarity among sentence via kernels. • Experiments on synthetic/full-LLM envs demonstrate that DSO works well by reducing the variance while keeping the bias small. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 59

Next research direction November 2025 PhD candidacy exam (A-exam) @
Cornell CS 60

My research goal and interests Support human decisions using ML
systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 61 The work I’ve worked so far in my first two years of PhD

My research goal and interests Support human decisions using ML
systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 62 Main research direction in the rest of my PhD

Two-stage decision-making systems In large-scale recommender systems, we often employ
a two-stage selection. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 63 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)

Two-stage decision-making systems In large-scale recommender systems, we often employ
a two-stage selection. How can we improve the early-stage retrieval process to enhance the overall quality of the two-stage decisions? November 2025 PhD candidacy exam (A-exam) @ Cornell CS 64 Large-scale Recsys less research more research

Default practical approach for large-scale RecSys November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 65 Two-tower models are often employed for fast inference. Encoding user and item info separately:

exam (A-exam) @ Cornell CS 66 Two-tower models are often employed for fast inference. Encoding user and item info separately:

exam (A-exam) @ Cornell CS 67 Two-tower models are often employed for fast inference. Encoding user and item info separately: Limitation: items may concentrate [Guo+,21].

Research direction How to achieve the conflicting objectives of computational
efficiency and model adaptability/expressiveness simultaneously? • How to introduce diversity in candidates while using two-tower model?* • How to be adaptive to rich runtime info like user prompts and real-time news? • How to bridge the gap of input/model complexity in two-stage RecSys? November 2025 PhD candidacy exam (A-exam) @ Cornell CS 68 * HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two Stage Decisions. 2025.

Overview of my PhD research November 2025 PhD candidacy exam
(A-exam) @ Cornell CS 69 Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025. HK, Daniel Yiming Cao, Yuta Saito, Thorsten Joachims. An Off-Policy Learning Approach for Steering Sentence Generation toward Personalization. RecSys, 2025. HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two-stage Decisions. 2025. .. and more to come!

Thank you for listening! Many thanks to my committee and
amazing collaborators, and kind support from Funai Overseas Scholarship and Quad Fellowship! November 2025 PhD candidacy exam (A-exam) @ Cornell CS 70

Appendix for the RecSys paper November 2025 PhD candidacy exam
(A-exam) @ Cornell CS 71

We observe the reward only for the sentence generated by
the chosen prompt. Data generation process November 2025 PhD candidacy exam (A-exam) @ Cornell CS 72 (examples are generated by ChatGPT-3.5 [Brown+,20])

Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •
have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning November 2025 PhD candidacy exam (A-exam) @ Cornell CS 73 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback Params Datasets

Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •
have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning November 2025 PhD candidacy exam (A-exam) @ Cornell CS 74 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback • Logged bandit feedback (Ours) • allows safe and costless in data collection • need to deal with counterfactual & dist. shift Params Datasets for the first time!

Theoretical analysis; support condition ① DSO is less likely to
incur deficient support. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 75 (similar sentence support) (action support) because the similar sentence support is a relaxed condition of the action support.

Theoretical analysis; bias ② DSO has small bias when kernel
bandwidth 𝛕 is small. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 76

Theoretical analysis; bias ② DSO has small bias when kernel
bandwidth 𝛕 is small. • • • November 2025 PhD candidacy exam (A-exam) @ Cornell CS 77 This term comes from the within-neighbor reward shift. These terms comes from applying marginalization via kernels in the sentence space.

Theoretical analysis; variance ③ DSO has a large variance reduction
when kernel bandwidth 𝛕 is large. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 78

Theoretical analysis; variance ③ DSO has a large variance reduction
when kernel bandwidth 𝛕 is large. • • November 2025 PhD candidacy exam (A-exam) @ Cornell CS 79 This term reduces variance by avoiding within-neighbor importance weights: This term reduces variance by doing implicit data augmentation and soft-rejection sampling.

How to estimate the logging marginal density? To use DSO,
we need to estimate the logging marginal density defined as We can use function approximation trained on the following MSE loss. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 80

Synthetic experiments data generation process • sentence distribution • reward
distribution November 2025 PhD candidacy exam (A-exam) @ Cornell CS 81 smooth, different prompts can results in similar sentence prompt sentence sentence reward smooth, different sentences results in different rewards

Synthetic experiments ablations • kernel bandwidth: {0.5, 1.0, 2.0, 4.0}
• logging marginal density: {w/ and w/o function approx.} (w/o is the monte-carlo estimation) • add noise 𝝈𝒔 = 𝟏. 𝟎 to the sentence embeddings to measure the distance November 2025 PhD candidacy exam (A-exam) @ Cornell CS 82 value: default value

• We observe some bias-variance tradeoff when using monte-carlo estimation.
• Using a Gaussian kernel and the function approx. improves the robustness of DSO to the choice of bandwidth hyperparameter 𝜏. Ablation results November 2025 PhD candidacy exam (A-exam) @ Cornell CS 83

Why does function approx. improve the robustness of DSO? A.
Because we use the MSE loss to fit the marginal density model. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 84 For example, when the true marginal density is 1e-5, estimating it as 1e-5 and 1e-4 does not change the MSE loss too much. In contrast, 1e-4 and 1e-5 make a significant difference. Using function approximation, we can avoid being too precise about small values of the marginal density.

Full-LLM experiment using MovieLens [Harper&Konstan,15] Original dataset Augmented dataset •
𝑢: user id • 𝑞: item id (movie title) movie description • 𝑟: ratings Reward simulator November 2025 PhD candidacy exam (A-exam) @ Cornell CS 85 (generated by Mistral-7B (zero-shot, w/o prompt)) user id embedding (・) inner product DistilBert encoder movie description loss function: MSE in reward prediction

Full-LLM experiment • Semi-synthetic experiments on the MovieLens-10M dataset [Harper&Konstan,15].
• DistilBert [Sanh+,19]-based reward simulator is trained on the data. (next page) • User and query (i.e., movie) are sampled from the dataset. • Candidate prompts are retrieved from RelatedWord.io. • Using Mistral-7B [Jiang+,23] as the frozen LLM to generate the sentence. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 86

Examples of sentence generation in full-LLM bench. November 2025 PhD
candidacy exam (A-exam) @ Cornell CS 87

Reward simulation results of full-LLM bench. November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 88 (Left) “positive” indicate the movies with a rating of 5, while “negative” indicates those with ratings of 0-3. (Right) Showing the distribution of normalized reward, which indicate the improvement of expected reward gained by using the given prompt, compared to that of the sentence generated without prompts. The normalized value is multiplied by 10, so the difference become evident when running policy learning methods.

Derivation of the weighted score function (1/2) As a preparation,
we first derive the following expression of the importance weight. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 89

Derivation of the weighted score function (2/2) Then, we transform
the weighted score function as follows. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 90

Appendix for the second research direction November 2025 PhD candidacy
exam (A-exam) @ Cornell CS 91

(2) how to steer systems for long-term success Both viewers
and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 92 … … viewer provider

and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 93 … … provider Viewers receive content recommendation. If viewers are happy/dissatisfied with contents, they may increase/decrease participation. Providers supply contents to the platform. If providers receive high/inadequate exposure, they may increase/decrease production. Many papers assume that both viewer and provider populations are static, but..

and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 94 … … provider How should we design content allocation to pursue the long-term success? Applications include video streaming, online ads, job matching, SNS, and more! HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025.

Reference November 2025 PhD candidacy exam (A-exam) @ Cornell CS
95

Reference (1/4) [SST20] Noveen Sachdeva, Yi Su, Thorsten Joachims. Off-policy
Bandits with Deficient Support. KDD, 2020. [FDRC22] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, Paolo Cremonesi. Off-Policy Evaluation with Deficient Support Using Side Information. NeurIPS, 2022. [KSU23] Samir Khan, Martin Saveski, Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. 2023. [TGTRV21] Hung Tran-The, Sunil Gupta, Thanh Nguyen-Tang, Santu Rana, Svetha Venkatesh. Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support. 2021. [KKKKNS24] Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. ICLR, 2024. [UKNST24] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. AAAI, 2023. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 96

Reference (2/4) [SSK20] Yi Su, Pavithra Srinath, Akshay Krishnamurthy. Adaptive
Estimator Selection for Off-Policy Evaluation. ICML, 2020. [SAAC24] Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin. Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning. 2024. [CNSMBT21] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas. Universal Off-Policy Evaluation. NeurIPS, 2021. [HLLA21] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Contextual Bandits. NeurIPS, 2021. [HLLA22] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Markov Decision Processes. AISTATS, 2022. [WUS23] Runzhe Wu, Masatoshi Uehara, Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML, 2023. [YJ22] Yuta Saito, Thorsten Joachims. Off-Policy Evaluation for Large Action Spaces via Embeddings. ICML, 2022. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 97

Reference (3/4) [YRJ22] Yuta Saito, Qingyang Ren, Thorsten Joachims. Off-Policy
Evaluation for Large Action Spaces via Conjunct Effect Modeling. ICML, 2023. [TDCT23] Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. NeurIPS, 2023. [SWLKM24] Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, Julian McAuley. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. WWW, 2024. [KSMNSY22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. WSDM, 2022. [KUNSYS23] Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. KDD, 2023. [KNS24] Haruka Kiyohara, Masahiro Nomura, Yuta Saito. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. WWW, 2024. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 98

Reference (4/4) [STKKNS24] Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka
Kiyohara, Masahiro Nomura, Yuta Saito. Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits. RecSys, 2024. [SKADLJZ17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. Off-policy evaluation for slate recommendation. NeurIPS, 2017. [SAACL24] Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas. Long- term Off-Policy Evaluation and Learning. WWW, 2024. [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 99

Reference (5/6) [Konda&Tsitsiklis,99] Vijay Konda and John Tsitsiklis. Actor-critic algorithms.
NeurIPS, 1999. [Swaminathan&Joachims,16] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2016. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Saito+,24] Yuta Saito, Jihan Yao, and Thorsten Joachims. Potec: Off-policy learning for large action spaces via two-stage policy decomposition. 2024. [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 100

Reference (6/6) [Jiang+,23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7b. 2023. [Sanh+,19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019. [Harper&Konstan,15] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. TIIS, 2015. [Guo+,21] Wenshuo Guo, Karl Krauth, Michael I. Jordan, and Nikhil Garg. The Stereotyping Problem in Collaboratively Filtered Recommender Systems. EAMMO, 2021. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 101

[A-exam'25] Data-efficiency, steerability, and ...

[A-exam'25] Data-efficiency, steerability, and adaptability for personalized decisions at scale

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript