Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[A-exam'25] Data-efficiency, steerability, and ...

[A-exam'25] Data-efficiency, steerability, and adaptability for personalized decisions at scale

The slides I used for the A-exam (PhD candidacy exam, thesis proposal) at Cornell.

The corresponding research statement is shared under the following link:
https://drive.google.com/file/d/1LqONxB8Qw4Z0GSUavAwSSV_oCANcl9TI/view

Avatar for Haruka Kiyohara

Haruka Kiyohara

November 20, 2025
Tweet

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Data efficiency, steerability, and adaptability for personalized decisions at scale

    Haruka Kiyohara ([email protected]) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 1 Committee: Thorsten Joachims (chair), Sarah Dean (co-chair), Nikhil Garg A-exam
  2. Machine decision-making systems are everywhere! November 2025 PhD candidacy exam

    (A-exam) @ Cornell CS 2 search recommendation SNS Chatbots/AI assistance (LLMs) Creatives (GenAI)
  3. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 3
  4. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 4 user service/platfotm logs daily interaction feedback
  5. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 5 human AI inquiry/ logs idea model update adaptation
  6. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 6 Large-scale Recsys Millions, or billions of items!
  7. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness ⇨ Support human decisions using ML systems November 2025 PhD candidacy exam (A-exam) @ Cornell CS 7
  8. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness ⇨ Support human decisions using ML systems November 2025 PhD candidacy exam (A-exam) @ Cornell CS 8 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints)
  9. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 9 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints) The work I’ve worked so far in my first two years of PhD
  10. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness In the first 2/3 of this talk (~30min), I will present.. • (Quick intro to OPE/L) • A paper I’ve worked in the first research direction November 2025 PhD candidacy exam (A-exam) @ Cornell CS 10 (off-policy evaluation and learning) (dynamics, control, social aspects) (practical constraints) The work I’ve worked so far in my first two years of PhD
  11. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness November 2025 PhD candidacy exam (A-exam) @ Cornell CS 11 (off-policy evaluation and learning) (practical constraints) Main research direction in the rest of my PhD
  12. Three key challenges and my research goal • Naturally collect

    interaction data -> how to efficiently use logged data? • Human-in-the-loop feedback process -> how to steer systems for social goods? • Dealing with large action (item) space -> fast inference, scalability, expressiveness In the last 1/3 of this talk (~15min), I will present.. • Research plan for the third research direction November 2025 PhD candidacy exam (A-exam) @ Cornell CS 12 (off-policy evaluation and learning) (practical constraints) Main research direction in the rest of my PhD
  13. How does a recommender/ranking system work? November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 14 recommendation/ranking … a policy a coming user context clicks reward action 𝑎
  14. How does a recommender/ranking system work? November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 15 recommendation/ranking … a policy ▼ evaluate this one a coming user context clicks reward action 𝑎
  15. Goal: evaluating with the policy value We evaluate a policy

    with its expected reward. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 16
  16. Goal: evaluating with the policy value We evaluate a policy

    with its expected reward. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 17 Online A/B tests is a straightforward evaluation, but.. Online testing may harm user experience when the policy performs poorly ..
  17. Off-policy evaluation; OPE November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 18 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎
  18. Off-policy evaluation; OPE November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 19 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎
  19. Off-policy evaluation; OPE November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 20 recommendation/ranking … a coming user context clicks reward a logging (default) policy action 𝑎 an evaluation policy
  20. Off-policy evaluation; OPE November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 21 a logging policy an evaluation policy OPE estimator
  21. Now, let’s consider the following estimator November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 22 bias caused by distribution shift
  22. Now, let’s consider the following estimator November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 23 evaluation logging action A action B more less less more bias caused by distribution shift reward = +5 reward = -5
  23. Importance sampling [Strehl+,10] November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 24 evaluation logging more less less more action A action B correcting the distribution shift importance weight reward = +5 reward = -5 ・unbiased ・variance (data inefficient)
  24. Importance sampling [Strehl+,10] November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 25 evaluation logging more less action A when the importance weight is large ・unbiased ・variance (data inefficient)
  25. Importance sampling [Strehl+,10] November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 26 evaluation logging more less action A when the importance weight is large How can we apply IS efficiently in large action space? Key question: ・unbiased ・variance (data inefficient)
  26. Importance sampling [Strehl+,10] November 2025 PhD candidacy exam (A-exam) @

    Cornell CS 27 evaluation logging more less action A when the importance weight is large How can we apply IS efficiently in large action space? => Many OPE estimators invented by the community (better bias-variance) Key question: ・unbiased ・variance (data inefficient)
  27. Examples of OPE/L applications • Ranking recommendation • Set selection

    (e.g., fashion) • Sequential decisions (e.g., medicine) • Emerging applications November 2025 PhD candidacy exam (A-exam) @ Cornell CS 28 [KSMSY, WSDM22 Best Paper Award Runner-up; KUNSYS, KDD23; KTKSYS, 24] [KNS, WebConf24; STKKNS, RecSys24] [U*K*BCJKSS, NeurIPS23, KKKKNS, ICLR24] Large-scale RecSys & Gen AI [KCSJ, RecSys25; KKJ, 25]
  28. A paper from my previous work at Cornell November 2025

    PhD candidacy exam (A-exam) @ Cornell CS 29
  29. An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

    Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims November 2025 PhD candidacy exam (A-exam) @ Cornell CS 30 (Presented at RecSys2025)
  30. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 31 “WALL-E (2008)” short summary https://movies.disney.com/wall-e
  31. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 32 “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes
  32. Motivation for the personalized sentence generation Example of summary/review/reasons for

    recommendations: November 2025 PhD candidacy exam (A-exam) @ Cornell CS 33 For sci-fi lovers, In the distant future, one little robot sparked a cosmic revolution. For romance lovers, In a lonely world, a small robot discovers the power of connection. We’d like to personalize the sentence to each user. “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes
  33. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E November 2025 PhD candidacy exam (A-exam) @ Cornell CS 34
  34. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious November 2025 PhD candidacy exam (A-exam) @ Cornell CS 35
  35. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” November 2025 PhD candidacy exam (A-exam) @ Cornell CS 36
  36. Prompt personalization as a contextual bandit problem • user 𝑢,

    query (movie) 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 • genre or tone of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” • reward 𝑟 • click, purchase November 2025 PhD candidacy exam (A-exam) @ Cornell CS 37
  37. Our goal is to optimize the policy to maximize the

    total reward: Goal of Off-Policy Learning (OPL) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 38
  38. Our goal is to optimize the policy to maximize the

    total reward: , using the logged data collected by a logging policy 𝜋0 . Goal of Off-Policy Learning (OPL) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 39 need to deal with the partial rewards and distribution shift
  39. Naive approaches Naive approaches estimate the action policy gradient (PG)

    to update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 40 (true action policy gradient) 𝜃: policy parameter
  40. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 41 regression-based [Konda&Tsitsiklis,99] important sampling-based [Swaminathan&Joachims,16] • impute regressed reward • introduce bias when the regression is inaccurate • (regression is often demanding due to partial reward and covariate shift) • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts)
  41. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 42 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to discard rich information about sentences! But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]
  42. Naive approaches Naive approaches estimate the policy gradient (PG) to

    update the policy. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 43 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to discard rich information about sentences! But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]
  43. How to leverage similarities among sentences? November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 44 A. Take the gradient directly in the sentence space.
  44. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 45 (true sentence policy gradient) gradient w.r.t. sentence distribution however, the issue is that the original sentence space is high dimensional..
  45. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 46 (true sentence policy gradient) gradient w.r.t. sentence distribution (true marginalized sentence policy gradient) gradient w.r.t. marginalized sentence distribution 𝜙(𝑠): kernel-based neighbors of sentence 𝑠
  46. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 47 (true sentence policy gradient) gradient w.r.t. sentence distribution (true marginalized sentence policy gradient) gradient w.r.t. marginalized sentence distribution 𝜙(𝑠): kernel-based neighbors of sentence 𝑠
  47. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 48 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution
  48. How to leverage similarities among sentences? We consider estimating the

    following gradient in the sentence space. where November 2025 PhD candidacy exam (A-exam) @ Cornell CS 49 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution
  49. We estimate the sentence policy gradient using logged data as

    follows. How we can actually estimate/implement the weighted score function? Direct Sentence Off-policy gradient (DSO) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 50 (estimating marginalized sentence policy gradient)
  50. Estimation of the weighted score function We use the following

    re-sampling technique. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 51 See Appendix for the derivation.
  51. Estimation of the weighted score function We use the following

    re-sampling technique. , which suggests that ① DSO does implicit data augmentation via resampling (𝑎, 𝑠′) from the policy 𝜋𝜃 . ② DSO uses soft rejection sampling using the kernel weight. ③ DSO corrects the logging distribution in the marginalized sentence space. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 52 ① ② ③ See Appendix for the derivation.
  52. Theoretical analysis; bias-variance tradeoff Kernel bandwidth 𝛕 plays important role

    in controlling the bias-variance. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 53 See Appendix for the detailed analysis.
  53. Theoretical analysis; bias-variance tradeoff Kernel bandwidth 𝛕 plays important role

    in controlling the bias-variance. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 54 DSO often achieves better pareto-frontier of bias-variance tradeoff than action-based IS.
  54. Synthetic experiments evaluation metric November 2025 PhD candidacy exam (A-exam)

    @ Cornell CS 55 optimal policy uniform random compared methods • Regression [Konda&Tsitsiklis,99] • IS [Swaminathan&Joachims,16] • DR [Dudík+,11] • POTEC [Saito+,24] • DSO (ours) the higher, the better DR: hybrid of regression and IS POTEC: two-stage policy that uses the cluster of actions
  55. Synthetic experiments configurations • data sizes: {500, 1000, 2000, 4000,

    8000} • number of candidate prompts: {10, 50, 100, 500, 1000} • reward noises: {0.0, 1.0, 2.0, 3.0} • For DSO, we use the Gaussian kernel with 𝜏 = 𝟏. 𝟎. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 56 value: default value
  56. Results November 2025 PhD candidacy exam (A-exam) @ Cornell CS

    57 • DSO particularly works well when # of actions and reward noises are large. • DSO is much more data-efficient than the baselines.
  57. MovieLens Result • DSO often performs better than other OPL

    methods. • Especially compared to those involving importance sampling (IS, DR, POTEC), DSO is more robust to performance corruption. Note: “policy value” is the improvement observed over the sentences generated without prompt, which we call no-prompt baseline. Experiments results is from 25 different trials. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 58
  58. Summary • We studied OPL for prompt-guided language personalization. •

    The key challenge is dealing with large action spaces of prompts, and we proposed to leverage similarity among sentence via kernels. • Experiments on synthetic/full-LLM envs demonstrate that DSO works well by reducing the variance while keeping the bias small. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 59
  59. My research goal and interests Support human decisions using ML

    systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 61 The work I’ve worked so far in my first two years of PhD
  60. My research goal and interests Support human decisions using ML

    systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) November 2025 PhD candidacy exam (A-exam) @ Cornell CS 62 Main research direction in the rest of my PhD
  61. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 63 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)
  62. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. How can we improve the early-stage retrieval process to enhance the overall quality of the two-stage decisions? November 2025 PhD candidacy exam (A-exam) @ Cornell CS 64 Large-scale Recsys less research more research
  63. Default practical approach for large-scale RecSys November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 65 Two-tower models are often employed for fast inference. Encoding user and item info separately:
  64. Default practical approach for large-scale RecSys November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 66 Two-tower models are often employed for fast inference. Encoding user and item info separately:
  65. Default practical approach for large-scale RecSys November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 67 Two-tower models are often employed for fast inference. Encoding user and item info separately: Limitation: items may concentrate [Guo+,21].
  66. Research direction How to achieve the conflicting objectives of computational

    efficiency and model adaptability/expressiveness simultaneously? • How to introduce diversity in candidates while using two-tower model?* • How to be adaptive to rich runtime info like user prompts and real-time news? • How to bridge the gap of input/model complexity in two-stage RecSys? November 2025 PhD candidacy exam (A-exam) @ Cornell CS 68 * HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two Stage Decisions. 2025.
  67. Overview of my PhD research November 2025 PhD candidacy exam

    (A-exam) @ Cornell CS 69 Support human decisions using ML systems (1) how to leverage logged data (off-policy evaluation and learning) (2) how to steer systems for long-term success (dynamics, control, social aspects) (3) how to build a scalable and adaptable RecSys (practical constraints) HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025. HK, Daniel Yiming Cao, Yuta Saito, Thorsten Joachims. An Off-Policy Learning Approach for Steering Sentence Generation toward Personalization. RecSys, 2025. HK, Rayhan Khanna, Thorsten Joachims. Off-Policy Learning for Diversity-aware Candidate Retrieval in Two-stage Decisions. 2025. .. and more to come!
  68. Thank you for listening! Many thanks to my committee and

    amazing collaborators, and kind support from Funai Overseas Scholarship and Quad Fellowship! November 2025 PhD candidacy exam (A-exam) @ Cornell CS 70
  69. We observe the reward only for the sentence generated by

    the chosen prompt. Data generation process November 2025 PhD candidacy exam (A-exam) @ Cornell CS 72 (examples are generated by ChatGPT-3.5 [Brown+,20])
  70. Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •

    have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning November 2025 PhD candidacy exam (A-exam) @ Cornell CS 73 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback Params Datasets
  71. Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •

    have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning November 2025 PhD candidacy exam (A-exam) @ Cornell CS 74 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback • Logged bandit feedback (Ours) • allows safe and costless in data collection • need to deal with counterfactual & dist. shift Params Datasets for the first time!
  72. Theoretical analysis; support condition ① DSO is less likely to

    incur deficient support. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 75 (similar sentence support) (action support) because the similar sentence support is a relaxed condition of the action support.
  73. Theoretical analysis; bias ② DSO has small bias when kernel

    bandwidth 𝛕 is small. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 76
  74. Theoretical analysis; bias ② DSO has small bias when kernel

    bandwidth 𝛕 is small. • • • November 2025 PhD candidacy exam (A-exam) @ Cornell CS 77 This term comes from the within-neighbor reward shift. These terms comes from applying marginalization via kernels in the sentence space.
  75. Theoretical analysis; variance ③ DSO has a large variance reduction

    when kernel bandwidth 𝛕 is large. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 78
  76. Theoretical analysis; variance ③ DSO has a large variance reduction

    when kernel bandwidth 𝛕 is large. • • November 2025 PhD candidacy exam (A-exam) @ Cornell CS 79 This term reduces variance by avoiding within-neighbor importance weights: This term reduces variance by doing implicit data augmentation and soft-rejection sampling.
  77. How to estimate the logging marginal density? To use DSO,

    we need to estimate the logging marginal density defined as We can use function approximation trained on the following MSE loss. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 80
  78. Synthetic experiments data generation process • sentence distribution • reward

    distribution November 2025 PhD candidacy exam (A-exam) @ Cornell CS 81 smooth, different prompts can results in similar sentence prompt sentence sentence reward smooth, different sentences results in different rewards
  79. Synthetic experiments ablations • kernel bandwidth: {0.5, 1.0, 2.0, 4.0}

    • logging marginal density: {w/ and w/o function approx.} (w/o is the monte-carlo estimation) • add noise 𝝈𝒔 = 𝟏. 𝟎 to the sentence embeddings to measure the distance November 2025 PhD candidacy exam (A-exam) @ Cornell CS 82 value: default value
  80. • We observe some bias-variance tradeoff when using monte-carlo estimation.

    • Using a Gaussian kernel and the function approx. improves the robustness of DSO to the choice of bandwidth hyperparameter 𝜏. Ablation results November 2025 PhD candidacy exam (A-exam) @ Cornell CS 83
  81. Why does function approx. improve the robustness of DSO? A.

    Because we use the MSE loss to fit the marginal density model. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 84 For example, when the true marginal density is 1e-5, estimating it as 1e-5 and 1e-4 does not change the MSE loss too much. In contrast, 1e-4 and 1e-5 make a significant difference. Using function approximation, we can avoid being too precise about small values of the marginal density.
  82. Full-LLM experiment using MovieLens [Harper&Konstan,15] Original dataset Augmented dataset •

    𝑢: user id • 𝑞: item id (movie title) movie description • 𝑟: ratings Reward simulator November 2025 PhD candidacy exam (A-exam) @ Cornell CS 85 (generated by Mistral-7B (zero-shot, w/o prompt)) user id embedding (・) inner product DistilBert encoder movie description loss function: MSE in reward prediction
  83. Full-LLM experiment • Semi-synthetic experiments on the MovieLens-10M dataset [Harper&Konstan,15].

    • DistilBert [Sanh+,19]-based reward simulator is trained on the data. (next page) • User and query (i.e., movie) are sampled from the dataset. • Candidate prompts are retrieved from RelatedWord.io. • Using Mistral-7B [Jiang+,23] as the frozen LLM to generate the sentence. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 86
  84. Reward simulation results of full-LLM bench. November 2025 PhD candidacy

    exam (A-exam) @ Cornell CS 88 (Left) “positive” indicate the movies with a rating of 5, while “negative” indicates those with ratings of 0-3. (Right) Showing the distribution of normalized reward, which indicate the improvement of expected reward gained by using the given prompt, compared to that of the sentence generated without prompts. The normalized value is multiplied by 10, so the difference become evident when running policy learning methods.
  85. Derivation of the weighted score function (1/2) As a preparation,

    we first derive the following expression of the importance weight. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 89
  86. Derivation of the weighted score function (2/2) Then, we transform

    the weighted score function as follows. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 90
  87. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 92 … … viewer provider
  88. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 93 … … provider Viewers receive content recommendation. If viewers are happy/dissatisfied with contents, they may increase/decrease participation. Providers supply contents to the platform. If providers receive high/inadequate exposure, they may increase/decrease production. Many papers assume that both viewer and provider populations are static, but..
  89. (2) how to steer systems for long-term success Both viewers

    and providers are essential for the success of two-sided platforms. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 94 … … provider How should we design content allocation to pursue the long-term success? Applications include video streaming, online ads, job matching, SNS, and more! HK, Fan Yao, Sarah Dean. Policy Design in Two-sided Platforms with Participation Dynamics. ICML, 2025.
  90. Reference (1/4) [SST20] Noveen Sachdeva, Yi Su, Thorsten Joachims. Off-policy

    Bandits with Deficient Support. KDD, 2020. [FDRC22] Nicolò Felicioni, Maurizio Ferrari Dacrema, Marcello Restelli, Paolo Cremonesi. Off-Policy Evaluation with Deficient Support Using Side Information. NeurIPS, 2022. [KSU23] Samir Khan, Martin Saveski, Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. 2023. [TGTRV21] Hung Tran-The, Sunil Gupta, Thanh Nguyen-Tang, Santu Rana, Svetha Venkatesh. Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support. 2021. [KKKKNS24] Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. ICLR, 2024. [UKNST24] Takuma Udagawa, Haruka Kiyohara, Yusuke Narita, Yuta Saito, Kei Tateno. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. AAAI, 2023. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 96
  91. Reference (2/4) [SSK20] Yi Su, Pavithra Srinath, Akshay Krishnamurthy. Adaptive

    Estimator Selection for Off-Policy Evaluation. ICML, 2020. [SAAC24] Otmane Sakhi, Imad Aouali, Pierre Alquier, Nicolas Chopin. Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning. 2024. [CNSMBT21] Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip S. Thomas. Universal Off-Policy Evaluation. NeurIPS, 2021. [HLLA21] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Contextual Bandits. NeurIPS, 2021. [HLLA22] Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli. Off-Policy Risk Assessment in Markov Decision Processes. AISTATS, 2022. [WUS23] Runzhe Wu, Masatoshi Uehara, Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML, 2023. [YJ22] Yuta Saito, Thorsten Joachims. Off-Policy Evaluation for Large Action Spaces via Embeddings. ICML, 2022. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 97
  92. Reference (3/4) [YRJ22] Yuta Saito, Qingyang Ren, Thorsten Joachims. Off-Policy

    Evaluation for Large Action Spaces via Conjunct Effect Modeling. ICML, 2023. [TDCT23] Muhammad Faaiz Taufiq, Arnaud Doucet, Rob Cornish, Jean-Francois Ton. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. NeurIPS, 2023. [SWLKM24] Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, Julian McAuley. Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. WWW, 2024. [KSMNSY22] Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. WSDM, 2022. [KUNSYS23] Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. KDD, 2023. [KNS24] Haruka Kiyohara, Masahiro Nomura, Yuta Saito. Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction. WWW, 2024. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 98
  93. Reference (4/4) [STKKNS24] Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka

    Kiyohara, Masahiro Nomura, Yuta Saito. Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits. RecSys, 2024. [SKADLJZ17] Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni. Off-policy evaluation for slate recommendation. NeurIPS, 2017. [SAACL24] Yuta Saito, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, Mounia Lalmas. Long- term Off-Policy Evaluation and Learning. WWW, 2024. [Beygelzimer&Langford,00] Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” KDD, 2009. [Strehl+,10] Alex Strehl, John Langford, Sham Kakade, and Lihong Li. “Learning from Logged Implicit Exploration Data.” NeurIPS, 2010. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 99
  94. Reference (5/6) [Konda&Tsitsiklis,99] Vijay Konda and John Tsitsiklis. Actor-critic algorithms.

    NeurIPS, 1999. [Swaminathan&Joachims,16] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2016. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Saito+,24] Yuta Saito, Jihan Yao, and Thorsten Joachims. Potec: Off-policy learning for large action spaces via two-stage policy decomposition. 2024. [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 100
  95. Reference (6/6) [Jiang+,23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,

    Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7b. 2023. [Sanh+,19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019. [Harper&Konstan,15] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. TIIS, 2015. [Guo+,21] Wenshuo Guo, Karl Krauth, Michael I. Jordan, and Nikhil Garg. The Stereotyping Problem in Collaboratively Filtered Recommender Systems. EAMMO, 2021. November 2025 PhD candidacy exam (A-exam) @ Cornell CS 101