[RecSys'25] An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization
Haruka Kiyohara, Daniel Cao, Yuta Saito, Thorsten Joachims September 2025 OPL for prompt-guided sentence personalization @ RecSys 1

Motivation for the personalized sentence generation Example of summary/review/reasons for
recommendations: September 2025 OPL for prompt-guided sentence personalization @ RecSys 2 For sci-fi lovers, In the distant future, one little robot sparked a cosmic revolution. For romance lovers, In a lonely world, a small robot discovers the power of connection. We’d like to personalize the sentence to each user. “WALL-E (2008)” ・A robot called “WALL-E” and his adventure into space ・Animated films with beautiful picture and pretty charactors ・Science-fiction focuing on environment destruction ・Heart-warming drama about love and companionship ・Re-discovery of earth and humanity in dystopia ・Silent film without explicit quotes

We observe the reward only for the sentence generated by
the chosen prompt. Running a logging policy, we collect the logged data September 2025 OPL for prompt-guided sentence personalization @ RecSys 3 (examples are generated by ChatGPT-3.5 [Brown+,20])

More formally, we generate the data as follows: • user
𝑢, query 𝑞 • user ID embedding learnt from past interaction / title of the movie, e.g., Star Wars, Wall-E • prompt 𝑎 (and its vectorial embedding 𝑒𝑎 ) • aspects of movie that user may like, e.g., sci-fi, romance, joyful, serious • sentence 𝑠 • movie slogan generated by the (frozen) LLM, e.g., “.. cosmic revolution” or “.. power of connection” • reward 𝑟 • click, purchase September 2025 OPL for prompt-guided sentence personalization @ RecSys 4

Our goal is to optimize the policy to maximize the
total reward: , using the logged data collected by a logging policy 𝜋0 . Goal of Off-Policy Learning (OPL) September 2025 OPL for prompt-guided sentence personalization @ RecSys 5 need to deal with the partial rewards and distribution shift

Naive approaches Naive approaches estimate the action policy gradient (PG)
to update the policy. September 2025 OPL for prompt-guided sentence personalization @ RecSys 6 (true action policy gradient) 𝜃: policy parameter

Naive approaches Naive approaches estimate the policy gradient (PG) to
update the policy. September 2025 OPL for prompt-guided sentence personalization @ RecSys 7 regression-based [Konda&Tsitsiklis,99] important sampling-based [Swaminathan&Joachims,16] • impute regressed reward • introduce bias when the regression is inaccurate • (regression is often demanding due to partial reward and covariate shift) • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts)

update the policy. September 2025 OPL for prompt-guided sentence personalization @ RecSys 8 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

update the policy. September 2025 OPL for prompt-guided sentence personalization @ RecSys 9 • correct the distribution shift to be unbiased • variance can be significantly high • (especially with a rich set of prompts) The key limitation here is to treat each prompt independently. But, we know that using word and sentence embeddings is often beneficial in many NLP tasks. .. can we leverage similarities in OPL? important sampling-based [Swaminathan&Joachims,16]

How to leverage similarities among sentences? September 2025 OPL for
prompt-guided sentence personalization @ RecSys 10 A. Take the gradient directly in the sentence space.

How to leverage similarities among sentences? We consider estimating the
following gradient in the sentence space. September 2025 OPL for prompt-guided sentence personalization @ RecSys 11 (true sentence policy gradient) gradient w.r.t. sentence distribution however, the issue is that the original sentence space is high dimensional..

following gradient in the marginalized sentence space. September 2025 OPL for prompt-guided sentence personalization @ RecSys 12 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

following gradient in the marginalized sentence space. where September 2025 OPL for prompt-guided sentence personalization @ RecSys 13 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution

following gradient in the marginalized sentence space. where September 2025 OPL for prompt-guided sentence personalization @ RecSys 14 (true marginalized sentence policy gradient) 𝜙(𝑠): kernel-based neighbors of sentence 𝑠 gradient w.r.t. marginalized sentence distribution (expected reward within 𝜙(𝑠) under policy π) (probability of observing sentence within 𝜙(𝑠))

We estimate the sentence policy gradient using logged data as
follows. How we can actually estimate/implement the weighted score function? Direct Sentence Off-policy gradient (DSO) September 2025 OPL for prompt-guided sentence personalization @ RecSys 15 (estimating marginalized sentence policy gradient)

Estimation of the weighted score function We use the following
re-sampling technique. September 2025 OPL for prompt-guided sentence personalization @ RecSys 16 See Appendix for the derivation.

Estimation of the weighted score function We use the following
re-sampling technique. , which suggests that ① DSO does implicit data augmentation via resampling (𝑎, 𝑠′) from the policy 𝜋𝜃 . ② DSO uses soft rejection sampling using the kernel weight. ③ DSO corrects the logging distribution in the marginalized sentence space. September 2025 OPL for prompt-guided sentence personalization @ RecSys 17 See Appendix for the derivation. ① ② ③

How to estimate the logging marginal density? To use DSO,
we need to estimate the logging marginal density defined as We can use function approximation trained on the following MSE loss. September 2025 OPL for prompt-guided sentence personalization @ RecSys 18

Theoretical analysis; support condition ① DSO is less likely to
incur deficient support. September 2025 OPL for prompt-guided sentence personalization @ RecSys 19 (similar sentence support) (action support) because the similar sentence support is a relaxed condition of the action support.

Theoretical analysis; bias ② DSO has small bias when kernel
bandwidth 𝛕 is small. September 2025 OPL for prompt-guided sentence personalization @ RecSys 20

Theoretical analysis; bias ② DSO has small bias when kernel
bandwidth 𝛕 is small. • • • September 2025 OPL for prompt-guided sentence personalization @ RecSys 21 This term comes from the within-neighbor reward shift. These terms comes from applying marginalization via kernels in the sentence space.

Theoretical analysis; variance ③ DSO has a large variance reduction
when kernel bandwidth 𝛕 is large. September 2025 OPL for prompt-guided sentence personalization @ RecSys 22

Theoretical analysis; variance ③ DSO has a large variance reduction
when kernel bandwidth 𝛕 is large. • • September 2025 OPL for prompt-guided sentence personalization @ RecSys 23 This term reduces variance by avoiding within-neighbor importance weights: This term reduces variance by doing implicit data augmentation and soft-rejection sampling.

Theoretical analysis; bias-variance tradeoff ④ Kernel bandwidth 𝛕 plays important
role in controlling the bias-variance. September 2025 OPL for prompt-guided sentence personalization @ RecSys 24 DSO often achieves better pareto-frontier of bias-variance tradeoff than action-based IS.

Synthetic experiments evaluation metric September 2025 OPL for prompt-guided sentence
personalization @ RecSys 25 optimal policy uniform random compared methods • Regression [Konda&Tsitsiklis,99] • IS [Swaminathan&Joachims,16] • DR [Dudík+,11] • POTEC [Saito+,24] • DSO (ours) the higher, the better DR: hybrid of regression and IS POTEC: two-stage policy that uses the cluster of actions

Synthetic experiments data generation process • sentence distribution • reward
distribution September 2025 OPL for prompt-guided sentence personalization @ RecSys 26 smooth, different prompts can results in similar sentence prompt sentence sentence reward smooth, different sentences results in different rewards

Synthetic experiments configurations • data sizes: {500, 1000, 2000, 4000,
8000} • number of candidate prompts: {10, 50, 100, 500, 1000} • reward noises: {0.0, 1.0, 2.0, 3.0} • For DSO, we use the Gaussian kernel with 𝜏 = 𝟏. 𝟎. September 2025 OPL for prompt-guided sentence personalization @ RecSys 27 value: default value

Results September 2025 OPL for prompt-guided sentence personalization @ RecSys
28 • DSO particularly works well when # of actions and reward noises are large. • DSO is much more data-efficient than the baselines.

Synthetic experiments ablations • kernel bandwidth: {0.5, 1.0, 2.0, 4.0}
• logging marginal density: {w/ and w/o function approx.} (w/o is the monte-carlo estimation) • add noise 𝝈𝒔 = 𝟏. 𝟎 to the sentence embeddings to measure the distance September 2025 OPL for prompt-guided sentence personalization @ RecSys 29 value: default value

• We observe some bias-variance tradeoff when using monte-carlo estimation.
• Using a Gaussian kernel and the function approx. improves the robustness of DSO to the choice of bandwidth hyperparameter 𝜏. Ablation results September 2025 OPL for prompt-guided sentence personalization @ RecSys 30

Full-LLM experiment • Semi-synthetic experiments on the MovieLens-10M dataset [Harper&Konstan,15].
• DistilBert [Sanh+,19]-based reward simulator is trained on the data. (next page) • User and query (i.e., movie) are sampled from the dataset. • Candidate prompts are retrieved from RelatedWord.io. • Using Mistral-7B [Jiang+,23] as the frozen LLM to generate the sentence. September 2025 OPL for prompt-guided sentence personalization @ RecSys 31

Reward simulator fine-tuning on Movielens-10M Original CF dataset Augmented dataset
• 𝑢: user id • 𝑞: item id (movie title) movie description • 𝑟: ratings Reward simulator September 2025 OPL for prompt-guided sentence personalization @ RecSys 32 (generated by Mistral-7B (zero-shot, w/o prompt)) user id embedding (・) inner product DistilBert encoder movie description loss function: MSE in reward prediction

Result • DSO often performs better than other OPL methods.
• Especially compared to those involving importance sampling (IS, DR, POTEC), DSO is more robust to performance corruption. Note: “policy value” is the improvement observed over the sentences generated without prompt, which we call no-prompt baseline. September 2025 OPL for prompt-guided sentence personalization @ RecSys 33

Summary • We studied OPL for prompt-based language generation. •
The key challenge is dealing with large action spaces of prompts, and we proposed to leverage similarity among sentence via kernels. • DSO reduces variance by (1) applying IS in the marginalized sentence space. (2) applying implicit data augmentation via the re-sampling technique. • Experiments on synthetic/full-LLM envs demonstrate that DSO works very well by reducing the variance while maintaining a small bias. September 2025 OPL for prompt-guided sentence personalization @ RecSys 34

Appendix September 2025 OPL for prompt-guided sentence personalization @ RecSys
36

Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •
have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning September 2025 OPL for prompt-guided sentence personalization @ RecSys 37 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback Params Datasets

Two axioms for optimizing/personalizing LLMs • Model params (fine-tuning) •
have flexibility in optimization • can be expensive in computation & memory • Prompts • do not require costly model training • users and third-party company can exploit • less specificy compared to fine-tuning September 2025 OPL for prompt-guided sentence personalization @ RecSys 38 • Pairwise feedback (RLHF, DPO) • learns reward from preference data • human annotation can be costly & unethical • Online interaction data (RL) • can retrieve reward for any decisions • extensive exploration often negatively impact on the user feedback • Logged bandit feedback (Ours) • allows safe and costless in data collection • need to deal with counterfactual & dist. shift Params Datasets for the first time!

Why does function approx. improve the robustness of DSO? A.
Because we use the MSE loss to fit the marginal density model. September 2025 OPL for prompt-guided sentence personalization @ RecSys 39 For example, when the true marginal density is 1e-5, estimating it as 1e-5 and 1e-4 does not change the MSE loss too much. In contrast, 1e-4 and 1e-5 make a significant difference. Using function approximation, we can avoid being too precise about small values of the marginal density.

Examples of sentence generation in full-LLM bench. September 2025 OPL
for prompt-guided sentence personalization @ RecSys 40

Reward simulation results of full-LLM bench. September 2025 OPL for
prompt-guided sentence personalization @ RecSys 41 (Left) “positive” indicate the movies with a rating of 5, while “negative” indicates those with ratings of 0-3. (Right) Showing the distribution of normalized reward, which indicate the improvement of expected reward gained by using the given prompt, compared to that of the sentence generated without prompts. The normalized value is multiplied by 10, so the difference become evident when running policy learning methods.

Derivation of the weighted score function (1/2) As a preparation,
we first derive the following expression of the importance weight. September 2025 OPL for prompt-guided sentence personalization @ RecSys 42

Derivation of the weighted score function (2/2) Then, we transform
the weighted score function as follows. September 2025 OPL for prompt-guided sentence personalization @ RecSys 43

A baseline approach: Doubly Robust (DR) [Dudík+,11] September 2025 OPL
for prompt-guided sentence personalization @ RecSys 44 DR uses regression as a control variate as follows, for the variance reduction purpose. control variate

A baseline approach: POTEC [Saito+,24] POTEC considers two-stage policies to
leverage clusters among prompts. September 2025 OPL for prompt-guided sentence personalization @ RecSys 45 (estimating cluster policy gradient) regression-based greedy control variate IS w.r.t. clustering space 𝑐: cluster Not leveraging the information about generated sentences.

Is it possible to define a DR-style variant of DSO?
When defining a DR-style estimator, the baseline term should be as follows. However, estimating the gradient involves another importance sampling, and we cannot reduce variance by using this formulation. September 2025 OPL for prompt-guided sentence personalization @ RecSys 46

Is it possible to define a DR-style variant of DSO?
When defining a DR-style estimator, the baseline term should be as follows. However, estimating the gradient involves another importance sampling, and we cannot reduce variance by using this formulation. September 2025 OPL for prompt-guided sentence personalization @ RecSys 47 It would be interesting to explore how we can efficiently combine the regression and DSO as a potential future work!

Reference September 2025 OPL for prompt-guided sentence personalization @ RecSys
48

Reference (1/2) [Konda&Tsitsiklis,99] Vijay Konda and John Tsitsiklis. Actor-critic algorithms.
NeurIPS, 1999. [Swaminathan&Joachims,16] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 2016. [Dudík+,11] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ICML, 2011. [Saito+,24] Yuta Saito, Jihan Yao, and Thorsten Joachims. Potec: Off-policy learning for large action spaces via two-stage policy decomposition. 2024. [Brown+,20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020. September 2025 OPL for prompt-guided sentence personalization @ RecSys 49

Reference (2/2) [Jiang+,23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7b. 2023. [Sanh+,19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2019. [Harper&Konstan,15] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. TIIS, 2015. September 2025 OPL for prompt-guided sentence personalization @ RecSys 50

[RecSys'25] An Off-Policy Learning Approach for...

[RecSys'25] An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript