Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[ICML'26] Credit-assigned Policy Gradient for E...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

[ICML'26] Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

arXiv: https://arxiv.org/abs/2605.26385
日本語スライド: 近日公開予定

Avatar for Haruka Kiyohara

Haruka Kiyohara

June 23, 2026

More Decks by Haruka Kiyohara

Other Decks in Research

Transcript

  1. Credit-assigned Policy Gradient for Early-Stage Retrieval in Two Stage Ranking

    Haruka Kiyohara Internship work at Meta 2025 (Central Applied Science, M+TH team) July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 1 Joint work with: (Meta) - Nitzan Razin, Mihaela Curmei, Israel Nir, Shankar Kalyanaraman, Arieal Evnine, Roxana Pop, Udi Weinsberg, (Cornell) - Thorsten Joachims, Sarah Dean ICML2026, Seoul, Korea
  2. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 2 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)
  3. Two-stage decision-making systems In large-scale recommender systems, we often employ

    a two-stage selection. How can we improve the overall decision process by appropriately “assign credits” for the early stage retrieval? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 3 Large-scale Recsys less research more research How early stage contributes to the outcome?
  4. where the (two stage) policy is defined as: Data generation

    process July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 4 optimize fixed context candidate set ranking reward
  5. How do we calculate “vanilla” policy gradient (V-PG)? Our objective

    is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 5 ranking-wise reward with position weight 𝛼
  6. How do we calculate “vanilla” policy gradient (V-PG)? Our objective

    is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 6 reward-independence assumption i.e., reward doesn’t depend on other position’s action.
  7. How do we calculate “vanilla” policy gradient (V-PG)? Our objective

    is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 7 reward-independence assumption i.e., reward doesn’t depend on other position’s action. Arbitrary gradient is also decomposed to the linear-sum of position-reward gradient.
  8. How do we calculate “vanilla” policy gradient (V-PG)? Now, consider

    the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 8 (marginal) prob. of having action 𝑎𝑙 at position 𝑙
  9. How do we calculate “vanilla” policy gradient (V-PG)? Now, consider

    the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 9 Rewriting the prob using two stage policies..
  10. How do we calculate “vanilla” policy gradient (V-PG)? Now, consider

    the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 10 Rewriting the prob using two stage policies.. Propagating the gradient to the ESR’s candidate selection prob. [Ma+,20].
  11. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Variance issue • Action space scale up expotentially ≈ 𝑂(|𝐴|𝐾) • In contrast, we can only sample results with a single candidate set. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 11
  12. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Variance issue • Action space scale up expotentially ≈ 𝑂(|𝐴|𝐾) • In contrast, we can only sample results with a single candidate set. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 12 candidate set size (𝐾) # of combinations → Even when 𝐴 = 10, the total number of combination (|𝐴|𝐾) becomes exponentially large.. ! (In practice, we may have 𝐴 = 1,000,000 or even more!)
  13. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 13
  14. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 14 c c c c ※ gradient propagation workflow
  15. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 15 c c ※ gradient propagation workflow
  16. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 16 c c ※ gradient propagation workflow
  17. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 17 ※ gradient propagation workflow
  18. Looks like a promising approach but.. V-PG does not scale

    when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 18 Propagating gradient to irrelevant actions! ※ gradient propagation workflow
  19. What is the “ideal” credit-assignment? Thinking about the gradient when

    estimating with a single monte-carlo sample. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 19 (ideal) ..? Propagating grad. to the correspondence! efficient policy gradient = variance reduction..?
  20. How can we achieve the “ideal” credit assignment? We propose

    the following credit-assigned policy gradient (CA-PG): July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 20
  21. How can we achieve the “ideal” credit assignment? We propose

    the following credit-assigned policy gradient (CA-PG): where July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 21 (marginal) prob. of having action 𝑎𝑙 included in top-K (in some candidate) (marginal) prob. of selecting action 𝑎𝑙 given the fact that 𝑎𝑙 is in (some) top-K
  22. How can we achieve the “ideal” credit assignment? We propose

    the following credit-assigned policy gradient (CA-PG): where July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 22 (marginal) prob. of having action 𝑎𝑙 included in top-K (in some candidate) (marginal) prob. of selecting action 𝑎𝑙 given the fact that 𝑎𝑙 is in (some) top-K A set of candidate sets 𝐴𝐾 that contains action 𝑎𝑙 Sum of the ESR prob of the candidates
  23. Theoretical analysis (2/3) 1) What is the relation between the

    two PGs? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 23 Now, all the factors that depends on 𝑨𝑲 is ignored!
  24. Theoretical analysis (2/3) 2) What is the relation between the

    two PGs? By ignoring from which candidate 𝑎𝑙 comes from, CA-PG reduces variance. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 24
  25. Theoretical analysis (1/3) 1) What do V-PG and CA-PG optimizes

    for? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 25 LSR’s action choice probability-discounted reward
  26. Theoretical analysis (1/3) 1) What do V-PG and CA-PG optimizes

    for? CA-PG uses LSR’s choice as a reward signal. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 26
  27. Theoretical analysis (3/3) 2) When CA-PG can learn the accurate

    alignment of actions? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 27 Alignment of the LSR’s action choice probability matters
  28. Theoretical analysis (3/3) 2) When CA-PG can learn the accurate

    alignment of actions? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 28 Alignment of the LSR’s action choice probability matters ・Any (oracle) epsilon-greedy and softmax-type policies satisfies the alignment condition of LSR. ・Even when the LSR policy makes mistake in the action alignment, the following mistakes are no problem. ・Any misalignment among top-1 to K (i.e., top) items. ・Any misalignment among top-K+1 to |A| (i.e., tail) items. ・Any misalignment whose probability ratio is bounded by the reward ratio:
  29. Theoretical analysis (3/3) 3) When CA-PG can learn the accurate

    alignment of actions? CA-PG works with a reasonably accurate (practical) LSR policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 29 ・Any (oracle) epsilon-greedy and softmax-type policies satisfies the alignment condition of LSR. ・Even when the LSR policy makes mistake in the action alignment, the following mistakes are no problem. ・Any misalignment among top-1 to K (i.e., top) items. ・Any misalignment among top-K+1 to |A| (i.e., tail) items. ・Any misalignment whose probability ratio is bounded by the reward ratio:
  30. Key takeaways from theoretical analysis • Proposed method (CA-PG) is

    a partial PG of the vanilla PG. • CA-PG enables the credit-assignment within the candidate set, by considering the marginal prob of action is being selected in (one of) top-K. • CA-PG intentionally ignore from which candidate action come from, greatly reducing variance by modifying the action space from 𝑂(𝐴𝐾) to 𝑂(|𝐴|). • CA-PG can learn the accurate alignment with a practical choice of LSR. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 30
  31. A potential drawback of credit-assigned PG CA-PG requires some computational

    overhead to compute 𝜋(𝑆𝐾(𝑎)|𝑥). The gradient computation of CA-PG requires 𝑂(𝐾𝐿), while that of V-PG is 𝑂(𝐾). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 31
  32. Our recommendation: TOP1-PG As a practical soluation, we suggest a

    simplified alternative called TOP1-PG, July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 32 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿)
  33. Our recommendation: TOP1-PG As a practical soluation, we suggest a

    simplified alternative called TOP1-PG, with the commonly use Plackett-Luce policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 33 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿) (We will test the performance in experiments.)
  34. Our recommendation: TOP1-PG As a practical soluation, we suggest a

    simplified alternative called TOP1-PG, with the commonly use Plackett-Luce policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 34 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿) TOP1-PG is equivalent to CA-PG when • Using sampling-with-replacement (SwR) approximation to compute probability • Using a single model for selecting top-K actions (i.e., not using mixture-of-expert)
  35. Pros and cons of PG methods Summarizing the properties of

    each PG, we have.. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 35 worst / best ※ SwR: Sampling-with-Replacement approximation MoE: mixture-of-experts
  36. Pros and cons of PG methods Summarizing the properties of

    each PG, we have.. In experiments, we test • How does the performance of each PG change with varying # of candidate (K), # of outputs (L), optimality of LSR? • How does the computational time change with varying # of candidates (K), # of outputs (L)? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 36 worst / best
  37. Synthetic experiment (1/4) We first see the results when model

    is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 37 With the increased candidate set sizes (𝑲), CA-PG shows faster convergence and better stability!
  38. Synthetic experiment (2/4) We first see the results when model

    is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 38 When the LSR’s alignment condition is satisfied, CA-PG works well!
  39. Synthetic experiment (3/4) We first see the results when model

    is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 39 CA-PG gains benefits from combining multiple sub-retrievers (mixture of experts; MoE) Results when using multiple sub-retriever to sample candidate set
  40. Synthetic experiment (4/4) Also looking at the computational time of

    each method.. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 40 Combined with the SwR approximation, both the computational time does not increase with 𝑲 and 𝑳. 𝑲 𝑳 𝑲 𝑴
  41. Real-data experiment We test with a larger size of candidate

    set (𝐾) in {50, 100, 200}, where 𝐴 = 1000. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 41 Observed a result similar to the synthetic setting, CA-PG-SwR converges faster when 𝑲 is large.
  42. Takeaways.. • We studied how to improve early-stage retrieval of

    two-stage decisions. • To key challenge was high variance and credit-assignment issues of vanilla PG. • We proposed credit-assigned PG and its computationally fast alternative, SwR ver. CA-PG-SwR works very well and fast, making practical impl. of PG more feasible! July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 42
  43. Additional results combined with GRPO [Shao+,24] CA-PG can be easily

    combined with other variance reduction methods. As a proof of concept, we combined CA-PG with GRPO: 1. Query 𝑚 samples per context and action, 𝑟𝑗 (𝑥, 𝑎), 𝑗 ∈ [𝑚]. 2. Normalize the reward as 𝑟′ 𝑗 = (𝑟𝑗 − 𝑚𝑒𝑎𝑛(𝑟))/𝑠𝑡𝑑(𝑟). 3. (Add a constant value to scale rewards to be positive). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 44
  44. Plackett-Luce and Sampling-with-Replacement (SwR) The PL policy selects candidate set

    by recursively applying softmax on the remaining. SwR approximation calculates the probability as if applying softmax independently. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 45
  45. Gradient Computation of CA-PG (1/2) The score function (log action

    choice probability) can be calculated as follows. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 46 We approximate this probability with small relative errors (~6%) in the next slides.
  46. Gradient Computation of CA-PG (1/2) The probability is approximated as

    follows. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 47 Replacing the expectation over all possible candidate set with the most likely candidate set.
  47. References [Ma et al., 2020] Jiaqi Ma, Zhe Zhao, Xinyang

    Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, Ed Chi. Off-policy Learning in Two-stage Recommender Systems. WWW, 2020. [Gao et al., 2022] Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, Tat-Seng Chua. KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems. CIKM, 2022. [Shao et al., 2022] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 48