[ICML'26] Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Credit-assigned Policy Gradient for Early-Stage Retrieval in Two Stage Ranking
Haruka Kiyohara Internship work at Meta 2025 (Central Applied Science, M+TH team) July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 1 Joint work with: (Meta) - Nitzan Razin, Mihaela Curmei, Israel Nir, Shankar Kalyanaraman, Arieal Evnine, Roxana Pop, Udi Weinsberg, (Cornell) - Thorsten Joachims, Sarah Dean ICML2026, Seoul, Korea

Two-stage decision-making systems In large-scale recommender systems, we often employ
a two-stage selection. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 2 candidate retrieval (i.e., fast screening) Large-scale Recsys Retrieval Augmented Generation (RAG)

Two-stage decision-making systems In large-scale recommender systems, we often employ
a two-stage selection. How can we improve the overall decision process by appropriately “assign credits” for the early stage retrieval? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 3 Large-scale Recsys less research more research How early stage contributes to the outcome?

where the (two stage) policy is defined as: Data generation
process July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 4 optimize fixed context candidate set ranking reward

How do we calculate “vanilla” policy gradient (V-PG)? Our objective
is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 5 ranking-wise reward with position weight 𝛼

is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 6 reward-independence assumption i.e., reward doesn’t depend on other position’s action.

is to maximize the policy value (expected reward), July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 7 reward-independence assumption i.e., reward doesn’t depend on other position’s action. Arbitrary gradient is also decomposed to the linear-sum of position-reward gradient.

How do we calculate “vanilla” policy gradient (V-PG)? Now, consider
the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 8 (marginal) prob. of having action 𝑎𝑙 at position 𝑙

the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 9 Rewriting the prob using two stage policies..

the gradient of the joint policy (ESR + LSR). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 10 Rewriting the prob using two stage policies.. Propagating the gradient to the ESR’s candidate selection prob. [Ma+,20].

Looks like a promising approach but.. V-PG does not scale
when when the candidate set size (𝐾) becomes large, due to • Variance issue • Action space scale up expotentially ≈ 𝑂(|𝐴|𝐾) • In contrast, we can only sample results with a single candidate set. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 11

when when the candidate set size (𝐾) becomes large, due to • Variance issue • Action space scale up expotentially ≈ 𝑂(|𝐴|𝐾) • In contrast, we can only sample results with a single candidate set. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 12 candidate set size (𝐾) # of combinations → Even when 𝐴 = 10, the total number of combination (|𝐴|𝐾) becomes exponentially large.. ! (In practice, we may have 𝐴 = 1,000,000 or even more!)

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 13

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 14 c c c c ※ gradient propagation workflow

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 15 c c ※ gradient propagation workflow

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 16 c c ※ gradient propagation workflow

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 17 ※ gradient propagation workflow

when when the candidate set size (𝐾) becomes large, due to • Credit-assignment issue (inefficient policy gradient) • We cannot weight the importance of items in the candidate. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 18 Propagating gradient to irrelevant actions! ※ gradient propagation workflow

What is the “ideal” credit-assignment? Thinking about the gradient when
estimating with a single monte-carlo sample. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 19 (ideal) ..？ Propagating grad. to the correspondence! efficient policy gradient = variance reduction..?

How can we achieve the “ideal” credit assignment? We propose
the following credit-assigned policy gradient (CA-PG): July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 20

the following credit-assigned policy gradient (CA-PG): where July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 21 (marginal) prob. of having action 𝑎𝑙 included in top-K (in some candidate) (marginal) prob. of selecting action 𝑎𝑙 given the fact that 𝑎𝑙 is in (some) top-K

the following credit-assigned policy gradient (CA-PG): where July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 22 (marginal) prob. of having action 𝑎𝑙 included in top-K (in some candidate) (marginal) prob. of selecting action 𝑎𝑙 given the fact that 𝑎𝑙 is in (some) top-K A set of candidate sets 𝐴𝐾 that contains action 𝑎𝑙 Sum of the ESR prob of the candidates

Theoretical analysis (1/3) 1) What is the relation between the
two PGs? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 23 Now, all the factors that depends on 𝑨𝑲 is ignored!

Theoretical analysis (1/3) 1) What is the relation between the
two PGs? By ignoring from which candidate 𝑎𝑙 comes from, CA-PG reduces variance. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 24

Theoretical analysis (2/3) 2) What do V-PG and CA-PG optimizes
for? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 25 LSR’s action choice probability-discounted reward

Theoretical analysis (2/3) 2) What do V-PG and CA-PG optimizes
for? CA-PG uses LSR’s choice as a reward signal. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 26

Theoretical analysis (3/3) 3) When CA-PG can learn the accurate
alignment of actions? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 27 Alignment of the LSR’s action choice probability matters

alignment of actions? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 28 Alignment of the LSR’s action choice probability matters ・Any (oracle) epsilon-greedy and softmax-type policies satisfies the alignment condition of LSR. ・Even when the LSR policy makes mistake in the action alignment, the following mistakes are no problem. ・Any misalignment among top-1 to K (i.e., top) items. ・Any misalignment among top-K+1 to |A| (i.e., tail) items. ・Any misalignment whose probability ratio is bounded by the reward ratio:

alignment of actions? CA-PG works with a reasonably accurate (practical) LSR policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 29 ・Any (oracle) epsilon-greedy and softmax-type policies satisfies the alignment condition of LSR. ・Even when the LSR policy makes mistake in the action alignment, the following mistakes are no problem. ・Any misalignment among top-1 to K (i.e., top) items. ・Any misalignment among top-K+1 to |A| (i.e., tail) items. ・Any misalignment whose probability ratio is bounded by the reward ratio:

Key takeaways from theoretical analysis • Proposed method (CA-PG) is
a partial PG of the vanilla PG. • CA-PG enables the credit-assignment within the candidate set, by considering the marginal prob of action is being selected in (one of) top-K. • CA-PG intentionally ignore from which candidate action come from, greatly reducing variance by modifying the action space from 𝑂(|𝐴|𝐾) to 𝑂(|𝐴|). • CA-PG can learn the accurate alignment with a practical choice of LSR. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 30

A potential drawback of credit-assigned PG CA-PG requires some computational
overhead to compute 𝜋(𝑆𝐾(𝑎)|𝑥). The gradient computation of CA-PG requires 𝑂(𝐾𝐿), while that of V-PG is 𝑂(𝐾). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 31

Our recommendation: TOP1-PG As a practical soluation, we suggest a
simplified alternative called TOP1-PG, July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 32 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿)

simplified alternative called TOP1-PG, with the commonly use Plackett-Luce policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 33 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿) (We will test the performance in experiments.)

simplified alternative called TOP1-PG, with the commonly use Plackett-Luce policy. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 34 Use 𝑆1 instead of 𝑆𝐾 only for the gradient computation (i.e., we actually sample 𝐾 actions using ESR). 𝑂(𝐿) TOP1-PG is equivalent to CA-PG when • Using sampling-with-replacement (SwR) approximation to compute probability • Using a single model for selecting top-K actions (i.e., not using mixture-of-expert)

Pros and cons of PG methods Summarizing the properties of
each PG, we have.. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 35 worst / best ※ SwR: Sampling-with-Replacement approximation MoE: mixture-of-experts

Pros and cons of PG methods Summarizing the properties of
each PG, we have.. In experiments, we test • How does the performance of each PG change with varying # of candidate (K), # of outputs (L), optimality of LSR? • How does the computational time change with varying # of candidates (K), # of outputs (L)? July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 36 worst / best

Synthetic experiment (1/4) We first see the results when model
is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 37 With the increased candidate set sizes (𝑲), CA-PG shows faster convergence and better stability!

is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 38 When the LSR’s alignment condition is satisfied, CA-PG works well!

is (almost) well-specified. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 39 CA-PG gains benefits from combining multiple sub-retrievers (mixture of experts; MoE) Results when using multiple sub-retriever to sample candidate set

Synthetic experiment (4/4) Also looking at the computational time of
each method.. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 40 Combined with the SwR approximation, both the computational time does not increase with 𝑲 and 𝑳. 𝑲 𝑳 𝑲 𝑴

Real-data experiment on KuaiRec [Gao+,22] We test with a larger
size of candidate set (𝐾) in {50, 100, 200}, where 𝐴 = 1000. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 41 Observed a result similar to the synthetic setting, CA-PG-SwR converges faster when 𝑲 is large.

Takeaways.. • We studied how to improve early-stage retrieval of
two-stage decisions. • To key challenge was high variance and credit-assignment issues of vanilla PG. • We proposed credit-assigned PG and its computationally fast alternative, SwR ver. CA-PG-SwR works very well and fast, making practical impl. of PG more feasible! July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 42

Appendix July 2026 Credit-assigned policy gradient in two stage ranking
@ ICML 43

Additional results combined with GRPO [Shao+,24] CA-PG can be easily
combined with other variance reduction methods. As a proof of concept, we combined CA-PG with GRPO: 1. Query 𝑚 samples per context and action, 𝑟𝑗 (𝑥, 𝑎), 𝑗 ∈ [𝑚]. 2. Normalize the reward as 𝑟′ 𝑗 = (𝑟𝑗 − 𝑚𝑒𝑎𝑛(𝑟))/𝑠𝑡𝑑(𝑟). 3. (Add a constant value to scale rewards to be positive). July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 44

Plackett-Luce and Sampling-with-Replacement (SwR) The PL policy selects candidate set
by recursively applying softmax on the remaining. SwR approximation calculates the probability as if applying softmax independently. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 45

Gradient Computation of CA-PG (1/2) The score function (log action
choice probability) can be calculated as follows. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 46 We approximate this probability with small relative errors (~6%) in the next slides.

Gradient Computation of CA-PG (2/2) The probability is approximated as
follows. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 47 Replacing the expectation over all possible candidate set with the most likely candidate set.

References [Ma et al., 2020] Jiaqi Ma, Zhe Zhao, Xinyang
Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, Ed Chi. Off-policy Learning in Two-stage Recommender Systems. WWW, 2020. [Gao et al., 2022] Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, Tat-Seng Chua. KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems. CIKM, 2022. [Shao et al., 2022] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. July 2026 Credit-assigned policy gradient in two stage ranking @ ICML 48

[ICML'26] Credit-assigned Policy Gradient for E...

[ICML'26] Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

More Decks by Haruka Kiyohara

Other Decks in Research

Featured

Transcript