Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

MultiScale Contextual Bandits for Long Term Obj...

Avatar for Richa Rastogi Richa Rastogi
November 09, 2025

MultiScale Contextual Bandits for Long Term Objectives

The feedback that AI systems (e.g., recommender systems, chatbots) collect from user interactions is a crucial source of training data. While short-term feedback (e.g., clicks, engagement) is widely used for training, there is ample evidence that optimizing short-term feedback does not necessarily achieve the desired long-term objectives. Unfortunately, directly optimizing for long-term objectives is challenging, and we identify the disconnect in the timescales of short-term interventions (e.g., rankings) and the long-term feedback (e.g., user retention) as one of the key obstacles. To overcome this disconnect, we introduce the framework of MultiScale Policy Learning to contextually reconcile that AI systems need to act and optimize feedback at multiple interdependent timescales.
Following a PAC-Bayes motivation, we show how the lower timescales with more plentiful data can provide a data-dependent hierarchical prior for faster learning at higher scales, where data is more scarce.
As a result, the policies at all levels effectively optimize for the long-term. We instantiate the framework with MultiScale Off-Policy Bandit Learning (MSBL) and demonstrate its effectiveness on three tasks relating to recommender and conversational systems.

Avatar for Richa Rastogi

Richa Rastogi

November 09, 2025
Tweet

More Decks by Richa Rastogi

Other Decks in Science

Transcript

  1. Motivation • In many interactive AI systems, (recommender, conversational systems),

    there is abundant short term feedback (e.g., clicks, generated response quality) • Prior work shows that optimizing for short term feedback does not necessarily achieve the desired long term objective (e.g., clickbait feeds do not lead to user retention) Bene fi cial dialogue outcomes User Retention
  2. Motivation A key problem — long-term feedback is at a

    di ff erent timescale than the short-term interventions Rankings Clicks We address it by contextually reconciling this disconnect in timescales
  3. MultiScale Policy Framework Consider two levels • A micro level

    that operates at faster timescale, e.g., clicks, response quality • A macro level that operates at slower timescale, e.g., user retention
  4. MultiScale Policy Framework Even though , is typically much better

    than a random policy from VL2(πL1*) < VL2(πL2*) πL1* Π Hard πL2* ← arg max π∈Π VL2(π) Easy πL1* ← arg max π∈Π VL1(π)
  5. MultiScale Policy Framework Can we exploit feedback at the micro

    level to learn the long term optimal policy?
  6. MultiScale Policies Factorization of policies Π ≜ ΠL1 ⋅ ΠL2

    ΠL1 ΠL2 Learns a large part of the parameter space Inductive bias for long term optimal policy Simpli fi ed learning at macro level Small policy space
  7. Policy Learning at micro level (b) Learning family of promising

    policies using micro level data ΠL1 Micro Level Data Macro Level Data πL2 DL1 DL2 𝒜L2 ≅ ̂ ΠL1 Micro Level Policy Space 𝒜L2 ≅ ̂ ΠL1 Macro Level Policy Space πL1 aL2 (c) Learning a macro policy πL2 es action es to t the policy Policy or Feedback Modification ΠL1 ΠL2 aL2 𝒜 L2 ≅ ̂ ΠL1 aL1 ∼ ̂ πL1 aL2 Macro action indexes to select the micro policy ̂ ΠL1 = { ̂ πL1 aL2 : aL2 ∈ 𝒜 L2}
  8. Policy Learning at micro level (b) Learning family of promising

    policies using micro level data ΠL1 Micro Level Data Macro Level Data πL2 DL1 DL2 𝒜L2 ≅ ̂ ΠL1 Micro Level Policy Space 𝒜L2 ≅ ̂ ΠL1 Macro Level Policy Space πL1 aL2 (c) Learning a macro policy πL2 es action es to t the policy Policy or Feedback Modification ΠL1 ΠL2 [ Clicks Likes ] [0.8 0.2] 𝒜 L2 ≅ ̂ ΠL1 = { ̂ πL1 [0.8 0.2], ̂ πL1 [0.4 0.6], …} rL1 ≜ [ Clicks Likes ] … [ Clicks Likes ] [0.4 0.6] Example:
  9. Policy Learning at macro level mily of promising policies g

    micro level data ΠL1 Level Data Macro Level Data πL2 DL2 𝒜L2 ≅ ̂ ΠL1 ce 𝒜L2 ≅ ̂ ΠL1 Macro Level Policy Space (c) Learning a macro policy πL2 Policy or Feedback Modification ΠL1 ΠL2 ̂ πL1 [0.8 0.2] ̂ πL1 [0.4 0.6] ̂ πL2( . |xL2) selects ̂ πL1 [0.4 0.6] ̂ πL2 Macro Level Policy Space ΠL2 ̂ πL2( . |xL2) selects ̂ πL1 [0.8 0.2]
  10. • Training • Bottom up • Inference • Top down

    MultiScale Contextual Bandits aL2 ∼ ̂ πL2 𝒜 L2 ≅ ̂ ΠL1 aL1 ∼ ̂ πL1 aL2 Macro action indexes to select the micro policy Micro Level Data Macro Level Data ̂ πL2 DL1 DL2 𝒜 L2 ≅ ̂ ΠL1 𝒜 L2 ≅ ̂ ΠL1 ΠL1 Policy or Feedback Modi fi cation
  11. Experiments Multi turn Conversation User LLM Micro Level LLM policy

    πL1 aL2 aL2 ∼ πL2 Preference weighting {Child, Expert} LLM Evaluator 5 turns Prompt Response rL2 rL1
  12. Summary • Introduce a principled framework to optimize for long

    term objectives. • Motivated by using plentiful short-term data for faster learning with scarcer long term feedback • We discuss two ways - policy and feedback modi fi cation to learn a family of policies at micro level. • Propose a practical bandit algorithm for recursively learning policies at multiple interdependent levels. • Checkout the paper for more results and analysis - PAC Bayesian motivation, updating micro and macro policies after deployment when new data is available, scaling action space and more !