Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Reinforcement Learning for Recommender Systems

wing.nus
December 23, 2021

Deep Reinforcement Learning for Recommender Systems

Recommender systems have become increasingly important in our daily lives since they play an important role in mitigating the information overload problem, especially in many user-oriented online services. Most recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy, which may fail given the dynamic nature of the users’ preferences. Also, they are designed to maximize the immediate reward of recommendations, while completely overlooking their long-term influence on user experiments. To learn adaptive recommendation policy, we will consider the recommendation procedure as sequential interactions between users and recommender agents; and leverage Reinforcement Learning (RL) to automatically learn an optimal recommendation strategy (policy) that maximizes cumulative rewards from users without any specific instructions. Recommender systems based on reinforcement learning have two advantages. First, they can continuously update their strategies during the interactions. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. This talk will introduce the fundamentals and advances of deep reinforcement learning and its applications in recommender systems.
Seminar page: https://wing-nus.github.io/ir-seminar/speaker-xiangyu
YouTube Video recording: https://www.youtube.com/watch?v=spx6Pocc104

wing.nus

December 23, 2021
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. Applied Machine Learning Lab CityU Deep Reinforcement Learning for Recommender

    Systems Xiangyu Zhao Assistant Professor School of Data Science City University of Hong Kong Dec 22, 2021 @ WING, NUS
  2. Applied Machine Learning Lab CityU Biography Applied Machine Learning Lab

    Homepage: zhaoxyai.github.io Email: [email protected] • Research interests: data mining and machine learning, especially Reinforcement Learning, AutoML, and Multimodal and their applications in Recommender System and Smart City • Published more than 30+ papers in top conferences and journals (e.g., KDD, WWW, AAAI, SIGIR, ICDE, CIKM, ICDM) • His research received ICDM’21 Best-ranked Papers, Global Top 100 Chinese New Stars in AI, CCF-Tencent Open Fund, Criteo Research Award, and Bytedance Research Award • Top conference (senior) program committee members and session chairs, and journal reviewers • Organizer of DRL4KDD and DRL4KD at KDD’19, WWW’21 and SIGIR’20/21, and a lead tutor at WWW’21/22 and IJCAI’21 • Founding academic committee members of MLNLP, the largest AI community in China with 800,000 followers • The models and algorithms from his research have been launched in the online system of many companies
  3. Applied Machine Learning Lab CityU Recommender Systems § Intelligent system

    that assists users’ information seeking tasks Music Video Ecommerce News Social Friends Location based
  4. Applied Machine Learning Lab CityU Recommender Systems § Intelligent system

    that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based
  5. Applied Machine Learning Lab CityU Recommender Systems § Intelligent system

    that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based Browsing History
  6. Applied Machine Learning Lab CityU Recommender Systems § Intelligent system

    that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based System User
  7. Applied Machine Learning Lab CityU § Considering recommendation as an

    offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users Existing Recommendation Policies System User
  8. Applied Machine Learning Lab CityU § Considering recommendation as an

    offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users Existing Recommendation Policies System User
  9. Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering

    recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback System User
  10. Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering

    recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience System User
  11. Applied Machine Learning Lab CityU § Considering recommendation as an

    offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience Existing Recommendation Policies System
  12. Applied Machine Learning Lab CityU § RL is a general-purpose

    framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell
  13. Applied Machine Learning Lab CityU § RL is a general-purpose

    framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell state actions
  14. Applied Machine Learning Lab CityU Agent and Environment § At

    each step t the agent: § Receives state st § Receives scalar reward rt § Executes action at § The environment: § Receives action at § Emits state st § Emits scalar reward rt state reward action at rt st
  15. Applied Machine Learning Lab CityU Examples of Deep RL @DeepMind

    § Play games: Atari, poker, Go, ... § Explore worlds: 3D worlds, Labyrinth, ... § Control physical systems: manipulate, walk, swim, ... § Interact with users: recommend, personalize, optimize, ...
  16. Applied Machine Learning Lab CityU Major Components of an RL

    Agent § An RL agent may include one or more of these components: § Value function (Q-value): prediction of value for each state and action § Policy: maps current state to action § Model: agent’s representation of the environment
  17. Applied Machine Learning Lab CityU Deep Reinforcement Learning § Use

    deep neural networks to represent § Value function (Q-value) § Policy § Model § Optimize loss function by stochastic gradient descent Q-value Table Deep Q-Network
  18. Applied Machine Learning Lab CityU Value Function § A value

    function is a prediction of future reward § “How much reward will I get from action a in state s?” § Q-value function gives expected total reward § from state s and action a § under policy π § with discount factor +2 +1 -1 Value of taking the action state actions
  19. Applied Machine Learning Lab CityU Policy § A policy is

    the agent’s behavior § It is a map from state to action: • Deterministic policy: a = π(s) • Stochastic policy: π (a|s) = P [a|s] 0.7 0.2 0.1 Probability of taking the action
  20. Applied Machine Learning Lab CityU Model § Model is learnt

    from experience (interactions) § Model acts as proxy for environment § Planner interacts with model § e.g. using lookahead search observation reward action at rt ot
  21. Applied Machine Learning Lab CityU Approaches To Reinforcement Learning §

    Policy-based RL § Search directly for the optimal policy π* § This is the policy achieving maximum future reward § Value-based RL § Estimate the optimal value function Q*(s,a) § This is the maximum value achievable under any policy § Model-based RL § Build a transition model of the environment § Plan (e.g. by lookahead) using model
  22. Applied Machine Learning Lab CityU § Continuously updating the recommendation

    strategies during the interactions Reinforcement Learning for Recommendations
  23. Applied Machine Learning Lab CityU § Continuously updating the recommendation

    strategies during the interactions § Maximizing the long-term reward from users Reinforcement Learning for Recommendation Policies Recommendation Session t0 t1 t2 t3
  24. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  25. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  26. Applied Machine Learning Lab CityU User-System Interactions § The system

    recommends a page of items to a user § The user provides real-time feedback and the system updates its policy § The system recommends a new page of items
  27. Applied Machine Learning Lab CityU Challenges § Updating strategy according

    to user’s real-time feedback § Diverse and complementary recommendations
  28. Applied Machine Learning Lab CityU Challenges § Updating strategy according

    to user’s real-time feedback § Diverse and complementary recommendations § Displaying items in a 2-D page
  29. Applied Machine Learning Lab CityU Actor-Critic Q(s, a) = Es

    r + γQ(s , a )|s, a h1 h2 ··· (a) state s Q(s, a2) Q(s, a1) action ai h1 h2 (b) state s Q(s, ai) h1 h1 h2 h2 (c) Actor Critic state s state s action a Q(s, a) action a Q∗(s, a) = Es r + γ max a Q∗(s , a )|s, a Fixed item space max à enumerating all possible items
  30. Applied Machine Learning Lab CityU Actor Design § Goal: Generating

    a page of recommendations according to user’s browsing history h1 h2 Actor state s action a
  31. Applied Machine Learning Lab CityU Actor Architecture eM e2 e1

    ··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items § Goal: Generating a page of items according to user’s browsing history
  32. Applied Machine Learning Lab CityU Embedding Layer § Three types

    of information § ei : item’s identifier § ci : item’s category § fi : user’s feedback ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer Xi = concat(Ei, Ci, Fi ) = tanh concat(WEei + bE, WCci + bC, WF fi + bF ) Identifier Embedding Category Embedding Feedback Embedding Item Embedding
  33. Applied Machine Learning Lab CityU Page-wise CNN Layer ··· ···

    ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer
  34. Applied Machine Learning Lab CityU RNN & Attention Layer zt

    = σ(Wz Et + Uz ht−1 ) rt = σ(Wr Et + Ur ht−1 ) ht = (1 − zt )ht−1 + zt ˆ ht ˆ ht = tanh[WEt + U(rt · ht−1 )] ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer s = T t=1 αt ht where αt = exp(Wα ht + bα ) j exp(Wα hj + bα ) User Preference Attention GRU Page 1 Page 2 Page T
  35. Applied Machine Learning Lab CityU Decoder § Goal: Generating a

    page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder User preference (vector) A page of items (matrix) ü Task 1: Generating a set of items ü Task 2: Displaying items in a page
  36. Applied Machine Learning Lab CityU Decoder § Goal: Generating a

    page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder DeCNN User preference (vector) A page of items (matrix) Deconvolution Neural Network Representation (vector) Image (matrix) recover
  37. Applied Machine Learning Lab CityU Actor Architecture eM e2 e1

    ··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items 23
  38. Applied Machine Learning Lab CityU Qθµ (s, a) = Es

    r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation
  39. Applied Machine Learning Lab CityU Qθµ (s, a) = Es

    r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation § DeepPage § user’s real-time feedback § long-term reward § putting items in a page
  40. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  41. Applied Machine Learning Lab CityU Why Negative Feedback? § What

    users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases
  42. Applied Machine Learning Lab CityU Why Negative Feedback? § What

    users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases § Challenges § Negative feedback could bury the positive ones § May not be caused by users disliking them § Weak/wrong negative feedback can introduce noise
  43. Applied Machine Learning Lab CityU Novel DQN Architecture § Intuition:

    § recommend an item that is similar to the clicked/ordered items (left part) § while dissimilar to the skipped items (right part) § RNN with Gated Recurrent Units (GRU) to capture users’ sequential preference Recently clicked or ordered items Recently skipped items
  44. Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback

    § Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others
  45. Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback

    § Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others § The partial order of user’s preference over these two items in category B § At time 2, we name a5 as the competitor item of a2
  46. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  47. Applied Machine Learning Lab CityU Recommendation as MDP § Environment:

    User Pool + News Pool § Agent: Recommendation Algorithm § State: Feature Representation for Users § Action: Feature Representation for News § Reward: User Feedback § click/skip labels § estimation of user activeness Agent Environment Action State Reward DQN Click / no click User activiness Action 1 Action 2 Action m User News Explore Memory ...
  48. Applied Machine Learning Lab CityU User Activeness Modelling § Hazard

    function § User activeness Time 0.0 0.2 0.4 0.6 0.8 1.0 User activeness t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
  49. Applied Machine Learning Lab CityU Duelling Network Architecture § State

    features: User features and Context features § Action features: User news features and Context features § Value function V(s) § state features § Advantage function A(s, a) § state features + action features § Q-function = V(s) + A(s, a) V(s) A(s, a) Q(s, a) User features Context features User-news features News features
  50. Applied Machine Learning Lab CityU Effective Exploration § Random exploration

    § Harm the user experience in short term § Multi-armed Bandit § Large variance § Long time to converge § Steps § Get recommendation from 𝑄 and . 𝑄 § Probabilistic interleave these two lists § Get feedback from user and compare the performance of two network § If . 𝑄 performs better, update 𝑄 towards it C D B Step towards Keep Model choice List Probabilistic Interleave Current Network Explore Network A B C List Feedback A C D A C D List Push to user Collect feedback
  51. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  52. Applied Machine Learning Lab CityU Background § Users sequentially interact

    with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: matching user’s various preferences
  53. Applied Machine Learning Lab CityU Background § Users sequentially interact

    with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: comparing with the clicked item
  54. Applied Machine Learning Lab CityU Background § Users sequentially interact

    with multiple scenarios § Different scenario has different objective Entrance Page skip
  55. Applied Machine Learning Lab CityU Motivation § Optimizing each recommender

    agent for each scenario § Ignoring sequential dependency § Missing information § Sub-optimal overall objective Item Detail Page Entrance Page click return X
  56. Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal §

    Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session
  57. Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal §

    Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors
  58. Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal §

    Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors § Advantages § Agents are sequentially activated § Agents share the same memory § Agents work collaboratively
  59. Applied Machine Learning Lab CityU Actorm Actord click return skip

    click return Entrance Page Item Detail Page yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m Entrance Page § 1st row: skip behavior § 2nd row: click behavior § 3rd row: leave behavior
  60. Applied Machine Learning Lab CityU yt = ps m (st

    , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st, at ) · rt 1d Actorm Actord Entrance Page Item Detail Page click return skip click return Entrance Page Item Detail Page Optimization
  61. Applied Machine Learning Lab CityU § Advantages § Reducing training

    data amount requirement § Performing accurate optimization of the Q-function Why Model-based RL? yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st , at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st , at ) · rt 1d Model-based
  62. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  63. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  64. Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements §

    Goal: maximizing the advertising impression revenue from advertisers § Assigning the right ads to the right users at the right place § Reinforcement learning for advertisements § Continuously updating the advertising strategies & maximizing the long-term revenue Normal Recommendations Sponsored Products Ad
  65. Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements §

    Challenges: § Different teams, goals and models à suboptimal overall performance § Goal: § Jointly optimizing advertising revenue and user experience § KDD’2020, AAAI’2021 Advertising Revenue User Experience VS
  66. Applied Machine Learning Lab CityU Reinforcement Learning Framework § Two-level

    Deep Q-networks: § first-level: recommender system (RS) § second-level: advertising system (AS) § State: rec/ads browsing history § Action: § Reward: § Transition: at = (ars t , aas t ) rt (st , ars t ) and rt (st , aas t ) st to st+1
  67. Applied Machine Learning Lab CityU Recommender System § Goal: long-term

    user experience or engagement § Challenge: combinatorial action space
  68. Applied Machine Learning Lab CityU Cascading DQN for RS O

    N k → O(kN) N: number of candidate items k: length of rec-list Historical Rec Historical Ads Context Rec items
  69. Applied Machine Learning Lab CityU Advertising System § Goal: §

    maximize the advertising revenue § minimize the negative influence of ads on user experience § Decisions: § interpolate an ad? § the optimal location § the optimal ad
  70. Applied Machine Learning Lab CityU Novel DQN for AS §

    Three decisions: 1. interpolate an ad? 2. the optimal location 3. the optimal ad Historical Rec Historical Ads Context Rec-list Ad item Decision 1 Decision 2 Decision 3
  71. Applied Machine Learning Lab CityU Systems Update § Target User:

    § browses the mixed rec-ads list § provides her/his feedback
  72. Applied Machine Learning Lab CityU Advantage § The first individual

    DQN architecture that can simultaneously evaluate the Q- values of multiple levels’ related actions Neural Network ··· Q(st , aad t )0 Q(st , aad t )1 state st action aad t State Action Neural Network Q-value State Neural Network Q-value1 Q-valueL (a) (b) ……
  73. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  74. Applied Machine Learning Lab CityU Real-time Feedback § The most

    practical and precise way is online A/B test
  75. Applied Machine Learning Lab CityU Real-time Feedback § The most

    practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience
  76. Applied Machine Learning Lab CityU Real-time Feedback § The most

    practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience Real-time Feedback UserSim System
  77. Applied Machine Learning Lab CityU Overview § Simulating users’ real-time

    feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited
  78. Applied Machine Learning Lab CityU Overview § Simulating users’ real-time

    feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited · · · i1 i2 iN Decoder Encoder · · · i1 i2 iN Generator Discriminator Browsing History real a or fake Gθ (s) Browsing History Gθ (s) RNN · · · · · · lR1 lRK lF1 lFK softmax MLP MLP Super . Super . real action a · · · ground truth feedback
  79. Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1

    fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Sum Sum Feedback for real items Feedback for fake items
  80. Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1

    fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K<k≤2K} Sum Sum
  81. Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1

    fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K<k≤2K} LD = Lunsup D + α · Lsup D Sum Sum
  82. Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2

    FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))]
  83. Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2

    FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 Real Item Fake Item
  84. Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2

    FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 LG = Lunsup G + β · Lsup G Real Item Fake Item
  85. Applied Machine Learning Lab CityU RL-based Recommender Training § Metric:

    average reward of a session § Baselines: Historical Logs, IRecGAN
  86. Applied Machine Learning Lab CityU RL-based Recommender Training § Metric:

    average reward of a session § Baselines: Historical Logs, IRecGAN § UserSim converges to the similar avg_reward with the one upon historical data § UserSim performs much more stably than the one trained based upon IRecGAN
  87. Applied Machine Learning Lab CityU Other Simulators RecoGym @ Criteo

    Virtual-Taobao @ Alibaba RecSim @ Google GAN-PW @ Alibaba
  88. Applied Machine Learning Lab CityU Outline § Recommendations in Single

    Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys
  89. Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning

    for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation
  90. Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning

    for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation § Reinforcement Learning based Recommender Systems: A Survey (Arxiv’2021) § Papers are grouped based on classic RL methodologies § Q-learning (DQN) Methods § REINFORCE (Policy Gradient) methods § Actor-Critic Methods § Compound Methods
  91. Applied Machine Learning Lab CityU Surveys § A Survey on

    Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation
  92. Applied Machine Learning Lab CityU Surveys § A Survey on

    Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation § A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions § Papers are grouped based on RL methodologies § Component Optimization in Deep RL based RS, such as Environment Simulation, State Representation, Reward Functions § Emerging topics, such as Multi-Agent, Hierarchical, Inverse, GNN-based, Self-Supervised Deep RL
  93. Applied Machine Learning Lab CityU § Continuously updating the recommendation

    strategies during the interactions § Maximizing the long-term reward from users Conclusion Recommendation Session t0 t1 t2 t3