Slide 1

Slide 1 text

Applied Machine Learning Lab CityU Deep Reinforcement Learning for Recommender Systems Xiangyu Zhao Assistant Professor School of Data Science City University of Hong Kong Dec 22, 2021 @ WING, NUS

Slide 2

Slide 2 text

Applied Machine Learning Lab CityU Biography Applied Machine Learning Lab Homepage: zhaoxyai.github.io Email: [email protected] • Research interests: data mining and machine learning, especially Reinforcement Learning, AutoML, and Multimodal and their applications in Recommender System and Smart City • Published more than 30+ papers in top conferences and journals (e.g., KDD, WWW, AAAI, SIGIR, ICDE, CIKM, ICDM) • His research received ICDM’21 Best-ranked Papers, Global Top 100 Chinese New Stars in AI, CCF-Tencent Open Fund, Criteo Research Award, and Bytedance Research Award • Top conference (senior) program committee members and session chairs, and journal reviewers • Organizer of DRL4KDD and DRL4KD at KDD’19, WWW’21 and SIGIR’20/21, and a lead tutor at WWW’21/22 and IJCAI’21 • Founding academic committee members of MLNLP, the largest AI community in China with 800,000 followers • The models and algorithms from his research have been launched in the online system of many companies

Slide 3

Slide 3 text

Applied Machine Learning Lab CityU Recommender Systems § Intelligent system that assists users’ information seeking tasks Music Video Ecommerce News Social Friends Location based

Slide 4

Slide 4 text

Applied Machine Learning Lab CityU Recommender Systems § Intelligent system that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based

Slide 5

Slide 5 text

Applied Machine Learning Lab CityU Recommender Systems § Intelligent system that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based Browsing History

Slide 6

Slide 6 text

Applied Machine Learning Lab CityU Recommender Systems § Intelligent system that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based System User

Slide 7

Slide 7 text

Applied Machine Learning Lab CityU § Considering recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users Existing Recommendation Policies System User

Slide 8

Slide 8 text

Applied Machine Learning Lab CityU § Considering recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users Existing Recommendation Policies System User

Slide 9

Slide 9 text

Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback System User

Slide 10

Slide 10 text

Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience System User

Slide 11

Slide 11 text

Applied Machine Learning Lab CityU § Considering recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience Existing Recommendation Policies System

Slide 12

Slide 12 text

Applied Machine Learning Lab CityU § RL is a general-purpose framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell

Slide 13

Slide 13 text

Applied Machine Learning Lab CityU § RL is a general-purpose framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell state actions

Slide 14

Slide 14 text

Applied Machine Learning Lab CityU Agent and Environment § At each step t the agent: § Receives state st § Receives scalar reward rt § Executes action at § The environment: § Receives action at § Emits state st § Emits scalar reward rt state reward action at rt st

Slide 15

Slide 15 text

Applied Machine Learning Lab CityU Examples of Deep RL @DeepMind § Play games: Atari, poker, Go, ... § Explore worlds: 3D worlds, Labyrinth, ... § Control physical systems: manipulate, walk, swim, ... § Interact with users: recommend, personalize, optimize, ...

Slide 16

Slide 16 text

Applied Machine Learning Lab CityU Major Components of an RL Agent § An RL agent may include one or more of these components: § Value function (Q-value): prediction of value for each state and action § Policy: maps current state to action § Model: agent’s representation of the environment

Slide 17

Slide 17 text

Applied Machine Learning Lab CityU Deep Reinforcement Learning § Use deep neural networks to represent § Value function (Q-value) § Policy § Model § Optimize loss function by stochastic gradient descent Q-value Table Deep Q-Network

Slide 18

Slide 18 text

Applied Machine Learning Lab CityU Value Function § A value function is a prediction of future reward § “How much reward will I get from action a in state s?” § Q-value function gives expected total reward § from state s and action a § under policy π § with discount factor +2 +1 -1 Value of taking the action state actions

Slide 19

Slide 19 text

Applied Machine Learning Lab CityU Deep Q-Network (DQN) Architectures

Slide 20

Slide 20 text

Applied Machine Learning Lab CityU Policy § A policy is the agent’s behavior § It is a map from state to action: • Deterministic policy: a = π(s) • Stochastic policy: π (a|s) = P [a|s] 0.7 0.2 0.1 Probability of taking the action

Slide 21

Slide 21 text

Applied Machine Learning Lab CityU Model § Model is learnt from experience (interactions) § Model acts as proxy for environment § Planner interacts with model § e.g. using lookahead search observation reward action at rt ot

Slide 22

Slide 22 text

Applied Machine Learning Lab CityU Approaches To Reinforcement Learning § Policy-based RL § Search directly for the optimal policy π* § This is the policy achieving maximum future reward § Value-based RL § Estimate the optimal value function Q*(s,a) § This is the maximum value achievable under any policy § Model-based RL § Build a transition model of the environment § Plan (e.g. by lookahead) using model

Slide 23

Slide 23 text

Applied Machine Learning Lab CityU § Continuously updating the recommendation strategies during the interactions Reinforcement Learning for Recommendations

Slide 24

Slide 24 text

Applied Machine Learning Lab CityU § Continuously updating the recommendation strategies during the interactions § Maximizing the long-term reward from users Reinforcement Learning for Recommendation Policies Recommendation Session t0 t1 t2 t3

Slide 25

Slide 25 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 26

Slide 26 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 27

Slide 27 text

Applied Machine Learning Lab CityU User-System Interactions § The system recommends a page of items to a user § The user provides real-time feedback and the system updates its policy § The system recommends a new page of items

Slide 28

Slide 28 text

Applied Machine Learning Lab CityU Challenges § Updating strategy according to user’s real-time feedback

Slide 29

Slide 29 text

Applied Machine Learning Lab CityU Challenges § Updating strategy according to user’s real-time feedback § Diverse and complementary recommendations

Slide 30

Slide 30 text

Applied Machine Learning Lab CityU Challenges § Updating strategy according to user’s real-time feedback § Diverse and complementary recommendations § Displaying items in a 2-D page

Slide 31

Slide 31 text

Applied Machine Learning Lab CityU Actor-Critic Q(s, a) = Es r + γQ(s , a )|s, a h1 h2 ··· (a) state s Q(s, a2) Q(s, a1) action ai h1 h2 (b) state s Q(s, ai) h1 h1 h2 h2 (c) Actor Critic state s state s action a Q(s, a) action a Q∗(s, a) = Es r + γ max a Q∗(s , a )|s, a Fixed item space max à enumerating all possible items

Slide 32

Slide 32 text

Applied Machine Learning Lab CityU Actor Design § Goal: Generating a page of recommendations according to user’s browsing history h1 h2 Actor state s action a

Slide 33

Slide 33 text

Applied Machine Learning Lab CityU Actor Architecture eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items § Goal: Generating a page of items according to user’s browsing history

Slide 34

Slide 34 text

Applied Machine Learning Lab CityU Embedding Layer § Three types of information § ei : item’s identifier § ci : item’s category § fi : user’s feedback ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer Xi = concat(Ei, Ci, Fi ) = tanh concat(WEei + bE, WCci + bC, WF fi + bF ) Identifier Embedding Category Embedding Feedback Embedding Item Embedding

Slide 35

Slide 35 text

Applied Machine Learning Lab CityU Page-wise CNN Layer ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer

Slide 36

Slide 36 text

Applied Machine Learning Lab CityU RNN & Attention Layer zt = σ(Wz Et + Uz ht−1 ) rt = σ(Wr Et + Ur ht−1 ) ht = (1 − zt )ht−1 + zt ˆ ht ˆ ht = tanh[WEt + U(rt · ht−1 )] ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer s = T t=1 αt ht where αt = exp(Wα ht + bα ) j exp(Wα hj + bα ) User Preference Attention GRU Page 1 Page 2 Page T

Slide 37

Slide 37 text

Applied Machine Learning Lab CityU Decoder § Goal: Generating a page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder User preference (vector) A page of items (matrix) ü Task 1: Generating a set of items ü Task 2: Displaying items in a page

Slide 38

Slide 38 text

Applied Machine Learning Lab CityU Decoder § Goal: Generating a page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder DeCNN User preference (vector) A page of items (matrix) Deconvolution Neural Network Representation (vector) Image (matrix) recover

Slide 39

Slide 39 text

Applied Machine Learning Lab CityU Actor Architecture eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items 23

Slide 40

Slide 40 text

Applied Machine Learning Lab CityU Qθµ (s, a) = Es r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation

Slide 41

Slide 41 text

Applied Machine Learning Lab CityU Qθµ (s, a) = Es r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation § DeepPage § user’s real-time feedback § long-term reward § putting items in a page

Slide 42

Slide 42 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 43

Slide 43 text

Applied Machine Learning Lab CityU Why Negative Feedback? § What users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases

Slide 44

Slide 44 text

Applied Machine Learning Lab CityU Why Negative Feedback? § What users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases § Challenges § Negative feedback could bury the positive ones § May not be caused by users disliking them § Weak/wrong negative feedback can introduce noise

Slide 45

Slide 45 text

Applied Machine Learning Lab CityU Novel DQN Architecture § Intuition: § recommend an item that is similar to the clicked/ordered items (left part) § while dissimilar to the skipped items (right part) § RNN with Gated Recurrent Units (GRU) to capture users’ sequential preference Recently clicked or ordered items Recently skipped items

Slide 46

Slide 46 text

Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback § Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others

Slide 47

Slide 47 text

Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback § Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others § The partial order of user’s preference over these two items in category B § At time 2, we name a5 as the competitor item of a2

Slide 48

Slide 48 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 49

Slide 49 text

Applied Machine Learning Lab CityU Recommendation as MDP § Environment: User Pool + News Pool § Agent: Recommendation Algorithm § State: Feature Representation for Users § Action: Feature Representation for News § Reward: User Feedback § click/skip labels § estimation of user activeness Agent Environment Action State Reward DQN Click / no click User activiness Action 1 Action 2 Action m User News Explore Memory ...

Slide 50

Slide 50 text

Applied Machine Learning Lab CityU User Activeness Modelling § Hazard function § User activeness Time 0.0 0.2 0.4 0.6 0.8 1.0 User activeness t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

Slide 51

Slide 51 text

Applied Machine Learning Lab CityU Duelling Network Architecture § State features: User features and Context features § Action features: User news features and Context features § Value function V(s) § state features § Advantage function A(s, a) § state features + action features § Q-function = V(s) + A(s, a) V(s) A(s, a) Q(s, a) User features Context features User-news features News features

Slide 52

Slide 52 text

Applied Machine Learning Lab CityU Effective Exploration § Random exploration § Harm the user experience in short term § Multi-armed Bandit § Large variance § Long time to converge § Steps § Get recommendation from 𝑄 and . 𝑄 § Probabilistic interleave these two lists § Get feedback from user and compare the performance of two network § If . 𝑄 performs better, update 𝑄 towards it C D B Step towards Keep Model choice List Probabilistic Interleave Current Network Explore Network A B C List Feedback A C D A C D List Push to user Collect feedback

Slide 53

Slide 53 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 54

Slide 54 text

Applied Machine Learning Lab CityU Background § Users sequentially interact with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: matching user’s various preferences

Slide 55

Slide 55 text

Applied Machine Learning Lab CityU Background § Users sequentially interact with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: comparing with the clicked item

Slide 56

Slide 56 text

Applied Machine Learning Lab CityU Background § Users sequentially interact with multiple scenarios § Different scenario has different objective Entrance Page skip

Slide 57

Slide 57 text

Applied Machine Learning Lab CityU Motivation § Optimizing each recommender agent for each scenario § Ignoring sequential dependency § Missing information § Sub-optimal overall objective Item Detail Page Entrance Page click return X

Slide 58

Slide 58 text

Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal § Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session

Slide 59

Slide 59 text

Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal § Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors

Slide 60

Slide 60 text

Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal § Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors § Advantages § Agents are sequentially activated § Agents share the same memory § Agents work collaboratively

Slide 61

Slide 61 text

Applied Machine Learning Lab CityU Actorm Actord click return skip click return Entrance Page Item Detail Page yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m Entrance Page § 1st row: skip behavior § 2nd row: click behavior § 3rd row: leave behavior

Slide 62

Slide 62 text

Applied Machine Learning Lab CityU yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st, at ) · rt 1d Actorm Actord Entrance Page Item Detail Page click return skip click return Entrance Page Item Detail Page Optimization

Slide 63

Slide 63 text

Applied Machine Learning Lab CityU § Advantages § Reducing training data amount requirement § Performing accurate optimization of the Q-function Why Model-based RL? yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st , at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st , at ) · rt 1d Model-based

Slide 64

Slide 64 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 65

Slide 65 text

Applied Machine Learning Lab CityU Overall Model Architecture

Slide 66

Slide 66 text

Applied Machine Learning Lab CityU Overall Model Architecture

Slide 67

Slide 67 text

Applied Machine Learning Lab CityU Overall Model Architecture

Slide 68

Slide 68 text

Applied Machine Learning Lab CityU Detailed Structure of MA-RDPG

Slide 69

Slide 69 text

Applied Machine Learning Lab CityU Detailed Structure of MA-RDPG

Slide 70

Slide 70 text

Applied Machine Learning Lab CityU Detailed Structure of MA-RDPG

Slide 71

Slide 71 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 72

Slide 72 text

Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements § Goal: maximizing the advertising impression revenue from advertisers § Assigning the right ads to the right users at the right place § Reinforcement learning for advertisements § Continuously updating the advertising strategies & maximizing the long-term revenue Normal Recommendations Sponsored Products Ad

Slide 73

Slide 73 text

Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements § Challenges: § Different teams, goals and models à suboptimal overall performance § Goal: § Jointly optimizing advertising revenue and user experience § KDD’2020, AAAI’2021 Advertising Revenue User Experience VS

Slide 74

Slide 74 text

Applied Machine Learning Lab CityU Reinforcement Learning Framework § Two-level Deep Q-networks: § first-level: recommender system (RS) § second-level: advertising system (AS) § State: rec/ads browsing history § Action: § Reward: § Transition: at = (ars t , aas t ) rt (st , ars t ) and rt (st , aas t ) st to st+1

Slide 75

Slide 75 text

Applied Machine Learning Lab CityU Recommender System § Goal: long-term user experience or engagement § Challenge: combinatorial action space

Slide 76

Slide 76 text

Applied Machine Learning Lab CityU Cascading DQN for RS O N k → O(kN) N: number of candidate items k: length of rec-list Historical Rec Historical Ads Context Rec items

Slide 77

Slide 77 text

Applied Machine Learning Lab CityU Advertising System § Goal: § maximize the advertising revenue § minimize the negative influence of ads on user experience § Decisions: § interpolate an ad? § the optimal location § the optimal ad

Slide 78

Slide 78 text

Applied Machine Learning Lab CityU Novel DQN for AS § Three decisions: 1. interpolate an ad? 2. the optimal location 3. the optimal ad Historical Rec Historical Ads Context Rec-list Ad item Decision 1 Decision 2 Decision 3

Slide 79

Slide 79 text

Applied Machine Learning Lab CityU Systems Update § Target User: § browses the mixed rec-ads list § provides her/his feedback

Slide 80

Slide 80 text

Applied Machine Learning Lab CityU Advantage § The first individual DQN architecture that can simultaneously evaluate the Q- values of multiple levels’ related actions Neural Network ··· Q(st , aad t )0 Q(st , aad t )1 state st action aad t State Action Neural Network Q-value State Neural Network Q-value1 Q-valueL (a) (b) ……

Slide 81

Slide 81 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 82

Slide 82 text

Applied Machine Learning Lab CityU Real-time Feedback § The most practical and precise way is online A/B test

Slide 83

Slide 83 text

Applied Machine Learning Lab CityU Real-time Feedback § The most practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience

Slide 84

Slide 84 text

Applied Machine Learning Lab CityU Real-time Feedback § The most practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience Real-time Feedback UserSim System

Slide 85

Slide 85 text

Applied Machine Learning Lab CityU Overview § Simulating users’ real-time feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited

Slide 86

Slide 86 text

Applied Machine Learning Lab CityU Overview § Simulating users’ real-time feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited · · · i1 i2 iN Decoder Encoder · · · i1 i2 iN Generator Discriminator Browsing History real a or fake Gθ (s) Browsing History Gθ (s) RNN · · · · · · lR1 lRK lF1 lFK softmax MLP MLP Super . Super . real action a · · · ground truth feedback

Slide 87

Slide 87 text

Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1 fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Sum Sum Feedback for real items Feedback for fake items

Slide 88

Slide 88 text

Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1 fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K

Slide 89

Slide 89 text

Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1 fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K

Slide 90

Slide 90 text

Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2 FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))]

Slide 91

Slide 91 text

Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2 FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 Real Item Fake Item

Slide 92

Slide 92 text

Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2 FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 LG = Lunsup G + β · Lsup G Real Item Fake Item

Slide 93

Slide 93 text

Applied Machine Learning Lab CityU RL-based Recommender Training § Metric: average reward of a session § Baselines: Historical Logs, IRecGAN

Slide 94

Slide 94 text

Applied Machine Learning Lab CityU RL-based Recommender Training § Metric: average reward of a session § Baselines: Historical Logs, IRecGAN § UserSim converges to the similar avg_reward with the one upon historical data § UserSim performs much more stably than the one trained based upon IRecGAN

Slide 95

Slide 95 text

Applied Machine Learning Lab CityU Other Simulators RecoGym @ Criteo Virtual-Taobao @ Alibaba RecSim @ Google GAN-PW @ Alibaba

Slide 96

Slide 96 text

Applied Machine Learning Lab CityU Outline § Recommendations in Single Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Slide 97

Slide 97 text

Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation

Slide 98

Slide 98 text

Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation § Reinforcement Learning based Recommender Systems: A Survey (Arxiv’2021) § Papers are grouped based on classic RL methodologies § Q-learning (DQN) Methods § REINFORCE (Policy Gradient) methods § Actor-Critic Methods § Compound Methods

Slide 99

Slide 99 text

Applied Machine Learning Lab CityU Surveys § A Survey on Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation

Slide 100

Slide 100 text

Applied Machine Learning Lab CityU Surveys § A Survey on Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation § A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions § Papers are grouped based on RL methodologies § Component Optimization in Deep RL based RS, such as Environment Simulation, State Representation, Reward Functions § Emerging topics, such as Multi-Agent, Hierarchical, Inverse, GNN-based, Self-Supervised Deep RL

Slide 101

Slide 101 text

Applied Machine Learning Lab CityU § Continuously updating the recommendation strategies during the interactions § Maximizing the long-term reward from users Conclusion Recommendation Session t0 t1 t2 t3