Deep Reinforcement Learning for Recommender Systems

Applied Machine Learning Lab CityU Deep Reinforcement Learning for Recommender
Systems Xiangyu Zhao Assistant Professor School of Data Science City University of Hong Kong Dec 22, 2021 @ WING, NUS

Applied Machine Learning Lab CityU Biography Applied Machine Learning Lab
Homepage: zhaoxyai.github.io Email: [email protected] • Research interests: data mining and machine learning, especially Reinforcement Learning, AutoML, and Multimodal and their applications in Recommender System and Smart City • Published more than 30+ papers in top conferences and journals (e.g., KDD, WWW, AAAI, SIGIR, ICDE, CIKM, ICDM) • His research received ICDM’21 Best-ranked Papers, Global Top 100 Chinese New Stars in AI, CCF-Tencent Open Fund, Criteo Research Award, and Bytedance Research Award • Top conference (senior) program committee members and session chairs, and journal reviewers • Organizer of DRL4KDD and DRL4KD at KDD’19, WWW’21 and SIGIR’20/21, and a lead tutor at WWW’21/22 and IJCAI’21 • Founding academic committee members of MLNLP, the largest AI community in China with 800,000 followers • The models and algorithms from his research have been launched in the online system of many companies

Applied Machine Learning Lab CityU Recommender Systems § Intelligent system
that assists users’ information seeking tasks Music Video Ecommerce News Social Friends Location based

that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based

that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based Browsing History

that assists users’ information seeking tasks § Goal: Suggesting items that best match users’ preferences Music Video Ecommerce News Social Friends Location based System User

Applied Machine Learning Lab CityU § Considering recommendation as an
offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users Existing Recommendation Policies System User

Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering
recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback System User

Applied Machine Learning Lab CityU Existing Recommendation Policies § Considering
recommendation as an offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience System User

Applied Machine Learning Lab CityU § Considering recommendation as an
offline optimization problem § Following a greedy strategy to maximize the immediate rewards from users § Disadvantages § Overlooking real-time feedback § Overlooking the long-term influence on user experience Existing Recommendation Policies System

Applied Machine Learning Lab CityU § RL is a general-purpose
framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell

Applied Machine Learning Lab CityU § RL is a general-purpose
framework for decision-making § RL is for an agent with the capacity to take actions § Success is measured by a reward from the environment § Each action influences the agent’s future state § Goal: select actions to maximize future reward Reinforcement Learning in a nutshell state actions

Applied Machine Learning Lab CityU Agent and Environment § At
each step t the agent: § Receives state st § Receives scalar reward rt § Executes action at § The environment: § Receives action at § Emits state st § Emits scalar reward rt state reward action at rt st

Applied Machine Learning Lab CityU Examples of Deep RL @DeepMind
§ Play games: Atari, poker, Go, ... § Explore worlds: 3D worlds, Labyrinth, ... § Control physical systems: manipulate, walk, swim, ... § Interact with users: recommend, personalize, optimize, ...

Applied Machine Learning Lab CityU Major Components of an RL
Agent § An RL agent may include one or more of these components: § Value function (Q-value): prediction of value for each state and action § Policy: maps current state to action § Model: agent’s representation of the environment

Applied Machine Learning Lab CityU Deep Reinforcement Learning § Use
deep neural networks to represent § Value function (Q-value) § Policy § Model § Optimize loss function by stochastic gradient descent Q-value Table Deep Q-Network

Applied Machine Learning Lab CityU Value Function § A value
function is a prediction of future reward § “How much reward will I get from action a in state s?” § Q-value function gives expected total reward § from state s and action a § under policy π § with discount factor +2 +1 -1 Value of taking the action state actions

Applied Machine Learning Lab CityU Deep Q-Network (DQN) Architectures

Applied Machine Learning Lab CityU Policy § A policy is
the agent’s behavior § It is a map from state to action: • Deterministic policy: a = π(s) • Stochastic policy: π (a|s) = P [a|s] 0.7 0.2 0.1 Probability of taking the action

Applied Machine Learning Lab CityU Model § Model is learnt
from experience (interactions) § Model acts as proxy for environment § Planner interacts with model § e.g. using lookahead search observation reward action at rt ot

Applied Machine Learning Lab CityU Approaches To Reinforcement Learning §
Policy-based RL § Search directly for the optimal policy π* § This is the policy achieving maximum future reward § Value-based RL § Estimate the optimal value function Q*(s,a) § This is the maximum value achievable under any policy § Model-based RL § Build a transition model of the environment § Plan (e.g. by lookahead) using model

Applied Machine Learning Lab CityU § Continuously updating the recommendation
strategies during the interactions Reinforcement Learning for Recommendations

strategies during the interactions § Maximizing the long-term reward from users Reinforcement Learning for Recommendation Policies Recommendation Session t0 t1 t2 t3

Applied Machine Learning Lab CityU Outline § Recommendations in Single
Scenario § DeepPage - Deep Reinforcement Learning for Page-wise Recommendations (RecSys’2018) § DEERS - Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (KDD’2018) § DRN - A Deep Reinforcement Learning Framework for News Recommendation (WWW’2018) § Recommendations in Multiple Scenarios § DeepChain - Whole-Chain Recommendations (CIKM’2020) § MA-RDPG - Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning (WWW’2018) § RAM - Jointly Learning to Recommend and Advertise (KDD’2020) § DEAR - Deep Reinforcement Learning for Online Advertising in Recommender Systems (AAAI’2021) § Online Environment Simulator § UserSim - User Simulation via Supervised Generative Adversarial Network (WWW’2021) § Surveys

Applied Machine Learning Lab CityU User-System Interactions § The system
recommends a page of items to a user § The user provides real-time feedback and the system updates its policy § The system recommends a new page of items

Applied Machine Learning Lab CityU Challenges § Updating strategy according
to user’s real-time feedback

to user’s real-time feedback § Diverse and complementary recommendations

to user’s real-time feedback § Diverse and complementary recommendations § Displaying items in a 2-D page

Applied Machine Learning Lab CityU Actor-Critic Q(s, a) = Es
r + γQ(s , a )|s, a h1 h2 ··· (a) state s Q(s, a2) Q(s, a1) action ai h1 h2 (b) state s Q(s, ai) h1 h1 h2 h2 (c) Actor Critic state s state s action a Q(s, a) action a Q∗(s, a) = Es r + γ max a Q∗(s , a )|s, a Fixed item space max à enumerating all possible items

Applied Machine Learning Lab CityU Actor Design § Goal: Generating
a page of recommendations according to user’s browsing history h1 h2 Actor state s action a

Applied Machine Learning Lab CityU Actor Architecture eM e2 e1
··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items § Goal: Generating a page of items according to user’s browsing history

Applied Machine Learning Lab CityU Embedding Layer § Three types
of information § ei : item’s identifier § ci : item’s category § fi : user’s feedback ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer Xi = concat(Ei, Ci, Fi ) = tanh concat(WEei + bE, WCci + bC, WF fi + bF ) Identifier Embedding Category Embedding Feedback Embedding Item Embedding

Applied Machine Learning Lab CityU Page-wise CNN Layer ··· ···
··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer

Applied Machine Learning Lab CityU RNN & Attention Layer zt
= σ(Wz Et + Uz ht−1 ) rt = σ(Wr Et + Ur ht−1 ) ht = (1 − zt )ht−1 + zt ˆ ht ˆ ht = tanh[WEt + U(rt · ht−1 )] ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s Encoder CNN Layer s = T t=1 αt ht where αt = exp(Wα ht + bα ) j exp(Wα hj + bα ) User Preference Attention GRU Page 1 Page 2 Page T

Applied Machine Learning Lab CityU Decoder § Goal: Generating a
page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder User preference (vector) A page of items (matrix) ü Task 1: Generating a set of items ü Task 2: Displaying items in a page

Applied Machine Learning Lab CityU Decoder § Goal: Generating a
page of items according to user’s preference acur pro Actor eM e2 e1 ··· ··· s ··· ··· ··· eM−1 Decoder DeCNN User preference (vector) A page of items (matrix) Deconvolution Neural Network Representation (vector) Image (matrix) recover

Applied Machine Learning Lab CityU Actor Architecture eM e2 e1
··· ··· s ··· ··· ··· eM−1 Decoder ··· ··· ··· ··· ··· ··· h2 h1 hT fi ei ci XM−1 X2 X1 XM ··· ··· αT α2 α1 ··· ··· ··· s page−wise items DeCNN Encoder CNN Layer Prior Pages User’s Preference User’s Preference A Page of Items 23

Applied Machine Learning Lab CityU Qθµ (s, a) = Es
r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation

Applied Machine Learning Lab CityU Qθµ (s, a) = Es
r + γ Qθµ ( s , a ) Critic Architecture § Learning action-value function Q(s, a) User Preference h1 h2 CNN Critic User eM−1 e2 e1 eM acur val Q(s, a) a r s ··· ··· ··· ··· ··· A Page of Items Short-term Reward Next Action fθπ (s ) r = M m=1 reward(em ) Next State Short-term Reward Target (fixed) Evaluation § DeepPage § user’s real-time feedback § long-term reward § putting items in a page

Applied Machine Learning Lab CityU Why Negative Feedback? § What
users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases

Applied Machine Learning Lab CityU Why Negative Feedback? § What
users may not like § Positive: click or purchase § Negative: skip or leave § Advantage: § Avoiding bad recommendation cases § Challenges § Negative feedback could bury the positive ones § May not be caused by users disliking them § Weak/wrong negative feedback can introduce noise

Applied Machine Learning Lab CityU Novel DQN Architecture § Intuition:
§ recommend an item that is similar to the clicked/ordered items (left part) § while dissimilar to the skipped items (right part) § RNN with Gated Recurrent Units (GRU) to capture users’ sequential preference Recently clicked or ordered items Recently skipped items

Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback
§ Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others

Applied Machine Learning Lab CityU Weak or Wrong Negative Feedback
§ Recommender systems often recommends items belong to the same category (e.g., cell phone), while users click/order a part of them and skip others § The partial order of user’s preference over these two items in category B § At time 2, we name a5 as the competitor item of a2

Applied Machine Learning Lab CityU Recommendation as MDP § Environment:
User Pool + News Pool § Agent: Recommendation Algorithm § State: Feature Representation for Users § Action: Feature Representation for News § Reward: User Feedback § click/skip labels § estimation of user activeness Agent Environment Action State Reward DQN Click / no click User activiness Action 1 Action 2 Action m User News Explore Memory ...

Applied Machine Learning Lab CityU User Activeness Modelling § Hazard
function § User activeness Time 0.0 0.2 0.4 0.6 0.8 1.0 User activeness t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

Applied Machine Learning Lab CityU Duelling Network Architecture § State
features: User features and Context features § Action features: User news features and Context features § Value function V(s) § state features § Advantage function A(s, a) § state features + action features § Q-function = V(s) + A(s, a) V(s) A(s, a) Q(s, a) User features Context features User-news features News features

Applied Machine Learning Lab CityU Effective Exploration § Random exploration
§ Harm the user experience in short term § Multi-armed Bandit § Large variance § Long time to converge § Steps § Get recommendation from 𝑄 and . 𝑄 § Probabilistic interleave these two lists § Get feedback from user and compare the performance of two network § If . 𝑄 performs better, update 𝑄 towards it C D B Step towards Keep Model choice List Probabilistic Interleave Current Network Explore Network A B C List Feedback A C D A C D List Push to user Collect feedback

Applied Machine Learning Lab CityU Background § Users sequentially interact
with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: matching user’s various preferences

with multiple scenarios § Different scenario has different objective Entrance Page skip Objective: comparing with the clicked item

with multiple scenarios § Different scenario has different objective Entrance Page skip

Applied Machine Learning Lab CityU Motivation § Optimizing each recommender
agent for each scenario § Ignoring sequential dependency § Missing information § Sub-optimal overall objective Item Detail Page Entrance Page click return X

Applied Machine Learning Lab CityU Whole-Chain Recommendation § Goal §
Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session

Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors

Jointly optimizing multiple recommendation strategies § Maximizing the overall performance of the whole session § Actor-Critic § Actor: recommender agent in one scenario § Critic: controlling actors § Advantages § Agents are sequentially activated § Agents share the same memory § Agents work collaboratively

Applied Machine Learning Lab CityU Actorm Actord click return skip
click return Entrance Page Item Detail Page yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m Entrance Page § 1st row: skip behavior § 2nd row: click behavior § 3rd row: leave behavior

Applied Machine Learning Lab CityU yt = ps m (st
, at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st, at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st, at ) · rt 1d Actorm Actord Entrance Page Item Detail Page click return skip click return Entrance Page Item Detail Page Optimization

Applied Machine Learning Lab CityU § Advantages § Reducing training
data amount requirement § Performing accurate optimization of the Q-function Why Model-based RL? yt = ps m (st , at ) · γQµ (st+1 , πm (st+1 )) + pc m (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + pl m (st , at ) · rt 1m + pc d (st , at ) · rt + γQµ (st+1 , πd (st+1 )) + ps d (st , at ) · γQµ (st+1 , πm (st+1 )) + pl d (st , at ) · rt 1d Model-based

Applied Machine Learning Lab CityU Overall Model Architecture

Applied Machine Learning Lab CityU Detailed Structure of MA-RDPG

Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements §
Goal: maximizing the advertising impression revenue from advertisers § Assigning the right ads to the right users at the right place § Reinforcement learning for advertisements § Continuously updating the advertising strategies & maximizing the long-term revenue Normal Recommendations Sponsored Products Ad

Applied Machine Learning Lab CityU Reinforcement Learning for Advertisements §
Challenges: § Different teams, goals and models à suboptimal overall performance § Goal: § Jointly optimizing advertising revenue and user experience § KDD’2020, AAAI’2021 Advertising Revenue User Experience VS

Applied Machine Learning Lab CityU Reinforcement Learning Framework § Two-level
Deep Q-networks: § first-level: recommender system (RS) § second-level: advertising system (AS) § State: rec/ads browsing history § Action: § Reward: § Transition: at = (ars t , aas t ) rt (st , ars t ) and rt (st , aas t ) st to st+1

Applied Machine Learning Lab CityU Recommender System § Goal: long-term
user experience or engagement § Challenge: combinatorial action space

Applied Machine Learning Lab CityU Cascading DQN for RS O
N k → O(kN) N: number of candidate items k: length of rec-list Historical Rec Historical Ads Context Rec items

Applied Machine Learning Lab CityU Advertising System § Goal: §
maximize the advertising revenue § minimize the negative influence of ads on user experience § Decisions: § interpolate an ad? § the optimal location § the optimal ad

Applied Machine Learning Lab CityU Novel DQN for AS §
Three decisions: 1. interpolate an ad? 2. the optimal location 3. the optimal ad Historical Rec Historical Ads Context Rec-list Ad item Decision 1 Decision 2 Decision 3

Applied Machine Learning Lab CityU Systems Update § Target User:
§ browses the mixed rec-ads list § provides her/his feedback

Applied Machine Learning Lab CityU Advantage § The first individual
DQN architecture that can simultaneously evaluate the Q- values of multiple levels’ related actions Neural Network ··· Q(st , aad t )0 Q(st , aad t )1 state st action aad t State Action Neural Network Q-value State Neural Network Q-value1 Q-valueL (a) (b) ……

Applied Machine Learning Lab CityU Real-time Feedback § The most
practical and precise way is online A/B test

practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience

practical and precise way is online A/B test § Online A/B test is inefficient and expensive § Taking several weeks to collect sufficient data § Numerous engineering efforts § Bad user experience Real-time Feedback UserSim System

Applied Machine Learning Lab CityU Overview § Simulating users’ real-time
feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited

Applied Machine Learning Lab CityU Overview § Simulating users’ real-time
feedback is challenging § Underlying distribution of item sequences is extremely complex § Data available to each user is rather limited · · · i1 i2 iN Decoder Encoder · · · i1 i2 iN Generator Discriminator Browsing History real a or fake Gθ (s) Browsing History Gθ (s) RNN · · · · · · lR1 lRK lF1 lFK softmax MLP MLP Super . Super . real action a · · · ground truth feedback

Applied Machine Learning Lab CityU Discriminator e1 hN h1 f1
fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Sum Sum Feedback for real items Feedback for fake items

fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K<k≤2K} Sum Sum

fN eN IN I1 PD real a or fake Gθ (s) FC2 FC1 eD · · · · · · FC3 · · · · · · lR1 lRK lF1 lFK softmax · · · supervised component ground truth feedback Dφ (s, a) = K k=1 pmodel (lk |s, a) Dφ (s, Gθ (s)) = 2×K k=K+1 pmodel (lk |s, Gθ (s)) Lunsup D = − {Es,a∼pdata log Dφ (s, a) + Es∼pdata log Dφ (s, Gθ (s))} Lsup D = −{Es,a,r∼pdata [log pmodel (lk |s, a, k≤K)] + λ · Es,r∼pdata [log pmodel (lk |s, Gθ (s), K<k≤2K} LD = Lunsup D + α · Lsup D Sum Sum

Applied Machine Learning Lab CityU Generator PE Decoder Encoder FC2
FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))]

FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 Real Item Fake Item

FC1 Output Layer FC Layers Gθ (s) real action a supervised component e1 hN h1 f1 fN eN IN I1 · · · · · · Lunsup G = Es∼pdata [log Dφ (s, Gθ (s))] Lsup G = Es,a∼pdata a − Gθ (s) 2 2 LG = Lunsup G + β · Lsup G Real Item Fake Item

Applied Machine Learning Lab CityU RL-based Recommender Training § Metric:
average reward of a session § Baselines: Historical Logs, IRecGAN

Applied Machine Learning Lab CityU RL-based Recommender Training § Metric:
average reward of a session § Baselines: Historical Logs, IRecGAN § UserSim converges to the similar avg_reward with the one upon historical data § UserSim performs much more stably than the one trained based upon IRecGAN

Applied Machine Learning Lab CityU Other Simulators RecoGym @ Criteo
Virtual-Taobao @ Alibaba RecSim @ Google GAN-PW @ Alibaba

Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning
for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation

Applied Machine Learning Lab CityU Surveys § Deep Reinforcement Learning
for Search, Recommendation, and Online Advertising: A Survey (SIGWEB’2019) § Papers are grouped based on recommendation problems to solve § Exploitation/Exploration § Users’ Dynamic Preference Modeling § Long Term User Engagement § Slate Recommendation § Reinforcement Learning based Recommender Systems: A Survey (Arxiv’2021) § Papers are grouped based on classic RL methodologies § Q-learning (DQN) Methods § REINFORCE (Policy Gradient) methods § Actor-Critic Methods § Compound Methods

Applied Machine Learning Lab CityU Surveys § A Survey on
Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation

Applied Machine Learning Lab CityU Surveys § A Survey on
Reinforcement Learning for Recommender Systems (Arxiv’2021) § Papers are grouped based on recommendation applications § Interactive Recommendation § Conversational Recommendation § Sequential Recommendation § Explainable Recommendation § A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions § Papers are grouped based on RL methodologies § Component Optimization in Deep RL based RS, such as Environment Simulation, State Representation, Reward Functions § Emerging topics, such as Multi-Agent, Hierarchical, Inverse, GNN-based, Self-Supervised Deep RL

strategies during the interactions § Maximizing the long-term reward from users Conclusion Recommendation Session t0 t1 t2 t3

Deep Reinforcement Learning for Recommender Sys...

Deep Reinforcement Learning for Recommender Systems

More Decks by wing.nus

Other Decks in Education

Featured

Transcript