Intro to Reinforcement Learning + Deep Q-Networks

Slide 1

Slide 1 text

Intro to Reinforcement Learning + Deep Q-Networks Robin Ranjit Singh Chauhan [email protected] Pathway Intelligence Inc

Slide 2

Slide 2 text

● Aanchan Mohan ○ for suggesting I do this, and organizing ● Reviewers ○ Valuable comments ● Other meetup organizers ● Hive ● Sutton+Barto, Berkeley, UCL, DeepMind, OpenAI ○ for publishing openly Props Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 2 “Many unpaid hours were sacrificed to bring us this information”

Slide 3

Slide 3 text

Why ● Aim for today ○ Deliver you hard-won insight on silver platter ○ Things I wish I had known ○ Curated best existing content + original content ● Exchange ○ There is a lot to know ○ I hope others present on RL topics ○ If you have serious interest in RL I would like to chat Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 3 “Many unpaid hours were sacrificed to bring us this information”

Slide 4

Slide 4 text

● Head of Engineering @ AgFunder ○ SF-based VC focussed on AgTech + FoodTech investing ○ Investments include companies doing Machine Learning in these spaces ■ ImpactVision, The Yield ● Pathway Intelligence ○ BC-based consulting company ● Past ○ Microsoft PM in Fintech Payment Fraud ○ Transportation ○ HPC for Environmental engineering ● Computer Engineering @ Waterloo Me 4 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 4

Slide 5

Slide 5 text

You ● Comfort levels ○ ML ○ RL ○ Experience? ● Interest areas ● Lots of slides! ○ Speed vs depth?

Slide 6

Slide 6 text

IRL RL = trial and error + learning trial and error = variation and selection, search (explore/exploit) Learning = Association + Memory - Sutton + Barto

Slide 7

Slide 7 text

Types of ML ● Unsupervised ● Supervised ● Reinforcement Learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 7

Slide 8

Slide 8 text

Unsupervised Learning Supervised Learning (+semi-supervised) Reinforcement Learning Training Data Collect training data Collect training data Agent creates data through exploration Labels None Explicit label per example ** Sparse, Delayed Reward -> Temporal Credit Assignment problem Evaluation Case-specific, can be subjective Often Accuracy / Loss metrics per instance Regret ; Total Reward Inherent vs Artificial Reward Training / Fitting Training set Training set Behaviour policy Testing Test set Test set Target policy Exploration n/a n/a Exploration strategy ** (typically part of Behavior Policy) 8 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 8 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 9

Slide 9 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 9 Image credit: Complementary roles of basal ganglia and cerebellum in learning and motor control, Kenji Doya, 2000

Slide 10

Slide 10 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 10 Yann LeCun, January 2017 Asilomar, Future of Life Institute

Slide 11

Slide 11 text

Related Fields Image credit: UCL MSc Course on RL, David Silver, University College London Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 11

Slide 12

Slide 12 text

● Know what a “good / bad result” looks like ○ Don’t want to/cannot specify how to get to it ● When you need Tactics + Strategy ○ Action, not just prediction ● Cases ○ Medical treatment ○ Complex robot control ○ Games ○ Dialog systems ○ Vehicle Control ** ○ More as RL and DL advances When to consider RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 12 Image credit: (internet)...

Slide 13

Slide 13 text

● Simulatable (or have large existing data) ○ Else: training IRL usually infeasible ** ● Vast state spaces require exploration ○ Else: enumerate + plan ● Dependencies across time ○ Delayed reward ○ Else: supervised ● Avoid RL unless needed ○ Complicated ○ Can be data-hungry ○ Explainability? (Depends on fn approx) ○ Maturity? When to consider RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 13 Image credit: (internet)...

Slide 14

Slide 14 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 14 Image credit: OpenAI https://blog.openai.com/ai-and-compute

Slide 15

Slide 15 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 15 Image credit: OpenAI https://blog.openai.com/ai-and-compute 1 day on worlds fastest supercomputer peak perf 1 day on NVIDIA DGX-2: 16 Volta GPUs $400k HPC stats from top500.org

Slide 16

Slide 16 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 16 Image credit: OpenAI https://blog.openai.com/ai-and-compute 1 day on Sunway TaihuLight peak perf 1 day on NVIDIA DGX-2: 16 Volta GPUs $400k HPC stats from top500.org 4 of the 5 most data-hungry AI training runs are RL 1 day on Intel Core i9 Extreme ($2k chip, CPU perf only) 1 day on DOE Summit w/GPUs

Slide 17

Slide 17 text

Hype vs Reality ● Behind many recently AI milestones ● Better than human perf ● “Scared of AI” == Scared of RL ○ Jobs ○ Killing / Enslaving ○ Paperclips ○ AGI ○ Sentience 17 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 17 ● Few applications so far ● Slow learning ● Practical for robots? ● Progressing quickly

Slide 18

Slide 18 text

"RL + DL = general intelligence" David Silver Google DeepMind ICML 2016

Slide 19

Slide 19 text

“I think reinforcement learning is one class of technology where the PR excitement is vastly disproportional relative to the ... actual deployments today” Andrew Ng Chief Scientist of Baidu EmTech Nov 2017 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 19

Slide 20

Slide 20 text

● Methods ○ RL Algorithms ● Approximators ○ Deep Learning models in general ○ RL-specific DL techniques ● Gear ○ GPU ○ TPU, Other custom silicon ● Data ○ Sensors + Sensor data ● All of these are on fire ○ Safe to expect non-linear advancement in RL RL Trajectory Dependencies Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 20 Image credit: (internet)

Slide 21

Slide 21 text

● DeepMind (Google) ● OpenAI ● UAlberta ● Google Brain ● Berkeley, CMU, Oxford ● Many more... Who Does RL Research? Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 21

Slide 22

Slide 22 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 22

Slide 23

Slide 23 text

● Safety ● Animal rights ● Unemployment ● Civil Liberties ● Peace + Conflict ● Power Centralization RL+AI Ethics Dimensions Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 23 Consider donating to organizations dedicated to protecting values you cherish

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Reinforcement Learning ● learning to decide + act over time ● often online learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 25 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

Slide 26

Slide 26 text

● Sequential Task ○ Pick one of K arms ○ Each has own fixed, unknown, (stochastic?) reward distribution ● Goal ○ Maximize reward ● Challenge ○ Explore vs Exploit ○ Either alone not optimal ○ Supervised learning alone cannot solve: does not explore (Stochastic Multi-armed) Bandit Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 26 Image credit: Microsoft Research Image credit: https://github.com/CamDavidsonPilon

Slide 27

Slide 27 text

Contextual Multi-armed Bandit ● Rewards depend on Context ● Context independent of action F a1 F a2 F a3 F a4 Edgewater Casino (Context a) F b1 F b2 F b3 F b4 Hard Rock Casino (Context b) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 27

Slide 28

Slide 28 text

Reinforcement Learning Image credit: CMU Graduate AI course slides ● Context change depends on action ● Learn an MDP from experience only ● Game setting ○ Experiences effects of rules (wins/loss/tie) ○ Does not “know” rules Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 28

Slide 29

Slide 29 text

Markov Chains ● State fully defines history ● Transitions ○ Probability ○ Destination Image credit: Wikipedia Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 29

Slide 30

Slide 30 text

Markov Decision Process (MDP) ● Markov Chains ○ States linked w/o history ● Actions ○ Choice ● Rewards ○ Motivation ● Variants ○ Bandit = MDP with single state! ○ MC + Rewards = MRP ○ Partially observed (POMDP) ○ Semi-MDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 30 Image credit: Wikipedia Q: where will you often find the MDP in RL codebase? Non-MB vs MB

Slide 31

Slide 31 text

MDP and Friends Image credit: Aaron Schumacher, planspace.org Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 31

Slide 32

Slide 32 text

Reward Signal ● Reward drives learning ○ Details of reward signal often critical ● Too sparse ○ complete learning failure ● Too generous ○ optimization limited ● Problem specific Image credit: Wikipedia Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 32

Slide 33

Slide 33 text

Montezuma’s Actual Revenge Chart credit: Schaul et al, Prioritized Experience Replay, DeepMind Feb 2016 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 33

Slide 34

Slide 34 text

Broad Applicability ● Environment / Episodes ○ Finite length / Endless ● Action space ○ Discrete / Continuous ○ Few / Vast ● State space ○ Discrete / Continuous ○ Tree / Graph / Cyclic ○ Deterministic / Stochastic ○ Partially / Fully observed ● Reward signals ○ Deterministic / Stochastic ○ Continuous / Sparse ○ Immediate / Delayed 34 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 34 Image credit: Wikipedia

Slide 35

Slide 35 text

Types of RL ● Value Based ○ state-value fn V(s) ○ state-action value fn Q(s,a) ○ action-advantage fn A(s,a) ● Policy Based ○ Directly construct π*(s) ● Model Based ○ Learn model of environment ○ Plan using model ● Hybrids Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 35 Image credit: UCL MSc Course on RL, David Silver, University College London

Slide 36

Slide 36 text

Reinforcement Learning vs Planning / Search ● Planning and RL can be combined Planning / Search Reinforcement Learning Goal Improved policy Improved policy Method Computing on known model Interacting with unknown environment State Space Model Known Unknown Algos Heuristic state-space seach Dynamic programming Q-Learning Monte Carlo rollouts Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 36 Content paraphrased from: UCL MSc Course on RL, David Silver, University College London

Slide 37

Slide 37 text

RL Rollouts vs Planning rollouts Image credit: Aaron Schumacher, planspace.org

Slide 38

Slide 38 text

Elementary approaches ● Monte Carlo RL (MC) ○ Value = mean return of multiple runs ● Value Iteration + Policy Iteration ○ Both require enumerating all states ○ Both require knowing transition model T(s) ● Dynamic Programming (DP) ○ Value = value of next state + reward in this state ○ Iteration propagates reward from terminal state back to beginning Images credit: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 38

Slide 39

Slide 39 text

Elementary approaches: Value Iteration Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 39 Image credit: Pieter Abbeel UC Berkeley EECS

Slide 40

Slide 40 text

Elementary approaches: Policy Iteration Image credit: Pascal Poupart CS886 University of Waterloo Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 40

Slide 41

Slide 41 text

RL Algo Zoo ● Discrete / Continuous ● Model-based / Model-Free ● On- / Off-policy ● Derivative-based / not Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 41

Slide 42

Slide 42 text

Slide 43

Slide 43 text

RL Algo Zoo ● Discrete / Continuous ● Model-based / Model-Free ● On- / Off-policy ● Derivative-based / not ● Memory, Imagination ● Imitation, Inverse ● Hierarchical ● Mastery / Generalization ● Scalability ● Sample efficiency? Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus additions in red by Robin Chauhan GAE V-trace (Impala) Dyna-Q family AlphaGo AlphaZero MPPI MMC PAL HER GPS+DDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 43 Sarsa Distro-DQN TD Search DDO FRL MAXQ Options UNREAL HAM OptCrit hDQN iLQR MPC Pri-Sweep ReinfPlan NAC ACER A0C Rainbow MERLIN DQRN (POMDP)

Slide 44

Slide 44 text

Image credit: Sergey Levine via Chelsea Finn and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 44

Slide 45

Slide 45 text

● Wide task variety ○ Toy tasks ○ Continuous + Discrete ○ 2D, 3D, Text, Atari ● Common API for env + agent ○ Compare algos ● Similar ○ OpenAI’s Retro: Genesis, Atari arcade ○ DeepMind’s Lab: Quake-based 3D env ○ Microsoft’s Malmo: Minecraft ○ Facebook’s CommAI: Text comms ○ Poznan University, Poland: VizDoom OpenAI gym Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 45 Image credit: Open AI gym

Slide 46

Slide 46 text

OpenAI gym import gym env = gym.make('CartPole-v0') for i_episode in range(20): observation = env.reset() for t in range(100): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done: print("Episode finished after {} timesteps".format(t+1)) break Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 46 Sample code from https://gym.openai.com/docs/

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

StarCraft II Learning Environment

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 51

Slide 52

Slide 52 text

RL IRL ● Most results in hermetic envs ○ Board games ○ Computer games ○ Simulatable robot controllers ● Sim != Reality ● Model-based : Sample efficiency ++ ○ But: Model errors accumulate ● Techniques to account for model errors ● Theme: Bridge Sim -> Reality Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 52

Slide 53

Slide 53 text

RL IRL: Robotics ● Simple IRL manipulations hard for present day RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 53 Image credit: Chelsea Finn Image credit: Sergey Levine Image credit: Google Research

Slide 54

Slide 54 text

Q Learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 54

Slide 55

Slide 55 text

Q (s,a)

Slide 56

Slide 56 text

Types of RL ● Policy Based ○ Directly construct π*(s) ● Value Based ○ State-action value fn Q*(s,a) ○ State-value fn V*(s) ○ State-action advantage fn A*(s) ● Model Based ○ Learn model of environment ○ Plan using model ● Hybrids Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 56 Image credit: UCL MSc Course on RL, David Silver, University College London

Slide 57

Slide 57 text

Reinforcement Learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 57 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

Slide 58

Slide 58 text

Reinforcement Learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 58 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

Slide 59

Slide 59 text

Types of RL ● Policy Based ○ Directly construct π*(s) ● Value Based ○ State-action value fn Q*(s,a) ○ State-value fn V*(s) ○ State-action advantage fn A*(s) ● Model Based ○ Learn model of environment ○ Plan using model ● Hybrids Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 59 Image credit: UCL MSc Course on RL, David Silver, University College London

Slide 60

Slide 60 text

Name Notation Intuition Where Used Policy π(s) What action do we take in state s? (π* is optimal) Policy-based methods (But all RL methods have some kind of policy) State Value function V π (s) How good is state s? (using policy π) Value-based methods State-action value function Q π (s,a) In state s, how good is action a? (using policy π) Q-Learning, DDPG Advantage function A π (s,a) = Q π (s,a) - V π (s) In state s, how much better is action a, than the “overall” V π (s)? (using policy π) Duelling DQN, Advantage Actor Critic, A3C Transition prediction function P(s′,r|s,a) In state s, if I take action a, what is expected next state and reward? Model-based RL Reward prediction function R(s,a) In state s, if I take action a, what is expected reward? Model-based RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 60

Slide 61

Slide 61 text

Q-Learning ● From s, which a best? Q( state, action ) = E[ Σr ] ● Q implies policy: π*(s) = max a Q*(s, a) ● Use Temporal Difference (TD) Learning to find Q for each s ○ By Watkins in 1989 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 61

Slide 62

Slide 62 text

Q-Learning properties Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 62 ● Discrete, finite action spaces ○ stochastic env ○ changing envs (unlike Go) ● Model-free RL ○ Naive about action effects ● TD(0) ○ Only draws reward 1 step into the past, each iteration

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Intuition: Q-function Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 64 Image credit: (internet) Image credit: AlphaXos, Pathway Intelligence Inc.

Slide 65

Slide 65 text

Image credit: Vlad Mnih, Deepmind at Deep RL Bootcamp, Berkeley Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 65

Slide 66

Slide 66 text

Image credit: AlphaXos, Pathway Intelligence Inc. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 66

Slide 67

Slide 67 text

Temporal Difference (TD) Learning ● Predict future values ○ Incremental ● Many Variants ○ TD(0) vs TD(1) vs TD(λ) ○ n-step ● Not new ○ From Witten 1977, Sutton and Barto 1981 ● Here we use it to predict expected reward Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 67

Slide 68

Slide 68 text

Intuition: TD learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 68 Image credit: (internet)... Image credit: Author

Slide 69

Slide 69 text

Intuition: TD Learning and MDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 69 Image credit: (internet)...

Slide 70

Slide 70 text

TD vs other Basic Methods Images credit: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 70

Slide 71

Slide 71 text

Images credit: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 71

Slide 72

Slide 72 text

Bellman’s Principle of Optimality An optimal policy has the property that: ● whatever the initial state and initial decision are, ● the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision (See Bellman, 1957, Chap. III.3.) => optimal path is made up of optimal sub-paths Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 72

Slide 73

Slide 73 text

Bellman Equation for Q-learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 73 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 74

Slide 74 text

TD(0) updates / Backups / Bellman Updates Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 74 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 75

Slide 75 text

Q-Learning (non-deep) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 75 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

Slide 76

Slide 76 text

Q-Learning Policies ● Policy: course of action ● Greedy policy ○ pick action w / max Q ● ε-Greedy policy ○ ε: Explore: random action ○ 1-ε: Exploit : action w / max Q Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 76 Image credit: (internet…)

Slide 77

Slide 77 text

Q-Learning Policies: Behavior vs Target ● “Behaviour policy”: ϵ-Greedy ○ Discuss: Why not purely random? ● “Target policy”: Greedy ○ Discuss: Is 100% greedy always best? Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 77

Slide 78

Slide 78 text

Q-Learning Policies: Exploration ● ε-Greedy Alternatives ○ Sample based on action value (Softmax / “Boltzmann” from physics); Noise + Greedy ○ Bandit methods: Optimistic, Pessimistic, UCB, Thompson, Bayesian, … ■ But they don’t directly account for uncertainty in future MDP return ○ Rmax … ○ Theoretically optimal exploration expensive ■ Explicitly represent information in MDP: Bayes-adaptive MDP (+ Monte Carlo) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 78

Slide 79

Slide 79 text

On-policy / Off-policy ● On-policy learning ○ Learn on the job ○ Behaviour policy == Target policy ○ Learn effects of exploration ● Off-policy learning ○ Look over someone’s shoulder ○ Behaviour policy != Target policy ○ Ignore effects of exploration ● Q-Learning = off-policy ○ ϵ-Greedy != Greedy ○ On-policy variant: “Sarsa” Paraphrased from: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 79

Slide 80

Slide 80 text

Guarantees ● Q learning has some theoretical guarantees if ○ Infinite visitation of actions, states ○ Learning rate schedule within a goldilocks zone ● Requirements can be soft in practice Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 80

Slide 81

Slide 81 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 81 Images credit: Shreyas Skandan

Slide 82

Slide 82 text

Images credit: Yan Duan, Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 82

Slide 83

Slide 83 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 83 Images credit: Yan Duan, Berkeley Deep RL Bootcamp

Slide 84

Slide 84 text

Q-Learning methods ● Tabular ● FQI: Fitted Q Iteration ○ Batch mode: learn after batch ○ Fitting of regression, tree or (any) other fn approx for Q fn ● DQN ○ “Make Q-learning look like supervised (deep) learning” ○ Training deep net approx for Q fn ● Bayesian ○ Update Q fn priors Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 84 Image credit: (internet…)

Slide 85

Slide 85 text

Deep Q-Networks and Friends Robin Ranjit Singh Chauhan [email protected] Pathway Intelligence Inc

Slide 86

Slide 86 text

RL Algo Zoo ● Discrete / Continuous ● Model-based / Model-Free ● On- / Off-policy ● Derivative-based / not ● Memory, Imagination ● Imitation, Inverse ● Hierarchical ● Mastery / Generalization ● Scalability ● Sample efficiency? Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus additions in red by Robin Chauhan GAE V-trace (Impala) Dyna-Q family AlphaGo AlphaZero MPPI MMC PAL HER GPS+DDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 86 Sarsa Distro-DQN TD Search DDO FRL MAXQ Options UNREAL HAM OptCrit hDQN iLQR MPC Pri-Sweep ReinfPlan NAC ACER A0C Rainbow MERLIN DQRN (POMDP)

Slide 87

Slide 87 text

RL + DL = GI ● Single agent for any human level task ● RL defines objective ● DL gives the mechanism Above text paraphrased from: Tutorial on Deep Reinforcement Learning David Silver ICML 2016 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 87

Slide 88

Slide 88 text

Arcade Learning Environment (ALE) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 88 Image credit: Marc Bellemare

Slide 89

Slide 89 text

Deep Q Networks ● Key insight: Use Deep Learning for Q function approximator Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015

Slide 90

Slide 90 text

Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 90

Slide 91

Slide 91 text

Imma let you finish, but... ….are Atari games really that hard? (What was astonishing here?) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 91

Slide 92

Slide 92 text

92 Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015 Hyperparameters! Hyperparameters! Hyperparameters! Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 92

Slide 93

Slide 93 text

K : Komogorov complexity ; more weight on simpler tasks Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 93

Slide 94

Slide 94 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 94 Image credit: OpenAI https://blog.openai.com/ai-and-compute 1 day on Sunway TaihuLight peak perf 1 day on NVIDIA DGX-2: 16 Volta GPUs $400k HPC stats from top500.org 4 of the 5 most data-hungry AI training runs are RL 1 day on Intel Core i9 Extreme ($2k chip, CPU perf only) 1 day on DOE Summit w/GPUs

Slide 95

Slide 95 text

DQN Innovations Q-Learning: based on TD learning + Deep Learning : Q function approximation + Experience Replay: stabilizes learning (Lin 93) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 95

Slide 96

Slide 96 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 96 DQN Setup: Training Agent DQN Training Algo XRP Memory Agent network “Behaviour”/ Exploration Policy Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 97

Slide 97 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 97 DQN Setup: Test Agent Agent network Exploration Policy Agent Agent network Target Policy argmax Q(s,a) Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 98

Slide 98 text

DQN+XRP Algo: Intuition ● Starting policy ○ try random actions (Epsilon-Greedy Policy) ○ eventually get some reward (Environment) ● Remember each transition ○ space-limited memory (XRP memory) ● Train in little bits as we go ○ Train on a few random memories (from XRP memory) ○ Stretching reward backwards in time for remembered state (TD learning) ○ Train to learn stretched future reward from remembered state (Deep Learning) ○ Train to generalizing over similar states (Deep Learning) ● Final policy ○ always choose actions which Q model says will reward well from this state (Greedy policy) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 98 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 99

Slide 99 text

● Reward Causality: TD Learning ○ temporal credit assignment for Q values ○ stretch rewards back in time; provide objective ● Value Estimator: Q function ○ Q( state, action ) -> future reward ○ Compute Q for each a, pick best a ● Value Estimator Generalization: Deep Learning ○ Q function is FF NN ○ generalization predict for unseen states ● Memory: Experience replay ○ Improve learning convergence ○ Remember some transitions: ( state, action, next state, reward ) ○ Q is trained on random memory mini-batches, not just live experience DQN+XRP: Components Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 99 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 100

Slide 100 text

DQN+XRP Algo: Informal ● Forward step ○ Agent gets State from Env ○ Env gets Action from Agent ■ Greedy: action w / max Q ■ Epsilon greedy: random ○ Env gives Reward to Agent ● Backup step ○ Mem stores ( state, action, next state, reward ) ○ Train Q network ** ■ Sample random mini-batch from mem ■ Update target Q values** for mini-batch ■ Mini-batch gradient descent Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 100 Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 101

Slide 101 text

DQN+XRP Algo: Formal Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 101 Algorithm 1 credit: Playing Atari with Deep Reinforcement Learning, Mnih et al 2013 Diagram credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 102

Slide 102 text

Algorithm 1 credit: Human-level control through deep reinforcement Learning, Mnih et al 2015 Diagram credit: Robin Chauhan, Pathway Intelligence Inc. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 102

Slide 103

Slide 103 text

Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015

Slide 104

Slide 104 text

● Discrete only Continuous variant: NAF ● TD(0) slow Needs lots of samples n-step DQN TD( λ ) ● XRP can consume lots of Memory DQN Limits 104 ● Epsilon-Greedy exploration weak No systematic exploration Ignores repetition ● Model-Free is dumb No learning of MDP -> No planning possible Eg. Dyna-Q variants Not always feasible Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 104

Slide 105

Slide 105 text

DQN Variants Image credit: Justesen et al, Deep Learning for Video Game Playing 105 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 105

Slide 106

Slide 106 text

Detail: DDQN Fixed Target Network

Slide 107

Slide 107 text

Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 107 DDQN Setup: Training Agent DQN Training Algo XRP Memory Agent network Exploration Policy Image credit: Robin Chauhan, Pathway Intelligence Inc.

Slide 108

Slide 108 text

Prioritized Replay ● Q network outputs less accurate for some states ● Focus learning on those ● Sample minibatch memories based on TD error XRP Memory Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 108

Slide 109

Slide 109 text

Distributional DQN Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 109 Image: A Distributional Perspective on Reinforcement Learning, Bellemare et al 2017

Slide 110

Slide 110 text

Distributional DQN Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 110 Image: A Distributional Perspective on Reinforcement Learning, Bellemare et al 2017

Slide 111

Slide 111 text

Dueling DQN Increases stability Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 111 Image: Dueling Network Architectures for Deep Reinforcement Learning, Wang et al 2016

Slide 112

Slide 112 text

Noisy Nets ● More efficient exploration than epsilon-greedy ● Parametric noise added to weights ● Parameters of the noise learned with gradient descent along with the remaining network weights Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 112 Image: Noisy Networks for Exploration, Fortunato et al, 2017

Slide 113

Slide 113 text

N-step DQN ● TD(0) updates only 1 timestep Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 113 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

Slide 114

Slide 114 text

Rainbow DQN Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 114 Image: Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al 2017

Slide 115

Slide 115 text

OpenAI gym: Baseline DQN import gym from baselines import deepq def main(): env = gym.make("CartPole-v0") model = deepq.models.mlp([64]) act = deepq.learn( env, q_func=model, lr=1e-3, max_timesteps=100000, buffer_size=50000, exploration_fraction=0.1, exploration_final_eps=0.02,) act.save("cartpole_model.pkl") Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 115 Sample code from https://gym.openai.com/docs/

Slide 116

Slide 116 text

keras-rl Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 116 ● Keras-based :) RL algos for OpenAI gym-type envs ○ https://github.com/keras-rl/keras-rl pip install keras-rl ● But: Development stalled :( ○ Needs a lot of love; But seems new maintainer has been found ● Worthy competitors ○ TensorForce ○ OpenAI Baseline implementations ○ Rllab , tensorflow/agents , Ray RLlib, anyrl (go) , tensorflow-rl , ShangtongZhang/DeepRL , BlueWhale, SLM-Lab, pytorch-a2c-ppo-acktr , dennybritz/reinforcement-learning

Slide 117

Slide 117 text

# Get the environment and extract the number of actions. env = gym.make('CartPole-v0') nb_actions = env.action_space.n # Next, we build a very simple model. model = Sequential() model.add(Flatten(input_shape=(1,) + env.observation_space.shape)) model.add(Dense(16)) model.add(Activation('relu')) model.add(Dense(nb_actions)) model.add(Activation('linear')) keras-rl Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 117 Sample code from keras-rl https://github.com/keras-rl/keras-rl/blob/master/examples/dqn_cartpole.py

Slide 118

Slide 118 text

memory = SequentialMemory(limit=50000, window_length=1) policy = BoltzmannQPolicy() dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=100, policy=policy) dqn.compile(Adam(lr=1e-3), metrics=['mae']) dqn.fit(env, nb_steps=50000, visualize=True, verbose=2) dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True) dqn.test(env, nb_episodes=5, visualize=True) keras-rl Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 118 Sample code from keras-rl https://github.com/keras-rl/keras-rl/blob/master/examples/dqn_cartpole.py

Slide 119

Slide 119 text

● Volodymyr Mnih ○ PhD UofT under Hinton ○ Masters at UofA under Csaba Szepesvari (also at DeepMind) ● David Silver ○ Met Demis Hassabis during Cambridge ○ CTO of Elixir Studios (game company) ○ PhD @ UofA in RL ○ AlphaGo ● Koray Kavukcuoglu ○ NEC Labs ML ○ PhD: w/Yann LeCun, NYU ○ Aerospace People from first DQN Atari paper ● Alex Graves ○ UofT under Hinton ● Ioannis Antonoglou ○ Masters U of Edinburg ML/AI ● Daan Wierstra ○ Masters Utrecht U (NL) ● Martin Riedmiller ○ Prof @ U of Freiburg (Germany) ○ PhD on neural controllers U of Karlsruhe Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 119

Slide 120

Slide 120 text

"literally nothing is solved yet" Volodymyr Mnih Google DeepMind August 2017 Berkeley Deep RL Bootcamp

Slide 121

Slide 121 text

● Self-Play, Multi-Agent Reinforcement Learning, AlphaZero + AlphaXos ● RL in Medicine ● Actor-Critic methods ● Selected RL papers 2017-2018 ● Artificial General Intelligence Possible Next Talks Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 121 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 121

Slide 122

Slide 122 text

Questions Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 122

Slide 123

Slide 123 text

● reinforcement learning: an introduction by sutton and barto http://incompleteideas.net/book/bookdraft2018jan1.pdf ● david silver's RL course: https://www.youtube.com/watch?v=2pWv7GOvuf0 ● Berkeley Deep RL bootcamp: https://sites.google.com/view/deep-rl-bootcamp/lectures ● openai gym: https://gym.openai.com/ ● arxiv.org Resources Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 123

Slide 124

Slide 124 text

Thank you! Robin Ranjit Singh Chauhan [email protected] https://github.com/pathway https://ca.linkedin.com/in/robinc https://pathway.com/aiml 124 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 124

Slide 125

Slide 125 text

Appendix Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 125

Slide 126

Slide 126 text

Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 126