Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Reinforcement Learning + Deep Q-Netwo...

Intro to Reinforcement Learning + Deep Q-Networks: Part 1

We will introduce Reinforcement Learning concepts, methods, and applications. We will look at tools and frameworks for posing and RL problems, including OpenAI gym. We introduce Q learning and set the stage for DQN.
Robin aims to share the best insights from the top researchers in a lucid and entertaining way. We assume only basic knowledge of machine learning and math.

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. • Aanchan Mohan ◦ for suggesting I do this, and

    organizing • Bruce Sharpe ◦ Video! • Reviewers ◦ Valuable comments • Other meetup organizers • Hive • Sutton+Barto, Berkeley, UCL, DeepMind, OpenAI ◦ for publishing openly Props Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 2 “Many unpaid hours were sacrificed to bring us this information”
  2. Why • Aim for today ◦ Deliver you hard-won insight

    on silver platter ◦ Things I wish I had known ◦ Curated best existing content + original content • Exchange ◦ There is a lot to know ◦ I hope others present on RL topics ◦ If you have serious interest in RL I would like to chat Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 3 “Many unpaid hours were sacrificed to bring us this information”
  3. • Head of Engineering @ AgFunder ◦ SF-based VC focussed

    on AgTech + FoodTech investing ◦ Investments include companies doing Machine Learning in these spaces ▪ ImpactVision, The Yield • Pathway Intelligence ◦ BC-based consulting company • Past ◦ Microsoft PM in Fintech Payment Fraud ◦ Transportation ◦ HPC for Environmental engineering • Computer Engineering @ Waterloo Me 4 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 4
  4. You • Comfort levels ◦ ML ◦ RL ◦ Experience?

    • Interest areas • Lots of slides! ◦ Speed vs depth?
  5. IRL RL = trial and error + learning trial and

    error = variation and selection, search (explore/exploit) Learning = Association + Memory - Sutton + Barto
  6. Types of ML • Unsupervised • Supervised • Reinforcement Learning

    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 7
  7. Unsupervised Learning Supervised Learning (+semi-supervised) Reinforcement Learning Training Data Collect

    training data Collect training data Agent creates data through exploration Labels None Explicit label per example ** Sparse, Delayed Reward -> Temporal Credit Assignment problem Evaluation Case-specific, can be subjective Often Accuracy / Loss metrics per instance Regret ; Total Reward Inherent vs Artificial Reward Training / Fitting Training set Training set Behaviour policy Testing Test set Test set Target policy Exploration n/a n/a Exploration strategy ** (typically part of Behavior Policy) 8 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 8 Image credit: Robin Chauhan, Pathway Intelligence Inc.
  8. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 9

    Image credit: Complementary roles of basal ganglia and cerebellum in learning and motor control, Kenji Doya, 2000
  9. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 10

    Yann LeCun, January 2017 Asilomar, Future of Life Institute
  10. Related Fields Image credit: UCL MSc Course on RL, David

    Silver, University College London Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 11
  11. • Know what a “good / bad result” looks like

    ◦ Don’t want to/cannot specify how to get to it • When you need Tactics + Strategy ◦ Action, not just prediction • Cases ◦ Games ◦ Complex robot control ◦ Dialog systems ◦ Vehicle Control ** ◦ More as RL and DL advances When to consider RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 12 Image credit: (internet)...
  12. • Simulatable ◦ Else: training IRL usually infeasible ** •

    Vast state spaces require exploration ◦ Else: enumerate + plan • Dependencies across time ◦ Delayed reward ◦ Else: supervised • Avoid RL unless needed ◦ Immature ◦ Complicated ◦ Data-hungry When to consider RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 13 Image credit: (internet)...
  13. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 14

    Image credit: OpenAI https://blog.openai.com/ai-and-compute
  14. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 15

    Image credit: OpenAI https://blog.openai.com/ai-and-compute 1 day on worlds fastest supercomputer peak perf 1 day on NVIDIA DGX-2: 16 Volta GPUs $400k HPC stats from top500.org
  15. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 16

    Image credit: OpenAI https://blog.openai.com/ai-and-compute 1 day on worlds fastest supercomputer peak perf 1 day on NVIDIA DGX-2: 16 Volta GPUs $400k HPC stats from top500.org 4 of the 5 most data-hungry AI training runs are RL
  16. Hype vs Reality • Behind many recently AI milestones •

    Better than human perf • “Scared of AI” == Scared of RL ◦ Jobs ◦ Killing / Enslaving ◦ Paperclips ◦ AGI ◦ Sentience 17 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 17 • Few applications so far • Slow learning • Practical for robots? • Progressing quickly
  17. “I think reinforcement learning is one class of technology where

    the PR excitement is vastly disproportional relative to the ... actual deployments today” Andrew Ng Chief Scientist of Baidu EmTech Nov 2017 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 19
  18. • Methods ◦ RL Algorithms • Approximators ◦ Deep Learning

    models in general ◦ RL-specific DL techniques • Gear ◦ GPU ◦ TPU, Other custom silicon • Data ◦ Sensors + Sensor data • All of these are on fire ◦ Safe to expect non-linear advancement in RL RL Trajectory Dependencies Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 20 Image credit: (internet)
  19. • DeepMind (Google) • OpenAI • UAlberta • Google Brain

    • Berkeley, CMU, Oxford • Many more... Who Does RL Research? Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 21
  20. • Safety • Animal rights • Unemployment • Civil Liberties

    • Peace + Conflict • Power Centralization RL+AI Ethics Dimensions Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 23 Consider donating to organizations dedicated to protecting values you cherish
  21. Reinforcement Learning • learning to decide + act over time

    • often online learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 25 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto
  22. • Sequential Task ◦ Pick one of K arms ◦

    Each has own fixed, unknown, (stochastic?) reward distribution • Goal ◦ Maximize reward • Challenge ◦ Explore vs Exploit ◦ Either alone not optimal ◦ Supervised learning alone cannot solve: does not explore (Stochastic Multi-armed) Bandit Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 26 Image credit: Microsoft Research Image credit: https://github.com/CamDavidsonPilon
  23. Contextual Multi-armed Bandit • Rewards depend on Context • Context

    independent of action F a1 F a2 F a3 F a4 Edgewater Casino (Context a) F b1 F b2 F b3 F b4 Hard Rock Casino (Context b) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 27
  24. Reinforcement Learning Image credit: CMU Graduate AI course slides •

    Context change depends on action • Learn an MDP from experience only • Game setting ◦ Experiences effects of rules (wins/loss/tie) ◦ Does not “know” rules Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 28
  25. Markov Chains • State fully defines history • Transitions ◦

    Probability ◦ Destination Image credit: Wikipedia Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 29
  26. Markov Decision Process (MDP) • Markov Chains ◦ States linked

    w/o history • Actions ◦ Choice • Rewards ◦ Motivation • Variants ◦ Bandit = MDP with single state! ◦ MC + Rewards = MRP ◦ Partially observed (POMDP) ◦ Semi-MDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 30 Image credit: Wikipedia
  27. MDP and Friends Image credit: Aaron Schumacher, planspace.org Intro to

    RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 31
  28. Reward Signal • Reward drives learning ◦ Details of reward

    signal often critical • Too sparse ◦ complete learning failure • Too generous ◦ optimization limited • Problem specific Image credit: Wikipedia Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 32
  29. Montezuma’s Actual Revenge Chart credit: Schaul et al, Prioritized Experience

    Replay, DeepMind Feb 2016 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 33
  30. Broad Applicability • Environment / Episodes ◦ Finite length /

    Endless • Action space ◦ Discrete / Continuous ◦ Few / Vast • State space ◦ Discrete / Continuous ◦ Tree / Graph / Cyclic ◦ Deterministic / Stochastic ◦ Partially / Fully observed • Reward signals ◦ Deterministic / Stochastic ◦ Continuous / Sparse ◦ Immediate / Delayed 34 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 34 Image credit: Wikipedia
  31. Types of RL • Value Based ◦ Construct state-action value

    function Q*(s,a) • Policy Based ◦ Directly construct π*(s) • Model Based ◦ Learn model of environment ◦ Plan using model • Hybrids Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 35 Image credit: UCL MSc Course on RL, David Silver, University College London
  32. Reinforcement Learning vs Planning / Search • Planning and RL

    can be combined Planning / Search Reinforcement Learning Goal Improved policy Improved policy Method Computing on known model Interacting with unknown environment State Space Model Known Unknown Algos Heuristic state-space seach Dynamic programming Q-Learning Monte Carlo rollouts Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 36 Content paraphrased from: UCL MSc Course on RL, David Silver, University College London
  33. Elementary approaches • Monte Carlo RL (MC) ◦ Value =

    mean return of multiple runs • Value Iteration + Policy Iteration ◦ Both require enumerating all states ◦ Both require knowing transition model T(s) • Dynamic Programming (DP) ◦ Value = value of next state + reward in this state ◦ Iteration propagates reward from terminal state back to beginning Images credit: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 38
  34. Elementary approaches: Value Iteration Intro to RL+DQN by Robin Chauhan,

    Pathway Intelligence Inc. 39 Image credit: Pieter Abbeel UC Berkeley EECS
  35. Elementary approaches: Policy Iteration Image credit: Pascal Poupart CS886 University

    of Waterloo Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 40
  36. RL Algo Zoo • Discrete / Continuous • Model-based /

    Model-Free • On- / Off-policy • Derivative-based / not Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 41
  37. RL Algo Zoo • Discrete / Continuous • Model-based /

    Model-Free • On- / Off-policy • Derivative-based / not • Memory, Imagination • Imitation, Inverse • Hierarchical • Mastery / Generalization Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 42
  38. RL Algo Zoo • Discrete / Continuous • Model-based /

    Model-Free • On- / Off-policy • Derivative-based / not • Memory, Imagination • Imitation, Inverse • Hierarchical • Mastery / Generalization • Scalability • Sample efficiency? Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus additions in red by Robin Chauhan GAE V-trace (Impala) Dyna-Q family AlphaGo AlphaZero MPPI MMC PAL HER GPS+DDP Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 43 Sarsa Distro-DQN TD Search DDO FRL MAXQ Options UNREAL HAM OptCrit hDQN iLQR MPC Pri-Sweep ReinfPlan NAC ACER A0C Rainbow MERLIN DQRN (POMDP)
  39. Name Notation Intuition Where Used State Value function V(s) How

    good is state s? Value-based methods State-action value function Q(s,a) In state s, how good is action a? Q-Learning, DDPG Policy π(s) What action do we take in state s? Policy-based methods (But all RL methods have some kind of policy) Advantage function A(s,a) In state s, how much better is action a, than the “average” V(s)? Duelling DQN, Advantage Actor Critic, A3C Transition prediction function P(s′,r|s,a) In state s, if I take action a, what is expected next state and reward? Model-based RL Reward prediction function R(s,a) In state s, if I take action a, what is expected reward? Model-based RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 44
  40. Image credit: Sergey Levine via Chelsea Finn and Berkeley Deep

    RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 45
  41. • Wide task variety ◦ Toy tasks ◦ Continuous +

    Discrete ◦ 2D, 3D, Text, Atari • Common API for env + agent ◦ Compare algos • Similar ◦ OpenAI’s Retro: Genesis, Atari arcade ◦ DeepMind’s Lab: Quake-based 3D env ◦ Microsoft’s Malmo: Minecraft ◦ Facebook’s CommAI: Text comms ◦ Poznan University, Poland: VizDoom OpenAI gym Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 46 Image credit: Open AI gym
  42. OpenAI gym import gym env = gym.make('CartPole-v0') for i_episode in

    range(20): observation = env.reset() for t in range(100): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done: print("Episode finished after {} timesteps".format(t+1)) break Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 47 Sample code from https://gym.openai.com/docs/
  43. RL IRL • Most results in hermetic envs ◦ Board

    games ◦ Computer games ◦ Simulatable robot controllers • Sim != Reality • Model-based : Sample efficiency ++ ◦ But: Model errors accumulate • Techniques to account for model errors • Theme: Bridge Sim -> Reality Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 53
  44. RL IRL • Simple IRL manipulations hard for present day

    RL Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 54 Image credit: Chelsea Finn Image credit: Sergey Levine Image credit: Google Research
  45. Q-Learning • From s, which a best? Q( state, action

    ) = E[ Σr ] • Q implies policy: π*(s) = max a Q*(s, a) • Use TD Learning to find Q for each s ◦ By Watkins in 1989 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 57
  46. Q-Learning Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc.

    58 • Discrete, finite action spaces ◦ stochastic env ◦ changing env (unlike Go) • Model-free RL ◦ Naive about action effects • TD(0) ◦ Only draws reward 1 step into the past, each iteration
  47. Intuition: Q-function Intro to RL+DQN by Robin Chauhan, Pathway Intelligence

    Inc. 60 Image credit: (internet) Image credit: AlphaXos, Pathway Intelligence Inc.
  48. Image credit: Vlad Mnih, Deepmind at Deep RL Bootcamp, Berkeley

    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 61
  49. Image credit: AlphaXos, Pathway Intelligence Inc. Intro to RL+DQN by

    Robin Chauhan, Pathway Intelligence Inc. 62
  50. Temporal Difference (TD) Learning • Predict future values ◦ Incremental

    • Many Variants ◦ TD(0) vs TD(1) vs TD(λ) • Not new ◦ From Witten 1977, Sutton and Barto 1981 • Here we use it to predict expected reward Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 63
  51. Intuition: TD learning Intro to RL+DQN by Robin Chauhan, Pathway

    Intelligence Inc. 64 Image credit: (internet)... Image credit: Author
  52. Intuition: TD Learning and State Space Intro to RL+DQN by

    Robin Chauhan, Pathway Intelligence Inc. 65 Image credit: (internet)...
  53. Bellman Equation for Q-learning Intro to RL+DQN by Robin Chauhan,

    Pathway Intelligence Inc. 66 Image credit: Robin Chauhan, Pathway Intelligence Inc.
  54. TD(0) updates / Backups / Bellman Updates Intro to RL+DQN

    by Robin Chauhan, Pathway Intelligence Inc. 67 Image credit: Robin Chauhan, Pathway Intelligence Inc.
  55. Q-Learning (non-deep) Intro to RL+DQN by Robin Chauhan, Pathway Intelligence

    Inc. 68 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto
  56. Q-Learning Policies • Policy: course of action • Greedy policy

    ◦ pick action w / max Q • Epsilon-Greedy policy ◦ Explore: random action ◦ Exploit : action w / max Q ◦ Probability ε • Alternatives ◦ Sample over action distro ◦ Noise + Greedy Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 69 Image credit: (internet…)
  57. On-policy / Off-policy • On-policy learning ◦ Learn on the

    job • Off-policy learning ◦ Look over someone’s shoulder • Q-Learning = off-policy Paraphrased from: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 70
  58. Basic Methods Images credit: Reinforcement Learning: An Introduction, Sutton and

    Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 71
  59. • reinforcement learning: an introduction by sutton and barto http://incompleteideas.net/book/bookdraft2018jan1.pdf

    • david silver's RL course: https://www.youtube.com/watch?v=2pWv7GOvuf0 • Berkeley Deep RL bootcamp: https://sites.google.com/view/deep-rl-bootcamp/lectures • openai gym: https://gym.openai.com/ • arxiv.org Resources Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 72