Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Reinforcement Learning + Deep Q-Networks: Part 1

Intro to Reinforcement Learning + Deep Q-Networks: Part 1

We will introduce Reinforcement Learning concepts, methods, and applications. We will look at tools and frameworks for posing and RL problems, including OpenAI gym. We introduce Q learning and set the stage for DQN.
Robin aims to share the best insights from the top researchers in a lucid and entertaining way. We assume only basic knowledge of machine learning and math.

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. Intro to
    Reinforcement Learning +
    Deep Q-Networks:
    Part 1
    Robin Ranjit Singh Chauhan
    [email protected]
    Pathway Intelligence Inc

    View Slide

  2. ● Aanchan Mohan
    ○ for suggesting I do this, and
    organizing
    ● Bruce Sharpe
    ○ Video!
    ● Reviewers
    ○ Valuable comments
    ● Other meetup organizers
    ● Hive
    ● Sutton+Barto, Berkeley, UCL,
    DeepMind, OpenAI
    ○ for publishing openly
    Props
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 2
    “Many unpaid hours were sacrificed
    to bring us this information”

    View Slide

  3. Why
    ● Aim for today
    ○ Deliver you hard-won insight on
    silver platter
    ○ Things I wish I had known
    ○ Curated best existing content +
    original content
    ● Exchange
    ○ There is a lot to know
    ○ I hope others present on RL
    topics
    ○ If you have serious interest in RL
    I would like to chat
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 3
    “Many unpaid hours were sacrificed
    to bring us this information”

    View Slide

  4. ● Head of Engineering @ AgFunder
    ○ SF-based VC focussed on AgTech + FoodTech investing
    ○ Investments include companies doing Machine Learning in these spaces
    ■ ImpactVision, The Yield
    ● Pathway Intelligence
    ○ BC-based consulting company
    ● Past
    ○ Microsoft PM in Fintech Payment Fraud
    ○ Transportation
    ○ HPC for Environmental engineering
    ● Computer Engineering @ Waterloo
    Me
    4
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 4

    View Slide

  5. You
    ● Comfort levels
    ○ ML
    ○ RL
    ○ Experience?
    ● Interest areas
    ● Lots of slides!
    ○ Speed vs depth?

    View Slide

  6. IRL
    RL = trial and error + learning
    trial and error = variation and
    selection, search (explore/exploit)
    Learning = Association + Memory
    - Sutton + Barto

    View Slide

  7. Types of ML
    ● Unsupervised
    ● Supervised
    ● Reinforcement Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 7

    View Slide

  8. Unsupervised Learning Supervised Learning
    (+semi-supervised)
    Reinforcement Learning
    Training Data Collect training data Collect training data Agent creates data through
    exploration
    Labels None Explicit label per example ** Sparse, Delayed Reward
    -> Temporal Credit Assignment
    problem
    Evaluation Case-specific, can be
    subjective
    Often Accuracy / Loss metrics
    per instance
    Regret ; Total Reward
    Inherent vs Artificial Reward
    Training /
    Fitting
    Training set Training set Behaviour policy
    Testing Test set Test set Target policy
    Exploration n/a n/a Exploration strategy ** (typically
    part of Behavior Policy)
    8
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 8
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  9. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 9
    Image credit: Complementary roles of basal ganglia and cerebellum in learning and motor control, Kenji Doya, 2000

    View Slide

  10. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 10
    Yann LeCun, January 2017 Asilomar, Future of Life Institute

    View Slide

  11. Related Fields
    Image credit: UCL MSc Course on RL, David Silver, University College London Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 11

    View Slide

  12. ● Know what a “good / bad result” looks like
    ○ Don’t want to/cannot specify how to get to it
    ● When you need Tactics + Strategy
    ○ Action, not just prediction
    ● Cases
    ○ Games
    ○ Complex robot control
    ○ Dialog systems
    ○ Vehicle Control **
    ○ More as RL and DL advances
    When to consider RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 12
    Image credit: (internet)...

    View Slide

  13. ● Simulatable
    ○ Else: training IRL usually infeasible **
    ● Vast state spaces require exploration
    ○ Else: enumerate + plan
    ● Dependencies across time
    ○ Delayed reward
    ○ Else: supervised
    ● Avoid RL unless needed
    ○ Immature
    ○ Complicated
    ○ Data-hungry
    When to consider RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 13
    Image credit: (internet)...

    View Slide

  14. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 14
    Image credit: OpenAI https://blog.openai.com/ai-and-compute

    View Slide

  15. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 15
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    1 day on worlds fastest
    supercomputer peak
    perf
    1 day on NVIDIA
    DGX-2: 16 Volta GPUs
    $400k
    HPC stats from top500.org

    View Slide

  16. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 16
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    1 day on worlds fastest
    supercomputer peak
    perf
    1 day on NVIDIA
    DGX-2: 16 Volta GPUs
    $400k
    HPC stats from top500.org
    4 of the 5 most
    data-hungry AI
    training runs
    are RL

    View Slide

  17. Hype vs Reality
    ● Behind many recently AI milestones
    ● Better than human perf
    ● “Scared of AI” == Scared of RL
    ○ Jobs
    ○ Killing / Enslaving
    ○ Paperclips
    ○ AGI
    ○ Sentience
    17
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 17
    ● Few applications so far
    ● Slow learning
    ● Practical for robots?
    ● Progressing quickly

    View Slide

  18. "RL + DL =
    general intelligence"
    David Silver
    Google DeepMind
    ICML 2016

    View Slide

  19. “I think reinforcement learning is one
    class of technology where the PR
    excitement is vastly disproportional
    relative to the ... actual deployments
    today”
    Andrew Ng
    Chief Scientist of Baidu
    EmTech Nov 2017
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 19

    View Slide

  20. ● Methods
    ○ RL Algorithms
    ● Approximators
    ○ Deep Learning models in general
    ○ RL-specific DL techniques
    ● Gear
    ○ GPU
    ○ TPU, Other custom silicon
    ● Data
    ○ Sensors + Sensor data
    ● All of these are on fire
    ○ Safe to expect non-linear advancement in RL
    RL Trajectory Dependencies
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 20
    Image credit: (internet)

    View Slide

  21. ● DeepMind (Google)
    ● OpenAI
    ● UAlberta
    ● Google Brain
    ● Berkeley, CMU, Oxford
    ● Many more...
    Who Does RL Research?
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 21

    View Slide

  22. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 22

    View Slide

  23. ● Safety
    ● Animal rights
    ● Unemployment
    ● Civil Liberties
    ● Peace + Conflict
    ● Power Centralization
    RL+AI Ethics Dimensions
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 23
    Consider donating to organizations dedicated to protecting values you cherish

    View Slide

  24. View Slide

  25. Reinforcement Learning
    ● learning to decide + act over time
    ● often online learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 25
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  26. ● Sequential Task
    ○ Pick one of K arms
    ○ Each has own fixed, unknown, (stochastic?) reward
    distribution
    ● Goal
    ○ Maximize reward
    ● Challenge
    ○ Explore vs Exploit
    ○ Either alone not optimal
    ○ Supervised learning
    alone cannot solve:
    does not explore
    (Stochastic Multi-armed) Bandit
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 26
    Image credit: Microsoft Research
    Image credit: https://github.com/CamDavidsonPilon

    View Slide

  27. Contextual Multi-armed Bandit
    ● Rewards depend on Context
    ● Context independent of action
    F
    a1
    F
    a2
    F
    a3
    F
    a4
    Edgewater Casino (Context a)
    F
    b1
    F
    b2
    F
    b3
    F
    b4
    Hard Rock Casino (Context b)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 27

    View Slide

  28. Reinforcement Learning
    Image credit: CMU Graduate AI course slides
    ● Context change depends on
    action
    ● Learn an MDP from experience
    only
    ● Game setting
    ○ Experiences effects of rules
    (wins/loss/tie)
    ○ Does not “know” rules
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 28

    View Slide

  29. Markov Chains
    ● State fully defines history
    ● Transitions
    ○ Probability
    ○ Destination
    Image credit: Wikipedia
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 29

    View Slide

  30. Markov Decision Process (MDP)
    ● Markov Chains
    ○ States linked w/o history
    ● Actions
    ○ Choice
    ● Rewards
    ○ Motivation
    ● Variants
    ○ Bandit = MDP with single state!
    ○ MC + Rewards = MRP
    ○ Partially observed (POMDP)
    ○ Semi-MDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 30
    Image credit: Wikipedia

    View Slide

  31. MDP and Friends
    Image credit: Aaron Schumacher, planspace.org
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 31

    View Slide

  32. Reward Signal
    ● Reward drives learning
    ○ Details of reward signal often critical
    ● Too sparse
    ○ complete learning failure
    ● Too generous
    ○ optimization limited
    ● Problem specific
    Image credit: Wikipedia
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 32

    View Slide

  33. Montezuma’s Actual Revenge
    Chart credit: Schaul et al, Prioritized Experience Replay, DeepMind Feb 2016 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 33

    View Slide

  34. Broad Applicability
    ● Environment / Episodes
    ○ Finite length / Endless
    ● Action space
    ○ Discrete / Continuous
    ○ Few / Vast
    ● State space
    ○ Discrete / Continuous
    ○ Tree / Graph / Cyclic
    ○ Deterministic / Stochastic
    ○ Partially / Fully observed
    ● Reward signals
    ○ Deterministic / Stochastic
    ○ Continuous / Sparse
    ○ Immediate / Delayed
    34
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 34
    Image credit: Wikipedia

    View Slide

  35. Types of RL
    ● Value Based
    ○ Construct state-action value
    function Q*(s,a)
    ● Policy Based
    ○ Directly construct π*(s)
    ● Model Based
    ○ Learn model of environment
    ○ Plan using model
    ● Hybrids
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 35
    Image credit: UCL MSc Course on RL, David Silver, University College London

    View Slide

  36. Reinforcement Learning vs Planning / Search
    ● Planning and RL can be combined
    Planning / Search Reinforcement Learning
    Goal Improved policy Improved policy
    Method Computing on known model Interacting with unknown
    environment
    State Space Model Known Unknown
    Algos Heuristic state-space seach
    Dynamic programming
    Q-Learning
    Monte Carlo rollouts
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 36
    Content paraphrased from: UCL MSc Course on RL, David Silver, University College
    London

    View Slide

  37. RL Rollouts vs Planning
    rollouts
    Image credit: Aaron Schumacher, planspace.org

    View Slide

  38. Elementary approaches
    ● Monte Carlo RL (MC)
    ○ Value = mean return of multiple runs
    ● Value Iteration + Policy Iteration
    ○ Both require enumerating all states
    ○ Both require knowing transition model T(s)
    ● Dynamic Programming (DP)
    ○ Value = value of next state + reward in this
    state
    ○ Iteration propagates reward from terminal
    state back to beginning
    Images credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 38

    View Slide

  39. Elementary approaches: Value Iteration
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 39
    Image credit: Pieter Abbeel UC Berkeley EECS

    View Slide

  40. Elementary approaches: Policy Iteration
    Image credit: Pascal Poupart CS886 University of Waterloo
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 40

    View Slide

  41. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 41

    View Slide

  42. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    ● Memory, Imagination
    ● Imitation, Inverse
    ● Hierarchical
    ● Mastery / Generalization
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 42

    View Slide

  43. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    ● Memory, Imagination
    ● Imitation, Inverse
    ● Hierarchical
    ● Mastery / Generalization
    ● Scalability
    ● Sample efficiency?
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus
    additions in red by Robin Chauhan
    GAE
    V-trace (Impala)
    Dyna-Q family
    AlphaGo
    AlphaZero
    MPPI
    MMC
    PAL
    HER
    GPS+DDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 43
    Sarsa
    Distro-DQN
    TD Search
    DDO
    FRL
    MAXQ
    Options
    UNREAL
    HAM
    OptCrit
    hDQN
    iLQR
    MPC
    Pri-Sweep
    ReinfPlan
    NAC
    ACER
    A0C
    Rainbow
    MERLIN
    DQRN (POMDP)

    View Slide

  44. Name Notation Intuition Where Used
    State Value function V(s) How good is state s? Value-based methods
    State-action value function Q(s,a) In state s, how good is action a? Q-Learning, DDPG
    Policy π(s) What action do we take in state s? Policy-based methods
    (But all RL methods have
    some kind of policy)
    Advantage function A(s,a) In state s, how much better is
    action a, than the “average” V(s)?
    Duelling DQN, Advantage
    Actor Critic, A3C
    Transition prediction
    function
    P(s′,r|s,a) In state s, if I take action a, what is
    expected next state and reward?
    Model-based RL
    Reward prediction function R(s,a) In state s, if I take action a, what is
    expected reward?
    Model-based RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 44

    View Slide

  45. Image credit: Sergey Levine via Chelsea Finn and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 45

    View Slide

  46. ● Wide task variety
    ○ Toy tasks
    ○ Continuous + Discrete
    ○ 2D, 3D, Text, Atari
    ● Common API for env + agent
    ○ Compare algos
    ● Similar
    ○ OpenAI’s Retro: Genesis, Atari arcade
    ○ DeepMind’s Lab: Quake-based 3D env
    ○ Microsoft’s Malmo: Minecraft
    ○ Facebook’s CommAI: Text comms
    ○ Poznan University, Poland: VizDoom
    OpenAI gym
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 46
    Image credit: Open AI gym

    View Slide

  47. OpenAI gym
    import gym
    env = gym.make('CartPole-v0')
    for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
    env.render()
    print(observation)
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
    print("Episode finished after {} timesteps".format(t+1))
    break
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 47
    Sample code from https://gym.openai.com/docs/

    View Slide

  48. View Slide

  49. StarCraft II Learning Environment

    View Slide

  50. View Slide

  51. View Slide

  52. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 52

    View Slide

  53. RL IRL
    ● Most results in hermetic envs
    ○ Board games
    ○ Computer games
    ○ Simulatable robot controllers
    ● Sim != Reality
    ● Model-based : Sample efficiency ++
    ○ But: Model errors accumulate
    ● Techniques to account for model errors
    ● Theme: Bridge Sim -> Reality
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 53

    View Slide

  54. RL IRL
    ● Simple IRL manipulations hard for present
    day RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 54
    Image credit: Chelsea Finn
    Image credit: Sergey Levine
    Image credit: Google Research

    View Slide

  55. Q Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 55

    View Slide

  56. Q
    (s,a)

    View Slide

  57. Q-Learning
    ● From s, which a best?
    Q( state, action ) = E[ Σr ]
    ● Q implies policy:
    π*(s) = max
    a
    Q*(s, a)
    ● Use TD Learning to find Q for each s
    ○ By Watkins in 1989
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 57

    View Slide

  58. Q-Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 58
    ● Discrete, finite action spaces
    ○ stochastic env
    ○ changing env (unlike Go)
    ● Model-free RL
    ○ Naive about action effects
    ● TD(0)
    ○ Only draws reward 1 step into the past, each iteration

    View Slide

  59. View Slide

  60. Intuition: Q-function
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 60
    Image credit: (internet)
    Image credit: AlphaXos, Pathway Intelligence Inc.

    View Slide

  61. Image credit: Vlad Mnih, Deepmind at Deep RL Bootcamp, Berkeley
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 61

    View Slide

  62. Image credit: AlphaXos, Pathway Intelligence Inc.
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 62

    View Slide

  63. Temporal Difference (TD) Learning
    ● Predict future values
    ○ Incremental
    ● Many Variants
    ○ TD(0) vs TD(1) vs TD(λ)
    ● Not new
    ○ From Witten 1977, Sutton and Barto 1981
    ● Here we use it to predict expected reward
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 63

    View Slide

  64. Intuition: TD learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 64
    Image credit: (internet)...
    Image credit: Author

    View Slide

  65. Intuition: TD Learning and State Space
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 65
    Image credit: (internet)...

    View Slide

  66. Bellman Equation for Q-learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 66
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  67. TD(0) updates / Backups / Bellman Updates
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 67
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  68. Q-Learning (non-deep)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 68
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  69. Q-Learning Policies
    ● Policy: course of action
    ● Greedy policy
    ○ pick action w / max Q
    ● Epsilon-Greedy policy
    ○ Explore: random action
    ○ Exploit : action w / max Q
    ○ Probability ε
    ● Alternatives
    ○ Sample over action distro
    ○ Noise + Greedy
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 69
    Image credit: (internet…)

    View Slide

  70. On-policy / Off-policy
    ● On-policy learning
    ○ Learn on the job
    ● Off-policy learning
    ○ Look over someone’s shoulder
    ● Q-Learning = off-policy
    Paraphrased from: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 70

    View Slide

  71. Basic Methods
    Images credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 71

    View Slide

  72. ● reinforcement learning: an introduction by sutton and barto
    http://incompleteideas.net/book/bookdraft2018jan1.pdf
    ● david silver's RL course: https://www.youtube.com/watch?v=2pWv7GOvuf0
    ● Berkeley Deep RL bootcamp:
    https://sites.google.com/view/deep-rl-bootcamp/lectures
    ● openai gym: https://gym.openai.com/
    ● arxiv.org
    Resources
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 72

    View Slide

  73. DQN in depth
    Part 2 of this talk coming
    soon
    73

    View Slide

  74. Thank you!
    Robin Ranjit Singh Chauhan
    [email protected]
    https://github.com/pathway
    https://ca.linkedin.com/in/robinc
    https://pathway.com/aiml
    74
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 74

    View Slide