Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Reinforcement Learning + Deep Q-Networks

Intro to Reinforcement Learning + Deep Q-Networks

We introduce RL concepts, methods, and applications. We look at tools and frameworks for posing and solving RL problems, including OpenAI gym. We then more closely examine Q learning and Deep Q Networks, a popular contemporary deep reinforcement algorithm. We aim to share the best insights from the top researchers in a lucid and entertaining way. We assume only basic knowledge of machine learning and math.

Robin Chauhan has a longstanding interest in AI and machine learning theory and practice. He has applied AI to venture capital, fintech fraud, transportation, and environmental engineering. He holds a BASc in Computer Engineering from University of Waterloo.

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. Intro to
    Reinforcement Learning +
    Deep Q-Networks
    Robin Ranjit Singh Chauhan
    [email protected]
    Pathway Intelligence Inc

    View Slide

  2. ● Aanchan Mohan
    ○ for suggesting I do this, and
    organizing
    ● Reviewers
    ○ Valuable comments
    ● Other meetup organizers
    ● Hive
    ● Sutton+Barto, Berkeley, UCL,
    DeepMind, OpenAI
    ○ for publishing openly
    Props
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 2
    “Many unpaid hours were sacrificed
    to bring us this information”

    View Slide

  3. Why
    ● Aim for today
    ○ Deliver you hard-won insight on
    silver platter
    ○ Things I wish I had known
    ○ Curated best existing content +
    original content
    ● Exchange
    ○ There is a lot to know
    ○ I hope others present on RL
    topics
    ○ If you have serious interest in RL
    I would like to chat
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 3
    “Many unpaid hours were sacrificed
    to bring us this information”

    View Slide

  4. ● Head of Engineering @ AgFunder
    ○ SF-based VC focussed on AgTech + FoodTech investing
    ○ Investments include companies doing Machine Learning in these spaces
    ■ ImpactVision, The Yield
    ● Pathway Intelligence
    ○ BC-based consulting company
    ● Past
    ○ Microsoft PM in Fintech Payment Fraud
    ○ Transportation
    ○ HPC for Environmental engineering
    ● Computer Engineering @ Waterloo
    Me
    4
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 4

    View Slide

  5. You
    ● Comfort levels
    ○ ML
    ○ RL
    ○ Experience?
    ● Interest areas
    ● Lots of slides!
    ○ Speed vs depth?

    View Slide

  6. IRL
    RL = trial and error + learning
    trial and error = variation and
    selection, search (explore/exploit)
    Learning = Association + Memory
    - Sutton + Barto

    View Slide

  7. Types of ML
    ● Unsupervised
    ● Supervised
    ● Reinforcement Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 7

    View Slide

  8. Unsupervised Learning Supervised Learning
    (+semi-supervised)
    Reinforcement Learning
    Training Data Collect training data Collect training data Agent creates data through
    exploration
    Labels None Explicit label per example ** Sparse, Delayed Reward
    -> Temporal Credit Assignment
    problem
    Evaluation Case-specific, can be
    subjective
    Often Accuracy / Loss metrics
    per instance
    Regret ; Total Reward
    Inherent vs Artificial Reward
    Training /
    Fitting
    Training set Training set Behaviour policy
    Testing Test set Test set Target policy
    Exploration n/a n/a Exploration strategy ** (typically
    part of Behavior Policy)
    8
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 8
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  9. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 9
    Image credit: Complementary roles of basal ganglia and cerebellum in learning and motor control, Kenji Doya, 2000

    View Slide

  10. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 10
    Yann LeCun, January 2017 Asilomar, Future of Life Institute

    View Slide

  11. Related Fields
    Image credit: UCL MSc Course on RL, David Silver, University College London Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 11

    View Slide

  12. ● Know what a “good / bad result” looks like
    ○ Don’t want to/cannot specify how to get to it
    ● When you need Tactics + Strategy
    ○ Action, not just prediction
    ● Cases
    ○ Medical treatment
    ○ Complex robot control
    ○ Games
    ○ Dialog systems
    ○ Vehicle Control **
    ○ More as RL and DL advances
    When to consider RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 12
    Image credit: (internet)...

    View Slide

  13. ● Simulatable (or have large existing data)
    ○ Else: training IRL usually infeasible **
    ● Vast state spaces require exploration
    ○ Else: enumerate + plan
    ● Dependencies across time
    ○ Delayed reward
    ○ Else: supervised
    ● Avoid RL unless needed
    ○ Complicated
    ○ Can be data-hungry
    ○ Explainability? (Depends on fn approx)
    ○ Maturity?
    When to consider RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 13
    Image credit: (internet)...

    View Slide

  14. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 14
    Image credit: OpenAI https://blog.openai.com/ai-and-compute

    View Slide

  15. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 15
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    1 day on worlds fastest
    supercomputer peak
    perf
    1 day on NVIDIA
    DGX-2: 16 Volta GPUs
    $400k
    HPC stats from top500.org

    View Slide

  16. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 16
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    1 day on Sunway
    TaihuLight peak perf
    1 day on NVIDIA
    DGX-2: 16 Volta GPUs
    $400k
    HPC stats from top500.org
    4 of the 5 most
    data-hungry AI
    training runs
    are RL
    1 day on Intel Core i9
    Extreme ($2k chip,
    CPU perf only)
    1 day on DOE Summit
    w/GPUs

    View Slide

  17. Hype vs Reality
    ● Behind many recently AI milestones
    ● Better than human perf
    ● “Scared of AI” == Scared of RL
    ○ Jobs
    ○ Killing / Enslaving
    ○ Paperclips
    ○ AGI
    ○ Sentience
    17
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 17
    ● Few applications so far
    ● Slow learning
    ● Practical for robots?
    ● Progressing quickly

    View Slide

  18. "RL + DL =
    general intelligence"
    David Silver
    Google DeepMind
    ICML 2016

    View Slide

  19. “I think reinforcement learning is one
    class of technology where the PR
    excitement is vastly disproportional
    relative to the ... actual deployments
    today”
    Andrew Ng
    Chief Scientist of Baidu
    EmTech Nov 2017
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 19

    View Slide

  20. ● Methods
    ○ RL Algorithms
    ● Approximators
    ○ Deep Learning models in general
    ○ RL-specific DL techniques
    ● Gear
    ○ GPU
    ○ TPU, Other custom silicon
    ● Data
    ○ Sensors + Sensor data
    ● All of these are on fire
    ○ Safe to expect non-linear advancement in RL
    RL Trajectory Dependencies
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 20
    Image credit: (internet)

    View Slide

  21. ● DeepMind (Google)
    ● OpenAI
    ● UAlberta
    ● Google Brain
    ● Berkeley, CMU, Oxford
    ● Many more...
    Who Does RL Research?
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 21

    View Slide

  22. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 22

    View Slide

  23. ● Safety
    ● Animal rights
    ● Unemployment
    ● Civil Liberties
    ● Peace + Conflict
    ● Power Centralization
    RL+AI Ethics Dimensions
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 23
    Consider donating to organizations dedicated to protecting values you cherish

    View Slide

  24. View Slide

  25. Reinforcement Learning
    ● learning to decide + act over time
    ● often online learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 25
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  26. ● Sequential Task
    ○ Pick one of K arms
    ○ Each has own fixed, unknown, (stochastic?) reward
    distribution
    ● Goal
    ○ Maximize reward
    ● Challenge
    ○ Explore vs Exploit
    ○ Either alone not optimal
    ○ Supervised learning
    alone cannot solve:
    does not explore
    (Stochastic Multi-armed) Bandit
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 26
    Image credit: Microsoft Research
    Image credit: https://github.com/CamDavidsonPilon

    View Slide

  27. Contextual Multi-armed Bandit
    ● Rewards depend on Context
    ● Context independent of action
    F
    a1
    F
    a2
    F
    a3
    F
    a4
    Edgewater Casino (Context a)
    F
    b1
    F
    b2
    F
    b3
    F
    b4
    Hard Rock Casino (Context b)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 27

    View Slide

  28. Reinforcement Learning
    Image credit: CMU Graduate AI course slides
    ● Context change depends on
    action
    ● Learn an MDP from experience
    only
    ● Game setting
    ○ Experiences effects of rules
    (wins/loss/tie)
    ○ Does not “know” rules
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 28

    View Slide

  29. Markov Chains
    ● State fully defines history
    ● Transitions
    ○ Probability
    ○ Destination
    Image credit: Wikipedia
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 29

    View Slide

  30. Markov Decision Process (MDP)
    ● Markov Chains
    ○ States linked w/o history
    ● Actions
    ○ Choice
    ● Rewards
    ○ Motivation
    ● Variants
    ○ Bandit = MDP with single state!
    ○ MC + Rewards = MRP
    ○ Partially observed (POMDP)
    ○ Semi-MDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 30
    Image credit: Wikipedia
    Q: where will you often find the MDP
    in RL codebase? Non-MB vs MB

    View Slide

  31. MDP and Friends
    Image credit: Aaron Schumacher, planspace.org
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 31

    View Slide

  32. Reward Signal
    ● Reward drives learning
    ○ Details of reward signal often critical
    ● Too sparse
    ○ complete learning failure
    ● Too generous
    ○ optimization limited
    ● Problem specific
    Image credit: Wikipedia
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 32

    View Slide

  33. Montezuma’s Actual Revenge
    Chart credit: Schaul et al, Prioritized Experience Replay, DeepMind Feb 2016 Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 33

    View Slide

  34. Broad Applicability
    ● Environment / Episodes
    ○ Finite length / Endless
    ● Action space
    ○ Discrete / Continuous
    ○ Few / Vast
    ● State space
    ○ Discrete / Continuous
    ○ Tree / Graph / Cyclic
    ○ Deterministic / Stochastic
    ○ Partially / Fully observed
    ● Reward signals
    ○ Deterministic / Stochastic
    ○ Continuous / Sparse
    ○ Immediate / Delayed
    34
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 34
    Image credit: Wikipedia

    View Slide

  35. Types of RL
    ● Value Based
    ○ state-value fn V(s)
    ○ state-action value fn Q(s,a)
    ○ action-advantage fn A(s,a)
    ● Policy Based
    ○ Directly construct π*(s)
    ● Model Based
    ○ Learn model of environment
    ○ Plan using model
    ● Hybrids
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 35
    Image credit: UCL MSc Course on RL, David Silver, University College London

    View Slide

  36. Reinforcement Learning vs Planning / Search
    ● Planning and RL can be combined
    Planning / Search Reinforcement Learning
    Goal Improved policy Improved policy
    Method Computing on known model Interacting with unknown
    environment
    State Space Model Known Unknown
    Algos Heuristic state-space seach
    Dynamic programming
    Q-Learning
    Monte Carlo rollouts
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 36
    Content paraphrased from: UCL MSc Course on RL, David Silver, University College
    London

    View Slide

  37. RL Rollouts vs Planning
    rollouts
    Image credit: Aaron Schumacher, planspace.org

    View Slide

  38. Elementary approaches
    ● Monte Carlo RL (MC)
    ○ Value = mean return of multiple runs
    ● Value Iteration + Policy Iteration
    ○ Both require enumerating all states
    ○ Both require knowing transition model T(s)
    ● Dynamic Programming (DP)
    ○ Value = value of next state + reward in this
    state
    ○ Iteration propagates reward from terminal
    state back to beginning
    Images credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 38

    View Slide

  39. Elementary approaches: Value Iteration
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 39
    Image credit: Pieter Abbeel UC Berkeley EECS

    View Slide

  40. Elementary approaches: Policy Iteration
    Image credit: Pascal Poupart CS886 University of Waterloo
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 40

    View Slide

  41. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 41

    View Slide

  42. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    ● Memory, Imagination
    ● Imitation, Inverse
    ● Hierarchical
    ● Mastery / Generalization
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 42

    View Slide

  43. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    ● Memory, Imagination
    ● Imitation, Inverse
    ● Hierarchical
    ● Mastery / Generalization
    ● Scalability
    ● Sample efficiency?
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus
    additions in red by Robin Chauhan
    GAE
    V-trace (Impala)
    Dyna-Q family
    AlphaGo
    AlphaZero
    MPPI
    MMC
    PAL
    HER
    GPS+DDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 43
    Sarsa
    Distro-DQN
    TD Search
    DDO
    FRL
    MAXQ
    Options
    UNREAL
    HAM
    OptCrit
    hDQN
    iLQR
    MPC
    Pri-Sweep
    ReinfPlan
    NAC
    ACER
    A0C
    Rainbow
    MERLIN
    DQRN (POMDP)

    View Slide

  44. Image credit: Sergey Levine via Chelsea Finn and Berkeley Deep RL Bootcamp Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 44

    View Slide

  45. ● Wide task variety
    ○ Toy tasks
    ○ Continuous + Discrete
    ○ 2D, 3D, Text, Atari
    ● Common API for env + agent
    ○ Compare algos
    ● Similar
    ○ OpenAI’s Retro: Genesis, Atari arcade
    ○ DeepMind’s Lab: Quake-based 3D env
    ○ Microsoft’s Malmo: Minecraft
    ○ Facebook’s CommAI: Text comms
    ○ Poznan University, Poland: VizDoom
    OpenAI gym
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 45
    Image credit: Open AI gym

    View Slide

  46. OpenAI gym
    import gym
    env = gym.make('CartPole-v0')
    for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
    env.render()
    print(observation)
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
    print("Episode finished after {} timesteps".format(t+1))
    break
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 46
    Sample code from https://gym.openai.com/docs/

    View Slide

  47. View Slide

  48. StarCraft II Learning Environment

    View Slide

  49. View Slide

  50. View Slide

  51. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 51

    View Slide

  52. RL IRL
    ● Most results in hermetic envs
    ○ Board games
    ○ Computer games
    ○ Simulatable robot controllers
    ● Sim != Reality
    ● Model-based : Sample efficiency ++
    ○ But: Model errors accumulate
    ● Techniques to account for model errors
    ● Theme: Bridge Sim -> Reality
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 52

    View Slide

  53. RL IRL: Robotics
    ● Simple IRL manipulations hard for present
    day RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 53
    Image credit: Chelsea Finn
    Image credit: Sergey Levine
    Image credit: Google Research

    View Slide

  54. Q Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 54

    View Slide

  55. Q
    (s,a)

    View Slide

  56. Types of RL
    ● Policy Based
    ○ Directly construct π*(s)
    ● Value Based
    ○ State-action value fn Q*(s,a)
    ○ State-value fn V*(s)
    ○ State-action advantage fn A*(s)
    ● Model Based
    ○ Learn model of environment
    ○ Plan using model
    ● Hybrids
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 56
    Image credit: UCL MSc Course on RL, David Silver, University College London

    View Slide

  57. Reinforcement Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 57
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  58. Reinforcement Learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 58
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  59. Types of RL
    ● Policy Based
    ○ Directly construct π*(s)
    ● Value Based
    ○ State-action value fn Q*(s,a)
    ○ State-value fn V*(s)
    ○ State-action advantage fn A*(s)
    ● Model Based
    ○ Learn model of environment
    ○ Plan using model
    ● Hybrids
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 59
    Image credit: UCL MSc Course on RL, David Silver, University College London

    View Slide

  60. Name Notation Intuition Where Used
    Policy π(s) What action do we take in state s?
    (π* is optimal)
    Policy-based methods
    (But all RL methods have
    some kind of policy)
    State Value function V
    π
    (s) How good is state s?
    (using policy π)
    Value-based methods
    State-action value
    function
    Q
    π
    (s,a) In state s, how good is action a?
    (using policy π)
    Q-Learning, DDPG
    Advantage function A
    π
    (s,a)
    = Q
    π
    (s,a) - V
    π
    (s)
    In state s, how much better is
    action a, than the “overall” V
    π
    (s)?
    (using policy π)
    Duelling DQN, Advantage
    Actor Critic, A3C
    Transition prediction
    function
    P(s′,r|s,a) In state s, if I take action a, what is
    expected next state and reward?
    Model-based RL
    Reward prediction
    function
    R(s,a) In state s, if I take action a, what is
    expected reward?
    Model-based RL
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 60

    View Slide

  61. Q-Learning
    ● From s, which a best?
    Q( state, action ) = E[ Σr ]
    ● Q implies policy:
    π*(s) = max
    a
    Q*(s, a)
    ● Use Temporal Difference (TD) Learning to find Q for each s
    ○ By Watkins in 1989
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 61

    View Slide

  62. Q-Learning properties
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 62
    ● Discrete, finite action spaces
    ○ stochastic env
    ○ changing envs (unlike Go)
    ● Model-free RL
    ○ Naive about action effects
    ● TD(0)
    ○ Only draws reward 1 step into the past, each iteration

    View Slide

  63. View Slide

  64. Intuition: Q-function
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 64
    Image credit: (internet)
    Image credit: AlphaXos, Pathway Intelligence Inc.

    View Slide

  65. Image credit: Vlad Mnih, Deepmind at Deep RL Bootcamp, Berkeley
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 65

    View Slide

  66. Image credit: AlphaXos, Pathway Intelligence Inc.
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 66

    View Slide

  67. Temporal Difference (TD) Learning
    ● Predict future values
    ○ Incremental
    ● Many Variants
    ○ TD(0) vs TD(1) vs TD(λ)
    ○ n-step
    ● Not new
    ○ From Witten 1977, Sutton and Barto 1981
    ● Here we use it to predict expected reward
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 67

    View Slide

  68. Intuition: TD learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 68
    Image credit: (internet)...
    Image credit: Author

    View Slide

  69. Intuition: TD Learning and MDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 69
    Image credit: (internet)...

    View Slide

  70. TD vs other Basic Methods
    Images credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 70

    View Slide

  71. Images credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 71

    View Slide

  72. Bellman’s Principle of Optimality
    An optimal policy has the property that:
    ● whatever the initial state and initial decision are,
    ● the remaining decisions must constitute an optimal policy with regard to the
    state resulting from the first decision
    (See Bellman, 1957, Chap. III.3.)
    => optimal path is made up of optimal sub-paths
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 72

    View Slide

  73. Bellman Equation for Q-learning
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 73
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  74. TD(0) updates / Backups / Bellman Updates
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 74
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  75. Q-Learning (non-deep)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 75
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  76. Q-Learning Policies
    ● Policy: course of action
    ● Greedy policy
    ○ pick action w / max Q
    ● ε-Greedy policy
    ○ ε: Explore: random action
    ○ 1-ε: Exploit : action w / max Q
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 76
    Image credit: (internet…)

    View Slide

  77. Q-Learning Policies: Behavior vs Target
    ● “Behaviour policy”: ϵ-Greedy
    ○ Discuss: Why not purely random?
    ● “Target policy”: Greedy
    ○ Discuss: Is 100% greedy always best?
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 77

    View Slide

  78. Q-Learning Policies: Exploration
    ● ε-Greedy Alternatives
    ○ Sample based on action value (Softmax / “Boltzmann” from physics); Noise + Greedy
    ○ Bandit methods: Optimistic, Pessimistic, UCB, Thompson, Bayesian, …
    ■ But they don’t directly account for uncertainty in future MDP return
    ○ Rmax …
    ○ Theoretically optimal exploration expensive
    ■ Explicitly represent information in MDP: Bayes-adaptive MDP (+ Monte Carlo)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 78

    View Slide

  79. On-policy / Off-policy
    ● On-policy learning
    ○ Learn on the job
    ○ Behaviour policy == Target policy
    ○ Learn effects of exploration
    ● Off-policy learning
    ○ Look over someone’s shoulder
    ○ Behaviour policy != Target policy
    ○ Ignore effects of exploration
    ● Q-Learning = off-policy
    ○ ϵ-Greedy != Greedy
    ○ On-policy variant: “Sarsa”
    Paraphrased from: Reinforcement Learning: An Introduction, Sutton and Barto Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 79

    View Slide

  80. Guarantees
    ● Q learning has some theoretical guarantees if
    ○ Infinite visitation of actions, states
    ○ Learning rate schedule within a goldilocks zone
    ● Requirements can be soft in practice
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 80

    View Slide

  81. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 81
    Images credit: Shreyas Skandan

    View Slide

  82. Images credit: Yan Duan, Berkeley Deep RL Bootcamp
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 82

    View Slide

  83. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 83
    Images credit: Yan Duan, Berkeley Deep RL Bootcamp

    View Slide

  84. Q-Learning methods
    ● Tabular
    ● FQI: Fitted Q Iteration
    ○ Batch mode: learn after batch
    ○ Fitting of regression, tree or (any) other fn approx for Q fn
    ● DQN
    ○ “Make Q-learning look like supervised (deep) learning”
    ○ Training deep net approx for Q fn
    ● Bayesian
    ○ Update Q fn priors
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 84
    Image credit: (internet…)

    View Slide

  85. Deep Q-Networks
    and Friends
    Robin Ranjit Singh Chauhan
    [email protected]
    Pathway Intelligence Inc

    View Slide

  86. RL Algo Zoo
    ● Discrete / Continuous
    ● Model-based / Model-Free
    ● On- / Off-policy
    ● Derivative-based / not
    ● Memory, Imagination
    ● Imitation, Inverse
    ● Hierarchical
    ● Mastery / Generalization
    ● Scalability
    ● Sample efficiency?
    Image credit: Aaron Schumacher and Berkeley Deep RL Bootcamp , plus
    additions in red by Robin Chauhan
    GAE
    V-trace (Impala)
    Dyna-Q family
    AlphaGo
    AlphaZero
    MPPI
    MMC
    PAL
    HER
    GPS+DDP
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 86
    Sarsa
    Distro-DQN
    TD Search
    DDO
    FRL
    MAXQ
    Options
    UNREAL
    HAM
    OptCrit
    hDQN
    iLQR
    MPC
    Pri-Sweep
    ReinfPlan
    NAC
    ACER
    A0C
    Rainbow
    MERLIN
    DQRN (POMDP)

    View Slide

  87. RL + DL = GI
    ● Single agent for any human level task
    ● RL defines objective
    ● DL gives the mechanism
    Above text paraphrased from: Tutorial on Deep Reinforcement Learning
    David Silver ICML 2016
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 87

    View Slide

  88. Arcade Learning Environment (ALE)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 88
    Image credit: Marc Bellemare

    View Slide

  89. Deep Q Networks
    ● Key insight: Use Deep Learning for Q
    function approximator
    Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015

    View Slide

  90. Image credit: Human-level control through deep reinforcement
    learning, Mnih et al, 2015
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 90

    View Slide

  91. Imma let you finish, but...
    ….are Atari games
    really that hard?
    (What was astonishing here?)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 91

    View Slide

  92. 92
    Image credit: Human-level control through deep reinforcement learning,
    Mnih et al, 2015
    Hyperparameters!
    Hyperparameters!
    Hyperparameters!
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 92

    View Slide

  93. K : Komogorov complexity ; more weight on simpler tasks
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 93

    View Slide

  94. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 94
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    1 day on Sunway
    TaihuLight peak perf
    1 day on NVIDIA
    DGX-2: 16 Volta GPUs
    $400k
    HPC stats from top500.org
    4 of the 5 most
    data-hungry AI
    training runs
    are RL
    1 day on Intel Core i9
    Extreme ($2k chip,
    CPU perf only)
    1 day on DOE Summit
    w/GPUs

    View Slide

  95. DQN Innovations
    Q-Learning: based on TD learning
    +
    Deep Learning : Q function approximation
    +
    Experience Replay: stabilizes learning (Lin 93)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 95

    View Slide

  96. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 96
    DQN Setup: Training
    Agent
    DQN
    Training Algo
    XRP
    Memory
    Agent network
    “Behaviour”/
    Exploration
    Policy
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  97. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 97
    DQN Setup: Test
    Agent
    Agent network
    Exploration
    Policy
    Agent
    Agent network
    Target Policy
    argmax
    Q(s,a)
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  98. DQN+XRP Algo: Intuition
    ● Starting policy
    ○ try random actions (Epsilon-Greedy Policy)
    ○ eventually get some reward (Environment)
    ● Remember each transition
    ○ space-limited memory (XRP memory)
    ● Train in little bits as we go
    ○ Train on a few random memories (from XRP memory)
    ○ Stretching reward backwards in time for remembered state (TD learning)
    ○ Train to learn stretched future reward from remembered state (Deep Learning)
    ○ Train to generalizing over similar states (Deep Learning)
    ● Final policy
    ○ always choose actions which Q model says will reward well from this state (Greedy policy)
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 98
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  99. ● Reward Causality: TD Learning
    ○ temporal credit assignment for Q values
    ○ stretch rewards back in time; provide objective
    ● Value Estimator: Q function
    ○ Q( state, action ) -> future reward
    ○ Compute Q for each a, pick best a
    ● Value Estimator Generalization: Deep Learning
    ○ Q function is FF NN
    ○ generalization predict for unseen states
    ● Memory: Experience replay
    ○ Improve learning convergence
    ○ Remember some transitions: ( state, action, next state, reward )
    ○ Q is trained on random memory mini-batches, not just live experience
    DQN+XRP: Components
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 99
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  100. DQN+XRP Algo: Informal
    ● Forward step
    ○ Agent gets State from Env
    ○ Env gets Action from Agent
    ■ Greedy: action w / max Q
    ■ Epsilon greedy: random
    ○ Env gives Reward to Agent
    ● Backup step
    ○ Mem stores ( state, action, next state, reward )
    ○ Train Q network **
    ■ Sample random mini-batch from mem
    ■ Update target Q values** for mini-batch
    ■ Mini-batch gradient descent
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 100
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  101. DQN+XRP Algo: Formal
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 101
    Algorithm 1 credit: Playing Atari with Deep Reinforcement Learning, Mnih et al 2013
    Diagram credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  102. Algorithm 1 credit: Human-level control through deep reinforcement Learning,
    Mnih et al 2015 Diagram credit: Robin Chauhan, Pathway Intelligence Inc.
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 102

    View Slide

  103. Image credit: Human-level control through deep reinforcement learning, Mnih et al, 2015

    View Slide

  104. ● Discrete only
    Continuous variant: NAF
    ● TD(0) slow
    Needs lots of samples
    n-step DQN
    TD( λ )
    ● XRP can consume lots of Memory
    DQN Limits
    104
    ● Epsilon-Greedy exploration weak
    No systematic exploration
    Ignores repetition
    ● Model-Free is dumb
    No learning of MDP ->
    No planning possible
    Eg. Dyna-Q variants
    Not always feasible
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 104

    View Slide

  105. DQN Variants
    Image credit: Justesen et al, Deep Learning for Video Game Playing
    105
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 105

    View Slide

  106. Detail: DDQN Fixed Target Network

    View Slide

  107. Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 107
    DDQN Setup: Training
    Agent
    DQN
    Training Algo
    XRP
    Memory
    Agent network
    Exploration
    Policy
    Image credit: Robin Chauhan, Pathway Intelligence Inc.

    View Slide

  108. Prioritized Replay
    ● Q network outputs less accurate for some states
    ● Focus learning on those
    ● Sample minibatch memories based on TD error
    XRP
    Memory
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 108

    View Slide

  109. Distributional DQN
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 109
    Image: A Distributional Perspective on Reinforcement Learning,
    Bellemare et al 2017

    View Slide

  110. Distributional DQN
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 110
    Image: A Distributional Perspective on Reinforcement Learning,
    Bellemare et al 2017

    View Slide

  111. Dueling DQN
    Increases stability
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 111
    Image: Dueling Network Architectures for Deep Reinforcement Learning,
    Wang et al 2016

    View Slide

  112. Noisy Nets
    ● More efficient exploration than epsilon-greedy
    ● Parametric noise added to weights
    ● Parameters of the noise learned with gradient descent along with the
    remaining network weights
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 112
    Image: Noisy Networks for Exploration, Fortunato et al, 2017

    View Slide

  113. N-step DQN
    ● TD(0) updates only 1 timestep
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 113
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto

    View Slide

  114. Rainbow DQN
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 114
    Image: Rainbow: Combining Improvements in Deep Reinforcement
    Learning, Hessel et al 2017

    View Slide

  115. OpenAI gym: Baseline DQN
    import gym
    from baselines import deepq
    def main():
    env = gym.make("CartPole-v0")
    model = deepq.models.mlp([64])
    act = deepq.learn(
    env, q_func=model, lr=1e-3,
    max_timesteps=100000,
    buffer_size=50000,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,)
    act.save("cartpole_model.pkl")
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 115
    Sample code from https://gym.openai.com/docs/

    View Slide

  116. keras-rl
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 116
    ● Keras-based :) RL algos for OpenAI gym-type envs
    ○ https://github.com/keras-rl/keras-rl
    pip install keras-rl
    ● But: Development stalled :(
    ○ Needs a lot of love; But seems new maintainer has been found
    ● Worthy competitors
    ○ TensorForce
    ○ OpenAI Baseline implementations
    ○ Rllab , tensorflow/agents , Ray RLlib, anyrl (go) , tensorflow-rl , ShangtongZhang/DeepRL ,
    BlueWhale, SLM-Lab, pytorch-a2c-ppo-acktr , dennybritz/reinforcement-learning

    View Slide

  117. # Get the environment and extract the number of actions.
    env = gym.make('CartPole-v0')
    nb_actions = env.action_space.n
    # Next, we build a very simple model.
    model = Sequential()
    model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
    model.add(Dense(16))
    model.add(Activation('relu'))
    model.add(Dense(nb_actions))
    model.add(Activation('linear'))
    keras-rl
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 117
    Sample code from keras-rl
    https://github.com/keras-rl/keras-rl/blob/master/examples/dqn_cartpole.py

    View Slide

  118. memory = SequentialMemory(limit=50000, window_length=1)
    policy = BoltzmannQPolicy()
    dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory,
    nb_steps_warmup=10, target_model_update=100, policy=policy)
    dqn.compile(Adam(lr=1e-3), metrics=['mae'])
    dqn.fit(env, nb_steps=50000, visualize=True, verbose=2)
    dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True)
    dqn.test(env, nb_episodes=5, visualize=True)
    keras-rl
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 118
    Sample code from keras-rl
    https://github.com/keras-rl/keras-rl/blob/master/examples/dqn_cartpole.py

    View Slide

  119. ● Volodymyr Mnih
    ○ PhD UofT under Hinton
    ○ Masters at UofA under Csaba Szepesvari
    (also at DeepMind)
    ● David Silver
    ○ Met Demis Hassabis during Cambridge
    ○ CTO of Elixir Studios (game company)
    ○ PhD @ UofA in RL
    ○ AlphaGo
    ● Koray Kavukcuoglu
    ○ NEC Labs ML
    ○ PhD: w/Yann LeCun, NYU
    ○ Aerospace
    People from first DQN Atari paper
    ● Alex Graves
    ○ UofT under Hinton
    ● Ioannis Antonoglou
    ○ Masters U of Edinburg ML/AI
    ● Daan Wierstra
    ○ Masters Utrecht U (NL)
    ● Martin Riedmiller
    ○ Prof @ U of Freiburg (Germany)
    ○ PhD on neural controllers U of Karlsruhe
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 119

    View Slide

  120. "literally nothing is
    solved yet"
    Volodymyr Mnih
    Google DeepMind
    August 2017
    Berkeley Deep RL Bootcamp

    View Slide

  121. ● Self-Play, Multi-Agent Reinforcement Learning, AlphaZero + AlphaXos
    ● RL in Medicine
    ● Actor-Critic methods
    ● Selected RL papers 2017-2018
    ● Artificial General Intelligence
    Possible Next Talks
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 121
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 121

    View Slide

  122. Questions
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 122

    View Slide

  123. ● reinforcement learning: an introduction by sutton and barto
    http://incompleteideas.net/book/bookdraft2018jan1.pdf
    ● david silver's RL course: https://www.youtube.com/watch?v=2pWv7GOvuf0
    ● Berkeley Deep RL bootcamp:
    https://sites.google.com/view/deep-rl-bootcamp/lectures
    ● openai gym: https://gym.openai.com/
    ● arxiv.org
    Resources
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 123

    View Slide

  124. Thank you!
    Robin Ranjit Singh Chauhan
    [email protected]
    https://github.com/pathway
    https://ca.linkedin.com/in/robinc
    https://pathway.com/aiml
    124
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 124

    View Slide

  125. Appendix
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 125

    View Slide

  126. Image credit: Human-level control through deep reinforcement learning, Mnih et al,
    2015
    Intro to RL+DQN by Robin Chauhan, Pathway Intelligence Inc. 126

    View Slide