Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Erik Daxberger - Introduction to Deep Reinforce...

Erik Daxberger - Introduction to Deep Reinforcement Learning

The London-based AI company DeepMind recently gained considerable attention after succeeding in developing a single AI agent capable of attaining human-level performance on a wide range of Atari video games - entirely self-taught and using only the raw pixels and game scores as input. In 2016, DeepMind again made headlines when its self-taught AI system AlphaGo succeeded in beating a world champion at the board game of Go, a feat that experts expected to be at least a decade away. What both systems have in common is that they are fundamentally grounded on a technique called Deep Reinforcement Learning. In this talk, we will demystify the mechanisms underlying this increasingly popular Machine Learning approach, which combines the agent-centered paradigm of Reinforcement Learning with state-of-the-art Deep Learning techniques.

Avatar for Munich DataGeeks

Munich DataGeeks

May 04, 2017
Tweet

More Decks by Munich DataGeeks

Other Decks in Research

Transcript

  1. Introduction Motivation Ultimate goal of ML/AI research: general purpose intelligence

    i.e. an intelligent agent excelling at a wide variety of human-level tasks Real world is insanely complex → use video games as a testbed! Is it possible to design a single AI agent, being able to play a wide variety of different games, end-to-end, at a human level? Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 2 / 26
  2. Introduction Motivation Recently, DeepMind published a paper succeeding at that

    task. Shortly after, DeepMind was bought by Google for $500M and published a Nature cover paper Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 3 / 26
  3. Introduction Motivation But how?! → deep reinforcement learning! Erik A.

    Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 4 / 26
  4. Introduction Contents 1 Introduction 2 Reinforcement Learning 3 Deep Q-Learning

    4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 5 / 26
  5. Reinforcement Learning Contents 1 Introduction 2 Reinforcement Learning 3 Deep

    Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 6 / 26
  6. Reinforcement Learning Machine Learning Paradigms 1 Supervised learning • in:

    example input/output pairs (by a ”teacher”) • out: general mapping rule • example: learn to detect dogs in images → classification 2 Unsupervised learning • in: inputs without outputs • out: hidden structure in the inputs • example: learn groups of different animals → clustering 3 Reinforcement learning • in: observations and feedback (rewards or punishments) from interacting with an environment • out: achieve some goal • example: learn to play Atari games from scratch Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 6 / 26
  7. Reinforcement Learning The RL Setting RL is a general-purpose framework

    for AI At each timestep t an agent in- teracts with an environment E and • observes state st ∈ Rd • receives reward rt ∈ R • executes action at ∈ A Reward signals may be sparse, noisy and delayed. → Agent-environment-loop Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 7 / 26
  8. Reinforcement Learning Example: TicTacToe st ∈ O X X X

    O , X X X O O O , O X O , . . . rt =      +1 if we won −1 if we lost 0 otherwise at ∈ X , X , X , . . . Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 8 / 26
  9. Reinforcement Learning Reinforcement Learning vs. Deep Learning Why don’t we

    just use (supervised) Deep Learning?! Reinforcement Learning vs. Deep Learning sparse and noisy rewards vs. hand-labeled training data delay between actions and rewards ( credit assignment problem) vs. direct association between inputs and outputs highly correlated inputs from a non-stationary data distribution vs. i.i.d. data samples Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 9 / 26
  10. Reinforcement Learning Fundamental RL Concepts Policy π a = π(s)

    Cumulative reward return Rt Rt = rt + γrt+1 + γ2rt+2 + . . . = T t =t γt −trt Value function Qπ(s, a) Qπ(s, a) = E [Rt|s, a] = E rt + γrt+1 + γ2rt+2 + . . . s, a Optimal value function Q∗(s, a) = max π Qπ(s, a) Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 10 / 26
  11. Reinforcement Learning Example: TicTacToe π(s) =    

           X if s = X X O O X if s = . . . Let γ = 0.9 Rt = rt + γrt+1 + γ2rt+2 + . . . = 0 + 0.9 × 0 + . . . + 0.95 × 1 ≈ 0.59 Qπ(s, a) = X X O O X X O . . . O O X X X 0.3 0.5 . . . 0.9 . . . . . . . . . ... . . . X 0.6 0.1 . . . 0.8 Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 11 / 26
  12. Reinforcement Learning Bellman Equation and Value Iteration Optimal value function

    can be unrolled recursively → Bellman equation Q∗(s, a) = max π E rt + γrt+1 + γ2rt+2 + . . . s, a = Es ∼E r + γ max a Q∗(s , a ) s, a Value iteration algorithms estimate Q∗ using iterative Bellman updates Qi+1(s, a) = E r + γ max a Qi (s , a ) s, a This procedure is guaranteed to converge, i.e. Qi → Q∗ as i → ∞ Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 12 / 26
  13. Reinforcement Learning Function Approximators These tabular methods are highly impractical

    1 all Q-values are stored seperately → technical challenge Example: Chess • 1047 states • 35 possible moves/state • → 1047 × 35 × 1Byte ≈ 1027Zettabytes (The Internet is 10 Zettabytes. . . ) 2 no generalization over unseen states → ”dumb” approach =⇒ use a function approximator to estimate Q! Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 13 / 26
  14. Deep Q-Learning Contents 1 Introduction 2 Reinforcement Learning 3 Deep

    Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 14 / 26
  15. Deep Q-Learning DQN Idea Approximate Q by an ANN with

    weights θ → deep Q-network (DQN) Q(s, a; θ) ≈ Qπ(s, a) Loss function = MSE in Q-Values L(θ) = Es,a∼ρ(·);s ∼E         r + γ max a Q(s , a ; θ) target − Q(s, a; θ)     2    Q-learning gradient ∇θL(θ) = Es,a∼ρ(·);s ∼E r + γ max a Q(s , a ; θ) − Q(s, a; θ) ∇θQ(s, a; θ) → Optimize L(θ) with ∇θL(θ) using stochastic gradient descent Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 14 / 26
  16. Deep Q-Learning Stability Issues & Solutions Standard Q-learning oscillates or

    diverges when using ANNs 1 Sequential nature of data → non-i.i.d., as successive samples are (highly) correlated 2 Policy changes rapidly with slight changes to Q-values → policy may oscillate and data distribution may swing between extremes Tweaks DQN uses to stabilize training 1 Experience replay: store agent’s experiences et = (st, at, rt, st+1) in a data set D = {e1, . . . , eN}, into a replay memory → apply minibatch updates to samples e ∼ D 2 Frozen target network: hold previous parameters θ− fixed in the Q-learning target when optimizing Li (θ), and update them only periodically, i.e. θ− ← θ Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 16 / 26
  17. Deep Q-Learning DQN Pseudocode Algorithm: Deep Q-learning Initialize Q-function with

    random weights for episode = 1, . . . , M do for t = 1, . . . , T do Choose at = sampled randomly with probability maxa Q∗(st, a; θ) otherwise Execute at in emulator and observe reward rt and image st+1 Store transition (st, at, rt, st+1) in D Sample random minibatch of transitions (sj , aj , rj , sj+1) from D Set yj = rj for terminal st+1 rj + γ maxa Q(sj+1, a ; θ) for non-terminal st+1 Perform a gradient descent step on (yj − Q(sj , aj ; θ))2 end end Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 17 / 26
  18. Deep Q-Learning DQN Atari Results Erik A. Daxberger Introduction to

    Deep Reinforcement Learning May 4, 2017 18 / 26
  19. Demo: Learning to Play Pong Contents 1 Introduction 2 Reinforcement

    Learning 3 Deep Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 19 / 26
  20. Demo: Learning to Play Pong Framework Goal: Agent that learns

    how to play Pong, end-to-end • input: raw image frame in form of a 210 × 160 × 3 byte array, i.e. st ∈ {0, . . . , 255}100.800 • output: distribution over actions ρ(at) (→ stochastic policy, i.e., we sample at ∼ ρ(at) in every t), with at ∈ {UP, DOWN} • → game emulator executes at and emits st+1 and a reward, i.e., rt =      +1 if we score a point −1 if our opponent scores a point 0 otherwise Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 19 / 26
  21. Demo: Learning to Play Pong Architecture Policy network: 2-layer fully-connected

    ANN with 200 hidden nodes → Goal: find optimal weights θ of the policy network Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 20 / 26
  22. Demo: Learning to Play Pong Implementation Implemented in Python, using

    only numpy and OpenAI Gym import gym env = gym . make( ”Pong−v0” ) s 0 = env . r e s e t () f o r i n range (1000) : env . render () a t = env . a c t i o n s p a c e . sample () s t1 , r t1 , done , i n f o = env . step ( a t ) Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 21 / 26
  23. Demo: Learning to Play Pong Results 0 2000 4000 6000

    8000 10000 episode −20 −15 −10 −5 0 5 10 rt : mean ± std Evolution of the reward rt over time Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 22 / 26
  24. Conclusions Contents 1 Introduction 2 Reinforcement Learning 3 Deep Q-Learning

    4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 23 / 26
  25. Conclusions Summary • RL = general-purpose AI framework, modelled through

    the agent-environment-loop • can solve problems involving sparse, noisy and delayed rewards and correlated inputs • agent’s goal: find the policy maximizing its return • → use iterative Bellman updates to find the optimal value function • storing Q in a table is infeasible and ’dumb’ → use a function approximator • DQN uses an ANN to approximate Q → exploit state-of-the-art deep learning techniques • possibility of oscillation/divergence → stabilize training using experience replay and a frozen target network Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 23 / 26
  26. Conclusions Now what about AlphaGo?! Chess vs. Go 1047 states

    vs. 10170 states rule-based vs. intuition-based So how does AlphaGo work? Deep reinforcement learning + Monte Carlo tree search → Silver, David, et al. Mastering the game of Go with deep neural networks and tree search. Nature 529.7587 (2016): 484-489. Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 24 / 26
  27. Conclusions Open Problems and Current Research Despite the empirical successes,

    there are still many fundamental unsolved problems. For example, current algorithms • are bad at long-term planning (i.e. act myopically) • need a long training time (i.e., have a low data efficiency) • are unable to understand abstract concepts (i.e. lack a model) → Deep reinforcement learning is a very hot research topic! Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 25 / 26
  28. Conclusions Ok, so what’s next?! Erik A. Daxberger Introduction to

    Deep Reinforcement Learning May 4, 2017 26 / 26
  29. Thank you for your attention! Erik A. Daxberger Introduction to

    Deep Reinforcement Learning May 4, 2017 26 / 26
  30. References • Mnih, Volodymyr, et al. Playing atari with deep

    reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). • Mnih, Volodymyr, et al. Human-level control through deep reinforcement learning. Nature 518.7540 (2015): 529-533. • Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. • Karpathy, Andrej. Deep Reinforcement Learning: Pong from Pixels. http://karpathy.github.io/2016/05/31/rl/ (04.05.2017). • Brockman, Greg, et al. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016) and https://gym.openai.com/ (04.05.2017). • Silver, David. Deep Reinforcement Learning. Tutorial at ICLR (2015), http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_ files/deep_rl.pdf (04.05.2017). Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 26 / 26
  31. References (All accessed on 04.05.2017.) • Siri image: http://efunnyphotos.com/wp-content/uploads/ funny-Siri-answer-ambulance.jpg

    • DeepMind logo: http://offshorent.com/wp-content/uploads/ 2016/09/google-deepmind-artificial-intelligence.png • Nature cover DQN: http: //www.nature.com/nature/journal/v518/n7540/index.html • AlphaGo headline: http://www.latimes.com/world/asia/ la-fg-korea-alphago-20160312-story.html • Lee Sedol image: https://www.androidheadlines.com/wp-content/uploads/ 2016/03/lee_sedol_alphago_theguardian-1600x959.jpg • Nature cover AlphaGo: http: //www.nature.com/nature/journal/v529/n7587/index.html • Terminator image: https://betanews.com/wp-content/uploads/ 2015/07/terminator.jpg Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 26 / 26