Erik Daxberger - Introduction to Deep Reinforcement Learning

Introduction to Deep Reinforcement Learning Munich Datageeks - May Edition
Erik A. Daxberger May 4, 2017

Introduction Motivation Erik A. Daxberger Introduction to Deep Reinforcement Learning
May 4, 2017 1 / 26

Introduction Motivation Ultimate goal of ML/AI research: general purpose intelligence
i.e. an intelligent agent excelling at a wide variety of human-level tasks Real world is insanely complex → use video games as a testbed! Is it possible to design a single AI agent, being able to play a wide variety of diﬀerent games, end-to-end, at a human level? Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 2 / 26

Introduction Motivation Recently, DeepMind published a paper succeeding at that
task. Shortly after, DeepMind was bought by Google for $500M and published a Nature cover paper Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 3 / 26

Introduction Motivation But how?! → deep reinforcement learning! Erik A.
Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 4 / 26

Introduction Contents 1 Introduction 2 Reinforcement Learning 3 Deep Q-Learning
4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 5 / 26

Reinforcement Learning Contents 1 Introduction 2 Reinforcement Learning 3 Deep
Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 6 / 26

Reinforcement Learning Machine Learning Paradigms 1 Supervised learning • in:
example input/output pairs (by a ”teacher”) • out: general mapping rule • example: learn to detect dogs in images → classiﬁcation 2 Unsupervised learning • in: inputs without outputs • out: hidden structure in the inputs • example: learn groups of diﬀerent animals → clustering 3 Reinforcement learning • in: observations and feedback (rewards or punishments) from interacting with an environment • out: achieve some goal • example: learn to play Atari games from scratch Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 6 / 26

Reinforcement Learning The RL Setting RL is a general-purpose framework
for AI At each timestep t an agent in- teracts with an environment E and • observes state st ∈ Rd • receives reward rt ∈ R • executes action at ∈ A Reward signals may be sparse, noisy and delayed. → Agent-environment-loop Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 7 / 26

Reinforcement Learning Example: TicTacToe st ∈ O X X X
O , X X X O O O , O X O , . . . rt =      +1 if we won −1 if we lost 0 otherwise at ∈ X , X , X , . . . Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 8 / 26

Reinforcement Learning Reinforcement Learning vs. Deep Learning Why don’t we
just use (supervised) Deep Learning?! Reinforcement Learning vs. Deep Learning sparse and noisy rewards vs. hand-labeled training data delay between actions and rewards ( credit assignment problem) vs. direct association between inputs and outputs highly correlated inputs from a non-stationary data distribution vs. i.i.d. data samples Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 9 / 26

Reinforcement Learning Fundamental RL Concepts Policy π a = π(s)
Cumulative reward return Rt Rt = rt + γrt+1 + γ2rt+2 + . . . = T t =t γt −trt Value function Qπ(s, a) Qπ(s, a) = E [Rt|s, a] = E rt + γrt+1 + γ2rt+2 + . . . s, a Optimal value function Q∗(s, a) = max π Qπ(s, a) Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 10 / 26

Reinforcement Learning Example: TicTacToe π(s) =    
       X if s = X X O O X if s = . . . Let γ = 0.9 Rt = rt + γrt+1 + γ2rt+2 + . . . = 0 + 0.9 × 0 + . . . + 0.95 × 1 ≈ 0.59 Qπ(s, a) = X X O O X X O . . . O O X X X 0.3 0.5 . . . 0.9 . . . . . . . . . ... . . . X 0.6 0.1 . . . 0.8 Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 11 / 26

Reinforcement Learning Bellman Equation and Value Iteration Optimal value function
can be unrolled recursively → Bellman equation Q∗(s, a) = max π E rt + γrt+1 + γ2rt+2 + . . . s, a = Es ∼E r + γ max a Q∗(s , a ) s, a Value iteration algorithms estimate Q∗ using iterative Bellman updates Qi+1(s, a) = E r + γ max a Qi (s , a ) s, a This procedure is guaranteed to converge, i.e. Qi → Q∗ as i → ∞ Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 12 / 26

Reinforcement Learning Function Approximators These tabular methods are highly impractical
1 all Q-values are stored seperately → technical challenge Example: Chess • 1047 states • 35 possible moves/state • → 1047 × 35 × 1Byte ≈ 1027Zettabytes (The Internet is 10 Zettabytes. . . ) 2 no generalization over unseen states → ”dumb” approach =⇒ use a function approximator to estimate Q! Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 13 / 26

Deep Q-Learning Contents 1 Introduction 2 Reinforcement Learning 3 Deep
Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 14 / 26

Deep Q-Learning DQN Idea Approximate Q by an ANN with
weights θ → deep Q-network (DQN) Q(s, a; θ) ≈ Qπ(s, a) Loss function = MSE in Q-Values L(θ) = Es,a∼ρ(·);s ∼E         r + γ max a Q(s , a ; θ) target − Q(s, a; θ)     2    Q-learning gradient ∇θL(θ) = Es,a∼ρ(·);s ∼E r + γ max a Q(s , a ; θ) − Q(s, a; θ) ∇θQ(s, a; θ) → Optimize L(θ) with ∇θL(θ) using stochastic gradient descent Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 14 / 26

Deep Q-Learning DQN Architecture Erik A. Daxberger Introduction to Deep
Reinforcement Learning May 4, 2017 15 / 26

Deep Q-Learning Stability Issues & Solutions Standard Q-learning oscillates or
diverges when using ANNs 1 Sequential nature of data → non-i.i.d., as successive samples are (highly) correlated 2 Policy changes rapidly with slight changes to Q-values → policy may oscillate and data distribution may swing between extremes Tweaks DQN uses to stabilize training 1 Experience replay: store agent’s experiences et = (st, at, rt, st+1) in a data set D = {e1, . . . , eN}, into a replay memory → apply minibatch updates to samples e ∼ D 2 Frozen target network: hold previous parameters θ− ﬁxed in the Q-learning target when optimizing Li (θ), and update them only periodically, i.e. θ− ← θ Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 16 / 26

Deep Q-Learning DQN Pseudocode Algorithm: Deep Q-learning Initialize Q-function with
random weights for episode = 1, . . . , M do for t = 1, . . . , T do Choose at = sampled randomly with probability maxa Q∗(st, a; θ) otherwise Execute at in emulator and observe reward rt and image st+1 Store transition (st, at, rt, st+1) in D Sample random minibatch of transitions (sj , aj , rj , sj+1) from D Set yj = rj for terminal st+1 rj + γ maxa Q(sj+1, a ; θ) for non-terminal st+1 Perform a gradient descent step on (yj − Q(sj , aj ; θ))2 end end Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 17 / 26

Deep Q-Learning DQN Atari Results Erik A. Daxberger Introduction to
Deep Reinforcement Learning May 4, 2017 18 / 26

Demo: Learning to Play Pong Contents 1 Introduction 2 Reinforcement
Learning 3 Deep Q-Learning 4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 19 / 26

Demo: Learning to Play Pong Framework Goal: Agent that learns
how to play Pong, end-to-end • input: raw image frame in form of a 210 × 160 × 3 byte array, i.e. st ∈ {0, . . . , 255}100.800 • output: distribution over actions ρ(at) (→ stochastic policy, i.e., we sample at ∼ ρ(at) in every t), with at ∈ {UP, DOWN} • → game emulator executes at and emits st+1 and a reward, i.e., rt =      +1 if we score a point −1 if our opponent scores a point 0 otherwise Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 19 / 26

Demo: Learning to Play Pong Architecture Policy network: 2-layer fully-connected
ANN with 200 hidden nodes → Goal: ﬁnd optimal weights θ of the policy network Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 20 / 26

Demo: Learning to Play Pong Implementation Implemented in Python, using
only numpy and OpenAI Gym import gym env = gym . make( ”Pong−v0” ) s 0 = env . r e s e t () f o r i n range (1000) : env . render () a t = env . a c t i o n s p a c e . sample () s t1 , r t1 , done , i n f o = env . step ( a t ) Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 21 / 26

Demo: Learning to Play Pong Results 0 2000 4000 6000
8000 10000 episode −20 −15 −10 −5 0 5 10 rt : mean ± std Evolution of the reward rt over time Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 22 / 26

Conclusions Contents 1 Introduction 2 Reinforcement Learning 3 Deep Q-Learning
4 Demo: Learning to Play Pong 5 Conclusions Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 23 / 26

Conclusions Summary • RL = general-purpose AI framework, modelled through
the agent-environment-loop • can solve problems involving sparse, noisy and delayed rewards and correlated inputs • agent’s goal: ﬁnd the policy maximizing its return • → use iterative Bellman updates to ﬁnd the optimal value function • storing Q in a table is infeasible and ’dumb’ → use a function approximator • DQN uses an ANN to approximate Q → exploit state-of-the-art deep learning techniques • possibility of oscillation/divergence → stabilize training using experience replay and a frozen target network Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 23 / 26

Conclusions Now what about AlphaGo?! Chess vs. Go 1047 states
vs. 10170 states rule-based vs. intuition-based So how does AlphaGo work? Deep reinforcement learning + Monte Carlo tree search → Silver, David, et al. Mastering the game of Go with deep neural networks and tree search. Nature 529.7587 (2016): 484-489. Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 24 / 26

Conclusions Open Problems and Current Research Despite the empirical successes,
there are still many fundamental unsolved problems. For example, current algorithms • are bad at long-term planning (i.e. act myopically) • need a long training time (i.e., have a low data eﬃciency) • are unable to understand abstract concepts (i.e. lack a model) → Deep reinforcement learning is a very hot research topic! Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 25 / 26

Conclusions Ok, so what’s next?! Erik A. Daxberger Introduction to

Thank you for your attention! Erik A. Daxberger Introduction to

References • Mnih, Volodymyr, et al. Playing atari with deep
reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). • Mnih, Volodymyr, et al. Human-level control through deep reinforcement learning. Nature 518.7540 (2015): 529-533. • Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. • Karpathy, Andrej. Deep Reinforcement Learning: Pong from Pixels. http://karpathy.github.io/2016/05/31/rl/ (04.05.2017). • Brockman, Greg, et al. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016) and https://gym.openai.com/ (04.05.2017). • Silver, David. Deep Reinforcement Learning. Tutorial at ICLR (2015), http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_ files/deep_rl.pdf (04.05.2017). Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 26 / 26

References (All accessed on 04.05.2017.) • Siri image: http://efunnyphotos.com/wp-content/uploads/ funny-Siri-answer-ambulance.jpg
• DeepMind logo: http://offshorent.com/wp-content/uploads/ 2016/09/google-deepmind-artificial-intelligence.png • Nature cover DQN: http: //www.nature.com/nature/journal/v518/n7540/index.html • AlphaGo headline: http://www.latimes.com/world/asia/ la-fg-korea-alphago-20160312-story.html • Lee Sedol image: https://www.androidheadlines.com/wp-content/uploads/ 2016/03/lee_sedol_alphago_theguardian-1600x959.jpg • Nature cover AlphaGo: http: //www.nature.com/nature/journal/v529/n7587/index.html • Terminator image: https://betanews.com/wp-content/uploads/ 2015/07/terminator.jpg Erik A. Daxberger Introduction to Deep Reinforcement Learning May 4, 2017 26 / 26

Erik Daxberger - Introduction to Deep Reinforce...

Erik Daxberger - Introduction to Deep Reinforcement Learning

More Decks by Munich DataGeeks

Other Decks in Research

Featured

Transcript