Playing Atari with Deep Reinforcement Learning

Slide 1

Slide 1 text

Playing Atari with Deep Reinforcement Learning An explanatory tutorial assembled by: Liang Gong Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 1 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller

Slide 2

Slide 2 text

What is Deep Reinforcement Learning? • Simply speaking: Q-learning + Neural Networks • What is Q-learning? • Q-Learning (a simple example) • Q-Learning on Atari Games • Why use it with Neural Networks? Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

A Simple Tutorial on Q-learning • Suppose we have a house with 5 rooms • Starting from any room • Destination: 5 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 4 http://mnemstudio.org/path-finding-q-learning-tutorial.htm

Slide 5

Slide 5 text

A Simple Tutorial on Q-learning • Suppose we have a house with 5 rooms • Starting from any room • Destination: 5 0 1 2 3 4 5 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 5

Slide 6

Slide 6 text

Slide 7

Slide 7 text

A Simple Tutorial on Q-learning • Destination: 5 • On-step reward matrix R (fixed) • Q-Matrix: Q (changing) State * Action  Reward Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 7

Slide 8

Slide 8 text

A Simple Tutorial on Q-learning • Destination: 5 • Goal is a Q-graph (or matrix) • At any state, the Q-graph tells us what is the optimal next state (in order to reach goal state early) Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 8

Slide 9

Slide 9 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Init Q-matrix: Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 9

Slide 10

Slide 10 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Randomly pick a state as initial state: 1 • next state: 3, 5; pick 5 as the next state Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 10

Slide 11

Slide 11 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 11

Slide 12

Slide 12 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Randomly pick a state: 3 • next state: 1, 2, 4; randomly pick 1 as the next state Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 12

Slide 13

Slide 13 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * Max(0, 100) = 80 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 13

Slide 14

Slide 14 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Repeat the process over and over -> convergence Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 14

Slide 15

Slide 15 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Repeat the process over and over -> convergence • Normalize Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 15

Slide 16

Slide 16 text

A Simple Tutorial on Q-learning • Calculate Q-matrix: • Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] • Repeat the process over and over -> convergence • Normalize Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 16

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Q-Learning on Atari Games • Each unique screenshot corresponds to one state • At any state, only three actions: left, right, or nop • Number of states: width * height * colors Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 19

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Q-Learning on Atari Games • Each unique screenshot corresponds to one state • At any state, only three actions: left, right, or nop • Number of states: width * height * colors Problem: • Can not start from a random state • Impractical: iterating over all possible states and actions Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 21

Slide 22

Slide 22 text

Q-Learning on Atari Games • Each unique screenshot corresponds to one state • At any state, only three actions: left, right, or nop • Number of states: width * height * colors Problem: • Can not start from a random state • Impractical: iterating over all possible states and actions Solution: • Use a predictor to estimate (or approximate) the converged Q-matrix • Neural Network seems to work well! Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 22

Slide 23

Slide 23 text

23 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. This is called the Deep-Q-Networks (DQN)

Slide 24

Slide 24 text

24 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. This is called the Deep-Q-Networks (DQN) Each output is an estimated reward for one action. …

Slide 25

Slide 25 text

25 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. This is called the Deep-Q-Networks (DQN) During training, with probably ε, pick a random action. Observe the actual reward as y, and the estimated reward as y’ Loss = (y – y’)^2

Slide 26

Slide 26 text

26 Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. This is called the Deep-Q-Networks (DQN) Loss = (y – y’)^2 Update the network weights using gradient descent

Slide 27

Slide 27 text

The Math in the Paper closure Expectation Reward for taking action a under state s future discount after action a, the maximal reward for the next possible action The original Q-Learning Formula (Impractical): Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 27

Slide 28

Slide 28 text

The Math in the Paper Loss in step i Expectation actual reward from Atari game NN predicted reward under state s, action a, and weights Calculate the loss function NN weights in step i Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 28

Slide 29

Slide 29 text

The Math in the Paper differentiate Li with respect to weights Calculate gradient: Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley. 29

Slide 30

Slide 30 text

30 Theano: https://github.com/Theano/Theano Convnetjs: https://github.com/karpathy/convnetjs DNN-JS-Library: https://github.com/EngageSoftware/DNN- JavaScript-Libraries DNN Libraries Liang Gong, Electric Engineering & Computer Science, University of California, Berkeley.