Slide 1

Slide 1 text

Reinforcement Learning First steps with Q-Learning

Slide 2

Slide 2 text

Who am I? Tomiwa Ijaware Software Engineer (Konga) Udacity ML nanodegree student e911miri@gmail.com

Slide 3

Slide 3 text

● It involves allowing machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. ● It involves learning how to map situations to actions so as to maximize a numerical reward signal. ● It explores how software agents should act to maximize cummulative reward What is reinforcement learning

Slide 4

Slide 4 text

Two notable mentions

Slide 5

Slide 5 text

Alpha GO ● Achieved 99.8% win rate against other go programs ● Combined Deep neural networks with Reinforcement learning ● Defeated a Human professional 4 - 1

Slide 6

Slide 6 text

Google Self Driving Car ● Over 1.5million miles driven by a trained model ● It uses real sensors to tell where it is and what is around it ● It reads what to do next from its navigation planner

Slide 7

Slide 7 text

Where does it shine? ● Robotics ● Control systems ● Operations Research ● Games ● Economics

Slide 8

Slide 8 text

Q-Learning

Slide 9

Slide 9 text

What is Q-Learning Q-learning is a model-free reinforcement learning technique. Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP)

Slide 10

Slide 10 text

A few terms Some theory to get us started

Slide 11

Slide 11 text

States ● A state represents information about the agent’s position in its environment. ● A good state contains the location of the agent and its observable environment ● In a game of tic tac toe, a state would be all xes and ooes on the board ● In a car, a state could contain traffic information, the direction on google maps, traffic lights and more

Slide 12

Slide 12 text

Action ● This is a step you take. ● It leads you to another state ● It gives you a reward

Slide 13

Slide 13 text

Reward ● Numerical value representing what you get for taking an action. ● It could be positive or negative.

Slide 14

Slide 14 text

Long term reward ● The cumulative reward for taking an action given that you were in a state divided by the number of times ● Q(s, a) = sum[Q(s, a)] / t ● t is the number of times a was taken from s

Slide 15

Slide 15 text

Optimal Action ● The action that gives the most long term reward ● For example, obeying traffic rules means that lastma does not catch you

Slide 16

Slide 16 text

Policies ● This is a set of optimal actions ● If we were training a car to drive in Lagos, it would be your personal rule book for surviving Lagos traffic

Slide 17

Slide 17 text

Some code goes here... https://github.com/e911miri/forloop-qlearning

Slide 18

Slide 18 text

Consider a game of blackjack Player’s Hand Dealer’s hand 10 + 2 = 12 3 + 10 = 13 Should I hit or miss?

Slide 19

Slide 19 text

Lets model that problem as a markovian decision process ● States = My hand and the dealer’s hand ● Actions = Hit or miss ● Rewards (1 for win, 0 for draw, -1 for lose) ● Ideal Policy should show tell me what to do given a particular state.

Slide 20

Slide 20 text

Learning process

Slide 21

Slide 21 text

To train it,

Slide 22

Slide 22 text

Learning rate ● This determines how much the new information overrides the old information. ● It ranges from 0 to 1 ● 0 means do not learn anything ● 1 means the agent should only use the most recent rewards.

Slide 23

Slide 23 text

Discount Factor ● This regulates the importance of future rewards. ● It is a value from 0 to 1 ● 0 means consider only current rewards ● nearing 1 will make the model strive for long term rewards. ● 1 and beyond means the model will not converge

Slide 24

Slide 24 text

Q-Learning in the real world Considering real world challenges and how to deal with those problems

Slide 25

Slide 25 text

References ● Brandon’s blogpost: http://outlace.com/Reinforcement-Learning-Part-1/ ● Udacity reinforcement learning course: https://www.udacity.com/course/machine-learning-reinforcement-learning --ud820 ● Deepmind Alpha Go: https://deepmind.com/alpha-go ● Google self-driving car: https://www.google.com/selfdrivingcar/ ● https://github.com/e911miri/forloop-qlearning