Reinforcement Learning

Reinforcement Learning First steps with Q-Learning

Who am I? Tomiwa Ijaware Software Engineer (Konga) Udacity ML
nanodegree student [email protected]

• It involves allowing machines and software agents to automatically
determine the ideal behaviour within a specific context, in order to maximize its performance. • It involves learning how to map situations to actions so as to maximize a numerical reward signal. • It explores how software agents should act to maximize cummulative reward What is reinforcement learning

Two notable mentions

Alpha GO • Achieved 99.8% win rate against other go
programs • Combined Deep neural networks with Reinforcement learning • Defeated a Human professional 4 - 1

Google Self Driving Car • Over 1.5million miles driven by
a trained model • It uses real sensors to tell where it is and what is around it • It reads what to do next from its navigation planner

Where does it shine? • Robotics • Control systems •
Operations Research • Games • Economics

Q-Learning

What is Q-Learning Q-learning is a model-free reinforcement learning technique.
Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP)

A few terms Some theory to get us started

States • A state represents information about the agent’s position
in its environment. • A good state contains the location of the agent and its observable environment • In a game of tic tac toe, a state would be all xes and ooes on the board • In a car, a state could contain traffic information, the direction on google maps, traffic lights and more

Action • This is a step you take. • It
leads you to another state • It gives you a reward

Reward • Numerical value representing what you get for taking
an action. • It could be positive or negative.

Long term reward • The cumulative reward for taking an
action given that you were in a state divided by the number of times • Q(s, a) = sum[Q(s, a)] / t • t is the number of times a was taken from s

Optimal Action • The action that gives the most long
term reward • For example, obeying traffic rules means that lastma does not catch you

Policies • This is a set of optimal actions •
If we were training a car to drive in Lagos, it would be your personal rule book for surviving Lagos traffic

Some code goes here... https://github.com/e911miri/forloop-qlearning

Consider a game of blackjack Player’s Hand Dealer’s hand 10
+ 2 = 12 3 + 10 = 13 Should I hit or miss?

Lets model that problem as a markovian decision process •
States = My hand and the dealer’s hand • Actions = Hit or miss • Rewards (1 for win, 0 for draw, -1 for lose) • Ideal Policy should show tell me what to do given a particular state.

Learning process

To train it,

Learning rate • This determines how much the new information
overrides the old information. • It ranges from 0 to 1 • 0 means do not learn anything • 1 means the agent should only use the most recent rewards.

Discount Factor • This regulates the importance of future rewards.
• It is a value from 0 to 1 • 0 means consider only current rewards • nearing 1 will make the model strive for long term rewards. • 1 and beyond means the model will not converge

Q-Learning in the real world Considering real world challenges and
how to deal with those problems

References • Brandon’s blogpost: http://outlace.com/Reinforcement-Learning-Part-1/ • Udacity reinforcement learning course:
https://www.udacity.com/course/machine-learning-reinforcement-learning --ud820 • Deepmind Alpha Go: https://deepmind.com/alpha-go • Google self-driving car: https://www.google.com/selfdrivingcar/ • https://github.com/e911miri/forloop-qlearning

Reinforcement Learning

Reinforcement Learning

forLoop

More Decks by forLoop

Other Decks in Programming

Featured

Transcript