Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reinforcement Learning

forLoop
August 22, 2016

Reinforcement Learning

Tomiwa Ijaware showed the forLoop Machine learning attendence what reinforcement learning is

forLoop

August 22, 2016
Tweet

More Decks by forLoop

Other Decks in Programming

Transcript

  1. • It involves allowing machines and software agents to automatically

    determine the ideal behaviour within a specific context, in order to maximize its performance. • It involves learning how to map situations to actions so as to maximize a numerical reward signal. • It explores how software agents should act to maximize cummulative reward What is reinforcement learning
  2. Alpha GO • Achieved 99.8% win rate against other go

    programs • Combined Deep neural networks with Reinforcement learning • Defeated a Human professional 4 - 1
  3. Google Self Driving Car • Over 1.5million miles driven by

    a trained model • It uses real sensors to tell where it is and what is around it • It reads what to do next from its navigation planner
  4. Where does it shine? • Robotics • Control systems •

    Operations Research • Games • Economics
  5. What is Q-Learning Q-learning is a model-free reinforcement learning technique.

    Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP)
  6. States • A state represents information about the agent’s position

    in its environment. • A good state contains the location of the agent and its observable environment • In a game of tic tac toe, a state would be all xes and ooes on the board • In a car, a state could contain traffic information, the direction on google maps, traffic lights and more
  7. Action • This is a step you take. • It

    leads you to another state • It gives you a reward
  8. Reward • Numerical value representing what you get for taking

    an action. • It could be positive or negative.
  9. Long term reward • The cumulative reward for taking an

    action given that you were in a state divided by the number of times • Q(s, a) = sum[Q(s, a)] / t • t is the number of times a was taken from s
  10. Optimal Action • The action that gives the most long

    term reward • For example, obeying traffic rules means that lastma does not catch you
  11. Policies • This is a set of optimal actions •

    If we were training a car to drive in Lagos, it would be your personal rule book for surviving Lagos traffic
  12. Consider a game of blackjack Player’s Hand Dealer’s hand 10

    + 2 = 12 3 + 10 = 13 Should I hit or miss?
  13. Lets model that problem as a markovian decision process •

    States = My hand and the dealer’s hand • Actions = Hit or miss • Rewards (1 for win, 0 for draw, -1 for lose) • Ideal Policy should show tell me what to do given a particular state.
  14. Learning rate • This determines how much the new information

    overrides the old information. • It ranges from 0 to 1 • 0 means do not learn anything • 1 means the agent should only use the most recent rewards.
  15. Discount Factor • This regulates the importance of future rewards.

    • It is a value from 0 to 1 • 0 means consider only current rewards • nearing 1 will make the model strive for long term rewards. • 1 and beyond means the model will not converge
  16. References • Brandon’s blogpost: http://outlace.com/Reinforcement-Learning-Part-1/ • Udacity reinforcement learning course:

    https://www.udacity.com/course/machine-learning-reinforcement-learning --ud820 • Deepmind Alpha Go: https://deepmind.com/alpha-go • Google self-driving car: https://www.google.com/selfdrivingcar/ • https://github.com/e911miri/forloop-qlearning