determine the ideal behaviour within a specific context, in order to maximize its performance. • It involves learning how to map situations to actions so as to maximize a numerical reward signal. • It explores how software agents should act to maximize cummulative reward What is reinforcement learning
in its environment. • A good state contains the location of the agent and its observable environment • In a game of tic tac toe, a state would be all xes and ooes on the board • In a car, a state could contain traffic information, the direction on google maps, traffic lights and more
States = My hand and the dealer’s hand • Actions = Hit or miss • Rewards (1 for win, 0 for draw, -1 for lose) • Ideal Policy should show tell me what to do given a particular state.
• It is a value from 0 to 1 • 0 means consider only current rewards • nearing 1 will make the model strive for long term rewards. • 1 and beyond means the model will not converge