Developing a self learning snake game using deep q learning in javascript

HOW TO TRAIN YOUR DRAGON

DEVELOPING A SELF LEARNING SNAKE GAME USING DEEP Q LEARNING
IN JAVASCRIPT

TYPES OF MACHINE LEARNING Machine Learning Reinforcement Algorithms learns to
react to an environment Unsupervised Data driven (Clustering) Supervised Task driven (Regression / Classiﬁcation)

WHAT IS REINFORCEMENT LEARNING ▸ Branch of machine learning concerned
with taking sequences of actions ▸ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward

PRACTICAL APPLICATION ROBOTICS ▸ Ball acquisition ▸ Gait control ▸
Air hockey ▸ Active sensing

PRACTICAL APPLICATION CONTROL ▸ Helicopter control

PRACTICAL APPLICATION OPERATIONS RESEARCH ▸ Pricing ▸ Vehicle routing ▸
Target marketing GAMES ▸ ATARI ▸ Solitaire ▸ Chess ▸ Checkers ECONOMICS ▸ Trading

MARKOV DECISION PROCESS

MARKOV DECISION PROCESS DEFINITION ▸ States ▸ Actions ▸ Probability
▸ Reward ▸ Discount factor 0 1 2 0 3 1 2 4 S A P a (s,s' ) = Pr(s' | s,a) R a (s,s' ) γ ∈[0,1]

MARKOV DECISION PROCESS PROBLEM ▸ Find a function (“policy”) that
specify the action that the decision maker will choose when in state S 0 1 2 0 3 1 2 4 γ t R at (s t ,s t+1 ) t=0 ∞ ∑ π π(S)

MARKOV DECISION PROCESS DISCOUNTED FUTURE REWARD ▸ Total reward R
= r 1 + r 2 + r 3 +...+ r n

MARKOV DECISION PROCESS DISCOUNTED FUTURE REWARD ▸ Total reward ▸
Total future reward R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n

Total future reward ▸ Discounted future reward R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n

Total future reward ▸ Discounted future reward R t = r t +γ (r t+1 +γ (r t+2 + ...)) = r t +γ R t+1 R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n

Q-LEARNING

Q-LEARNING ▸ In Q-learning we deﬁne a function Q(s,a) representing
the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on Q(s t ,a t ) = max R t+1

the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a)

the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state ▸ Bellman equation. Maximum future reward for this state and action is the immediate reward plus maximum future reward for the next Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a) Q(s,a) = r +γ max a' Q(s',a')

Q-LEARNING

Q-LEARNING 0 1 0 1

Q-LEARNING 0 1 0 1 Initial Q-Table U D L
R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0

Q-LEARNING 0 1 0 1 Initial Q-Table U D L
R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E

Q-LEARNING Initial Q-Table U D L R 0-0 0 0
0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1

Q-LEARNING Initial Q-Table U D L R 0-0 0 -10
0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1

0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1

0 0 0-1 0 0 0 10 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1

Q-LEARNING 0 1 0 1 Final Q-Table U D L
R 0-0 0 -2 0 7 0-1 0 10 4.6 0 1-0 4.6 0 0 10 1-1 0 0 0 0

Q-LEARNING PROBLEM 84 84 25684*84*4 ≈1067970

DEEP Q NETWORK

DEEP Q NETWORK action state state Neural Network Q(s, a)
Q(s,a) Layer Input Filter Size Stride Num Filters Activation Output conv1 84x84x4 8x8 4 32 ReLU 20x20x32 conv2 20x20x32 4x4 2 64 ReLU 9x9x64 conv3 9x9x64 3x3 1 64 ReLU 7x7x64 fc4 7x7x64 512 ReLU 512 fc5 512 18 Linear 18

DEEP Q NETWORK ▸ Experience replay ▸ during gameplay all
the experiences <s,a,r,s’> are stored in a replay memory ▸ when training the network, random minibatches from the replay memory are used instead of the most recent transition ▸ Exploration - Exploitation ▸ e-greed policy - with probability e choose a random action, otherwise go with the “greedy” action with the highest Q-value

DEEP Q NETWORK

JS LIBRARIES

JS LIBRARIES ▸ NeuroJS - https://github.com/janhuenermann/neurojs A javascript deep learning
and reinforcement learning library. ▸ ReinforceJS - https://github.com/karpathy/reinforcejs A javascript reinforcement learning library that implements several common RL algorithms, all with web demos.

JS LIBRARIES NEUROJS ▸ No documentation at all ▸ Implements
a full stack neural-network based machine learning framework ▸ Extended reinforcement-learning support ▸ Has several examples (self-driving cars, waterworld, xor)

JS LIBRARIES NEUROJS var states = 29; var actions =
2; var input = states + (states + actions) * 1; var actor = new window.neurojs.Network.Model([ { type: "input", size: input }, { type: "fc", size: 50, activation: "relu" }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 2, activation: "sigmoid" } ]); window.brain = new window.neurojs.Agent({ actor: actor.newConfiguration(), critic: null, states: states, actions: actions, algorithm: "ddpg", temporalWindow: 1 });

JS LIBRARIES REINFORCEJS ▸ Has basic documentation ▸ Implements several
common RL algorithms ▸ Dynamic Programming ▸ Tabular Temporal Difference Learning ▸ Deep Q Learning ▸ Policy Gradients (unstable) ▸ Has examples for each algorithm

JS LIBRARIES NEUROJS // create an environment object var env
= {}; env.getNumStates = function() { return 8; } env.getMaxNumActions = function() { return 4; } // create the DQN agent var spec = { alpha: 0.01 }; // see full options on DQN page agent = new RL.DQNAgent(env, spec); setInterval(function() { // start the learning loop var action = agent.act(s); // s is an array of length 8 //... execute action in environment and get the reward agent.learn(reward); // the agent improves its Q,policy,model }, 0);

SELF LEARNING SNAKE IMPLEMENTATION

SNAKE IMPLEMENTATION GAME LOGIC export type State = { input:
"up" | "down" | "left" | "right", snake: Snake, game: { height: number, width: number }, food: Position, tick: number, reward: number }; export type Snake = { dir: Direction, position: Position, dead: boolean, tail: Array<Position> }; export type Food = { x: number, y: number }; export type Position = { x: number, y: number }; export type Direction = { x: -1 | 0 | 1, y: -1 | 0 | 1 };

SNAKE IMPLEMENTATION GAME LOGIC // snake.js export function update(state: State):
State { // return new state with next updates: // update dead state if collided with tail // update position based on direction // update tail: // if touched food - concat position to snake // else concat position and remove last tail cell) } export const setup: Snake = { dead: false, position: { x: 3, y: 1 }, tail: [{ x: 1, y: 1 }, { x: 2, y: 1 }], dir: RIGHT_DIR };

SNAKE IMPLEMENTATION GAME LOGIC // food.js function positionOnSnake(snake: Snake, position:
Position): boolean { // true if position is on snake } function randomPositionFood( snake: Snake = snakeSetup, gameWidth: number, gameHeight: number ): Position { // random position while position is on snake } export function update(state: State): State { // if snake touched food - return new state with updated food location // (random generate food location while food location is on snake) } export function setup(width: number, height: number): Food { return randomPositionFood(undefined, width, height); }

SNAKE IMPLEMENTATION GAME LOGIC // engine.js const update: State =>
State = _.flow( updateSnake, updateFood, updateTick ); function tick(prevState: State) { const input = someFunctionToGetLastInputFromKeyboard(); const state = update({ ...prevState, input }); draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); } export function start() { tick(initialState({})); }

SNAKE IMPLEMENTATION GAME BRAIN let spec = {}; spec.gamma =
1; // discount factor, [0, 1) spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1) spec.alpha = 0.01; // value function learning rate // spec.experience_add_every = 50; // number of time steps before we add another experience to replay memory // spec.experience_size = 10000; // size of experience replay memory // spec.num_hidden_units = 50 // number of neurons in hidden layer const FOOD_REWARD: number = 5; const DEATH_REWARD: number = -10; const SILENCE_REWARD: number = -1; function getNumStates(): number { return GAME_WIDTH * GAME_HEIGHT; } const env = { getNumStates, getMaxNumActions: () => 4 }; export const agent = new RL.DQNAgent(env, spec);

SNAKE IMPLEMENTATION GAME BRAIN function getState(state: State): Array<number> { const
{ snake, food } = state; const zerosArray = new Array(getNumStates()).fill(0); zerosArray[calculateCellNumber(food)] = 2; zerosArray[calculateCellNumber(snake.position)] = 1; snake.tail.forEach( cell => (zerosArray[calculateCellNumber(cell)] = -1) ); return zerosArray; } export function getAction(state: State): number { return agent.act(getState(state)); }

SNAKE IMPLEMENTATION GAME BRAIN export function learn(prevState: State, nextState: State):
number { if (prevState.snake.tail.length < nextState.snake.tail.length) { agent.learn(FOOD_REWARD); return FOOD_REWARD; } else if (nextState.snake.dead) { agent.learn(DEATH_REWARD); return DEATH_REWARD; } agent.learn(SILENCE_REWARD); return SILENCE_REWARD; }

SNAKE IMPLEMENTATION GAME BRAIN function tick(prevState: State) { const input
= BRAIN_ACTIONS_MAPPING[getAction(prevState)]; const nextState = update({ ...prevState, input }); const state = nextState.snake.dead ? initialState : nextState; draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); }

DEMO TIME!!!

WHAT’S THIS ALL FOR?

LINKS ▸ https://github.com/BUsatov/q-learning-snake ▸ https://busatov.github.io/q-learning-snake/ ▸ https://www.intelnervana.com/demystifying-deep- reinforcement-learning ▸ RL
Course by David Silver

THAT’S ALL

Developing a self learning snake game using dee...

Developing a self learning snake game using deep q learning in javascript

Other Decks in Science

Featured

Transcript