Developing a self learning snake game using deep q learning in javascript

September 22, 2017

  1. TYPES OF MACHINE LEARNING Machine Learning Reinforcement Algorithms learns to

    react to an environment Unsupervised Data driven (Clustering) Supervised Task driven (Regression / Classification)
  2. WHAT IS REINFORCEMENT LEARNING ▸ Branch of machine learning concerned

    with taking sequences of actions ▸ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward

    Target marketing GAMES ▸ ATARI ▸ Solitaire ▸ Chess ▸ Checkers ECONOMICS ▸ Trading
  4. MARKOV DECISION PROCESS DEFINITION ▸ States ▸ Actions ▸ Probability

    ▸ Reward ▸ Discount factor 0 1 2 0 3 1 2 4 S A P a (s,s' ) = Pr(s' | s,a) R a (s,s' ) γ ∈[0,1]
  5. MARKOV DECISION PROCESS PROBLEM ▸ Find a function (“policy”) that

    specify the action that the decision maker will choose when in state S 0 1 2 0 3 1 2 4 γ t R at (s t ,s t+1 ) t=0 ∞ ∑ π π(S)

    Total future reward R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n

    Total future reward ▸ Discounted future reward R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n

    Total future reward ▸ Discounted future reward R t = r t +γ (r t+1 +γ (r t+2 + ...)) = r t +γ R t+1 R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n
  9. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on Q(s t ,a t ) = max R t+1
  10. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a)
  11. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state ▸ Bellman equation. Maximum future reward for this state and action is the immediate reward plus maximum future reward for the next Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a) Q(s,a) = r +γ max a' Q(s',a')
  12. Q-LEARNING 0 1 0 1 Initial Q-Table U D L

    R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0
  13. Q-LEARNING 0 1 0 1 Initial Q-Table U D L

    R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E
  14. Q-LEARNING Initial Q-Table U D L R 0-0 0 0

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1
  15. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1
  16. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1
  17. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 10 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1
  18. Q-LEARNING 0 1 0 1 Final Q-Table U D L

    R 0-0 0 -2 0 7 0-1 0 10 4.6 0 1-0 4.6 0 0 10 1-1 0 0 0 0
  19. DEEP Q NETWORK action state state Neural Network Q(s, a)

    Q(s,a) Layer Input Filter Size Stride Num Filters Activation Output conv1 84x84x4 8x8 4 32 ReLU 20x20x32 conv2 20x20x32 4x4 2 64 ReLU 9x9x64 conv3 9x9x64 3x3 1 64 ReLU 7x7x64 fc4 7x7x64 512 ReLU 512 fc5 512 18 Linear 18
  20. DEEP Q NETWORK ▸ Experience replay ▸ during gameplay all

    the experiences <s,a,r,s’> are stored in a replay memory ▸ when training the network, random minibatches from the replay memory are used instead of the most recent transition ▸ Exploration - Exploitation ▸ e-greed policy - with probability e choose a random action, otherwise go with the “greedy” action with the highest Q-value
  21. JS LIBRARIES ▸ NeuroJS - https://github.com/janhuenermann/neurojs A javascript deep learning

    and reinforcement learning library. ▸ ReinforceJS - https://github.com/karpathy/reinforcejs A javascript reinforcement learning library that implements several common RL algorithms, all with web demos.
  22. JS LIBRARIES NEUROJS ▸ No documentation at all ▸ Implements

    a full stack neural-network based machine learning framework ▸ Extended reinforcement-learning support ▸ Has several examples (self-driving cars, waterworld, xor)
  23. JS LIBRARIES NEUROJS var states = 29; var actions =

    2; var input = states + (states + actions) * 1; var actor = new window.neurojs.Network.Model([ { type: "input", size: input }, { type: "fc", size: 50, activation: "relu" }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 2, activation: "sigmoid" } ]); window.brain = new window.neurojs.Agent({ actor: actor.newConfiguration(), critic: null, states: states, actions: actions, algorithm: "ddpg", temporalWindow: 1 });
  24. JS LIBRARIES REINFORCEJS ▸ Has basic documentation ▸ Implements several

    common RL algorithms ▸ Dynamic Programming ▸ Tabular Temporal Difference Learning ▸ Deep Q Learning ▸ Policy Gradients (unstable) ▸ Has examples for each algorithm
  25. JS LIBRARIES NEUROJS // create an environment object var env

    = {}; env.getNumStates = function() { return 8; } env.getMaxNumActions = function() { return 4; } // create the DQN agent var spec = { alpha: 0.01 }; // see full options on DQN page agent = new RL.DQNAgent(env, spec); setInterval(function() { // start the learning loop var action = agent.act(s); // s is an array of length 8 //... execute action in environment and get the reward agent.learn(reward); // the agent improves its Q,policy,model }, 0);
  26. SNAKE IMPLEMENTATION GAME LOGIC export type State = { input:

    "up" | "down" | "left" | "right", snake: Snake, game: { height: number, width: number }, food: Position, tick: number, reward: number }; export type Snake = { dir: Direction, position: Position, dead: boolean, tail: Array<Position> }; export type Food = { x: number, y: number }; export type Position = { x: number, y: number }; export type Direction = { x: -1 | 0 | 1, y: -1 | 0 | 1 };
  27. SNAKE IMPLEMENTATION GAME LOGIC // snake.js export function update(state: State):

    State { // return new state with next updates: // update dead state if collided with tail // update position based on direction // update tail: // if touched food - concat position to snake // else concat position and remove last tail cell) } export const setup: Snake = { dead: false, position: { x: 3, y: 1 }, tail: [{ x: 1, y: 1 }, { x: 2, y: 1 }], dir: RIGHT_DIR };
  28. SNAKE IMPLEMENTATION GAME LOGIC // food.js function positionOnSnake(snake: Snake, position:

    Position): boolean { // true if position is on snake } function randomPositionFood( snake: Snake = snakeSetup, gameWidth: number, gameHeight: number ): Position { // random position while position is on snake } export function update(state: State): State { // if snake touched food - return new state with updated food location // (random generate food location while food location is on snake) } export function setup(width: number, height: number): Food { return randomPositionFood(undefined, width, height); }
  29. SNAKE IMPLEMENTATION GAME LOGIC // engine.js const update: State =>

    State = _.flow( updateSnake, updateFood, updateTick ); function tick(prevState: State) { const input = someFunctionToGetLastInputFromKeyboard(); const state = update({ ...prevState, input }); draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); } export function start() { tick(initialState({})); }
  30. SNAKE IMPLEMENTATION GAME BRAIN let spec = {}; spec.gamma =

    1; // discount factor, [0, 1) spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1) spec.alpha = 0.01; // value function learning rate // spec.experience_add_every = 50; // number of time steps before we add another experience to replay memory // spec.experience_size = 10000; // size of experience replay memory // spec.num_hidden_units = 50 // number of neurons in hidden layer const FOOD_REWARD: number = 5; const DEATH_REWARD: number = -10; const SILENCE_REWARD: number = -1; function getNumStates(): number { return GAME_WIDTH * GAME_HEIGHT; } const env = { getNumStates, getMaxNumActions: () => 4 }; export const agent = new RL.DQNAgent(env, spec);
  31. SNAKE IMPLEMENTATION GAME BRAIN function getState(state: State): Array<number> { const

    { snake, food } = state; const zerosArray = new Array(getNumStates()).fill(0); zerosArray[calculateCellNumber(food)] = 2; zerosArray[calculateCellNumber(snake.position)] = 1; snake.tail.forEach( cell => (zerosArray[calculateCellNumber(cell)] = -1) ); return zerosArray; } export function getAction(state: State): number { return agent.act(getState(state)); }
  32. SNAKE IMPLEMENTATION GAME BRAIN export function learn(prevState: State, nextState: State):

    number { if (prevState.snake.tail.length < nextState.snake.tail.length) { agent.learn(FOOD_REWARD); return FOOD_REWARD; } else if (nextState.snake.dead) { agent.learn(DEATH_REWARD); return DEATH_REWARD; } agent.learn(SILENCE_REWARD); return SILENCE_REWARD; }
  33. SNAKE IMPLEMENTATION GAME BRAIN function tick(prevState: State) { const input

    = BRAIN_ACTIONS_MAPPING[getAction(prevState)]; const nextState = update({ ...prevState, input }); const state = nextState.snake.dead ? initialState : nextState; draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); }