Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a self learning snake game using deep q learning in javascript

Bohdan
September 22, 2017

Developing a self learning snake game using deep q learning in javascript

Bohdan

September 22, 2017
Tweet

Other Decks in Science

Transcript

  1. TYPES OF MACHINE LEARNING Machine Learning Reinforcement Algorithms learns to

    react to an environment Unsupervised Data driven (Clustering) Supervised Task driven (Regression / Classification)
  2. WHAT IS REINFORCEMENT LEARNING ▸ Branch of machine learning concerned

    with taking sequences of actions ▸ Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward
  3. PRACTICAL APPLICATION OPERATIONS RESEARCH ▸ Pricing ▸ Vehicle routing ▸

    Target marketing GAMES ▸ ATARI ▸ Solitaire ▸ Chess ▸ Checkers ECONOMICS ▸ Trading
  4. MARKOV DECISION PROCESS DEFINITION ▸ States ▸ Actions ▸ Probability

    ▸ Reward ▸ Discount factor 0 1 2 0 3 1 2 4 S A P a (s,s' ) = Pr(s' | s,a) R a (s,s' ) γ ∈[0,1]
  5. MARKOV DECISION PROCESS PROBLEM ▸ Find a function (“policy”) that

    specify the action that the decision maker will choose when in state S 0 1 2 0 3 1 2 4 γ t R at (s t ,s t+1 ) t=0 ∞ ∑ π π(S)
  6. MARKOV DECISION PROCESS DISCOUNTED FUTURE REWARD ▸ Total reward ▸

    Total future reward R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n
  7. MARKOV DECISION PROCESS DISCOUNTED FUTURE REWARD ▸ Total reward ▸

    Total future reward ▸ Discounted future reward R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n
  8. MARKOV DECISION PROCESS DISCOUNTED FUTURE REWARD ▸ Total reward ▸

    Total future reward ▸ Discounted future reward R t = r t +γ (r t+1 +γ (r t+2 + ...)) = r t +γ R t+1 R t = r t +γ r t+1 +γ 2r t+2 +...+γ n−tr n R t = r t + r t+1 + r t+2 +...+ r n R = r 1 + r 2 + r 3 +...+ r n
  9. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on Q(s t ,a t ) = max R t+1
  10. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a)
  11. Q-LEARNING ▸ In Q-learning we define a function Q(s,a) representing

    the maximum discounted future reward when we perform action a is state s, and continue optimally from that point on ▸ П represents the policy, the rule how we choose an action in each state ▸ Bellman equation. Maximum future reward for this state and action is the immediate reward plus maximum future reward for the next Q(s t ,a t ) = max R t+1 π(s) = argmax a Q(s,a) Q(s,a) = r +γ max a' Q(s',a')
  12. Q-LEARNING 0 1 0 1 Initial Q-Table U D L

    R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0
  13. Q-LEARNING 0 1 0 1 Initial Q-Table U D L

    R 0-0 0 0 0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E
  14. Q-LEARNING Initial Q-Table U D L R 0-0 0 0

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1
  15. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,0); A - D; Q(00, D) = R(00,D) + Y*[max(Q(01, U) & Q(01, R))] Q(00, D) = -10 +0.8*0 = -10; 0 1 0 1
  16. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 0 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1
  17. Q-LEARNING Initial Q-Table U D L R 0-0 0 -10

    0 0 0-1 0 0 0 10 1-0 0 0 0 0 1-1 0 0 0 0 Reward table U D L R 0-0 E -10 E -1 0-1 E +10 -1 E 1-0 -1 E E +10 1-1 -1 E -10 E S - (0,1); A - R; Q(01, R) = R(01,R) + Y*[max(Q(11, U) & Q(11, L))] Q(01, R) = 10 +0.8*0 = 10; 0 1 0 1
  18. Q-LEARNING 0 1 0 1 Final Q-Table U D L

    R 0-0 0 -2 0 7 0-1 0 10 4.6 0 1-0 4.6 0 0 10 1-1 0 0 0 0
  19. DEEP Q NETWORK action state state Neural Network Q(s, a)

    Q(s,a) Layer Input Filter Size Stride Num Filters Activation Output conv1 84x84x4 8x8 4 32 ReLU 20x20x32 conv2 20x20x32 4x4 2 64 ReLU 9x9x64 conv3 9x9x64 3x3 1 64 ReLU 7x7x64 fc4 7x7x64 512 ReLU 512 fc5 512 18 Linear 18
  20. DEEP Q NETWORK ▸ Experience replay ▸ during gameplay all

    the experiences <s,a,r,s’> are stored in a replay memory ▸ when training the network, random minibatches from the replay memory are used instead of the most recent transition ▸ Exploration - Exploitation ▸ e-greed policy - with probability e choose a random action, otherwise go with the “greedy” action with the highest Q-value
  21. JS LIBRARIES ▸ NeuroJS - https://github.com/janhuenermann/neurojs A javascript deep learning

    and reinforcement learning library. ▸ ReinforceJS - https://github.com/karpathy/reinforcejs A javascript reinforcement learning library that implements several common RL algorithms, all with web demos.
  22. JS LIBRARIES NEUROJS ▸ No documentation at all ▸ Implements

    a full stack neural-network based machine learning framework ▸ Extended reinforcement-learning support ▸ Has several examples (self-driving cars, waterworld, xor)
  23. JS LIBRARIES NEUROJS var states = 29; var actions =

    2; var input = states + (states + actions) * 1; var actor = new window.neurojs.Network.Model([ { type: "input", size: input }, { type: "fc", size: 50, activation: "relu" }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 50, activation: "relu", dropout: 0.5 }, { type: "fc", size: 2, activation: "sigmoid" } ]); window.brain = new window.neurojs.Agent({ actor: actor.newConfiguration(), critic: null, states: states, actions: actions, algorithm: "ddpg", temporalWindow: 1 });
  24. JS LIBRARIES REINFORCEJS ▸ Has basic documentation ▸ Implements several

    common RL algorithms ▸ Dynamic Programming ▸ Tabular Temporal Difference Learning ▸ Deep Q Learning ▸ Policy Gradients (unstable) ▸ Has examples for each algorithm
  25. JS LIBRARIES NEUROJS // create an environment object var env

    = {}; env.getNumStates = function() { return 8; } env.getMaxNumActions = function() { return 4; } // create the DQN agent var spec = { alpha: 0.01 }; // see full options on DQN page agent = new RL.DQNAgent(env, spec); setInterval(function() { // start the learning loop var action = agent.act(s); // s is an array of length 8 //... execute action in environment and get the reward agent.learn(reward); // the agent improves its Q,policy,model }, 0);
  26. SNAKE IMPLEMENTATION GAME LOGIC export type State = { input:

    "up" | "down" | "left" | "right", snake: Snake, game: { height: number, width: number }, food: Position, tick: number, reward: number }; export type Snake = { dir: Direction, position: Position, dead: boolean, tail: Array<Position> }; export type Food = { x: number, y: number }; export type Position = { x: number, y: number }; export type Direction = { x: -1 | 0 | 1, y: -1 | 0 | 1 };
  27. SNAKE IMPLEMENTATION GAME LOGIC // snake.js export function update(state: State):

    State { // return new state with next updates: // update dead state if collided with tail // update position based on direction // update tail: // if touched food - concat position to snake // else concat position and remove last tail cell) } export const setup: Snake = { dead: false, position: { x: 3, y: 1 }, tail: [{ x: 1, y: 1 }, { x: 2, y: 1 }], dir: RIGHT_DIR };
  28. SNAKE IMPLEMENTATION GAME LOGIC // food.js function positionOnSnake(snake: Snake, position:

    Position): boolean { // true if position is on snake } function randomPositionFood( snake: Snake = snakeSetup, gameWidth: number, gameHeight: number ): Position { // random position while position is on snake } export function update(state: State): State { // if snake touched food - return new state with updated food location // (random generate food location while food location is on snake) } export function setup(width: number, height: number): Food { return randomPositionFood(undefined, width, height); }
  29. SNAKE IMPLEMENTATION GAME LOGIC // engine.js const update: State =>

    State = _.flow( updateSnake, updateFood, updateTick ); function tick(prevState: State) { const input = someFunctionToGetLastInputFromKeyboard(); const state = update({ ...prevState, input }); draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); } export function start() { tick(initialState({})); }
  30. SNAKE IMPLEMENTATION GAME BRAIN let spec = {}; spec.gamma =

    1; // discount factor, [0, 1) spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1) spec.alpha = 0.01; // value function learning rate // spec.experience_add_every = 50; // number of time steps before we add another experience to replay memory // spec.experience_size = 10000; // size of experience replay memory // spec.num_hidden_units = 50 // number of neurons in hidden layer const FOOD_REWARD: number = 5; const DEATH_REWARD: number = -10; const SILENCE_REWARD: number = -1; function getNumStates(): number { return GAME_WIDTH * GAME_HEIGHT; } const env = { getNumStates, getMaxNumActions: () => 4 }; export const agent = new RL.DQNAgent(env, spec);
  31. SNAKE IMPLEMENTATION GAME BRAIN function getState(state: State): Array<number> { const

    { snake, food } = state; const zerosArray = new Array(getNumStates()).fill(0); zerosArray[calculateCellNumber(food)] = 2; zerosArray[calculateCellNumber(snake.position)] = 1; snake.tail.forEach( cell => (zerosArray[calculateCellNumber(cell)] = -1) ); return zerosArray; } export function getAction(state: State): number { return agent.act(getState(state)); }
  32. SNAKE IMPLEMENTATION GAME BRAIN export function learn(prevState: State, nextState: State):

    number { if (prevState.snake.tail.length < nextState.snake.tail.length) { agent.learn(FOOD_REWARD); return FOOD_REWARD; } else if (nextState.snake.dead) { agent.learn(DEATH_REWARD); return DEATH_REWARD; } agent.learn(SILENCE_REWARD); return SILENCE_REWARD; }
  33. SNAKE IMPLEMENTATION GAME BRAIN function tick(prevState: State) { const input

    = BRAIN_ACTIONS_MAPPING[getAction(prevState)]; const nextState = update({ ...prevState, input }); const state = nextState.snake.dead ? initialState : nextState; draw(state); setTimeout(() => { window.requestAnimationFrame(() => tick(state)); }, 1000 / FPS); }