Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mario Bros meets reinforcement learning

Mario Bros meets reinforcement learning

¿Quieres conocer los fundamentos detrás de la tecnología que permitió vencer al campeón mundial de go?, ¿Te gustaría comprender como se le puede enseñar a una maquina a solucionar problemas sin tu intervención directa?

De ser así esta es una charla que no te puedes perder! Mario Bros ya no te necesitara para terminar los niveles.

Python Pereira

June 29, 2019
Tweet

More Decks by Python Pereira

Other Decks in Technology

Transcript

  1. Cristian Vargas Lead developer at Swapps Electronic engineer PythonCali and

    Calidev organizer @cdvv7788 https://github.com/cdvv7788 About Me About Me
  2. Repeat until it works...or Repeat until it works...or you give

    up trying you give up trying Taken from: https://www.livechatinc.com/blog/what-is-artificial-intelligence-ai/
  3. Mario Library for Gym Mario Library for Gym https:/ /github.com/Kautenja/gym-

    https:/ /github.com/Kautenja/gym- super-mario-bros super-mario-bros
  4. Basic program Basic program import gym from gym import wrappers

    from random_agent import RandomAgent from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros from gym_super_mario_bros.actions import SIMPLE_MOVEMENT agent = RandomAgent() env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0') env = BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env = gym.wrappers.Monitor(env, agent.get_recording_folder(), video_callable=lambda episode_id: episode_id % 2 == 0, force=True) done = True episode = 1 while True: if done: print('Restarting env. Attempt # {}'.format(episode)) state = env.reset() episode += 1 state, reward, done, info = agent.run(env) env.close() 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  5. Random Agent Random Agent Duuuhhh Duuuhhh class RandomAgent(Agent): """ Randomly

    executes an action from the action space. Has no memory nor intelligence at all. """ def get_recording_folder(self): return './random' def run(self, env): state, reward, done, info = env.step(env.action_space.sample()) return state, reward, done, inf 1 2 3 4 5 6 7 8 9 10 11
  6. Q-Learning Q-Learning Is this applicable to the Mario game? Possible

    number of states, assuming we feed a single image at once, resized to 84x84 and changed to grayscale: 84*84*4*256 = 7'225,344
  7. Q-Learning Q-Learning Preprocessing steps I followed the approach proposed in

    the original and based earlier work on a . deepmind paper toptal post The approach is: 1. Scale to 84x84 2. Take only every 4th frame (frameskip) 3. Take 4 of these frames to create an input of 84x84x4
  8. Deep Q-Learning Deep Q-Learning In short terms, we approximate the

    q-table using neural networks Challenge: Input data is highly correlated, breaking the assumption for gradient descent that the input must be independent and identically distributed random variables( i.i.d)
  9. Deep Q-Learning Deep Q-Learning 1. Easy to overfit 2. Stuck

    into local minimum 3. Quickly forgets previous experiences In general terms, using neural networks with reinforcement learning is an unstable process
  10. Deep Q-Learning Deep Q-Learning Network Architecture class NeuralNetwork(nn.Module): def __init__(self,

    number_of_actions): super(NeuralNetwork, self).__init__() self.conv1 = nn.Conv2d(4, 32, 8, 4) self.relu1 = nn.ReLU(inplace=True) self.conv2 = nn.Conv2d(32, 64, 4, 2) self.relu2 = nn.ReLU(inplace=True) self.conv3 = nn.Conv2d(64, 64, 3, 1) self.relu3 = nn.ReLU(inplace=True) self.fc1 = nn.Linear(3136, 512) self.relu4 = nn.ReLU(inplace=True) self.fc2 = nn.Linear(512, self.number_of_actions) 1 2 3 4 5 6 7 8 9 10 11 12 13
  11. Experience Replay Experience Replay 1. We save every new step

    in a buffer of defined length. 2. We train the neural network using random samples of defined length taken from the memory. 3. We discard the oldest entries in the memory when it is full.
  12. Double Deep Q Double Deep Q (DDQN) (DDQN) DQN suffers

    of overconfidence on its estimations. This means that it can propagate noisy estimation values across the entire Q-table. This affects the stability of the algorithm.
  13. Double Deep Q Double Deep Q (DDQN) (DDQN) A proposed

    solution named attacks this problem. Double Learning Keep 2 different tables. Take the action from table #1, and the value for the proposed action from the table #2.
  14. Next Steps Next Steps Prioritized Experience Replay Epsilon alternative algorithms

    Asynchronous advantage actor critic (A3C) A lot more approaches, like IMPALA