Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supermario-Tutorial-1

Wonseok Jung
September 10, 2018
100

 Supermario-Tutorial-1

강화학습으로 슈퍼마리오 에이전트 학습시키기 시리즈 1편

Wonseok Jung

September 10, 2018
Tweet

Transcript

  1. SUPERMARIO TUTORIAL SERIES 1.Environment and DQN 2.Deep Reinforcement Learning with

    Double Q-learning 3.Dueling Network Architectures for Deep Reinforcement Learning 4.Prioritized Experience Replay 5.Noisy Networks for ExplorationPrioritized DQN 6. A Distributional Perspective on Reinforcement Learning 7. Rainbow: Combining Improvements in Deep Reinforcement Learning REINFORCEMENT LEARNING
  2. SUPERMARIO TUTORIAL SERIES 1. Environment and DQN 2.Deep Reinforcement Learning

    with Double Q-learning 3.Dueling Network Architectures for Deep Reinforcement Learning 4.Prioritized Experience Replay 5.Noisy Networks for ExplorationPrioritized DQN 6. A Distributional Perspective on Reinforcement Learning 7. Rainbow: Combining Improvements in Deep Reinforcement Learning REINFORCEMENT LEARNING
  3. ݾର 1. Markov Decision Process 2. How to install Supermario

    envrionment 3.Supermario Envrionment 4. Training 5. DQN 6. Result REINFORCEMENT LEARNING
  4. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  5. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  6. SUPERMARIO WITH R.L https://github.com/wonseokjung/gym-super-mario-bros pip install gym-super-mario-bros
 import gym_super_mario_bros
 env

    = gym_super_mario_bros.make(‘SuperMarioBros-v0') env.reset() env.render() INSTALL AND IMPORT ENVIRONMENT
  7. WORLDS & LEVELS ( WORLD 1~4) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  8. WORLDS & LEVELS ( WORLD 5~8) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  9. REWARD AND PENALTY SUPERMARIO WITH R.L 3FXBSE 1FOBMUZ ӥߊীоөਕ૑ݶ 

    ݾ಴ীب଱ೞݶ ݾ಴׳ࢿೞ૑ޅೞݶ दр੉૑զٸ݃׮ ӥߊীࢲݣয૑ݶ
  10. STATE, ACTION SUPERMARIO WITH R.L env.observation_space.shape (240, 256, 3) #

    [ height, weight, channel ] env.action_space.n 256 SIMPLE_MOVEMENT = [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT)
  11. OBSERVATION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  12. OBSERVATION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  13. ACTION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  14. ACTION AFTER WRAPPER SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT =

    [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ] env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
  15. EXPLOITATION AND EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info

    = env.step(action) else : 
 action = np.argmax(output) Exploitation Exploration def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) ?
  16. EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) Exploitation Exploration if np.random.rand() < epsilon : action=np.random.randint(output) def epsilon_greedy(q_value,step): ?
  17. EXPLOITATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) Exploitation Exploration ?
  18. ENV.STEP( ) SUPERMARIO WITH R.L next_state, reward, done, info =

    env.step(action) else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output)
  19. EXPLORATION RATE AND REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory

    = deque([],maxlen=1000000) memory.append(state,action,reward,next_state) (St , At , Rt+1 , St+1 ) next_state, reward, done, info = env.step(action) eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000
  20. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  21. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  22. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  23. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action)
  24. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)
  25. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory.append(state,action,reward,next_state) memory = deque([],maxlen=1000000)
  26. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf loss

    = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 )
  27. MINIMIZE LOSS SUPERMARIO WITH R.L (Rt+1 + γt+1 maxa′ qθ

    (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) import tensorflow as tf
  28. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf Optimizer

    =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) )
  29. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf training_op

    = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate)
  30. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf (Rt+1

    + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss)
  31. DOUBLE DQN SUPERMARIO WITH R.L JOQVU "DUJPO WBMVF &OW 2/FUXPSL

    s’ s 3FQMBZNFNPSZ 2 T B a r (St , At , Rt+1 , St+1 )
  32. *ଵҊ ਊয ӝഐ Time step Action Transition Function Reward Set

    of states Set of actions Start state Discount factor t a P(s′, r ∣ s, a) r A S S0 γ Set of reward
 
 Policy Reward State R π r REINFORCEMENT LEARNING s