Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

A.I Supermario with Reinforcement Learning

Wonseok Jung
August 20, 2018

A.I Supermario with Reinforcement Learning

Pycon KR 2018

인공지능 슈퍼마리오의 거의 모든것

Wonseok Jung

August 20, 2018
Tweet

More Decks by Wonseok Jung

Other Decks in Science

Transcript

  1. ੿ਗࢳ
 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*"*3FTFBSDIFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS ݽف੄োҳࣗ$53- $POUFTUJO3- -FBEFS

    3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH #MPH IUUQTXPOTFPLKVOHHJUIVCJP
  2. ݾର 1. How Animals Learn 2. How Humans Learn 3.

    Reinforcement Learning 4. SuperMario with Reinforcement Learning REINFORCEMENT LEARNING
  3. PREVIEW REINFORCEMENT LEARNING "OJNBM )VNBO 4VQFS.BSJP " " &OW 3

    At Rt 4 St Rt+1 St+1 3FJOGPSDFNFOU -FBSOJOH "HFOU &OWJSPONFOU
  4. ALL ANIMALS HAVE THE ABILITY TO LEARN - ݽٚزޛ਷೟णמ۱੉੓׮ -

    ৈѐ੄न҃ࣁನ݅ਸыҊ੓ח৘ࢂԘ݃ࢶ୽ژೠ೟णמ۱੉੓׮ - ݠ୍ܻࣻ߈ࢎIFBEXJUIESBXTSFGMFYਤ೷ೠޛ୓о੓ਸѪ੉ۄ౸ױীٮܲ߈ࢎ೯ز - ৘ࢂԘ݃ࢶ୽੄ݠܻܳѤܻ٘ݶੌ੿Ѣٍܻܳ۽р׮ HOW ANIMALS LEARN
  5. LAW OF EFFECT - &EXBSE5IPSOEJLF   - -BXPGFGGFDUযڃ೯ز੄Ѿҗо݅઒झ۞਋ݶ׮਺ীبӒ೯زਸ߈ࠂೠ׮ ߈؀۽݅઒ೞ૑ঋਵݶӒ೯زਸೞ૑ঋח׮

    - 3FJOGPSDFNFOU ъച ੉੹ীੌযդ೯زਸ߈ࠂೞѱ݅٘ח੗ӓ - 1VOJTINFOU ୊ߥ ੉੹ীੌযդ೯زਸೖೞѱ݅٘ח੗ӓ HOW ANIMALS LEARN
  6. HOW HUMANS LEARN -TAP BALL %BZ %BZ %BZ %BZ ୭Ҋ੼ࣻ


    ݏ਷പࣻ ୭Ҋ੼ࣻ
 ݏ਷പࣻ ୭Ҋ੼ࣻ
 ݏ਷പࣻ ୭Ҋ੼ࣻ ݏ਷പࣻ HOW HUMANS LEARN
  7. LEARNING 3FJOGPSDFNFOUMFBSOJOH਷3FXBSE ࠁ࢚ ਸ୭؀ചೞחBDUJPO ೯ز ਸࢶఖೠ׮ -FBSOFS ߓ਋ח੗ חৈ۞BDUJPOਸ೧ࠁݴ SFXBSEܳо੢֫ѱ߉חBDUJPOਸ଺ח׮

    ࢶఖػBDUJPO੉׼੢੄SFXBSEࡺ݅ইצ ׮਺੄࢚ടژח׮਺ੌযաѱؼ SFXBSEীب৔ೱਸՙசࣻب੓׮ "DUJPO ׼੢੄ ࢚ട߸ച ޷ې੄࢚ട 3FXBSE ޷ې੄3FXBSE REINFORCEMENT LEARNING
  8. MARKOV DECISION PROCESS &OWJSPONFOU 3FXBSE At Rt St Rt+1 St+1

    REINFORCEMENT LEARNING 5BQUIFCBMM 1PTJUJWF3FXBSE

  9. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  10. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  11. SUPERMARIO WITH R.L https://github.com/wonseokjung/gym-super-mario-bros pip install gym-super-mario-bros
 import gym_super_mario_bros
 env

    = gym_super_mario_bros.make(‘SuperMarioBros-v0') env.reset() env.render() INSTALL AND IMPORT ENVIRONMENT
  12. WORLDS & LEVELS ( WORLD 1~4) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  13. WORLDS & LEVELS ( WORLD 5~8) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  14. REWARD AND PENALTY SUPERMARIO WITH R.L 3FXBSE 1FOBMUZ ӥߊীоөਕ૑ݶ 

    ݾ಴ীب଱ೞݶ ݾ಴׳ࢿೞ૑ޅೞݶ दр੉૑զٸ݃׮ ӥߊীࢲݣয૑ݶ
  15. STATE, ACTION SUPERMARIO WITH R.L env.observation_space.shape (240, 256, 3) #

    [ height, weight, channel ] env.action_space.n 256 SIMPLE_MOVEMENT = [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT)
  16. OBSERVATION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  17. ACTION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  18. ACTION AFTER WRAPPER SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT =

    [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ] env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
  19. EXPLOITATION AND EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info

    = env.step(action) else : 
 action = np.argmax(output) Exploitation Exploration def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) ?
  20. EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) Exploitation Exploration if np.random.rand() < epsilon : action=np.random.randint(output) def epsilon_greedy(q_value,step): ?
  21. EXPLOITATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) Exploitation Exploration ?
  22. ENV.STEP( ) SUPERMARIO WITH R.L next_state, reward, done, info =

    env.step(action) else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output)
  23. EXPLORATION RATE AND REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory

    = deque([],maxlen=1000000) memory.append(state,action,reward,next_state) (St , At , Rt+1 , St+1 ) next_state, reward, done, info = env.step(action) eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000
  24. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  25. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  26. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  27. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action)
  28. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)
  29. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory.append(state,action,reward,next_state) memory = deque([],maxlen=1000000)
  30. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf loss

    = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 )
  31. MINIMIZE LOSS SUPERMARIO WITH R.L (Rt+1 + γt+1 maxa′ qθ

    (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) import tensorflow as tf
  32. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf Optimizer

    =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) )
  33. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf training_op

    = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate)
  34. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf (Rt+1

    + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss)
  35. DOUBLE DQN SUPERMARIO WITH R.L JOQVU "DUJPO WBMVF &OW 2/FUXPSL

    s’ s 3FQMBZNFNPSZ 2 T B a r (St , At , Rt+1 , St+1 )
  36. SUMMARY 1. How Animals Learn 2. How Humans Learn 3.

    Reinforcement Learning 4. SuperMario with Reinforcement Learning REINFORCEMENT LEARNING
  37. *ଵҊ ਊয ӝഐ Time step Action Transition Function Reward Set

    of states Set of actions Start state Discount factor t a P(s′, r ∣ s, a) r A S S0 γ Set of reward
 
 Policy Reward State R π r REINFORCEMENT LEARNING s
  38. REFERENCES 1. Habituation The Birth of Intelligence 2. Law of

    effect : The Birth of Intelligence ,p.171 3. Thorndike, E. L. (1905). The elements of psychology. New York: A. G. Seiler. 4. Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Monographs: General and Applied, 2(4), i-109. 5. SuperMario environment 
 https://github.com/Kautenja/gym-super-mario-bros 6. http://faculty.coe.uh.edu/smcneil/cuin6373/idhistory/thorndike_extra.html