Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A.I Supermario with Reinforcement Learning

81ea9b772450dea2e068da891f1672b4?s=47 Wonseok Jung
August 20, 2018

A.I Supermario with Reinforcement Learning

Pycon KR 2018

인공지능 슈퍼마리오의 거의 모든것

81ea9b772450dea2e068da891f1672b4?s=128

Wonseok Jung

August 20, 2018
Tweet

Transcript

  1. ੋҕ૑מगಌܻ݃য়੄Ѣ੄ݽٚѪ 3FJOGPSDFNFOU-FBSOJOH 8POTFPL+VOH

  2. ੿ਗࢳ
 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*"*3FTFBSDIFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS ݽف੄োҳࣗ$53- $POUFTUJO3- -FBEFS

    3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH #MPH IUUQTXPOTFPLKVOHHJUIVCJP
  3. ݾର 1. How Animals Learn 2. How Humans Learn 3.

    Reinforcement Learning 4. SuperMario with Reinforcement Learning REINFORCEMENT LEARNING
  4. PREVIEW REINFORCEMENT LEARNING "OJNBM )VNBO 4VQFS.BSJP " " &OW 3

    At Rt 4 St Rt+1 St+1 3FJOGPSDFNFOU -FBSOJOH "HFOU &OWJSPONFOU
  5. HOW ANIMALS LEARN

  6. ALL ANIMALS HAVE THE ABILITY TO LEARN - ݽٚزޛ਷೟णמ۱੉੓׮ -

    ৈѐ੄न҃ࣁನ݅ਸыҊ੓ח৘ࢂԘ݃ࢶ୽ژೠ೟णמ۱੉੓׮ - ݠ୍ܻࣻ߈ࢎIFBEXJUIESBXTSFGMFYਤ೷ೠޛ୓о੓ਸѪ੉ۄ౸ױীٮܲ߈ࢎ೯ز - ৘ࢂԘ݃ࢶ୽੄ݠܻܳѤܻ٘ݶੌ੿Ѣٍܻܳ۽р׮ HOW ANIMALS LEARN
  7. HABITUATION HOW ANIMALS LEARN 'JSTUUSZ 4FDPOEUSZ 5IJSEUSZ

  8. LAW OF EFFECT - &EXBSE5IPSOEJLF   - -BXPGFGGFDUযڃ೯ز੄Ѿҗо݅઒झ۞਋ݶ׮਺ীبӒ೯زਸ߈ࠂೠ׮ ߈؀۽݅઒ೞ૑ঋਵݶӒ೯زਸೞ૑ঋח׮

    - 3FJOGPSDFNFOU ъച ੉੹ীੌযդ೯زਸ߈ࠂೞѱ݅٘ח੗ӓ - 1VOJTINFOU ୊ߥ ੉੹ীੌযդ೯زਸೖೞѱ݅٘ח੗ӓ HOW ANIMALS LEARN
  9. EXAMPLE OF THE LAW OF EFFECT HOW ANIMALS LEARN

  10. HOW HUMANS LEARN

  11. INTERACTION WITH ENVIRONMENT &OWJSPONFOU &YQFSJFODF -FBSO *OUFSBDUJPO HOW HUMANS LEARN

  12. HOW HUMANS LEARN? - 3FJOGPSDFNFOU੉੹ীੌযդ೯زਸ߈ࠂೞѱ݅٘ח੗ӓ - 1VOJTINFOU੉੹ীੌযդ೯زਸೖೞѱ݅٘ח੗ӓ HOW HUMANS LEARN

  13. HOW HUMANS LEARN &YQFSJNFOU6TJOH5BQCBMM HOW HUMANS LEARN https://www.youtube.com/watch?v=2sicukP34fk

  14. HOW HUMANS LEARN -TAP BALL %BZ %BZ %BZ %BZ ୭Ҋ੼ࣻ


    ݏ਷പࣻ ୭Ҋ੼ࣻ
 ݏ਷പࣻ ୭Ҋ੼ࣻ
 ݏ਷പࣻ ୭Ҋ੼ࣻ ݏ਷പࣻ HOW HUMANS LEARN
  15. TAP BALL DAY 5 %BZ ୭Ҋ੼ࣻ ݏ਷പࣻ HOW HUMANS LEARN

  16. SIMILAR LEARNING METHODS B/W ANIMALS AND HUMANS HOW HUMANS LEARN

    1VOJTINFOU 1VOJTINFOU 1VOJTINFOU
  17. REINFORCEMENT LEARNING

  18. REINFORCEMENT LEARNING &OWJSPONFOU &YQFSJFODF -FBSO REINFORCEMENT LEARNING *OUFSBDUJPO

  19. LEARNING 3FJOGPSDFNFOUMFBSOJOH਷3FXBSE ࠁ࢚ ਸ୭؀ചೞחBDUJPO ೯ز ਸࢶఖೠ׮ -FBSOFS ߓ਋ח੗ חৈ۞BDUJPOਸ೧ࠁݴ SFXBSEܳо੢֫ѱ߉חBDUJPOਸ଺ח׮

    ࢶఖػBDUJPO੉׼੢੄SFXBSEࡺ݅ইצ ׮਺੄࢚ടژח׮਺ੌযաѱؼ SFXBSEীب৔ೱਸՙசࣻب੓׮ "DUJPO ׼੢੄ ࢚ട߸ച ޷ې੄࢚ട 3FXBSE ޷ې੄3FXBSE REINFORCEMENT LEARNING
  20. Agent Exploitation Exploration ? EXPLOITATION AND EXPLORATION REINFORCEMENT LEARNING

  21. IMPORTANCE OF EXPLORATION ࣄ੉ 3VTTJBO#MVF ࢓ $VSJPTJUZ
 ಽ੉ .VODILJO ࢓

    'PPE REINFORCEMENT LEARNING
  22. IMPORTANCE OF EXPLORATION-2 REINFORCEMENT LEARNING ಽ੉ ࣄ੉ ;FSP
 FYQMPSBUJPO &YQMPSBUJPO

  23. IMPORTANCE OF EXPLORATION-3 REINFORCEMENT LEARNING ಽ੉ ࣄ੉ 'BJM

  24. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 REINFORCEMENT LEARNING
  25. MARKOV DECISION PROCESS &OWJSPONFOU 3FXBSE At Rt St Rt+1 St+1

    REINFORCEMENT LEARNING 5BQUIFCBMM 1PTJUJWF3FXBSE

  26. STATE-VALUE FUNCTION 4UBUFWBMVF REINFORCEMENT LEARNING

  27. STATE-ACTION VALUE FUNCTION 4UBUF"DUJPOWBMVF REINFORCEMENT LEARNING

  28. OPTIMAL POLICY 0QUJNBM4UBUF7BMVFGVODUJPO REINFORCEMENT LEARNING 0QUJNBM4UBUF"DUJPOWBMVFGVODUJPO

  29. SUPERMARIO WITH REINFORCEMENT LEARNING

  30. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  31. MARKOV DECISION PROCESS "DUJPO "HFOU &OWJSPONFOU 3FXBSE At Rt 4UBUF

    St Rt+1 St+1 SUPERMARIO WITH R.L 3FXBSE  1FOBMUZ
  32. SUPERMARIO WITH R.L https://github.com/wonseokjung/gym-super-mario-bros pip install gym-super-mario-bros
 import gym_super_mario_bros
 env

    = gym_super_mario_bros.make(‘SuperMarioBros-v0') env.reset() env.render() INSTALL AND IMPORT ENVIRONMENT
  33. WORLDS & LEVELS ( WORLD 1~4) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  34. WORLDS & LEVELS ( WORLD 5~8) SUPERMARIO WITH R.L 8PSME

    8PSME 8PSME 8PSME env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')
  35. ALL WORLDS AND LEVELS SUPERMARIO WITH R.L env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')

           
  36. ALL WORLDS AND LEVELS SUPERMARIO WITH R.L env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')

                   
  37. WORLDS & LEVELS SUPERMARIO WITH R.L 7FSTJPO env = gym_super_mario_bros.make('SuperMarioBros-<world>-<level>-v<version>')

    7FSTJPO 7FSTJPO 7FSTJPO
  38. GOAL SUPERMARIO WITH R.L

  39. REWARD AND PENALTY SUPERMARIO WITH R.L 3FXBSE 1FOBMUZ ӥߊীоөਕ૑ݶ 

    ݾ಴ীب଱ೞݶ ݾ಴׳ࢿೞ૑ޅೞݶ दр੉૑զٸ݃׮ ӥߊীࢲݣয૑ݶ
  40. STATE, ACTION SUPERMARIO WITH R.L env.observation_space.shape (240, 256, 3) #

    [ height, weight, channel ] env.action_space.n 256 SIMPLE_MOVEMENT = [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT)
  41. OBSERVATION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  42. ACTION SPACE SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT = [

    [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ]
  43. ACTION AFTER WRAPPER SUPERMARIO WITH R.L env.action_space.n 256 SIMPLE_MOVEMENT =

    [ [‘nop’], [‘right’], [‘right’,’A’], [‘right’,’B’], [‘right’,’A’,’B’], [‘A’], [‘left’], ] 
 
 import gym_super_mario_bros
 env = gym_super_mario_bros.make(‘SuperMarioBros-v0’) env.observation_space.shape (240, 256, 3) # [ height, weight, channel ] env =BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT) from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
  44. EXPLOITATION AND EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info

    = env.step(action) else : 
 action = np.argmax(output) Exploitation Exploration def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) ?
  45. EXPLORATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) Exploitation Exploration if np.random.rand() < epsilon : action=np.random.randint(output) def epsilon_greedy(q_value,step): ?
  46. EXPLOITATION SUPERMARIO WITH R.L next_state, reward, done, info = env.step(action)

    else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output) Exploitation Exploration ?
  47. ENV.STEP( ) SUPERMARIO WITH R.L next_state, reward, done, info =

    env.step(action) else : 
 action = np.argmax(output) def epsilon_greedy(q_value,step): if np.random.rand() < epsilon : action=np.random.randint(output)
  48. EXPLORATION RATE AND REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory

    = deque([],maxlen=1000000) memory.append(state,action,reward,next_state) (St , At , Rt+1 , St+1 ) next_state, reward, done, info = env.step(action) eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000
  49. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  50. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  51. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    next_state, reward, done, info = env.step(action) eps_max = 1 eps_decay_steps = 200000 eps_min = 0.1
  52. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)

    eps_max = 1 eps_min = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action)
  53. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory = deque([],maxlen=1000000) memory.append(state,action,reward,next_state)
  54. REPLAY MEMORY BUFFER SUPERMARIO WITH R.L eps_max = 1 eps_min

    = 0.1 eps_decay_steps = 200000 next_state, reward, done, info = env.step(action) memory.append(state,action,reward,next_state) memory = deque([],maxlen=1000000)
  55. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf loss

    = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 )
  56. MINIMIZE LOSS SUPERMARIO WITH R.L (Rt+1 + γt+1 maxa′ qθ

    (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) import tensorflow as tf
  57. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf Optimizer

    =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) )
  58. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf training_op

    = optimizer.minize(loss) (Rt+1 + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate)
  59. MINIMIZE LOSS SUPERMARIO WITH R.L import tensorflow as tf (Rt+1

    + γt+1 maxa′ qθ (St+1 , a′) − qθ (St , At ))2 (St , At , Rt+1 , St+1 ) loss = tf.reduce_mean(tf.squre( y - Q_action ) ) Optimizer =tf.train.AdamsOptimizer(learning_rate) training_op = optimizer.minize(loss)
  60. APPROXIMATE ACTION-VALUE SUPERMARIO WITH R.L

  61. DOUBLE DQN SUPERMARIO WITH R.L JOQVU "DUJPO WBMVF &OW 2/FUXPSL

    s’ s 3FQMBZNFNPSZ 2 T B a r (St , At , Rt+1 , St+1 )
  62. 1000EPISODE, 3000EPISODE, TRAINING SUPERMARIO WITH R.L FQJTPEF FQJTPEF

  63. 5000 EPISODE SUPERMARIO WITH R.L FQJTPEF %BZT

  64. SUMMARY 1. How Animals Learn 2. How Humans Learn 3.

    Reinforcement Learning 4. SuperMario with Reinforcement Learning REINFORCEMENT LEARNING
  65. OTHER ENVIRONMENTS 0QFO"* %FFQNJOE-BC 4UBSDSBGU 4VQFSNBSJP 4POJD .JOFDSBGU REINFORCEMENT LEARNING

  66. OTHER LEARNING METHODS 
 %2/ 
 %%2/ UVOFE  3BJOCPX%2/

    UVOFE %%1( REINFORCEMENT LEARNING
  67. CURRICULUM LEARNING (PBM 8BMM "HFOU "DUJPO "DUJPO "DUJPO REINFORCEMENT LEARNING

  68. IMITATION LEARNING *NJUBUJPO-FBSOJOH 5FBDIFS 4UVEFOU REINFORCEMENT LEARNING

  69. REINFORCEMENT LEARNING

  70. How about making your own A.I SuperMario?

  71. (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH #MPH IUUQTXPOTFPLKVOHHJUIVCJP хࢎ೤פ׮ 
 5IBOLZPV

  72. 2VFTUJPO 

  73. *ଵҊ ਊয ӝഐ Time step Action Transition Function Reward Set

    of states Set of actions Start state Discount factor t a P(s′, r ∣ s, a) r A S S0 γ Set of reward
 
 Policy Reward State R π r REINFORCEMENT LEARNING s
  74. REFERENCES 1. Habituation The Birth of Intelligence 2. Law of

    effect : The Birth of Intelligence ,p.171 3. Thorndike, E. L. (1905). The elements of psychology. New York: A. G. Seiler. 4. Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Monographs: General and Applied, 2(4), i-109. 5. SuperMario environment 
 https://github.com/Kautenja/gym-super-mario-bros 6. http://faculty.coe.uh.edu/smcneil/cuin6373/idhistory/thorndike_extra.html