Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ChainerRLとマインクラフトで深層強化学習ハンズオン

 ChainerRLとマインクラフトで深層強化学習ハンズオン

DLLAB Engineer Days Day1: Hands-onのChainerRLとマインクラフトで深層強化学習ハンズオン用資料。

keisuke umezawa

October 06, 2019
Tweet

More Decks by keisuke umezawa

Other Decks in Technology

Transcript

  1. Chainer ॳֶऀ޲͚ϋϯζΦϯ
 chug (Chainer User Group) twitterͷϋογϡλά͸ #chug_jp #dllab Ͱ͓ئ͍͠·͢ɻ

    ࣭໰͸ɺSlack chainer-jp ͷ #chainer-handson-ms Ͱɻ ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/chainer-jp-slack
  2. ͝ڠྗͷ͓ئ͍ !2 • Slack chainer-jpʹ͝ొ࿥͓ئ͍͠·͢ʂ • ࠓճɺ࣭໰͸Slack chainer-jp ͷ #chainer-handson-ms

    Ͱड͚͚͍ͭͨͱࢥ͍·͢ɻ • ߨٛத΋ɺνϡʔλʔͷํ͕ճ౴͠·͢ɻ • ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/join-chainer-jp-slack • SNS౤ߘͷͨΊࣸਅࡱӨ͠·͢ʂ • ࣸਅʹ͏ͭΓͨ͘ͳ͍ํ͍Βͬ͠Ό͍·ͨ͠Βݴ͍ͬͯ ͚ͨͩΔͱ͋Γ͕͍ͨͰ͢ɻ
  3. Slack chainer-jpʹ͝ొ࿥͓ئ͍͠·͢ʂ !3 • ࠓճɺ࣭໰͸Slack chainer-jp ͷ #chainer-handson-ms Ͱ ड͚͚͍ͭͨͱࢥ͍·͢ɻ

    • ߨٛத΋ɺνϡʔλʔͷํ͕ճ౴͠·͢ɻ • ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/chainer-jp-slack
  4. CuPy !9 Chainerʹ͓͚ΔGPUܭࢉΛશͯ୲౰͢ΔϥΠϒϥϦ͕ಠཱ NumPyޓ׵APIͰ௿ίετʹCPUίʔυΛGPU΁Ҡߦ ಛҟ஋෼ղͳͲͷઢܗ୅਺ΞϧΰϦζϜΛGPU࣮ߦ KMeans, Gaussian Mixture ModelͳͲͷExampleͷॆ࣮ import

    numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) GPU https://github.com/cupy/cupy
  5. w 2018/11/25 ਂ૚ڧԽֶशΛνϣοτৄ͘͠   w %FFQ2/FUXPSL %2/  w

    ڧԽֶशͰ͸ɺঢ়ଶʹର͢Δ֤ߦಈͷධՁ஋Λ2஋ͱݺ ͼɺͦΕΛ༻͍Δख๏Λ2ֶशͱݺͿ w 2஋Λදݱ͢Δؔ਺Λ2ؔ਺ͱݺͼɺͦΕΛ%FFQ/FVSBM /FUXPSLͰදݱ͢ΔͷͰ%FFQ2/FUXPSLͱ͍͏ɻ ঢ়ଶ Convolutional 
 Neural Network ߦಈධՁ લਐ ޙୀ ӈճ ࠨճ
  6. w 2018/11/25 %2/ͷ޻෉   w &YQFSJFODF3FQMBZ w FYQFSJFODFʹ஝ੵͨ݁͠ՌΛϥϯμϜͰֶशσʔλʹ͔ͭ͏ Experience


    (ঢ়ଶ , ߦಈ , ใु) , ޙୀ , -0.1 , ӈճ , -0.1 , લਐ , 1.0 , લਐ , -0.1 Replay
  7. w 2018/11/25 ڧԽֶशͷશମͷྲྀΕ   Experience
 , ޙୀ , -0.1

    , ӈճ , -0.1 , લਐ , 1.0 , લਐ , -0.1 Replay Qؔ਺ ؀ڥ ঢ়ଶ ߦಈ ใु
  8. w 2018/11/25 3BJOCPX   • ਂ૚ڧԽֶशͷ༷ʑͳख๏͕ఏҊ͞Ε͖͕ͯͨɺͦΕΛ૊Έ߹ ΘͤͨΒͲ͏ͳΔ͔ͱ͍͏࿦จʮRainbow: Combining Improvements

    in Deep Reinforcement Learningʯ͕ݩ • ৄ͍͠εϥΠυ • https://www.slideshare.net/juneokumura/dqnrainbow
  9. MineRL ʹ͍ͭͯ • Microsoft ։ൃͷ Open AI gym ޓ׵ Minecraft

    ؀ڥ • “mission ϑΝΠϧ” ʹΑΔλεΫఆٛ • multi-agent ϓϩτίϧ΋αϙʔτ • MalmÖ Λར༻͠ɺಉظɾ҆ఆɾߴ଎Խ • ಛʹಉظ͸ॏཁ -- 1step ͣͭ action ͍ͨ͠ʂ • ֤λεΫʹରԠ͢Δσʔληοτ΋ఏڙ MalmÖ MineRL
  10. MineRL ʹ͍ͭͯ • μΠϠϞϯυͷೖख͸͔ͳΓ೉қ౓ͷߴ͍λεΫ • ͍͔ͭ͘ͷதؒλεΫ͕ఏڙ͞ΕΔɿ • MineRLTreechop • MineRLNavigate

    • MineRLObtainIronPickaxe • MineRLObtainDiamond -- ͜Ε͕࠷ऴλεΫ 
 Navigate ʹ͸ Extreme ൛†ɺNavigate/Obtain* ʹ͸ Dense ൛‡΋ఏڙ͞ΕΔ †: “extreme” ͳ஍ܗͰ։࢝ ‡: ใु͕ΑΓີʹ༩͑ΒΕΔ “ิॿ” λεΫͨͪ
  11. MineRL ʹ͍ͭͯ • 64ݸͷʮݪ໦ (log)ʯΛूΊΔ • ݪ໦͸ Minecraft ͷΩʔϦιʔε •

    ৿ (=पΓʹ໦͕ଟ͍) Ͱɺ
 మͷ佁Λ࣋ͬͨঢ়ଶ͔Βελʔτ • log Λೖख͢Δ͝ͱʹ +1 reward MineRLTreechop
  12. MineRL ʹ͍ͭͯ • ࢦఆ͞Εͨΰʔϧ஍఺΁޲͔͏ • Minecraft Ͱ࠷΋جຊͱͳΔεΩϧ • “compass” Λ؍ଌՄೳʀΰʔϧํ޲Λࢦࣔͯ͘͠ΕΔ

    • ΰʔϧ౸ୡͰ +100 reward • “Dense” ൛Ͱ͸ߋʹΰʔϧʹͱͷڑ཭ʹԠͯ͡ຖεςοϓใ ुΛ໯͑Δ (ԕ͔͟Δͱෛͷใु) MineRLNavigate
  13. MineRL ʹ͍ͭͯ • μΠϠϞϯυΛೖख͢Δ • Minecraft Ͱ࠷΋وॏͳΞΠςϜͷͻͱͭ • ଟ͘ͷதؒΞΠςϜ͕ඞཁͱͳΔ •

    μΠϠϞϯυೖखͰ +1024 reward • தؒΞΠςϜೖखͰ΋ (খ͞Ίͷ) ใु • σϑΥϧτͰ͸ΞΠςϜͷ1ճ໨ͷೖख࣌ͷΈใु͕໯͑Δ
 “Dense” ൛Ͱ͸Կ౓Ͱ΋໯͑Δ MineObtainDiamond
  14. The MineRL Competition • ڧԽֶशͷݚڀ͸ۙ೥େ͖͘ਐา͍ͯ͠ΔҰํͰɺ
 ཁٻ͞ΕΔαϯϓϧ਺͕૿Ճ • Ұ෦ͷڊେAIاۀҎ֎ʹΑΔݚڀ͕೉͍͠ • ࣮ੈքԠ༻͕೉͍͠

    • αϯϓϧޮ཰ͷྑ͍ RL γεςϜͷ։ൃΛڝ͏ • ͨͩ͠ɺਓؒͷσϞσʔλ (σʔληοτ) ͕
 ར༻Մೳͱ͢Δ ՝୊ MineRL Competition
  15. MineRL ʹ͍ͭͯ΋ͬͱ • ֤λεΫʹରԠ͢ΔσϞϯετϨʔγϣϯ (σʔληοτ) ͕ఏ ڙ͞ΕΔ • ਓؒʹΑΔϓϨΠσʔλ •

    λεΫ͝ͱʹͦΕͧΕ100~250ݸఔ౓ͣͭ༩͑ΒΕΔ • ؀ڥͱͷ interaction ճ਺͸੍ݶ͞Ε͍ͯΔ͕ɺ
 σʔληοτ͸ (࣌ؒͷڐ͢ݶΓ) Կ౓Ͱ΋࢖ͬͯྑ͍ σʔληοτ
  16. MineRL ʹ͍ͭͯ΋ͬͱ • ୯ҰͷΩʔೖྗͰ͸ͳ͘ɺҙຯͷ͋Δʮߦಈʯ୯ҐͰఆٛ͞ΕΔ • attack, camera, forward, craft, place,

    … • ֤ߦಈ͸ಉ࣮࣌ߦՄೳ (gym ͷ Dict space ͱͯ͠ఆٛ) ͕ͩɺ
 ૬൓͢Δߦಈ΋͋Δ͜ͱʹ஫ҙ (forward ͱ back ͳͲ) • ิॿλεΫͰ͸͍͔ͭ͘ͷߦಈ͕ෆՄೳʹͳ͍ͬͯΔ͜ͱ͕͋Δ • ྫ͑͹ Treechop Ͱ͸ craft ΍ place ͸Ͱ͖ͳ͍ (ඞཁͳ ͍) action ʹ͍ͭͯ
  17. MineRL ʹ͍ͭͯ΋ͬͱ • observation: • pov (Ұਓশࢹ఺ͷը૾) • 64x64x3 (uint8)

    ͷ numpy array • inventory (ॴ࣋ΞΠςϜ) • dirt ͱ͔ log ͱ͔ iron_ore ͱ͔ • equipped_items (૷උ͍ͯ͠Δಓ۩) • wooden_axe ͱ͔ iron_pickaxe ͱ͔ • compassAngle (ΰʔϧํ޲ͷࢦࣔ) • Navigate* ͰͷΈར༻Մೳ • action ಉ༷ɺิॿλεΫ͝ͱʹಘΒΕΔ؍ଌ͸ҟͳΔ (pov ͸ඞͣ͋Δ) observation ʹ͍ͭͯ
  18. MineRL ʹ͍ͭͯ΋ͬͱ >>> import gym >>> import minerl # ͪΐͬͱ͔͔࣌ؒΔ

    (Minecraft ͷىಈ͕ޙΖͰ૸Δ) >>> env = gym.make('MineRLObtainDiamond-v0') جຊతʹ Gym API ४ڌ = ੈͷதͷ RL πʔϧ͕͍͍ͩͨಈ͘
  19. MineRL ʹ͍ͭͯ΋ͬͱ >>> obs, info = env.reset() # info ΋Ұॹʹฦͬͯ͘Δɻͪΐͬͱ

    Gym API ͱҧ͏ >>> obs {'equipped_items': {'mainhand': {'damage': 0, 'maxDamage': 0, 'type': 0}}, 'inventory': {'coal': 0, 'cobblestone': 0, ... }, 'pov': array([[[ 0, 0, 0], [ 16, 32, 9], ..., [ 75, 91, 118]], dtype=uint8)} >>> info {}
  20. MineRL ʹ͍ͭͯ΋ͬͱ >>> action = env.action_space.sample() # ద౰ͳ action Λऔಘ

    >>> action OrderedDict([('attack', 0), ('back', 1), ('camera', array([ 39.44639 , -77.577675], dtype=float32)), ('craft', 3), ... ('nearbySmelt', 0), ('place', 3), ('right', 1), ('sneak', 0), ('sprint', 1)])
  21. MineRL ʹ͍ͭͯ΋ͬͱ >>> obs, reward, done, info = env.step(action) >>>

    obs (ུ) >>> reward 0.0 >>> done False >>> info {}
  22. MineRL tutorial with ChainerRL • PFN ։ൃͷ ChainerRL Λར༻ͨ͠ϕʔεϥΠϯΛఏڙ •

    https://github.com/minerllabs/quick_start/tree/master/ chainerrl_baselines • “starter kit” ͱͯ͠ɺ • ؀ڥ (΍σʔληοτ) ΛͲ͏ಈ͔ͤ͹ྑ͍͔ • ࠷ॳʹऔΔ΂͖φΠʔϒͳΞϓϩʔν ͳͲʹ͍ͭͯͷऔֻ͔ͬΓΛఏڙ ϕʔεϥΠϯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  23. MineRL tutorial with ChainerRL • Treechop/Navigate ΛφΠʔϒʹղ͘ • σʔληοτΛແࢹ •

    action/observation space ʹڧ͍ prior ΛԾఆ φΠʔϒͳΞϓϩʔν Treechop Navigate NavigateDense training reward training episode https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  24. MineRL tutorial with ChainerRL • Treechop ͱ NavigateDense ͸ΫϦΞͰ͖Δ •

    Navigate ͸Ͱ͖͍ͯͳ͍
 → Obtain* λεΫ͸͜ΕΑΓ΋೉͍͠ʂ φΠʔϒͳΞϓϩʔν Treechop Navigate NavigateDense https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  25. MineRL tutorial with ChainerRL • action space ʹڧ͍ prior ΛԾఆ͍ͯ͠Δ


    → ͜ͷ··Ͱ͸ଞͷλεΫʹ͸స༻Ͱ͖ͳ͍͜ͱʹ஫ҙ • Treechop • 5࣍ݩͷ཭ࢄߦಈʹม׵ i. {'forward': 1, 'jump': 0, 'camera': [0, 0]} ii.{'forward': 0, 'jump': 0, 'camera': [0, 0]} iii.{'forward': 1, 'jump': 1, 'camera': [0, 0]} iv.{'forward': 1, 'jump': 0, 'camera': [0, -10]} v. {'forward': 1, 'jump': 0, 'camera': [0, 10]}
 attack ͸ৗʹ onɺͦΕҎ֎͸ৗʹ off action space ʹ͍ͭͯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  26. MineRL tutorial with ChainerRL • Navigate/NavigateDense • 4࣍ݩͷ཭ࢄߦಈʹม׵ i. {'jump':

    0, 'camera': [0, 0]} ii.{'jump': 1, 'camera': [0, 0]} iii.{'jump': 0, 'camera': [0, -10]} iv.{'jump': 0, 'camera': [0, 10]} forward, sprint, attack ͸ৗʹ onɺͦΕҎ֎͸ৗʹ off action space ʹ͍ͭͯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  27. MineRL tutorial with ChainerRL • Double Dueling DQN (উखʹུͯ͠ݺΜͰΔ͚ͩ) •

    DQN ͷ͍ΖΜͳ೿ੜΛ٧ΊࠐΜͩख๏ • Proximal Policy Optimization • ࿈ଓۭؒͰఆٛ͞ΕͨߦಈΛࣗવʹѻ͑Δ͕ɺ
 ϕʔεϥΠϯͰ͸཭ࢄԽͨ͠ߦಈΛར༻ DDDQN Rainbow PPO https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  28. MineRL tutorial with ChainerRL • Ҏ্ͷࣄલ஌ࣝͷ΋ͱɺϋΠύʔύϥϝʔλΛద౰ʹνϡʔχϯ άͯ͠ಘΒΕͨ݁Ռ͕Լਤ (࠶ܝ) Treechop Navigate

    NavigateDense training reward training episode https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  29. MineRL tutorial with ChainerRL action space ͷ੍໿ (ಛʹ camera Λ্Լʹಈ͔ͤͳ͍)

    ͷͨΊɺ
 ਓ͔ؒΒݟΔͱ࠷దͱ͸ݴ͑ͳ͍΋ͷͷɺ
 ͦͷൣғͰʮλεΫΛΫϦΞʯͰ͖͍ͯΔʁ MineRLTreechop https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  30. ͞Βʹֶश͢Δʹ͸ɾɾɾ • ͍ͭઌ೔ɺbaselinesʹ໛฿ֶशͷΞϧΰϦζϜ΋௥Ճ͞Εͨ • behavoral cloning • GAIL • DQfD

    • ৄ͍͠આ໌͸͜ͷεϥΠυͰ • https://www.slideshare.net/pfi/minerl-competition-tutorial-with- chainerrl-156927429 https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  31. Chainer User Groupͷ঺հ • Slack chainer-jp • Twitter @chug_jp •

    ׆ಈ಺༰ • MeetupɾϋϯζΦϯͷ։࠵ • Chainerʹؔ͢Δ৘ใɾυΩϡϝϯτͷ֦ॆ • Web νϡʔτϦΞϧͷ࡞੒ • Ұॹʹ׆ಈͰ͖ΔํΛืू͍ͯ͠·͢ʂ
 Slack #chug-jp-management ʹੋඇ !54