ChainerRLとマインクラフトで深層強化学習ハンズオン

 ChainerRLとマインクラフトで深層強化学習ハンズオン

DLLAB Engineer Days Day1: Hands-onのChainerRLとマインクラフトで深層強化学習ハンズオン用資料。

E04f148b1f9614f4dee94e804cd8bf9c?s=128

keisuke umezawa

October 06, 2019
Tweet

Transcript

  1. Chainer ॳֶऀ޲͚ϋϯζΦϯ
 chug (Chainer User Group) twitterͷϋογϡλά͸ #chug_jp #dllab Ͱ͓ئ͍͠·͢ɻ

    ࣭໰͸ɺSlack chainer-jp ͷ #chainer-handson-ms Ͱɻ ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/chainer-jp-slack
  2. ͝ڠྗͷ͓ئ͍ !2 • Slack chainer-jpʹ͝ొ࿥͓ئ͍͠·͢ʂ • ࠓճɺ࣭໰͸Slack chainer-jp ͷ #chainer-handson-ms

    Ͱड͚͚͍ͭͨͱࢥ͍·͢ɻ • ߨٛத΋ɺνϡʔλʔͷํ͕ճ౴͠·͢ɻ • ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/join-chainer-jp-slack • SNS౤ߘͷͨΊࣸਅࡱӨ͠·͢ʂ • ࣸਅʹ͏ͭΓͨ͘ͳ͍ํ͍Βͬ͠Ό͍·ͨ͠Βݴ͍ͬͯ ͚ͨͩΔͱ͋Γ͕͍ͨͰ͢ɻ
  3. Slack chainer-jpʹ͝ొ࿥͓ئ͍͠·͢ʂ !3 • ࠓճɺ࣭໰͸Slack chainer-jp ͷ #chainer-handson-ms Ͱ ड͚͚͍ͭͨͱࢥ͍·͢ɻ

    • ߨٛத΋ɺνϡʔλʔͷํ͕ճ౴͠·͢ɻ • ొ࿥͕·ͩͷํ͸ɺhttps://bit.ly/chainer-jp-slack
  4. ࣗݾ঺հ !4 • കᖒ ܚհ • ۚ༥Ϛʔέοτ༧ଌϞσϧͷݚڀ։ൃ @AlpacaJapan • Chainer

    Evangelist ˏPreferred Networks
  5. Agenda 1. Chainer/CuPyͷ঺հ 2. ຊ೔ͷϋϯζΦϯͷ಺༰ 1. ChainerRLʹ͍ͭͯ 2. MineRLʹ͍ͭͯ 3.

    ChainerRL+MineRLͷ؀ڥઃఆ 4. MineRLʹ͍ͭͯ΋ͬͱ !5
  6. Chainer/CuPyͷ঺հ !6

  7. Chainer • Chainerͱ͸ (http://chainer.org/) • Preferred Networks੡Deep LearningϑϨʔϜϫʔΫ !7

  8. Chainer !8 • Googleࣾ੡TensorFlowͳͲͷಉྨ

  9. CuPy !9 Chainerʹ͓͚ΔGPUܭࢉΛશͯ୲౰͢ΔϥΠϒϥϦ͕ಠཱ NumPyޓ׵APIͰ௿ίετʹCPUίʔυΛGPU΁Ҡߦ ಛҟ஋෼ղͳͲͷઢܗ୅਺ΞϧΰϦζϜΛGPU࣮ߦ KMeans, Gaussian Mixture ModelͳͲͷExampleͷॆ࣮ import

    numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) GPU https://github.com/cupy/cupy
  10. ֦େ͢ΔChainerϑΝϛϦʔ !10 Chainer UI Chainer Chemistry ڧԽֶश ը૾ೝࣝ ՄࢹԽ άϥϑߏ଄

    େن໛෼ࢄ Menoh ਪ࿦ಛԽ
  11. ࠷ۙͷͰ͖͝ͱ(1): Chainer Tutorials ͷެ։ • Chainer͚ͩͰͳ͘ɺ਺ֶ΍PythonͷجૅʙػցֶशɾσΟʔϓϥʔ χϯάͷجૅɾίʔσΟϯάɾԠ༻·Ͱ෯޿͘ղઆɻ • શষΛColab্Ͱ࣮ߦՄೳͳJupyterϊʔτϒοΫͱͯ͠഑෍த !11

    https://tutorials.chainer.org/ja/
  12. ࠷ۙͷͰ͖͝ͱ(2): ChainerX ެ։ • ࣗಈඍ෼ՄೳͳNumPy-likeͳ ndarrayΛC++Ͱ࣮૷͢Δ͜ͱ ʹΑΓҎԼΛୡ੒ • ߴ଎Խ •

    ଟ༷ͳσόΠεʹରԠ • pythonҎ֎ͷݴޠͰdeploy Մೳ !12
  13. ຊ೔ͷϋϯζΦϯͷ಺༰ !13

  14. Agenda 1. Chainer/CuPyͷ঺հ 2. ຊ೔ͷϋϯζΦϯͷ಺༰ 1. ChainerRLʹ͍ͭͯ 2. MineRLʹ͍ͭͯ 3.

    ChainerRL+MineRLͷ؀ڥઃఆ 4. MineRLʹ͍ͭͯ΋ͬͱ !14
  15. ChainerRL Deep reinforcement learning library built on top of Chainer

  16. ChainerRL • Chainer ʹΑΔ (ਂ૚) ڧԽֶशϑϨʔϜϫʔΫ • ڧԽֶशͷ༗໊ΞϧΰϦζϜͷ࠶ݱ࣮૷΋ؚ·ΕΔ

  17. ChainerRL • ࠶ݱ࣮૷Ͱ͸ɺख๏ͦͷ΋ͷ͸΋ͪΖΜɺ • ؀ڥઃఆͷࡉ͔ͳࠩҟ • ϋΠύʔύϥϝʔλ • ධՁϓϩτίϧͱϝτϦοΫ
 ͳͲΛ஫ҙਂ࣮͘૷͠ɺ࠶ݱੑΛ௥ٻ

  18. ࠓճ࢖༻͢ΔΞϧΰϦζϜ • DDQN (Double Deep Q Network) • Rainbow •

    PPO (Proximal Policy Optimization)
  19. w 2018/11/25 ਂ૚ڧԽֶशΛνϣοτৄ͘͠   w %FFQ2/FUXPSL %2/  w

    ڧԽֶशͰ͸ɺঢ়ଶʹର͢Δ֤ߦಈͷධՁ஋Λ2஋ͱݺ ͼɺͦΕΛ༻͍Δख๏Λ2ֶशͱݺͿ w 2஋Λදݱ͢Δؔ਺Λ2ؔ਺ͱݺͼɺͦΕΛ%FFQ/FVSBM /FUXPSLͰදݱ͢ΔͷͰ%FFQ2/FUXPSLͱ͍͏ɻ ঢ়ଶ Convolutional 
 Neural Network ߦಈධՁ લਐ ޙୀ ӈճ ࠨճ
  20. w 2018/11/25 %2/ͷ޻෉   w &YQFSJFODF3FQMBZ w FYQFSJFODFʹ஝ੵͨ݁͠ՌΛϥϯμϜͰֶशσʔλʹ͔ͭ͏ Experience


    (ঢ়ଶ , ߦಈ , ใु) , ޙୀ , -0.1 , ӈճ , -0.1 , લਐ , 1.0 , લਐ , -0.1 Replay
  21. w 2018/11/25 ڧԽֶशͷશମͷྲྀΕ   Experience
 , ޙୀ , -0.1

    , ӈճ , -0.1 , લਐ , 1.0 , લਐ , -0.1 Replay Qؔ਺ ؀ڥ ঢ়ଶ ߦಈ ใु
  22. w 2018/11/25 %PVCMF%2/   • Qؔ਺Λߦಈબ୒ͱঢ়ଶධՁʹ࢖༻͢Δ͕ɺͦΕΛผʑͷωοτ ϫʔΫͰֶशͤͯ͞DQN • ৄ͍͠εϥΠυ

    • https://www.slideshare.net/juneokumura/dqnrainbow
  23. w 2018/11/25 3BJOCPX   • ਂ૚ڧԽֶशͷ༷ʑͳख๏͕ఏҊ͞Ε͖͕ͯͨɺͦΕΛ૊Έ߹ ΘͤͨΒͲ͏ͳΔ͔ͱ͍͏࿦จʮRainbow: Combining Improvements

    in Deep Reinforcement Learningʯ͕ݩ • ৄ͍͠εϥΠυ • https://www.slideshare.net/juneokumura/dqnrainbow
  24. w 2018/11/25 110   • ঢ়ଶ͔Βߦಈ֬཰Λग़ྗ͢Δؔ਺ΛωοτϫʔΫͰදݱͨ͠ϙ ϦγʔϕʔεͷϞσϧͷ̍ͭ • ߦಈ֬཰ͷ෼෍ֶ͕श࣌ʹେ͖ͳมԽ͕ى͖ͳ͍Α͏ʹɺ໨త

    ؔ਺ʹਖ਼ଇԽ߲ΛೖΕͨϞσϧ
  25. Agenda 1. Chainer/CuPyͷ঺հ 2. ຊ೔ͷϋϯζΦϯͷ಺༰ 1. ChainerRLʹ͍ͭͯ 2. MineRLʹ͍ͭͯ 3.

    ChainerRL+MineRLͷ؀ڥઃఆ 4. MineRLʹ͍ͭͯ΋ͬͱ !25
  26. MineRL ʹ͍ͭͯ • Microsoft ։ൃͷ Open AI gym ޓ׵ Minecraft

    ؀ڥ • “mission ϑΝΠϧ” ʹΑΔλεΫఆٛ • multi-agent ϓϩτίϧ΋αϙʔτ • MalmÖ Λར༻͠ɺಉظɾ҆ఆɾߴ଎Խ • ಛʹಉظ͸ॏཁ -- 1step ͣͭ action ͍ͨ͠ʂ • ֤λεΫʹରԠ͢Δσʔληοτ΋ఏڙ MalmÖ MineRL
  27. MineRL ʹ͍ͭͯ • μΠϠϞϯυͷೖख͸͔ͳΓ೉қ౓ͷߴ͍λεΫ • ͍͔ͭ͘ͷதؒλεΫ͕ఏڙ͞ΕΔɿ • MineRLTreechop • MineRLNavigate

    • MineRLObtainIronPickaxe • MineRLObtainDiamond -- ͜Ε͕࠷ऴλεΫ 
 Navigate ʹ͸ Extreme ൛†ɺNavigate/Obtain* ʹ͸ Dense ൛‡΋ఏڙ͞ΕΔ †: “extreme” ͳ஍ܗͰ։࢝ ‡: ใु͕ΑΓີʹ༩͑ΒΕΔ “ิॿ” λεΫͨͪ
  28. MineRL ʹ͍ͭͯ • 64ݸͷʮݪ໦ (log)ʯΛूΊΔ • ݪ໦͸ Minecraft ͷΩʔϦιʔε •

    ৿ (=पΓʹ໦͕ଟ͍) Ͱɺ
 మͷ佁Λ࣋ͬͨঢ়ଶ͔Βελʔτ • log Λೖख͢Δ͝ͱʹ +1 reward MineRLTreechop
  29. MineRL ʹ͍ͭͯ • ࢦఆ͞Εͨΰʔϧ஍఺΁޲͔͏ • Minecraft Ͱ࠷΋جຊͱͳΔεΩϧ • “compass” Λ؍ଌՄೳʀΰʔϧํ޲Λࢦࣔͯ͘͠ΕΔ

    • ΰʔϧ౸ୡͰ +100 reward • “Dense” ൛Ͱ͸ߋʹΰʔϧʹͱͷڑ཭ʹԠͯ͡ຖεςοϓใ ुΛ໯͑Δ (ԕ͔͟Δͱෛͷใु) MineRLNavigate
  30. MineRL ʹ͍ͭͯ • μΠϠϞϯυΛೖख͢Δ • Minecraft Ͱ࠷΋وॏͳΞΠςϜͷͻͱͭ • ଟ͘ͷதؒΞΠςϜ͕ඞཁͱͳΔ •

    μΠϠϞϯυೖखͰ +1024 reward • தؒΞΠςϜೖखͰ΋ (খ͞Ίͷ) ใु • σϑΥϧτͰ͸ΞΠςϜͷ1ճ໨ͷೖख࣌ͷΈใु͕໯͑Δ
 “Dense” ൛Ͱ͸Կ౓Ͱ΋໯͑Δ MineObtainDiamond
  31. MineRL ʹ͍ͭͯ • μΠϠϞϯυͷೖखʹ͸ଟ͘ͷதؒΞΠςϜ͕ඞཁͱͳΔͨΊɺ ͔ͳΓࠔ೉ͳλεΫ • ໦Λೖख→ؙੴΛೖख→͔·ͲΛΫϥϑτ→మ߭ੴΛೖख→మΠϯ ΰοτΛਫ਼࿉→మϐοέϧΛΫϥϑτ→μΠϠϞϯυΛೖख MineObtainDiamond

  32. MineRL ʹ͍ͭͯ • ެࣜυΩϡϝϯτΛ͝ཡ͍ͩ͘͞ ͦͷଞ؀ڥͷৄࡉ

  33. Agenda 1. Chainer/CuPyͷ঺հ 2. ຊ೔ͷϋϯζΦϯͷ಺༰ 1. ChainerRLʹ͍ͭͯ 2. MineRLʹ͍ͭͯ 3.

    ChainerRL+MineRLͷ؀ڥઃఆ 4. MineRLʹ͍ͭͯ΋ͬͱ !33
  34. ઃఆखॱ https://bit.ly/2Vh8eG6

  35. Agenda 1. Chainer/CuPyͷ঺հ 2. ຊ೔ͷϋϯζΦϯͷ಺༰ 1. ChainerRLʹ͍ͭͯ 2. MineRLʹ͍ͭͯ 3.

    ChainerRL+MineRLͷ؀ڥઃఆ 4. MineRLʹ͍ͭͯ΋ͬͱ !35
  36. The MineRL Competition • ڧԽֶशͷݚڀ͸ۙ೥େ͖͘ਐา͍ͯ͠ΔҰํͰɺ
 ཁٻ͞ΕΔαϯϓϧ਺͕૿Ճ • Ұ෦ͷڊେAIاۀҎ֎ʹΑΔݚڀ͕೉͍͠ • ࣮ੈքԠ༻͕೉͍͠

    • αϯϓϧޮ཰ͷྑ͍ RL γεςϜͷ։ൃΛڝ͏ • ͨͩ͠ɺਓؒͷσϞσʔλ (σʔληοτ) ͕
 ར༻Մೳͱ͢Δ ՝୊ MineRL Competition
  37. MineRL ʹ͍ͭͯ΋ͬͱ • ֤λεΫʹରԠ͢ΔσϞϯετϨʔγϣϯ (σʔληοτ) ͕ఏ ڙ͞ΕΔ • ਓؒʹΑΔϓϨΠσʔλ •

    λεΫ͝ͱʹͦΕͧΕ100~250ݸఔ౓ͣͭ༩͑ΒΕΔ • ؀ڥͱͷ interaction ճ਺͸੍ݶ͞Ε͍ͯΔ͕ɺ
 σʔληοτ͸ (࣌ؒͷڐ͢ݶΓ) Կ౓Ͱ΋࢖ͬͯྑ͍ σʔληοτ
  38. MineRL ʹ͍ͭͯ΋ͬͱ • ୯ҰͷΩʔೖྗͰ͸ͳ͘ɺҙຯͷ͋Δʮߦಈʯ୯ҐͰఆٛ͞ΕΔ • attack, camera, forward, craft, place,

    … • ֤ߦಈ͸ಉ࣮࣌ߦՄೳ (gym ͷ Dict space ͱͯ͠ఆٛ) ͕ͩɺ
 ૬൓͢Δߦಈ΋͋Δ͜ͱʹ஫ҙ (forward ͱ back ͳͲ) • ิॿλεΫͰ͸͍͔ͭ͘ͷߦಈ͕ෆՄೳʹͳ͍ͬͯΔ͜ͱ͕͋Δ • ྫ͑͹ Treechop Ͱ͸ craft ΍ place ͸Ͱ͖ͳ͍ (ඞཁͳ ͍) action ʹ͍ͭͯ
  39. MineRL ʹ͍ͭͯ΋ͬͱ • observation: • pov (Ұਓশࢹ఺ͷը૾) • 64x64x3 (uint8)

    ͷ numpy array • inventory (ॴ࣋ΞΠςϜ) • dirt ͱ͔ log ͱ͔ iron_ore ͱ͔ • equipped_items (૷උ͍ͯ͠Δಓ۩) • wooden_axe ͱ͔ iron_pickaxe ͱ͔ • compassAngle (ΰʔϧํ޲ͷࢦࣔ) • Navigate* ͰͷΈར༻Մೳ • action ಉ༷ɺิॿλεΫ͝ͱʹಘΒΕΔ؍ଌ͸ҟͳΔ (pov ͸ඞͣ͋Δ) observation ʹ͍ͭͯ
  40. MineRL ʹ͍ͭͯ΋ͬͱ >>> import gym >>> import minerl # ͪΐͬͱ͔͔࣌ؒΔ

    (Minecraft ͷىಈ͕ޙΖͰ૸Δ) >>> env = gym.make('MineRLObtainDiamond-v0') جຊతʹ Gym API ४ڌ = ੈͷதͷ RL πʔϧ͕͍͍ͩͨಈ͘
  41. MineRL ʹ͍ͭͯ΋ͬͱ >>> obs, info = env.reset() # info ΋Ұॹʹฦͬͯ͘Δɻͪΐͬͱ

    Gym API ͱҧ͏ >>> obs {'equipped_items': {'mainhand': {'damage': 0, 'maxDamage': 0, 'type': 0}}, 'inventory': {'coal': 0, 'cobblestone': 0, ... }, 'pov': array([[[ 0, 0, 0], [ 16, 32, 9], ..., [ 75, 91, 118]], dtype=uint8)} >>> info {}
  42. MineRL ʹ͍ͭͯ΋ͬͱ >>> action = env.action_space.sample() # ద౰ͳ action Λऔಘ

    >>> action OrderedDict([('attack', 0), ('back', 1), ('camera', array([ 39.44639 , -77.577675], dtype=float32)), ('craft', 3), ... ('nearbySmelt', 0), ('place', 3), ('right', 1), ('sneak', 0), ('sprint', 1)])
  43. MineRL ʹ͍ͭͯ΋ͬͱ >>> obs, reward, done, info = env.step(action) >>>

    obs (ུ) >>> reward 0.0 >>> done False >>> info {}
  44. MineRL tutorial with ChainerRL • PFN ։ൃͷ ChainerRL Λར༻ͨ͠ϕʔεϥΠϯΛఏڙ •

    https://github.com/minerllabs/quick_start/tree/master/ chainerrl_baselines • “starter kit” ͱͯ͠ɺ • ؀ڥ (΍σʔληοτ) ΛͲ͏ಈ͔ͤ͹ྑ͍͔ • ࠷ॳʹऔΔ΂͖φΠʔϒͳΞϓϩʔν ͳͲʹ͍ͭͯͷऔֻ͔ͬΓΛఏڙ ϕʔεϥΠϯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  45. MineRL tutorial with ChainerRL • Treechop/Navigate ΛφΠʔϒʹղ͘ • σʔληοτΛແࢹ •

    action/observation space ʹڧ͍ prior ΛԾఆ φΠʔϒͳΞϓϩʔν Treechop Navigate NavigateDense training reward training episode https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  46. MineRL tutorial with ChainerRL • Treechop ͱ NavigateDense ͸ΫϦΞͰ͖Δ •

    Navigate ͸Ͱ͖͍ͯͳ͍
 → Obtain* λεΫ͸͜ΕΑΓ΋೉͍͠ʂ φΠʔϒͳΞϓϩʔν Treechop Navigate NavigateDense https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  47. MineRL tutorial with ChainerRL • action space ʹڧ͍ prior ΛԾఆ͍ͯ͠Δ


    → ͜ͷ··Ͱ͸ଞͷλεΫʹ͸స༻Ͱ͖ͳ͍͜ͱʹ஫ҙ • Treechop • 5࣍ݩͷ཭ࢄߦಈʹม׵ i. {'forward': 1, 'jump': 0, 'camera': [0, 0]} ii.{'forward': 0, 'jump': 0, 'camera': [0, 0]} iii.{'forward': 1, 'jump': 1, 'camera': [0, 0]} iv.{'forward': 1, 'jump': 0, 'camera': [0, -10]} v. {'forward': 1, 'jump': 0, 'camera': [0, 10]}
 attack ͸ৗʹ onɺͦΕҎ֎͸ৗʹ off action space ʹ͍ͭͯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  48. MineRL tutorial with ChainerRL • Navigate/NavigateDense • 4࣍ݩͷ཭ࢄߦಈʹม׵ i. {'jump':

    0, 'camera': [0, 0]} ii.{'jump': 1, 'camera': [0, 0]} iii.{'jump': 0, 'camera': [0, -10]} iv.{'jump': 0, 'camera': [0, 10]} forward, sprint, attack ͸ৗʹ onɺͦΕҎ֎͸ৗʹ off action space ʹ͍ͭͯ https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  49. MineRL tutorial with ChainerRL • Double Dueling DQN (উखʹུͯ͠ݺΜͰΔ͚ͩ) •

    DQN ͷ͍ΖΜͳ೿ੜΛ٧ΊࠐΜͩख๏ • Proximal Policy Optimization • ࿈ଓۭؒͰఆٛ͞ΕͨߦಈΛࣗવʹѻ͑Δ͕ɺ
 ϕʔεϥΠϯͰ͸཭ࢄԽͨ͠ߦಈΛར༻ DDDQN Rainbow PPO https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  50. MineRL tutorial with ChainerRL • Ҏ্ͷࣄલ஌ࣝͷ΋ͱɺϋΠύʔύϥϝʔλΛద౰ʹνϡʔχϯ άͯ͠ಘΒΕͨ݁Ռ͕Լਤ (࠶ܝ) Treechop Navigate

    NavigateDense training reward training episode https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  51. MineRL tutorial with ChainerRL action space ͷ੍໿ (ಛʹ camera Λ্Լʹಈ͔ͤͳ͍)

    ͷͨΊɺ
 ਓ͔ؒΒݟΔͱ࠷దͱ͸ݴ͑ͳ͍΋ͷͷɺ
 ͦͷൣғͰʮλεΫΛΫϦΞʯͰ͖͍ͯΔʁ MineRLTreechop https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  52. ͞Βʹֶश͢Δʹ͸ɾɾɾ • ͍ͭઌ೔ɺbaselinesʹ໛฿ֶशͷΞϧΰϦζϜ΋௥Ճ͞Εͨ • behavoral cloning • GAIL • DQfD

    • ৄ͍͠આ໌͸͜ͷεϥΠυͰ • https://www.slideshare.net/pfi/minerl-competition-tutorial-with- chainerrl-156927429 https://github.com/minerllabs/baselines/tree/master/general/chainerrl
  53. chug (Chainer User Group)ͷ঺հ !53

  54. Chainer User Groupͷ঺հ • Slack chainer-jp • Twitter @chug_jp •

    ׆ಈ಺༰ • MeetupɾϋϯζΦϯͷ։࠵ • Chainerʹؔ͢Δ৘ใɾυΩϡϝϯτͷ֦ॆ • Web νϡʔτϦΞϧͷ࡞੒ • Ұॹʹ׆ಈͰ͖ΔํΛืू͍ͯ͠·͢ʂ
 Slack #chug-jp-management ʹੋඇ !54
  55. Ξϯέʔτ https://forms.gle/oNniPXrDtQv8BiVYA

  56. None