Slide 1

Slide 1 text

POMO: Policy Optimization with Multiple Optima for Reinforcement Learning Kwon, Yeong-Dae, et al. NeurIPS, 2020, vol.33

Slide 2

Slide 2 text

ཁ໿ •૊Έ߹Θͤ࠷దԽ໰୊ʹ͓͚Δɼਂ૚ڧԽֶश ͰͷFOEUPFOEͷۙࣅղ๏ɽ •طଘͷਂ૚ڧԽֶशख๏ͱൺֱͯ͠ɼ ܭࢉ࣌ؒɾਫ਼౓ͱ΋ʹେ͖͘վળͨ͠ •८ճηʔϧεϚϯ໰୊ͳͲͰݕূɽ 2/26

Slide 3

Slide 3 text

ಋೖ

Slide 4

Slide 4 text

૊Έ߹Θͤ࠷దԽ •८ճηʔϧεϚϯ໰୊΍഑ૹܭը໰୊ɼφοϓβοΫ໰୊ ͳͲʹ୅ද͞ΕΔΑ͏ͳ࠷దͳ૊Έ߹ΘͤΛٻΊΔ෼໺ɽ 4/26 精度 計算時間 厳密解法 最適 遅い 近似解法 最適に 近い 早い https://onl.tw/vzkASMX

Slide 5

Slide 5 text

ڧԽֶशʢ3FJOGPSDFNFOU-FBSOJOH3-ʣ •3-ɿஞ࣍తͳҙࢥܾఆ໰୊Λղ͘ख๏ɽ ྦྷੵใु͕࠷େʹͳΔΑ͏ͳํࡦΛݟ͚ͭΔ͜ͱ͕໨తɽ 5/26 ໰୊ઃఆͱͯ͠ɼঢ়ଶू߹ɼߦಈू߹ɼใुؔ਺Λ ઃఆ͢Δඞཁ͕͋Δɽ https://onl.tw/98fQVvW

Slide 6

Slide 6 text

ํࡦϕʔεͷ3&*/'03$& 6/26 •ํࡦ 𝜋 𝑠 ɿঢ়ଶ𝑠ʹ͓͚Δߦಈ𝑎Λग़ྗ͢Δؔ਺ •𝜋! ɿύϥϝʔλ 𝜃Ͱ௚઀ύϥϝʔλԽ͞Εͨํࡦ •ํࡦͷߋ৽ࣜɿ𝛼͸ֶश཰ɼ𝐽 𝜋! ͸໨తؔ਺ 𝜃 ← 𝜃 + 𝛼∇! 𝐽 𝜋! •ํࡦޯ഑ͷࣜɿ𝔼͸ظ଴஋ɼ𝑅" ͸ऩӹɼ𝑏 𝑠 ͸ϕʔεϥΠϯ ∇! 𝐽 𝜋! = 𝔼#! ∇! log 𝜋! ⋅ 𝑅" − 𝑏 𝑠

Slide 7

Slide 7 text

ઌߦݚڀ

Slide 8

Slide 8 text

1PJOUFS/FUXPSLTʢʣ ૊Έ߹Θͤ࠷దԽͰར༻͢ΔωοτϫʔΫ •ॏෳͳ͘બ୒͠ɼग़ྗύλʔϯྻΛੜ੒͢Δɽ •ೖྗ஍఺৘ใ͔Βಛ௃நग़Λߦ͏FODPEFSͱɼFODPEFS ͷग़ྗΛར༻ͯ͠౴͑ͱͳΔܦ࿏Λग़ྗ͢ΔEFDPEFS͔ ΒͳΔɽ •FODPEFSͱEFDPEFSʹ͸-45.Λ࢖༻ɽ 8/26

Slide 9

Slide 9 text

"UUFOUJPO .PEFMʢʣ 1PJOUFS/FUXPSLTͷվྑ൛ •1PJOUFS/FUXPSLTಉ༷ɼ&ODPEFSͱ%FDPEFSΛ࢖༻͢Δ Ϟσϧɽ •-45.͸ഇࢭ͠ɼ.VMUJIFBE"UUFOUJPOΛ࠾༻ɽ 9/26

Slide 10

Slide 10 text

ख๏

Slide 11

Slide 11 text

ຊ࿦จͷख๏ͷΞΠσΞ 11/26 ࠷ॳͷߦಈ͸ɼޙͷΤʔδΣϯτͷߦಈʹେ͖͘ӨڹΛ༩͑Δɽ ૊Έ߹Θͤ࠷దԽ໰୊ʹΑ͘ݟΒΕΔରশੑΛར༻ɽ

Slide 12

Slide 12 text

10.0 •3&*/'03$&XJUI#BTFMJOFɿయܕతͳํࡦޯ഑ϕʔεͷ 3-ΞϧΰϦζϜΛ࢖༻ɽ •ෳ਺ͷҟͳΔ։࢝ߦಈΛࢦఆ͠ɼෳ਺ͷߦಈܥྻʢيಓʣ ΛಘΔɽ •ʻ45"35ʼτʔΫϯΛ༻͍ͳ͍ɽ 12/26 従来 POMO

Slide 13

Slide 13 text

10.0 ∇! 𝐽 𝜃 ≈ 1 𝑁 6 $%& ' 𝑅 𝜏$ − 𝑏$ 𝑠 ∇! log 𝑝! 𝜏$ ∣ 𝑠 𝑤ℎ𝑒𝑟𝑒 𝑝! 𝝉$ ∣ 𝑠 ≡ @ "%( ) 𝑝! 𝑎" $ ∣ 𝑠, 𝑎&:"+& $ يಓ 𝝉$ = 𝑎& $ , 𝑎( $ , … , 𝑎) $ GPS 𝑖 = 1,2, … , 𝑁 ڞ༗ϕʔεϥΠϯ 𝑏$(𝑠) = 𝑏TIBSFE (𝑠) = 1 𝑁 6 ,%& ' 𝑅 𝝉, GPS 𝑖 = 1,2, … , 𝑁 13/26

Slide 14

Slide 14 text

܇࿅෦෼ͷٖࣅίʔυ 14/26

Slide 15

Slide 15 text

*OTUBODF"VHNFOUBUJPOɿਪ࿦ख๏ •ը૾ॲཧ෼໺ͷσʔλΦʔάϝϯςʔγϣϯ͔Βண૝ɽ •ࠓճ࢖͏஍఺࠲ඪ͸ɼYͷ୯Ґਖ਼ํܗ಺ʢୈҰ৅ݶʣͷ ΋ͷΛར༻ɽ 15/26 今回使う Instance Augmentation

Slide 16

Slide 16 text

ਪ࿦෦෼ͷٖࣅίʔυ 16/26

Slide 17

Slide 17 text

࣮ݧ

Slide 18

Slide 18 text

࣮ݧ ࣮ݧ಺༰ •10.0Λ༻͍ͯɼҎԼͷ໰୊Λղ͍ͨ݁ՌΛଞͷ୅දతख๏ͱ ൺֱɽ ८ճηʔϧεϚϯ໰୊ ༰ྔ੍໿͋Γͷ഑ૹܭը໰୊ φοϓβοΫ໰୊ 18/26

Slide 19

Slide 19 text

ֶशۂઢɿ८ճηʔϧεϚϯ໰୊ 19/26 50地点 100地点

Slide 20

Slide 20 text

८ճηʔϧεϚϯ໰୊ʢ541ʣ 20/26

Slide 21

Slide 21 text

८ճηʔϧεϚϯ໰୊ʢ541ʣ 21/26

Slide 22

Slide 22 text

༰ྔ੍໿͋Γͷ഑ૹܭը໰୊ʢ$731ʣ 22/26

Slide 23

Slide 23 text

φοϓβοΫ໰୊ʢ,1ʣ 23/26

Slide 24

Slide 24 text

࣮ݧͷ·ͱΊ •ҟͳΔઃఆͷͭͷ૊Έ߹Θͤ࠷దԽ໰୊ʹରͯ͠ɼ ಉҰͷ܇࿅ख๏ͱ//ΞʔΩςΫνϟΛ༻͍ͯ༗๬ͳ݁ՌΛ ಘͨɽ •܇࿅ɾਪ࿦ख๏ͱͯ͠ͷ10.0ɼਪ࿦ख๏ͱͯ͠ͷ *OTUBODF"VHNFOUBUJPOͲͪΒ΋ޮՌతͳख๏Ͱ͋Δ͜ͱ Λ֬ೝͨ͠ɽ 24/26

Slide 25

Slide 25 text

·ͱΊ ຊ࿦จͰ͸૊Έ߹Θͤ࠷దԽ໰୊ʹ͓͍ͯɼରশੑΛར༻ ͯ͠3-ͷαϯϓϧޮ཰΍ਫ਼౓ ਪ࿦࣌ؒΛ୹ॖ͢Δख๏Λ঺ հͨ͠ɽ 25/26

Slide 26

Slide 26 text

ࢀߟจݙ ,XPO :FPOH%BF FUBM10.01PMJDZ0QUJNJ[BUJPOXJUI .VMUJQMF0QUJNBGPS3FJOGPSDFNFOU-FBSOJOH "EWBODFTJO /FVSBM*OGPSNBUJPO1SPDFTTJOH4ZTUFNT ,PPM 8PVUFS )FSLF WBO)PPG BOE.BY8FMMJOH"UUFOUJPO -FBSOUP4PMWF3PVUJOH1SPCMFNT *OUFSOBUJPOBM$POGFSFODF PO-FBSOJOH3FQSFTFOUBUJPOT 7JOZBMT 0SJPM .FJSF 'PSUVOBUP BOE/BWEFFQ+BJUMZ1PJOUFS /FUXPSLT "EWBODFTJO/FVSBM*OGPSNBUJPO1SPDFTTJOH 4ZTUFNT 26/26