6/3&"-
ڧԽֶशͷ"$ΞϧΰϦζϜΛϕʔεʹ&YQFSJFODF
3FQMBZΛͬͨิॿλεΫΛΈ߹Θͤͯ%໎࿏Ͱ
YഒͷֶशͷߴԽΛ࣮ݱ
REINFORCEMENT LEARNING WITH
UNSUPERVISED AUXILIARY TASKS
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki et. al (DeepMind, 2016)
Slide 4
Slide 4 text
ಈͷເ
w ಈເͷதͰܦݧͨ͠ग़དྷࣄΛ࠶ݱ ϦϓϨΠ
͠
ͳ͕Βւഅ৽ൽ࣭ͷهԱͷݻఆΛߦ͍ͬͯΔ
w ߠఆత൱ఆతͳใुʹؔΘΔग़དྷࣄͷເΛಘʹස
ൟʹݟֶͯशΛߦ͍ͬͯΔ
w FYʮਫҿΈͰϥΠΦϯΛݟ͔͚ͯةݥͳʹ͋ͬ
ͨʯ
w 6/3&"-Ͱ͜ΕΛώϯτʹ͍ͯ͠Δ
Slide 5
Slide 5 text
ڧԽֶश
ڥ
ΤʔδΣϯτ
"DUJPO
⬆
➡
⬇
ঢ়ଶ T
ใु S
Slide 6
Slide 6 text
6/3&"-ͷྲྀΕ
%2/ "$ 6/3&"-
Slide 7
Slide 7 text
"$
"TZODISPOPVT"EWBODFE"DUPS$SJUJD
w ෳͷڥΛඇಉظʹฒྻʹಈֶ͔ͯ͠शΛߴԽ
҆ఆԽͤͨ͞
֤-PDBM/FUXPSLͰɺֶश݁Ռͷޯ EВ
ͷΈΛٻΊɺ
ΣΠτʹөͤͣ(MPCBMͷΣΠτ В
ʹݸผʹөɻ
(MPCBMͷΣΠτΛ·֤ͨ-PDBMͷΣΠτʹίϐʔɻ
EВ
EВ
EВ
EВ
В
ʜ
Slide 10
Slide 10 text
1PMJDZ К
7ͷޯ
R=
=
=
w 73ʹ͚ۙͮΔ༷ʹߋ৽
w 37͕ਖ਼ͳΒɺऔͬͨBDUJPO͕ग़Δ֬Λ૿༷͢ʹߋ৽
37͕ෛͳΒɺऔͬͨBDUJPO͕ग़Δ֬ΛݮΒ༷͢ʹߋ৽
V network:
Policy network:
˞্هͷදهͰ7(SBEJFOU%FTDFOU
1PMJDZ(SBEJFOU"TDFOUθv = θv - α * dθv, θ = θ + α * dθ
1PMJDZ
7
Slide 11
Slide 11 text
6/3&"-
w "$ʹɺ&YQFSJFODF3FQMBZΛޮՌతʹͬͨิ
ॿλεΫΛಋೖ͠ɺ͞ΒʹֶशΛߴԽͤ͞Δ
w 1JYFM$POUSPM
w 3FXBSE1SFEJDUJPO
w 7BMVF'VODUJPO3FQMBZ
6/TVQFSWJTFE3&JOGPSDFNFOU"VYJMJBSZ-FBSOJOH
Slide 12
Slide 12 text
&YQFSJFODF3FQMBZ
w <ঢ়ଶ
"DUJPO
ใु
࣍ঢ়ଶ>ͷϖΞΛେྔʹอଘ͠
ͯɺ͔ͦ͜ΒαϯϓϦϯάͯ͠ωοτϫʔΫΛֶश
w %2/ɺ͜Ε͕ͳ͍ͱֶश͕҆ఆ͠ͳ͔ͬͨ
w "$Ͱ͍ͬͯͳ͍
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
1JYFM$POUSPM
w ը໘ͷϐΫηϧͷมԽྔΛΑΓେ͖͘͢Δ༷ʹ͞
͍ͤͨ
w ը໘ͷϐΫηϧͷมԽΛٖࣅใुͱ͢Δิॿλε
Ϋ
Slide 15
Slide 15 text
1JYFM$POUSPM
w ը໘ΛYͷϐΫηϧάϦουʹ͚ɺάϦουຖʹ2ֶशΛߦ͏
w %VFMJOH/FUXPSLΛͬͨ2ֶश
˞1JYFM$POUSPMͰಘΒΕͨ2͕BDUJPOͷબʹΘΕΔ༁Ͱͳ͍
YͷάϦου BDUJPO
֤άϦουͷϐΫηϧมԽྔฏۉΛใुͱͨ࣌͠ͷׂҾՃࢉใु߹ܭ2
Slide 16
Slide 16 text
3FXBSE1SFEJDUJPO
w &YQFSJFODF3FQMBZ͔Β࿈ଓͨ͠ϑϨʔϜऔΓग़
͠ɺϑϨʔϜͷใु͕ɺਖ਼͔ෛ͔θϩ͔Λ༧ଌ
͢ΔิॿλεΫ
w ༧ଌ͢Δใुɺ
ʴ
ʔPSͷൺ͕ʹͳΔ༷ʹαϯϓϦϯά
༗ӹͳใुΠϕϯτϨΞͰ͋ͬͯɺසൟʹαϯϓϦϯά͞ΕΔ
Slide 17
Slide 17 text
3FXBSE1SFEJDUJPO
࣍ͷใु͕PSPSΛ༧ଌ
Slide 18
Slide 18 text
7BMVF'VODUJPO3FQMBZ
w "$Ͱ͍ͬͯΔɺঢ়ଶՁ 7
ͷਪఆ "DUPS$SJUJDͷ$SJUJDଆ
Λɺ&YQFSJFODF3FQMBZ͔ΒαϯϓϦϯάͨ͠ϑϨʔϜͰ࠶
ߦ͏
w 3FXBSE1SFEJDUJPOͱҧͬͯɺαϯϓϦϯάಛʹภΒͤͳ͍