Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Distributed prioritized experience replay
Search
umeco
July 03, 2018
Research
0
500
Distributed prioritized experience replay
Research paper readings in my laboratory
umeco
July 03, 2018
Tweet
Share
More Decks by umeco
See All by umeco
Clineプロンプト徹底解剖
umeco
0
560
LLMでの多言語対応どうする問題
umeco
0
170
大生成AI時代の新規事業戦略を考える
umeco
0
130
【WSSIT2019】食材名の分散表現学習を用いた料理レシピの栄養推定手法
umeco
0
580
Cookpad_R&D_internship_2018_byumeco
umeco
0
450
【WSSIT2018】料理レシピの分散表現を用いた代替食材の発見手法
umeco
2
640
Using an Artificial Financial Market for studying a Cryptocurrency Market
umeco
0
610
【WSSIT2017】過去の変動に対する類似検索を用いた短時間USD/JPY為替レート予測
umeco
0
500
Other Decks in Research
See All in Research
能動適応的実験計画
masakat0
2
780
経済学と機械学習:因果推論と密度比推定を中心に
masakat0
0
130
Self-supervised audiovisual representation learning for remote sensing data
satai
3
260
Streamlit 総合解説 ~ PythonistaのためのWebアプリ開発 ~
mickey_kubo
1
1.4k
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
shunk031
14
9.7k
電力システム最適化入門
mickey_kubo
1
860
言語モデルの地図:確率分布と情報幾何による類似性の可視化
shimosan
4
1k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
0
140
AIによる画像認識技術の進化 -25年の技術変遷を振り返る-
hf149
7
3.9k
20250725-bet-ai-day
cipepser
2
390
大規模な2値整数計画問題に対する 効率的な重み付き局所探索法
mickey_kubo
1
330
SSII2025 [TS3] 医工連携における画像情報学研究
ssii
PRO
2
1.3k
Featured
See All Featured
Testing 201, or: Great Expectations
jmmastey
45
7.6k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.4k
Navigating Team Friction
lara
189
15k
Gamification - CAS2011
davidbonilla
81
5.4k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Raft: Consensus for Rubyists
vanstee
140
7.1k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Designing for Performance
lara
610
69k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
283
13k
Stop Working from a Prison Cell
hatefulcrawdad
271
21k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Transcript
%JTUSJCVUFEQSJPSJUJ[FE FYQFSJFODFSFQMBZ കຊ Horgan, Dan, et al. "Distributed
prioritized experience replay." arXiv preprint arXiv:1803.00933 (2018).
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ڧԽֶशͱ Ϟσϧ͕ࣗͰ༷ʑʹߦಈ͠ɼྑ͍ใु͕ಘΒΕΔ ߦಈΛֶश͍ͯ͘͠ख๏ ࣮༻ྫ "MQIB(P ғޟͷଧͪํΛֶश
ڧԽֶशͷཁૉ Policy <ྫ> ಛఆͷғޟͷ൫໘Ͱ࠷উͭͱࢥ͏खΛଧͭ উͭ PSෛ͚Δ
উͯΔͳΒ͜ͷखΛ͍ɼෛ͚ΔͳΒΘͳ͍ Λ܁Γฦ͢͜ͱͰɼͲͷ൫໘ͰͲͷखΛଧͯ উ͍͔ͪ͢Λֶश͍ͯ͘͠ ߦಈ ݁Ռ ใुؔͷߋ৽
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ݚڀഎܠ ڧྗͳܭࢉࢿݯΛޮՌతʹར༻ͨ͠Ϟσϧ͕಄ n (PSJMB n "$ n (16"EWBOUBHF"DUPS$SJUJD
ݱঢ়ଟ͘ͷϞσϧ୯ҰͷϚγϯΛఆ ݱࡏͷڧԽֶशख๏ ଟͷϚγϯΛ༻͍ͨϞσϧͷඞཁੑ
ݚڀత ڧԽֶशख๏"QF9ͷఏҊ n ࢄγεςϜʴ༏ઌॱҐ͖ܦݧ࠶ੜ n ࠷৽ͷΞϧΰϦζϜͷΈ߹Θͤ n ࣮ӡ༻্ʹ͓͚Δࡉ͔͍मਖ਼ ఏҊख๏ͷύϥϝʔλͷֶशͷޮՌͷੳ n
ܦݧΛੜ͢ΔXPSLFSͷ n ܦݧͷอ࣋
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ؔ࿈ݚڀ ਂֶशͷޯΛฒྻʹܭࢉ͢Δख๏ ಉظɼඇಉظͰͷߋ৽ํ๏͕ఏҊ /BJSΒ͜ΕΒΛڧԽֶशʹద༻ n ޯͷࢄඇಉظߋ৽ n ࢄܦݧੜ ࢄ֬ޯ߱Լ๏
!$ !#""%& !#"! !!#!!% ! !#!% $& ୯ҰϚγϯɼϚϧνεϨουͰߴ͍݁Ռ
ؔ࿈ݚڀ ֶशͷ্ͨΊʹΑ͘ΘΕ͍ͯΔख๏ n ༏ઌΛ༻͍ͨαϯϓϦϯάภΓ͕ൃੜ n ֬ͳαϯϓϧͰͷޯมԽΛେ͖͘͢Δ "MBJOΒڭࢣ͋ΓֶशʹԠ༻ ࢄγεςϜͷԠ༻ʹޭ ࢄԽॏཁαϯϓϦϯά
Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.
ؔ࿈ݚڀ ੜͨ͠ܦݧΛอଘ͠Կֶशʹ༻͢Δख๏ n ੜͨ͠ܦݧΛޮతʹ༻Ͱ͖Δ n ݹ͍ํࡦͷܦݧΛ͢͜ͱͰաద߹Λ͛Δ 1SJPSJUJ[FE&YQFSJFODF3FQMBZ n ༗༻ͳܦݧΛΑΓଟ͘࠶ੜ͢Δख๏ n
5%ޡࠩΛ༻͍ͯ༏ઌ͚ &YQFSJFODF3FQMBZ -$%%('"$' %!$&)*(.$'"* ,$. " ',++ ('* $'!(* & ',% *'$'")%''$'"', #$'"#$' *'$'" (&#-%(#'-'(''$+ ',('("%(-'.$$%. **$(*$,$1 /) *$ ' * )%0 '', *',$('% ('! * ' (' *'$'" )* + ',,$('+
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ఏҊख๏ "QF9ͷ֓ཁ Learner Network Replay Experiences Actor Network Environment
ڧԽֶशΛͭͷׂׂ
ఏҊख๏ n ֤ࣗͷߦಈՁOFUXPSLͱFOWJSPONFOUΛॴ࣋ n ํࡦʹج͖ͮߦಈ͠ɼঢ়ଶભҠΛ؍ଌ n ભҠʹ༏ઌΛ༩͠ɼ3FQMBZ.FNPSZʹૹ৴ n "DUPSߦಈՁOFUXPSLΛֶश͠ͳ͍
"DUPS େྔͷ"DUPS͕ಠཱʹߦಈ͠ɼܦݧΛେྔʹੜ
ఏҊख๏ "DUPS͔Βૹ৴͞ΕͨܦݧΛอ࣋ n શମͰͭͷ3FQMBZ.FNPSZΛ࣋ͭ n อ࣋Ͱ͖Δܦݧͷ্ݶΛઃఆ n ্ݶΛ͑ͨ߹'*'0Ͱআ 3FQMBZ.FNPSZ
-FBSOFSֶ͕श͢ΔܦݧΛେྔʹอ࣋
ఏҊख๏ n ܦݧΛ༏ઌॱҐʹج͖ͮαϯϓϦϯάɼֶश n ֶशʹ༻͍ͨܦݧ༏ઌΛ࠶ܭࢉ n ҰఆִؒͰ"DUPSύϥϝʔλΛૹ৴ -FBSOFS ༗༻ͳܦݧΛ༏ઌతʹֶश
ఏҊख๏ "QF9ͷ֓ཁͷ·ͱΊ Learner Network Replay Experiences Actor Network Environment
ฒྻʹܦݧΛେྔʹੜ େྔͷܦݧΛอ࣋ ใुΛ૿͢Α͏ʹֶश
ఏҊख๏ (16Λେྔʹཁٻ͠ͳ͍ n -FBSOFS(16ΛੵΜͩϚγϯ্Ͱಈ࡞ ͭ n "DUPS$16ͷΈͷϚγϯ্Ͱಈ࡞ େྔ ܦݧͷޮతͳར༻ n
3FQMBZNFNPSZશମͰڞ༗ n ܦݧʹ༏ઌΛ༩ ఏҊख๏ͷಛ ͭͷ"DUPSʹΑΔ༗༻ͳൃݟ͕શମͰڞ༗
ఏҊख๏ n ֶशΞϧΰϦζϜ n 2ؔͷۙࣅث n σʔλͷαϯϓϦϯά -FBSOFSͷϞσϧ %PVCMF%FFQ2/FUXPSL
NVMUJTUFQCPPUTUSBQUBSHFU %VFMJOH/FUXPSL 1SJPSJUJ[FE&YQFSJFODF3FQMBZ
ఏҊख๏ n "DUPSݸผʹઃఆ͞Εͨ! − greedy๏ʹै͏ l ֬!ͰϥϯμϜʹߦಈ͢Δख๏ l ϥϯμϜʹߦಈ͢Δ͜ͱͰաద߹Λ͛Δ l
"DUPSຖʹઃఆ͢Δ͜ͱͰଟ༷ੑΛ୲อ n ༏ઌॱҐʹج͖ͮαϯϓϦϯά͢ΔͨΊɼ ॏཁαϯϓϦϯάʹΑͬͯͷภΓΛमਖ਼ ͦͷଞͷࡉ͔͍ઃఆ
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ධՁ࣮ݧ n ࣮ݧ"UBSJͷήʔϜ FHϒϩοΫ่͠ n "DUPSɿ "DUPSʹ$16 n "DUPSͷੜܦݧɿ'14 n
શମੜܦݧɿ ,'14 3FQFBU n ޯͷߋ৽ɿճTFD n ܦݧ༰ྔݮͷͨΊ1/(Ͱѹॖ͠อଘ ࣮ݧઃఆ
ධՁ࣮ݧ ֶशऴྃ࣌ͷੑೳൺֱ ֶश࣌ؒ είΞ n ήʔϜͷείΞͷதԝ n ਓؒͷείΞ n
࠷ऴείΞɼֶश࣌ؒڞʹ طଘख๏͔Βେ͖͘վળ
ධՁ࣮ݧ ใुͷ࣌ؒมԽ ֶश࣌ؒ ใु n ͭͷήʔϜʹ͓͚Δ ֫ಘใुͷฏۉ n ଞͷख๏ͱൺֱ͠ɼ
֫ಘใुΛΑΓૣ͘ େ͖͍ͯ͘͠Δ
ධՁ࣮ݧ ࣮ݧ݁Ռ - )1( ) ) ) 3) -
1 0 0-2 0 %) - -. %) (2 . % 50 - 0 ) -4 % 50 % 50 - 0 n "QF9͕࠷ߴ͍είΞΛه n ࢄֶशʹΑֶͬͯश࣌ؒେ෯ʹॖ
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
ੳ "DUPSͱใुͷؔ "DUPS͕ଟ͍΄ͲɼΑΓྑ͍ใुΛ֫ಘ
ੳ 3FQMBZ.FNPSZͱใुͷؔ ༰ྔ͕ଟ͍΄Ͳɼൺֱతྑ͍ใुΛ֫ಘ
ੳ ΑΓ࠷৽ͷܦݧͷֶशείΞʹد༩͢Δ͔ʁ ࠷৽ͷܦݧɼ࠷৽ͷύϥϝʔλʹجͮ͘ "DUPS͕ૹ৴͢ΔܦݧΛෳͯ͠ଟΊʹૹ৴ ΑΓ৽͍͠ܦݧ͕ଟΊʹαϯϓϦϯά͞ΕΔ ࠷৽ͷܦݧ
ੳ ࠷৽ͷܦݧͱใुͷؔ ! ࠷৽ͷܦݧͷֶशͱ ใु݁ͼ͍͍ͭͯͳ͍
ੳ n "DUPSΛ૿͢ͱใु͕૿Ճ l ہॴղؕΔ͜ͱΛ͛Δಇ͖ l େྔͷ୳ࡧͰɼ༗༻ͳܦݧΛ֫ಘ n 3FQMBZ.FNPSZΛ૿͢͜ͱͰใु͕૿Ճ n
࠷৽ͷܦݧͱใुʹతͳد༩ͳ͍ ੳ݁Ռ·ͱΊ ༗༻ͳܦݧΛΑΓ͘อ࣋Ͱ͖ͨ ܦݧͷਫ૿͠ଟ༷ੑΛ͘͠ɼ ύϑΥʔϚϯεΛԼ͛Δ
࣍ ڧԽֶश ݚڀഎܠɼݚڀత ؔ࿈ݚڀ ఏҊख๏
ධՁ࣮ݧ ੳ ·ͱΊͱߟ
·ͱΊͱߟ n ࢄʴ༏ઌ͖ܦݧ࠶ੜͷ'SBNFXPSLΛఏҊ n "QF9ֶ࣮࣌ؒशɼ࠷ऴੑೳʹ͓͍ͯ࠷ྑ ͍ੑೳΛࣔͨ͠ n աద߹ڧԽֶशʹ͓͚Δେ͖ͳͰɼࠓճσʔ λΛେྔʹੜ͢Δ୯७ͳํ๏͕ޮՌతͰ͋Δ͜ͱΛ ࣔͨ͠
n কདྷతʹσʔλΛޮΑ͘͏ํ๏Λࡧ͢Δ͖ ·ͱΊ
·ͱΊͱߟ "QF9ܦݧΛߴʹେྔʹूΊΔख๏ ෳࡶͳλεΫͰঢ়ଶ!"͕େྔʹଘࡏ େྔͷܦݧͷੜ͕ঢ়ଶ!"Λ͘Χόʔֶ͠श͕ਐΜͩ ݱঢ়ɼϥϯμϜ୳ࡧʹΑͬͯະͷߦಈΛܦݧ ൃੜසͷ͍ঢ়ଶ!"Λॏతʹ୳ࡧ͢Δख๏ ߟ
2MFBSOJOHͷ2ؔͷߋ৽ࣜ ! "# , %# ← ! "# , %#
+ α(*#+, + - max 12∈4 52 ! "#+, , 67 − ! "# , %# ) "# : ࣌ࠁ;ͷঢ়ଶ %# :࣌ࠁ;ͷߦಈ ! "# , %# ঢ়ଶ"#Ͱߦಈ%#Λͱͬͨ߹ͷਪఆใु *#ɿ࣌ࠁ;ʹ͓͚Δใु αɿֶश -ɿׂҾ 5%ޡࠩʢ5FNQPSBMMZ%JGGFSFODFʣ 5%ޡࠩ ਪఆใुͱ࣮ࡍͷใुͷࠩ