Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
bellman方程式の導出.pdf
Search
m_nshr
February 16, 2019
1
1.1k
bellman方程式の導出.pdf
Deriving Bellman equations
m_nshr
February 16, 2019
Tweet
Share
Featured
See All Featured
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
The Invisible Side of Design
smashingmag
299
50k
Agile that works and the tools we love
rasmusluckow
328
21k
Gamification - CAS2011
davidbonilla
80
5.2k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
49k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
13
1k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.5k
The World Runs on Bad Software
bkeepers
PRO
67
11k
Building Your Own Lightsaber
phodgson
104
6.2k
Rails Girls Zürich Keynote
gr2m
94
13k
Transcript
#FMMNBOํఔࣜͷಋग़ .BUI$PEJOHڧԽֶशΛޠΖ͏
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ڧԽֶशͷతͱ 5IBUBMMPGXIBUXFNFBOCZHPBMTBOEQVSQPTFTDBOCF XFMMUIPVHIUPGBTNBYJNJ[BUJPOPGUIFFYQFDUFEWBMVF PGUIFDVNVMBUJWFTVNPGBSFDFJWFETDBMBSTJHOBM DBMMFESFXBSE 4VUUPO#BSUP 3FXBSE)ZQPUIFTJT ྦྷੵใुͷ࠷େԽ
Gt ≐ Rt+1 + γ Rt+2 + γ2 Rt+3 +
⋯ = ∞ ∑ k=0 γkRt+1+k = Rt+1 + γ ∞ ∑ k=0 γkRt+1+k+1 = Rt+1 + γGt+1 ˡʹண S0 π A0 → R1 → S1 π A1 → R2 → S2 π ⋯ ྦྷੵใु ऩӹ γ ∈ [0,1]
ϕϧϚϯ࠷దੑͷݪཧ ࠶ؼతߏΛ࣋ͭ ࠷దͳํࡦɺॳظঢ়ଶͱॳظܾఆ͕ͲΜͳͷͰ͋Εɺͦͷ݁Ռಘ ΒΕΔ࣍ͷঢ়ଶʹؔͯ͠ɺҎ߱ͷܾఆ͕ඞͣ࠷దํࡦʹͳ͍ͬͯΔͱ ͍͏ੑ࣭Λͭɻ ࢀরɿɹ#FMMNBO ɺ$IBQ*** 1SJODJQMFPG0QUJNBMJUZ ಈతܭը๏ %1
Ͱղ͚Δ ͔͠Εͳ͍
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ه߸ͷ४උ ํࡦ ঢ়ଶભҠ֬ π(a|s) ≐ (At = a|St = s)
p(s′|s, a) ≐ (St+1 = s′|St = s, At = a) r(s, a, s′) ≐ [Rt+1 |St = s, At = a, St+1 = s′] ˠঢ়ଶTͰߦಈBΛબ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠͨ͠߹ͷଈ࣌ใुͷظ ڥͷμΠφϛΫε FOWJSPONFOU`TEZOBNJDT ଈ࣌ใु ใुؔ
Ձؔ zঢ়ଶzՁؔ lঢ়ଶɾߦಈzՁؔ Vπ(s) ≐ π [Gt |St = s]
Qπ(s, a) ≐ π [Gt |St = s, At = a] ྆ऀͷؔ Vπ(s) ≐ π [Gt |St = s] = ∑ a π(a|s)π [Gt |St = s, At = a] = ∑ a π(a|s)Qπ(s, a) ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
Vπ(s) ≐ π [Gt |St = s] = π [Rt+1
+ γGt+1 |St = s] = ∑ a π(a|s)∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) ˡ݁Ռ #FMMNBOํఔࣜGPS Vπ(s)
Qπ(s, a) ≐ π [Gt |St = s, At =
a] = π [Rt+1 + γGt+1 |St = s, At = a] = ∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) #FMMNBOํఔࣜGPS Qπ(s, a) ˡ݁Ռ ˣ݁ՌΛೖ ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBN ɾঢ়ଶͱߦಈͷܥྻΛਤͰද͢ ɾ˓ঢ়ଶɺ˔ߦಈ ·ͨঢ়ଶɾߦಈର Λද͢ ɾϧʔτϊʔυͷՁΛܭࢉ͢Δ࣌ʹ͏ ɾϧʔτϊʔυ Ұ൪্ͷϊʔυ ͷՁ͕ ͲΜͳཁૉ͔ΒΓཱ͍ͬͯΔ͔Λදݱ͠
͍ͯΔ
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) Qπ(s, a)
= ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) π(a|s) Vπ(s) s a1 a2 Qπ(s, a1 ) Qπ(s, a2 ) Qπ(s, a) p(s′|s, a) (s, a) r(s, a, s′1 ) s′1 s′2 r(s, a, s′2 ) ˠ݁Ռ ࠶ Vπ(s′1 ) Vπ(s′2 ) ˠ݁Ռ ࠶
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) = ∑
a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) Qπ(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) Vπ(s) Vπ(s′) Qπ(s, a) Qπ(s, a) Qπ(s′, a′) π ˠ݁Ռ ࠶ p p r(s, a, s′) π(a|s) Vπ(s′) r(s, a, s′) ˣ݁ՌΛೖ ˠ݁Ռ ࠶
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
࠷దՁؔ V*(s) = max π Vπ(s) for any Q*(s, a)
= max π Qπ(s, a) for any s ∈ s ∈ , a ∈ ɾ͜ͷؔΛຬͨ͢ ͕গͳ͘ͱͭଘࡏ͢Δ ࠷దํࡦ ɾ͜ͷ ʹΑͬͯɺऩӹͷ࠷େԽ͕ୡ͞ΕΔ π π
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBNͰ #FMMNBO࠷దํఔࣜΛ֬ೝ V*(s) = ∑ a π(a|s) max a Q*(s,
a) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) V*(s) = max a ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ max a′ Q*(s′, a′)) max a max a max a max a
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
·ͱΊ ɾͬͨ͜ͱ ˠ#FMMNBOํఔࣜΛಋ͘ ɾ͜ͷํఔ͕ࣜͨͪղ͚ΔέʔεͰ࠷దղ͕ಘΒΕΔ ˠ%ZOBNJD1SPHSBNNJOH #FMMNBO࠷దੑͷݪཧ ɾ௨ৗɺ͍Ζ͍Ζͳཧ༝Ͱ%1ͷ࣮ߦෆՄೳ ˠαϯϓϦϯάతख๏ͷग़൪ .POUF$BSMP๏ɺ5%๏ͳͲ ɾͨͩ͠ɺଟ͘ͷΞϧΰϦζϜɺ#FMMNBOํఔࣜͷۙࣅతͳղ๏ͱͯ͠
ཧղͰ͖Δ ˠ.$ͱ5%ɺ4BSTBͱ2MFBSOJOHͷಈ࡞ͷҧ͍ͳͲʹ͍ͭͯ #FMMNBOํఔࣜͷ؍͔ΒཧղͰ͖Δ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠