Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
bellman方程式の導出.pdf
Search
m_nshr
February 16, 2019
1
1.3k
bellman方程式の導出.pdf
Deriving Bellman equations
m_nshr
February 16, 2019
Tweet
Share
Featured
See All Featured
Large-scale JavaScript Application Architecture
addyosmani
514
110k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.2k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.5k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.4k
Side Projects
sachag
455
43k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
Docker and Python
trallard
46
3.7k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.6k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Transcript
#FMMNBOํఔࣜͷಋग़ .BUI$PEJOHڧԽֶशΛޠΖ͏
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ڧԽֶशͷతͱ 5IBUBMMPGXIBUXFNFBOCZHPBMTBOEQVSQPTFTDBOCF XFMMUIPVHIUPGBTNBYJNJ[BUJPOPGUIFFYQFDUFEWBMVF PGUIFDVNVMBUJWFTVNPGBSFDFJWFETDBMBSTJHOBM DBMMFESFXBSE 4VUUPO#BSUP 3FXBSE)ZQPUIFTJT ྦྷੵใुͷ࠷େԽ
Gt ≐ Rt+1 + γ Rt+2 + γ2 Rt+3 +
⋯ = ∞ ∑ k=0 γkRt+1+k = Rt+1 + γ ∞ ∑ k=0 γkRt+1+k+1 = Rt+1 + γGt+1 ˡʹண S0 π A0 → R1 → S1 π A1 → R2 → S2 π ⋯ ྦྷੵใु ऩӹ γ ∈ [0,1]
ϕϧϚϯ࠷దੑͷݪཧ ࠶ؼతߏΛ࣋ͭ ࠷దͳํࡦɺॳظঢ়ଶͱॳظܾఆ͕ͲΜͳͷͰ͋Εɺͦͷ݁Ռಘ ΒΕΔ࣍ͷঢ়ଶʹؔͯ͠ɺҎ߱ͷܾఆ͕ඞͣ࠷దํࡦʹͳ͍ͬͯΔͱ ͍͏ੑ࣭Λͭɻ ࢀরɿɹ#FMMNBO ɺ$IBQ*** 1SJODJQMFPG0QUJNBMJUZ ಈతܭը๏ %1
Ͱղ͚Δ ͔͠Εͳ͍
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ه߸ͷ४උ ํࡦ ঢ়ଶભҠ֬ π(a|s) ≐ (At = a|St = s)
p(s′|s, a) ≐ (St+1 = s′|St = s, At = a) r(s, a, s′) ≐ [Rt+1 |St = s, At = a, St+1 = s′] ˠঢ়ଶTͰߦಈBΛબ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠͨ͠߹ͷଈ࣌ใुͷظ ڥͷμΠφϛΫε FOWJSPONFOU`TEZOBNJDT ଈ࣌ใु ใुؔ
Ձؔ zঢ়ଶzՁؔ lঢ়ଶɾߦಈzՁؔ Vπ(s) ≐ π [Gt |St = s]
Qπ(s, a) ≐ π [Gt |St = s, At = a] ྆ऀͷؔ Vπ(s) ≐ π [Gt |St = s] = ∑ a π(a|s)π [Gt |St = s, At = a] = ∑ a π(a|s)Qπ(s, a) ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
Vπ(s) ≐ π [Gt |St = s] = π [Rt+1
+ γGt+1 |St = s] = ∑ a π(a|s)∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) ˡ݁Ռ #FMMNBOํఔࣜGPS Vπ(s)
Qπ(s, a) ≐ π [Gt |St = s, At =
a] = π [Rt+1 + γGt+1 |St = s, At = a] = ∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) #FMMNBOํఔࣜGPS Qπ(s, a) ˡ݁Ռ ˣ݁ՌΛೖ ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBN ɾঢ়ଶͱߦಈͷܥྻΛਤͰද͢ ɾ˓ঢ়ଶɺ˔ߦಈ ·ͨঢ়ଶɾߦಈର Λද͢ ɾϧʔτϊʔυͷՁΛܭࢉ͢Δ࣌ʹ͏ ɾϧʔτϊʔυ Ұ൪্ͷϊʔυ ͷՁ͕ ͲΜͳཁૉ͔ΒΓཱ͍ͬͯΔ͔Λදݱ͠
͍ͯΔ
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) Qπ(s, a)
= ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) π(a|s) Vπ(s) s a1 a2 Qπ(s, a1 ) Qπ(s, a2 ) Qπ(s, a) p(s′|s, a) (s, a) r(s, a, s′1 ) s′1 s′2 r(s, a, s′2 ) ˠ݁Ռ ࠶ Vπ(s′1 ) Vπ(s′2 ) ˠ݁Ռ ࠶
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) = ∑
a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) Qπ(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) Vπ(s) Vπ(s′) Qπ(s, a) Qπ(s, a) Qπ(s′, a′) π ˠ݁Ռ ࠶ p p r(s, a, s′) π(a|s) Vπ(s′) r(s, a, s′) ˣ݁ՌΛೖ ˠ݁Ռ ࠶
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
࠷దՁؔ V*(s) = max π Vπ(s) for any Q*(s, a)
= max π Qπ(s, a) for any s ∈ s ∈ , a ∈ ɾ͜ͷؔΛຬͨ͢ ͕গͳ͘ͱͭଘࡏ͢Δ ࠷దํࡦ ɾ͜ͷ ʹΑͬͯɺऩӹͷ࠷େԽ͕ୡ͞ΕΔ π π
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBNͰ #FMMNBO࠷దํఔࣜΛ֬ೝ V*(s) = ∑ a π(a|s) max a Q*(s,
a) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) V*(s) = max a ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ max a′ Q*(s′, a′)) max a max a max a max a
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
·ͱΊ ɾͬͨ͜ͱ ˠ#FMMNBOํఔࣜΛಋ͘ ɾ͜ͷํఔ͕ࣜͨͪղ͚ΔέʔεͰ࠷దղ͕ಘΒΕΔ ˠ%ZOBNJD1SPHSBNNJOH #FMMNBO࠷దੑͷݪཧ ɾ௨ৗɺ͍Ζ͍Ζͳཧ༝Ͱ%1ͷ࣮ߦෆՄೳ ˠαϯϓϦϯάతख๏ͷग़൪ .POUF$BSMP๏ɺ5%๏ͳͲ ɾͨͩ͠ɺଟ͘ͷΞϧΰϦζϜɺ#FMMNBOํఔࣜͷۙࣅతͳղ๏ͱͯ͠
ཧղͰ͖Δ ˠ.$ͱ5%ɺ4BSTBͱ2MFBSOJOHͷಈ࡞ͷҧ͍ͳͲʹ͍ͭͯ #FMMNBOํఔࣜͷ؍͔ΒཧղͰ͖Δ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠