Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
bellman方程式の導出.pdf
Search
m_nshr
February 16, 2019
1
1.1k
bellman方程式の導出.pdf
Deriving Bellman equations
m_nshr
February 16, 2019
Tweet
Share
Featured
See All Featured
Ruby is Unlike a Banana
tanoku
96
11k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.1k
The Pragmatic Product Professional
lauravandoore
31
6.3k
Art, The Web, and Tiny UX
lynnandtonic
297
20k
Producing Creativity
orderedlist
PRO
341
39k
Visualization
eitanlees
145
15k
[RailsConf 2023] Rails as a piece of cake
palkan
51
4.9k
Building a Scalable Design System with Sketch
lauravandoore
459
33k
Done Done
chrislema
181
16k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
27
1.9k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
46
2.1k
A Modern Web Designer's Workflow
chriscoyier
693
190k
Transcript
#FMMNBOํఔࣜͷಋग़ .BUI$PEJOHڧԽֶशΛޠΖ͏
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ڧԽֶशͷతͱ 5IBUBMMPGXIBUXFNFBOCZHPBMTBOEQVSQPTFTDBOCF XFMMUIPVHIUPGBTNBYJNJ[BUJPOPGUIFFYQFDUFEWBMVF PGUIFDVNVMBUJWFTVNPGBSFDFJWFETDBMBSTJHOBM DBMMFESFXBSE 4VUUPO#BSUP 3FXBSE)ZQPUIFTJT ྦྷੵใुͷ࠷େԽ
Gt ≐ Rt+1 + γ Rt+2 + γ2 Rt+3 +
⋯ = ∞ ∑ k=0 γkRt+1+k = Rt+1 + γ ∞ ∑ k=0 γkRt+1+k+1 = Rt+1 + γGt+1 ˡʹண S0 π A0 → R1 → S1 π A1 → R2 → S2 π ⋯ ྦྷੵใु ऩӹ γ ∈ [0,1]
ϕϧϚϯ࠷దੑͷݪཧ ࠶ؼతߏΛ࣋ͭ ࠷దͳํࡦɺॳظঢ়ଶͱॳظܾఆ͕ͲΜͳͷͰ͋Εɺͦͷ݁Ռಘ ΒΕΔ࣍ͷঢ়ଶʹؔͯ͠ɺҎ߱ͷܾఆ͕ඞͣ࠷దํࡦʹͳ͍ͬͯΔͱ ͍͏ੑ࣭Λͭɻ ࢀরɿɹ#FMMNBO ɺ$IBQ*** 1SJODJQMFPG0QUJNBMJUZ ಈతܭը๏ %1
Ͱղ͚Δ ͔͠Εͳ͍
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ه߸ͷ४උ ํࡦ ঢ়ଶભҠ֬ π(a|s) ≐ (At = a|St = s)
p(s′|s, a) ≐ (St+1 = s′|St = s, At = a) r(s, a, s′) ≐ [Rt+1 |St = s, At = a, St+1 = s′] ˠঢ়ଶTͰߦಈBΛબ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠͨ͠߹ͷଈ࣌ใुͷظ ڥͷμΠφϛΫε FOWJSPONFOU`TEZOBNJDT ଈ࣌ใु ใुؔ
Ձؔ zঢ়ଶzՁؔ lঢ়ଶɾߦಈzՁؔ Vπ(s) ≐ π [Gt |St = s]
Qπ(s, a) ≐ π [Gt |St = s, At = a] ྆ऀͷؔ Vπ(s) ≐ π [Gt |St = s] = ∑ a π(a|s)π [Gt |St = s, At = a] = ∑ a π(a|s)Qπ(s, a) ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
Vπ(s) ≐ π [Gt |St = s] = π [Rt+1
+ γGt+1 |St = s] = ∑ a π(a|s)∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) ˡ݁Ռ #FMMNBOํఔࣜGPS Vπ(s)
Qπ(s, a) ≐ π [Gt |St = s, At =
a] = π [Rt+1 + γGt+1 |St = s, At = a] = ∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) #FMMNBOํఔࣜGPS Qπ(s, a) ˡ݁Ռ ˣ݁ՌΛೖ ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBN ɾঢ়ଶͱߦಈͷܥྻΛਤͰද͢ ɾ˓ঢ়ଶɺ˔ߦಈ ·ͨঢ়ଶɾߦಈର Λද͢ ɾϧʔτϊʔυͷՁΛܭࢉ͢Δ࣌ʹ͏ ɾϧʔτϊʔυ Ұ൪্ͷϊʔυ ͷՁ͕ ͲΜͳཁૉ͔ΒΓཱ͍ͬͯΔ͔Λදݱ͠
͍ͯΔ
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) Qπ(s, a)
= ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) π(a|s) Vπ(s) s a1 a2 Qπ(s, a1 ) Qπ(s, a2 ) Qπ(s, a) p(s′|s, a) (s, a) r(s, a, s′1 ) s′1 s′2 r(s, a, s′2 ) ˠ݁Ռ ࠶ Vπ(s′1 ) Vπ(s′2 ) ˠ݁Ռ ࠶
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) = ∑
a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) Qπ(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) Vπ(s) Vπ(s′) Qπ(s, a) Qπ(s, a) Qπ(s′, a′) π ˠ݁Ռ ࠶ p p r(s, a, s′) π(a|s) Vπ(s′) r(s, a, s′) ˣ݁ՌΛೖ ˠ݁Ռ ࠶
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
࠷దՁؔ V*(s) = max π Vπ(s) for any Q*(s, a)
= max π Qπ(s, a) for any s ∈ s ∈ , a ∈ ɾ͜ͷؔΛຬͨ͢ ͕গͳ͘ͱͭଘࡏ͢Δ ࠷దํࡦ ɾ͜ͷ ʹΑͬͯɺऩӹͷ࠷େԽ͕ୡ͞ΕΔ π π
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBNͰ #FMMNBO࠷దํఔࣜΛ֬ೝ V*(s) = ∑ a π(a|s) max a Q*(s,
a) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) V*(s) = max a ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ max a′ Q*(s′, a′)) max a max a max a max a
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
·ͱΊ ɾͬͨ͜ͱ ˠ#FMMNBOํఔࣜΛಋ͘ ɾ͜ͷํఔ͕ࣜͨͪղ͚ΔέʔεͰ࠷దղ͕ಘΒΕΔ ˠ%ZOBNJD1SPHSBNNJOH #FMMNBO࠷దੑͷݪཧ ɾ௨ৗɺ͍Ζ͍Ζͳཧ༝Ͱ%1ͷ࣮ߦෆՄೳ ˠαϯϓϦϯάతख๏ͷग़൪ .POUF$BSMP๏ɺ5%๏ͳͲ ɾͨͩ͠ɺଟ͘ͷΞϧΰϦζϜɺ#FMMNBOํఔࣜͷۙࣅతͳղ๏ͱͯ͠
ཧղͰ͖Δ ˠ.$ͱ5%ɺ4BSTBͱ2MFBSOJOHͷಈ࡞ͷҧ͍ͳͲʹ͍ͭͯ #FMMNBOํఔࣜͷ؍͔ΒཧղͰ͖Δ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠