$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
bellman方程式の導出.pdf
Search
m_nshr
February 16, 2019
1
1.3k
bellman方程式の導出.pdf
Deriving Bellman equations
m_nshr
February 16, 2019
Tweet
Share
Featured
See All Featured
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
120
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
51
43k
How to make the Groovebox
asonas
2
1.8k
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
400
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
1
410
Thoughts on Productivity
jonyablonski
73
5k
Bash Introduction
62gerente
615
210k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
38
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
170
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
115
91k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Visualization
eitanlees
150
16k
Transcript
#FMMNBOํఔࣜͷಋग़ .BUI$PEJOHڧԽֶशΛޠΖ͏
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ڧԽֶशͷతͱ 5IBUBMMPGXIBUXFNFBOCZHPBMTBOEQVSQPTFTDBOCF XFMMUIPVHIUPGBTNBYJNJ[BUJPOPGUIFFYQFDUFEWBMVF PGUIFDVNVMBUJWFTVNPGBSFDFJWFETDBMBSTJHOBM DBMMFESFXBSE 4VUUPO#BSUP 3FXBSE)ZQPUIFTJT ྦྷੵใुͷ࠷େԽ
Gt ≐ Rt+1 + γ Rt+2 + γ2 Rt+3 +
⋯ = ∞ ∑ k=0 γkRt+1+k = Rt+1 + γ ∞ ∑ k=0 γkRt+1+k+1 = Rt+1 + γGt+1 ˡʹண S0 π A0 → R1 → S1 π A1 → R2 → S2 π ⋯ ྦྷੵใु ऩӹ γ ∈ [0,1]
ϕϧϚϯ࠷దੑͷݪཧ ࠶ؼతߏΛ࣋ͭ ࠷దͳํࡦɺॳظঢ়ଶͱॳظܾఆ͕ͲΜͳͷͰ͋Εɺͦͷ݁Ռಘ ΒΕΔ࣍ͷঢ়ଶʹؔͯ͠ɺҎ߱ͷܾఆ͕ඞͣ࠷దํࡦʹͳ͍ͬͯΔͱ ͍͏ੑ࣭Λͭɻ ࢀরɿɹ#FMMNBO ɺ$IBQ*** 1SJODJQMFPG0QUJNBMJUZ ಈతܭը๏ %1
Ͱղ͚Δ ͔͠Εͳ͍
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
ه߸ͷ४උ ํࡦ ঢ়ଶભҠ֬ π(a|s) ≐ (At = a|St = s)
p(s′|s, a) ≐ (St+1 = s′|St = s, At = a) r(s, a, s′) ≐ [Rt+1 |St = s, At = a, St+1 = s′] ˠঢ়ଶTͰߦಈBΛબ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠ͢Δ֬ ˠঢ়ଶTͰߦಈBΛબͯ࣍͠ঢ়ଶT`ʹભҠͨ͠߹ͷଈ࣌ใुͷظ ڥͷμΠφϛΫε FOWJSPONFOU`TEZOBNJDT ଈ࣌ใु ใुؔ
Ձؔ zঢ়ଶzՁؔ lঢ়ଶɾߦಈzՁؔ Vπ(s) ≐ π [Gt |St = s]
Qπ(s, a) ≐ π [Gt |St = s, At = a] ྆ऀͷؔ Vπ(s) ≐ π [Gt |St = s] = ∑ a π(a|s)π [Gt |St = s, At = a] = ∑ a π(a|s)Qπ(s, a) ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
Vπ(s) ≐ π [Gt |St = s] = π [Rt+1
+ γGt+1 |St = s] = ∑ a π(a|s)∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) ˡ݁Ռ #FMMNBOํఔࣜGPS Vπ(s)
Qπ(s, a) ≐ π [Gt |St = s, At =
a] = π [Rt+1 + γGt+1 |St = s, At = a] = ∑ s′ p(s′|s, a)π [Rt+1 + γGt+1 |St = s, At = a, St+1 = s′] = ∑ s′ p(s′|s, a)(r(s, a, s′) + γπ [Gt+1 |St = s, At = a, St+1 = s′]) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) #FMMNBOํఔࣜGPS Qπ(s, a) ˡ݁Ռ ˣ݁ՌΛೖ ˡ݁Ռ
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBN ɾঢ়ଶͱߦಈͷܥྻΛਤͰද͢ ɾ˓ঢ়ଶɺ˔ߦಈ ·ͨঢ়ଶɾߦಈର Λද͢ ɾϧʔτϊʔυͷՁΛܭࢉ͢Δ࣌ʹ͏ ɾϧʔτϊʔυ Ұ൪্ͷϊʔυ ͷՁ͕ ͲΜͳཁૉ͔ΒΓཱ͍ͬͯΔ͔Λදݱ͠
͍ͯΔ
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) Qπ(s, a)
= ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) π(a|s) Vπ(s) s a1 a2 Qπ(s, a1 ) Qπ(s, a2 ) Qπ(s, a) p(s′|s, a) (s, a) r(s, a, s′1 ) s′1 s′2 r(s, a, s′2 ) ˠ݁Ռ ࠶ Vπ(s′1 ) Vπ(s′2 ) ˠ݁Ռ ࠶
#BDLVQEJBHSBNͰ #FMMNBOํఔࣜΛ֬ೝ Vπ(s) = ∑ a π(a|s)Qπ(s, a) = ∑
a π(a|s)∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) Qπ(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γVπ(s′)) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ∑ a′ π(a′|s′)Qπ(s′, a′)) Vπ(s) Vπ(s′) Qπ(s, a) Qπ(s, a) Qπ(s′, a′) π ˠ݁Ռ ࠶ p p r(s, a, s′) π(a|s) Vπ(s′) r(s, a, s′) ˣ݁ՌΛೖ ˠ݁Ռ ࠶
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
࠷దՁؔ V*(s) = max π Vπ(s) for any Q*(s, a)
= max π Qπ(s, a) for any s ∈ s ∈ , a ∈ ɾ͜ͷؔΛຬͨ͢ ͕গͳ͘ͱͭଘࡏ͢Δ ࠷దํࡦ ɾ͜ͷ ʹΑͬͯɺऩӹͷ࠷େԽ͕ୡ͞ΕΔ π π
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
#BDLVQEJBHSBNͰ #FMMNBO࠷దํఔࣜΛ֬ೝ V*(s) = ∑ a π(a|s) max a Q*(s,
a) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) V*(s) = max a ∑ s′ p(s′|s, a)(r(s, a, s′) + γV*(s′)) Q*(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γ max a′ Q*(s′, a′)) max a max a max a max a
࣍ ͡Ίʹ Ձؔͷಋೖ Ձؔͷఆ͔ٛΒ#FMMNBOํఔࣜΛಋग़ #BDLVQEJBHSBN͔Β#FMMNBOํఔࣜΛΈΔ ࠷దՁؔͷհ #BDLVQEJBHSBN͔Β࠷దՁ؍ΛΈΔ ·ͱΊ
·ͱΊ ɾͬͨ͜ͱ ˠ#FMMNBOํఔࣜΛಋ͘ ɾ͜ͷํఔ͕ࣜͨͪղ͚ΔέʔεͰ࠷దղ͕ಘΒΕΔ ˠ%ZOBNJD1SPHSBNNJOH #FMMNBO࠷దੑͷݪཧ ɾ௨ৗɺ͍Ζ͍Ζͳཧ༝Ͱ%1ͷ࣮ߦෆՄೳ ˠαϯϓϦϯάతख๏ͷग़൪ .POUF$BSMP๏ɺ5%๏ͳͲ ɾͨͩ͠ɺଟ͘ͷΞϧΰϦζϜɺ#FMMNBOํఔࣜͷۙࣅతͳղ๏ͱͯ͠
ཧղͰ͖Δ ˠ.$ͱ5%ɺ4BSTBͱ2MFBSOJOHͷಈ࡞ͷҧ͍ͳͲʹ͍ͭͯ #FMMNBOํఔࣜͷ؍͔ΒཧղͰ͖Δ
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠