Slide 1

Slide 1 text

4VUUPOOE&EJUJPOྠಡձ
 $I
 1PMJDZ(SBEJFOU.FUIPET ,PTVLF.JZPTIJ WFSTJPO

Slide 2

Slide 2 text

1PMJDZ(SBEJFOU.FUIPET w ͜Ε·Ͱͷষ w "DUJPOWBMVFΛֶश͠"DUJPOWBMVFʹج͍ͮͯ
 "DUJPOΛબ୒ BDUJPOWBMVFNFUIPET w ࠓճͷষ w QPMJDZΛ௚઀QBSBNFUFSͰදݱ
 QBSBNFUFSJ[FEQPMJDZ w "DUJPOબ୒ʹWBMVFGVODUJPOΛར༻͠ͳ͍
 QPMJDZͷֶशࣗମʹ͸࢖͏৔߹΋͋Δ w QPMJDZΛදݱ͢ΔQBSBNFUFSʹରͯ͠HSBEJFOUBTDFOU

Slide 3

Slide 3 text

1BUBNFUFSJ[FEQPMJDZ θt+1 = θt + α ̂ ∇J (θt) π(a|s, θ) = Pr {At = a|St = s, θt = θ} ֬཰తͳ1PMJDZΛE`࣍ݩͷύϥϝʔλВͰද͢ ϙϦγʔΠͷྑ͠ѱ͠Λ൑அ͢ΔͨΊͷ
 1FSGPSNBODFNFBTVSF+ В εΧϥʔ஋ Λಋೖɻ +͕࠷େԽͷର৅ +ͷВʹର͢Δޯ഑ͷ֬཰తͳਪఆ஋ͷظ଴஋ Вͱಉ͘͡E`࣍ݩ

Slide 4

Slide 4 text

1FSGPSNBODF.FBTVSF + ϙϦγʔΠͷྑ͠ѱ͠ͷ൑அج४ w &QJTPEJDDBTF w +͸ΠʹԊͬͨ࣌ͷ։࢝4UBUF 4 ͷঢ়ଶՁ஋ w $POUJOVJOHDBTF w +͸BWFSBHFSFXBSESBUF

Slide 5

Slide 5 text

1PMJDZ"QQSPYJNBUJPOBOE JUT"EWBOUBHFT w "DUJPOWBMVFGVODUJPOͷܽ఺ w ֬཰తͳϙϦγʔΛࣗવʹදݱ͢Δํ๏͕ແ͍ w ΧʔυήʔϜͷ༷ͳෆ׬શ৘ใήʔϜͰ͸࠷దϙϦγʔ͕֬཰తʹͳΔ৔߹͕ଟ͍
 FYϙʔΧʔͰͷϒϥϑ 
 w 1BSBNFUFSJ[FEQPMJDZͷ௕ॴ w ֬཰తͳϙϦγʔΛදݱͰ͖Δ w ࿈ଓ"DUJPOʹ΋ରԠͰ͖Δ w 1PMJDZͷมߋ͕εϜʔζ w 1PMJDZͷۙࣅදݱ͕WBMVFGVODJUPOΛ࢖͏৔߹ΑΓ΋1PMJDZ௚઀දݱͷํ͕
 γϯϓϧʹͳΔΑ͏ͳ໰୊ͷ৔߹༗ޮ ςτϦεͳͲ w 1PMJDZʹରͯ͠ࣄલ৘ใΛ૊ΈࠐΈ΍͍͢

Slide 6

Slide 6 text

4PGUNBYJOBDUJPO QSFGFSFODFT π(a|s, θ) ≐ eh(s,a,θ) ∑ b eh(s,b,θ) h(s, a, θ) = θ⊤x(s, a) GFBUVSFWFDUPS ύϥϝʔλВΛઢܗͰදݱ͢Δ৔߹

Slide 7

Slide 7 text

1PMJDZ(SBEJFOU5IFPSFN J(θ) ≐ vπθ (s0) ∇J(θ) ∝ ∑ s μ(s)∑ a qπ (s, a)∇π(a|s, θ) &QJTPEJDͳ৔߹Ͱͷ1FSGPSNBODFNFBTVSF ΠʹԊͬͨͱ͖ʹঢ়ଶTΛͱΔ֬཰ 1FSGPSNBODFNFBTVSF͸ ɾBDUJPOબ୒ ɾ๚ΕΔTͷස౓෼෍Ж T ʹґଘ͠ɺͦΕΒ͸ϙϦγʔͷύϥϝʔλВͷӨڹΛड͚Δɻ ͔͠͠1FSGPSNBODF.FBTVSF+ͷВʹର͢Δޯ഑㲆+͸
 Tͷ෼෍ʹର͢Δඍ෼Λؚ·ͳ͍ܗͰهड़͞ΕΔ

Slide 8

Slide 8 text

3&*/'03$&
 .POUF$BMSP1PMJDZ(SBEJFOU ∇J(θ) ∝ ∑ s μ(s)∑ a qπ (s, a)∇π(a|s, θ) = π [∑ a qπ (St , a)∇π (a|St , θ) ] θt+1 ≐ θt + α∑ a ̂ q (St , a, w)∇π (a|St , θ) 1PMJDZHSBEJFOUUIFPSFN ͜͜Ͱ͸શBDUJPOʹର͢Δ࿨ͷܗʹͳ͍ͬͯΔ ͜ΕΛBDUJPOͰͷ.POUF$BSMPαϯϓϦϯάͰஔ͖׵͑Δ͜ͱΛߟ͑Δ

Slide 9

Slide 9 text

3&*/'03$&
 .POUF$BMSP1PMJDZ(SBEJFOU ∇J(θ) = π [∑ a π (a|St , θ) qπ (St , a) ∇π (a|St , θ) π (a|St , θ) ] = π [ qπ (St , At) ∇π (At |St , θ) π (At |St , θ) ] = π [ Gt ∇π (At |St , θ) π (At |St , θ) ] , θt+1 ≐ θt + αGt ∇π (At |St , θt) π (At |St , θt) ∇ln x = ∇x x = θt + αGt ∇ln π (At |St , θt) Λར༻ R 4 " Λใु࿨(ʹ͓͖͔͑ શBDUJPOBͰͷ࿨Λɺ࣮ࡍʹऔͬͨ"DUJPO"Uʹ
 ͓͖͔͑ αϯϓϦϯά Вͷߋ৽ࣜ

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

4IPSUDPSSJEPS w 4͔Β։࢝ (ʹ౸ୡ͢Δͱ5FSNJOBUF w ঢ়ଶۙࣅΛ͢Δͱશ4UBUF͕ಉ͡ঢ়ଶͱͯ͠ݟ͑Δ w TUBUFͷ࣌ɺӈBDUJPOͰࠨ΁ɺࠨBDUJPOͰӈ΁Ҡಈ
 ͦΕҎ֎͸ӈBDUJPO͸ӈɺࠨBDUJPO͸ࠨ w TUFQಈͨ͘ͼʹSFXBSE w ࠷దͳQPMJDZ͸֬཰తQPMJDZ ֬཰Ͱӈ΁

Slide 12

Slide 12 text

TIPSUDPSSJEPSHSJEXPSMEͰͷ3&*/'03$&

Slide 13

Slide 13 text

3&*/'03$&XJUI#BTFMJOF ∇J(θ) ∝ ∑ s μ(s)∑ a (qπ (s, a) − b(s))∇π(a|s, θ) ∑ a b(s)∇π(a|s, θ) = b(s)∇∑ a π(a|s, θ) = b(s)∇1 = 0 θt+1 ≐ θt + α (Gt − b (St)) ∇π (At |St , θt) π (At |St , θt) CBTFMJOF͸Bʹґଘ͠ͳ͍ Вͱؔ܎ͳ͍ ΋ͷΛબ΂͹ɺ Լهͷ࿨͸ʹͳΓޯ഑ਪఆͷظ଴஋ʹӨڹ͠ͳ͍ CBTFMJOFΛಋೖͯ͠ɺ+ͷޯ഑㲆+ͷਪఆ஋ͷWBSJBODFΛԼ͛Δ͜ͱΛߟ͑Δ CBTFMJOF͸TUBUFʹґଘ͢Δ΂͖
 ྫશBDUJPOͷWBMVF͕ߴ͍৔߹͸CBTFMJOF΋ߴ͍ํ͕֤BDUJPOͷࠩผԽ͕Ͱ͖Δ CBTMJOFͷࣗવͳબ୒͸ঢ়ଶՁ஋7 Вͷߋ৽ࣜ

Slide 14

Slide 14 text

7ͷۙࣅͷͨΊͷύϥϝʔλXΛಋೖ͠ɺ͜Εʹؔͯ͠΋.POUF$BSMP๏ͰٻΊɺ#BTFMJOFͱͯ͠ར༻

Slide 15

Slide 15 text

3&*/'03$&ͷCBTFMJOFΛೖΕͨ࣌ͷൺֱ

Slide 16

Slide 16 text

"DUPS$SJUJD.FUIPET θt+1 ≐ θt + α (Gt:t+1 − ̂ v (St , w)) ∇π (At |St , θt) π (At |St , θt) = θt + α (Rt+1 + γ ̂ v (St+1 , w) − ̂ v (St , w)) ∇π (At |St , θt) π (At |St , θt) = θt + αδt ∇π (At |St , θt) π (At |St , θt) 7ΛCBTFMJOFͱͯ͠ར༻ͨ͠3&*/'03$&͸ɺ#PPUTUSBQQJOH͸͍ͯ͠ͳ͍ #PPUTUSBQQJOHΛར༻ͯ͠CJBTΛೖΕͨํ͕WBSJBODF͕ݮֶͬͯशૣ͘ͳΔ &QJTPEFऴྃ·Ͱͷใु࿨Λ TUFQͷCPPUTUSBQQJOHͰஔ͖׵͑

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

$PJOUJOVJOH1SPCMFNT J(θ) ≐ r(π) ≐ lim h→∞ 1 h h ∑ t=1 [Rt |S0 , A0:t−1 ∼ π] = lim t→∞ [Rt |S0 , A0:t−1 ∼ π] = ∑ s μ(s)∑ a π(a|s)∑ s′,r p (s′, r|s, a) r Gt ≐ Rt+1 − r(π) + Rt+2 − r(π) + Rt+3 − r(π) + ⋯ ࿈ଓΤϐιʔυͷ৔߹ɺQFSGPSNBODFNFBTVSF͸
 ΠʹԊͬͨ࣌ͷTUFQ͋ͨΓͷSFXBSEฏۉ vπ (s) ≐ π [Gt |St = s] qπ (s, a) ≐ π [Gt |St = s, At = a] ∇J(θ) = ∑ s μ(s)∑ a qπ (s, a)∇π(a|s, θ) ͱఆٛ͢Δͱ&QJTPEJDͳ৔߹ͱಉ༷ͷ1PMJDZHSBEJFOUUIFPMFN͕ಋग़͞ΕΔ

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

$POUJOVPVT"DUJPOT π(a|s, θ) ≐ 1 σ(s, θ) 2π exp ( − (a − μ(s, θ))2 2σ(s, θ)2 ) μ(s, θ) ≐ θ⊤ μ xμ (s) σ(s, θ) ≐ exp (θ⊤ σ xσ (s)) ࿈ଓ"DUJPOΛߟ͑Δ θ = [θμ , θσ] ⊤

Slide 22

Slide 22 text

4VNNBSZ w 1PMJDZ(SBEJFOUͰ͸ߦಈՁ஋ؔ਺ਪఆΛར༻ͤͣʹQBSBNFUSJ[F͞Εͨ1PMJDZΛ௚઀ֶश w 1BSBNFUFSJ[F͞Εͨ1PMJDZͷར఺ w ֬཰తͳදݱ୳ࡧܾఆతͳؒΛ઴ۙతʹදݱՄೳ w 1PMJDZHSBEJFOUUIFPSFN w ঢ়ଶ෼෍ͷඍ෼Λؚ·ͳ͍ܗͰ1FSGPSNBODF.FBTVSF͕1PMJDZύϥϝʔλʹ
 Ͳ͏ӨڹΛड͚Δ͔͕هड़Ͱ͖Δ w 3&*/03$& w .POUF$BSMP w ঢ়ଶՁ஋Λ#BTFMJOFͱͯ͠ಋೖ͠ɺCJBTΛ૿΍ͣ͞ʹWBSJBODFΛԼ͛Δ w "DUPS$SJUJD w ঢ়ଶՁ஋ΛCPPUTUSBQQJOHʹར༻ͯ͠5%ֶश.$ΑΓ΋௿WBSJBODF w ঢ়ଶՁ஋ʹΑΓ1PMJDZͷBDUJPOબ୒ʹରͯ͠DSFEJUBTTJHO DSJUJTJ[F

Slide 23

Slide 23 text

1PMJDZ(SBEJFOU5IFPMFNͷূ໌ &QJTPEJDDBTF Pr(s → x, k, π) TUBUFT͔ΒLTUFQޙʹTUBUFYʹͳΔ֬཰ ∇vπ (s) = ∇ [∑ a π(a|s)qπ (s, a) ] = ∑ a [∇π(a|s)qπ (s, a) + π(a|s)∇qπ (s, a)] = ∑ a ∇π(a|s)qπ (s, a) + π(a|s)∇∑ s′,r p (s′, r|s, a) (r + vπ (s′)) = ∑ a [ ∇π(a|s)qπ (s, a) + π(a|s)∑ s′ p (s′|s, a)∇vπ (s′) ] = ∑ a [ ∇π(a|s)qπ (s, a) + π(a|s)∑ s′ p (s′|s, a) ∑ a′ [ ∇π (a′|s′) qπ (s′, a′) + π (a′|s′)∑ s′′ p (s′′|s′, a′)∇vπ (s′′) ] = ∑ x∈ ∞ ∑ k=0 Pr(s → x, k, π)∑ a ∇π(a|x)qπ (x, a) ੵͷඍ෼ RΛ෼ղ SͱQ T` ScT B ͸Вʹґଘ͠ͳ͍ T`ʹؔͯ͠࠶ؼతʹల։

Slide 24

Slide 24 text

∇J(θ) = ∇vπ (s0) = ∑ s ( ∞ ∑ k=0 Pr (s0 → s, k, π) )∑ a ∇π(a|s)qπ (s, a) = ∑ s η(s)∑ a ∇π(a|s)qπ (s, a) = ∑ s′ η (s′)∑ s η(s) ∑ s′ η (s′) ∑ a ∇π(a|s)qπ (s, a) = ∑ s′ η (s′)∑ s μ(s)∑ a ∇π(a|s)qπ (s, a) ∝ ∑ s μ(s)∑ a ∇π(a|s)qπ (s, a) 1FSGPSNBODFNFBTVSFΛٻΊΔͨΊʹTʹTΛೖΕΔ T͕ग़ͯ͘Δճ਺ η(s) : T͕ग़ͯ͘Δ֬཰ μ(s) : ূ໌ऴΘΓ