Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sutton "Reinforcement Learning" 2nd Edition Ch13: Policy Gradient Methods

Sutton "Reinforcement Learning" 2nd Edition Ch13: Policy Gradient Methods

"Reinforcement Learning" by Richard S. Sutton 2nd edition
Chapter 13
Policy Gradient Methods

Kosuke Miyoshi

October 02, 2019
Tweet

More Decks by Kosuke Miyoshi

Other Decks in Research

Transcript

  1. 1PMJDZ(SBEJFOU.FUIPET w ͜Ε·Ͱͷষ w "DUJPOWBMVFΛֶश͠"DUJPOWBMVFʹج͍ͮͯ
 "DUJPOΛબ୒ BDUJPOWBMVFNFUIPET  w ࠓճͷষ

    w QPMJDZΛ௚઀QBSBNFUFSͰදݱ
 QBSBNFUFSJ[FEQPMJDZ  w "DUJPOબ୒ʹWBMVFGVODUJPOΛར༻͠ͳ͍
 QPMJDZͷֶशࣗମʹ͸࢖͏৔߹΋͋Δ  w QPMJDZΛදݱ͢ΔQBSBNFUFSʹରͯ͠HSBEJFOUBTDFOU
  2. 1BUBNFUFSJ[FEQPMJDZ θt+1 = θt + α ̂ ∇J (θt) π(a|s,

    θ) = Pr {At = a|St = s, θt = θ} ֬཰తͳ1PMJDZΛE`࣍ݩͷύϥϝʔλВͰද͢ ϙϦγʔΠͷྑ͠ѱ͠Λ൑அ͢ΔͨΊͷ
 1FSGPSNBODFNFBTVSF+ В  εΧϥʔ஋ Λಋೖɻ +͕࠷େԽͷର৅ +ͷВʹର͢Δޯ഑ͷ֬཰తͳਪఆ஋ͷظ଴஋ Вͱಉ͘͡E`࣍ݩ
  3. 1PMJDZ"QQSPYJNBUJPOBOE JUT"EWBOUBHFT w "DUJPOWBMVFGVODUJPOͷܽ఺ w ֬཰తͳϙϦγʔΛࣗવʹදݱ͢Δํ๏͕ແ͍ w ΧʔυήʔϜͷ༷ͳෆ׬શ৘ใήʔϜͰ͸࠷దϙϦγʔ͕֬཰తʹͳΔ৔߹͕ଟ͍
 FYϙʔΧʔͰͷϒϥϑ 


    w 1BSBNFUFSJ[FEQPMJDZͷ௕ॴ w ֬཰తͳϙϦγʔΛදݱͰ͖Δ w ࿈ଓ"DUJPOʹ΋ରԠͰ͖Δ w 1PMJDZͷมߋ͕εϜʔζ w 1PMJDZͷۙࣅදݱ͕WBMVFGVODJUPOΛ࢖͏৔߹ΑΓ΋1PMJDZ௚઀දݱͷํ͕
 γϯϓϧʹͳΔΑ͏ͳ໰୊ͷ৔߹༗ޮ ςτϦεͳͲ  w 1PMJDZʹରͯ͠ࣄલ৘ใΛ૊ΈࠐΈ΍͍͢
  4. 4PGUNBYJOBDUJPO QSFGFSFODFT π(a|s, θ) ≐ eh(s,a,θ) ∑ b eh(s,b,θ) h(s,

    a, θ) = θ⊤x(s, a) GFBUVSFWFDUPS ύϥϝʔλВΛઢܗͰදݱ͢Δ৔߹
  5. 1PMJDZ(SBEJFOU5IFPSFN J(θ) ≐ vπθ (s0) ∇J(θ) ∝ ∑ s μ(s)∑

    a qπ (s, a)∇π(a|s, θ) &QJTPEJDͳ৔߹Ͱͷ1FSGPSNBODFNFBTVSF ΠʹԊͬͨͱ͖ʹঢ়ଶTΛͱΔ֬཰ 1FSGPSNBODFNFBTVSF͸ ɾBDUJPOબ୒ ɾ๚ΕΔTͷස౓෼෍Ж T  ʹґଘ͠ɺͦΕΒ͸ϙϦγʔͷύϥϝʔλВͷӨڹΛड͚Δɻ ͔͠͠1FSGPSNBODF.FBTVSF+ͷВʹର͢Δޯ഑㲆+͸
 Tͷ෼෍ʹର͢Δඍ෼Λؚ·ͳ͍ܗͰهड़͞ΕΔ
  6. 3&*/'03$&
 .POUF$BMSP1PMJDZ(SBEJFOU ∇J(θ) ∝ ∑ s μ(s)∑ a qπ (s,

    a)∇π(a|s, θ) = π [∑ a qπ (St , a)∇π (a|St , θ) ] θt+1 ≐ θt + α∑ a ̂ q (St , a, w)∇π (a|St , θ) 1PMJDZHSBEJFOUUIFPSFN ͜͜Ͱ͸શBDUJPOʹର͢Δ࿨ͷܗʹͳ͍ͬͯΔ ͜ΕΛBDUJPOͰͷ.POUF$BSMPαϯϓϦϯάͰஔ͖׵͑Δ͜ͱΛߟ͑Δ
  7. 3&*/'03$&
 .POUF$BMSP1PMJDZ(SBEJFOU ∇J(θ) = π [∑ a π (a|St ,

    θ) qπ (St , a) ∇π (a|St , θ) π (a|St , θ) ] = π [ qπ (St , At) ∇π (At |St , θ) π (At |St , θ) ] = π [ Gt ∇π (At |St , θ) π (At |St , θ) ] , θt+1 ≐ θt + αGt ∇π (At |St , θt) π (At |St , θt) ∇ln x = ∇x x = θt + αGt ∇ln π (At |St , θt) Λར༻ R 4 " Λใु࿨(ʹ͓͖͔͑ શBDUJPOBͰͷ࿨Λɺ࣮ࡍʹऔͬͨ"DUJPO"Uʹ
 ͓͖͔͑ αϯϓϦϯά Вͷߋ৽ࣜ
  8. 3&*/'03$&XJUI#BTFMJOF ∇J(θ) ∝ ∑ s μ(s)∑ a (qπ (s, a)

    − b(s))∇π(a|s, θ) ∑ a b(s)∇π(a|s, θ) = b(s)∇∑ a π(a|s, θ) = b(s)∇1 = 0 θt+1 ≐ θt + α (Gt − b (St)) ∇π (At |St , θt) π (At |St , θt) CBTFMJOF͸Bʹґଘ͠ͳ͍ Вͱؔ܎ͳ͍ ΋ͷΛબ΂͹ɺ Լهͷ࿨͸ʹͳΓޯ഑ਪఆͷظ଴஋ʹӨڹ͠ͳ͍ CBTFMJOFΛಋೖͯ͠ɺ+ͷޯ഑㲆+ͷਪఆ஋ͷWBSJBODFΛԼ͛Δ͜ͱΛߟ͑Δ CBTFMJOF͸TUBUFʹґଘ͢Δ΂͖
 ྫશBDUJPOͷWBMVF͕ߴ͍৔߹͸CBTFMJOF΋ߴ͍ํ͕֤BDUJPOͷࠩผԽ͕Ͱ͖Δ  CBTMJOFͷࣗવͳબ୒͸ঢ়ଶՁ஋7 Вͷߋ৽ࣜ
  9. "DUPS$SJUJD.FUIPET θt+1 ≐ θt + α (Gt:t+1 − ̂ v

    (St , w)) ∇π (At |St , θt) π (At |St , θt) = θt + α (Rt+1 + γ ̂ v (St+1 , w) − ̂ v (St , w)) ∇π (At |St , θt) π (At |St , θt) = θt + αδt ∇π (At |St , θt) π (At |St , θt) 7ΛCBTFMJOFͱͯ͠ར༻ͨ͠3&*/'03$&͸ɺ#PPUTUSBQQJOH͸͍ͯ͠ͳ͍ #PPUTUSBQQJOHΛར༻ͯ͠CJBTΛೖΕͨํ͕WBSJBODF͕ݮֶͬͯशૣ͘ͳΔ &QJTPEFऴྃ·Ͱͷใु࿨Λ TUFQͷCPPUTUSBQQJOHͰஔ͖׵͑
  10. $PJOUJOVJOH1SPCMFNT J(θ) ≐ r(π) ≐ lim h→∞ 1 h h

    ∑ t=1 [Rt |S0 , A0:t−1 ∼ π] = lim t→∞ [Rt |S0 , A0:t−1 ∼ π] = ∑ s μ(s)∑ a π(a|s)∑ s′,r p (s′, r|s, a) r Gt ≐ Rt+1 − r(π) + Rt+2 − r(π) + Rt+3 − r(π) + ⋯ ࿈ଓΤϐιʔυͷ৔߹ɺQFSGPSNBODFNFBTVSF͸
 ΠʹԊͬͨ࣌ͷTUFQ͋ͨΓͷSFXBSEฏۉ vπ (s) ≐ π [Gt |St = s] qπ (s, a) ≐ π [Gt |St = s, At = a] ∇J(θ) = ∑ s μ(s)∑ a qπ (s, a)∇π(a|s, θ) ͱఆٛ͢Δͱ&QJTPEJDͳ৔߹ͱಉ༷ͷ1PMJDZHSBEJFOUUIFPMFN͕ಋग़͞ΕΔ
  11. $POUJOVPVT"DUJPOT π(a|s, θ) ≐ 1 σ(s, θ) 2π exp (

    − (a − μ(s, θ))2 2σ(s, θ)2 ) μ(s, θ) ≐ θ⊤ μ xμ (s) σ(s, θ) ≐ exp (θ⊤ σ xσ (s)) ࿈ଓ"DUJPOΛߟ͑Δ θ = [θμ , θσ] ⊤
  12. 4VNNBSZ w 1PMJDZ(SBEJFOUͰ͸ߦಈՁ஋ؔ਺ਪఆΛར༻ͤͣʹQBSBNFUSJ[F͞Εͨ1PMJDZΛ௚઀ֶश w 1BSBNFUFSJ[F͞Εͨ1PMJDZͷར఺ w ֬཰తͳදݱ୳ࡧܾఆతͳؒΛ઴ۙతʹදݱՄೳ w 1PMJDZHSBEJFOUUIFPSFN w

    ঢ়ଶ෼෍ͷඍ෼Λؚ·ͳ͍ܗͰ1FSGPSNBODF.FBTVSF͕1PMJDZύϥϝʔλʹ
 Ͳ͏ӨڹΛड͚Δ͔͕هड़Ͱ͖Δ w 3&*/03$& w .POUF$BSMP w ঢ়ଶՁ஋Λ#BTFMJOFͱͯ͠ಋೖ͠ɺCJBTΛ૿΍ͣ͞ʹWBSJBODFΛԼ͛Δ w "DUPS$SJUJD w ঢ়ଶՁ஋ΛCPPUTUSBQQJOHʹར༻ͯ͠5%ֶश.$ΑΓ΋௿WBSJBODF w ঢ়ଶՁ஋ʹΑΓ1PMJDZͷBDUJPOબ୒ʹରͯ͠DSFEJUBTTJHO DSJUJTJ[F
  13. 1PMJDZ(SBEJFOU5IFPMFNͷূ໌ &QJTPEJDDBTF Pr(s → x, k, π) TUBUFT͔ΒLTUFQޙʹTUBUFYʹͳΔ֬཰ ∇vπ (s)

    = ∇ [∑ a π(a|s)qπ (s, a) ] = ∑ a [∇π(a|s)qπ (s, a) + π(a|s)∇qπ (s, a)] = ∑ a ∇π(a|s)qπ (s, a) + π(a|s)∇∑ s′,r p (s′, r|s, a) (r + vπ (s′)) = ∑ a [ ∇π(a|s)qπ (s, a) + π(a|s)∑ s′ p (s′|s, a)∇vπ (s′) ] = ∑ a [ ∇π(a|s)qπ (s, a) + π(a|s)∑ s′ p (s′|s, a) ∑ a′ [ ∇π (a′|s′) qπ (s′, a′) + π (a′|s′)∑ s′′ p (s′′|s′, a′)∇vπ (s′′) ] = ∑ x∈ ∞ ∑ k=0 Pr(s → x, k, π)∑ a ∇π(a|x)qπ (x, a) ੵͷඍ෼ RΛ෼ղ SͱQ T` ScT B ͸Вʹґଘ͠ͳ͍ T`ʹؔͯ͠࠶ؼతʹల։
  14. ∇J(θ) = ∇vπ (s0) = ∑ s ( ∞ ∑

    k=0 Pr (s0 → s, k, π) )∑ a ∇π(a|s)qπ (s, a) = ∑ s η(s)∑ a ∇π(a|s)qπ (s, a) = ∑ s′ η (s′)∑ s η(s) ∑ s′ η (s′) ∑ a ∇π(a|s)qπ (s, a) = ∑ s′ η (s′)∑ s μ(s)∑ a ∇π(a|s)qπ (s, a) ∝ ∑ s μ(s)∑ a ∇π(a|s)qπ (s, a) 1FSGPSNBODFNFBTVSFΛٻΊΔͨΊʹTʹTΛೖΕΔ T͕ग़ͯ͘Δճ਺ η(s) : T͕ग़ͯ͘Δ֬཰ μ(s) : ূ໌ऴΘΓ