Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sutton "Reinforcement Learning" 2nd Edition Ch7: n-step Bootstrapping

Sutton "Reinforcement Learning" 2nd Edition Ch7: n-step Bootstrapping

Kosuke Miyoshi

May 28, 2019
Tweet

More Decks by Kosuke Miyoshi

Other Decks in Research

Transcript

  1. ໨࣍ w 1SFEJDUJPO w OTUFQ5%1SFEJDUJPO w &SSPSSFEVDUJPOQSPQFSUZ w 3BOEPN8BML w

    $POUSPM"MHPSJUINT w OTUFQ4BSTB w OTUFQ&YQFDUFE4BSTB w OTUFQ0⒎QPMJDZ-FBSOJOH w 1FSEFDJTJPO.FUIPETXJUI$POUSPM7BSJBUFT w 5IFOTUFQ5SFF#BDLVQ"MHPSJUIN w "6OJGZJOH"MHPSJUINOTUFQ2 М
  2. 4U "U 3U  Rt+1 + γVt (St+1) Rt+1 +

    γRt+2 + γ2Rt+3 + ⋯ + γT−t−1RT 4U  3U  3U  35 தؒͷ৔߹ʁ
  3. OTUFQ5%QSFEJDUJPO Gt ≐ Rt+1 + γRt+2 + γ2Rt+3 + ⋯

    + γT−t−1RT Gt:t+1 ≐ Rt+1 + γVt (St+1) Gt:t+2 ≐ Rt+1 + γRt+2 + γ2Vt+1 (St+2) Gt:t+n ≐ Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γnVt+n−1 (St+n) TUFQ5% TUFQ5% OTUFQ5% .POUF$BSMP
  4. Vt+n (St) ≐ Vt+n−1 (St) + α [Gt:t+n − Vt+n−1

    (St)] OTUFQ5%QSFEJDUJPO Gt:t+n ≐ Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γnVt+n−1 (St+n)
  5. max s π [Gt:t+n |St = s] − vπ (s)

    ≤ γn max s Vt+n−1 (s) − vπ (s) Gt:t+n ≐ Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γnVt+n−1 (St+n) OTUFQใु࿨ͱਅͷ7ͷؒͷޡࠩͷ࠷େ஋ 7ͱਅͷ7ͷؒͷޡࠩͷ࠷େ஋ &SSPSSFEVDUJPOQSPQFSUZ 7К 7 T T T T
  6. $ % &  Gt:t+1 ≐ Rt+1 + γVt (St+1)

    Gt:t+2 ≐ Rt+1 + γRt+2 + γ2Vt+1 (St+2) TUFQͳΒ&͚ͩΛ7 & ʹ TUFQͳΒ% &Λ7 % 7 & ʹ 3BOEPN8BML w 7 " d7 & ͷॳظ஋Λ͔Β։࢝ w $͔Β։࢝ͯ͠ɺ$%&ͱҠಈͯ͠ӈʹҠಈͯ͠3FXBSE
 Λಘͨͱ͢Δ w ЍͰߟ͑ͨ࣌
  7. OTUFQ4BSTB Gt:t+n ≐ Rt+1 + γRt+2 + ⋯ + γn−1Rt+n

    + γnQt+n−1 (St+n , At+n) Qt+n (St , At) ≐ Qt+n−1 (St , At) + α [Gt:t+n − Qt+n−1 (St , At)]
  8. OTUFQ&YQFDUFE4BSTB Gt:t+n ≐ Rt+1 + ⋯ + γn−1Rt+n + γnQt+n−1

    (St+n , At+n) Gt:t+n ≐ Rt+1 + ⋯ + γn−1Rt+n + γn ∑ a π(a|st+n )Qt+n−1 (st+n , a) BDUJPO OTUFQ4BSTB OTUFQ&YQFDUFE4BSTB
  9. OTUFQ0⒎QPMJDZ-FBSOJOH Vt+n (St) ≐ Vt+n−1 (St) + αρt:t+n−1 [Gt:t+n −

    Vt+n−1 (St)] ρt:h ≐ min(h,T−1) ∏ k=t π (Ak |Sk) b (Ak |Sk) Qt+n (St , At) ≐ Qt+n−1 (St , At) + αρt+1:t+n [Gt:t+n − Qt+n−1 (St , At)] 7 2ͷਪఆͷର৅ͷ1PMJDZ
 ྫ(SFFEZ ࣮ࡍʹ"DUJPOΛൃߦ͢Δ1PMJDZ
 ྫЏ(SFFEZ
  10. 1FSEFDJTJPO.FUIPET
 XJUI$POUSPM7BSJBUFT Gt:h = Rt+1 + γGt+1:h Gt:h ≐ ρt

    (Rt+1 + γGt+1:h) Gh:h ≐ Vh−1 (Sh) ρt = π (At |St) b (At |St) ઴Խࣜͷܗʹ͢Δͱ Gt:t+n ≐ Rt+1 + γRt+2 + ⋯ + γn−1Rt+n + γnVt+n−1 (St+n) P⒎QPMJDZ Кʹͯ"UΛऔΔ֬཰͕ͩͱЛU΋ 7ͷਪఆ஋ͷόϦΞϯε͕
 େ͖͘ͳΔ
  11. Gt:h ≐ ρt (Rt+1 + γGt+1:h) ЛU͸ظ଴஋͕ͱͳΔੑ࣭Λ࣋ͭɻ Αͬͯ$POUSPM7BSJBUFΛ଍ͯ͠΋
 ظ଴஋͸มΘΒͳ͍ɻ Gt:h

    ≐ ρt (Rt+1 + γGt+1:h) + (1 − ρt) Vh−1 (St) $POUSPM7BSJBUF [ π (Ak |Sk) b (Ak |Sk) ] ≐ ∑ a b (a|Sk) π (a|Sk) b (a|Sk) = ∑ a π (a|Sk) = 1
  12. ЛU͕ͷ࣌ɺαϯϓϧ͸ࣺͯΒΕΔɻ 7ͷมԽ͸ى͜Βͳ͍ͷ͕ਖ਼͍͠
 ্ͷ࣌λʔήοτ͕7 4U ʹͳ͍ͬͯΔͷͰ0, Gt:h ≐ ρt (Rt+1 +

    γGt+1:h) + (1 − ρt) Vh−1 (St) Кʹͯ"UΛऔΔ֬཰͕ͩͬͨ৔߹ЛU͸   Vt+n (St) ≐ Vt+n−1 (St) + α [Gt:t+n − Vt+n−1 (St)]
  13. 1FSEFDJTJPO.FUIPET
 XJUI$POUSPM7BSJBUFT Gt:h ≐ Rt+1 + γ (ρt+1 Gt+1:h +

    Vh−1 (St+1) − ρt+1 Qh−1 (St+1 , At+1)) = Rt+1 + γρt+1 (Gt+1:h − Qh−1 (St+1 , At+1)) + γVh−1 (St+1) Qt+n (St , At) ≐ Qt+n−1 (St , At) + α [Gt:t+n − Qt+n−1 (St , At)] Vt (s) ≐ ∑ a π(a|s)Qt (s, a) ߦಈՁ஋ͷ৔߹ Gh:h ≐ Qh−1 (Sh , Ah)
  14. 5IFOTUFQ5SFF#BDLVQ"MHPSJUIN Gt:t+1 ≐ Rt+1 + γ∑ a π (a|St+1) Qt

    (St+1 , a) 4UFQ 4UFQͷ৔߹͸௨ৗͷ&YQFDUFE4BSTB
  15. 5IFOTUFQ5SFF#BDLVQ"MHPSJUIN Gt:t+1 ≐ Rt+1 + γ∑ a π (a|St+1) Qt

    (St+1 , a) Gt:t+2 ≐ Rt+1 + γ ∑ a≠At+1 π (a|St+1) Qt+1 (St+1 , a) +γπ (At+1 |St+1) ( Rt+2 + γ∑ a π (a|St+2) Qt+1 (St+2 , a) ) = Rt+1 + γ ∑ a≠At+1 π (a|St+1) Qt+1 (St+1 , a) + γπ (At+1 |St+1) Gt+1:t+2 4UFQ 4UFQ औΒͳ͔ͬͨ"DUJPO औͬͨ"DUJPO ઴Խࣜ
  16. Gt:t+n ≐ Rt+1 + γ ∑ a≠At+1 π (a|St+1) Qt+n−1

    (St+1 , a) + γπ (At+1 |St+1) Gt+1:t+n O4UFQ5SFF#BDLVQ"MHPSJUIN 5IFOTUFQ5SFF#BDLVQ"MHPSJUIN औΒͳ͔ͬͨ"DUJPO औͬͨ"DUJPO Qt+n (St , At) ≐ Qt+n−1 (St , At) + α [Gt:t+n − Qt+n−1 (St , At)] 6QEBUF
  17. "6OJGZJOH"MHPSJUIN
 OTUFQ2 М Gt:h = Rt+1 + γ ∑ a≠At+1

    π (a|St+1) Qh−1 (St+1 , a) + γπ (At+1 |St+1) Gt+1:h = Rt+1 + γVh−1 (St+1) − γπ (At+1 |St+1) Qh−1 (St+1 , At+1) + γπ (At+1 |St+1) Gt+1:h = Rt+1 + γπ (At+1 |St+1) (Gt+1:h − Qh−1 (St+1 , At+1)) + γVh−1 (St+1) OTUFQ5SFF#BDLVQ $POUSPM7BSJBUF4BSTBͷЛΛК " 4 ʹม͑ͨ΋ͷ Gt:h ≐ Rt+1 + γ (σt+1 ρt+1 + (1 − σt+1) π (At+1 |St+1)) (Gt+1:h − Qh−1 (St+1 , At+1)) +γVh−1 (St+1) OTUFQ2 М $POUSPM7BSJBUF4BSTBͱOTUFQ5SFF#BDLVQΛМͰεΠον͍ͯ͠Δ
  18. ·ͱΊ w .POUF$BMSPͱPOFTUFQ5%ͷத͕ؒOTUFQ5% w தؒͷO͸྆ۃ୺ͷ৔߹ΑΓ΋Ұൠʹྑ͍݁Ռ w ࿈ଓλεΫʹ΋ΤϐιʔυλεΫʹ΋྆ํద༻Մೳ w ܭࢉίετ͸POFTUFQΑΓ΋͔͔Δ w

    ࠷ޙͷOTUFQ෼هԱ͠ͳ͍ͱ͍͚ͳ͍ w ֶश͕OTUFQޙʹͳΒͳ͍ͱ࣮ߦͰ͖ͳ͍ w 5%ͱಉ༷ʹεςοϓ͋ͨΓͷܭࢉίετ͸খ͘͞ۉҰ w ΞϧΰϦζϜͷҰൠԽ w &SSPSSFEVDUJPOUIFPSZ 4BSTB *NQPSUBODF4BNQMJOHΛ࢖ͬͨ0⒎QPMJDZ &YQFDUFE TBSTB 5SFF#BDLVQ w શͯΛҰൠԽͨ͠OTUFQ2 М