Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sutton "Reinforcement Learning" 2nd Edition Ch6: TD-learning

Sutton "Reinforcement Learning" 2nd Edition Ch6: TD-learning

Kosuke Miyoshi

May 09, 2019
Tweet

More Decks by Kosuke Miyoshi

Other Decks in Research

Transcript

  1. ໨࣍ w 1PMJDZ&WBMVBUJPO w Кʹର͢Δ7ͷ1SFEJDUJPO w 5%ͱ.$ %1ͷڞ௨఺ͱҧ͍ w $POUSPM"MHPSJUINT

    w 0O1PMJDZ4BSTB w 0⒎1PMJDZ2MFBSOJOH w &YQFDUFE4BSTB w .BYJNJ[BUJPO#JBT໰୊ w %PVCMF-FBSOJOH
  2. 5FNQPSBM%J⒎FSFODF -FBSOJOH w %ZOBNJD1SPHSBNNJOH w #PPUTUSBQQJOH
  7 4U 

    ͷਪఆ஋Λ࢖ͬͯ7 4U ͷਪఆ஋Λߋ৽  w ؀ڥͷ.PEFM  Λ࢖͏ w .POUF$BSMP w ใु࿨ͷ4BNQMJOH 5%-FBSOJOH P(St+1 , Rt+1 |St , At )
  3. %ZOBNJD1SPHSBNNJOH T T T T T T T T T

    T T T T V(S t ) ← E π R t+1 + γV(S t+1 ) [ ] S t = X a ⇡(a|St) X s0,r p(s0, r|St, a)[r + V (s0)] r a s0 http://incompleteideas.net/609%20dropbox/slides%20(pdf%20and%20keynote)/11-12-TD.pdf
  4. .POUF$BSMP T T T T T T T T T

    T T T T T T T T T T T V(S t ) ←V(S t )+α G t −V(S t ) [ ] St http://incompleteideas.net/609%20dropbox/slides%20(pdf%20and%20keynote)/11-12-TD.pdf
  5. 5%MFBSOJOH T T T T T T T T T

    T T T T T T T T T T T V(S t ) ←V(S t )+α R t+1 + γV(S t+1 )−V(S t ) [ ] S t R t+1 S t+1 http://incompleteideas.net/609%20dropbox/slides%20(pdf%20and%20keynote)/11-12-TD.pdf
  6. V (St) ← V (St) + α [Gt − V

    (St)] V (St) ← V (St) + α [Rt+1 + γV (St+1) − V (St)] .POUF$BSMP $POTUBOUЋ.$ 5%  &QJTPEFऴྃ࣌ʹΘ͔Δ
 ใु࿨ ใु࿨ͷਪఆ஋
 4UFQ࣮ߦޙʹ͙͢ग़ͤΔ 5%ޡࠩ
  7. vπ (s) ≐ π [Gt |St = s] = π

    [Rt+1 + γGt+1 |St = s] = π [Rt+1 + γvπ (St+1)|St = s] .$͸͜Εͷਪఆ஋Λλʔήοτʹ 5%͸͜Εͷਪఆ஋Λλʔήοτʹ
  8. 3BOEPN8BML w $͔Βελʔτ w ֤ঢ়ଶͰࠨӈʹͷ֬཰ͰҠಈ .BSLPW3FXBSE1SPDFTT  w &ͰӈʹҠಈͨ͠Β3FXBSEͰऴྃ w

    "ͰࠨʹҠಈͨ͠Β3FXBSEͰऴྃ w Ѝͱ͢Δͱ"d&ͷਅͷঢ়ଶՁ஋7͸ 1 6 , 2 6 , 3 6 , 4 6 , 5 6
  9. #BUDI5SBJOJOH w ͜ΕΒͷ6QEBUFΛຖ4UFQߦ͏ͷͰ͸ͳ͘ɺ૿Ճ෼Λ
 4UBUFຖʹஷΊ͓͍ͯͯɺશ&QJTPEF෼Λ·ͱΊͯ7 4U ߋ৽ w 7 4U ͷมԽ͕ͳ͘ͳΔ·Ͱ܁Γฦ͢

    w .$΋5%΋#BUDI6QEBUFͰ͸ܾఆతʹऩଋ͢Δ͕
 ऩଋ͢Δ஋͸.$ͱ5%ͰҟͳΔ V (St) ← V (St) + α [Gt − V (St)] .POUF$BSMP 5%  V (St) ← V (St) + α [Rt+1 + γV (St+1) − V (St)]
  10. :PVBSFUIFQSFEJDUPS Ѝͱͨ࣌͠ɺ7 " 7 # ͸ʁ A, 0, B, 0

    B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 ҎԼͷΤϐιʔυͷ݁Ռ͕͋ͬͨͱ͢Δ
  11. A, 0, B, 0 B, 1 B, 1 B, 1

    B, 1 B, 1 B, 1 B, 0 ݟํ
 ɾ#͸ճग़͖ͯͯ಺ճ3FXBSE
 
 ɾ"͸ճग़͖ͯͯશͯใु࿨ V(B) = 3 4 V(A) = 0 ݟํ
 ɾ#͸ճग़͖ͯͯ಺ճ3FXBSE
 
 ɾ"͸ͷ֬཰Ͱ#ʹભҠͯ͠3FXBSE
  Αͬͯ7 " 7 # V(B) = 3 4 V(A) = 3 4 #BUDI.$
 ͷ݁Ռ #BUDI5%
 ͷ݁Ռ  
  12. w 5SBJOJOH%BUBͷ.FBO4RVBSF&SSPSΛ࠷খԽ͢Δͷ͕
 7 "  w ͜Ε͕#BUDI.$ͷग़͢ղ
 w ໰୊ͷ4FRVFOUJBMJUZΛߟྀ͢Δͱ7 "

     w σʔλΛੜ੒͢ΔϚϧίϑϞσϧͷ࠷໬ਪఆ w ਪఆͨ͠Ϟσϧ͕ਖ਼͚͠Ε͹7ͷਪఆ΋ਖ਼͍͠஋͕
 ࢉग़Ͱ͖Δ w DFSUBJOUZFRVJWBMFODFFTUJNBUFͱݺ͹ΕΔ w ͜Ε͕#BUDI5%ͷग़͢ղ

  13. 4BSTB
 0OQPMJDZ5%$POUSPM w ࣮ࡍʹ࣍4UBUFͰऔͬͨ"DUJPO "U  ͷ2஋Λར༻ͯ͠2 4U "U Λߋ৽͢Δ

    w #FIBWJPS1PMJDZКͷRКͷਪఆΛ͍ͯ͠ΔͷͰ0OQPMJDZ Q (St , At) ← Q (St , At) + α [Rt+1 + γQ (St+1 , At+1) − Q (St , At)]
  14. &YQFDUFE4BSTB w 2MFBSJOHͰ͸࠷େ஋Λͱ͍ͬͯͨͱ͜ΖΛɺݱࡏͷ1PMJDZͰͱΔ"DUJPOͷ
 ֬཰Λར༻ͨ͠ظ଴஋ʹมߋͨ͠΋ͷ w 4BSTBʹൺ΂ͯܭࢉྔ͕૿͑Δ୅ΘΓʹ"U ͷબ୒ʹىҼ͢Δ2஋ਪఆͷ
 ෼ࢄΛͳ͘͢͜ͱ͕Ͱ͖Δ w 4U

    ͕༩͑ΒΕͨ࣌ɺ4BSTBʹ͓͚Δ ЏHSFFEZͳͲͷ"DUJPOબ୒ʹΑΓ 
 2஋͕֬཰తʹมԽ͢Δํ޲ͷظ଴஋ &YQFDUBUJPO ͷํ޲ʹܾఆతʹมԽ͢ΔͨΊɺ &YQFDUFE4BSTBͱݺͿ w Ұൠతʹ4BSTBΑΓ΋ྑ͍݁ՌΛग़͢ Q (St , At) ← Q (St , At) + α [Rt+1 + γπ [Q (St+1 , At+1)|St+1] − Q (St , At)] ← Q (St , At) + α [ Rt+1 + γ∑ a π (a|St+1) Q (St+1 , a) − Q (St , At) ]
  15. &YQFDUFE4BSTB w ্ͷྫͰ͸ظ଴஋ΛऔΔК͸#FIBWJPSͱಉ͡΋ͷͳͷͰ
 0O1PMJDZ͕ͩҟͳΔ΋ͷΛར༻ͨ͠0⒎1PMJDZͰ΋ߏΘͳ͍ w ͜͜Λ(SFFEZʹม͑ͨ΋ͷ͕2MFBSOJOHͱͳΔ w 2MFBSOJOH͸&YQFDUFE4BSTBͷҰྫͱݟΔ͜ͱ͕Ͱ͖Δ Q (St

    , At) ← Q (St , At) + α [Rt+1 + γπ [Q (St+1 , At+1)|St+1] − Q (St , At)] ← Q (St , At) + α [ Rt+1 + γ∑ a π (a|St+1) Q (St+1 , a) − Q (St , At) ]
  16. w "͔Βελʔτ
 w "͔Βӈ΁ಈ͘ͱ w 3FXBSEͰऴྃ
 w "͔Βࠨ΁ಈ͘ͱ w 3FBSEͰ#΁ભҠ


    w #ʹͯଟ਺ͷ"DUJPO w Ͳͷ"DUJPOΛબΜͰ΋ฏۉ෼ࢄͷ
 3FXBSEΛಘͯऴྃ w ࠨΛબΜͩ৔߹ͷظ଴ใु࿨͸
 w ӈΛબΜͩํ͕ಘͳͷʹ.BYJNJ[BUJPO#JBT
 ͷӨڹ͕͋ΔͱࠨΛબͼ΍͘͢ͳͬͯ͠·͏ Џͷ2-FBSOJOH
 ࠷దͳ৔߹ͷࠨΛબͿ֬཰͸
  17. %PVCMF2MFBSOJOH Q1 (St , At) ← Q1 (St , At)

    + α Rt+1 + γQ2 ( St+1 , arg max a Q1 (St+1 , a) ) − Q1 (St , At) Q2 (St , At) ← Q2 (St , At) + α Rt+1 + γQ1 ( St+1 , arg max a Q2 (St+1 , a) ) − Q2 (St , At) ϥϯμϜʹ
 બ୒ 2஋ςʔϒϧΛ2ͱ2ͷೋͭ༻ҙ ࠷େ஋ͷJOEFY͸2͔ΒಘΔ͕
 2͔Β஋Λऔ͖ͬͯͯར༻͢Δ
  18. ·ͱΊ w 5%-FBSOJOHͷ঺հ w %1ͷ#PPUTUSBQ .$ͷ4BNQMJOHΛऔΓೖΕ͍ͯΔ w ܭࢉ࣌ʹ.$΍%1Ͱ༷͋ͬͨͳোน͕গͳ͍ w $POUSPM"MHPSJUINT

    w 0O1PMJDZ4BSTB w 0⒎1PMJDZ2MFBSOJOH w &YQFDUFE4BSTBͷ঺հ w %PVCMF-FBSOJOHʹΑΔ.BYJNJ[BUJPO#JBTͷճආ