Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction Deep Reinforcement Learning

Wonseok Jung
November 20, 2018
130

Introduction Deep Reinforcement Learning

Wonseok Jung

November 20, 2018
Tweet

Transcript

  1. 
 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS
 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO

    $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH
 #MPH IUUQTXPOTFPLKVOHHJUIVCJP
 :PVUVCF
 IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X

  2. .BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st

    ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN
 
 TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽಴അغחъച೟णীࢲ੄ࣁ࢚ਸ੿੄
  3. 3FXBSEGVODUJPOT  )JHI3FXBSE ֫਷ࠁ࢚ উ੹ೞѱݾ੸૑ীب଱  -PX3FXBSE ծ਷ࠁ࢚ ҮాࢎҊ 

    1PMJDZউ੹ೞѱ਍੹ೞѱೞח੿଼೟ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧೟ण
  4. .BSLPWDIBJO HSBQIJDBMMZ →μt +1 = T→μt μt,i = p(st =

    i) s1 s2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKо઱য઎ਸٸTUBUFоJ੉ؼഛܫݫ౟ܼझ UJNFTUFQUীTUBUFоJੌഛܫ
  5. .BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st ,

    at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO
  6. .BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 =

    i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоK੉ҊBDUJPO੉Lੌ⮶  UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPO੉Lੌഛܫ r : SxA → R SFXBSEGVODUJPO μt,i = ∑ j,k Ti,j,k μt,j ξt,k
  7. 1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A

    4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ P(ot ∣ st ) r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
  8. 5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . .

    . , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ pθ (τ) [∑ t r(st , at )]
  9. 2GVODUJPOBOE7BMVFGVODUJPO Qπ(st , at ) = ∑T t′=t Eπθ [r(s′

    t , a′ t ) ∣ st , at ] 2GVODUJPO 4UBUFU BDUJPOUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑
  10. 2GVODUJPOBOE7BMVFGVODUJPO 7BMVFGVODUJPO Vπ(st ) = T ∑ t′=t Eπθ [r(s′

    t , a′ t ) ∣ st ] TUBUFUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑
  11. ъച೟णঌҊ્ܻ੉݆਷੉ਬ  ೙ਃೠࢠ೒੄۝  ೞ੉ಌ౵ۄݫఠ  উ੿ࢿ
 $POWFSHF  

    .%1 4UPDIBTUJD %FUFSNJOJTUJD   $POUJOVPVT %JTDSFUF  .%1੄ೠ੿ ޖೠ