Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction Deep Reinforcement Learning

Avatar for Wonseok Jung Wonseok Jung
November 20, 2018
150

Introduction Deep Reinforcement Learning

Avatar for Wonseok Jung

Wonseok Jung

November 20, 2018
Tweet

Transcript

  1. 
 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS
 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO

    $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH
 #MPH IUUQTXPOTFPLKVOHHJUIVCJP
 :PVUVCF
 IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X

  2. .BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st

    ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN
 
 TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽಴അغחъച೟णীࢲ੄ࣁ࢚ਸ੿੄
  3. 3FXBSEGVODUJPOT  )JHI3FXBSE ֫਷ࠁ࢚ উ੹ೞѱݾ੸૑ীب଱  -PX3FXBSE ծ਷ࠁ࢚ ҮాࢎҊ 

    1PMJDZউ੹ೞѱ਍੹ೞѱೞח੿଼೟ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧೟ण
  4. .BSLPWDIBJO HSBQIJDBMMZ →μt +1 = T→μt μt,i = p(st =

    i) s1 s2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKо઱য઎ਸٸTUBUFоJ੉ؼഛܫݫ౟ܼझ UJNFTUFQUীTUBUFоJੌഛܫ
  5. .BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st ,

    at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO
  6. .BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 =

    i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоK੉ҊBDUJPO੉Lੌ⮶  UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPO੉Lੌഛܫ r : SxA → R SFXBSEGVODUJPO μt,i = ∑ j,k Ti,j,k μt,j ξt,k
  7. 1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A

    4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ P(ot ∣ st ) r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
  8. 5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . .

    . , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ pθ (τ) [∑ t r(st , at )]
  9. 2GVODUJPOBOE7BMVFGVODUJPO Qπ(st , at ) = ∑T t′=t Eπθ [r(s′

    t , a′ t ) ∣ st , at ] 2GVODUJPO 4UBUFU BDUJPOUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑
  10. 2GVODUJPOBOE7BMVFGVODUJPO 7BMVFGVODUJPO Vπ(st ) = T ∑ t′=t Eπθ [r(s′

    t , a′ t ) ∣ st ] TUBUFUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑
  11. ъച೟णঌҊ્ܻ੉݆਷੉ਬ  ೙ਃೠࢠ೒੄۝  ೞ੉ಌ౵ۄݫఠ  উ੿ࢿ
 $POWFSHF  

    .%1 4UPDIBTUJD %FUFSNJOJTUJD   $POUJOVPVT %JTDSFUF  .%1੄ೠ੿ ޖೠ