Slide 1

Slide 1 text

3-9%- 8POTFPL+VOH *OUSPEVDUJPOUP3FJOGPSDFNFOU-FBSOJOH GFBUDT

Slide 2

Slide 2 text


 8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS  $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS
 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO $IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH
 #MPH IUUQTXPOTFPLKVOHHJUIVCJP
 :PVUVCF
 IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X


Slide 3

Slide 3 text

5PEBZ  .BSLPWEFDJTJPOQSPDFTT੄੿੄  ъച೟णޙઁ੿੄  ъച೟णঌҊ્ܻ੄ࣁо૑࠙ܨ  ъച೟णঌҊ્ܻఋੑѐਃ

Slide 4

Slide 4 text

੿੄

Slide 5

Slide 5 text

.BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st ) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN
 
 TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽಴അغחъച೟णীࢲ੄ࣁ࢚ਸ੿੄

Slide 6

Slide 6 text

3FXBSEGVODUJPOT  )JHI3FXBSE ֫਷ࠁ࢚ উ੹ೞѱݾ੸૑ীب଱  -PX3FXBSE ծ਷ࠁ࢚ ҮాࢎҊ  1PMJDZউ੹ೞѱ਍੹ೞѱೞח੿଼೟ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧೟ण

Slide 7

Slide 7 text

.BSLPWDIBJO s ∈ S s T 4UBUFTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF M = S, T

Slide 8

Slide 8 text

.BSLPWDIBJO HSBQIJDBMMZ →μt +1 = T→μt μt,i = p(st = i) s1 s2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKо઱য઎ਸٸTUBUFоJ੉ؼഛܫݫ౟ܼझ UJNFTUFQUীTUBUFоJੌഛܫ

Slide 9

Slide 9 text

.BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO

Slide 10

Slide 10 text

.BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 = i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоK੉ҊBDUJPO੉Lੌ⮶  UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPO੉Lੌഛܫ r : SxA → R SFXBSEGVODUJPO μt,i = ∑ j,k Ti,j,k μt,j ξt,k

Slide 11

Slide 11 text

1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ P(ot ∣ st ) r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )

Slide 12

Slide 12 text

5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH πθ (a ∣ s) θ ੋҕन҃ݎ੄XFJHIUT 1PMJDZח౵ۄݫఠܳ؀߸ೠ׮ θ ੋҕन҃ݎ਷ੑ۱ਵ۽TUBUFܳ߉ҊBDUJPOਸ୹۱ೠ׮ ജ҃਷BDUJPOਸੑ۱ਵ۽߉Ҋ׮਺TUBUFܳ୹۱ೠ׮ 

Slide 13

Slide 13 text

5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . . . , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ pθ (τ) [∑ t r(st , at )]

Slide 14

Slide 14 text

5IFBOBUPNZPGBSFJOGPSDFNFOUMFBSOJOHBMHPSJUIN 1PMJDZܳࢎਊೞৈ
 ࢠ೒ࢤࢿ  SFUVSOчਸஏ੿  SFUVSO SFXBSE  ܳ੉ਊೞৈ੿଼ সؘ੉౟ 

Slide 15

Slide 15 text

2GVODUJPOBOE7BMVFGVODUJPO Qπ(st , at ) = ∑T t′=t Eπθ [r(s′ t , a′ t ) ∣ st , at ] 2GVODUJPO 4UBUFU BDUJPOUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑

Slide 16

Slide 16 text

2GVODUJPOBOE7BMVFGVODUJPO 7BMVFGVODUJPO Vπ(st ) = T ∑ t′=t Eπθ [r(s′ t , a′ t ) ∣ st ] TUBUFUীࢲࠗఠ՘զٸө૑ 5 ߉ਸࣻ੓ח୨SFXBSE੄ӝ؀чਸ҅࢑

Slide 17

Slide 17 text

5ZQFPG3-BMHPSJUINT 1PMJDZHSBEJFOUT 7BMVFCBTFE "DUPSDSJUJD .PEFMCBTFE3-

Slide 18

Slide 18 text

ъച೟णঌҊ્ܻ੉݆਷੉ਬ  ೙ਃೠࢠ೒੄۝  ೞ੉ಌ౵ۄݫఠ  উ੿ࢿ
 $POWFSHF   .%1 4UPDIBTUJD %FUFSNJOJTUJD   $POUJOVPVT %JTDSFUF  .%1੄ೠ੿ ޖೠ


Slide 19

Slide 19 text

/&95DIBQUFS
 
 1PMJDZHSBEJFOUT