Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction Deep Reinforcement Learning
Search
Wonseok Jung
November 20, 2018
0
160
Introduction Deep Reinforcement Learning
Wonseok Jung
November 20, 2018
Tweet
Share
More Decks by Wonseok Jung
See All by Wonseok Jung
Ai for business -self car driving
wonseokjung
0
200
reinforcement_learning_.pdf
wonseokjung
2
1.5k
원석이의 모두연에서 강화학습 보석되기
wonseokjung
0
420
NeuralIPS
wonseokjung
0
410
Deep reinforcemenet learning -2
wonseokjung
0
200
Deep Reinforcement Learning - Introduction
wonseokjung
1
640
How to become a datascientist ?
wonseokjung
2
2.3k
Review of Taylor series
wonseokjung
1
120
꿈꾸는 Agent
wonseokjung
2
140
Featured
See All Featured
Making the Leap to Tech Lead
cromwellryan
135
9.5k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.1k
Building an army of robots
kneath
306
46k
The Language of Interfaces
destraynor
162
25k
The World Runs on Bad Software
bkeepers
PRO
71
11k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.9k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
How to train your dragon (web standard)
notwaldorf
96
6.2k
Music & Morning Musume
bryan
46
6.8k
GraphQLの誤解/rethinking-graphql
sonatard
72
11k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
590
Making Projects Easy
brettharned
118
6.4k
Transcript
3-9%- 8POTFPL+VOH *OUSPEVDUJPOUP3FJOGPSDFNFOU-FBSOJOH GFBUDT
8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO
$IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH #MPH IUUQTXPOTFPLKVOHHJUIVCJP :PVUVCF IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X
5PEBZ .BSLPWEFDJTJPOQSPDFTT ъചणޙઁ ъചणঌҊ્ܻࣁо࠙ܨ ъചणঌҊ્ܻఋੑѐਃ
.BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st
) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽അغחъചणীࢲࣁ࢚ਸ
3FXBSEGVODUJPOT )JHI3FXBSE ֫ࠁ࢚ উೞѱݾীب -PX3FXBSE ծࠁ࢚ ҮాࢎҊ
1PMJDZউೞѱೞѱೞח଼ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧ण
.BSLPWDIBJO s ∈ S s T 4UBUFTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF M
= S, T
.BSLPWDIBJO HSBQIJDBMMZ →μt +1 = T→μt μt,i = p(st =
i) s1 s2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKоযਸٸTUBUFоJؼഛܫݫܼझ UJNFTUFQUীTUBUFоJੌഛܫ
.BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st ,
at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO
.BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 =
i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоKҊBDUJPOLੌ⮶ UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPOLੌഛܫ r : SxA → R SFXBSEGVODUJPO μt,i = ∑ j,k Ti,j,k μt,j ξt,k
1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A
4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ P(ot ∣ st ) r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH πθ (a ∣ s) θ ੋҕन҃ݎXFJHIUT 1PMJDZחۄݫఠܳ߸ೠ θ ੋҕन҃ݎੑ۱ਵ۽TUBUFܳ߉ҊBDUJPOਸ۱ೠ
ജ҃BDUJPOਸੑ۱ਵ۽߉ҊTUBUFܳ۱ೠ
5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . .
. , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ pθ (τ) [∑ t r(st , at )]
5IFBOBUPNZPGBSFJOGPSDFNFOUMFBSOJOHBMHPSJUIN 1PMJDZܳࢎਊೞৈ ࢠࢤࢿ SFUVSOчਸஏ SFUVSO SFXBSE ܳਊೞৈ଼
সؘ
2GVODUJPOBOE7BMVFGVODUJPO Qπ(st , at ) = ∑T t′=t Eπθ [r(s′
t , a′ t ) ∣ st , at ] 2GVODUJPO 4UBUFU BDUJPOUীࢲࠗఠզٸө 5 ߉ਸࣻח୨SFXBSEӝчਸ҅
2GVODUJPOBOE7BMVFGVODUJPO 7BMVFGVODUJPO Vπ(st ) = T ∑ t′=t Eπθ [r(s′
t , a′ t ) ∣ st ] TUBUFUীࢲࠗఠզٸө 5 ߉ਸࣻח୨SFXBSEӝчਸ҅
5ZQFPG3-BMHPSJUINT 1PMJDZHSBEJFOUT 7BMVFCBTFE "DUPSDSJUJD .PEFMCBTFE3-
ъചणঌҊ્ܻ݆ਬ ਃೠࢠ ೞಌۄݫఠ উࢿ $POWFSHF
.%1 4UPDIBTUJD %FUFSNJOJTUJD $POUJOVPVT %JTDSFUF .%1ೠ ޖೠ
/&95DIBQUFS 1PMJDZHSBEJFOUT