Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Introduction Deep Reinforcement Learning
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Wonseok Jung
November 20, 2018
180
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Introduction Deep Reinforcement Learning
Wonseok Jung
November 20, 2018
More Decks by Wonseok Jung
See All by Wonseok Jung
Ai for business -self car driving
wonseokjung
0
210
reinforcement_learning_.pdf
wonseokjung
2
1.5k
원석이의 모두연에서 강화학습 보석되기
wonseokjung
0
440
NeuralIPS
wonseokjung
0
440
Deep reinforcemenet learning -2
wonseokjung
0
220
Deep Reinforcement Learning - Introduction
wonseokjung
1
670
How to become a datascientist ?
wonseokjung
2
2.4k
Review of Taylor series
wonseokjung
1
130
꿈꾸는 Agent
wonseokjung
2
170
Featured
See All Featured
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
We Are The Robots
honzajavorek
0
250
Music & Morning Musume
bryan
47
7.2k
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
170
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.8k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
3.5k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Deep Space Network (abreviated)
tonyrice
0
210
Faster Mobile Websites
deanohume
310
31k
The agentic SEO stack - context over prompts
schlessera
0
820
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
210
Transcript
3-9%- 8POTFPL+VOH *OUSPEVDUJPOUP3FJOGPSDFNFOU-FBSOJOH GFBUDT
8POTFPL+VOH $JUZ6OJWFSTJUZPG/FX:PSL#BSVDI$PMMFHF %BUB4DJFODF.BKPS $POOFYJPO"*'PVOEFS %FFQ-FBSOJOH$PMMFHF3FJOGPSDFNFOU-FBSOJOH3FTFBSDIFS .PEVMBCT$53--FBEFS 3FJOGPSDFNFOU-FBSOJOH 0CKFDU%FUFDUJPO
$IBUCPU (JUIVC IUUQTHJUIVCDPNXPOTFPLKVOH 'BDFCPPL IUUQTXXXGBDFCPPLDPNXTKVOH #MPH IUUQTXPOTFPLKVOHHJUIVCJP :PVUVCF IUUQTXXXZPVUVCFDPNDIBOOFM6$N5Y8,EIM8W+6GS3X
5PEBZ .BSLPWEFDJTJPOQSPDFTT ъചणޙઁ ъചणঌҊ્ܻࣁо࠙ܨ ъചणঌҊ્ܻఋੑѐਃ
.BSLPW%FDJTJPO1SPCMFN πθ (at ∣ ot ) πθ (at ∣ st
) - policy - policy ( fully observed ) st ot at - state - observation - action o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) .BSLPW%FDJTJPO1SPCMFN TUBUF BDUJPO SFXBSE USBOTJUJPO١ਵ۽അغחъചणীࢲࣁ࢚ਸ
3FXBSEGVODUJPOT )JHI3FXBSE ֫ࠁ࢚ উೞѱݾীب -PX3FXBSE ծࠁ࢚ ҮాࢎҊ
1PMJDZউೞѱೞѱೞח଼ण )JHI3FXBSE -PX3FXBSE 3FXBSEGVODUJPOਸా೧ण
.BSLPWDIBJO s ∈ S s T 4UBUFTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF M
= S, T
.BSLPWDIBJO HSBQIJDBMMZ →μt +1 = T→μt μt,i = p(st =
i) s1 s2 s3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at ) Ti,j = p(st+1 = i ∣ st = j) TUBUFKоযਸٸTUBUFоJؼഛܫݫܼझ UJNFTUFQUীTUBUFоJੌഛܫ
.BSLPWEFDJTJPOQSPDFTT s1 a1 s2 a2 s3 p(st+1 ∣ st ,
at ) p(st+1 ∣ st , at ) s ∈ S s A T a ∈ A 4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, T, r} r 3FXBSEGVODUJPO
.BSLPWEFDJTJPOQSPDFTT μt,i = p(st = i) Ti,j,k = p(st+1 =
i ∣ st = j, at = k) UJNFTUFQUীTUBUFоJੌഛܫ UJNFTUFQUীࢲTUBUFоKҊBDUJPOLੌ⮶ UJNFTUFQU ীࢲTUBUFоJੌഛܫ ξt,k = p(at = k) UJNFTUFQUীBDUJPOLੌഛܫ r : SxA → R SFXBSEGVODUJPO μt,i = ∑ j,k Ti,j,k μt,j ξt,k
1BSUJBM0CTFSWFE.BSLPWEFDJTJPOQSPDFTT s ∈ S s A T a ∈ A
4UBUFTQBDF "DUJPOTQBDF 5SBOTJUJPOPQFSBUPS 4UBUFTQBDF "DUJPOTQBDF M = {S, A, O, T, E, r} O 0CTFSWBUJPOTQBDF E &NJTTJPOQSPCBCJMJUZ P(ot ∣ st ) r 3FXBSEGVODUJPO o ∈ O PCTFSWBUJPOTQBDF o1 s1 a1 o2 s2 a2 o3 s3 a3 p(st+1 ∣ st , at ) p(st+1 ∣ st , at )
5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH πθ (a ∣ s) θ ੋҕन҃ݎXFJHIUT 1PMJDZחۄݫఠܳ߸ೠ θ ੋҕन҃ݎੑ۱ਵ۽TUBUFܳ߉ҊBDUJPOਸ۱ೠ
ജ҃BDUJPOਸੑ۱ਵ۽߉ҊTUBUFܳ۱ೠ
5IFHPBMPGSFJOGPSDFNFOUMFBSOJOH pθ (s1 , a1 , . . . .
. , ST , aT ) = p(s1 ) T ∏ t=1 πθ (at ∣ st )p(st+1 ∣ st , at ) pθ (τ) θ* = argmaxθ Eτ pθ (τ) [∑ t r(st , at )]
5IFBOBUPNZPGBSFJOGPSDFNFOUMFBSOJOHBMHPSJUIN 1PMJDZܳࢎਊೞৈ ࢠࢤࢿ SFUVSOчਸஏ SFUVSO SFXBSE ܳਊೞৈ଼
সؘ
2GVODUJPOBOE7BMVFGVODUJPO Qπ(st , at ) = ∑T t′=t Eπθ [r(s′
t , a′ t ) ∣ st , at ] 2GVODUJPO 4UBUFU BDUJPOUীࢲࠗఠզٸө 5 ߉ਸࣻח୨SFXBSEӝчਸ҅
2GVODUJPOBOE7BMVFGVODUJPO 7BMVFGVODUJPO Vπ(st ) = T ∑ t′=t Eπθ [r(s′
t , a′ t ) ∣ st ] TUBUFUীࢲࠗఠզٸө 5 ߉ਸࣻח୨SFXBSEӝчਸ҅
5ZQFPG3-BMHPSJUINT 1PMJDZHSBEJFOUT 7BMVFCBTFE "DUPSDSJUJD .PEFMCBTFE3-
ъചणঌҊ્ܻ݆ਬ ਃೠࢠ ೞಌۄݫఠ উࢿ $POWFSHF
.%1 4UPDIBTUJD %FUFSNJOJTUJD $POUJOVPVT %JTDSFUF .%1ೠ ޖೠ
/&95DIBQUFS 1PMJDZHSBEJFOUT