Deep and Reinforced Learning

%FFQBOE3FJOGPSDFE-FBSOJOH 5FXFJ!*/'03)&:%":

$POUFOUT 1. Intro …1 2. Deep Learning …2 3. Reinforcement
Learning 4. Experiments …7 5. Discussion …1 About Me ML summer camp @ NCTU Learned RL and DL for… 3 months? http://ml.infor.org/

*OUSP Machine Learning? Human Learning? How we learned from our
experience? Machine Learning Fields, Algorithms … blah blah blah

%FFQ-FBSOJOH Pertaining to Neural Nets Fundamental Ideas Learning Nonlinearity Back
propagation Feature Selection and Dimensionality Reduction

%FFQ-FBSOJOH MLP CNN RNN Autoencoder, Deep Belief Nets…

3FJOGPSDFNFOU-FBSOJOH Learning from observation of rewards Learning an optimal policy
with best reward

.BSLPW%FDJTJPO1SPDFTT Environment + Agent State(s) Actions(a) for a state Reward
R(s,a,s’) Probability for Transition P(s,a,s’) Policy — set of actions to take in each state Environment Action Reward New State Agent

.BSLPW%FDJTJPO1SPDFTT s a s s a a s s s

2-FBSOJOH Q-Value — Expected Reward for taking an action at
a state Action with Max Q-Value — Optimal Policy !

2-FBSOJOH s’ s’ s a a’ a’ s’ a’ Q*(s,a)
= sum(P(s,a,s’)(R(s,a,s’) + r(max(Q(s’,a’)|a’)|s’)))

%ZOBNJD1SPHSBNNJOH 3FBMMZ

10.%1 Life isn’t always good — We don’t know P(s,a,s’)
and R(s,a,s’) — Partially observed MDP Needs model-free methods QAQ

2-FBSOJOH Average each observation into previous Q-values Q(s,a) = Q(s,a)
+ k(R(s,a,s’)+r(max(Q(s’,a’)|a’))) with k << 1 Gain observation by epsilon-greedy method — for probability p act randomly and for (1-p) act on-policy

1SPCMFNT Too much states Orz — Apply feature selection —
Approximate Q-Values (LR, NN…) — No end-to-end approaches QAQ

1MBZJOH5FUSJT

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH (Google) DeepMind NIPS 2013, Nature 2015 Implemented in Torch
http://arxiv.org/abs/ 1312.5602 https://youtu.be/ cjpEIotvwFY

%FFQ2-FBSOJOH 1. Use artificial neural networks to approximate Q-values 2.
Experience Replay — Model Free — Feature Selection Done — OwO

%FFQ2-FBSOJOH “Second, learning directly from consecutive samples is inefﬁcient, due
to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. ”

//BOE2WBMVFT Input — State Output — Q-values for each action
Q(s,a) = Q(s,a) + k(R(s,a,s’)+r(max(Q(s’,a’)|a’))) Training (X,Y) = (state, R(s,a,s’)+r(max(Q(s’,a’)|a’))) — Neural Dynamic Programing @@

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH Game (86x86) CNN FC(256) Output(4-18)

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH “The leftmost two plots in ﬁgure 2 show how
the average total reward evolves during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress. ”

&YQFSJNFOU 1010! P(19,3) tiles 2100 board states 300 possible actions
Orz

"SDIJUFDUVSF Input (109) FC Hidden Layer (128) FC Hidden Layer
(128) Output (300)

%JFSFODFT Not frame based CNN-less Different reward scaling 4-18 actions
vs 300 actions Valid actions not consistent

*NQMFNFOUBUJPO Theano~! Emulator MLP with RMSprop (without dropout) Deep Q-Learning
https://github.com/Newmu/Theano-Tutorials https://github.com/tewei/MLatINFOR/blob/master/ DeepQLearning.ipynb

3FTVMUT Average training score: 125 Average testing score: 147

8JUIPVU&YQFSJFODF3FQMBZ (VFTTXIBU

.BHJD Average testing score: 224 Best testing score: 600

$PNQBSJTPO Random moves — average 230 points QAQ Deep Q-Learning
agent tends to put tiles on the sides Good results == NaN Deep Q-Learning Failed or my implementation failed Orz

%JTDVTTJPO What caused training to stagnate? How do architecture and
hyper parameters affect performance? Explore actions? Multi-Agent Games?

'JOJTIFE_'JOBMMZ 2"5JNF

Deep and Reinforced Learning

Deep and Reinforced Learning

tewei

More Decks by tewei

Other Decks in Science

Featured

Transcript

%FFQBOE3FJOGPSDFE-FBSOJOH 5FXFJ!*/'03)&:%":

$POUFOUT 1. Intro …1 2. Deep Learning …2 3. Reinforcement

*OUSP Machine Learning? Human Learning? How we learned from our

%FFQ-FBSOJOH Pertaining to Neural Nets Fundamental Ideas Learning Nonlinearity Back

%FFQ-FBSOJOH MLP CNN RNN Autoencoder, Deep Belief Nets…

3FJOGPSDFNFOU-FBSOJOH Learning from observation of rewards Learning an optimal policy

.BSLPW%FDJTJPO1SPDFTT Environment + Agent State(s) Actions(a) for a state Reward

.BSLPW%FDJTJPO1SPDFTT s a s s a a s s s

2-FBSOJOH Q-Value — Expected Reward for taking an action at

2-FBSOJOH s’ s’ s a a’ a’ s’ a’ Q*(s,a)

%ZOBNJD1SPHSBNNJOH 3FBMMZ

10.%1 Life isn’t always good — We don’t know P(s,a,s’)

2-FBSOJOH Average each observation into previous Q-values Q(s,a) = Q(s,a)

1SPCMFNT Too much states Orz — Apply feature selection —

1MBZJOH5FUSJT

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH (Google) DeepMind NIPS 2013, Nature 2015 Implemented in Torch

%FFQ2-FBSOJOH 1. Use artificial neural networks to approximate Q-values 2.

%FFQ2-FBSOJOH “Second, learning directly from consecutive samples is inefﬁcient, due

//BOE2WBMVFT Input — State Output — Q-values for each action

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH Game (86x86) CNN FC(256) Output(4-18)

1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH “The leftmost two plots in ﬁgure 2 show how

&YQFSJNFOU 1010! P(19,3) tiles 2100 board states 300 possible actions

"SDIJUFDUVSF Input (109) FC Hidden Layer (128) FC Hidden Layer

%JFSFODFT Not frame based CNN-less Different reward scaling 4-18 actions

*NQMFNFOUBUJPO Theano~! Emulator MLP with RMSprop (without dropout) Deep Q-Learning

3FTVMUT Average training score: 125 Average testing score: 147

8JUIPVU&YQFSJFODF3FQMBZ (VFTTXIBU

.BHJD Average testing score: 224 Best testing score: 600

$PNQBSJTPO Random moves — average 230 points QAQ Deep Q-Learning

%JTDVTTJPO What caused training to stagnate? How do architecture and

'JOJTIFE_'JOBMMZ 2"5JNF