Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep and Reinforced Learning

tewei
May 28, 2015

Deep and Reinforced Learning

Deep Learning and Reinforcement Learning

tewei

May 28, 2015
Tweet

More Decks by tewei

Other Decks in Science

Transcript

  1. $POUFOUT 1. Intro …1 2. Deep Learning …2 3. Reinforcement

    Learning 4. Experiments …7 5. Discussion …1 About Me ML summer camp @ NCTU Learned RL and DL for… 3 months? http://ml.infor.org/
  2. *OUSP Machine Learning? Human Learning? How we learned from our

    experience? Machine Learning Fields, Algorithms … blah blah blah
  3. %FFQ-FBSOJOH Pertaining to Neural Nets Fundamental Ideas Learning Nonlinearity Back

    propagation Feature Selection and Dimensionality Reduction
  4. .BSLPW%FDJTJPO1SPDFTT Environment + Agent State(s) Actions(a) for a state Reward

    R(s,a,s’) Probability for Transition P(s,a,s’) Policy — set of actions to take in each state Environment Action Reward New State Agent
  5. 2-FBSOJOH Q-Value — Expected Reward for taking an action at

    a state Action with Max Q-Value — Optimal Policy !
  6. 2-FBSOJOH s’ s’ s a a’ a’ s’ a’ Q*(s,a)

    = sum(P(s,a,s’)(R(s,a,s’) + r(max(Q(s’,a’)|a’)|s’)))
  7. 10.%1 Life isn’t always good — We don’t know P(s,a,s’)

    and R(s,a,s’) — Partially observed MDP Needs model-free methods QAQ
  8. 2-FBSOJOH Average each observation into previous Q-values Q(s,a) = Q(s,a)

    + k(R(s,a,s’)+r(max(Q(s’,a’)|a’))) with k << 1 Gain observation by epsilon-greedy method — for probability p act randomly and for (1-p) act on-policy
  9. 1SPCMFNT Too much states Orz — Apply feature selection —

    Approximate Q-Values (LR, NN…) — No end-to-end approaches QAQ
  10. %FFQ2-FBSOJOH 1. Use artificial neural networks to approximate Q-values 2.

    Experience Replay — Model Free — Feature Selection Done — OwO
  11. %FFQ2-FBSOJOH “Second, learning directly from consecutive samples is inefficient, due

    to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. ”
  12. //BOE2WBMVFT Input — State Output — Q-values for each action

    Q(s,a) = Q(s,a) + k(R(s,a,s’)+r(max(Q(s’,a’)|a’))) Training (X,Y) = (state, R(s,a,s’)+r(max(Q(s’,a’)|a’))) — Neural Dynamic Programing @@
  13. 1MBZJOH"UBSJXJUI%FFQ3FJOGPSDFNFOU-FBSOJOH “The leftmost two plots in figure 2 show how

    the average total reward evolves during training on the games Seaquest and Breakout. Both averaged reward plots are indeed quite noisy, giving one the impression that the learning algorithm is not making steady progress. ”
  14. *NQMFNFOUBUJPO Theano~! Emulator MLP with RMSprop (without dropout) Deep Q-Learning

    https://github.com/Newmu/Theano-Tutorials https://github.com/tewei/MLatINFOR/blob/master/ DeepQLearning.ipynb
  15. $PNQBSJTPO Random moves — average 230 points QAQ Deep Q-Learning

    agent tends to put tiles on the sides Good results == NaN Deep Q-Learning Failed or my implementation failed Orz
  16. %JTDVTTJPO What caused training to stagnate? How do architecture and

    hyper parameters affect performance? Explore actions? Multi-Agent Games?