Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DQN速習会@Wantedly

 DQN速習会@Wantedly

Brief workshop of Deep Q-Network at Wantedly

Takuma Seno

June 29, 2017
Tweet

More Decks by Takuma Seno

Other Decks in Technology

Transcript

  1. 強化学習で話題になったこと(1) AlphaGo 囲碁チャンピオンを倒したAI https://www.youtube.com/watch?v=f_r9smp4-0U Silver, David and Huang, Aja and

    Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Kavukcuoglu, Koray and Graepel, Thore and Hassabis, Demis, Mastering the Game of Go with Deep Neural Networks and Tree Search, 2016 6
  2. 強化学習で話題になったこと(2) Deep Q-Network (DQN) Atariで人間と同じかそれ以上の点数を取れるようになった https://www.youtube.com/watch?v=TmPfTpjtdgg 7 Volodymyr Mnih and

    Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller,Playing Atari With Deep Reinforcement Learning, 2013 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis, Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis, Human-level control through deep reinforcement learning, 2015
  3. マルコフ性 次の状態が今の状態と行動にしか依らない Pr {s t+1 = s’, r t+1 =

    r | s t , a t } 強化学習では環境がマルコフ性をもつとする 24
  4. TD学習(2) V(s t+3 )=0, V(s t+2 )=0, V(s t+1 )=0,

    V(s t )=0, a = 1, γ=0.9 s t+2 s t+1 s t r t+3 =1 r t+2 =0 r t+1 =0 V(s t+2 ) = 0 + {1 + 0 - 0} V(s t+1 ) = 0+ {0 + 0 - 0} V(s t ) = 0 + {0 + 0 - 0} iteration1 例としていつも同じ状態遷移をするとする 31
  5. TD学習(3) V(s t+3 )=0, V(s t+2 )=1, V(s t+1 )=0,

    V(s t )=0, a = 1, γ=0.9 s t+2 s t+1 s t r t+3 =1 r t+2 =0 r t+1 =0 V(s t+2 ) = 1 + {1 + 0 - 1} V(s t+1 ) = 0 + {0 + 0.9 - 0} V(s t ) = 0 + {0 + 0 - 0} iteration2 32
  6. TD学習(4) V(s t+3 )=0, V(s t+2 )=1, V(s t+1 )=0.9,

    V(s t )=0, a = 1, γ=0.9 s t+2 s t+1 s t r t+3 =1 r t+2 =0 r t+1 =0 V(s t+2 ) = 1 + {1 + 0 - 1} V(s t+1 ) = 0.9 + {0 + 0.9 - 0.9} V(s t ) = 0 + {0 + 0.81 - 0} iteration3 33
  7. 行動価値関数の簡単な方策の例 行動価値関数をすでに持ってるなら 常に一番大きな行動価値を持つ行動を選択 s Q(s, a 1 ) = 0.1

    Q(s, a 2 ) = 0.2 Q(s, a 3 ) = 0.1 これなら次の状態を考えなくて良い(マルコフ性のおかげ) 36
  8. DQN (Deep Q-Network)とは Deep Mind より発表された深層強化学習の手法 Atariのゲームで高得点が取れるようになった 43 Volodymyr Mnih

    and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller,Playing Atari With Deep Reinforcement Learning, 2013 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis, Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis, Human-level control through deep reinforcement learning, 2015
  9. DQNの成果 from Nature Atari 2600[1] のゲームで人間よりも高いスコアが取れるようになった [1] The Arcade Learning

    Environment: An Evaluation Platform for General Agents, 2013, Bellemare, Naddaf, Veness, Bowling 45
  10. DQNの工夫 実はDQNはQ関数をディープにしただけではない • Experience Replay • Freezing the target network

    • Clipping rewards • Skipping frames 色々工夫してようやくスコアが上がった! 53