Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Vein of AlphaGo

skydome20
July 20, 2018

Vein of AlphaGo

The v&v (vein and vision) of algorithms about AlphaGo and AlphaGo Zero
==========================================================
【2019/10/10:Revision History】
[p.14] correct the size of KGS data: 30 million -> 160,000 games & 29.4 million (s,a) pairs
[p.16] correct the input size of policy nework (19x19x48 -> 19x19x36); additional 36 features description from paper.
[p.72] add information: the generated 30 million games(data) from self-play for AlphaGo
[p.77] correct input size (7 -> 17); add information: the generated 29 million games(data) from self-play for AlphaGoZero
[p.84] update hyperlinks of reference papers
==========================================================
- Development history (Aja Huang, Rémi Coulom)
- Policy Network
- Monte Carlo Tree Search
- Value Network
- Self Play
- Different structure between AlphaGo and AlphaGo Zero
==========================================================
[2018.07.20] For POLab (https://github.com/PO-LAB) Summer Rookie Training

skydome20

July 20, 2018
Tweet

More Decks by skydome20

Other Decks in Research

Transcript

  1. PROFILE skydome20 Author of R 系列筆記 2011 – 2015 :

    NCKU CSIE SHAO-YEN HUNG (洪紹嚴) 2015 – 2018 : NCKU IMIS 2018 – Now : CTL Data Science; R and Python Learner https://www.linkedin.com/in/skydome20/ 2
  2. 黃士傑 (Aja Huang) https://www.facebook.com/aja.huang • 台師大資訊工程研究所第一屆的學生,研讀碩士跟博士 • 2003 - 碩士論文為《電腦圍棋打劫的策略》

    • 2011 - 博士論文為《應用於電腦圍棋之蒙地卡羅樹搜尋法的新 啟發式演算法》 • 開發出圍棋軟體 Erick,打敗當時最強的軟體 Zen • 被 Deepmind 的 David Sliver 挖角,第 40 號員工 • 2014 – Deepmind 被 Google 收購 • 2014 ~ 2015 -重啟圍棋的人工智慧計畫,引入深度學習的技術 • 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 • 2016 – AlphaGo 出世,驚艷世人 • 《Mastering the game of Go with deep neural networks and tree search》 6
  3. Aja Hung & Rémi Coulom 2015 2016/2017 《Efficient Selectivity and

    Backup Operatorsin Monte-carlo Tree Search》 Aja Hung & Erick 《應用於電腦圍棋之 蒙地卡羅樹搜尋法的 新啟發式演算法》 2006 2011 http://pr.ntnu.edu.tw/news/index.php?mode=data&id=16487 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 Aja Hung & 李世乭 《Mastering the game of Go with deep neural networks and tree search》 Aja Hung & Erick CrazyStone Erick AlphaGo / AlphaGo Zero Policy Network 《Mastering the Game of Go without Human Knowledge》 7
  4. Aja Hung & Rémi Coulom 2015 2016/2017 《Efficient Selectivity and

    Backup Operatorsin Monte-carlo Tree Search》 Aja Hung & Erick 《應用於電腦圍棋之 蒙地卡羅樹搜尋法的 新啟發式演算法》 2006 2011 http://pr.ntnu.edu.tw/news/index.php?mode=data&id=16487 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 Aja Hung & 李世乭 《Mastering the game of Go with deep neural networks and tree search》 Aja Hung & Erick CrazyStone Erick AlphaGo / AlphaGo Zero Policy Network 《Mastering the Game of Go without Human Knowledge》 10
  5. 黃士傑 (Aja Huang) 2015 -《Move Evaluation in Go Using Deep

    Convolutional Neural Networks》 https://www.facebook.com/aja.huang Policy Network 11
  6. Problem Description 棋盤上有 19 x 19 = 361 個點,每個點有三種狀態 (白:

    1) (黑:-1) (無子: 0) 棋盤狀態: റ = (1, 0, -1, 0, ……) 假設在某一狀態 റ 之下,暫不考慮無法落子的地方 那下一步能走的空間是 361 維 落子動作: റ = (0, 0, 1, 0, ……) (圍棋問題的定義) 任一狀態 റ 下,尋找最優落子策略 റ ,獲得最大的地盤 12
  7. Policy Network റ = (1, 0, -1, 0, ……) റ

    = (0, 0,……, 1, 0, ……) Supervised Learning 13
  8. Policy Network Policy Network (用 CNN 學習人類下棋的方式) റ = ℎ(

    റ ) Input റ 19 x 19 x 36 Model Convolution Neural Network Output റ 19 x 19 16
  9. Policy Network റ = ℎ( റ ) 問題1: 獲得的棋譜皆來自於「業餘者」? 問題2:

    此網絡的棋力約為「業餘六段」 問題3: 無法打敗當時最強的 CrazyStone 17
  10. Aja Hung & Rémi Coulom 2015 2016/2017 《Efficient Selectivity and

    Backup Operatorsin Monte-carlo Tree Search》 Aja Hung & Erick 《應用於電腦圍棋之 蒙地卡羅樹搜尋法的 新啟發式演算法》 2006 2011 http://pr.ntnu.edu.tw/news/index.php?mode=data&id=16487 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 Aja Hung & 李世乭 《Mastering the game of Go with deep neural networks and tree search》 Aja Hung & Erick CrazyStone Erick AlphaGo / AlphaGo Zero Policy Network 《Mastering the Game of Go without Human Knowledge》 19
  11. Rémi Coulom • 2006 - 《Efficient Selectivity and Backup Operatorsin

    Monte-carlo Tree Search》 • 用蒙地卡羅樹搜尋,開發出 CrazyStone • 2011 - 指導黃士傑的博士論文《應用於電腦圍棋之蒙地卡羅樹 搜尋法的新啟發式演算法》 • 黃士傑開發出 Erick • 2014~2015 - 領導 Deepmind 的 AlphaGo 開發團隊 20
  12. Monte Carlo Tree Search Build up a tree and do

    tree search Upper Confidence Bound 26
  13. Monte Carlo Tree Search Black 1/1 …… …… 勝數 /

    次數 Lose : 0 Take action by value 30
  14. Monte Carlo Tree Search Black 5/9 1/8 1/22 3/5 ……

    …… 4/7 勝數 / 次數 Lose : 0 38
  15. Monte Carlo Tree Search Black 5/10 1/8 1/22 3/5 ……

    …… 4/7 勝數 / 次數 0/1 Lose : 0 39
  16. Monte Carlo Tree Search Black 5/10 1/8 1/22 3/5 ……

    …… 4/7 勝數 / 次數 0/1 Take action by value or random 40
  17. Monte Carlo Tree Search White … … … … …

    … … … … … … … … Black … … … … … … … … … … … … … Black First White Black White Black Take Action Observe Enemy Combined as one MCT 41
  18. Monte Carlo Tree Search Build up a tree and do

    tree search Upper Confidence Bound (UCB) 44
  19. Monte Carlo Tree Search Take action by value or random?

    Exploration & Exploitation ? root 3/6 1/8 …… Bandit Algorithms action? Black/ White 45
  20. Monte Carlo Tree Search (Multi-armed) Bandit Problem Slot Machine (One-armed

    bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) … … How can you maximize the expected gain after a series of action/choice ? 47
  21. Monte Carlo Tree Search (Multi-armed) Bandit Problem 面對 K 個固定的吃角子老虎機

    (推薦產品/廣告) 在沒有任何先驗知識、不知道各自的預期報酬之下 每一次嘗試都選擇其中一個 如何在這個選擇過程中,最大化我們的期望報酬? 48
  22. Monte Carlo Tree Search (Multi-armed) Bandit Algorithms ε-first Upper Confidence

    Bound (UCB1 / UCB2) ε-greedy LinUCB Thompson sampling εn-greedy 49
  23. Monte Carlo Tree Search UCB = ഥ + Upper Confidence

    Bound • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數 50 ഥ ( ) δ δ Probability > α
  24. Monte Carlo Tree Search (Multi-armed) Bandit Problem Slot Machine (One-armed

    bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) Slot Machine (One-armed bandit) … … 51 UCB = ഥ + • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數
  25. Monte Carlo Tree Search • 對機器 i 來說,每一次選擇,都會調整 ഥ 跟

    • 對各台機器來說,如果選擇次數一樣,那會選擇 ഥ 最高的那一台 • 隨著某一台機器被挑到多次: • 表示它平均歷史報酬 ഥ 很高:Exploit • 右邊那一項 δ = 會逐漸變小 • 直到另外一台 UCB 大於這一台時,改選擇另外一台:Explore UCB = ഥ + 52
  26. Monte Carlo Tree Search Build up a tree and do

    tree search Upper Confidence Bound ഥ + UCT 54 root … Action • by UCB
  27. Build up a tree and do tree search UCB Q(s,

    a) + u(s, a) • UCB update: Q(s, a) + u(s, a) value network reward after simulation ℎ( റ ) 55 UCT
  28. Monte Carlo Tree Search for 56 《Mastering the game of

    Go with deep neural networks and tree search》 probability
  29. Monte Carlo Tree Search 2016-03-13 第四戰: 神之一手:李世石第 78 手 Netflix

    – 《AlphaGo 世紀對決》 https://www.101weiqi.com/chessbook/chess/140037/ 78 57
  30. Value Network We need value function : v(s) to early

    stop the game ! 59 value network reward after simulation
  31. Value Network 62 《Mastering the game of Go with deep

    neural networks and tree search》
  32. Value Network – Self play ℎ ( റ ) N

    data Policy Network ℎ−1 ( റ ) Policy Network ℎ−2 ( റ ) N data Policy Network N data …… Policy Network ℎ− ( റ ) N data 64
  33. Value Network Initially, value network is copied from policy network

    65 《Mastering the game of Go with deep neural networks and tree search》
  34. Value Network – Self play 結果:ℎ−( റ ) 對上 ℎ(

    റ ) =有 80% 勝率 問題:拿 ℎ− ( റ ) 來優化 MCTS,棋力反而變差 Q(s, a) + u(s, a) ℎ− ( റ ) 觀察: ℎ− ( റ ) 步數太過集中,無法 Explore 66 generate 30 million games
  35. E Value Network – Self play (黃博士最後在訓練 v(s) 時所採取的策略) 1.

    先用 ℎ( റ ) 走 L 步 2. 在 L+1 步時,「完全隨機」走一步 3. 之後才用 ℎ−( റ ) 走完終局 67
  36. ℎ− ( റ ) The percentage of frequency that the

    action was selected during simulations The real situation 70
  37. Human data 30,000,000 Policy Network Value Network (Initial Copy) ……

    Self-Play (Take action) MCTS Q(s, a) + u(s, a) ⇒ (Early stop) v(s) ℎ ( റ ) () (win/lost) trainning trainning 19 x 19 x 48 72 generate 30 million games
  38. 黃士傑 (Aja Huang) 2017 -《Mastering the Game of Go without

    Human Knowledge》 https://www.facebook.com/aja.huang ZERO 74
  39. One Network • ResNet • 20-40 error blocks • Batch

    Normalization • Rectifier non-linearities • Input: only black/white info • 19 x 19 x 17 • Output: (p, v) • p = actions probability • v = odds for current player • Loss: …… Self-Play ZERO training (reward, ) (p, v) action/eval (Take action) MCTS Q(s, a) + u(s, a) ⇒ (Early stop) v 77 generate 29 million games
  40. 79 19 x 19 x 48 《Mastering the Game of

    Go without Human Knowledge》 《Mastering the game of Go with deep neural networks and tree search》
  41. ZERO 176 GPUs 48 TPUs ≈ 720 4 TPUs AlphaGo

    Fan AlphaGo Lee AlphaGo Zero 0 100 200 300 400 500 600 (GPUs) 700 800 In Google Cloud 1 TPU ≈ 15 ~ 30 GPUs 《Mastering the Game of Go without Human Knowledge》 81
  42. 82 Policy Network  SL  CNN MCTS  Build

    tree  UCB  Q(s, a)  u(s, a) Value Network  CNN  Early stop Self-Play  Generate data  Train network  Update MCTS ResNet  CNN  (p, v) Result  Insights  GPU, TPU ZERO
  43. Aja Hung & Rémi Coulom 2015 2016/2017 《Efficient Selectivity and

    Backup Operatorsin Monte-carlo Tree Search》 Aja Hung & Erick 《應用於電腦圍棋之 蒙地卡羅樹搜尋法的 新啟發式演算法》 2006 2011 http://pr.ntnu.edu.tw/news/index.php?mode=data&id=16487 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 Aja Hung & 李世乭 《Mastering the game of Go with deep neural networks and tree search》 Aja Hung & Erick CrazyStone Erick AlphaGo / AlphaGo Zero Policy Network 《Mastering the Game of Go without Human Knowledge》 83
  44. 84 深入浅出看懂AlphaGo如何下棋 深入浅出看懂AlphaGo元 《電腦圍棋打劫的策略》 《Mastering the game of Go with

    deep neural networks and tree search》 《應用於電腦圍棋之蒙地卡羅樹搜尋法的新啟發式演算法》 《Move Evaluation in Go Using Deep Convolutional Neural Networks》 《Efficient Selectivity and Backup Operatorsin Monte-carlo Tree Search》 《The Monte Carlo Method》 《Algorithms for the multi-armed bandit problem》 《Mastering the Game of Go without Human Knowledge》 《Deep Residual Learning for Image Recognition》 賭徒的人工智慧1:吃角子老虎 (Bandit) 問題 Monte Carlo Tree Search – beginners guide 构建自己的AlphaGo 《New Exponential Bounds and Approximations for the Computation of Error Probability in Fading Channels》
  45. Prerequisite Prerequisite (Reinforcement Learning) • Bellman equation: V(s), Q(s,a) •

    Epsilon-greedy • TD, Q-Learning • Policy Gradient • DQN, Actor Critic (Deep Learning) • CNN • Filter, Pooling, Flatten, FC • Relu • Adam • Batch Normalization • VGG-19, ResNet (Python: implement) • numpy / pandas • Tensorflow / Pytorch • Keras • GPU / TPU • Google Cloud 87
  46. Monte Carlo Tree Search UCB = ഥ + Upper Confidence

    Bound • ഥ = 機器 i 的平均歷史報酬 • N = 總選擇次數 • = 機器 i 被選擇次數 90 ഥ ( ) δ δ Probability > α
  47. Monte Carlo Tree Search P (1 σ=1 ) − ()

    < δ > α ----(1) ⇒P (−δ < σ=1 −() < δ) > α -----(2) (Recall) Central Limit Theorem: ത −μ σ = ~ 0, 1 ത −μ σ = ( ത −μ) σ = ( ത − μ) = ( ത − μ ) = σ=1 −() ~ 0, 1 代回(2) Upper Confidence Bound 91
  48. Monte Carlo Tree Search P (−δ < σ=1 −() <

    δ) > α ----- (2) ⇒ P (−δ < σ=1 −() < δ ) = න −δ δ 1 2 − 2 2 = erf( δ 2 ) > α ---- (3) [1]:《New Exponential Bounds and Approximations for the Computation of Error Probability in Fading Channels》 0, 1 [1]: erfc < exp −2 erf( δ 2 ) = 1 − erfc δ 2 > 1 − exp − δ 2 2 = 1 − exp − δ2 2 > α (4) Upper Confidence Bound 92
  49. Monte Carlo Tree Search 1 − exp − δ2 2

    > α ---- (4) exp − δ2 2 < (1 − α) ⇒ − δ2 2 < ln(1 − α) ⇒ δ2 < −2 ln(1 − α) = 2 ln( 1 1 − α) ⇒ δ < 2 ln( 1 1 − α ) Upper Confidence Bound 93
  50. Monte Carlo Tree Search Bigger α is better Thus, let

    1 1 − α = N δ < 2 ln( 1 1 − α ) ⇒ δ< 2 ln() UCB = ഥ + LCB = ഥ − ഥ ( ) δ δ Probability > α Upper Confidence Bound 94
  51. ZERO separate (“sep”) combined policy and value networks (“dual”) convolutional

    (“conv”) residual networks (“res”) 《Mastering the Game of Go without Human Knowledge》 96
  52. “MAKE THE WORLD SMALLER, ONE SLIDE AT THE TIME” READY

    TO BOOST THE IMPACT OF YOUR PRESENTATIONS? Head over to 24Slides.com and find out more about our presentation design services. 104 https://24slides.com/templates/dashboard/view/other/dark- themed-30-slide-template-pack