Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algorithm behind AlphaGo and AlphaGo Zero

Algorithm behind AlphaGo and AlphaGo Zero

Briefly review of AlphaGo and AlphaGo Zero algorithm. How it works, and how it learns from self-play. Presented in Young Webmaster Camp Programming Meeting #5.

Kosate Limpongsa

February 03, 2018
Tweet

Other Decks in Technology

Transcript

  1. Neung (1) Kosate Limpongsa Studying in Chulalongkorn University #YWC12 #CP41

    Developer of JWC7 Developer of YWC13 Writer of • เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้) • AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว (อย่างไร ?)
  2. 1 Introduction Why Go is hard? 19 x 19 =

    361 Photo from https://en.wikipedia.org/wiki/Rules_of_Go
  3. 1 Introduction Why Go is hard? Reference http://tromp.github.io/go/legal.html 208168199381979984699478633 344862770286522453884530548

    4256394568209274196127380153 7852564845169851964390725991 601562812854608988831442712 9715319317557736620397247064 840935 More than atoms in universe
  4. 1 Introduction Why Go is hard? ~ 9 trillion trillion

    trillion trillion trillion trillion trillion trillion trillion trillion years If we use all computers in the world
  5. 3 Algorithm / AlphaGo AlphaGo • Rollout (Human data) •

    SL Policy (Human data) • RL Policy (Self-play) • Value Network (RL Policy)
  6. 3 Algorithm / AlphaGo 4 MCTS Selection Choose move from

    RL Policy u(P) + some probability Q
  7. 3 Algorithm / AlphaGo 4 MCTS Evaluation Reward How to

    estimate reward in middle of the game? Remember?
 You can’t calculate all possibilities Barrier
  8. 3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout

    plays until the end of game Play Play
  9. 3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout

    plays until the end of game 2) Use Value Network estimates win prop. What percent do I win?
  10. 3 Algorithm / AlphaGo 4 MCTS Backup 1) Use Rollout

    +1 reward 2) Use Value Network +0.79 win Update these values to Q in MCTS
  11. 3 Algorithm / AlphaGo 5 How AlphaGo makes decision Select

    move with max(Q + u(P)) in MCTS 0.73 0.63
  12. 3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture Policy-Value Network

    Win +0.79 Predict possibilities in all moves Predict possibilities to win
  13. 1 Self-play (MCTS) Selection Choose move from network U +

    some probability Q 3 Algorithm / AlphaGo Zero
  14. 3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Backup Use

    Value 79% win Update these values to Q in MCTS
  15. 1 Introduction AlphaGo Timeline AlphaGo Fan AlphaGo Lee Fan Hui

    (2Dan Player)
 AlphaGo wins 5– 0 Lee Sedol (9Dan Player)
 AlphaGo wins 4 – 1 Born
  16. 1 Introduction AlphaGo Timeline AlphaGo Master AlphaGo Zero (Nearly Complete)

    AlphaGo Zero Complete 60 Professional Player
 AlphaGo wins 60 – 0 Ke Jie (1st World Player)
 AlphaGo wins 3 – 0
  17. 4 Result Stronger AlphaGo Zero AlphaGo Lee 100 – 0

    AlphaGo Zero AlphaGo Master 89 – 11 vs vs
  18. Neung (1) Kosate Limpongsa Chulalongkorn University #YWC12 #CP41 Articles •

    เจาะลึก AlphaGo ทํางานอย่างไร ? https://medium.com/kosate/3a1cf3631289 • AlphaGo Zero ตัวใหม่ https://medium.com/kosate/9d6e5c059e7d Kosate Limpongsa kosatelim (at) gmail.com References • https://www.nature.com/articles/nature16961 • https://www.nature.com/articles/nature24270 medium.com/kosate