Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algorithm behind AlphaGo and AlphaGo Zero

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Algorithm behind AlphaGo and AlphaGo Zero

Briefly review of AlphaGo and AlphaGo Zero algorithm. How it works, and how it learns from self-play. Presented in Young Webmaster Camp Programming Meeting #5.

Avatar for Kosate Limpongsa

Kosate Limpongsa

February 03, 2018
Tweet

Other Decks in Technology

Transcript

  1. Neung (1) Kosate Limpongsa Studying in Chulalongkorn University #YWC12 #CP41

    Developer of JWC7 Developer of YWC13 Writer of • เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้) • AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว (อย่างไร ?)
  2. 1 Introduction Why Go is hard? 19 x 19 =

    361 Photo from https://en.wikipedia.org/wiki/Rules_of_Go
  3. 1 Introduction Why Go is hard? Reference http://tromp.github.io/go/legal.html 208168199381979984699478633 344862770286522453884530548

    4256394568209274196127380153 7852564845169851964390725991 601562812854608988831442712 9715319317557736620397247064 840935 More than atoms in universe
  4. 1 Introduction Why Go is hard? ~ 9 trillion trillion

    trillion trillion trillion trillion trillion trillion trillion trillion years If we use all computers in the world
  5. 3 Algorithm / AlphaGo AlphaGo • Rollout (Human data) •

    SL Policy (Human data) • RL Policy (Self-play) • Value Network (RL Policy)
  6. 3 Algorithm / AlphaGo 4 MCTS Selection Choose move from

    RL Policy u(P) + some probability Q
  7. 3 Algorithm / AlphaGo 4 MCTS Evaluation Reward How to

    estimate reward in middle of the game? Remember?
 You can’t calculate all possibilities Barrier
  8. 3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout

    plays until the end of game Play Play
  9. 3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout

    plays until the end of game 2) Use Value Network estimates win prop. What percent do I win?
  10. 3 Algorithm / AlphaGo 4 MCTS Backup 1) Use Rollout

    +1 reward 2) Use Value Network +0.79 win Update these values to Q in MCTS
  11. 3 Algorithm / AlphaGo 5 How AlphaGo makes decision Select

    move with max(Q + u(P)) in MCTS 0.73 0.63
  12. 3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture Policy-Value Network

    Win +0.79 Predict possibilities in all moves Predict possibilities to win
  13. 1 Self-play (MCTS) Selection Choose move from network U +

    some probability Q 3 Algorithm / AlphaGo Zero
  14. 3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Backup Use

    Value 79% win Update these values to Q in MCTS
  15. 1 Introduction AlphaGo Timeline AlphaGo Fan AlphaGo Lee Fan Hui

    (2Dan Player)
 AlphaGo wins 5– 0 Lee Sedol (9Dan Player)
 AlphaGo wins 4 – 1 Born
  16. 1 Introduction AlphaGo Timeline AlphaGo Master AlphaGo Zero (Nearly Complete)

    AlphaGo Zero Complete 60 Professional Player
 AlphaGo wins 60 – 0 Ke Jie (1st World Player)
 AlphaGo wins 3 – 0
  17. 4 Result Stronger AlphaGo Zero AlphaGo Lee 100 – 0

    AlphaGo Zero AlphaGo Master 89 – 11 vs vs
  18. Neung (1) Kosate Limpongsa Chulalongkorn University #YWC12 #CP41 Articles •

    เจาะลึก AlphaGo ทํางานอย่างไร ? https://medium.com/kosate/3a1cf3631289 • AlphaGo Zero ตัวใหม่ https://medium.com/kosate/9d6e5c059e7d Kosate Limpongsa kosatelim (at) gmail.com References • https://www.nature.com/articles/nature16961 • https://www.nature.com/articles/nature24270 medium.com/kosate