Algorithm behind AlphaGo and AlphaGo Zero

Behind AlphaGo AlphaGo Zero Programming Meetup #5 Kosate Limpongsa #ywc12

Neung (1) Kosate Limpongsa Studying in Chulalongkorn University #YWC12 #CP41
Developer of JWC7 Developer of YWC13 Writer of • เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้) • AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว (อย่างไร ?)

1 Introduction 2 Background 3 Algorithm 4 Result

I am not proﬁcient at Go

I am not proﬁcient at Go Just a nerd guy
who read some papers

2 Background 3 Algorithm 4 Result Introduction 1

1 Introduction What is Go?

1 Introduction What is Go? Photo from https://www.scichallenge.eu/blogs/stories-experiences/game-go-a-strategy-game/

1 Introduction What is Go? Photo from AlphaGo Movie

1 Introduction Why Go is hard?

1 Introduction Why Go is hard? 19 x 19 =
361 Photo from https://en.wikipedia.org/wiki/Rules_of_Go

1 Introduction Why Go is hard? Photo from https://youtu.be/SUbqykXVx0A

1 Introduction Why Go is hard? Reference http://tromp.github.io/go/legal.html 208168199381979984699478633 344862770286522453884530548
4256394568209274196127380153 7852564845169851964390725991 601562812854608988831442712 9715319317557736620397247064 840935 More than atoms in universe

1 Introduction Why Go is hard? ~ 9 trillion trillion
trillion trillion trillion trillion trillion trillion trillion trillion years If we use all computers in the world

1 Introduction

1 Introduction AlphaGo • Developed by Deepmind • Use deep
learning • Combine with tree search

1 Introduction AlphaGo Zero • Zero human knowledge • Harder,
Better, Faster, Stronger

3 Algorithm 4 Result Background 2

Reinforcement Learning 2 Background

2 Background Reinforcement Learning Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q

2 Background Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Reinforcement Learning +1 win, -1
loss, 0 other

2 Background Goal Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Make a smart agent

Neural Network 2 Background

Deep Learning 2 Background or

2 Background Neural Network Photo from http://cs231n.github.io/neural-networks-1/

2 Background Neural Network Photo from https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

2 Background Convolutional Neural Network Photo from https://ai.icymi.email/tag/alex-krizhevsky/

4 Result Algorithm 3

AlphaGo

3 Algorithm / AlphaGo AlphaGo Architecture

3 Algorithm / AlphaGo 1 Learns from human data Input
Output

3 Algorithm / AlphaGo 2 Self-play Input Output Smarter Version

3 Algorithm / AlphaGo 3 Value evaluation Input Output Smarter
Version Win +0.79

3 Algorithm / AlphaGo AlphaGo • Rollout (Human data) •
SL Policy (Human data) • RL Policy (Self-play) • Value Network (RL Policy)

4 Monte Carlo Tree Search (MCTS)

3 Algorithm / AlphaGo 4 MCTS

3 Algorithm / AlphaGo 4 MCTS Photo from http://slideplayer.com/slide/8088626/ Tree
Search

3 Algorithm / AlphaGo 4 MCTS Tree Search

3 Algorithm / AlphaGo 4 MCTS Selection Choose move from
RL Policy u(P) + some probability Q

3 Algorithm / AlphaGo 4 MCTS Expand

3 Algorithm / AlphaGo 4 MCTS Evaluation Reward How to
estimate reward in middle of the game? Remember?  You can’t calculate all possibilities Barrier

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout
plays until the end of game Play Play

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout
plays until the end of game 2) Use Value Network estimates win prop. What percent do I win?

3 Algorithm / AlphaGo 4 MCTS Backup 1) Use Rollout
+1 reward 2) Use Value Network +0.79 win Update these values to Q in MCTS

3 Algorithm / AlphaGo 4 MCTS Loop Loop

3 Algorithm / AlphaGo 5 How AlphaGo makes decision Select
move with max(Q + u(P)) in MCTS 0.73 0.63

That is everything about AlphaGo

AlphaGo Zero

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture Policy-Value Network
Win +0.79 Predict possibilities in all moves Predict possibilities to win

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS MCTS

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture AlphaGo AlphaGo
Zero It is no different

3 Algorithm / AlphaGo Zero 1 Self-play

3 Algorithm / AlphaGo Zero 1 Self-play Use MCTS from
beginning

1 Self-play (MCTS) Selection Choose move from network U +
some probability Q 3 Algorithm / AlphaGo Zero

1 Self-play (MCTS) Expand 3 Algorithm / AlphaGo Zero

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Evaluation Use
only Value from network Win 79%

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Backup Use
Value 79% win Update these values to Q in MCTS

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Loop

3 Algorithm / AlphaGo Zero But how it learns?

3 Algorithm / AlphaGo Zero 2 Training Pi

3 Algorithm / AlphaGo Zero 2 Training Pi π

3 Algorithm / AlphaGo Zero 2 Training Pi

3 Algorithm / AlphaGo Zero 2 Training Z-reward Notice?

3 Algorithm / AlphaGo Zero 2 Training Z-reward +1 win
-1 loss

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00

+0.79

+0.79 Predicted Actual

Flashback time

Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward +1.00 +0.79 Predicted Actual 3 Algorithm
/ AlphaGo Zero 2 Training Errors: 23.512

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary +1.00

3 Algorithm / AlphaGo Zero +1.00 0.12 0.58 0.79 AlphaGo
Zero Summary

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary

That is everything about AlphaGo Zero

Result 4

1 Introduction AlphaGo Timeline AlphaGo Fan AlphaGo Lee Fan Hui
(2Dan Player)  AlphaGo wins 5– 0 Lee Sedol (9Dan Player)  AlphaGo wins 4 – 1 Born

1 Introduction AlphaGo Timeline AlphaGo Master AlphaGo Zero (Nearly Complete)
AlphaGo Zero Complete 60 Professional Player  AlphaGo wins 60 – 0 Ke Jie (1st World Player)  AlphaGo wins 3 – 0

4 Result Better

4 Result Faster

4 Result Stronger 5,185 4,858 3,739 3,144

4 Result Stronger AlphaGo Zero AlphaGo Lee 100 – 0
AlphaGo Zero AlphaGo Master 89 – 11 vs vs

Nothing can beat AlphaGo Zero

Neung (1) Kosate Limpongsa Chulalongkorn University #YWC12 #CP41 Articles •
เจาะลึก AlphaGo ทํางานอย่างไร ? https://medium.com/kosate/3a1cf3631289 • AlphaGo Zero ตัวใหม่ https://medium.com/kosate/9d6e5c059e7d Kosate Limpongsa kosatelim (at) gmail.com References • https://www.nature.com/articles/nature16961 • https://www.nature.com/articles/nature24270 medium.com/kosate

Extra slide AlphaZero

AlphaZero uses same algorithm as AlphaGo Zero Except preprocessing

But it is used in general cases Like chess and
shogi

AlphaZero Chess Shogi AlphaZero

AlphaZero Unbeatable

AlphaZero Sorry for the long slide Here a potato

Algorithm behind AlphaGo and AlphaGo Zero

Algorithm behind AlphaGo and AlphaGo Zero

Other Decks in Technology

Featured

Transcript