Algorithm behind AlphaGo and AlphaGo Zero

Slide 1

Slide 1 text

Behind AlphaGo AlphaGo Zero Programming Meetup #5 Kosate Limpongsa #ywc12

Slide 2

Slide 2 text

Neung (1) Kosate Limpongsa Studying in Chulalongkorn University #YWC12 #CP41 Developer of JWC7 Developer of YWC13 Writer of • เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้) • AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว (อย่างไร ?)

Slide 3

Slide 3 text

1 Introduction 2 Background 3 Algorithm 4 Result

Slide 4

Slide 4 text

I am not proﬁcient at Go

Slide 5

Slide 5 text

I am not proﬁcient at Go Just a nerd guy who read some papers

Slide 6

Slide 6 text

2 Background 3 Algorithm 4 Result Introduction 1

Slide 7

Slide 7 text

1 Introduction What is Go?

Slide 8

Slide 8 text

1 Introduction What is Go? Photo from https://www.scichallenge.eu/blogs/stories-experiences/game-go-a-strategy-game/

Slide 9

Slide 9 text

1 Introduction What is Go? Photo from AlphaGo Movie

Slide 10

Slide 10 text

1 Introduction Why Go is hard?

Slide 11

Slide 11 text

1 Introduction Why Go is hard? 19 x 19 = 361 Photo from https://en.wikipedia.org/wiki/Rules_of_Go

Slide 12

Slide 12 text

1 Introduction Why Go is hard? Photo from https://youtu.be/SUbqykXVx0A

Slide 13

Slide 13 text

1 Introduction Why Go is hard? Reference http://tromp.github.io/go/legal.html 208168199381979984699478633 344862770286522453884530548 4256394568209274196127380153 7852564845169851964390725991 601562812854608988831442712 9715319317557736620397247064 840935 More than atoms in universe

Slide 14

Slide 14 text

1 Introduction Why Go is hard? ~ 9 trillion trillion trillion trillion trillion trillion trillion trillion trillion trillion years If we use all computers in the world

Slide 15

Slide 15 text

1 Introduction

Slide 16

Slide 16 text

1 Introduction AlphaGo • Developed by Deepmind • Use deep learning • Combine with tree search

Slide 17

Slide 17 text

1 Introduction AlphaGo Zero • Zero human knowledge • Harder, Better, Faster, Stronger

Slide 18

Slide 18 text

3 Algorithm 4 Result Background 2

Slide 19

Slide 19 text

Reinforcement Learning 2 Background

Slide 20

Slide 20 text

2 Background Reinforcement Learning Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q

Slide 21

Slide 21 text

2 Background Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Reinforcement Learning +1 win, -1 loss, 0 other

Slide 22

Slide 22 text

2 Background Goal Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Make a smart agent

Slide 23

Slide 23 text

Neural Network 2 Background

Slide 24

Slide 24 text

Deep Learning 2 Background or

Slide 25

Slide 25 text

2 Background Neural Network Photo from http://cs231n.github.io/neural-networks-1/

Slide 26

Slide 26 text

2 Background Neural Network Photo from https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/

Slide 27

Slide 27 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 28

Slide 28 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 29

Slide 29 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 30

Slide 30 text

2 Background Convolutional Neural Network Photo from https://ai.icymi.email/tag/alex-krizhevsky/

Slide 31

Slide 31 text

4 Result Algorithm 3

Slide 32

Slide 32 text

AlphaGo

Slide 33

Slide 33 text

3 Algorithm / AlphaGo AlphaGo Architecture

Slide 34

Slide 34 text

3 Algorithm / AlphaGo 1 Learns from human data Input Output

Slide 35

Slide 35 text

3 Algorithm / AlphaGo 2 Self-play Input Output Smarter Version

Slide 36

Slide 36 text

3 Algorithm / AlphaGo 3 Value evaluation Input Output Smarter Version Win +0.79

Slide 37

Slide 37 text

3 Algorithm / AlphaGo AlphaGo • Rollout (Human data) • SL Policy (Human data) • RL Policy (Self-play) • Value Network (RL Policy)

Slide 38

Slide 38 text

4 Monte Carlo Tree Search (MCTS)

Slide 39

Slide 39 text

3 Algorithm / AlphaGo 4 MCTS

Slide 40

Slide 40 text

3 Algorithm / AlphaGo 4 MCTS Photo from http://slideplayer.com/slide/8088626/ Tree Search

Slide 41

Slide 41 text

3 Algorithm / AlphaGo 4 MCTS Tree Search

Slide 42

Slide 42 text

3 Algorithm / AlphaGo 4 MCTS Selection Choose move from RL Policy u(P) + some probability Q

Slide 43

Slide 43 text

3 Algorithm / AlphaGo 4 MCTS Expand

Slide 44

Slide 44 text

3 Algorithm / AlphaGo 4 MCTS Evaluation Reward How to estimate reward in middle of the game? Remember?  You can’t calculate all possibilities Barrier

Slide 45

Slide 45 text

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout plays until the end of game Play Play

Slide 46

Slide 46 text

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout plays until the end of game 2) Use Value Network estimates win prop. What percent do I win?

Slide 47

Slide 47 text

3 Algorithm / AlphaGo 4 MCTS Backup 1) Use Rollout +1 reward 2) Use Value Network +0.79 win Update these values to Q in MCTS

Slide 48

Slide 48 text

3 Algorithm / AlphaGo 4 MCTS Loop Loop

Slide 49

Slide 49 text

3 Algorithm / AlphaGo 5 How AlphaGo makes decision Select move with max(Q + u(P)) in MCTS 0.73 0.63

Slide 50

Slide 50 text

That is everything about AlphaGo

Slide 51

Slide 51 text

AlphaGo Zero

Slide 52

Slide 52 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 53

Slide 53 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 54

Slide 54 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 55

Slide 55 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture Policy-Value Network Win +0.79 Predict possibilities in all moves Predict possibilities to win

Slide 56

Slide 56 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS

Slide 57

Slide 57 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS MCTS

Slide 58

Slide 58 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture AlphaGo AlphaGo Zero It is no different

Slide 59

Slide 59 text

3 Algorithm / AlphaGo Zero 1 Self-play

Slide 60

Slide 60 text

3 Algorithm / AlphaGo Zero 1 Self-play Use MCTS from beginning

Slide 61

Slide 61 text

1 Self-play (MCTS) Selection Choose move from network U + some probability Q 3 Algorithm / AlphaGo Zero

Slide 62

Slide 62 text

1 Self-play (MCTS) Expand 3 Algorithm / AlphaGo Zero

Slide 63

Slide 63 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Evaluation Use only Value from network Win 79%

Slide 64

Slide 64 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Backup Use Value 79% win Update these values to Q in MCTS

Slide 65

Slide 65 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Loop

Slide 66

Slide 66 text

3 Algorithm / AlphaGo Zero But how it learns?

Slide 67

Slide 67 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 68

Slide 68 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 69

Slide 69 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 70

Slide 70 text

3 Algorithm / AlphaGo Zero 2 Training Pi π

Slide 71

Slide 71 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 72

Slide 72 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Notice?

Slide 73

Slide 73 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward +1 win -1 loss

Slide 74

Slide 74 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00

Slide 75

Slide 75 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00 +0.79

Slide 76

Slide 76 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00 +0.79 Predicted Actual

Slide 77

Slide 77 text

Flashback time

Slide 78

Slide 78 text

Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward +1.00 +0.79 Predicted Actual 3 Algorithm / AlphaGo Zero 2 Training Errors: 23.512

Slide 79

Slide 79 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary +1.00

Slide 80

Slide 80 text

3 Algorithm / AlphaGo Zero +1.00 0.12 0.58 0.79 AlphaGo Zero Summary

Slide 81

Slide 81 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary

Slide 82

Slide 82 text

That is everything about AlphaGo Zero

Slide 83

Slide 83 text

Result 4

Slide 84

Slide 84 text

1 Introduction AlphaGo Timeline AlphaGo Fan AlphaGo Lee Fan Hui (2Dan Player)  AlphaGo wins 5– 0 Lee Sedol (9Dan Player)  AlphaGo wins 4 – 1 Born

Slide 85

Slide 85 text

1 Introduction AlphaGo Timeline AlphaGo Master AlphaGo Zero (Nearly Complete) AlphaGo Zero Complete 60 Professional Player  AlphaGo wins 60 – 0 Ke Jie (1st World Player)  AlphaGo wins 3 – 0

Slide 86

Slide 86 text

4 Result Better

Slide 87

Slide 87 text

4 Result Faster

Slide 88

Slide 88 text

4 Result Stronger 5,185 4,858 3,739 3,144

Slide 89

Slide 89 text

4 Result Stronger AlphaGo Zero AlphaGo Lee 100 – 0 AlphaGo Zero AlphaGo Master 89 – 11 vs vs

Slide 90

Slide 90 text

Nothing can beat AlphaGo Zero

Slide 91

Slide 91 text

Neung (1) Kosate Limpongsa Chulalongkorn University #YWC12 #CP41 Articles • เจาะลึก AlphaGo ทํางานอย่างไร ? https://medium.com/kosate/3a1cf3631289 • AlphaGo Zero ตัวใหม่ https://medium.com/kosate/9d6e5c059e7d Kosate Limpongsa kosatelim (at) gmail.com References • https://www.nature.com/articles/nature16961 • https://www.nature.com/articles/nature24270 medium.com/kosate