Slide 1

Slide 1 text

Behind AlphaGo AlphaGo Zero Programming Meetup #5 Kosate Limpongsa #ywc12

Slide 2

Slide 2 text

Neung (1) Kosate Limpongsa Studying in Chulalongkorn University #YWC12 #CP41 Developer of JWC7 Developer of YWC13 Writer of • เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้) • AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว (อย่างไร ?)

Slide 3

Slide 3 text

1 Introduction 2 Background 3 Algorithm 4 Result

Slide 4

Slide 4 text

I am not proficient at Go

Slide 5

Slide 5 text

I am not proficient at Go Just a nerd guy who read some papers

Slide 6

Slide 6 text

2 Background 3 Algorithm 4 Result Introduction 1

Slide 7

Slide 7 text

1 Introduction What is Go?

Slide 8

Slide 8 text

1 Introduction What is Go? Photo from https://www.scichallenge.eu/blogs/stories-experiences/game-go-a-strategy-game/

Slide 9

Slide 9 text

1 Introduction What is Go? Photo from AlphaGo Movie

Slide 10

Slide 10 text

1 Introduction Why Go is hard?

Slide 11

Slide 11 text

1 Introduction Why Go is hard? 19 x 19 = 361 Photo from https://en.wikipedia.org/wiki/Rules_of_Go

Slide 12

Slide 12 text

1 Introduction Why Go is hard? Photo from https://youtu.be/SUbqykXVx0A

Slide 13

Slide 13 text

1 Introduction Why Go is hard? Reference http://tromp.github.io/go/legal.html 208168199381979984699478633 344862770286522453884530548 4256394568209274196127380153 7852564845169851964390725991 601562812854608988831442712 9715319317557736620397247064 840935 More than atoms in universe

Slide 14

Slide 14 text

1 Introduction Why Go is hard? ~ 9 trillion trillion trillion trillion trillion trillion trillion trillion trillion trillion years If we use all computers in the world

Slide 15

Slide 15 text

1 Introduction

Slide 16

Slide 16 text

1 Introduction AlphaGo • Developed by Deepmind • Use deep learning • Combine with tree search

Slide 17

Slide 17 text

1 Introduction AlphaGo Zero • Zero human knowledge • Harder, Better, Faster, Stronger

Slide 18

Slide 18 text

3 Algorithm 4 Result Background 2

Slide 19

Slide 19 text

Reinforcement Learning 2 Background

Slide 20

Slide 20 text

2 Background Reinforcement Learning Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q

Slide 21

Slide 21 text

2 Background Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Reinforcement Learning +1 win, -1 loss, 0 other

Slide 22

Slide 22 text

2 Background Goal Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q Make a smart agent

Slide 23

Slide 23 text

Neural Network 2 Background

Slide 24

Slide 24 text

Deep Learning 2 Background or

Slide 25

Slide 25 text

2 Background Neural Network Photo from http://cs231n.github.io/neural-networks-1/

Slide 26

Slide 26 text

2 Background Neural Network Photo from https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/

Slide 27

Slide 27 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 28

Slide 28 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 29

Slide 29 text

2 Background Neural Network Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward

Slide 30

Slide 30 text

2 Background Convolutional Neural Network Photo from https://ai.icymi.email/tag/alex-krizhevsky/

Slide 31

Slide 31 text

4 Result Algorithm 3

Slide 32

Slide 32 text

AlphaGo

Slide 33

Slide 33 text

3 Algorithm / AlphaGo AlphaGo Architecture

Slide 34

Slide 34 text

3 Algorithm / AlphaGo 1 Learns from human data Input Output

Slide 35

Slide 35 text

3 Algorithm / AlphaGo 2 Self-play Input Output Smarter Version

Slide 36

Slide 36 text

3 Algorithm / AlphaGo 3 Value evaluation Input Output Smarter Version Win +0.79

Slide 37

Slide 37 text

3 Algorithm / AlphaGo AlphaGo • Rollout (Human data) • SL Policy (Human data) • RL Policy (Self-play) • Value Network (RL Policy)

Slide 38

Slide 38 text

4 Monte Carlo Tree Search (MCTS)

Slide 39

Slide 39 text

3 Algorithm / AlphaGo 4 MCTS

Slide 40

Slide 40 text

3 Algorithm / AlphaGo 4 MCTS Photo from http://slideplayer.com/slide/8088626/ Tree Search

Slide 41

Slide 41 text

3 Algorithm / AlphaGo 4 MCTS Tree Search

Slide 42

Slide 42 text

3 Algorithm / AlphaGo 4 MCTS Selection Choose move from RL Policy u(P) + some probability Q

Slide 43

Slide 43 text

3 Algorithm / AlphaGo 4 MCTS Expand

Slide 44

Slide 44 text

3 Algorithm / AlphaGo 4 MCTS Evaluation Reward How to estimate reward in middle of the game? Remember?
 You can’t calculate all possibilities Barrier

Slide 45

Slide 45 text

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout plays until the end of game Play Play

Slide 46

Slide 46 text

3 Algorithm / AlphaGo 4 MCTS Evaluation 1) Use Rollout plays until the end of game 2) Use Value Network estimates win prop. What percent do I win?

Slide 47

Slide 47 text

3 Algorithm / AlphaGo 4 MCTS Backup 1) Use Rollout +1 reward 2) Use Value Network +0.79 win Update these values to Q in MCTS

Slide 48

Slide 48 text

3 Algorithm / AlphaGo 4 MCTS Loop Loop

Slide 49

Slide 49 text

3 Algorithm / AlphaGo 5 How AlphaGo makes decision Select move with max(Q + u(P)) in MCTS 0.73 0.63

Slide 50

Slide 50 text

That is everything about AlphaGo

Slide 51

Slide 51 text

AlphaGo Zero

Slide 52

Slide 52 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 53

Slide 53 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 54

Slide 54 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture

Slide 55

Slide 55 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture Policy-Value Network Win +0.79 Predict possibilities in all moves Predict possibilities to win

Slide 56

Slide 56 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS

Slide 57

Slide 57 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture MCTS MCTS

Slide 58

Slide 58 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Architecture AlphaGo AlphaGo Zero It is no different

Slide 59

Slide 59 text

3 Algorithm / AlphaGo Zero 1 Self-play

Slide 60

Slide 60 text

3 Algorithm / AlphaGo Zero 1 Self-play Use MCTS from beginning

Slide 61

Slide 61 text

1 Self-play (MCTS) Selection Choose move from network U + some probability Q 3 Algorithm / AlphaGo Zero

Slide 62

Slide 62 text

1 Self-play (MCTS) Expand 3 Algorithm / AlphaGo Zero

Slide 63

Slide 63 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Evaluation Use only Value from network Win 79%

Slide 64

Slide 64 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Backup Use Value 79% win Update these values to Q in MCTS

Slide 65

Slide 65 text

3 Algorithm / AlphaGo Zero 1 Self-play (MCTS) Loop

Slide 66

Slide 66 text

3 Algorithm / AlphaGo Zero But how it learns?

Slide 67

Slide 67 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 68

Slide 68 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 69

Slide 69 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 70

Slide 70 text

3 Algorithm / AlphaGo Zero 2 Training Pi π

Slide 71

Slide 71 text

3 Algorithm / AlphaGo Zero 2 Training Pi

Slide 72

Slide 72 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Notice?

Slide 73

Slide 73 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward +1 win -1 loss

Slide 74

Slide 74 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00

Slide 75

Slide 75 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00 +0.79

Slide 76

Slide 76 text

3 Algorithm / AlphaGo Zero 2 Training Z-reward Pi +1.00 +0.79 Predicted Actual

Slide 77

Slide 77 text

Flashback time

Slide 78

Slide 78 text

Photo from https://medium.com/kosate/9d6e5c059e7d Feed-Forward +1.00 +0.79 Predicted Actual 3 Algorithm / AlphaGo Zero 2 Training Errors: 23.512

Slide 79

Slide 79 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary +1.00

Slide 80

Slide 80 text

3 Algorithm / AlphaGo Zero +1.00 0.12 0.58 0.79 AlphaGo Zero Summary

Slide 81

Slide 81 text

3 Algorithm / AlphaGo Zero AlphaGo Zero Summary

Slide 82

Slide 82 text

That is everything about AlphaGo Zero

Slide 83

Slide 83 text

Result 4

Slide 84

Slide 84 text

1 Introduction AlphaGo Timeline AlphaGo Fan AlphaGo Lee Fan Hui (2Dan Player)
 AlphaGo wins 5– 0 Lee Sedol (9Dan Player)
 AlphaGo wins 4 – 1 Born

Slide 85

Slide 85 text

1 Introduction AlphaGo Timeline AlphaGo Master AlphaGo Zero (Nearly Complete) AlphaGo Zero Complete 60 Professional Player
 AlphaGo wins 60 – 0 Ke Jie (1st World Player)
 AlphaGo wins 3 – 0

Slide 86

Slide 86 text

4 Result Better

Slide 87

Slide 87 text

4 Result Faster

Slide 88

Slide 88 text

4 Result Stronger 5,185 4,858 3,739 3,144

Slide 89

Slide 89 text

4 Result Stronger AlphaGo Zero AlphaGo Lee 100 – 0 AlphaGo Zero AlphaGo Master 89 – 11 vs vs

Slide 90

Slide 90 text

Nothing can beat AlphaGo Zero

Slide 91

Slide 91 text

Neung (1) Kosate Limpongsa Chulalongkorn University #YWC12 #CP41 Articles • เจาะลึก AlphaGo ทํางานอย่างไร ? https://medium.com/kosate/3a1cf3631289 • AlphaGo Zero ตัวใหม่ https://medium.com/kosate/9d6e5c059e7d Kosate Limpongsa kosatelim (at) gmail.com References • https://www.nature.com/articles/nature16961 • https://www.nature.com/articles/nature24270 medium.com/kosate

Slide 92

Slide 92 text

Extra slide AlphaZero

Slide 93

Slide 93 text

AlphaZero uses same algorithm as AlphaGo Zero Except preprocessing

Slide 94

Slide 94 text

But it is used in general cases Like chess and shogi

Slide 95

Slide 95 text

AlphaZero Chess Shogi AlphaZero

Slide 96

Slide 96 text

AlphaZero Unbeatable

Slide 97

Slide 97 text

AlphaZero Unbeatable

Slide 98

Slide 98 text

AlphaZero Sorry for the long slide Here a potato