Behind
AlphaGo
AlphaGo Zero
Programming Meetup #5
Kosate Limpongsa #ywc12
Slide 2
Slide 2 text
Neung (1)
Kosate Limpongsa
Studying in Chulalongkorn University
#YWC12 #CP41
Developer of JWC7 Developer of YWC13
Writer of
• เจาะลึก AlphaGo ทํางานอย่างไร ? (ฉบับเขียนโปรแกรมไม่เป็นก็อ่านได้)
• AlphaGo Zero ตัวใหม่: เหนือกว่าทุกตัวก่อนหน้า โดยไม่ใช่ข้อมูลจากมนุษย์แม้แต่นิดเดียว
(อย่างไร ?)
Slide 3
Slide 3 text
1 Introduction
2 Background
3 Algorithm
4 Result
Slide 4
Slide 4 text
I am not proficient at Go
Slide 5
Slide 5 text
I am not proficient at Go
Just a nerd guy who read
some papers
Slide 6
Slide 6 text
2 Background
3 Algorithm
4 Result
Introduction
1
Slide 7
Slide 7 text
1 Introduction
What is Go?
Slide 8
Slide 8 text
1 Introduction
What is Go?
Photo from https://www.scichallenge.eu/blogs/stories-experiences/game-go-a-strategy-game/
Slide 9
Slide 9 text
1 Introduction
What is Go?
Photo from AlphaGo Movie
Slide 10
Slide 10 text
1 Introduction
Why Go is hard?
Slide 11
Slide 11 text
1 Introduction
Why Go is hard?
19 x 19 = 361
Photo from https://en.wikipedia.org/wiki/Rules_of_Go
Slide 12
Slide 12 text
1 Introduction
Why Go is hard?
Photo from https://youtu.be/SUbqykXVx0A
Slide 13
Slide 13 text
1 Introduction
Why Go is hard?
Reference http://tromp.github.io/go/legal.html
208168199381979984699478633
344862770286522453884530548
4256394568209274196127380153
7852564845169851964390725991
601562812854608988831442712
9715319317557736620397247064
840935
More than atoms in universe
Slide 14
Slide 14 text
1 Introduction
Why Go is hard?
~ 9 trillion trillion trillion trillion
trillion trillion trillion trillion
trillion trillion years
If we use all computers in the world
Slide 15
Slide 15 text
1 Introduction
Slide 16
Slide 16 text
1 Introduction
AlphaGo
• Developed by Deepmind
• Use deep learning
• Combine with tree search
Slide 17
Slide 17 text
1 Introduction
AlphaGo Zero
• Zero human knowledge
• Harder, Better, Faster, Stronger
Slide 18
Slide 18 text
3 Algorithm
4 Result
Background
2
Slide 19
Slide 19 text
Reinforcement Learning
2 Background
Slide 20
Slide 20 text
2 Background
Reinforcement Learning
Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q
Slide 21
Slide 21 text
2 Background
Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q
Reinforcement Learning
+1 win, -1 loss, 0 other
Slide 22
Slide 22 text
2 Background
Goal
Photo from https://www.youtube.com/watch?v=OBzvN9FLx4Q
Make a smart agent
Slide 23
Slide 23 text
Neural Network
2 Background
Slide 24
Slide 24 text
Deep Learning
2 Background
or
Slide 25
Slide 25 text
2 Background
Neural Network
Photo from http://cs231n.github.io/neural-networks-1/
Slide 26
Slide 26 text
2 Background
Neural Network
Photo from https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/
Slide 27
Slide 27 text
2 Background
Neural Network
Photo from https://medium.com/kosate/9d6e5c059e7d
Feed-Forward
Slide 28
Slide 28 text
2 Background
Neural Network
Photo from https://medium.com/kosate/9d6e5c059e7d
Feed-Forward
Slide 29
Slide 29 text
2 Background
Neural Network
Photo from https://medium.com/kosate/9d6e5c059e7d
Feed-Forward
Slide 30
Slide 30 text
2 Background
Convolutional Neural Network
Photo from https://ai.icymi.email/tag/alex-krizhevsky/
Slide 31
Slide 31 text
4 Result
Algorithm
3
Slide 32
Slide 32 text
AlphaGo
Slide 33
Slide 33 text
3 Algorithm / AlphaGo
AlphaGo Architecture
Slide 34
Slide 34 text
3 Algorithm / AlphaGo
1 Learns from human data
Input
Output
Slide 35
Slide 35 text
3 Algorithm / AlphaGo
2 Self-play
Input
Output
Smarter Version
Slide 36
Slide 36 text
3 Algorithm / AlphaGo
3 Value evaluation
Input
Output
Smarter Version
Win
+0.79
3 Algorithm / AlphaGo
4 MCTS
Photo from http://slideplayer.com/slide/8088626/
Tree Search
Slide 41
Slide 41 text
3 Algorithm / AlphaGo
4 MCTS
Tree Search
Slide 42
Slide 42 text
3 Algorithm / AlphaGo
4 MCTS
Selection
Choose move from RL Policy u(P)
+ some probability Q
Slide 43
Slide 43 text
3 Algorithm / AlphaGo
4 MCTS
Expand
Slide 44
Slide 44 text
3 Algorithm / AlphaGo
4 MCTS
Evaluation
Reward
How to estimate reward
in middle of the game?
Remember?
You can’t calculate
all possibilities
Barrier
Slide 45
Slide 45 text
3 Algorithm / AlphaGo
4 MCTS
Evaluation
1) Use Rollout plays
until the end of game
Play Play
Slide 46
Slide 46 text
3 Algorithm / AlphaGo
4 MCTS
Evaluation
1) Use Rollout plays
until the end of game
2) Use Value Network
estimates win prop.
What percent
do I win?
Slide 47
Slide 47 text
3 Algorithm / AlphaGo
4 MCTS
Backup
1) Use Rollout
+1 reward
2) Use Value Network
+0.79 win
Update these values to Q
in MCTS
Slide 48
Slide 48 text
3 Algorithm / AlphaGo
4 MCTS
Loop
Loop
Slide 49
Slide 49 text
3 Algorithm / AlphaGo
5 How AlphaGo makes decision
Select move with max(Q + u(P)) in MCTS
0.73
0.63
Slide 50
Slide 50 text
That is everything about
AlphaGo
Slide 51
Slide 51 text
AlphaGo Zero
Slide 52
Slide 52 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
Slide 53
Slide 53 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
Slide 54
Slide 54 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
Slide 55
Slide 55 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
Policy-Value Network
Win
+0.79
Predict
possibilities in
all moves
Predict
possibilities to win
Slide 56
Slide 56 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
MCTS
Slide 57
Slide 57 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
MCTS
MCTS
Slide 58
Slide 58 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Architecture
AlphaGo
AlphaGo
Zero
It is no different
Slide 59
Slide 59 text
3 Algorithm / AlphaGo Zero
1 Self-play
Slide 60
Slide 60 text
3 Algorithm / AlphaGo Zero
1 Self-play
Use MCTS from beginning
Slide 61
Slide 61 text
1 Self-play (MCTS)
Selection
Choose move from network U
+ some probability Q
3 Algorithm / AlphaGo Zero
Slide 62
Slide 62 text
1 Self-play (MCTS)
Expand
3 Algorithm / AlphaGo Zero
Slide 63
Slide 63 text
3 Algorithm / AlphaGo Zero
1 Self-play (MCTS)
Evaluation
Use only Value from network
Win
79%
Slide 64
Slide 64 text
3 Algorithm / AlphaGo Zero
1 Self-play (MCTS)
Backup
Use Value
79% win
Update these values to Q
in MCTS
Slide 65
Slide 65 text
3 Algorithm / AlphaGo Zero
1 Self-play (MCTS)
Loop
Slide 66
Slide 66 text
3 Algorithm / AlphaGo Zero
But how it learns?
Slide 67
Slide 67 text
3 Algorithm / AlphaGo Zero
2 Training
Pi
Slide 68
Slide 68 text
3 Algorithm / AlphaGo Zero
2 Training
Pi
Slide 69
Slide 69 text
3 Algorithm / AlphaGo Zero
2 Training
Pi
Slide 70
Slide 70 text
3 Algorithm / AlphaGo Zero
2 Training
Pi
π
Slide 71
Slide 71 text
3 Algorithm / AlphaGo Zero
2 Training
Pi
Slide 72
Slide 72 text
3 Algorithm / AlphaGo Zero
2 Training
Z-reward
Notice?
Slide 73
Slide 73 text
3 Algorithm / AlphaGo Zero
2 Training
Z-reward
+1 win
-1 loss
Slide 74
Slide 74 text
3 Algorithm / AlphaGo Zero
2 Training
Z-reward
Pi
+1.00
Slide 75
Slide 75 text
3 Algorithm / AlphaGo Zero
2 Training
Z-reward
Pi
+1.00
+0.79
Slide 76
Slide 76 text
3 Algorithm / AlphaGo Zero
2 Training
Z-reward
Pi
+1.00
+0.79
Predicted
Actual
Slide 77
Slide 77 text
Flashback time
Slide 78
Slide 78 text
Photo from https://medium.com/kosate/9d6e5c059e7d
Feed-Forward
+1.00
+0.79
Predicted
Actual
3 Algorithm / AlphaGo Zero
2 Training
Errors: 23.512
Slide 79
Slide 79 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Summary
+1.00
Slide 80
Slide 80 text
3 Algorithm / AlphaGo Zero
+1.00
0.12 0.58 0.79
AlphaGo Zero Summary
Slide 81
Slide 81 text
3 Algorithm / AlphaGo Zero
AlphaGo Zero Summary
Slide 82
Slide 82 text
That is everything about
AlphaGo Zero
Slide 83
Slide 83 text
Result
4
Slide 84
Slide 84 text
1 Introduction
AlphaGo Timeline
AlphaGo Fan AlphaGo Lee
Fan Hui (2Dan Player)
AlphaGo wins 5– 0
Lee Sedol (9Dan Player)
AlphaGo wins 4 – 1
Born
Slide 85
Slide 85 text
1 Introduction
AlphaGo Timeline
AlphaGo Master AlphaGo Zero
(Nearly Complete)
AlphaGo Zero
Complete
60 Professional Player
AlphaGo wins 60 – 0
Ke Jie (1st World Player)
AlphaGo wins 3 – 0
Slide 86
Slide 86 text
4 Result
Better
Slide 87
Slide 87 text
4 Result
Faster
Slide 88
Slide 88 text
4 Result
Stronger
5,185
4,858
3,739
3,144
Slide 89
Slide 89 text
4 Result
Stronger
AlphaGo Zero AlphaGo Lee
100 – 0
AlphaGo Zero AlphaGo Master
89 – 11
vs
vs