Robin Ranjit Singh Chauhan [email protected] Notes on Deep RL, Self-Play, AlphaZero and DQN Alexey Iskrov [email protected] Vancouver Kaggle Meetup: ConnectX Competition April 30, 2020
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Reinforcement Learning ● learning to decide + act over time ● Often (not always) online learning 2 Image credit: Reinforcement Learning: An Introduction, Sutton and Barto
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Environment: Markov Decision Process (MDP) ● Markov Chains ○ States linked w/o history ● Actions ○ Choice ● Rewards ○ Motivation ● Variants ○ Bandit = MDP with single state! ○ MC + Rewards = MRP ○ Partially observed (POMDP) ○ Semi-MDP 3 Image credit: Wikipedia
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Multi-Agent RL (MARL) ● Opponent-independent ○ No dependence on opponent strategy ● Opponent-aware ○ Adapts to their strategy ● Aware ○ Models opponent behaviour ● Cooperative ○ Shared vs independent reward ● Generally harder! Image credit: A comprehensive survey of multi-agent reinforcement learning, Busoniu et al 2008 4
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Self-Play: Domains ● Board games, DOTA, StarCraft, ... What do these self-play domains mostly have in common? ● Easy to simulate ○ Notice none are messy real world domains ● Many strong opponents ○ Software ○ Sometimes human ● Exploration is tractable Not always applicable, but cool when it is 5
Self-Play with Deep RL by Robin Chauhan, Pathway Intelligence Inc ● Single agent + environment = “playing by itself”, right? ○ No: Self Play specifically means playing (competitively against?) ■ Another agent (Dota) ■ Recent replica (AlphaGo) ■ Self in the mirror (AlphaZero) ● Self Play has specific and interesting properties, benefits, challenges Self-Play RL 6
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Game of Go ● Ancient Chinese game ○ 围棋 wéiqí (pronounced way-chee), which means the "surrounding game" ● Invented over 3-4k years ago ○ Myth: legendary Emperor Yao invented Go to enlighten his son, Dan Zhu ○ by 500 BC it had already become one of the "Four Accomplishments" that must be mastered by Chinese gentlemen ○ Professional system established 1978 Paraphrased from: British Go Association, GoBase.org, Wikipedia
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Game of Go ● Associated with divination ○ Divination associated with agriculture ○ Yellow River Diagram and the Luo Record were “magic squares” ■ depicted in the same way as go diagrams ■ numbers are not shown with numerals but with clusters of black and white "go" stones ● Properties of Go ○ Perfect Information ○ Zero-sum ○ Deterministic ○ Very large action, state spaces Image credit: Wikipedia
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Image credit: Data from Wikipedia, plotted by Robin in ggplot2 11 Particles in universe = 1086
Self-Play with Deep RL by Robin Chauhan, Pathway Intelligence Inc 200M (per sec) Deep Blue Image credit: Demis Hassabis, Learning from First Principles, NIPS 2017 Markup: Robin Chauhan
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Image credit: Moerland et al, “A0C: Alpha Zero in Continuous Action Space” https://arxiv.org/abs/1805.09613 Adversarial Reasoning: Sampling-Based Search with the UCT algorithm Raghuram Ramanujan and Ashish Sabharwal http://www.cs.cornell.edu/courses/cs6700/2016sp/lectures/CS6700-UCT.pdf
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Image credit: David Foster https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan AlphaZero Loss ○ z: actual winner ○ p: observed move prob using MCTS ○ v: network value head output: Scalar ○ π: network policy head output: Vector over actions ○ c: L2 regularization constant Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver et al Mean Squared Error Cross Entropy Regularize
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Credit: Surag Nair https://web.stanford.edu/~surag/posts/alphazero.html
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan From Surag Nair https://web.stanford.edu/~surag/posts/alphazero.html Learning to Play Othello Without Human Knowledge, Thakoor et al
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan From Surag Nair https://web.stanford.edu/~surag/posts/alphazero.html Learning to Play Othello Without Human Knowledge, Thakoor et al
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver et al 2017
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan AlphaZero: Chess Output ● Chess Move: 1) Select piece, 2) Select destination ● 73 x 8x8 planes ○ probability distribution over 4,672 possible moves ○ First 56: “Queen” type moves for any piece ○ “a number of squares [1..7] in which the piece will be moved, along one of eight relative compass directions {N, NE, E, SE, S, SW, W, NW}” ○ next 8 planes encode possible knight moves for that piece ○ Final 9: underpromotions for pawn moves or captures in two possible diagonals, to knight, bishop or rook
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver et al 2017
Self-Play with Deep RL by Robin Chauhan, Pathway Intelligence Inc Extra Exploration: Dirichlet(α) ● Adds to 1; Softmax-ish ● Symmetric Dirichlet distribution ○ Alpha has same value for all elements ○ Symmetric useful when “there is no prior knowledge favoring one component over another” ● α = {0.3, 0.15, 0.03} for chess, shogi and Go respectively ○ α=1: Uniform ○ α>1: “Similar” values ○ 0 < α <1 ■ sparse distributions; most near zero, most of mass near a few values ● Root node only
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Image credits: Human-level control through deep reinforcement learning, Mnih et al, 2015 25
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan AlphaZero DQN Domain Go : 361 moves Seconds to minutes for thinking Atari : 16 actions Milliseconds for thinking Network K-block Resnet Simpler CNN + FC Inputs Layers for each piece type ** Layer for side Layers with prev 4 frames of screen Target MCTS results Current reward + network outputs (recursive!) Loss MSE(value) + Cross Entropy(Policy) MSE(Q) Exploration UCT + Dirichlet noise Epsilon greedy Next move selection MCTS(k) Just take argmax Q
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan ● “most directly applicable to zero-sum games of perfect information" ● Requires model ○ Perfect : allows accurate sim over many steps ○ Performant : allowing many (~10k) simulated decisions each step ● Value function “bumpy” ○ Value estimate can change unpredictably within a small # of steps/plies ○ Otherwise, MCTS not as important ■ Network could completely capture subtree value ■ Alternatively, policy network could completely capture best move ● For Go MDP, Value function alone could not capture enough detail Applicability of AlphaZero algo
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan AlphaZero References ● AlphaGo: Mastering the game of Go with deep neural networks and tree search, Silver et al ○ https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf ● AlphaGo Zero: Mastering the Game of Go without Human Knowledge, Silver et al ○ https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf ● AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Silver et al ○ Science paper https://science.sciencemag.org/content/sci/362/6419/1140.full.pdf ○ Preprint https://arxiv.org/pdf/1712.01815.pdf ● Implementations ○ https://github.com/suragnair/alpha-zero-general ○ https://github.com/junxiaosong/AlphaZero_Gomoku ○ https://github.com/NeymarL/ChineseChess-AlphaZero ○ https://github.com/pytorch/ELF ○ https://github.com/topics/alphazero
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Self-Play: Measuring performance ● Measuring performance in self-play can be stranger ○ There is often no absolute score ○ Can’t generally have humans in the loop ○ Performance is often only relative to other agents ● Special cases of competitors ○ AlphaGo/Zero : other Go programs ○ ConnectX : perfect play agent 33
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Make next move Board “Player” Agent “Opponent” Agent keras-rl DQNAgent XRP Memory
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Games + Agents ● Various agent designs ○ with various deep learning network architectures ● Training regime ○ Distribution of opponents ● Evaluation ○ Round-robin Tournaments 37
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan alpha-zero-general https://github.com/suragnair/alpha-zero-general ● Nice implementation in python ● Go, Connect4, Tictactoe (2d+3d)
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Finite-time Analysis of the Multiarmed Bandit Problem, AUER et al 2002
Some notes on Deep RL, AlphaZero, DQN, and Self-Play, Robin Ranjit Singh Chauhan Sutskever on Self-Play / Multi-agent Play ● Simple environments -> extremely complex strategy ● Convert Compute into Data ● Perfect curriculum ● Main open question ○ Design the self play environment so that the result will be useful to some external task ● Social life incentivizes evolution of intelligence ● Society of agents which will have... ○ language, theory of mind, negotiation, social skills, trade, economy, politics, justice system … ○ all these things should happen inside a multi-agent environment Comments from Ilya Sustkever presentation in MIT 6.S099 AGI class, April 2018 44