Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenAI Five Dota 2 RL Agent Teardown

OpenAI Five Dota 2 RL Agent Teardown

A technical, biased, incomplete analysis of OpenAI’s awesome Five Agent design.

Robin Ranjit Singh Chauhan

August 28, 2018
Tweet

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. Robin Ranjit Singh Chauhan
    Pathway Intelligence Inc
    [email protected]
    https://pathwayi.com
    OpenAI Five
    DOTA 2 5v5
    RL Agent Teardown
    Kiran N. Chauhan
    [email protected]
    Biased, incomplete analysis of OpenAI’s awesome Five Agent design.

    View Slide

  2. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    ● Dota 1: 2003
    ○ Multiplayer online battle arena (MOBA) mod for the video game Warcraft III: Reign of
    Chaos (real-time strategy from Blizzard)
    ○ Valve of Kirkland WA (96) bought IP in 2009
    ● Dota 2: 2013
    ○ Played by over 10M people monthly
    ○ Highest betting of any esport in tens of M$ USD
    ● Goal
    ○ collectively destroy a large structure defended by the opposing team known as the
    "Ancient", whilst defending their own
    Defense of the Ancients 2

    View Slide

  3. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Valve’s Dota Bot API Simplifying restrictions
    ● Lua/C++ scripting of server
    ○ Directly access game engine info
    ○ Not pixel/memory level like
    ■ ALE Atari
    ■ Deepmind Lab
    ■ Karpathy’s Pong from pixels, etc
    ● Besides detailed “action level API”
    ○ also has higher level “teams” and “modes”
    APIs
    ○ See
    https://developer.valvesoftware.com/wiki/D
    ota_Bot_Scripting
    ● Mirror match of specific heroes (18 of 100+)
    ● No …
    ○ warding (vision+spying)
    ○ Roshan (most powerful neutral creep)
    ○ invisibility of units/heroes (consumables
    and relevant items)
    ○ summons (units via spells) /illusions
    (copies of heroes)
    ○ Divine Rapier, Bottle, Quelling Blade,
    Boots of Travel, Tome of Knowledge,
    Infused Raindrop (powerful items)
    ○ No Scan (minimap enemy hero detection)
    ○ 5 invulnerable couriers, no exploiting them
    by scouting or tanking
    ● Some restrictions lifted last week, expect
    more to be lifted over time...

    View Slide

  4. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Dota Complexity
    See “The problem” at https://blog.openai.com/openai-five/

    View Slide

  5. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-cov
    ers/openai-five/network-architecture.pdf
    “π(a|s)”

    View Slide

  6. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Agent Net
    Image credit: Robin Chauhan
    Based on OpenAI Five Architecture
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf
    π(a|s)
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf

    View Slide

  7. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    ● FC, FC Relu
    ● Embedding
    ● Concat, Slice
    ● MaxPool
    ● Softmax
    ● Dot, Attention
    ● No CNNs, ResNets,
    “exotic” layers
    Recall: Obs are not planar
    matrix, but discrete attributes.
    This is unlike:
    ● AlphaZero (board
    matrix),
    ● DQN atari
    (pixel/memory matrix)
    Layers
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf
    Inset Image: Robin Chauhan

    View Slide

  8. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Agents’ Nets
    Image credit: Robin Chauhan
    Based on OpenAI Five Architecture
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf
    ● Separate agent (LSTM) per hero
    ● LSTM: 1-layer, 1024-unit
    Image credit: Reinforcement Learning: An Introduction, Sutton and Barto
    Overlays: Robin Chauhan
    π
    i
    ( a
    i
    | s
    i
    )

    View Slide

  9. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    “Rapid”
    Image: OpenAI https://blog.openai.com/openai-five/
    Overlays by Robin Chauhan
    High speed
    multi-gpu /
    inter-node comms
    “Every step” ie.
    Synchronous
    Distributed
    training w/ Rapid
    + PPO:
    Compare to
    IMPALA+V-Trace,
    BA3C, Ape-X
    DQN, Gorila DQN

    View Slide

  10. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Reward Shaping
    ● Hero score
    ○ +: Experience, gold, mana, hero health, last hit, deny
    ○ -: kill enemy hero (?), dying
    ● Zero Sum
    ○ team's mean reward is subtracted from the rewards of the
    enemy team
    ■ hero_rewards[i] -= mean(enemy_rewards)
    ● Building: all heros on team
    ○ Weighted by type and health
    ● Time scaling of rewards
    ○ scale up all rewards early in the game and scale down rewards
    late in the game
    Source: OpenAI
    See https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a
    Image credit: Wikipedia
    Such
    reward

    View Slide

  11. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Actions: Selection
    LSTM -> FC ->
    Train / test
    Part of obs
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf
    Overlays: Robin Chauhan

    View Slide

  12. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    LSTM -> FC ->
    Unit observations
    (NOT thru LSTM)
    Actions: Unit attention/Focus
    Train / test
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf
    Overlays: Robin Chauhan

    View Slide

  13. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Actions: Multiple heads
    All Observations Obs about units
    “Each [action] head has semantic
    meaning, for example, the number
    of ticks to delay this action, which
    action to select, the X or Y
    coordinate of this action in a grid
    around the unit, etc. Action heads
    are computed independently.”
    - OpenAI
    Image credit: OpenAI
    https://d4mucfpksywv.cloudfront.net/research-covers/openai-five/network-architecture.pdf

    View Slide

  14. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    MARL: τ ≡ tau ≡ Team Spirit
    ● “how much each of OpenAI Five’s heroes should care about its individual
    reward … versus the average of the team’s reward”
    ○ Start focussed on self reward
    ○ end focussed on team reward
    ● τ
    ○ hero_rewards[i] =
    τ * mean(hero_rewards) + (1 - τ) * hero_rewards[i]
    ○ anneal τ from 0.2 to 0.97
    ● No comms, no central controller, no cutting edge MARL training
    ○ Independent LSTM, but any other shared weights?
    ○ Simple τ scheme -> team coordination

    View Slide

  15. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Proximal Policy Optimization
    Source: Proximal Policy Optimization Algorithms
    Schulman et al, 2017 https://arxiv.org/abs/1707.06347
    ● OpenAI’s flagship model-free RL algo
    ● fixed-length trajectory segs of N
    parallel actors
    ● Improved version of Policy Gradients
    (on-policy), TRPO w/clipped objective
    “Outperforms other online policy gradient
    methods, and overall strikes a favorable balance
    between sample complexity, simplicity, and
    wall-time.”
    - Schulman et al 2017

    View Slide

  16. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Training Performance
    Graphs: OpenAI
    See https://blog.openai.com/openai-five/
    https://blog.openai.com/more-on-dota-2/
    1v1 in 2017
    5v5 Reward shaping
    bugs=slower training
    Units: TrueSkill(TM): A
    Bayesian Skill Rating
    System by Microsoft
    4 months!

    View Slide

  17. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Training Compute
    GPUs 256
    CPU cores 128k
    Parameters 10M fp, 58MB
    Observation size 20k fp, 36.8 kB
    Action space ~1,000 valid
    Batch size 1M Obs
    Batches/Min ~60
    Play speed 330k x reatime
    1v1: 110k x realtime
    “The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in
    some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these
    results ran for just a weekend.” -- GCP employee on Hacker News
    Image credit: OpenAI https://blog.openai.com/ai-and-compute

    View Slide

  18. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Image credit: OpenAI https://blog.openai.com/ai-and-compute
    Five (Aug)
    Five (June)

    View Slide

  19. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    ● Reinforcement Learning
    ○ PPO => general purpose, previously existing, Model-free, on-policy, policy gradient Deep
    Reinforcement Learning
    ■ RL innovation in scaling with Rapid
    ■ Teamwork with Tau
    ● Deep learning
    ○ Specific network design tuned for inductive bias on this environment
    ■ Net design itself embodies “human knowledge” ** AlphaZero same same
    ○ “Single-layer LSTM” agent core = pure elegance
    ■ “Basic” RNN performance surprise?
    My take: Innovation, Intelligence

    View Slide

  20. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    ● OpenAI Five Formula
    ○ Hand-designed Deep Learning inductive bias ** Agent net
    ○ + Teamwork annealing ** cool
    ○ + Basic RNN ** LSTM
    ○ + General RL ** PPO
    ○ + (Well engineered) Brute Force ** Rapid => Results
    ■ Similar formula as Deepmind AlphaZero ** ResNet + innovative but very simple model-based RL
    ● RL Implications
    ○ Discouraging for Proponents of HRL, MARL-tailored algos, richer RL
    ■ Sample efficiency?
    My take: their AI Formula

    View Slide

  21. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    My take: Impact
    ● Brag rights to OpenAI for major achievement
    ○ Compare Deepmind’s AlphaGo Zero/AlphaZero
    ■ Dota has vastly larger action and obs spaces
    ● Largest scale Multi-agent mixed collaborative/competitive result so far
    ○ Compare to AlphaGo Zero/AlphaZero purely competitive self-play
    ○ Provides a potential approach for design
    ○ Influence StarCraft Agent design for DeepMind’s SC2LE?
    ● Create vs Destroy
    ○ Dota as betting platform? Game AI jobs?
    ● AGI
    ○ Not intended to directly address AGI
    ○ Useful datapoint on potential of general DRL for simulatable domain-specific intelligence

    View Slide

  22. OpenAI Dota 2 5v5 Agent Teardown
    by Robin Chauhan, Pathway Intelligence Inc
    Thank you!
    Robin Ranjit Singh Chauhan: The rest
    [email protected]
    https://github.com/pathway
    https://ca.linkedin.com/in/robinc
    https://pathway.com/aiml
    Kiran Chauhan: Memes and Dota info
    [email protected]
    Aug 5: Benchmark match
    ● Playing vs 99.95th-percentile Dota players
    12:30pm Pacific Time on August 5th in SF
    ● Streaming at https://www.twitch.tv/openai
    Aug 20–25: Dota 2 International
    ● Rogers Arena in Vancouver, Canada

    View Slide