Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Watching Over Overwatch: An Elastic Story

Elastic{ON} 2018 - Watching Over Overwatch: An Elastic Story

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Blizzard Entertainment
    February 28, 2018
    Chris Burkhart: @ctide / Bill Warnecke: @ww
    Watching Overwatch at
    Activision Blizzard
    Chris Burkhart, Technical Lead, Principal I
    Bill Warnecke, Lead Software Engineer, Principal I

    View full-size slide

  2. Who are we?
    Chris Burkhart
    Technical Lead, Principal I
    Battle.net – Data Team
    William Warnecke
    Lead Software Engineer, Principal I
    Team 4 – Overwatch

    View full-size slide

  3. What This Talk Covers
    • Quick History
    • Blizzard’s Global Data Platform
    • Walkthrough of BEAM
    • Overwatch Monitoring
    • Future

    View full-size slide

  4. Quick History
    • 1991 - Founded as Silicon & Synapse
    • 1996 - Battle.net Classic
    • 2000 - Diablo II
    • 2004 - World of Warcraft
    • 2016 - Overwatch

    View full-size slide

  5. Earliest Monitoring
    • Host status
    • Physical or VM compute
    • Basic Hardware Utilization
    • CPU
    • Memory
    • Disk
    • OS Data
    • TCP Retransmit
    • File Descriptor Count

    View full-size slide

  6. Earliest Monitoring
    • Service Status
    • PID Monitoring
    • OS Exit Code
    • Service variables
    • Limited window into service internals
    • Can “track” variables to graph changes
    • Player Concurrency
    • Customer Service contacts
    • Player reports on forums

    View full-size slide

  7. Global Data Platform

    View full-size slide

  8. Global Data Platform
    • 28 Person Team
    • 14 Software Engineers, 4 System Engineers
    • 6 PMs, 4 Tech Leads
    • 6 Production Datacenters
    • Telem-Telem – Monitoring Pipeline in each datacenter
    • 7 SDKs
    • Events, Logs, Metrics

    View full-size slide

  9. Global Data Platform
    • Microservices (Node.js / Scala / Java)
    • Protocol Buffers only
    • All data is associated with registered Schema
    • Telemetry Development Kit
    • Multiple Datastores (Elastic, HDFS, Cassandra)
    • 7 day TTL for Elastic, much longer for HDFS
    • Cassandra for specific use cases

    View full-size slide

  10. GDP Ingest for Overwatch Launch
    Syslog Ingest
    Brubeck

    View full-size slide

  11. GDP Ingest for Overwatch Launch
    Syslog Ingest
    Brubeck
    Logstash

    View full-size slide

  12. GDP Ingest for Overwatch Launch
    Syslog Ingest
    Brubeck
    Kafka
    Logstash

    View full-size slide

  13. GDP Ingest for Overwatch Launch
    HTTP Ingest Kafka
    Syslog Ingest
    Brubeck
    Logstash Kafka

    View full-size slide

  14. GDP Microservices Architecture
    Kafka

    View full-size slide

  15. GDP Microservices Architecture
    Enrichment
    Ingest Topic
    Kafka

    View full-size slide

  16. GDP Microservices Architecture
    Enrichment
    Ingest Topic
    Schema Reg
    Kafka

    View full-size slide

  17. GDP Microservices Architecture
    Enrichment
    Ingest Topic
    Specific Topics
    Schema Reg
    Kafka

    View full-size slide

  18. GDP Microservices Architecture
    Enrichment
    Ingest Topic
    Specific Topics
    Schema Reg
    ES Processor
    Cassandra
    Processor
    HDFS Processor
    Kafka

    View full-size slide

  19. GDP Microservices Architecture
    Enrichment
    Ingest Topic
    Specific Topics
    Schema Reg
    ES Processor
    Cassandra
    Processor
    HDFS Processor HDFS
    Cassandra
    Elasticsearch
    Kafka

    View full-size slide

  20. Data Consumption

    View full-size slide

  21. BEAM
    • Blizzard’s custom monitoring solution
    • Poll datasources periodically
    • Transform data
    • Check conditions
    • Perform actions

    View full-size slide

  22. Overwatch Monitoring

    View full-size slide

  23. Incident Response - Without Data Platform?
    • Was the drop spread across all servers
    • Did any services or hosts unexpectedly terminate
    • Check for server crash emails
    • Compare concurrency to other Overwatch platforms
    • Compare concurrency to other Blizzard games
    • Spin up a bunch of resources to investigate if the drop was bad
    enough

    View full-size slide

  24. With Data Platform
    • Pipeline
    • Supports client telemetry
    • Data
    • Metrics have more associated data
    • Reporting
    • Easy to discover and pivot

    View full-size slide

  25. Disconnections
    By Platform
    By Continent
    By ISP

    View full-size slide

  26. Operationalizing Overwatch
    • Everyone was very excited about the potential of our data platform
    • Identify what is critical and focus there
    • Common flows like login, play a game
    • Critical flows like purchasing
    • Your instrumentation should get better over time
    • Define your KPIs

    View full-size slide

  27. Incident Management
    • 134 Major Incidents in 2017 that affected Overwatch
    • 78% were detected first by an Alert
    • 30% were recommended for review to improve monitoring
    • Did alert identify root cause
    • Time to detect incident
    • Time taken for ops staff to validate incident

    View full-size slide

  28. Embracing Telemetry

    View full-size slide

  29. Map Load Stall

    View full-size slide

  30. Future – BEAM
    • RPC Message
    • Autoremediation?
    • Autoscaling?
    • Rules templates
    • Better auditing
    • Stateful Alerts
    • Maintenance Mode

    View full-size slide

  31. Future – Leveraging Elasticsearch
    • Cross cluster search
    • Multitenancy Challenges in Kibana
    • Hundreds of broken visualizations and dashboards
    • Unified data access layer / Query Engine
    • Presto, SparkSQL, Query Grid, Drill, Qubole?

    View full-size slide

  32. 55
    Questions?
    Visit us at the AMA

    View full-size slide

  33. www.elastic.co

    View full-size slide

  34. Except where otherwise noted, this work is licensed under
    http://creativecommons.org/licenses/by-nd/4.0/
    Creative Commons and the double C in a circle are
    registered trademarks of Creative Commons in the United States and other countries.
    Third party marks and brands are the property of their respective holders.
    57
    Please attribute Elastic with a link to elastic.co

    View full-size slide

  35. Future Plans – Pipeline
    • Isolated pipelines for specific usecases
    • Higher guarantees, lower latencies
    • Still have lots of data flowing through old pipelines
    • Expanding esports initiatives
    • Self supporting Kafka

    View full-size slide