Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - Watching Over Overwatch: An Elastic Story

Elastic{ON} 2018 - Watching Over Overwatch: An Elastic Story

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 01, 2018
Tweet

Transcript

  1. Blizzard Entertainment February 28, 2018 Chris Burkhart: @ctide / Bill

    Warnecke: @ww Watching Overwatch at Activision Blizzard Chris Burkhart, Technical Lead, Principal I Bill Warnecke, Lead Software Engineer, Principal I
  2. Who are we? Chris Burkhart Technical Lead, Principal I Battle.net

    – Data Team William Warnecke Lead Software Engineer, Principal I Team 4 – Overwatch
  3. What This Talk Covers • Quick History • Blizzard’s Global

    Data Platform • Walkthrough of BEAM • Overwatch Monitoring • Future
  4. Quick History • 1991 - Founded as Silicon & Synapse

    • 1996 - Battle.net Classic • 2000 - Diablo II • 2004 - World of Warcraft • 2016 - Overwatch
  5. None
  6. None
  7. None
  8. Earliest Monitoring • Host status • Physical or VM compute

    • Basic Hardware Utilization • CPU • Memory • Disk • OS Data • TCP Retransmit • File Descriptor Count
  9. Earliest Monitoring • Service Status • PID Monitoring • OS

    Exit Code • Service variables • Limited window into service internals • Can “track” variables to graph changes • Player Concurrency • Customer Service contacts • Player reports on forums
  10. Global Data Platform

  11. Global Data Platform • 28 Person Team • 14 Software

    Engineers, 4 System Engineers • 6 PMs, 4 Tech Leads • 6 Production Datacenters • Telem-Telem – Monitoring Pipeline in each datacenter • 7 SDKs • Events, Logs, Metrics
  12. Global Data Platform • Microservices (Node.js / Scala / Java)

    • Protocol Buffers only • All data is associated with registered Schema • Telemetry Development Kit • Multiple Datastores (Elastic, HDFS, Cassandra) • 7 day TTL for Elastic, much longer for HDFS • Cassandra for specific use cases
  13. GDP Ingest for Overwatch Launch Syslog Ingest Brubeck

  14. GDP Ingest for Overwatch Launch Syslog Ingest Brubeck Logstash

  15. GDP Ingest for Overwatch Launch Syslog Ingest Brubeck Kafka Logstash

  16. GDP Ingest for Overwatch Launch HTTP Ingest Kafka Syslog Ingest

    Brubeck Logstash Kafka
  17. GDP Microservices Architecture Kafka

  18. GDP Microservices Architecture Enrichment Ingest Topic Kafka

  19. GDP Microservices Architecture Enrichment Ingest Topic Schema Reg Kafka

  20. GDP Microservices Architecture Enrichment Ingest Topic Specific Topics Schema Reg

    Kafka
  21. GDP Microservices Architecture Enrichment Ingest Topic Specific Topics Schema Reg

    ES Processor Cassandra Processor HDFS Processor Kafka
  22. GDP Microservices Architecture Enrichment Ingest Topic Specific Topics Schema Reg

    ES Processor Cassandra Processor HDFS Processor HDFS Cassandra Elasticsearch Kafka
  23. Data Consumption

  24. BEAM • Blizzard’s custom monitoring solution • Poll datasources periodically

    • Transform data • Check conditions • Perform actions
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. Overwatch Monitoring

  33. None
  34. None
  35. None
  36. Incident Response - Without Data Platform? • Was the drop

    spread across all servers • Did any services or hosts unexpectedly terminate • Check for server crash emails • Compare concurrency to other Overwatch platforms • Compare concurrency to other Blizzard games • Spin up a bunch of resources to investigate if the drop was bad enough
  37. With Data Platform • Pipeline • Supports client telemetry •

    Data • Metrics have more associated data • Reporting • Easy to discover and pivot
  38. None
  39. Disconnections By Platform By Continent By ISP

  40. Operations

  41. Operationalizing Overwatch • Everyone was very excited about the potential

    of our data platform • Identify what is critical and focus there • Common flows like login, play a game • Critical flows like purchasing • Your instrumentation should get better over time • Define your KPIs
  42. Incident Management • 134 Major Incidents in 2017 that affected

    Overwatch • 78% were detected first by an Alert • 30% were recommended for review to improve monitoring • Did alert identify root cause • Time to detect incident • Time taken for ops staff to validate incident
  43. Embracing Telemetry

  44. None
  45. Map Load Stall

  46. None
  47. None
  48. None
  49. Automation

  50. None
  51. Future

  52. Future – BEAM • RPC Message • Autoremediation? • Autoscaling?

    • Rules templates • Better auditing • Stateful Alerts • Maintenance Mode
  53. Future – Leveraging Elasticsearch • Cross cluster search • Multitenancy

    Challenges in Kibana • Hundreds of broken visualizations and dashboards • Unified data access layer / Query Engine • Presto, SparkSQL, Query Grid, Drill, Qubole?
  54. Thank You!

  55. 55 Questions? Visit us at the AMA

  56. www.elastic.co

  57. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 57 Please attribute Elastic with a link to elastic.co
  58. Future Plans – Pipeline • Isolated pipelines for specific usecases

    • Higher guarantees, lower latencies • Still have lots of data flowing through old pipelines • Expanding esports initiatives • Self supporting Kafka
  59. None