Blizzard: Building a Near Real-Time Data Pipeline

Slide 1

Slide 1 text

Blizzard Entertainment March 8th, 2017 @jordanirwin / @ctide Building a Near Real-Time Pipeline for All Things Blizzard Jordan Irwin / Chris Burkhart, Technical Leads

Slide 2

Slide 2 text

Jordan Irwin Technical Lead, Data Team Chris Burkhart Technical Lead, Data Team Who are we?

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Is This Easy Mode? ??? Data Step 1: Generate Good Data Step 2: Collect and Analyze Step 3: Profit!

Slide 5

Slide 5 text

Quest List • Brief history of Big Data at Blizzard – Where it began – The world could use more hero…ic data – Glimpse into our future • Elastic Stack: GG or OMG? • Lessons learned • Tidbits

Slide 6

Slide 6 text

A long long time ago… • Protocol Buffers as IDL • Server data only • Publish directly to RMQ • Federation galore • Map/Reduce all the things • Standard “data lake” approach

Slide 7

Slide 7 text

Back Then Game Server RMQ Flume Hadoop Game Client API * Limited client support eventually added… Map/Reduce

Slide 8

Slide 8 text

The Good Parts • From zero to hero: It worked! • Data driven decisions now possible • Positive “data culture” formed • Protocol Buffers well established • Foundation for Big Data at Blizzard

Slide 9

Slide 9 text

The Bad Parts • Schemas coordinated via emails (if at all) • Map/Reduce requires specialized expertise • More effort preparing data vs analyzing it • RMQ scaling became non-trivial

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

You must construct additional pylons Some goals… • ~20 billion messages/day • Schema Registry • Collect data from anywhere • “Free the data”

Slide 12

Slide 12 text

Road to Overwatch Game Server SDK Kafka Hadoop Elasticsearch Game Client Git repo for Schemas Map/Reduce Tribe API Metrics Logs Logstash Logstash Kibana

Slide 13

Slide 13 text

Cluster Cluster Cluster Cluster Cluster Tribe

Slide 14

Slide 14 text

Immediate Winz • Client data meant new ways to debug – CCU Drops tied to ISPs – Network Quality reports – Measurable customer impact • Even better than server monitoring! • Centralized log searching – RIP grep • All in near real-time!

Slide 15

Slide 15 text

The Good Parts • Elasticsearch + Kibana accessibility – Single “pane of glass” – Easy to use – Instant data • “Free the data” worked • Much higher scalability • SDKs for multiple languages/platforms – C#, C++ (PC/Xbox/PS4) • Offered a schema storage place • The business LOVED IT

Slide 16

Slide 16 text

The Bad Parts • Schemas not required and avoided – Not really a true registry – Dynamic mapping nightmares – Converted “data lake” into “data swamp” • Map/Reduce all-the-things still a problem • Tribe Node instability meant frequent outages • Metrics solution wasn’t scalable (ingest) • Logging wasn’t sustainable (configuration) We needed a bigger boat...

Slide 17

Slide 17 text

MOAR PYLONS! Reconsidered goals… • ~100 billion messages/day • Schema Registry revisited and required • Collect data from anywhere • “Free the data” even more • Easy to onboard • Dogfood everything

Slide 18

Slide 18 text

Today’sh Kafka Hadoop SDK Schema Registry … Kibana API Enrich Kafka Game Client Elasticsearch Tribe* Game Server Logs Metrics TDK

Slide 19

Slide 19 text

The Good Parts • Required and robust Schema Registry – "What You Registered Is What You Get” (WYRWYG) • Telemetry Development Kit (TDK) • Improved and expanded SDKs – C#, C++ (PC/Mac/Xbox/PS4), Python, Java, NodeJS, Android, Unity, Go* • Documentation prioritization • Telem-Telem: Dog food is tasty • Stable Tribe Nodes (Thanks Elastic!) • Extendible for more features • Less map/reduce, moar insight

Slide 20

Slide 20 text

The Bad Parts • Deprecated metrics support (for now) • Limited logging support • Dozens of global Elasticsearch clusters constituting a single system isn’t trivial (but still possible!) – Monitoring – Logging – Auditing – JVM GC – Updates…

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Future • Upgrade to 5.x Elastic Stack (/shivers) • Logging 4realz • Metrics 4sho (w/rollup!) • Custom transforms • Subscriptions • ODBC/JDBC • Machine Learning • … and much, much more!

Slide 23

Slide 23 text

The Good Parts • Leverages existing foundation • Low risk updates allows major features • Pairs tools with access patterns • Favors extensibility

Slide 24

Slide 24 text

The Bad Parts • NONE • It’ll be PERFECT!

Slide 25

Slide 25 text

Think Globally • Proven architecture • Vetted by influential companies • Best parts of popular pipelines • Blizzard will be a global leader in Big Data, Soon™

Slide 26

Slide 26 text

GG Elastic • “Free the data” contributor • Kibana makes data accessible • Tribe Nodes centralize data • Aliases abstract index names • Fast time to insight • APIs allow tooling • Shield controls access • Communication with Elastic has been great!

Slide 27

Slide 27 text

OMG Elastic • Shield can get complicated • Kibana multi-tenancy needs loves • Tribes are great… when they work • Logs can be spammy • Auditing gaps (who did what?) • Bad actors can ruin the fun

Slide 28

Slide 28 text

We were not prepared • Take schema management seriously • Let use-cases drive development • Expect success • Get data flowing ASAP

Slide 29

Slide 29 text

Data Data • Message Rate – Billions/day • Elasticsearch Storage – Hundreds of Terabytes • HDFS Storage – Petabytes So sorry no real details L

Slide 30

Slide 30 text

Shameless “Plug” • Using NodeJS with Kafka? – We open sourced node-rdkafka – https://github.com/Blizzard/node-rdkafka • We’re Hiring! – Know someone? – Am someone? – Java / Scala / Kafka / Hadoop / Big Data – http://careers.blizzard.com