Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MZ Games: The Evolution of Log Insight

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 07, 2017

MZ Games: The Evolution of Log Insight

Machine Zone is the company behind massively-multiplayer online games Game of War, Fire Age, and Mobile Strike. Real-time operations is essential for a great user experience. The MZ engineering team shares how they use the Elastic Stack to track player activity to support customer service and more.

Eric Liaw l Director of Engineering l Machine Zone
Reggie Meisler l Developer l Machine Zone

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 07, 2017
Tweet

Transcript

  1. MZ March 7th 2017 eliaw@mz.com, rmeisler@mz.com Evolution of log insight

    @ mz games Eric Liaw, Director of Game Engineering Reggie Meisler, Software Engineer
  2. Agenda 2 1 Who is MZ? 2 “Application log insight

    w/ Kibana” by Eric Liaw 3 “Player ledger for customer service” by Reggie Meisler 4 Learnings & Best Practices 5 Questions
  3. Who is MZ?

  4. Who is MZ?

  5. Who is MZ? (MMO) Massively Multiplayer Online (RTS) Real Time

    Strategy
  6. Eric Liaw Application log insight w/ Kibana

  7. 2014 - pssh is all we need! # Finding a

    specific Error message instance $ pssh -h first50.txt -o logs grep ‘Unable to create instance’ /path/to/logs/application.log # Finding all logs for user_id ‘44273942’ $ pssh -h all_hosts.txt -o user_logs xzgrep ‘44273942’ /path/to/logs/application.log.gz
  8. pssh - Is it all we need? Couldn’t aggregate without

    writing code Couldn’t easily share the results Couldn’t explore the data Slow! (Not real time!)
  9. Along came ELK

  10. Elasticsearch/Logstash/Kibana a.k.a Elastic Stack Used by both engineers and non-engineers

    4+ Billion logs per day 100+ Dashboards created Used in 100% of incident response
  11. How to maximize insight from logs 1. Automatic Fields (5

    examples) 2. Case Specific Fields (2 examples)
  12. Automatic Fields (1/5): log_level What: 10 (Debug) - minutia (Not

    indexed) 20 (Info) - interesting information 30 (Warning) - unexpected, but recoverable 40 (Error) - incorrect state 50 (Critical) - catastrophic, unrecoverable events Sample Queries log_level: 50 log_level: [40 TO 50] NOT log_level: 20
  13. Automatic Fields (2/5): Controller/Action What: Public facing controller/action this log

    was triggered by Why: How does this impact the customer? Sample queries: controller: login AND action: register log_level: 50 AND controller: purchase
  14. Automatic Fields (3/5): hostname What: hostname that generated the log

    Why: Allows us to find bad hosts quickly Sample Queries: hostname: web-1-046 hostname: worker* hostname: web-1-0*
  15. Automatic Fields (4/5): fileline What: Concatenation of the filename &

    line number that generated the log Example: User.php:1337 Why: Logically groups logs by source Avoids searching for specific log terminology Bonus: Helps with log clean up
  16. Automatic Fields (5/5): request_id What: Generated uuid for a single

    controller/action instance Why: Get all logs from a specific API call Useful when backtracking from an error log state to see what lead to it
  17. Automatic Fields Review log_level: 50 AND controller: login AND action:

    register AND host: web* AND fileline: User.php:1337 AND NOT request_id: 6168f35d5a4ec33
  18. Case Specific Fields (1/2): user identification user_id What: Unique numerical

    identifier for a user Why: Track all logs generated by a particular player admin_user What: Email of a user using our administration tool Why: Track all logs generated by a particular CS agent or game operations manager
  19. Case Specific Fields(2/2): DB endpoints What: Data_source_name concatenates the hostname

    and database name for a MySQL log entry Why: Pinpoint application perceived issues on db endpoints
  20. Reggie Meisler Player Ledger for Customer Service

  21. Customer Service for Games Tracking Player History Using Elastic Stack

    Outline
  22. Customer Service for Games • Player creates CS ticket ◦“I

    should have gotten X, but did not!” •CS opens admintool, looks up user history ◦True ▪Award them X ◦False ▪“Sorry, here’s why you did not get X”
  23. Customer Service for Games Game Servers Desk CS Ticket Player

    complaint
  24. Customer Service for Games Game Servers Desk Admintool CS sees

    ticket Looks up history
  25. Customer Service for Games Game Servers Desk SQL Admintool Award

    user via game api Access to game dbs How does this piece work? Looks up history
  26. Tracking Player History • Leverages existing ORM ◦Define a model,

    including declarative metadata ◦Construct during player action, set fields, persist ◦Query models from admintool •Player state and history are first class data ◦Stored in SQL and cached in Redis ◦Accessible to runtime
  27. Tracking Player History class CombatReport extends Model { public $timestamp

    = [Type::timeuuid, INDEXKEY]; public $att_user_id = [Type::int64, INDEXKEY]; public $def_user_id = [Type::int64, INDEXKEY]; public $def_hp = [Type::int32]; public $def_starting_hp = [Type::int32]; // … } Describes SQL table schema Backed by SQL/Redis
  28. Tracking Player History public function performCombat() { $starting_hp = $defender->hp;

    // ...mutate data based on combat... $summary = new CombatReport(); $summary->timestamp = time(); $summary->att_user_id = $attacker->user_id; $summary->def_user_id = $defender->user_id; $summary->def_hp = $defender->hp; $summary->def_starting_hp = $starting_hp; $summary->save(); } Capture state and persist
  29. Tracking Player History public function getCombatReports($session, $from, $to) { $summaries

    = CombatReport::GetWhere([ “att_user_id” => $session->user_id, [“timestamp > %i”, $from], [“timestamp < %i”, $to] ]); // Return to admintool client for display return $summaries; } Query reports for display in CS admintool
  30. Tracking Player History - Challenges • Difficult to add new

    fields ◦Schema migration for every little change ◦Requires DBA oversight • Difficult to scale ◦Data grows rapidly for transactional data ◦Need to constantly add more DB shards 50,000 new records every second!
  31. Using Elastic Stack • Pros ◦High-throughput pipeline already set up

    for logs ◦Easy to add shards and fields ◦Rolling time series data •Cons ◦Impractical to perform updates ◦Lack of internal experts
  32. Using Elastic Stack class CombatReport extends ElasticsearchModel { public $timestamp

    = [Type::timeuuid, INDEXKEY]; public $att_user_id = [Type::int64, INDEXKEY]; public $def_user_id = [Type::int64, INDEXKEY]; public $def_hp = [Type::int32]; public $def_starting_hp = [Type::int32]; // … } Describes type mapping Backed by Elasticsearch
  33. curl -XPUT “$url/_template/$template” -d ‘{ “mappings”: { “type”: { “properties”:

    { “@timestamp”: {“type”: “date”}, “CombatReport”: { “properties”: { “att_user_id”: {“type”: “long”}, “def_user_id”: {“type”: “long”}, “def_hp”: {“type”: “int”, “index”: “no”}, “def_starting_hp”: {“type”: “int”, “index”:“no”}, ...more fields... } ... Using Elastic Stack
  34. Using Elastic Stack public function performCombat() { $starting_hp = $defender->hp;

    // ...mutate data based on combat... $summary = new CombatReport(); $summary->timestamp = time(); $summary->att_user_id = $attacker->user_id; $summary->def_user_id = $defender->user_id; $summary->def_hp = $defender->hp; $summary->def_starting_hp = $starting_hp; $summary->save(); } Same ORM api, no change necessary
  35. Using Elastic Stack public function getCombatReports($session, $from, $to) { $summaries

    = CombatReport::GetWhere([ “att_user_id” => $session->user_id, [“timestamp > %i”, $from], [“timestamp < %i”, $to] ]); // Return to admintool client for display return $summaries; } Equivalent to filter query
  36. Using Elastic Stack public function getCombatReports($session, $from, $to) { $summaries

    = CombatReport::GetWhere([ “att_user_id” => $session->user_id, [“timestamp > %i”, $from], [“timestamp < %i”, $to] ]); // Return to admintool client for display return $summaries; } “term”: { “att_user_id”:$user_id }
  37. Using Elastic Stack public function getCombatReports($session, $from, $to) { $summaries

    = CombatReport::GetWhere([ “att_user_id” => $session->user_id, [“timestamp > %i”, $from], [“timestamp < %i”, $to] ]); // Return to admintool client for display return $summaries; } “range”: { “@timestamp”:{“gt”:$date} }
  38. Using Elastic Stack public function getCombatReports($session, $from, $to) { $summaries

    = CombatReport::GetWhere([ “att_user_id” => $session->user_id, [“timestamp > %i”, $from], [“timestamp < %i”, $to] ]); // Return to admintool client for display return $summaries; } “range”: { “@timestamp”:{“lt”:$date} }
  39. Using Elastic Stack Game Servers SQL Admintool

  40. Using Elastic Stack Game Servers Elasticsearch ? Admintool

  41. Using Elastic Stack Game Servers Elasticsearch Admintool

  42. Using Elastic Stack Game Servers Logstash Log File Kafka Elasticsearch

    Storm Kafka Consumer Admintool
  43. •Ran some tests against a dev cluster ◦Determined we were

    bottlenecked on JVM memory ◦Hosts could fit 4x the JVM memory cap ◦Created 3 isolated clusters per ELS host •Initially overwhelmed our main cluster ◦> 50,000 index/sec (25,000 per game) ◦Game logs emit large indexed docs (30+ fields) Using Elastic Stack 1. Game of War 2. Mobile Strike 3. Player Ledger Number of Indexed Fields Index Rate
  44. •Elastic Stack was a great fit for this data ◦Easy

    to add fields ◦Easy to scale as throughput increased •The largest hurdle we faced was knowledge ◦Elastic.co has very good documentation ◦Company training was helpful Using Elastic Stack - Conclusion
  45. Learnings & Best Practices • Index only what is necessary

    ◦Indexed fields are determining factor for write speed •Loosen type restrictions ◦Less fragile to use largest type container •Improve visibility on issues with pipeline ◦Dump all error logs to Elastic Stack + ElastAlert
  46. Learnings & Best Practices •Reduce queries to “term” and “range”

    ◦Avoids overhead of scoring •Make pipeline idempotent ◦Use custom id and perform batch “create” • Some CS data should still go into SQL ◦Payments & permanent state changes
  47. Questions? Eric Liaw: eliaw@mz.com Reggie Meisler: rmeisler@mz.com We’re hiring!