Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Hero: Legends of Analytics at Activision

Elastic Co
February 18, 2016

Data Hero: Legends of Analytics at Activision

Activision knows that Twitter is a critically important channel for interacting with Activision's players. Learn how they use the Elastic Stack, natural language processing, and event detection algorithms to turn Tweet streams into information that supports the operations of its games.

Elastic Co

February 18, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. 1 Josh Hemann, Director - Analytic Services Will Kirwin, Data

    Scientist February 18, 2016 Data Hero: Legends of Analytics @ Activision
  2. 3 What this talk… • …IS NOT • A tutorial

    on Natural Language Processing • A tutorial on Big Data engineering • An attempt to be seen as Elastic Stack experts • …IS • A practical overview in how we use the Elastic Stack, some other open source tools, and math, to turn streams of data into information • In between you and your lunch
  3. 4 Today’s Two Tales Pathos gamertag- findernator • Players talk

    about our games a lot on twitter… • …so we built a tool on the Elastic Stack to turn a stream of tweets into information • We need “Did you mean…?” for many tens of millions of gamer tags… • …so we built a tool on top of ES 4 Me Will
  4. • Our colleagues at Activision • Customer Service • My

    team • The studios • Affine Analytics 5 Acknowledgements (doing this now because I forget to too often) http://activisiongamescience.github.io/
  5. 7

  6. 8

  7. 10 A Quick Aside… • Activision provides a ton a

    player support through call centers, social media, and even direct contact from studios • We also do a ton of player surveys and traditional consumer insights work • Social media is something everyone uses to keep a finger on the pulse of what players care about and monitor at a high level GOAL: Take something people are already doing and make it a more efficient, accurate process using analytics (and some fun tech)
  8. 11

  9. 12 Three Twitter Sources Search API Streaming API Firehose •

    Programmatically get the same results you’d get by visiting https://twitter.com/ search-home • Limited results • Get get tweets from push rather than pull • Limited results (up to 50%?), but much more extensive than via Search API • Same as Streaming API but guaranteed 100% of tweets based on selection criteria • Licensing for application re-use 12
  10. 13 Three styles of persuasion • Emotion (Pathos) • Logic

    (Logos) • Authority (Ethos) https://en.wikipedia.org/wiki/Modes_of_persuasion
  11. 14

  12. 16 Querying challenges… Lots of people care about the data,

    but not everyone is a coder Facets have been deprecated for aggregations API
  13. 18

  14. 19 Goal: get some stat by time bin Even if

    you are a coder, stuff like this can be tricky… https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html
  15. 21

  16. 22 What we’ve done
 • Annotated tweet stream using NLP

    • Derived new stream for spike detection • Exposed multiple UIs • Kibana, SQL, client-libs • Migrated from older stack (Kibana 3) to current Elastic Stack
  17. 23 What we’ll try
 • Move NLP from Python-based pre-

    processing into Logstash and/or ES analyzers • Use Watcher to do alerting on spikes (keep algorithm, but improve alerting management) • Timelion • Beats
  18. 24 Our Evolving Toolbox • Ingest (Python) • requests,  praw,

     tweepy,  langid,  twisted   • Annotate (mostly Python) • TextBlob,  Pattern  (CLiPs),  NLTK,  Gensim,  sklearn,  pyLDAvis,  MALLET  (Java)   • Log (Python, Java, other) • Logstash,  statsd,  twiggy   • Interact • Elasticsearch • Chrome Sense plugin and Postman (write ES queries in your browser) • Client libs: elasticsearch-­‐py (Python), elastich (Clojure) • Kibana (of course) • Ops • bash • Nagios (or ES Watcher), • ES plugins/add-ons: Curator, Shield, Marvel • Docker, Jenkins, CI/CD stuff (aspirational)
  19. 26 How fun is smashing n00bs on Xbox live if

    you don't have a unique Gamertag for them to remember and fear? WikiHow: How to Choose a Good Xbox Gamertag
  20. 27 I had one of my friends change their gamertag

    to match mine almost exactly. Only one letter was changed from mine… A dude on the internet…
  21. Words in Standard Spellchecker 120K 64 100M? The Problem 29

    Allowed Characters for PS4 Gamertags Unique Gamertags (I actually have no idea)
  22. Recognizable Names Everything Else 30 Where we have to search

    Gamertag Analysis We don’t even know what we’re looking for.
  23. 37 Gamercase
 like lowercase, but better
 
 (E, e, 3)

    e
 (0, o, 0) o
 . 
 .
 . count c0unt kownt . . . coRnt
  24. 38 Nearest Neighbor
 w/ some string-edit distance
 
 Jaro
 Jaro-Winkler


    Levenshtein
 .
 .
 .
 Source: Wikipedia/Cluster_analysis
  25. 39 Efficient Data Structures
 for NN search in an abstract

    metric space k-d tree, VP tree, index (inverted), … Source: Steve Hanov’s Blog, http://stevehanov.ca/
  26. 40 Implementation Challenges • Concurrency • Duplicates • Availability •

    Resiliency • Ease How to build an actual gamertag search
  27. 41 Bad programmers worry about the code. Good programmers worry

    about data structures and their relationships. Linus Torvalds
  28. 44 Elasticsearch Analyzer Analyzer = 3-step process Doc Char Token

    ES Incoming Document Character Filter Strip HTML Tokenizer Split into words Filter Remove stopwords (“the”, “and”, …) Elasticsearch Index Filter 44
  29. 45 GamerC4m3lC4s3 Analyzer Repurpose every step tag Char Token ES

    Incoming Gamertag ###...# NNN ## YR C4m3lC4s3 C4m3l C4s3 (E, e, 3) e (A, a, 4) a etc… Elasticsearch Index Filter 45
  30. 46 Three Analyzers gamercase gamer- CamelCase raw • digit/decorator preprocess

    • whitespace-like tokenizer • gamercase • digit/decorator preprocess • camelcase-like tokenizer • gamercase • raw gamertags
 (exact hits are important) 46
  31. 47 Filter … one of many “gc-0o”: { “pattern”: “0”,

    “type”: “pattern_replace”, “replacement”: “o” }
  32. 48 Tokenizer gamerC4m3lC4se : fun with regex “tokenizer”: { “C4m3lC4s3”:

    { “type”: “pattern”, “pattern”: (“([^\p{L}\d]+)|” + “(?<=[\p{Ll}0,1,3-5,7])(?=\p{Lu})|” + “(?<=\p{Lu})(?=\p{Lu}[\{Ll}0,1,3-5,7])”), } }
  33. 49 gamerCamelCase Analyzer Put it all together “gamerCamelCase”: {
 “type”:

    “custom”,
 “char_filter”: [“html_strip”, 
 “digitNNN”, 
 “yearYR”],
 “tokenizer”: “C4m3lC4s3”, 
 “filter”: [“lowercase”,
 “gc-3e”,
 “gc-4a”,
 “gc-5s”,
 “gc-0o”,
 “gc-1l”] }
  34. 50 Mapping Use several analyzers "properties": {
 "gamertag": {
 "type":

    "string",
 "fields": {
 "raw": {
 "type": "string", "index": "not_analyzed“ }, "gamercase": { "type": "string", "analyzer": "gamercase“ }, "gamercamel": { "type": "string", "analyzer": "gamerCamelCase“ } } }, … }
  35. 51 Query Template "query": { "bool": { "should": [ {"fuzzy_like_this_field":

    { "gamertag.gamercase": { "like_text": "{{gamertag}}", "ignore_tf": "true", "fuzziness": 2, "prefix_length": 1, "boost": 2, "analyzer": "gamercase“ } }}, {"match": { "gamertag.raw": { "query": "{{gamertag}}", "boost": 10.0 } }}, {"fuzzy_like_this_field": { "gamertag.gamercamel": { "like_text": "{{gamertag}}", "ignore_tf": "true", "fuzziness": 2, "prefix_length": 1, "boost": 1.0, "analyzer": "gamerCamelCase“ }}}]}} Fuzzy query: gamercase Fuzzy query: camelcase Match query: raw tag
  36. 52

  37. 53

  38. 54