Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Hero: Legends of Analytics

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
December 01, 2015

Data Hero: Legends of Analytics

Twitter is an important social media channel for interacting with Activision’s players. In this talk, Josh and Will discuss how Activision uses the Elastic stack, natural language processing, and event detection algorithms to turn tweet streams into information that supports our games’ operations.

Josh Hemann and Will Kirwin | Elastic{ON} Tour Los Angeles | December 1, 2015

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

December 01, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Data Hero: Legends of Analytics Josh Hemann, Director, Analytical Services

    Will Kirwin, Data Scientist
 Activision Analytic Services 1
  2. What this talk… • … IS NOT • A tutorial

    on Natural Language Processing (NLP) • A tutorial on Big Data engineering • An attempt to be seen as ELK experts • … IS • A practical overview of how we use open source tools and math to turn twitter and reddit data into instrumentation • In between you and coffee/food
  3. None
  4. Acknowledgements (doing this now because I forget to too often)

    • Affine Analytics (especially Rama Badrinath) • My colleagues at Activision • Customer Service • My team ————> • Studios ———>
  5. None
  6. Why not use a commercial vendor? • (very) Domain specific

    vocabularies • (very) Domain specific sentiment • And commercial services often use canned algorithms trained on book and movie reviews • Always changing
  7. Let’s start playing…

  8. None
  9. None
  10. None
  11. A quick aside… • Activision provides a ton a player

    support through call centers, social media, and even direct contact from studios • We also do a ton of player surveys and traditional consumer insights work • Social media is something everyone uses to keep a finger on the pulse of what players care about and monitor at a high level
  12. Emergent use case Take something that is being looked at

    a lot, manually, every day… …and make it more efficient and actionable using statistics —> “Customer Support” use of tweet stream
  13. None
  14. http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/ Small, unknown sample proportion Bigger, unknown sample proportion The

    universe, but costs $
  15. None
  16. None
  17. https://en.wikipedia.org/wiki/Model–view–controller ELK as an SDK Kibana Logstash and Kibana ElasticSearch

  18. None
  19. OK, cool, but… • No longer simple, Python-based batch processing

    • Harder, event-driven programming, lots of interdependencies • “IT” issues become the challenge, not NLP • When ingest stops you lose data • Need to monitor, test, alert • Tech stack, APIs, etc. constantly changing • ELK-ops (Java heap size, index curation)
  20. …It’s still totally doable • Lots of free social media

    data sources • Lots of great tools for turning text into data • Without tremendous effort, basic reporting monitoring is possible • Use statistical techniques to characterize social data just like you would sales, customer engagement, etc.
  21. Stuff we are struggling with… • Kibana 3 vs 4

    (like 3, want newer ES) • End-user querying (it’s all there, but…)
  22. Toolbox • Ingest (Python) • requests,  praw,  tweepy,  langid,  twisted

      • Annotate (mostly Python) • TextBlob,  Pattern  (CLiPs),  NLTK,  Gensim,  sklearn,  MALLET  (Java)   • Log (Python, Java, other) • Logstash,  statsd,  twiggy  (Python  lib)   • Interact • ElasticSearch  (+  Curator,  Shield,  Marvel)   • Chrome  Sense  plugin  and  Postman  (write  ES  queries  in  your  browser)   • Client  libs:  elasticsearch-­‐py  (Python),  elastich  (Clojure)   • Kibana  (of  course)   • Ops • bash   • Nagios  (or  ES  Watcher),  ES  Curator,  ES  Shield,  ES  Marvel   • Docker,  Jenkins,  CI/CD  stuff  (aspirational)
  23. Gamertag  Findernator3000 (how  to  search  for  misspellings)

  24. The  Problem  –  Searching  for  Gamertags Gamertag   • online

     “name”  used  in  multiplayer  video  games,  forums,  etc…   • unique  (user                        gamertag  is  one-­‐to-­‐one)   Motivation   given  an  approximate  gamertag,  find  relevant  actual  gamertags.
  25. Challenges • (usually)  deliberately  misspelled   saturnboyz,  l3afknode,  c0wnt_z3r0,  l33kyBrainz86,

     …   • (usually)  many  variations  on  each  theme   the-­‐count-­‐zero,  count-­‐zero,  c0unt-­‐zer0,  kowntZ3r0,  …   • Lots  of  gamertags  (millions  and  millions  and  millions…)   How  to  search?   search  term:  count  zero     results:  [c0unt-­‐zero,  c0wnt_z3r0,  …]
  26. Ideas  and  Prototypes • Nearest  neighbor  search  using  some  string-­‐edit

     distance     Levenshtein,  Jaro,  Jaro-­‐Winkler,  …   • c0unt  should  be  closer  to  count  than  coRnt        “gamercase”  :  analog  of  lowercase  that  maps     (E,  e,  3)      e;    (O,  o,  0)      o;    etc…   • Whitespace-­‐like,  CamelCase:  countZero  vs.  count_zero  vs.  count  zero   • Efficient  data  structure(s)  for  NN  search  in  an  abstract  metric  space     k-­‐d  tree,  binary  space  partitioning,  VP  tree,  index  (inverted),  …
  27. Challenges 1. Concurrency   2. Duplicates  (deduping)   3. Availability

      4. Resiliency   5. Scalability   Neither  do  I  want  nor  do  I  need  to  build  a  database  from  scratch.       to  the  rescue!
  28. (Ab)using  Elasticsearch  Analyzers Incoming   Document Char  Filter   •

    Strip  HTML Tokenizer   • Split  text  into   words Filter   • Remove   “stopwords”  (t he,  and,  …) Elasticsearch   Index Analyzer
  29. A  gamerC4m3lC4s3  Analyzer Incoming   Gamertag Char  Filter   •

     ###...#                 NNN   •  ##              YR Tokenizer   • gamerCamel Case Filter   • (E,  e,  3)                   e   • (A,  a,  4)                 a   •  … Elasticsearch   Index
  30. Gamertag gamercase gamerCamelCase raw Analyzers Index  or  Query

  31. "tokenizer": {
 "C4m3lC4s3": {
 "pattern": (r"([^\p{L}\d]+)|" + # tokenize non-alphanumeric

    sequences
 r"(?<=[\p{Ll}0,1,3-5,7])(?=\p{Lu})|" + # or lower(gamer)case followed by uppercase
 r"(?<=\p{Lu})(?=\p{Lu}[\{Ll}0,1,3-5,7])"), # or split upper / upper+lower
 "type": "pattern"
 }
 }
 "gc-0o": {
 "pattern": "0",
 "type": "pattern_replace",
 "replacement": "o"
 }, Tokenizer Filter  (one  of  many)
  32. "gamerCamelCase": { "type": "custom", "char_filter": [
 "html_strip",
 "digitNNN",
 "yearYR"
 ],

    "tokenizer": "C4m3lC4s3“,
 "filter": [
 "lowercase",
 "gc-3e",
 "gc-4a",
 "gc-5s",
 "gc-0o",
 "gc-1l"
 ]
 } Analyzer
  33. "properties": {
 "gamertag": {
 "type": "string",
 "fields": {
 "raw": {


    "type": "string",
 "index": "not_analyzed"
 },
 "gamercase": {
 "type": "string",
 "analyzer": "gamercase"
 },
 "gamercamel": {
 "type": "string",
 "analyzer": "gamerCamelCase"
 }
 }
 }, Mapping
  34. "query": {
 "bool": {
 "should": [{
 "fuzzy_like_this_field": {
 "gamertag.gamercase": {


    "like_text": "{{gamertag}}",
 "ignore_tf": "true",
 "fuzziness": 2,
 "prefix_length": 1,
 "boost": 2,
 "analyzer": "gamercase"
 }
 }},
 {"match": {
 "gamertag.raw": {
 "query": "{{gamertag}}",
 "boost": 10.0
 }
 }},
 {"fuzzy_like_this_field": {
 "gamertag.gamercamel": {
 "like_text": "{{gamertag}}",
 "ignore_tf": "true",
 "fuzziness": 2,
 "prefix_length": 1,
 "boost": 1.0,
 "analyzer": "gamerCamelCase"
 }
 }}
 ]
 }
 } Query  Template fuzzy  query:  gamercase match  query:  raw  tag fuzzy  query:  camelcase
  35. None
  36. None
  37. None
  38. Thanks! josh.hemann@activision.com @yellowspur