Realtime Systems for Social Data Analysis (RICON East 2013)

Realtime Systems for Social Data Analysis (RICON East 2013)

Presentation delivered by Hilary Mason at RICON East 2013.

It's one thing to have a lot of data, and another to make it useful. This talk explores the interplay between infrastructure, algorithms, and data necessary to design robust systems that produce useful and measurable insights for realtime data products. We'll walk through several examples and discuss the design metaphors that bitly uses to rapidly develop these kinds of systems.

About Hilary

Hilary is the Chief Scientist at bitly, the URL- shortening and bookmarking service, where she makes beautiful things with data. She is a former computer science professor with a background is in machine learning and data mining. As native New Yorker, Hilary was appointed to Mayor Bloomberg’s Technology and Innovation Advisory Council. She also co-founded HackNY, created dataists, and is a member of NYCResistor.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 14, 2013
Tweet

Transcript

  1. Hilary Mason Chief Scientist, bitly @hmason h@bit.ly Realtime Systems for

    Social Data Analysis
  2. None
  3. {"a":  "Mozilla/5.0  (Windows  NT  6.1)  AppleWebKit/ 537.31  (KHTML,  like  Gecko)

     Chrome/26.0.1410.64   Safari/537.31",  "c":  "US",  "nk":  0,  "tz":   "America/Chicago",  "gr":  "TX",  "g":  "126F3CN",   "i":  "xx.xxx.xxx.xxx",  "h":  "126F3CM",  "k":   "xxxxxx-­‐xxxxx-­‐xxxxx-­‐xxxxxxx",  "l":  "raycom",   "al":  "en-­‐US,en;q=0.8",  "hh":  "bit.ly",  "r":   "https://www.facebook.com/",  "u":  "http:// www.kltv.com/story/22237743/document-­‐sheds-­‐ light-­‐on-­‐events-­‐leading-­‐up-­‐to-­‐longview-­‐standoff? utm_content=bufferf4a8d&utm_source=buffer&utm_me dium=facebook&utm_campaign=Buffer",  "t":   1368478799,  "hc":  1368476067,  "cy":  "Longview",   "ll":  [32.500701904296875,  -­‐94.740501403808594]}
  4. None
  5. hadoop?

  6. None
  7. Data engineering!

  8. When you have data you can learn a lot of

    things...
  9. ...but how do we make these capacities useful?

  10. None
  11. None
  12. https://github.com/bitly/dablooms/issues/35

  13. https://github.com/bitly/dablooms/issues/35 http://bit.ly/12aUf3F

  14. h"p://bit.ly/xj57gS

  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. 10s of millions of URLs per day 100s of millions

    of clicks per day 10s of billions of URLs
  30. None
  31. (not the whole internet, just the bit people are paying

    attention to)
  32. Your social network is NOT my social network.

  33. None
  34. None
  35. http://tweet.onerandom.com

  36. None
  37. Identity?

  38. None
  39. Geography?

  40. None
  41. None
  42. [pizza in california]

  43. None
  44. None
  45. Networks impact behaviors.

  46. None
  47. twitter

  48. facebook

  49. tumblr

  50. Devices also change behavior.

  51. None
  52. None
  53. This is the real world.

  54. None
  55. None
  56. None
  57. None
  58. Revolution.

  59. 51

  60. None
  61. What  People  Share

  62. What  People  Share What  People  Read

  63. Things people share:

  64. None
  65. None
  66. None
  67. Things people click:

  68. None
  69. None
  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. AT SCALE

  78. Data engineering is when the architecture of your system is

    dependent on characteristics of the data flowing through that system.
  79. 1. Research offline 2. Do fancy math – find the

    shortcuts 3. Design infrastructure 4. Re-design to run at scale and speed
  80. Open  Source

  81. What language is this content?

  82. Classic Approach: Supervised Classification Take labeled content (google translate API,

    europarl corpus, etc.) and use a classifier.
  83. Classic Approach: Supervised Classification Take labeled content (google translate API,

    europarl corpus, etc.) and use a classifier. (sure)
  84. None
  85. wait, more data? "es" "en-us,en;q=0.5" "pt-BR,pt;q=0.8,en- US;q=0.6,en;q=0.4" "en-gb,en;q=0.5" "en-US,en;q=0.5" "es-es,es;q=0.8,en-

    us;q=0.5,en;q=0.3” "de, en-gb;q=0.9, en;q=0.8"
  86. None
  87. use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !

    """ ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  88. None
  89. Simple > Fancy

  90. Are you a human?

  91. Offline research on the full lifecycle of a link.

  92. None
  93. Train random forest decision tree on offline model.

  94. Classify every click in ‘realtime’.

  95. Downstream systems decide how to use that score.

  96. {"ck":  1,  "gr":  "X4",  "al":  "en-­‐US,en;q=0.8",   "topic":  "Sports",  "cy":

     "Bargoed",  "hc":   1368535661.0000002,  "ovi":  {"count":  124.0,   "proba":  [0.935458874,  0.064541125]},  "hh":   "mirr.im",  "a":  "Mozilla/5.0  (Windows  NT  6.1;   WOW64)  AppleWebKit/537.31  (KHTML,  like  Gecko)   Chrome/26.0.1410.64  Safari/537.31",  "c":  "GB",   "nk":  1,  "tz":  "Europe/London",  "g":  "16a3eXk",   "i":  "xxx.xxx.xxx.xxx",  "h":  "16a3eXi",  "k":   "xxxxxxxx-­‐xxxxx-­‐xxxxxx-­‐xxxxxxx",  "l":   "dailymirror",  "p":  "fans",  "r":  "http://t.co/ hSpdnJzMIh",  "u":  "http://www.mirror.co.uk/sport/ football/news/picture-­‐special-­‐david-­‐beckhams-­‐ paris-­‐1888664? utm_source=twitterfeed&utm_medium=twitter",  "t":   1368536389.0,  "ll":  [51.683300018,  -­‐3.23329997]}
  97. What is the world paying attention to right now?

  98. http://rt.ly

  99. 1.Search 2.Bursts 3.Stories

  100. Search: I know what I want.

  101. Realtime Search Attributes calculated either at index time or query

    time. Rankings can vary by second.
  102. ‘Realtime’ Search • built on Zoie (Solr plugin) • only

    keeps documents in the index if they have been clicked* in the previous 24 hours
  103. Click! queue Solr  processing RealFme  scoring Content  ExtracFon Crawlers

  104. QUERY Solr  to  find  all  documents RealFme  scoring  to  rank

  105. None
  106. None
  107. Narrow the stream by ‘query’, rank it.

  108. Bursts: What’s happening?

  109. actual rate of clicks on phrases vs expected rate of

    clicks on phrases
  110. We calculate clickrate with a sort of moving average: where

    Dragoneye
  111. We represent as a sum of delta spikes. This simplifies

    to: Dragoneye
  112. Choosing is important. It must be interpretable, and smooth (but

    not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye
  113. The models are built into the database. We get the

    CPS calculation for free.
  114. None
  115. None
  116. None
  117. Bursts inform priority in the search index queue.

  118. None
  119. Stories: more than just links

  120. [story api]

  121. None
  122. None
  123. http://rt.ly

  124. http://dev.bitly.com

  125. None
  126. None
  127. None
  128. None
  129. the cutest kitten

  130. the cutest kitten

  131. h@bit.ly @hmason Thank you!