Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realtime Systems for Social Data Analysis (RICON East 2013)

Realtime Systems for Social Data Analysis (RICON East 2013)

Presentation delivered by Hilary Mason at RICON East 2013.

It's one thing to have a lot of data, and another to make it useful. This talk explores the interplay between infrastructure, algorithms, and data necessary to design robust systems that produce useful and measurable insights for realtime data products. We'll walk through several examples and discuss the design metaphors that bitly uses to rapidly develop these kinds of systems.

About Hilary

Hilary is the Chief Scientist at bitly, the URL- shortening and bookmarking service, where she makes beautiful things with data. She is a former computer science professor with a background is in machine learning and data mining. As native New Yorker, Hilary was appointed to Mayor Bloomberg’s Technology and Innovation Advisory Council. She also co-founded HackNY, created dataists, and is a member of NYCResistor.

Basho Technologies

May 14, 2013
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Hilary Mason
    Chief Scientist, bitly
    @hmason [email protected]
    Realtime Systems
    for
    Social Data Analysis

    View full-size slide

  2. {"a":  "Mozilla/5.0  (Windows  NT  6.1)  AppleWebKit/
    537.31  (KHTML,  like  Gecko)  Chrome/26.0.1410.64  
    Safari/537.31",  "c":  "US",  "nk":  0,  "tz":  
    "America/Chicago",  "gr":  "TX",  "g":  "126F3CN",  
    "i":  "xx.xxx.xxx.xxx",  "h":  "126F3CM",  "k":  
    "xxxxxx-­‐xxxxx-­‐xxxxx-­‐xxxxxxx",  "l":  "raycom",  
    "al":  "en-­‐US,en;q=0.8",  "hh":  "bit.ly",  "r":  
    "https://www.facebook.com/",  "u":  "http://
    www.kltv.com/story/22237743/document-­‐sheds-­‐
    light-­‐on-­‐events-­‐leading-­‐up-­‐to-­‐longview-­‐standoff?
    utm_content=bufferf4a8d&utm_source=buffer&utm_me
    dium=facebook&utm_campaign=Buffer",  "t":  
    1368478799,  "hc":  1368476067,  "cy":  "Longview",  
    "ll":  [32.500701904296875,  -­‐94.740501403808594]}

    View full-size slide

  3. Data engineering!

    View full-size slide

  4. When you have data you can
    learn a lot of things...

    View full-size slide

  5. ...but how do we make these
    capacities useful?

    View full-size slide

  6. https://github.com/bitly/dablooms/issues/35

    View full-size slide

  7. https://github.com/bitly/dablooms/issues/35
    http://bit.ly/12aUf3F

    View full-size slide

  8. h"p://bit.ly/xj57gS

    View full-size slide

  9. 10s of millions of URLs per day
    100s of millions of clicks per day
    10s of billions of URLs

    View full-size slide

  10. (not the whole internet, just the bit
    people are paying attention to)

    View full-size slide

  11. Your social network is NOT
    my social network.

    View full-size slide

  12. http://tweet.onerandom.com

    View full-size slide

  13. [pizza in california]

    View full-size slide

  14. Networks impact behaviors.

    View full-size slide

  15. Devices also change behavior.

    View full-size slide

  16. This is the real world.

    View full-size slide

  17. What  People  Share

    View full-size slide

  18. What  People  Share
    What  People  Read

    View full-size slide

  19. Things people share:

    View full-size slide

  20. Things people click:

    View full-size slide

  21. Data engineering is when the architecture
    of your system is dependent on
    characteristics of the data flowing through
    that system.

    View full-size slide

  22. 1. Research offline
    2. Do fancy math – find the shortcuts
    3. Design infrastructure
    4. Re-design to run at scale and speed

    View full-size slide

  23. Open  Source

    View full-size slide

  24. What language is this content?

    View full-size slide

  25. Classic Approach:
    Supervised Classification
    Take labeled content (google translate
    API, europarl corpus, etc.) and use a
    classifier.

    View full-size slide

  26. Classic Approach:
    Supervised Classification
    Take labeled content (google translate
    API, europarl corpus, etc.) and use a
    classifier.
    (sure)

    View full-size slide

  27. wait, more data?
    "es"
    "en-us,en;q=0.5"
    "pt-BR,pt;q=0.8,en-
    US;q=0.6,en;q=0.4"
    "en-gb,en;q=0.5"
    "en-US,en;q=0.5"
    "es-es,es;q=0.8,en-
    us;q=0.5,en;q=0.3”
    "de, en-gb;q=0.9,
    en;q=0.8"

    View full-size slide

  28. use an entropy calculation!
    def ghash2lang(g, Ri, min_count=3,
    max_entropy=0.2):
    ! """
    ! returns the majority vote of a langauge for a
    given hash
    ! """
    ! lang = R.zrevrange(g,0,0)[0]
    # let's calculate the entropy!
    # possible languages
    x = R.zrange(g,0,-1)
    # distribution over those languages
    p = np.array([R.zscore(g,langi) for langi in
    x])
    p /= p.sum()
    # info content
    I = [pi*np.log(pi) for pi in p]
    # entropy: smaller the more certain we are! -
    i.e. the lower our surprise
    H = -sum(I)/len(I) #in nats!
    # note that this will give a perfect zero for a
    single count in one language
    # or for 5K counts in one language. So we also
    need the count..
    count = R.zscore(g,lang)
    if count < min_count and H > max_entropy:
    return lang, count
    else:
    return None, 1

    View full-size slide

  29. Simple > Fancy

    View full-size slide

  30. Are you a human?

    View full-size slide

  31. Offline research on the full
    lifecycle of a link.

    View full-size slide

  32. Train random forest decision
    tree on offline model.

    View full-size slide

  33. Classify every click in ‘realtime’.

    View full-size slide

  34. Downstream systems decide
    how to use that score.

    View full-size slide

  35. {"ck":  1,  "gr":  "X4",  "al":  "en-­‐US,en;q=0.8",  
    "topic":  "Sports",  "cy":  "Bargoed",  "hc":  
    1368535661.0000002,  "ovi":  {"count":  124.0,  
    "proba":  [0.935458874,  0.064541125]},  "hh":  
    "mirr.im",  "a":  "Mozilla/5.0  (Windows  NT  6.1;  
    WOW64)  AppleWebKit/537.31  (KHTML,  like  Gecko)  
    Chrome/26.0.1410.64  Safari/537.31",  "c":  "GB",  
    "nk":  1,  "tz":  "Europe/London",  "g":  "16a3eXk",  
    "i":  "xxx.xxx.xxx.xxx",  "h":  "16a3eXi",  "k":  
    "xxxxxxxx-­‐xxxxx-­‐xxxxxx-­‐xxxxxxx",  "l":  
    "dailymirror",  "p":  "fans",  "r":  "http://t.co/
    hSpdnJzMIh",  "u":  "http://www.mirror.co.uk/sport/
    football/news/picture-­‐special-­‐david-­‐beckhams-­‐
    paris-­‐1888664?
    utm_source=twitterfeed&utm_medium=twitter",  "t":  
    1368536389.0,  "ll":  [51.683300018,  -­‐3.23329997]}

    View full-size slide

  36. What is the world paying
    attention to right now?

    View full-size slide

  37. http://rt.ly

    View full-size slide

  38. 1.Search
    2.Bursts
    3.Stories

    View full-size slide

  39. Search: I know what I want.

    View full-size slide

  40. Realtime Search
    Attributes calculated either at index time or
    query time.
    Rankings can vary by second.

    View full-size slide

  41. ‘Realtime’ Search
    • built on Zoie (Solr plugin)
    • only keeps documents in the index if they
    have been clicked* in the previous 24
    hours

    View full-size slide

  42. Click!
    queue Solr  processing
    RealFme  scoring
    Content  ExtracFon
    Crawlers

    View full-size slide

  43. QUERY
    Solr  to  find  all  documents
    RealFme  scoring  to  rank

    View full-size slide

  44. Narrow the stream by ‘query’,
    rank it.

    View full-size slide

  45. Bursts: What’s happening?

    View full-size slide

  46. actual rate of clicks on phrases
    vs
    expected rate of clicks on phrases

    View full-size slide

  47. We calculate clickrate with a sort of moving
    average:
    where
    Dragoneye

    View full-size slide

  48. We represent as a sum of delta spikes.
    This simplifies to:
    Dragoneye

    View full-size slide

  49. Choosing is important.
    It must be interpretable, and smooth (but
    not too smooth).
    We use a distribution for that is a function
    that sums to 1. The function is 0 at the
    origin.
    Dragoneye

    View full-size slide

  50. The models are built into the
    database.
    We get the CPS calculation for
    free.

    View full-size slide

  51. Bursts inform priority in the
    search index queue.

    View full-size slide

  52. Stories: more than just links

    View full-size slide

  53. http://rt.ly

    View full-size slide

  54. http://dev.bitly.com

    View full-size slide

  55. the cutest kitten

    View full-size slide

  56. the cutest kitten

    View full-size slide