Realtime Systems for Social Data Analysis (RICON East 2013)

Hilary Mason Chief Scientist, bitly @hmason [email protected] Realtime Systems for
Social Data Analysis

{"a": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/ 537.31 (KHTML, like Gecko)
Chrome/26.0.1410.64 Safari/537.31", "c": "US", "nk": 0, "tz": "America/Chicago", "gr": "TX", "g": "126F3CN", "i": "xx.xxx.xxx.xxx", "h": "126F3CM", "k": "xxxxxx-‐xxxxx-‐xxxxx-‐xxxxxxx", "l": "raycom", "al": "en-‐US,en;q=0.8", "hh": "bit.ly", "r": "https://www.facebook.com/", "u": "http:// www.kltv.com/story/22237743/document-‐sheds-‐ light-‐on-‐events-‐leading-‐up-‐to-‐longview-‐standoff? utm_content=bufferf4a8d&utm_source=buffer&utm_me dium=facebook&utm_campaign=Buffer", "t": 1368478799, "hc": 1368476067, "cy": "Longview", "ll": [32.500701904296875, -‐94.740501403808594]}

hadoop?

Data engineering!

When you have data you can learn a lot of
things...

...but how do we make these capacities useful?

https://github.com/bitly/dablooms/issues/35

https://github.com/bitly/dablooms/issues/35 http://bit.ly/12aUf3F

h"p://bit.ly/xj57gS

10s of millions of URLs per day 100s of millions
of clicks per day 10s of billions of URLs

(not the whole internet, just the bit people are paying
attention to)

Your social network is NOT my social network.

http://tweet.onerandom.com

Identity?

Geography?

[pizza in california]

Networks impact behaviors.

twitter

facebook

tumblr

Devices also change behavior.

This is the real world.

Revolution.

What People Share

What People Share What People Read

Things people share:

Things people click:

AT SCALE

Data engineering is when the architecture of your system is
dependent on characteristics of the data flowing through that system.

1. Research offline 2. Do fancy math – find the
shortcuts 3. Design infrastructure 4. Re-design to run at scale and speed

Open Source

What language is this content?

Classic Approach: Supervised Classification Take labeled content (google translate API,
europarl corpus, etc.) and use a classifier.

Classic Approach: Supervised Classification Take labeled content (google translate API,
europarl corpus, etc.) and use a classifier. (sure)

wait, more data? "es" "en-us,en;q=0.5" "pt-BR,pt;q=0.8,en- US;q=0.6,en;q=0.4" "en-gb,en;q=0.5" "en-US,en;q=0.5" "es-es,es;q=0.8,en-
us;q=0.5,en;q=0.3” "de, en-gb;q=0.9, en;q=0.8"

use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !
""" ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1

Simple > Fancy

Are you a human?

Offline research on the full lifecycle of a link.

Train random forest decision tree on offline model.

Classify every click in ‘realtime’.

Downstream systems decide how to use that score.

{"ck": 1, "gr": "X4", "al": "en-‐US,en;q=0.8", "topic": "Sports", "cy":
"Bargoed", "hc": 1368535661.0000002, "ovi": {"count": 124.0, "proba": [0.935458874, 0.064541125]}, "hh": "mirr.im", "a": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31", "c": "GB", "nk": 1, "tz": "Europe/London", "g": "16a3eXk", "i": "xxx.xxx.xxx.xxx", "h": "16a3eXi", "k": "xxxxxxxx-‐xxxxx-‐xxxxxx-‐xxxxxxx", "l": "dailymirror", "p": "fans", "r": "http://t.co/ hSpdnJzMIh", "u": "http://www.mirror.co.uk/sport/ football/news/picture-‐special-‐david-‐beckhams-‐ paris-‐1888664? utm_source=twitterfeed&utm_medium=twitter", "t": 1368536389.0, "ll": [51.683300018, -‐3.23329997]}

What is the world paying attention to right now?

http://rt.ly

1.Search 2.Bursts 3.Stories

Search: I know what I want.

Realtime Search Attributes calculated either at index time or query
time. Rankings can vary by second.

‘Realtime’ Search • built on Zoie (Solr plugin) • only
keeps documents in the index if they have been clicked* in the previous 24 hours

Click! queue Solr processing RealFme scoring Content ExtracFon Crawlers

QUERY Solr to ﬁnd all documents RealFme scoring to rank

Narrow the stream by ‘query’, rank it.

Bursts: What’s happening?

actual rate of clicks on phrases vs expected rate of
clicks on phrases

We calculate clickrate with a sort of moving average: where
Dragoneye

We represent as a sum of delta spikes. This simplifies
to: Dragoneye

Choosing is important. It must be interpretable, and smooth (but
not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye

The models are built into the database. We get the
CPS calculation for free.

Bursts inform priority in the search index queue.

Stories: more than just links

[story api]

http://rt.ly

http://dev.bitly.com

the cutest kitten

[email protected] @hmason Thank you!

Realtime Systems for Social Data Analysis (RICO...

Realtime Systems for Social Data Analysis (RICON East 2013)

More Decks by Basho Technologies

Other Decks in Technology

Featured

Transcript