The Beauty of Data Engineering (in 15 minutes or less)

Hilary Mason Chief Scientist bitly Speaker @hmason

Hilary Mason Chief Scientist, bitly @hmason [email protected] The Beauty of
Data Engineering (in 15 mins or less)

Hello, Hack+Startup! (thanks, FirstRound)

Hilary Mason 416 w 13th St New York, NY 10014
Hilary Mason c/o bitly 416 w 13th St New York, NY 10014 Hilary Mason bit.ly 416 w 13St suite #203 New York City 10014 Hilary Mason Chief Scientist Bitly 416 west 13th Street New York, NY 10014

+ + sum() [1,2,3]

agreed?

[archive photo]

jmstriegel: no, really. I'm quite human. jmstriegel: test me if
you want shymuffin32: ok shymuffin32: why do you like music? jmstriegel: hmm. i've never really considered that. jmstriegel: hell, i'm not going to be able to contrive a good answer for that one. ask me something else. shymuffin32: jeesus, you're worse than eliza http://bit.ly/b7dter

“How can we build computer systems that automatically improve with
experience, and what are the fundamental laws that govern all learning processes?” -- Tom Mitchell, CMU

[this is working]

BIG data?

[excel]

Data BIG

useful BIG

small BIG

Data scientists do three things: • Write code. • Do
math. • Ask questions. (and answer them.)

a brief tour of the vocabulary and current thinking in
applied machine learning

supervised vs unsupervised

classification vs clustering

[spam folders]

Entity disambiguation. This is important.

Text Me UGLY HAG

UGLY HAG Me

Entity disambiguation. This is important. Disambigua;on is a very common
and very hard problem. For example: IBM, Interna;onal Business Machines, IBM Watson, TJ Watson Labs

Recommender Systems

Search

http://rt.ly

How do we build it?

Data Engineering.

0. Save your data!

1. Do fancy math.

2. Win - know why this works.

3. Design infrastructure

4. build + test + monitor

5. redesign for speed

https://github.com/bitly/dablooms/issues/35 http://bit.ly/12aUf3F

hHp://bit.ly/xj57gS

10s of millions of URLs per day 100s of millions
of clicks per day 10s of billions of URLs

Language Identification

Classic Approach: Supervised Classification Take labeled content (google translate API,
europarl corpus, etc.) and use a classifier. (sure)

wait, more data? "es" "en-us,en;q=0.5" "pt-BR,pt;q=0.8,en- US;q=0.6,en;q=0.4" "en-gb,en;q=0.5" "en-US,en;q=0.5" "es-es,es;q=0.8,en-
us;q=0.5,en;q=0.3” "de, en-gb;q=0.9, en;q=0.8"

use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !
""" ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1

Realtime Search

Realtime Search Attributes calculated either at index time or query
time. Rankings can vary by second.

What are people paying attention to right now?

actual rate of clicks on phrases vs expected rate of
clicks on phrases

We calculate clickrate with a sort of moving average: where
Dragoneye

We represent as a sum of delta spikes. This simplifies
to: Dragoneye

Choosing is important. It must be interpretable, and smooth (but
not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye

http://rt.ly (it’s realtime, baby)

[email protected] @hmason Thank you!

Hack+Startup Presented by First Round Capital

The Beauty of Data Engineering (in 15 minutes o...

The Beauty of Data Engineering (in 15 minutes or less)

More Decks by First Round Capital

Other Decks in Technology

Featured

Transcript