Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Beauty of Data Engineering (in 15 minutes o...

The Beauty of Data Engineering (in 15 minutes or less)

A talk from Hilary Mason at Hack+Startup in New York on Feb 28, 2013.

hackplusstartup.com

First Round Capital

February 28, 2013
Tweet

More Decks by First Round Capital

Other Decks in Technology

Transcript

  1. Hilary Mason 416 w 13th St New York, NY 10014

    Hilary Mason c/o bitly 416 w 13th St New York, NY 10014 Hilary Mason bit.ly 416 w 13St suite #203 New York City 10014 Hilary Mason Chief Scientist Bitly 416 west 13th Street New York, NY 10014
  2. jmstriegel: no, really. I'm quite human. jmstriegel: test me if

    you want shymuffin32: ok shymuffin32: why do you like music? jmstriegel: hmm. i've never really considered that. jmstriegel: hell, i'm not going to be able to contrive a good answer for that one. ask me something else. shymuffin32: jeesus, you're worse than eliza http://bit.ly/b7dter
  3. “How can we build computer systems that automatically improve with

    experience, and what are the fundamental laws that govern all learning processes?” -- Tom Mitchell, CMU
  4. Data scientists do three things: • Write code. • Do

    math. • Ask questions. (and answer them.)
  5. Entity disambiguation. This  is  important. Disambigua;on  is  a  very  common

     and  very hard  problem.  For  example:  IBM,  Interna;onal   Business  Machines,  IBM  Watson,  TJ  Watson  Labs
  6. 10s of millions of URLs per day 100s of millions

    of clicks per day 10s of billions of URLs
  7. use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !

    """ ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  8. Choosing is important. It must be interpretable, and smooth (but

    not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye