Save 37% off PRO during our Black Friday Sale! »

The Beauty of Data Engineering (in 15 minutes or less)

The Beauty of Data Engineering (in 15 minutes or less)

A talk from Hilary Mason at Hack+Startup in New York on Feb 28, 2013.


First Round Capital

February 28, 2013


  1. Hilary Mason Chief Scientist bitly Speaker @hmason

  2. Hilary Mason Chief Scientist, bitly @hmason The Beauty of

    Data Engineering (in 15 mins or less)
  3. Hello, Hack+Startup! (thanks, FirstRound)

  4. None
  5. Hilary Mason 416 w 13th St New York, NY 10014

    Hilary Mason c/o bitly 416 w 13th St New York, NY 10014 Hilary Mason 416 w 13St suite #203 New York City 10014 Hilary Mason Chief Scientist Bitly 416 west 13th Street New York, NY 10014
  6. None
  7. + + sum() [1,2,3]

  8. agreed?

  9. [archive photo]

  10. None
  11. None
  12. ELIZA

  13. jmstriegel: no, really. I'm quite human. jmstriegel: test me if

    you want shymuffin32: ok shymuffin32: why do you like music? jmstriegel: hmm. i've never really considered that. jmstriegel: hell, i'm not going to be able to contrive a good answer for that one. ask me something else. shymuffin32: jeesus, you're worse than eliza
  14. “How can we build computer systems that automatically improve with

    experience, and what are the fundamental laws that govern all learning processes?” -- Tom Mitchell, CMU
  15. [this is working]

  16. BIG data?

  17. [excel]

  18. None
  19. Data BIG

  20. useful BIG

  21. small BIG

  22. None
  23. Data scientists do three things: • Write code. • Do

    math. • Ask questions. (and answer them.)
  24. None
  25. a brief tour of the vocabulary and current thinking in

    applied machine learning
  26. supervised vs unsupervised

  27. classification vs clustering

  28. [spam folders]

  29. Entity disambiguation. This  is  important.

  30. Text Me UGLY  HAG

  31. UGLY  HAG Me

  32. Entity disambiguation. This  is  important. Disambigua;on  is  a  very  common

     and  very hard  problem.  For  example:  IBM,  Interna;onal   Business  Machines,  IBM  Watson,  TJ  Watson  Labs
  33. Recommender Systems

  34. None
  35. Search


  37. How do we build it?

  38. Data Engineering.

  39. None
  40. 0. Save your data!

  41. 1. Do fancy math.

  42. 2. Win - know why this works.

  43. 3. Design infrastructure

  44. 4. build + test + monitor

  45. 5. redesign for speed

  46. None

  48. hHp://

  49. None
  50. None
  51. None
  52. None
  53. 10s of millions of URLs per day 100s of millions

    of clicks per day 10s of billions of URLs
  54. None
  55. Language Identification

  56. Classic Approach: Supervised Classification Take labeled content (google translate API,

    europarl corpus, etc.) and use a classifier. (sure)
  57. None
  58. wait, more data? "es" "en-us,en;q=0.5" "pt-BR,pt;q=0.8,en- US;q=0.6,en;q=0.4" "en-gb,en;q=0.5" "en-US,en;q=0.5" "es-es,es;q=0.8,en-

    us;q=0.5,en;q=0.3” "de, en-gb;q=0.9, en;q=0.8"
  59. None
  60. use an entropy calculation! def ghash2lang(g, Ri, min_count=3, max_entropy=0.2): !

    """ ! returns the majority vote of a langauge for a given hash ! """ ! lang = R.zrevrange(g,0,0)[0] # let's calculate the entropy! # possible languages x = R.zrange(g,0,-1) # distribution over those languages p = np.array([R.zscore(g,langi) for langi in x]) p /= p.sum() # info content I = [pi*np.log(pi) for pi in p] # entropy: smaller the more certain we are! - i.e. the lower our surprise H = -sum(I)/len(I) #in nats! # note that this will give a perfect zero for a single count in one language # or for 5K counts in one language. So we also need the count.. count = R.zscore(g,lang) if count < min_count and H > max_entropy: return lang, count else: return None, 1
  61. None
  62. Realtime Search

  63. Realtime Search Attributes calculated either at index time or query

    time. Rankings can vary by second.
  64. What are people paying attention to right now?

  65. actual rate of clicks on phrases vs expected rate of

    clicks on phrases
  66. We calculate clickrate with a sort of moving average: where

  67. We represent as a sum of delta spikes. This simplifies

    to: Dragoneye
  68. Choosing is important. It must be interpretable, and smooth (but

    not too smooth). We use a distribution for that is a function that sums to 1. The function is 0 at the origin. Dragoneye
  69. (it’s realtime, baby)

  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. @hmason Thank you!

  78. Hack+Startup Presented by First Round Capital