Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling by Cheating: Approximation, Sampling an...

Scaling by Cheating: Approximation, Sampling and Fault-friendliness for Scalable Big Learning

talk by Sean Owen @Myrrix at Data Science London @ds-ldn meetup

Data Science London

September 09, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. I Today’s big is just tomorrow’s small. We’re expected to

    process arbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days. David, Sr. IT Manager “ ”
  2. I Speed is king. People expect up-to-the- second results, and

    millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x. Shelly, CTO “ ”
  3. I Disk and CPU are cheap, on-demand. Frameworks to harness

    them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply. “Scooter”, White Lab “ ”
  4. I Kirk What would you say the odds are on

    our getting out of here? Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one. Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one? Spock Seven thousand eight hundred twenty four point seven to one. Kirk That's a pretty close approximation. Star Trek, “Errand of Mercy” http://www.redbubble.com/people/feelmeflow
  5. THE MEAN „  Huge number of data points: x1 …

    xN „  Independent, from one roughly-normal distribution „  “True” population mean µ? „  Best estimate is sample mean:
  6. “CLOSE ENOUGH” MEAN „  Tight-enough estimate of µ? „  With

    high confidence (>p), µ = (1± ε) µsample „  Student’s t distribution with n-1 degrees of freedom
  7. COMMONS CRAWL „  s3n://aws-publicdatasets/common-crawl/
 parse-output/segment/*/textData-*" „  Count words, Proper Nouns,

    etc. in 35GB subset of crawl „  github.com/srowen/commoncrawl" „  Amazon EMR
  8. PRUNING EXPERIMENT „  Libimseti data set „  135K users x

    165K items „  17M data points „  Rating on scale of 1-10 „  Compute item-item Pearson correlation „  Amazon EMR: 2 x m1.xlarge"
  9. RESULTS NO PRUNING PRUNING „  0 threshold „  <10000 pairs

    per item „  <1000 prefs per user „  178 minutes „  20,400 MB output „  >0.3 threshold „  <10 pairs per item „  <100 prefs per user „  11 minutes „  2 MB output
  10. FAULTS THREATEN… CONSISTENCY DURABILITY „  Give same answer in same

    state „  But answer might be approximate „  Many answers are “close” „  OK to give inconsistent answers ? „  Data should not disappear „  But deleting data often has
  11. DESIGN FOR… FAST AVAILABILITY FAST 99% DURABILITY „  Multiple replicas

    „  Need not have a consistent view „  Clients have consistent view through smart load balancing „  Push data into durable store „  Buffer a little locally „  Tolerate loss of “a little”
  12. ?