Slide 1

Slide 1 text

Scaling by Cheating Approximation, Sampling and

Slide 2

Slide 2 text

TWO BIG PROBLEMS

Slide 3

Slide 3 text

I Today’s big is just tomorrow’s small. We’re expected to process arbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days. David, Sr. IT Manager “ ”

Slide 4

Slide 4 text

I Speed is king. People expect up-to-the- second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x. Shelly, CTO “ ”

Slide 5

Slide 5 text

TWO BIG SOLUTIONS

Slide 6

Slide 6 text

I Disk and CPU are cheap, on-demand. Frameworks to harness them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply. “Scooter”, White Lab “ ”

Slide 7

Slide 7 text

Cheating

Slide 8

Slide 8 text

I Kirk What would you say the odds are on our getting out of here? Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one. Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one? Spock Seven thousand eight hundred twenty four point seven to one. Kirk That's a pretty close approximation. Star Trek, “Errand of Mercy” http://www.redbubble.com/people/feelmeflow

Slide 9

Slide 9 text

WHEN TO CHEAT APPROXIMATE „ Only a few significant

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

APPROXIMATION

Slide 12

Slide 12 text

THE MEAN „  Huge number of data points: x1 … xN „  Independent, from one roughly-normal distribution „  “True” population mean µ? „  Best estimate is sample mean:

Slide 13

Slide 13 text

“CLOSE ENOUGH” MEAN „  Tight-enough estimate of µ? „  With high confidence (>p), µ = (1± ε) µsample „  Student’s t distribution with n-1 degrees of freedom

Slide 14

Slide 14 text

SAMPLING

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

WORD COUNT: TOY EXAMPLE „ Input are text documents „ Exactly how many times does

Slide 17

Slide 17 text

WORD COUNT: USEFUL EXAMPLE „ Input are text documents „ About how many times does

Slide 18

Slide 18 text

COMMONS CRAWL „  s3n://aws-publicdatasets/common-crawl/
 parse-output/segment/*/textData-*" „  Count words, Proper Nouns, etc. in 35GB subset of crawl „  github.com/srowen/commoncrawl" „  Amazon EMR

Slide 19

Slide 19 text

RAW RESULTS „  40 minutes „  40.1% Proper Nouns „  Most frequent words:

Slide 20

Slide 20 text

SAMPLE 10% OF DOCUMENTS „  21 minutes „  39.9% Proper Nouns „  Most frequent words:

Slide 21

Slide 21 text

STOP WHEN “CLOSE ENOUGH” „  CloseEnoughMean.java" „  Stop mapping documents

Slide 22

Slide 22 text

MORE SAMPLING

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

ITEM-ITEM SIMILARITY „  Input: user-item ratings, click count „  Compute item-item similarity „  Takes more than

Slide 25

Slide 25 text

PRUNING „  ItemSimilarityJob" „  --threshold

Slide 26

Slide 26 text

PRUNING EXPERIMENT „  Libimseti data set „  135K users x 165K items „  17M data points „  Rating on scale of 1-10 „  Compute item-item Pearson correlation „  Amazon EMR: 2 x m1.xlarge"

Slide 27

Slide 27 text

RESULTS NO PRUNING PRUNING „  0 threshold „  <10000 pairs per item „  <1000 prefs per user „  178 minutes „  20,400 MB output „  >0.3 threshold „  <10 pairs per item „  <100 prefs per user „  11 minutes „  2 MB output

Slide 28

Slide 28 text

FAULT-FRIENDLINESS

Slide 29

Slide 29 text

Not a Bank

Slide 30

Slide 30 text

FAULTS THREATEN… CONSISTENCY DURABILITY „  Give same answer in same state „  But answer might be approximate „  Many answers are “close” „  OK to give inconsistent answers ? „  Data should not disappear „  But deleting data often has

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

DESIGN FOR… FAST AVAILABILITY FAST 99% DURABILITY „  Multiple replicas „  Need not have a consistent view „  Clients have consistent view through smart load balancing „  Push data into durable store „  Buffer a little locally „  Tolerate loss of “a little”

Slide 33

Slide 33 text

RESOURCES „  Apache Mahout

Slide 34

Slide 34 text

?