Scaling by Cheating: Approximation, Sampling and Fault-friendliness for Scalable Big Learning

Scaling by Cheating Approximation, Sampling and

TWO BIG PROBLEMS

I Today’s big is just tomorrow’s small. We’re expected to
process arbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days. David, Sr. IT Manager “ ”

I Speed is king. People expect up-to-the- second results, and
millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x. Shelly, CTO “ ”

TWO BIG SOLUTIONS

I Disk and CPU are cheap, on-demand. Frameworks to harness
them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply. “Scooter”, White Lab “ ”

Cheating

I Kirk What would you say the odds are on
our getting out of here? Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one. Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one? Spock Seven thousand eight hundred twenty four point seven to one. Kirk That's a pretty close approximation. Star Trek, “Errand of Mercy” http://www.redbubble.com/people/feelmeflow

WHEN TO CHEAT APPROXIMATE  Only a few signiﬁcant

APPROXIMATION

THE MEAN   Huge number of data points: x1 …
xN   Independent, from one roughly-normal distribution   “True” population mean µ?   Best estimate is sample mean:

“CLOSE ENOUGH” MEAN   Tight-enough estimate of µ?   With
high conﬁdence (>p), µ = (1± ε) µsample   Student’s t distribution with n-1 degrees of freedom

SAMPLING

WORD COUNT: TOY EXAMPLE  Input are text documents  Exactly how
many times does

WORD COUNT: USEFUL EXAMPLE  Input are text documents  About how
many times does

COMMONS CRAWL   s3n://aws-publicdatasets/common-crawl/  parse-output/segment/*/textData-*"   Count words, Proper Nouns,
etc. in 35GB subset of crawl   github.com/srowen/commoncrawl"   Amazon EMR

RAW RESULTS   40 minutes   40.1% Proper Nouns  
Most frequent words:

SAMPLE 10% OF DOCUMENTS   21 minutes   39.9% Proper
Nouns   Most frequent words:

STOP WHEN “CLOSE ENOUGH”   CloseEnoughMean.java"   Stop mapping documents

MORE SAMPLING

ITEM-ITEM SIMILARITY   Input: user-item ratings, click count   Compute
item-item similarity   Takes more than

PRUNING   ItemSimilarityJob"   --threshold

PRUNING EXPERIMENT   Libimseti data set   135K users x
165K items   17M data points   Rating on scale of 1-10   Compute item-item Pearson correlation   Amazon EMR: 2 x m1.xlarge"

RESULTS NO PRUNING PRUNING   0 threshold   <10000 pairs
per item   <1000 prefs per user   178 minutes   20,400 MB output   >0.3 threshold   <10 pairs per item   <100 prefs per user   11 minutes   2 MB output

FAULT-FRIENDLINESS

Not a Bank

FAULTS THREATEN… CONSISTENCY DURABILITY   Give same answer in same
state   But answer might be approximate   Many answers are “close”   OK to give inconsistent answers ?   Data should not disappear   But deleting data often has

DESIGN FOR… FAST AVAILABILITY FAST 99% DURABILITY   Multiple replicas
  Need not have a consistent view   Clients have consistent view through smart load balancing   Push data into durable store   Buffer a little locally   Tolerate loss of “a little”

RESOURCES   Apache Mahout

Scaling by Cheating: Approximation, Sampling an...

Scaling by Cheating: Approximation, Sampling and Fault-friendliness for Scalable Big Learning

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript