I
Today’s big is just
tomorrow’s small.
We’re expected to
process arbitrarily large
data sets by just adding
computers. You can’t tell
the boss that anything’s
too big to handle these
days.
David, Sr. IT Manager
“
”
Slide 4
Slide 4 text
I
Speed is king. People
expect up-to-the-
second results, and
millisecond response
times. No more
overnight reporting
jobs. My data grows 10x
but my latency has to
drop 10x.
Shelly, CTO
“
”
Slide 5
Slide 5 text
TWO BIG SOLUTIONS
Slide 6
Slide 6 text
I
Disk and CPU are
cheap, on-demand.
Frameworks to harness
them, like Hadoop, are
free and mature. We can
easily bring to bear
plenty of resources to
process data quickly and
cheaply.
“Scooter”, White Lab
“
”
Slide 7
Slide 7 text
Cheating
Slide 8
Slide 8 text
I
Kirk What would you say the odds are on
our getting out of here?
Spock Difficult to be precise, Captain. I
should say approximately seven thousand
eight hundred twenty four point seven to
one.
Kirk Difficult to be precise?
Seven thousand eight hundred
and twenty four to one?
Spock Seven thousand eight hundred twenty
four point seven to one.
Kirk That's a pretty close approximation.
Star Trek, “Errand of Mercy”
http://www.redbubble.com/people/feelmeflow
Slide 9
Slide 9 text
WHEN TO CHEAT APPROXIMATE
Only a few significant
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
APPROXIMATION
Slide 12
Slide 12 text
THE MEAN
Huge number of data points: x1
… xN
Independent, from one roughly-normal distribution
“True” population mean µ?
Best estimate is sample mean:
Slide 13
Slide 13 text
“CLOSE ENOUGH” MEAN
Tight-enough estimate of µ?
With high confidence (>p), µ = (1± ε) µsample
Student’s t distribution with n-1 degrees of freedom
Slide 14
Slide 14 text
SAMPLING
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
WORD COUNT: TOY EXAMPLE
Input are text documents
Exactly how many times does
Slide 17
Slide 17 text
WORD COUNT: USEFUL EXAMPLE
Input are text documents
About how many times does
Slide 18
Slide 18 text
COMMONS CRAWL
s3n://aws-publicdatasets/common-crawl/
parse-output/segment/*/textData-*"
Count words, Proper Nouns, etc. in 35GB subset of crawl
github.com/srowen/commoncrawl"
Amazon EMR
Slide 19
Slide 19 text
RAW RESULTS
40 minutes
40.1% Proper Nouns
Most frequent words:
Slide 20
Slide 20 text
SAMPLE 10% OF DOCUMENTS
21 minutes
39.9% Proper Nouns
Most frequent words:
Slide 21
Slide 21 text
STOP WHEN “CLOSE ENOUGH”
CloseEnoughMean.java"
Stop mapping documents
Slide 22
Slide 22 text
MORE SAMPLING
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
ITEM-ITEM SIMILARITY
Input: user-item ratings, click count
Compute item-item similarity
Takes more than
Slide 25
Slide 25 text
PRUNING
ItemSimilarityJob"
--threshold
Slide 26
Slide 26 text
PRUNING EXPERIMENT
Libimseti data set
135K users x 165K items
17M data points
Rating on scale of 1-10
Compute item-item Pearson correlation
Amazon EMR: 2 x m1.xlarge"
Slide 27
Slide 27 text
RESULTS
NO PRUNING
PRUNING
0 threshold
<10000 pairs per item
<1000 prefs per user
178 minutes
20,400 MB output
>0.3 threshold
<10 pairs per item
<100 prefs per user
11 minutes
2 MB output
Slide 28
Slide 28 text
FAULT-FRIENDLINESS
Slide 29
Slide 29 text
Not a Bank
Slide 30
Slide 30 text
FAULTS THREATEN…
CONSISTENCY
DURABILITY
Give same answer in same state
But answer might be
approximate
Many answers are “close”
OK to give inconsistent
answers ?
Data should not disappear
But deleting data often has
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
DESIGN FOR…
FAST AVAILABILITY
FAST 99% DURABILITY
Multiple replicas
Need not have a consistent view
Clients have consistent view
through smart load balancing
Push data into durable store
Buffer a little locally
Tolerate loss of “a little”