Slide 1

Slide 1 text

NOT EXACTLY! APPROXIMATE ALGORITHMS FOR BIG DATA FANGJIN YANG · NELSON RAY METAMARKETS

Slide 2

Slide 2 text

THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG ESTIMATING DISTRIBUTION APPROXIMATE HISTOGRAMS OVERVIEW

Slide 3

Slide 3 text

THE PROBLEM

Slide 4

Slide 4 text

Fangjin Yang & Nelson Ray 2013 ...AND WE ANALYZE DATA WE ARE METAMARKETS...

Slide 5

Slide 5 text

Fangjin Yang & Nelson Ray 2013 Real-time Bidding

Slide 6

Slide 6 text

Fangjin Yang & Nelson Ray 2013 PROBLEMS ‣ Storing/processing billions of rows is expensive ‣ Reduce storage, improve performance ‣ Reduce storage by throwing away information ‣ Throwing away information reduces accuracy

Slide 7

Slide 7 text

THE DATA

Slide 8

Slide 8 text

Fangjin Yang & Nelson Ray 2013 THE DATA Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

Slide 9

Slide 9 text

Fangjin Yang & Nelson Ray 2013 DATA SUMMARIZATION Timestamp Revenue Number of Prices 2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

Slide 10

Slide 10 text

Fangjin Yang & Nelson Ray 2013 COMBINING SUMMARIZATIONS Timestamp Revenue Number of Prices 2013-10-28T02 2.28 3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Revenue Number of Prices 2013-10-28 4.66 8

Slide 11

Slide 11 text

Fangjin Yang & Nelson Ray 2013

Slide 12

Slide 12 text

Fangjin Yang & Nelson Ray 2013 ‣ Throw away information about individual events ‣ Drastically reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY

Slide 13

Slide 13 text

CASE STUDY 1

Slide 14

Slide 14 text

Fangjin Yang & Nelson Ray 2013 ‣ Problem: determine unique number of elements in a set ‣ Use case: measuring number of unique users CASE STUDY 1 DATA BIG DATA

Slide 15

Slide 15 text

Fangjin Yang & Nelson Ray 2013 ‣ Store every single username (in a Java HashSet) ‣ No loss of information, no accuracy tradeoff EXACT SOLUTION

Slide 16

Slide 16 text

Fangjin Yang & Nelson Ray 2013 HASHSET Timestamp Username 2013-10-28T02:13:43Z user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1 2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97 Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

Slide 17

Slide 17 text

Fangjin Yang & Nelson Ray 2013 HASHSET Timestamp Usernames 2013-10-28 {user1, user2, user4, user97, user9834} Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

Slide 18

Slide 18 text

Fangjin Yang & Nelson Ray 2013 ‣ Storage/Computation: O(# uniques) ‣ We’re not throwing away any information about usernames ‣ Accuracy: 100% EXACT SOLUTION

Slide 19

Slide 19 text

Fangjin Yang & Nelson Ray 2013 ‣ High cardinality user dimensions == infeasible storage • Storage cost for 10^9 unique elements == ~48GB of storage INFEASIBLE STORAGE

Slide 20

Slide 20 text

Fangjin Yang & Nelson Ray 2013 ‣ Plenty of literature • Linear Counting • Count-Min Sketch • Bloom Filters • LogLog CARDINALITY ESTIMATION

Slide 21

Slide 21 text

Fangjin Yang & Nelson Ray 2013 ‣ Storage: 1.5 KB ( for cardinalities 10^9 and above) • 99.999997% decrease in storage size ‣ Computation: O(1) (for cardinalities < ~10^10) ‣ Accuracy: 97% HYPERLOGLOG

Slide 22

Slide 22 text

Fangjin Yang & Nelson Ray 2013 ‣ Instead of storing all the data, let’s store a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we flipped a coin • ~50 % heads/tails • We could store the result of every coin flip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG

Slide 23

Slide 23 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG ‣ Maintain a series of buckets ‣ Each bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a specific phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the first ‘1’ bit ‣ Update a bucket if this position is greater than the existing value

Slide 24

Slide 24 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG Buckets -INF -INF -INF -INF

Slide 25

Slide 25 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 01xxx...x user1 Buckets 2 -INF -INF -INF

Slide 26

Slide 26 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 01xxx...x user1 Buckets 2 2 2 1 HashFn 01xxx...x user4 HashFn 01xxx...x user12 HashFn 1xxxx...x user7

Slide 27

Slide 27 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 001xx...x user6 Buckets 2 -> 3 2 2 1

Slide 28

Slide 28 text

Fangjin Yang & Nelson Ray 2013 DETERMINING FINAL CARDINALITY Buckets 3 2 2 1 MATH 11.00

Slide 29

Slide 29 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG Timestamp Buckets 2013-10-28T02 [3, 2, 2, 1] 2013-10-28T03 [1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]

Slide 30

Slide 30 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG Timestamp HLL Object 2013-10-28 [3, 2, 4, 2]

Slide 31

Slide 31 text

Fangjin Yang & Nelson Ray 2013

Slide 32

Slide 32 text

Fangjin Yang & Nelson Ray 2013 RESULTS

Slide 33

Slide 33 text

CASE STUDY 2

Slide 34

Slide 34 text

Fangjin Yang & Nelson Ray 2013 ‣ Problem: determine distribution of values ‣ Use case: quantiles and histograms ‣ Hourly truncation CASE STUDY 2

Slide 35

Slide 35 text

Fangjin Yang & Nelson Ray 2013 THE DATA Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

Slide 36

Slide 36 text

Fangjin Yang & Nelson Ray 2013 EXACT SOLUTION Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03 Timestamp Bid Prices 2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

Slide 37

Slide 37 text

Fangjin Yang & Nelson Ray 2013 EXACT SOLUTION Timestamp Bid Prices 2013-10-28 [1.19, 0.05, 1.04, 0.16, 1.03, 0.15, 0.01, 1.03] Timestamp Bid Prices 2013-10-28T02 [1.19, 0.05, 1.04] 2013-10-28T03 [0.16, 1.03] 2013-10-28T04 [0.15] 2013-10-28T05 [0.01, 1.03]

Slide 38

Slide 38 text

Fangjin Yang & Nelson Ray 2013 ‣ Arrays of values ‣ Storage: Linear ‣ Computation: Linear ‣ Accuracy: 100% ‣ Problem: Storing raw values can often be more expensive than storing the rest of the row. ‣ Solution: Store an approximate representation! EXACT SOLUTION

Slide 39

Slide 39 text

Fangjin Yang & Nelson Ray 2013 ‣ “A Streaming Parallel Decision Tree Algorithm” ‣ Yael Ben-Haim & Elad Tom-Tov ‣ Storage: Sublinear/Linear ‣ Computation: Sublinear/Linear ‣ Accuracy: pretty good APPROXIMATE HISTOGRAMS

Slide 40

Slide 40 text

Fangjin Yang & Nelson Ray 2013 RAW DATA • 40 Prices: 3.46, 5.37, 5.62, 5.87, 6.21, 6.79, 7.11, 7.36, 7.55, 7.64, 7.89, 7.9, 8.07, 8.44, 8.62, 8.78, 8.87, 9.03, 9.24, 9.36, 9.58, 9.59, 9.81, 10.31, 10.35, 10.39, 10.47, 10.77, 10.93, 11.04, 11.1, 13.1, 13.27, 13.29, 13.87, 14.29, 14.51, 14.9, 15.75, 17.07

Slide 41

Slide 41 text

Fangjin Yang & Nelson Ray 2013 RAW DATA

Slide 42

Slide 42 text

Fangjin Yang & Nelson Ray 2013 SUMMARIZE WITH (COUNT, MEAN)

Slide 43

Slide 43 text

Fangjin Yang & Nelson Ray 2013 SUMMARIZE WITH (COUNT, MEAN)

Slide 44

Slide 44 text

Fangjin Yang & Nelson Ray 2013 SUMMARIZE WITH (COUNT, MEAN)

Slide 45

Slide 45 text

Fangjin Yang & Nelson Ray 2013 COMBINING HISTOGRAMS

Slide 46

Slide 46 text

Fangjin Yang & Nelson Ray 2013 COMBINING HISTOGRAMS

Slide 47

Slide 47 text

Fangjin Yang & Nelson Ray 2013

Slide 48

Slide 48 text

Fangjin Yang & Nelson Ray 2013 COUNT # <= X

Slide 49

Slide 49 text

Fangjin Yang & Nelson Ray 2013

Slide 50

Slide 50 text

Fangjin Yang & Nelson Ray 2013 ACCURACY

Slide 51

Slide 51 text

Fangjin Yang & Nelson Ray 2013 • 100 cc2.8xlarge (1600 cores, 6TB RAM) Druid cluster • 27B summarized rows/s scan rate • Add 16B summarized (~640B raw) rows/s • Combine 4B HyperLogLog objects/s • Combine 1.5B ApproximateHistogram objects/s BENCHMARKS

Slide 52

Slide 52 text

Fangjin Yang & Nelson Ray 2013 • Summarization for sums: substantially (e.g. ~40x for us) faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 97% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS

Slide 53

Slide 53 text

MORE INFORMATION? OFFICE HOURS time Tuesday, Oct. 29, 3:25pm place Third Floor Foyer, Table E DRINKS time Monday, Oct. 28, 6:00pm place Old Castle Pub & Restaurant

Slide 54

Slide 54 text

THANK YOU

Slide 55

Slide 55 text

Fangjin Yang & Nelson Ray 2013 • Eric Tschetter • Xavier Léauté • Gian Merlino • Aggregate Knowledge Blog • High Scalability ACKNOWLEDGEMENTS

Slide 56

Slide 56 text

Fangjin Yang & Nelson Ray 2013 ‣ “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” • Flajolet et al. ‣ “A Streaming Parallel Decision Tree Algorithm” • Yael Ben-Haim & Elad Tom-Tov ‣ http://metamarkets.com/2012/fast-cheap-and-98-right- cardinality-estimation-for-big-data/ ‣ http://metamarkets.com/2013/histograms/ REFERENCES

Slide 57

Slide 57 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 0xx...x user1 HashFn 1xx...x user2

Slide 58

Slide 58 text

Fangjin Yang & Nelson Ray 2013 HYPERLOGLOG HashFn 00x...x user1 HashFn 10x...x user2 HashFn 01x...x user3 HashFn 11x...x user4

Slide 59

Slide 59 text

Fangjin Yang & Nelson Ray 2013 ‣ 50% of hashed values will look like this: 1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG

Slide 60

Slide 60 text

Fangjin Yang & Nelson Ray 2013 ‣ Invert this logic • If highest index of ‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG