Approximate Algorithms for Big Data

APPROXIMATE ALGORITHMS FOR BIG DATA NISHANT BANGARWA DRUID COMMITTER DRUID.IO
@DRUIDIO

THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH
EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG OVERVIEW

2014 PROBLEMS ‣ Storing/processing billions of rows is expensive ‣
Reduce storage, improve performance ‣ Reduce storage by throwing away information ‣ Throwing away information reduces accuracy

THE DATA

2014 THE DATA Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05
2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

2014 DATA SUMMARIZATION Timestamp Revenue Number of Prices 2013-10-28T02 2.28
3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03

2014 COMBINING SUMMARIZATIONS Timestamp Revenue Number of Prices 2013-10-28T02 2.28
3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Revenue Number of Prices 2013-10-28 4.66 8

2014 ‣ Throw away information about individual events ‣ Drastically
reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY

CARDINALITY ESTIMATION

2014 ‣ Problem: determine unique number of elements in a
set ‣ Use case: measuring number of unique users CARDINALITY ESTIMATION DATA BIG DATA

2014 ‣ Store every single username (in a Java HashSet)
‣ No loss of information, no accuracy tradeoff EXACT SOLUTION

2014 HASHSET Timestamp Username 2013-10-28T02:13:43Z user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1
2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97 Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

2014 HASHSET Timestamp Usernames 2013-10-28 {user1, user2, user4, user97, user9834}
Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}

2014 ‣ Storage/Computation: O(# uniques) ‣ We’re not throwing away
any information about usernames ‣ Accuracy: 100% EXACT SOLUTION

2014 ‣ High cardinality user dimensions == infeasible storage •
Storage cost for 10^9 unique elements == ~48GB of storage INFEASIBLE STORAGE

2014 ‣ Plenty of literature • Linear Counting • Count-Min
Sketch • Bloom Filters • LogLog CARDINALITY ESTIMATION

2014 ‣ Storage: 1.5 KB ( for cardinalities 10^9 and
above) • 99.999997% decrease in storage size ‣ Computation: O(1) (for cardinalities < ~10^10) ‣ Accuracy: 98% HYPERLOGLOG

2014 ‣ Instead of storing all the data, let’s store
a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we ﬂipped a coin • ~50 % heads/tails • We could store the result of every coin ﬂip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG

2014 HYPERLOGLOG ‣ Maintain a series of buckets ‣ Each
bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a speciﬁc phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the ﬁrst ‘1’ bit ‣ Update a bucket if this position is greater than the existing value

2014 HYPERLOGLOG Buckets -INF -INF -INF -INF

2014 HYPERLOGLOG HashFn 01xxx...x user1 Buckets 2 -INF -INF -INF

2014 HYPERLOGLOG HashFn 01xxx...x user1 Buckets 2 2 2 1
HashFn 01xxx...x user4 HashFn 01xxx...x user12 HashFn 1xxxx...x user7

2014 HYPERLOGLOG HashFn 001xx...x user6 Buckets 2 -> 3 2
2 1

2014 DETERMINING FINAL CARDINALITY Buckets 3 2 2 1 MATH
11.00

2014 HYPERLOGLOG Timestamp Buckets 2013-10-28T02 [3, 2, 2, 1] 2013-10-28T03
[1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]

2014 HYPERLOGLOG Timestamp HLL Object 2013-10-28 [3, 2, 4, 2]

2014 RESULTS

2014 • 100 cc2.8xlarge (1600 cores, 6TB RAM) Druid cluster
• 27B summarized rows/s scan rate • Add 16B summarized (~640B raw) rows/s • Combine 4B HyperLogLog objects/s BENCHMARKS

2014 • Summarization for sums: substantially (e.g. ~40x for us)
faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 98% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS

THANK YOU

2014 • Eric Tschetter • Fangjin Yang • Nelson Ray
• Xavier Léauté • Gian Merlino • Aggregate Knowledge Blog • High Scalability ACKNOWLEDGEMENTS

2014 ‣ “HyperLogLog: the analysis of a near-optimal cardinality estimation
algorithm” • Flajolet et al. ‣ http://metamarkets.com/2012/fast-cheap-and-98-right- cardinality-estimation-for-big-data/ ‣ http://metamarkets.com/2013/histograms/ REFERENCES

2014 HYPERLOGLOG HashFn 0xx...x user1 HashFn 1xx...x user2

2014 HYPERLOGLOG HashFn 00x...x user1 HashFn 10x...x user2 HashFn 01x...x
user3 HashFn 11x...x user4

2014 ‣ 50% of hashed values will look like this:
1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG

2014 ‣ Invert this logic • If highest index of
‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG

Approximate Algorithms for Big Data

Approximate Algorithms for Big Data

More Decks by Nishant

Other Decks in Technology

Featured

Transcript