Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Approximate Algorithms for Big Data

Nishant
August 07, 2014

Approximate Algorithms for Big Data

Nishant

August 07, 2014
Tweet

More Decks by Nishant

Other Decks in Technology

Transcript

  1. THE PROBLEM MANAGE DATA COST EFFICIENTLY THE DATA DEALING WITH

    EVENT STREAMS SIMPLIFYING STORAGE DATA SUMMARIZATION FINDING UNIQUES HYPERLOGLOG OVERVIEW
  2. 2014 PROBLEMS ‣ Storing/processing billions of rows is expensive ‣

    Reduce storage, improve performance ‣ Reduce storage by throwing away information ‣ Throwing away information reduces accuracy
  3. 2014 THE DATA Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05

    2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03
  4. 2014 DATA SUMMARIZATION Timestamp Revenue Number of Prices 2013-10-28T02 2.28

    3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Bid Price 2013-10-28T02:13:43Z 1.19 2013-10-28T02:14:21Z 0.05 2013-10-28T02:55:32Z 1.04 2013-10-28T03:07:28Z 0.16 2013-10-28T03:13:43Z 1.03 2013-10-28T04:18:19Z 0.15 2013-10-28T05:36:34Z 0.01 2013-10-28T05:37:59Z 1.03
  5. 2014 COMBINING SUMMARIZATIONS Timestamp Revenue Number of Prices 2013-10-28T02 2.28

    3 2013-10-28T03 1.19 2 2013-10-28T04 0.15 1 2013-10-28T05 1.04 2 Timestamp Revenue Number of Prices 2013-10-28 4.66 8
  6. 2014 ‣ Throw away information about individual events ‣ Drastically

    reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY
  7. 2014 ‣ Problem: determine unique number of elements in a

    set ‣ Use case: measuring number of unique users CARDINALITY ESTIMATION DATA BIG DATA
  8. 2014 ‣ Store every single username (in a Java HashSet)

    ‣ No loss of information, no accuracy tradeoff EXACT SOLUTION
  9. 2014 HASHSET Timestamp Username 2013-10-28T02:13:43Z user1 2013-10-28T02:14:21Z user2 2013-10-28T02:55:32Z user1

    2013-10-28T03:07:28Z user4 2013-10-28T03:13:43Z user97 2013-10-28T04:18:19Z user2 2013-10-28T05:36:34Z user9834 2013-10-28T05:37:59Z user97 Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}
  10. 2014 HASHSET Timestamp Usernames 2013-10-28 {user1, user2, user4, user97, user9834}

    Timestamp Usernames 2013-10-28T02 {user1, user2} 2013-10-28T03 {user4, user97} 2013-10-28T04 {user2} 2013-10-28T05 {user9834, user97}
  11. 2014 ‣ Storage/Computation: O(# uniques) ‣ We’re not throwing away

    any information about usernames ‣ Accuracy: 100% EXACT SOLUTION
  12. 2014 ‣ High cardinality user dimensions == infeasible storage •

    Storage cost for 10^9 unique elements == ~48GB of storage INFEASIBLE STORAGE
  13. 2014 ‣ Plenty of literature • Linear Counting • Count-Min

    Sketch • Bloom Filters • LogLog CARDINALITY ESTIMATION
  14. 2014 ‣ Storage: 1.5 KB ( for cardinalities 10^9 and

    above) • 99.999997% decrease in storage size ‣ Computation: O(1) (for cardinalities < ~10^10) ‣ Accuracy: 98% HYPERLOGLOG
  15. 2014 ‣ Instead of storing all the data, let’s store

    a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we flipped a coin • ~50 % heads/tails • We could store the result of every coin flip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG
  16. 2014 HYPERLOGLOG ‣ Maintain a series of buckets ‣ Each

    bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a specific phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the first ‘1’ bit ‣ Update a bucket if this position is greater than the existing value
  17. 2014 HYPERLOGLOG HashFn 01xxx...x user1 Buckets 2 2 2 1

    HashFn 01xxx...x user4 HashFn 01xxx...x user12 HashFn 1xxxx...x user7
  18. 2014 HYPERLOGLOG Timestamp Buckets 2013-10-28T02 [3, 2, 2, 1] 2013-10-28T03

    [1, 2, 1, 2] 2013-10-28T04 [2, 1, 4, 1] 2013-10-28T05 [2, 2, 3, 1]
  19. 2014 • 100 cc2.8xlarge (1600 cores, 6TB RAM) Druid cluster

    • 27B summarized rows/s scan rate • Add 16B summarized (~640B raw) rows/s • Combine 4B HyperLogLog objects/s BENCHMARKS
  20. 2014 • Summarization for sums: substantially (e.g. ~40x for us)

    faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 98% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS
  21. 2014 • Eric Tschetter • Fangjin Yang • Nelson Ray

    • Xavier Léauté • Gian Merlino • Aggregate Knowledge Blog • High Scalability ACKNOWLEDGEMENTS
  22. 2014 ‣ “HyperLogLog: the analysis of a near-optimal cardinality estimation

    algorithm” • Flajolet et al. ‣ http://metamarkets.com/2012/fast-cheap-and-98-right- cardinality-estimation-for-big-data/ ‣ http://metamarkets.com/2013/histograms/ REFERENCES
  23. 2014 ‣ 50% of hashed values will look like this:

    1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG
  24. 2014 ‣ Invert this logic • If highest index of

    ‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG