420

Introduction to Big Data Engineering

How to construct algorithms that can scale

March 11, 2015

Transcript

1. Introduction to Big Data Engineering How to construct algorithms that

can scale Felix Chern

Experience • MapReduce and Hive • Started hadoop using 8 nodes cluster • Later joined OpenX to manage and develop on 600+ nodes cluster • 100+TB/day data volume • http://idryman.org

6. Bridges Buildings Cars Rockets Solve Larger Problems Cross river or

sea Increase volume for human activities Transports Explore the outer space

8. Data Engineering Problems • Maintain the data ﬂow • Extract

Transform Load (ETL) • Downstream consumer: reporting and analytics • Data correctness • System stability
9. Extract Transform Load • De-duplicate (remove unwanted data) • Join

relevant data • Aggregations in different perspective • select … from large_table group by buyers; • select … from large_table group by publishers;

hoc queries
11. Estimate instead of actual run • Estimate the properties of

an aggregation (or any other data operations) • Cardinality • Frequency distribution (is data skewed?) • Top-k elements • Don’t need to be precise, but be fast and stable
12. Probabilistic Data Structure/Algorithms • Cardinality -> hyperloglog • Frequency distribution

-> count-min-sketch • Top-k elements -> lossy count Runtime: ! O(log(N)) Data size: ! several kbs
13. MapReduce, Jeffery et al. When data size is small All

computation would be in memory

15. Cardinality test • Goal: ﬁnd out the distinct element count

of a given query • Naive approach 1: Put into a hash set! • not enough memory! • Approach 2: Hash the input to a BitSet (allow collisions) • Linear counting [ACM’90] • Estimate collision impacts • Still require 1/10 size of the input (too large!)
16. LogLog (basic idea) • Pick a hash function h that

maps each of the n elements to at least log(n) bits • For each a = h(x), let r(a) be the number of trailing zeros • Record R = maximum r(a) seen • Cardinality estimate = 2R • Data size (R): 4 bytes or 8 byes
17. n elements m distinct ones a = h(m) 01010001001100 r(a)

= 2 Pr( r(a)=2 ) = 1/4 01000000000000 Need about 212 distinct elements to see this occur!

LogLog
19. Implementations • Aggregate Knowledge Inc. • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec •

https://github.com/aggregateknowledge/java-hll • https://github.com/aggregateknowledge/js-hll • https://github.com/aggregateknowledge/postgresql-hll • AddThis Inc. • https://github.com/addthis/stream-lib • Others • https://github.com/tvondra/distinct_estimators/tree/master/superloglog
20. Hive HLL -- estimate the cardinality of SELECT * FROM

src GROUP BY col1, col2; SELECT hll(col1, col2).cardinality from src; ! -- create hyperloglog cache per hour FROM input_table src INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00' INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01' ! -- read the cache and calculate the cardinality of full day SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;' ! -- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc. SELECT hll_col.signature from hll_cache WHERE d='2015-03-01'; https://github.com/dryman/hive-probabilistic-utils

Agent, etc.) every hour/minute. • How many unique users do we have every day/ week/month? • What is the churn rate? • It can even be visualized in browser! (js-hll)

24. Similar algorithms to HyperLogLog • Count-min-sketch (frequency estimation) • Bloom

ﬁlter (probabilistic set) • Lossy count (top-k elements) • ASM (2nd order moment estimation)
25. Algorithms that extend our problem sets • Locality Sensitive Hashing

(ﬁnding similar items) • ECLAT (frequent item set) • BigCLAM (graph clustering) • Many more