390

# Introduction to Big Data Engineering

How to construct algorithms that can scale

March 11, 2015

## Transcript

1. ### Introduction to Big Data Engineering How to construct algorithms that

can scale Felix Chern
2. ### About me • Felix Chern • 2 years of Hadoop

Experience • MapReduce and Hive • Started hadoop using 8 nodes cluster • Later joined OpenX to manage and develop on 600+ nodes cluster • 100+TB/day data volume • http://idryman.org

6. ### Bridges Buildings Cars Rockets Solve Larger Problems Cross river or

sea Increase volume for human activities Transports Explore the outer space

8. ### Data Engineering Problems • Maintain the data ﬂow • Extract

Transform Load (ETL) • Downstream consumer: reporting and analytics • Data correctness • System stability
9. ### Extract Transform Load • De-duplicate (remove unwanted data) • Join

relevant data • Aggregations in different perspective • select … from large_table group by buyers; • select … from large_table group by publishers;

hoc queries
11. ### Estimate instead of actual run • Estimate the properties of

an aggregation (or any other data operations) • Cardinality • Frequency distribution (is data skewed?) • Top-k elements • Don’t need to be precise, but be fast and stable
12. ### Probabilistic Data Structure/Algorithms • Cardinality -> hyperloglog • Frequency distribution

-> count-min-sketch • Top-k elements -> lossy count Runtime: ! O(log(N)) Data size: ! several kbs
13. ### MapReduce, Jeffery et al. When data size is small All

computation would be in memory

15. ### Cardinality test • Goal: ﬁnd out the distinct element count

of a given query • Naive approach 1: Put into a hash set! • not enough memory! • Approach 2: Hash the input to a BitSet (allow collisions) • Linear counting [ACM’90] • Estimate collision impacts • Still require 1/10 size of the input (too large!)
16. ### LogLog (basic idea) • Pick a hash function h that

maps each of the n elements to at least log(n) bits • For each a = h(x), let r(a) be the number of trailing zeros • Record R = maximum r(a) seen • Cardinality estimate = 2R • Data size (R): 4 bytes or 8 byes
17. ### n elements m distinct ones a = h(m) 01010001001100 r(a)

= 2 Pr( r(a)=2 ) = 1/4 01000000000000 Need about 212 distinct elements to see this occur!

LogLog
19. ### Implementations • Aggregate Knowledge Inc. • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec •

https://github.com/aggregateknowledge/java-hll • https://github.com/aggregateknowledge/js-hll • https://github.com/aggregateknowledge/postgresql-hll • AddThis Inc. • https://github.com/addthis/stream-lib • Others • https://github.com/tvondra/distinct_estimators/tree/master/superloglog
20. ### Hive HLL -- estimate the cardinality of SELECT * FROM

src GROUP BY col1, col2; SELECT hll(col1, col2).cardinality from src; ! -- create hyperloglog cache per hour FROM input_table src INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00' INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01' ! -- read the cache and calculate the cardinality of full day SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;' ! -- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc. SELECT hll_col.signature from hll_cache WHERE d='2015-03-01'; https://github.com/dryman/hive-probabilistic-utils
21. ### Business Values <3 • Distinct count on user (cookie, User

Agent, etc.) every hour/minute. • How many unique users do we have every day/ week/month? • What is the churn rate? • It can even be visualized in browser! (js-hll)

24. ### Similar algorithms to HyperLogLog • Count-min-sketch (frequency estimation) • Bloom

ﬁlter (probabilistic set) • Lossy count (top-k elements) • ASM (2nd order moment estimation)
25. ### Algorithms that extend our problem sets • Locality Sensitive Hashing

(ﬁnding similar items) • ECLAT (frequent item set) • BigCLAM (graph clustering) • Many more