Introduction to Big Data Engineering

Slide 1

Slide 1 text

Introduction to Big Data Engineering How to construct algorithms that can scale Felix Chern

Slide 2

Slide 2 text

About me • Felix Chern • 2 years of Hadoop Experience • MapReduce and Hive • Started hadoop using 8 nodes cluster • Later joined OpenX to manage and develop on 600+ nodes cluster • 100+TB/day data volume • http://idryman.org

Slide 3

Slide 3 text

Big Data Engineering

Slide 4

Slide 4 text

Center of mass Physics Duck Tape Algorithms Framework and Tools

Slide 5

Slide 5 text

Mastering Physics Bridges Buildings Cars Rockets

Slide 6

Slide 6 text

Bridges Buildings Cars Rockets Solve Larger Problems Cross river or sea Increase volume for human activities Transports Explore the outer space

Slide 7

Slide 7 text

Algorithms Solutions More solvable! problems! More business! values Problems

Slide 8

Slide 8 text

Data Engineering Problems • Maintain the data ﬂow • Extract Transform Load (ETL) • Downstream consumer: reporting and analytics • Data correctness • System stability

Slide 9

Slide 9 text

Extract Transform Load • De-duplicate (remove unwanted data) • Join relevant data • Aggregations in different perspective • select … from large_table group by buyers; • select … from large_table group by publishers;

Slide 10

Slide 10 text

Filter Join Hadoop jobs Aggregate DB Reports Aggregation size? Ad hoc queries

Slide 11

Slide 11 text

Estimate instead of actual run • Estimate the properties of an aggregation (or any other data operations) • Cardinality • Frequency distribution (is data skewed?) • Top-k elements • Don’t need to be precise, but be fast and stable

Slide 12

Slide 12 text

Probabilistic Data Structure/Algorithms • Cardinality -> hyperloglog • Frequency distribution -> count-min-sketch • Top-k elements -> lossy count Runtime: ! O(log(N)) Data size: ! several kbs

Slide 13

Slide 13 text

MapReduce, Jeffery et al. When data size is small All computation would be in memory

Slide 14

Slide 14 text

Look ma,! In memory computing!! https://ﬂic.kr/p/3evfjL

Slide 15

Slide 15 text

Cardinality test • Goal: ﬁnd out the distinct element count of a given query • Naive approach 1: Put into a hash set! • not enough memory! • Approach 2: Hash the input to a BitSet (allow collisions) • Linear counting [ACM’90] • Estimate collision impacts • Still require 1/10 size of the input (too large!)

Slide 16

Slide 16 text

LogLog (basic idea) • Pick a hash function h that maps each of the n elements to at least log(n) bits • For each a = h(x), let r(a) be the number of trailing zeros • Record R = maximum r(a) seen • Cardinality estimate = 2R • Data size (R): 4 bytes or 8 byes

Slide 17

Slide 17 text

n elements m distinct ones a = h(m) 01010001001100 r(a) = 2 Pr( r(a)=2 ) = 1/4 01000000000000 Need about 212 distinct elements to see this occur!

Slide 18

Slide 18 text

M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities LogLog

Slide 19

Slide 19 text

Implementations • Aggregate Knowledge Inc. • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec • https://github.com/aggregateknowledge/java-hll • https://github.com/aggregateknowledge/js-hll • https://github.com/aggregateknowledge/postgresql-hll • AddThis Inc. • https://github.com/addthis/stream-lib • Others • https://github.com/tvondra/distinct_estimators/tree/master/superloglog

Slide 20

Slide 20 text

Hive HLL -- estimate the cardinality of SELECT * FROM src GROUP BY col1, col2; SELECT hll(col1, col2).cardinality from src; ! -- create hyperloglog cache per hour FROM input_table src INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00' INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01' ! -- read the cache and calculate the cardinality of full day SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;' ! -- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc. SELECT hll_col.signature from hll_cache WHERE d='2015-03-01'; https://github.com/dryman/hive-probabilistic-utils

Slide 21

Slide 21 text

Business Values <3 • Distinct count on user (cookie, User Agent, etc.) every hour/minute. • How many unique users do we have every day/ week/month? • What is the churn rate? • It can even be visualized in browser! (js-hll)

Slide 22

Slide 22 text