Introduction to Big Data Engineering

Introduction to Big Data Engineering How to construct algorithms that
can scale Felix Chern

About me • Felix Chern • 2 years of Hadoop
Experience • MapReduce and Hive • Started hadoop using 8 nodes cluster • Later joined OpenX to manage and develop on 600+ nodes cluster • 100+TB/day data volume • http://idryman.org

Big Data Engineering

Center of mass Physics Duck Tape Algorithms Framework and Tools

Mastering Physics Bridges Buildings Cars Rockets

Bridges Buildings Cars Rockets Solve Larger Problems Cross river or
sea Increase volume for human activities Transports Explore the outer space

Algorithms Solutions More solvable! problems! More business! values Problems

Data Engineering Problems • Maintain the data ﬂow • Extract
Transform Load (ETL) • Downstream consumer: reporting and analytics • Data correctness • System stability

Extract Transform Load • De-duplicate (remove unwanted data) • Join
relevant data • Aggregations in different perspective • select … from large_table group by buyers; • select … from large_table group by publishers;

Filter Join Hadoop jobs Aggregate DB Reports Aggregation size? Ad
hoc queries

Estimate instead of actual run • Estimate the properties of
an aggregation (or any other data operations) • Cardinality • Frequency distribution (is data skewed?) • Top-k elements • Don’t need to be precise, but be fast and stable

Probabilistic Data Structure/Algorithms • Cardinality -> hyperloglog • Frequency distribution
-> count-min-sketch • Top-k elements -> lossy count Runtime: ! O(log(N)) Data size: ! several kbs

MapReduce, Jeffery et al. When data size is small All
computation would be in memory

Look ma,! In memory computing!! https://ﬂic.kr/p/3evfjL

Cardinality test • Goal: ﬁnd out the distinct element count
of a given query • Naive approach 1: Put into a hash set! • not enough memory! • Approach 2: Hash the input to a BitSet (allow collisions) • Linear counting [ACM’90] • Estimate collision impacts • Still require 1/10 size of the input (too large!)

LogLog (basic idea) • Pick a hash function h that
maps each of the n elements to at least log(n) bits • For each a = h(x), let r(a) be the number of trailing zeros • Record R = maximum r(a) seen • Cardinality estimate = 2R • Data size (R): 4 bytes or 8 byes

n elements m distinct ones a = h(m) 01010001001100 r(a)
= 2 Pr( r(a)=2 ) = 1/4 01000000000000 Need about 212 distinct elements to see this occur!

M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities
LogLog

Implementations • Aggregate Knowledge Inc. • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec •
https://github.com/aggregateknowledge/java-hll • https://github.com/aggregateknowledge/js-hll • https://github.com/aggregateknowledge/postgresql-hll • AddThis Inc. • https://github.com/addthis/stream-lib • Others • https://github.com/tvondra/distinct_estimators/tree/master/superloglog

Hive HLL -- estimate the cardinality of SELECT * FROM
src GROUP BY col1, col2; SELECT hll(col1, col2).cardinality from src; ! -- create hyperloglog cache per hour FROM input_table src INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00' INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01' ! -- read the cache and calculate the cardinality of full day SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;' ! -- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc. SELECT hll_col.signature from hll_cache WHERE d='2015-03-01'; https://github.com/dryman/hive-probabilistic-utils

Business Values <3 • Distinct count on user (cookie, User
Agent, etc.) every hour/minute. • How many unique users do we have every day/ week/month? • What is the churn rate? • It can even be visualized in browser! (js-hll)

Algorithms Solutions More solvable! problems! More business! values Problems

This is just one algorithm

Similar algorithms to HyperLogLog • Count-min-sketch (frequency estimation) • Bloom
ﬁlter (probabilistic set) • Lossy count (top-k elements) • ASM (2nd order moment estimation)

Algorithms that extend our problem sets • Locality Sensitive Hashing
(ﬁnding similar items) • ECLAT (frequent item set) • BigCLAM (graph clustering) • Many more

We are hiring! One more thing…

Contact www.openx.com/careers

Questions?

Introduction to Big Data Engineering

Introduction to Big Data Engineering

Felix Chern

More Decks by Felix Chern

Other Decks in Technology

Featured

Transcript