Felix Chern
March 11, 2015
450

# Introduction to Big Data Engineering

How to construct algorithms that can scale

March 11, 2015

## Transcript

1. Introduction to Big
Data Engineering
How to construct algorithms that can scale
Felix Chern

• Felix Chern
• 2 years of Hadoop Experience
• MapReduce and Hive
• Started hadoop using 8 nodes
cluster
• Later joined OpenX to manage
and develop on 600+ nodes
cluster
• 100+TB/day data volume
• http://idryman.org

3. Big Data
Engineering

4. Center of mass
Physics Duck Tape
Algorithms Framework and Tools

5. Mastering
Physics
Bridges
Buildings
Cars
Rockets

6. Bridges
Buildings
Cars
Rockets
Solve Larger Problems
Cross river or sea
Increase volume for human
activities
Transports
Explore the outer space

7. Algorithms
Solutions
More solvable!
problems!
values
Problems

8. Data Engineering
Problems
• Maintain the data ﬂow
• Downstream consumer: reporting and analytics
• Data correctness
• System stability

• De-duplicate (remove unwanted data)
• Join relevant data
• Aggregations in different perspective
• select … from large_table group by buyers;
• select … from large_table group by publishers;

10. Filter
Join
Aggregate
DB
Reports
Aggregation size?
queries

actual run
• Estimate the properties of an aggregation (or any
other data operations)
• Cardinality
• Frequency distribution (is data skewed?)
• Top-k elements
• Don’t need to be precise, but be fast and stable

12. Probabilistic
Data Structure/Algorithms
• Cardinality -> hyperloglog
• Frequency distribution -> count-min-sketch
• Top-k elements -> lossy count
Runtime: !
O(log(N))
Data size: !
several kbs

13. MapReduce, Jeffery et al.
When data size
is small
All computation
would be in
memory

14. Look ma,!
In memory computing!!
https://ﬂic.kr/p/3evfjL

15. Cardinality test
• Goal: ﬁnd out the distinct element count of a given query
• Naive approach 1: Put into a hash set!
• not enough memory!
• Approach 2: Hash the input to a BitSet (allow collisions)
• Linear counting [ACM’90]
• Estimate collision impacts
• Still require 1/10 size of the input (too large!)

16. LogLog (basic idea)
• Pick a hash function h that maps each of the n
elements to at least log(n) bits
• For each a = h(x), let r(a) be the number of trailing
zeros
• Record R = maximum r(a) seen
• Cardinality estimate = 2R
• Data size (R): 4 bytes or 8 byes

17. n elements
m distinct ones
a = h(m)
01010001001100 r(a) = 2
Pr( r(a)=2 ) = 1/4
elements to see this occur!

18. M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities
LogLog

19. Implementations
• Aggregate Knowledge Inc.
• Blob Schema https://github.com/aggregateknowledge/hll-storage-spec
• https://github.com/aggregateknowledge/java-hll
• https://github.com/aggregateknowledge/js-hll
• https://github.com/aggregateknowledge/postgresql-hll
• Others
• https://github.com/tvondra/distinct_estimators/tree/master/superloglog

20. Hive HLL
-- estimate the cardinality of SELECT * FROM src GROUP BY col1,
col2;
SELECT hll(col1, col2).cardinality from src;
!
-- create hyperloglog cache per hour
FROM input_table src
INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'
!
-- read the cache and calculate the cardinality of full day
SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'
!
-- unpack hive hll struct and make it readable by
postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';
https://github.com/dryman/hive-probabilistic-utils

• Distinct count on user (cookie, User Agent, etc.)
every hour/minute.
• How many unique users do we have every day/
week/month?
• What is the churn rate?
• It can even be visualized in browser! (js-hll)

22. Algorithms
Solutions
More solvable!
problems!
values
Problems

23. This is just one
algorithm

24. Similar algorithms
to HyperLogLog
• Count-min-sketch (frequency estimation)
• Bloom ﬁlter (probabilistic set)
• Lossy count (top-k elements)
• ASM (2nd order moment estimation)

25. Algorithms that extend
our problem sets
• Locality Sensitive Hashing (ﬁnding similar items)
• ECLAT (frequent item set)
• BigCLAM (graph clustering)
• Many more

26. We are hiring!
One more thing…

27. Contact
www.openx.com/careers

28. Questions?