Introduction to Big Data Engineering

Introduction to Big Data Engineering

How to construct algorithms that can scale

Cba7f423f9e0ee3a0be1ca18978a6684?s=128

Felix Chern

March 11, 2015
Tweet

Transcript

  1. Introduction to Big Data Engineering How to construct algorithms that

    can scale Felix Chern
  2. About me • Felix Chern • 2 years of Hadoop

    Experience • MapReduce and Hive • Started hadoop using 8 nodes cluster • Later joined OpenX to manage and develop on 600+ nodes cluster • 100+TB/day data volume • http://idryman.org
  3. Big Data Engineering

  4. Center of mass Physics Duck Tape Algorithms Framework and Tools

  5. Mastering Physics Bridges Buildings Cars Rockets

  6. Bridges Buildings Cars Rockets Solve Larger Problems Cross river or

    sea Increase volume for human activities Transports Explore the outer space
  7. Algorithms Solutions More solvable! problems! More business! values Problems

  8. Data Engineering Problems • Maintain the data flow • Extract

    Transform Load (ETL) • Downstream consumer: reporting and analytics • Data correctness • System stability
  9. Extract Transform Load • De-duplicate (remove unwanted data) • Join

    relevant data • Aggregations in different perspective • select … from large_table group by buyers; • select … from large_table group by publishers;
  10. Filter Join Hadoop jobs Aggregate DB Reports Aggregation size? Ad

    hoc queries
  11. Estimate instead of actual run • Estimate the properties of

    an aggregation (or any other data operations) • Cardinality • Frequency distribution (is data skewed?) • Top-k elements • Don’t need to be precise, but be fast and stable
  12. Probabilistic Data Structure/Algorithms • Cardinality -> hyperloglog • Frequency distribution

    -> count-min-sketch • Top-k elements -> lossy count Runtime: ! O(log(N)) Data size: ! several kbs
  13. MapReduce, Jeffery et al. When data size is small All

    computation would be in memory
  14. Look ma,! In memory computing!! https://flic.kr/p/3evfjL

  15. Cardinality test • Goal: find out the distinct element count

    of a given query • Naive approach 1: Put into a hash set! • not enough memory! • Approach 2: Hash the input to a BitSet (allow collisions) • Linear counting [ACM’90] • Estimate collision impacts • Still require 1/10 size of the input (too large!)
  16. LogLog (basic idea) • Pick a hash function h that

    maps each of the n elements to at least log(n) bits • For each a = h(x), let r(a) be the number of trailing zeros • Record R = maximum r(a) seen • Cardinality estimate = 2R • Data size (R): 4 bytes or 8 byes
  17. n elements m distinct ones a = h(m) 01010001001100 r(a)

    = 2 Pr( r(a)=2 ) = 1/4 01000000000000 Need about 212 distinct elements to see this occur!
  18. M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities

    LogLog
  19. Implementations • Aggregate Knowledge Inc. • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec •

    https://github.com/aggregateknowledge/java-hll • https://github.com/aggregateknowledge/js-hll • https://github.com/aggregateknowledge/postgresql-hll • AddThis Inc. • https://github.com/addthis/stream-lib • Others • https://github.com/tvondra/distinct_estimators/tree/master/superloglog
  20. Hive HLL -- estimate the cardinality of SELECT * FROM

    src GROUP BY col1, col2; SELECT hll(col1, col2).cardinality from src; ! -- create hyperloglog cache per hour FROM input_table src INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00' INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01') SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01' ! -- read the cache and calculate the cardinality of full day SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;' ! -- unpack hive hll struct and make it readable by postgres-hll, js-hll developed by Aggregate Knowledge, Inc. SELECT hll_col.signature from hll_cache WHERE d='2015-03-01'; https://github.com/dryman/hive-probabilistic-utils
  21. Business Values <3 • Distinct count on user (cookie, User

    Agent, etc.) every hour/minute. • How many unique users do we have every day/ week/month? • What is the churn rate? • It can even be visualized in browser! (js-hll)
  22. Algorithms Solutions More solvable! problems! More business! values Problems

  23. This is just one algorithm

  24. Similar algorithms to HyperLogLog • Count-min-sketch (frequency estimation) • Bloom

    filter (probabilistic set) • Lossy count (top-k elements) • ASM (2nd order moment estimation)
  25. Algorithms that extend our problem sets • Locality Sensitive Hashing

    (finding similar items) • ECLAT (frequent item set) • BigCLAM (graph clustering) • Many more
  26. We are hiring! One more thing…

  27. Contact www.openx.com/careers

  28. Questions?