Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Big Data Engineering

Introduction to Big Data Engineering

How to construct algorithms that can scale

Felix Chern

March 11, 2015
Tweet

More Decks by Felix Chern

Other Decks in Technology

Transcript

  1. Introduction to Big
    Data Engineering
    How to construct algorithms that can scale
    Felix Chern

    View Slide

  2. About me
    • Felix Chern
    • 2 years of Hadoop Experience
    • MapReduce and Hive
    • Started hadoop using 8 nodes
    cluster
    • Later joined OpenX to manage
    and develop on 600+ nodes
    cluster
    • 100+TB/day data volume
    • http://idryman.org

    View Slide

  3. Big Data
    Engineering

    View Slide

  4. Center of mass
    Physics Duck Tape
    Algorithms Framework and Tools

    View Slide

  5. Mastering
    Physics
    Bridges
    Buildings
    Cars
    Rockets

    View Slide

  6. Bridges
    Buildings
    Cars
    Rockets
    Solve Larger Problems
    Cross river or sea
    Increase volume for human
    activities
    Transports
    Explore the outer space

    View Slide

  7. Algorithms
    Solutions
    More solvable!
    problems!
    More business!
    values
    Problems

    View Slide

  8. Data Engineering
    Problems
    • Maintain the data flow
    • Extract Transform Load (ETL)
    • Downstream consumer: reporting and analytics
    • Data correctness
    • System stability

    View Slide

  9. Extract Transform Load
    • De-duplicate (remove unwanted data)
    • Join relevant data
    • Aggregations in different perspective
    • select … from large_table group by buyers;
    • select … from large_table group by publishers;

    View Slide

  10. Filter
    Join
    Hadoop jobs
    Aggregate
    DB
    Reports
    Aggregation size?
    Ad hoc
    queries

    View Slide

  11. Estimate instead of
    actual run
    • Estimate the properties of an aggregation (or any
    other data operations)
    • Cardinality
    • Frequency distribution (is data skewed?)
    • Top-k elements
    • Don’t need to be precise, but be fast and stable

    View Slide

  12. Probabilistic
    Data Structure/Algorithms
    • Cardinality -> hyperloglog
    • Frequency distribution -> count-min-sketch
    • Top-k elements -> lossy count
    Runtime: !
    O(log(N))
    Data size: !
    several kbs

    View Slide

  13. MapReduce, Jeffery et al.
    When data size
    is small
    All computation
    would be in
    memory

    View Slide

  14. Look ma,!
    In memory computing!!
    https://flic.kr/p/3evfjL

    View Slide

  15. Cardinality test
    • Goal: find out the distinct element count of a given query
    • Naive approach 1: Put into a hash set!
    • not enough memory!
    • Approach 2: Hash the input to a BitSet (allow collisions)
    • Linear counting [ACM’90]
    • Estimate collision impacts
    • Still require 1/10 size of the input (too large!)

    View Slide

  16. LogLog (basic idea)
    • Pick a hash function h that maps each of the n
    elements to at least log(n) bits
    • For each a = h(x), let r(a) be the number of trailing
    zeros
    • Record R = maximum r(a) seen
    • Cardinality estimate = 2R
    • Data size (R): 4 bytes or 8 byes

    View Slide

  17. n elements
    m distinct ones
    a = h(m)
    01010001001100 r(a) = 2
    Pr( r(a)=2 ) = 1/4
    01000000000000 Need about 212 distinct
    elements to see this occur!

    View Slide

  18. M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities
    LogLog

    View Slide

  19. Implementations
    • Aggregate Knowledge Inc.
    • Blob Schema https://github.com/aggregateknowledge/hll-storage-spec
    • https://github.com/aggregateknowledge/java-hll
    • https://github.com/aggregateknowledge/js-hll
    • https://github.com/aggregateknowledge/postgresql-hll
    • AddThis Inc.
    • https://github.com/addthis/stream-lib
    • Others
    • https://github.com/tvondra/distinct_estimators/tree/master/superloglog

    View Slide

  20. Hive HLL
    -- estimate the cardinality of SELECT * FROM src GROUP BY col1,
    col2;
    SELECT hll(col1, col2).cardinality from src;
    !
    -- create hyperloglog cache per hour
    FROM input_table src
    INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='00')
    SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='00'
    INSERT OVERWRITE TABLE hll_cache PARTITION (d='2015-03-01',h='01')
    SELECT hll(col1,col2) WHERE d='2015-03-01' AND h='01'
    !
    -- read the cache and calculate the cardinality of full day
    SELECT hll(hll_col).cardinality from hll_cache WHERE d='2015-03-01;'
    !
    -- unpack hive hll struct and make it readable by
    postgres-hll, js-hll developed by Aggregate Knowledge, Inc.
    SELECT hll_col.signature from hll_cache WHERE d='2015-03-01';
    https://github.com/dryman/hive-probabilistic-utils

    View Slide

  21. Business Values <3
    • Distinct count on user (cookie, User Agent, etc.)
    every hour/minute.
    • How many unique users do we have every day/
    week/month?
    • What is the churn rate?
    • It can even be visualized in browser! (js-hll)

    View Slide

  22. Algorithms
    Solutions
    More solvable!
    problems!
    More business!
    values
    Problems

    View Slide

  23. This is just one
    algorithm

    View Slide

  24. Similar algorithms
    to HyperLogLog
    • Count-min-sketch (frequency estimation)
    • Bloom filter (probabilistic set)
    • Lossy count (top-k elements)
    • ASM (2nd order moment estimation)

    View Slide

  25. Algorithms that extend
    our problem sets
    • Locality Sensitive Hashing (finding similar items)
    • ECLAT (frequent item set)
    • BigCLAM (graph clustering)
    • Many more

    View Slide

  26. We are hiring!
    One more thing…

    View Slide

  27. Contact
    www.openx.com/careers

    View Slide

  28. Questions?

    View Slide