Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

Craig Kerstiens

September 18, 2013
Tweet

More Decks by Craig Kerstiens

Other Decks in Technology

Transcript

  1. Scalable Uniques in
    Postgres -
    Craig Kerstiens
    Heroku Postgres

    View Slide

  2. Postgresql-HLL

    View Slide

  3. Truviso
    • Extended Postgres to do streaming
    • Various markets
    • Ad space
    • Wanted unique impressions
    • Sort of wanted unique impressions

    View Slide

  4. SELECT count(*)

    View Slide

  5. Approx Top K

    View Slide

  6. Compressed Bitmap

    View Slide

  7. HyperLogLog

    View Slide

  8. HyperLogLog
    • KMV - K minimum value

    View Slide

  9. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns

    View Slide

  10. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging

    View Slide

  11. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging

    View Slide

  12. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging

    View Slide

  13. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging
    • Implemented by Aggregate Knowledge

    View Slide

  14. View Slide

  15. HyperLogLog
    Probabilistic uniques with small footprint

    View Slide

  16. HyperLogLog
    Probabilistic uniques with small footprint
    Close enough distinct with small footprint

    View Slide

  17. Use cases

    View Slide

  18. Use cases
    • Semi distinct count
    • Think pg_stat_statements
    • Ad networks
    • Web traffic

    View Slide

  19. Use cases
    • Semi distinct count
    • Think pg_stat_statements
    • Ad networks
    • Web traffic
    • With rollups/groupings

    View Slide

  20. Digging in
    CREATE  EXTENSION  hll;
       CREATE  TABLE  helloworld  (
               id        integer,
               set      hll
       );

    View Slide

  21. Digging in
    CREATE  EXTENSION  hll;
       CREATE  TABLE  helloworld  (
               id        integer,
               set      hll
       );

    View Slide

  22. Inserting data
    UPDATE  helloworld  
    SET  set  =  hll_add(set,  hll_hash_integer(12345))  
    WHERE  id  =  1;
    UPDATE  helloworld  
    SET  set  =  hll_add(set,  hll_hash_text('hello  world'))  
    WHERE  id  =  1;

    View Slide

  23. Real world
    CREATE  TABLE  daily_uniques  (
           date                        date  UNIQUE,
           users                      hll
    );

    View Slide

  24. Real world
    INSERT  INTO  daily_uniques(date,  users)
       SELECT  date,  hll_add_agg(hll_hash_integer(user_id))
       FROM  users
       GROUP  BY  1;

    View Slide

  25. Real world
    SELECT  
                 EXTRACT(MONTH  FROM  date)  AS  month,  
                 hll_cardinality(hll_union_agg(users))
    FROM  daily_uniques
    WHERE  date  >=  '2012-­‐01-­‐01'  AND
               date  <    '2013-­‐01-­‐01'
    GROUP  BY  1;

    View Slide

  26. Real world
    SELECT  
                 EXTRACT(MONTH  FROM  date)  AS  month,  
                 hll_cardinality(hll_union_agg(users))
    FROM  daily_uniques
    WHERE  date  >=  '2012-­‐01-­‐01'  AND
               date  <    '2013-­‐01-­‐01'
    GROUP  BY  1;

    View Slide

  27. Good practices

    View Slide

  28. Good practices

    View Slide

  29. Good practices
    • It uses update

    View Slide

  30. Good practices
    • It uses update
    • Do as a batch in most cases

    View Slide

  31. Good practices
    • It uses update
    • Do as a batch in most cases
    • Tweak the config

    View Slide

  32. Tuning Parameters

    View Slide

  33. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage

    View Slide

  34. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register

    View Slide

  35. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register
    • expthresh - threshold for explicit vs sparse

    View Slide

  36. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register
    • expthresh - threshold for explicit vs sparse
    • spareson - on/off for sparse

    View Slide

  37. Is it better?

    View Slide

  38. 1280 bytes
    Estimate count of 10s of billions
    Few percent error

    View Slide

  39. Resources
    • https://github.com/aggregateknowledge/
    postgresql-hll
    • http://blog.aggregateknowledge.com/
    2013/02/04/open-source-release-
    postgresql-hll/
    • http://tapoueh.org/blog/2013/02/25-
    postgresql-hyperloglog

    View Slide

  40. Questions

    View Slide