Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

Craig Kerstiens

September 18, 2013
Tweet

More Decks by Craig Kerstiens

Other Decks in Technology

Transcript

  1. Scalable Uniques in
    Postgres -
    Craig Kerstiens
    Heroku Postgres

    View full-size slide

  2. Postgresql-HLL

    View full-size slide

  3. Truviso
    • Extended Postgres to do streaming
    • Various markets
    • Ad space
    • Wanted unique impressions
    • Sort of wanted unique impressions

    View full-size slide

  4. SELECT count(*)

    View full-size slide

  5. Approx Top K

    View full-size slide

  6. Compressed Bitmap

    View full-size slide

  7. HyperLogLog
    • KMV - K minimum value

    View full-size slide

  8. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns

    View full-size slide

  9. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging

    View full-size slide

  10. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging

    View full-size slide

  11. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging

    View full-size slide

  12. HyperLogLog
    • KMV - K minimum value
    • Bit observable patterns
    • Stochastic averaging
    • Harmonic averaging
    • Implemented by Aggregate Knowledge

    View full-size slide

  13. HyperLogLog
    Probabilistic uniques with small footprint

    View full-size slide

  14. HyperLogLog
    Probabilistic uniques with small footprint
    Close enough distinct with small footprint

    View full-size slide

  15. Use cases
    • Semi distinct count
    • Think pg_stat_statements
    • Ad networks
    • Web traffic

    View full-size slide

  16. Use cases
    • Semi distinct count
    • Think pg_stat_statements
    • Ad networks
    • Web traffic
    • With rollups/groupings

    View full-size slide

  17. Digging in
    CREATE  EXTENSION  hll;
       CREATE  TABLE  helloworld  (
               id        integer,
               set      hll
       );

    View full-size slide

  18. Digging in
    CREATE  EXTENSION  hll;
       CREATE  TABLE  helloworld  (
               id        integer,
               set      hll
       );

    View full-size slide

  19. Inserting data
    UPDATE  helloworld  
    SET  set  =  hll_add(set,  hll_hash_integer(12345))  
    WHERE  id  =  1;
    UPDATE  helloworld  
    SET  set  =  hll_add(set,  hll_hash_text('hello  world'))  
    WHERE  id  =  1;

    View full-size slide

  20. Real world
    CREATE  TABLE  daily_uniques  (
           date                        date  UNIQUE,
           users                      hll
    );

    View full-size slide

  21. Real world
    INSERT  INTO  daily_uniques(date,  users)
       SELECT  date,  hll_add_agg(hll_hash_integer(user_id))
       FROM  users
       GROUP  BY  1;

    View full-size slide

  22. Real world
    SELECT  
                 EXTRACT(MONTH  FROM  date)  AS  month,  
                 hll_cardinality(hll_union_agg(users))
    FROM  daily_uniques
    WHERE  date  >=  '2012-­‐01-­‐01'  AND
               date  <    '2013-­‐01-­‐01'
    GROUP  BY  1;

    View full-size slide

  23. Real world
    SELECT  
                 EXTRACT(MONTH  FROM  date)  AS  month,  
                 hll_cardinality(hll_union_agg(users))
    FROM  daily_uniques
    WHERE  date  >=  '2012-­‐01-­‐01'  AND
               date  <    '2013-­‐01-­‐01'
    GROUP  BY  1;

    View full-size slide

  24. Good practices

    View full-size slide

  25. Good practices

    View full-size slide

  26. Good practices
    • It uses update

    View full-size slide

  27. Good practices
    • It uses update
    • Do as a batch in most cases

    View full-size slide

  28. Good practices
    • It uses update
    • Do as a batch in most cases
    • Tweak the config

    View full-size slide

  29. Tuning Parameters

    View full-size slide

  30. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage

    View full-size slide

  31. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register

    View full-size slide

  32. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register
    • expthresh - threshold for explicit vs sparse

    View full-size slide

  33. Tuning Parameters
    • log2m - log base 2 of registers
    • Between 4 and 17
    • Each 1 increase doubles storage
    • regwidth - bits per register
    • expthresh - threshold for explicit vs sparse
    • spareson - on/off for sparse

    View full-size slide

  34. Is it better?

    View full-size slide

  35. 1280 bytes
    Estimate count of 10s of billions
    Few percent error

    View full-size slide

  36. Resources
    • https://github.com/aggregateknowledge/
    postgresql-hll
    • http://blog.aggregateknowledge.com/
    2013/02/04/open-source-release-
    postgresql-hll/
    • http://tapoueh.org/blog/2013/02/25-
    postgresql-hyperloglog

    View full-size slide