Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

240da217b000604a2302dfc8b02a9cad?s=128

Craig Kerstiens

September 18, 2013
Tweet

Transcript

  1. Scalable Uniques in Postgres - Craig Kerstiens Heroku Postgres

  2. Postgresql-HLL

  3. Truviso • Extended Postgres to do streaming • Various markets

    • Ad space • Wanted unique impressions • Sort of wanted unique impressions
  4. SELECT count(*)

  5. Approx Top K

  6. Compressed Bitmap

  7. HyperLogLog

  8. HyperLogLog • KMV - K minimum value

  9. HyperLogLog • KMV - K minimum value • Bit observable

    patterns
  10. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging
  11. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  12. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging
  13. HyperLogLog • KMV - K minimum value • Bit observable

    patterns • Stochastic averaging • Harmonic averaging • Implemented by Aggregate Knowledge
  14. None
  15. HyperLogLog Probabilistic uniques with small footprint

  16. HyperLogLog Probabilistic uniques with small footprint Close enough distinct with

    small footprint
  17. Use cases

  18. Use cases • Semi distinct count • Think pg_stat_statements •

    Ad networks • Web traffic
  19. Use cases • Semi distinct count • Think pg_stat_statements •

    Ad networks • Web traffic • With rollups/groupings
  20. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  21. Digging in CREATE  EXTENSION  hll;    CREATE  TABLE  helloworld  (

               id        integer,            set      hll    );
  22. Inserting data UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_integer(12345))

      WHERE  id  =  1; UPDATE  helloworld   SET  set  =  hll_add(set,  hll_hash_text('hello  world'))   WHERE  id  =  1;
  23. Real world CREATE  TABLE  daily_uniques  (        date

                           date  UNIQUE,        users                      hll );
  24. Real world INSERT  INTO  daily_uniques(date,  users)    SELECT  date,  hll_add_agg(hll_hash_integer(user_id))

       FROM  users    GROUP  BY  1;
  25. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  26. Real world SELECT              

     EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users)) FROM  daily_uniques WHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01' GROUP  BY  1;
  27. Good practices

  28. Good practices

  29. Good practices • It uses update

  30. Good practices • It uses update • Do as a

    batch in most cases
  31. Good practices • It uses update • Do as a

    batch in most cases • Tweak the config
  32. Tuning Parameters

  33. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage
  34. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register
  35. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse
  36. Tuning Parameters • log2m - log base 2 of registers

    • Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse • spareson - on/off for sparse
  37. Is it better?

  38. 1280 bytes Estimate count of 10s of billions Few percent

    error
  39. Resources • https://github.com/aggregateknowledge/ postgresql-hll • http://blog.aggregateknowledge.com/ 2013/02/04/open-source-release- postgresql-hll/ • http://tapoueh.org/blog/2013/02/25-

    postgresql-hyperloglog
  40. Questions