Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres - Craig Kerstiens Heroku Postgres

Postgresql-HLL

Truviso • Extended Postgres to do streaming • Various markets
• Ad space • Wanted unique impressions • Sort of wanted unique impressions

SELECT count(*)

Approx Top K

Compressed Bitmap

HyperLogLog

HyperLogLog • KMV - K minimum value

HyperLogLog • KMV - K minimum value • Bit observable
patterns

patterns • Stochastic averaging

patterns • Stochastic averaging • Harmonic averaging

patterns • Stochastic averaging • Harmonic averaging • Implemented by Aggregate Knowledge

HyperLogLog Probabilistic uniques with small footprint

HyperLogLog Probabilistic uniques with small footprint Close enough distinct with
small footprint

Use cases

Use cases • Semi distinct count • Think pg_stat_statements •
Ad networks • Web traﬃc

Use cases • Semi distinct count • Think pg_stat_statements •
Ad networks • Web traﬃc • With rollups/groupings

Digging in CREATE EXTENSION hll; CREATE TABLE helloworld (
id integer, set hll );

Inserting data UPDATE helloworld SET set = hll_add(set, hll_hash_integer(12345))
WHERE id = 1; UPDATE helloworld SET set = hll_add(set, hll_hash_text('hello world')) WHERE id = 1;

Real world CREATE TABLE daily_uniques ( date
date UNIQUE, users hll );

Real world INSERT INTO daily_uniques(date, users) SELECT date, hll_add_agg(hll_hash_integer(user_id))
FROM users GROUP BY 1;

Real world SELECT
EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-‐01-‐01' AND date < '2013-‐01-‐01' GROUP BY 1;

Good practices

Good practices • It uses update

Good practices • It uses update • Do as a
batch in most cases

Good practices • It uses update • Do as a
batch in most cases • Tweak the conﬁg

Tuning Parameters

Tuning Parameters • log2m - log base 2 of registers
• Between 4 and 17 • Each 1 increase doubles storage

• Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register

• Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse

• Between 4 and 17 • Each 1 increase doubles storage • regwidth - bits per register • expthresh - threshold for explicit vs sparse • spareson - on/oﬀ for sparse

Is it better?

1280 bytes Estimate count of 10s of billions Few percent
error

Resources • https://github.com/aggregateknowledge/ postgresql-hll • http://blog.aggregateknowledge.com/ 2013/02/04/open-source-release- postgresql-hll/ • http://tapoueh.org/blog/2013/02/25-
postgresql-hyperloglog

Questions

Scalable Uniques in Postgres with HLL

Scalable Uniques in Postgres with HLL

More Decks by Craig Kerstiens

Other Decks in Technology

Featured

Transcript