Scalable Uniques in
Postgres -
Craig Kerstiens
Heroku Postgres
Slide 2
Slide 2 text
Postgresql-HLL
Slide 3
Slide 3 text
Truviso
• Extended Postgres to do streaming
• Various markets
• Ad space
• Wanted unique impressions
• Sort of wanted unique impressions
Slide 4
Slide 4 text
SELECT count(*)
Slide 5
Slide 5 text
Approx Top K
Slide 6
Slide 6 text
Compressed Bitmap
Slide 7
Slide 7 text
HyperLogLog
Slide 8
Slide 8 text
HyperLogLog
• KMV - K minimum value
Slide 9
Slide 9 text
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
Slide 10
Slide 10 text
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
Slide 11
Slide 11 text
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
Slide 12
Slide 12 text
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
Slide 13
Slide 13 text
HyperLogLog
• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
• Implemented by Aggregate Knowledge
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
HyperLogLog
Probabilistic uniques with small footprint
Slide 16
Slide 16 text
HyperLogLog
Probabilistic uniques with small footprint
Close enough distinct with small footprint
Slide 17
Slide 17 text
Use cases
Slide 18
Slide 18 text
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
Slide 19
Slide 19 text
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
• With rollups/groupings
Slide 20
Slide 20 text
Digging in
CREATE
EXTENSION
hll;
CREATE
TABLE
helloworld
(
id
integer,
set
hll
);
Slide 21
Slide 21 text
Digging in
CREATE
EXTENSION
hll;
CREATE
TABLE
helloworld
(
id
integer,
set
hll
);
Slide 22
Slide 22 text
Inserting data
UPDATE
helloworld
SET
set
=
hll_add(set,
hll_hash_integer(12345))
WHERE
id
=
1;
UPDATE
helloworld
SET
set
=
hll_add(set,
hll_hash_text('hello
world'))
WHERE
id
=
1;
Slide 23
Slide 23 text
Real world
CREATE
TABLE
daily_uniques
(
date
date
UNIQUE,
users
hll
);
Slide 24
Slide 24 text
Real world
INSERT
INTO
daily_uniques(date,
users)
SELECT
date,
hll_add_agg(hll_hash_integer(user_id))
FROM
users
GROUP
BY
1;
Slide 25
Slide 25 text
Real world
SELECT
EXTRACT(MONTH
FROM
date)
AS
month,
hll_cardinality(hll_union_agg(users))
FROM
daily_uniques
WHERE
date
>=
'2012-‐01-‐01'
AND
date
<
'2013-‐01-‐01'
GROUP
BY
1;
Slide 26
Slide 26 text
Real world
SELECT
EXTRACT(MONTH
FROM
date)
AS
month,
hll_cardinality(hll_union_agg(users))
FROM
daily_uniques
WHERE
date
>=
'2012-‐01-‐01'
AND
date
<
'2013-‐01-‐01'
GROUP
BY
1;
Slide 27
Slide 27 text
Good practices
Slide 28
Slide 28 text
Good practices
Slide 29
Slide 29 text
Good practices
• It uses update
Slide 30
Slide 30 text
Good practices
• It uses update
• Do as a batch in most cases
Slide 31
Slide 31 text
Good practices
• It uses update
• Do as a batch in most cases
• Tweak the config
Slide 32
Slide 32 text
Tuning Parameters
Slide 33
Slide 33 text
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
Slide 34
Slide 34 text
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
Slide 35
Slide 35 text
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
Slide 36
Slide 36 text
Tuning Parameters
• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
• spareson - on/off for sparse
Slide 37
Slide 37 text
Is it better?
Slide 38
Slide 38 text
1280 bytes
Estimate count of 10s of billions
Few percent error