Distributed COUNT(DISTINCT) with Hyperloglog on PostgreSQL | PGConf EU 2017 | Burak Yucesoy

024d6a0dd14fb31c804969a57a06dfbe?s=47 Citus Data
October 26, 2017

Distributed COUNT(DISTINCT) with Hyperloglog on PostgreSQL | PGConf EU 2017 | Burak Yucesoy

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you cannot just push query to the workers and add up results, because most likely there will be overlapping records in different workers.

In this talk, we will focus on HyperLogLog(HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in distributed environment. During the talk, first, we’ll look at the internals of the HLL to understand why HLL algorithm is useful to solve distinct count problem in scalable way, then how it can be applied in a distributed fashion. Finally we will see some examples of HLL usage.

024d6a0dd14fb31c804969a57a06dfbe?s=128

Citus Data

October 26, 2017
Tweet