Running SELECT COUNT(DISTINCT) on your database is all too common. In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you cannot just push query to the workers and add up results, because most likely there will be overlapping records in different workers.
In this talk, we will focus on HyperLogLog(HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in distributed environment. During the talk, first, we’ll look at the internals of the HLL to understand why HLL algorithm is useful to solve distinct count problem in scalable way, then how it can be applied in a distributed fashion. Finally we will see some examples of HLL usage.