Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The TopN extension: Maintaining Top 10 lists at scale | PGConf EU 2019 | Furkan Sahin

Citus Data
October 17, 2019

The TopN extension: Maintaining Top 10 lists at scale | PGConf EU 2019 | Furkan Sahin

Whether it's the biggest, oldest, fastest, slowest one of everyone's favorite questions about data is: What is the top 10? SQL offers easy solutions like "SELECT item, count(*) FROM data GROUP BY item ORDER BY 2 DESC LIMIT 10", but if you have a lot of data and a dashboard that can generate many ever-changing top 10 lists, then such queries don't scale at all. What we need is to be able to incrementally update top N lists as new data comes in. PostgreSQL did not have an easy way to do that yet, so we added it via the postgresql-topn extension.

In this talk, I will focus on the PostgreSQL extension postgresql-topn (new as of March 2018). TopN can provide approximate answers to these kinds of aggregation queries within reasonable and configurable error bounds. TopN is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. In this talk, I will talk about the internals of TopN and how it estimates top items. At the end of this session, you will have fallen in love with the capabilities of TopN and you will be adding it to your analytics tool belt.

Citus Data

October 17, 2019
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. The TopN extension: Maintaining top 10 lists at scale •

    Furkan Şahin • Software Engineer • Citus Data @Microsoft • Author of postgresql-topn extension @sahinffurkan @furkansahin
  2. Presentation Flow • TopN lists and their usages • How

    to calculate TopN lists in PostgreSQL? • Why do we need an extension? • How does postgresql-topn work? • How do we use postgresql-topn? • Demo • Summary @sahinffurkan @furkansahin
  3. TopN Lists and Their Usages How to calculate TopN lists

    in PostgreSQL? Why do we need an extension? How does postgresql-topn work? How do we use postgresql-topn? Demo Summary
  4. TopN Lists and Their Usages Spotify: Your Top Songs 2018

    IMDB: Top Rated Movies E-commerce: Most selling items @sahinffurkan @furkansahin
  5. How to calculate TopN lists in PostgreSQL SELECT item, count(*)

    FROM table GROUP BY 1 ORDER BY 2 DESC LIMIT 10; SELECT topn(topn_add_agg(item),10) FROM table; @sahinffurkan @furkansahin
  6. Why do we need an extension? How does postgresql-topn work?

    How do we use postgresql-topn? Demo Summary TopN Lists and Their Usages How to calculate TopN lists in PostgreSQL
  7. Why do we need an extension? Continuous calculations Memory usage

    Scanning CPU usage Sorting @sahinffurkan @furkansahin
  8. Solution? An extension that uses count-min sketch Low memory footprint

    Low CPU usage Facilitate pre- processing @sahinffurkan @furkansahin
  9. How does postgresql-topn work? How do we use postgresql-topn? Demo

    Summary TopN Lists and Their Usages How to calculate TopN lists in PostgreSQL Why do we need an extension?
  10. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Heavy fuel 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  11. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Heavy fuel 1 Back in Black 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  12. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Back in Black 2 Heavy fuel 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  13. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Back in Black 3 Heavy fuel 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  14. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Back in Black 3 Heavy fuel 1 Englishman in New York 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  15. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Back in Black 3 Heavy fuel 2 Englishman in New York 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  16. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 Item Freq Back in Black 3 Heavy fuel 2 Englishman in New York 1 Breakfast in America 1 Wipe the bottom half SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin
  17. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 No more data SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin Item Freq Back in Black 3 Heavy fuel 2
  18. Example: Calculating the Top 1 song Date Music 10-10-2019 Heavy

    Fuel 11-10-2019 Back in Black 12-10-2019 Back in Black 13-10-2019 Back in Black 13-10-2019 Englishman in New York 14-10-2019 Heavy Fuel 14-10-2019 Breakfast in America topn.number_of_counters = 1 SELECT topn(topn_add_agg(Music),1) FROM my_music_history; @sahinffurkan @furkansahin Item Freq Back in Black 3
  19. How does postgresql-topn work? topn(column::JSONB, n::int) => returns Tuple Set(item,

    frequency) topn_add_agg(column::text) => returns JSONB topn_union_agg(column::JSONB) => returns JSONB topn_add(column::JSONB, data::text) => returns JSONB topn_union(column_1::JSONB, column_2::JSONB) => returns JSONB @sahinffurkan @furkansahin
  20. How do we use postgresql-topn? Demo Summary TopN Lists and

    Their Usages How to calculate TopN lists in PostgreSQL Why do we need an extension? How does postgresql-topn work?
  21. Demo Summary TopN Lists and Their Usages How to calculate

    TopN lists in PostgreSQL Why do we need an extension? How does postgresql-topn work? How do we use postgresql-topn?