Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The TopN extension: Maintaining Top 10 lists at scale | PGConf EU 2019 | Furkan Sahin

Citus Data
October 17, 2019

The TopN extension: Maintaining Top 10 lists at scale | PGConf EU 2019 | Furkan Sahin

Whether it's the biggest, oldest, fastest, slowest one of everyone's favorite questions about data is: What is the top 10? SQL offers easy solutions like "SELECT item, count(*) FROM data GROUP BY item ORDER BY 2 DESC LIMIT 10", but if you have a lot of data and a dashboard that can generate many ever-changing top 10 lists, then such queries don't scale at all. What we need is to be able to incrementally update top N lists as new data comes in. PostgreSQL did not have an easy way to do that yet, so we added it via the postgresql-topn extension.

In this talk, I will focus on the PostgreSQL extension postgresql-topn (new as of March 2018). TopN can provide approximate answers to these kinds of aggregation queries within reasonable and configurable error bounds. TopN is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. In this talk, I will talk about the internals of TopN and how it estimates top items. At the end of this session, you will have fallen in love with the capabilities of TopN and you will be adding it to your analytics tool belt.

Citus Data

October 17, 2019
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. The TopN extension: Maintaining top 10
    lists at scale
    • Furkan Şahin
    • Software Engineer
    • Citus Data @Microsoft
    • Author of postgresql-topn extension
    @sahinffurkan @furkansahin

    View Slide

  2. Presentation Flow
    • TopN lists and their usages
    • How to calculate TopN lists in PostgreSQL?
    • Why do we need an extension?
    • How does postgresql-topn work?
    • How do we use postgresql-topn?
    • Demo
    • Summary
    @sahinffurkan @furkansahin

    View Slide

  3. TopN Lists and Their Usages
    How to calculate TopN lists in PostgreSQL?
    Why do we need an extension?
    How does postgresql-topn work?
    How do we use postgresql-topn?
    Demo
    Summary

    View Slide

  4. TopN Lists and
    Their Usages
    Spotify: Your Top
    Songs 2018
    IMDB: Top Rated
    Movies
    E-commerce: Most
    selling items
    @sahinffurkan @furkansahin

    View Slide

  5. How to
    calculate
    TopN lists in
    PostgreSQL
    SELECT item, count(*)
    FROM table
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
    SELECT
    topn(topn_add_agg(item),10)
    FROM
    table;
    @sahinffurkan @furkansahin

    View Slide

  6. Why do we need an
    extension?
    How does postgresql-topn work?
    How do we use postgresql-topn?
    Demo
    Summary
    TopN Lists and Their Usages
    How to calculate TopN lists in PostgreSQL

    View Slide

  7. Why do we
    need an
    extension?
    Continuous calculations
    Memory usage Scanning
    CPU usage Sorting
    @sahinffurkan @furkansahin

    View Slide

  8. Why needed
    a TopN
    extension?
    High load times
    @sahinffurkan @furkansahin

    View Slide

  9. Solution? An extension that uses
    count-min sketch
    Low memory
    footprint
    Low CPU usage
    Facilitate pre-
    processing
    @sahinffurkan @furkansahin

    View Slide

  10. How does postgresql-topn
    work?
    How do we use postgresql-topn?
    Demo
    Summary
    TopN Lists and Their Usages
    How to calculate TopN lists in PostgreSQL
    Why do we need an extension?

    View Slide

  11. How does
    postgresql-topn
    work?
    Count-min sketch
    Limited number of counters
    @sahinffurkan @furkansahin

    View Slide

  12. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Heavy fuel 1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  13. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Heavy fuel 1
    Back in Black 1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  14. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Back in Black 2
    Heavy fuel 1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  15. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Back in Black 3
    Heavy fuel 1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  16. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Back in Black 3
    Heavy fuel 1
    Englishman in
    New York
    1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  17. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Back in Black 3
    Heavy fuel 2
    Englishman in
    New York
    1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  18. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    Item Freq
    Back in Black 3
    Heavy fuel 2
    Englishman in
    New York
    1
    Breakfast in
    America
    1
    Wipe the bottom half
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin

    View Slide

  19. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    No more data
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin
    Item Freq
    Back in Black 3
    Heavy fuel 2

    View Slide

  20. Example: Calculating the Top 1 song
    Date Music
    10-10-2019 Heavy Fuel
    11-10-2019 Back in Black
    12-10-2019 Back in Black
    13-10-2019 Back in Black
    13-10-2019 Englishman in New York
    14-10-2019 Heavy Fuel
    14-10-2019 Breakfast in America
    topn.number_of_counters = 1
    SELECT
    topn(topn_add_agg(Music),1)
    FROM
    my_music_history;
    @sahinffurkan @furkansahin
    Item Freq
    Back in Black 3

    View Slide

  21. How does
    postgresql-topn
    work?
    topn(column::JSONB, n::int) => returns
    Tuple Set(item, frequency)
    topn_add_agg(column::text) => returns
    JSONB
    topn_union_agg(column::JSONB) => returns
    JSONB
    topn_add(column::JSONB, data::text) =>
    returns JSONB
    topn_union(column_1::JSONB,
    column_2::JSONB) => returns JSONB
    @sahinffurkan @furkansahin

    View Slide

  22. Why JSONB?
    Human readable
    Portable
    Easy to process
    @sahinffurkan @furkansahin

    View Slide

  23. How do we use
    postgresql-topn?
    Demo
    Summary
    TopN Lists and Their Usages
    How to calculate TopN lists in PostgreSQL
    Why do we need an extension?
    How does postgresql-topn work?

    View Slide

  24. How do we use
    postgresql-topn?
    Pre-aggregation
    Materialization of TopNs

    View Slide

  25. Demo
    Summary
    TopN Lists and Their Usages
    How to calculate TopN lists in PostgreSQL
    Why do we need an extension?
    How does postgresql-topn work?
    How do we use postgresql-topn?

    View Slide

  26. Summary
    Requires pre-processing for performance gain
    Low memory footprint
    Approximates the results
    Open source
    Production ready

    View Slide

  27. Resources
    • https://github.com/citusdata/postgresql-topn
    • TopN for your Postgres database
    https://www.citusdata.com/blog/2018/03/27/topn-for-your-postgres-database/
    @sahinffurkan @furkansahin

    View Slide

  28. Thank you
    @sahinffurkan @furkansahin

    View Slide

  29. View Slide