Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Window Functions

Window Functions

A primer of SQL window functions

David Stokes

November 21, 2024
Tweet

More Decks by David Stokes

Other Decks in Technology

Transcript

  1. ©2023 Percona | Confidential Who Am I I am Dave

    Stokes Technology Evangelist at Percona Author of MySQL & JSON - A Practical Programming Guide Over a decade on the Oracle MySQL Community Team Started with MySQL 3.29 [email protected] @Stoker https://speakerdeck.com/stoker
  2. Percona is an open source database software, support, and services

    company that helps make databases and applications run better. Percona.com
  3. ©2023 Percona | Confidential This is not an easy subject

    6 Your brain may be hurt during this presentation.
  4. ©2023 Percona | Confidential In database management, an aggregate function

    or aggregation function is a function where multiple values are processed together to form a single summary statistic. Common aggregate functions include: • Average (i.e., arithmetic mean) • Count • Maximum • Median • Minimum • Mode • Range • Sum 7 https://en.wikipedia.org/wiki/Aggregate_function#:~:text=In%20database%20management%2C%20an%20aggregate,Sum
  5. ©2023 Percona | Confidential In SQL, a window function or

    analytic function[1] is a function which uses values from one or multiple rows to return a value for each row. (This contrasts with an aggregate function, which returns a single value for multiple rows.) Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function.[2] 8 https://en.wikipedia.org/wiki/Window_function_(SQL)
  6. ©2023 Percona | Confidential 9 Analytics, Statistics, Numbers, and Your

    Sanity • Presenting numbers is often tough ◦ Learning the ‘stuff’ to present numbers from a Structured Query Language Database is tougher ◦ Practice helps ◦ Build in ‘sanity checks’ -> Do the numbers make sense? • Aggregate Functions are fairly straightforward • Window Function are not straightforward ◦ Syntax is odd ◦ Partitioning can be hard ◦ Functions inside Window Functions are not ‘intuitive’ ◦ You need lots of practice! Lots!
  7. ©2023 Percona | Confidential Partitions 11 CURRENT ROW N Preceding

    N Following Unbounded Preceding Unbounded Following
  8. ©2023 Percona | Confidential Test Data test=# select * from

    data; id | brand | model | price ----+------------+-----------------+----------- 1 | Ford | Mustang | 48000.00 2 | Audi | TT | 54000.00 3 | Dodge | Charger | 74999.00 4 | Ford | Bronco | 37000.00 5 | Ferrari | 250 GTO | 999999.00 6 | Pontiac | GTO | 18000.00 7 | Chevrolet | Camaro | 64999.00 8 | Chevrolet | Corvette | 122500.00 9 | Ford | Mustang | 53400.00 10 | Dodge | Challenger | 73000.00 (10 rows) 12
  9. ©2023 Percona | Confidential SQL Aggregate Functions are fine but

    limited. test=# select count(id) as "# records", sum(price), round(avg(price),2), max(price) from data; # records | sum | round | max -----------+------------+-----------+----------- 10 | 1545897.00 | 154589.70 | 999999.00 (1 row) 13 It is easy to use aggregate functions to generate information on a column
  10. ©2023 Percona | Confidential GROUPing helps bring a little more

    clarity test=# select brand, count(brand), sum(price), round(avg(price),2), max(price) from data group by brand; brand | count | sum | round | max ------------+-------+-----------+-----------+----------- Ford | 3 | 138400.00 | 46133.33 | 53400.00 Dodge | 2 | 147999.00 | 73999.50 | 74999.00 Audi | 1 | 54000.00 | 54000.00 | 54000.00 Chevrolet | 2 | 187499.00 | 93749.50 | 122500.00 Pontiac | 1 | 18000.00 | 18000.00 | 18000.00 Ferrari | 1 | 999999.00 | 999999.00 | 999999.00 (6 rows) 14 GROUP BY can be used to put related information together.
  11. ©2023 Percona | Confidential Order can improve readability test=# select

    brand, count(brand), sum(price), round(avg(price),2), max(price) from data group by brand order by brand; brand | count | sum | round | max ------------+-------+-----------+-----------+----------- Audi | 1 | 54000.00 | 54000.00 | 54000.00 Chevrolet | 2 | 187499.00 | 93749.50 | 122500.00 Dodge | 2 | 147999.00 | 73999.50 | 74999.00 Ferrari | 1 | 999999.00 | 999999.00 | 999999.00 Ford | 3 | 138400.00 | 46133.33 | 53400.00 Pontiac | 1 | 18000.00 | 18000.00 | 18000.00 (6 rows) 15
  12. ©2023 Percona | Confidential But what about differences within those

    groups? Window functions allow the partitioning of the data for the purpose of analysis 16
  13. ©2023 Percona | Confidential OVER() is the Window Function call

    test=# select brand, price, sum(price) OVER () from data order by brand; brand | price | sum ------------+-----------+------------ Audi | 54000.00 | 1545897.00 Chevrolet | 64999.00 | 1545897.00 Chevrolet | 122500.00 | 1545897.00 Dodge | 74999.00 | 1545897.00 Dodge | 73000.00 | 1545897.00 Ferrari | 999999.00 | 1545897.00 Ford | 53400.00 | 1545897.00 Ford | 37000.00 | 1545897.00 Ford | 48000.00 | 1545897.00 Pontiac | 18000.00 | 1545897.00 (10 rows) 18 OVER() without anything in the brackets tells the database instance to apply the calculation over the ENTIRE column. Here, sum is for all the values is the price column
  14. ©2023 Percona | Confidential You can have multiple OVER()s test=#

    select brand, count(id) over(), sum(price) over () from data order by brand; brand | count | sum ------------+-------+------------ Audi | 10 | 1545897.00 Chevrolet | 10 | 1545897.00 Chevrolet | 10 | 1545897.00 Dodge | 10 | 1545897.00 Dodge | 10 | 1545897.00 Ferrari | 10 | 1545897.00 Ford | 10 | 1545897.00 Ford | 10 | 1545897.00 Ford | 10 | 1545897.00 Pontiac | 10 | 1545897.00 (10 rows) 19 Here are two WFs where we are counting all the columns with an ‘id’ value and another that is summing all the ‘price’ values.
  15. ©2023 Percona | Confidential Move window definition around test=# select

    brand, price, sum(price) OVER w from data window w as (); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 1545897.00 Audi | 54000.00 | 1545897.00 Dodge | 74999.00 | 1545897.00 Ford | 37000.00 | 1545897.00 Pontiac | 18000.00 | 1545897.00 Chevrolet | 64999.00 | 1545897.00 Chevrolet | 122500.00 | 1545897.00 Ford | 53400.00 | 1545897.00 Dodge | 73000.00 | 1545897.00 Ferrari | 999999.00 | 1545897.00 (10 rows) 20 We can also move the definition of the window away from the SELECT portion of the query, often provides better readability.
  16. ©2023 Percona | Confidential The definition clause can get weird

    test=# select brand, price, sum(price) over w from data window w as (rows current row); brand | price | sum ------------+-----------+----------- Ford | 48000.00 | 48000.00 Audi | 54000.00 | 54000.00 Dodge | 74999.00 | 74999.00 Ford | 37000.00 | 37000.00 Pontiac | 18000.00 | 18000.00 Chevrolet | 64999.00 | 64999.00 Chevrolet | 122500.00 | 122500.00 Ford | 53400.00 | 53400.00 Dodge | 73000.00 | 73000.00 Ferrari | 999999.00 | 999999.00 (10 rows) 21 Clarity is of prime importance so when in doubt, spell it out!
  17. ©2023 Percona | Confidential Unbounded? test=# select brand, price, sum(price)

    over w from data window w as (rows unbounded preceding); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 48000.00 Audi | 54000.00 | 102000.00 Dodge | 74999.00 | 176999.00 Ford | 37000.00 | 213999.00 Pontiac | 18000.00 | 231999.00 Chevrolet | 64999.00 | 296998.00 Chevrolet | 122500.00 | 419498.00 Ford | 53400.00 | 472898.00 Dodge | 73000.00 | 545898.00 Ferrari | 999999.00 | 1545897.00 (10 rows) 22
  18. ©2023 Percona | Confidential Partitions 23 CURRENT ROW N Preceding

    N Following Unbounded Preceding Unbounded Following This might be cleared after the examples
  19. ©2023 Percona | Confidential Unbounded return? test=# select brand, price,

    sum(price) over w from data window w as (rows unbounded preceding); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 48000.00sum (48,000 + previous row) Audi | 54000.00 | 102000.00sum (48,000 + 54,000) = 102,000 Dodge | 74999.00 | 176999.00sum (102,000 + 74,999) = 176,999 Ford | 37000.00 | 213999.00 Pontiac | 18000.00 | 231999.00 Chevrolet | 64999.00 | 296998.00 Chevrolet | 122500.00 | 419498.00 Ford | 53400.00 | 472898.00 Dodge | 73000.00 | 545898.00 Ferrari | 999999.00 | 1545897.00 (10 rows) 24
  20. ©2023 Percona | Confidential Rows between unbounded preceding and current

    row test=# select brand, price, sum(price) over w from data window w as (rows between unbounded preceding and current row); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 48000.00 Audi | 54000.00 | 102000.00 Dodge | 74999.00 | 176999.00 Ford | 37000.00 | 213999.00 Pontiac | 18000.00 | 231999.00 Chevrolet | 64999.00 | 296998.00 Chevrolet | 122500.00 | 419498.00 Ford | 53400.00 | 472898.00 Dodge | 73000.00 | 545898.00 Ferrari | 999999.00 | 1545897.00 (10 rows) 25 rows unbounded preceding = rows between unbounded preceding and current row Obvious, eh? This is why practice is needed
  21. ©2023 Percona | Confidential Rows between current row and unbounded

    following test=# select brand, price, sum(price) over w from data window w as (rows between current row and unbounded following); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 1545897.00(1,497,897 + 48,000) = 1,545,897 Audi | 54000.00 | 1497897.00 Dodge | 74999.00 | 1443897.00 Ford | 37000.00 | 1368898.00 Pontiac | 18000.00 | 1331898.00 Chevrolet | 64999.00 | 1313898.00 Chevrolet | 122500.00 | 1248899.00 Ford | 53400.00 | 1126399.00 Dodge | 73000.00 | 1072999.00 Ferrari | 999999.00 | 999999.00 (10 rows) 26 rows between unbounded preceding and current row is the opposite of rows between current row and unbounded following
  22. ©2023 Percona | Confidential Rows between 1 preceding and current

    row test=# select brand, price, sum(price) over w from data window w as (rows between 1 preceding and current row); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 48000.00 Audi | 54000.00 | 102000.00(48,000 + 54,000) = 102,000 Dodge | 74999.00 | 128999.00(74,999 + 54,000) = 128,999 Ford | 37000.00 | 111999.00(74,999 + 37,000) = 111,999 Pontiac | 18000.00 | 55000.00 Chevrolet | 64999.00 | 82999.00 Chevrolet | 122500.00 | 187499.00 Ford | 53400.00 | 175900.00 Dodge | 73000.00 | 126400.00 Ferrari | 999999.00 | 1072999.00 (10 rows) 27
  23. ©2023 Percona | Confidential Rows between 1 preceding and 1

    following test=# select brand, price, sum(price) over w from data window w as (rows between 1 preceding and 1 following); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 102000.0048,000 + 54,000 = 102,000 Audi | 54000.00 | 176999.0074,999 + 54,000 + 48,000 = 176,699 Dodge | 74999.00 | 165999.0037,000 + 74,999 + 54,000 = 165,999 Ford | 37000.00 | 129999.0018,000 + 37,000 + 74,999 = 129,999 Pontiac | 18000.00 | 119999.00 Chevrolet | 64999.00 | 205499.00 Chevrolet | 122500.00 | 240899.00 Ford | 53400.00 | 248900.00 Dodge | 73000.00 | 1126399.00 Ferrari | 999999.00 | 1072999.00 (10 rows) 28
  24. ©2023 Percona | Confidential Rows between 2 preceding and current

    row test=# select brand, price, sum(price) over w from data window w as (rows between 2 preceding and current row); brand | price | sum ------------+-----------+------------ Ford | 48000.00 | 48000.0048,000 = 48,000 Audi | 54000.00 | 102000.0048,000 + 54,000 = 102,000 Dodge | 74999.00 | 176999.0074,999 + 54,000 + 48,000 = 176,999 Ford | 37000.00 | 165999.0037,000 + 74,999 + 54,000 = 165,999 Pontiac | 18000.00 | 129999.00 Chevrolet | 64999.00 | 119999.00 Chevrolet | 122500.00 | 205499.00 Ford | 53400.00 | 240899.00 Dodge | 73000.00 | 248900.00 Ferrari | 999999.00 | 1126399.00 (10 rows) 29
  25. ©2023 Percona | Confidential You can also order within a

    window test=# select brand, price, sum(price) over w from data window w as (order by brand); brand | price | sum ------------+-----------+------------ Audi | 54000.00 | 54000.00 Chevrolet | 64999.00 | 241499.00 54,000 + 64,999 + 122,500 = 241,499 Chevrolet | 122500.00 | 241499.00 Dodge | 74999.00 | 389498.00 241,499 + 74,999 + 73,000 = 389,498 Dodge | 73000.00 | 389498.00 Ferrari | 999999.00 | 1389497.00 Ford | 53400.00 | 1527897.00 Ford | 37000.00 | 1527897.00 Ford | 48000.00 | 1527897.00 Pontiac | 18000.00 | 1545897.00 (10 rows) 30
  26. ©2023 Percona | Confidential Order an range in a window

    test=# select brand, price, sum(price) over w from data window w as (order by brand range current row); brand | price | sum ------------+-----------+----------- Audi | 54000.00 | 54000.00 Chevrolet | 64999.00 | 187499.00 64,999 + 122,500 = 187,499 Chevrolet | 122500.00 | 187499.00 Dodge | 74999.00 | 147999.00 74,999 + 73,000 = 147,999 Dodge | 73000.00 | 147999.00 Ferrari | 999999.00 | 999999.00 Ford | 53400.00 | 138400.00 53,400 + 37,000 + 48,000 = 138,400 Ford | 37000.00 | 138400.00 Ford | 48000.00 | 138400.00 Pontiac | 18000.00 | 18000.00 (10 rows) 31
  27. ©2023 Percona | Confidential Simple brand average test=# select brand,

    count(brand), sum(price), round(avg(price),2) as avg from data group by brand order by brand; brand | count | sum | avg ------------+-------+-----------+----------- Audi | 1 | 54000.00 | 54000.00 54,000/1 = 54,000 Chevrolet | 2 | 187499.00 | 93749.50 187,499/2 = 93,749 Dodge | 2 | 147999.00 | 73999.50 Ferrari | 1 | 999999.00 | 999999.00 Ford | 3 | 138400.00 | 46133.33 Pontiac | 1 | 18000.00 | 18000.00 (6 rows) 32
  28. ©2023 Percona | Confidential Rollup test=# select brand, count(brand), sum(price),

    round(avg(price),2) as avg from data group by rollup(brand) order by brand; brand | count | sum | avg ------------+-------+------------+----------- Audi | 1 | 54000.00 | 54000.00 Chevrolet | 2 | 187499.00 | 93749.50 Dodge | 2 | 147999.00 | 73999.50 Ferrari | 1 | 999999.00 | 999999.00 Ford | 3 | 138400.00 | 46133.33 Pontiac | 1 | 18000.00 | 18000.00 | 10 | 1545897.00 | 154589.70 This is the rollup line (7 rows) 33 NULL hiding here
  29. ©2023 Percona | Confidential CUBE - Multiple Grouping Sets test=#

    select brand, count(brand), model, sum(price), round(avg(price),2) as avg from data group by cube(model,brand) order by model,brand; brand | count | model | sum | avg ------------+-------+-----------------+------------+----------- Ferrari | 1 | 250 GTO | 999999.00 | 999999.00 | 1 | 250 GTO | 999999.00 | 999999.00 Ford | 1 | Bronco | 37000.00 | 37000.00 | 1 | Bronco | 37000.00 | 37000.00 Chevrolet | 1 | Camaro | 64999.00 | 64999.00 | 1 | Camaro | 64999.00 | 64999.00 Dodge | 1 | Challenger | 73000.00 | 73000.00 | 1 | Challenger | 73000.00 | 73000.00 Dodge | 1 | Charger | 73000.00 | 73000.00 | 1 | Charger | 73000.00 | 73000.00 Chevrolet | 1 | Corvette | 122500.00 | 122500.00 | 1 | Corvette | 122500.00 | 122500.00 Pontiac | 1 | GTO | 18000.00 | 18000.00 | 1 | GTO | 18000.00 | 18000.00 Ford | 2 | Mustang | 101400.00 | 50700.00 | 2 | Mustang | 101400.00 | 50700.00 Audi | 1 | TT | 73000.00 | 73000.00 | 1 | TT | 73000.00 | 73000.00 Audi | 1 | | 73000.00 | 73000.00 Chevrolet | 2 | | 187499.00 | 93749.50 Dodge | 2 | | 146000.00 | 73000.00 Ferrari | 1 | | 999999.00 | 999999.00 Ford | 3 | | 138400.00 | 46133.33 Pontiac | 1 | | 18000.00 | 18000.00 | 10 | | 1562898.00 | 156289.80 (25 rows) 34
  30. ©2023 Percona | Confidential Percentages test=# select brand, model, round(price

    / SUM(price) over() * 100,2) as percentage, price from data; brand | model | percentage | price ------------+-----------------+------------+----------- Ford | Mustang | 3.10 | 48000.00 Audi | TT | 3.49 | 54000.00 Dodge | Charger | 4.85 | 74999.00 Ford | Bronco | 2.39 | 37000.00 Pontiac | GTO | 1.16 | 18000.00 Chevrolet | Camaro | 4.20 | 64999.00 Chevrolet | Corvette | 7.92 | 122500.00 Ford | Mustang | 3.45 | 53400.00 Dodge | Challenger | 4.72 | 73000.00 Ferrari | 250 GTO | 64.69 | 999999.00 999,999 is 64.49% (10 rows) of 1,545,897 35
  31. ©2023 Percona | Confidential Cumulative (time series?) test=# select brand,

    model, price, sum(price) over (order by price DESC rows between unbounded preceding and current row) from data; brand | model | price | sum ------------+-----------------+-----------+------------ Ferrari | 250 GTO | 999999.00 | 999999.00 Chevrolet | Corvette | 122500.00 | 1122499.00 999,999 + 122,500 Dodge | Charger | 74999.00 | 1197498.00 1122499 + 74,999 Dodge | Challenger | 73000.00 | 1270498.00 Chevrolet | Camaro | 64999.00 | 1335497.00 Audi | TT | 54000.00 | 1389497.00 Ford | Mustang | 53400.00 | 1442897.00 Ford | Mustang | 48000.00 | 1490897.00 Ford | Bronco | 37000.00 | 1527897.00 Pontiac | GTO | 18000.00 | 1545897.00 (10 rows) 36
  32. ©2023 Percona | Confidential The Difference from average test=# select

    brand, model, price, round(AVG(price) over(), 2) as Avg, round(price - AVG(price) over (),2) as Difference from data order by price; brand | model | price | avg | difference ------------+-----------------+-----------+-----------+------------ Pontiac | GTO | 18000.00 | 154589.70 | -136589.70 Ford | Bronco | 37000.00 | 154589.70 | -117589.70 Ford | Mustang | 48000.00 | 154589.70 | -106589.70 Ford | Mustang | 53400.00 | 154589.70 | -101189.70 Audi | TT | 54000.00 | 154589.70 | -100589.70 Chevrolet | Camaro | 64999.00 | 154589.70 | -89590.70 Dodge | Challenger | 73000.00 | 154589.70 | -81589.70 Dodge | Charger | 74999.00 | 154589.70 | -79590.70 Chevrolet | Corvette | 122500.00 | 154589.70 | -32089.70 Ferrari | 250 GTO | 999999.00 | 154589.70 | 845409.30 (10 rows) 37
  33. ©2023 Percona | Confidential Difference between 1 previous and current

    test=# select brand, model, price, price - LEAD(price,1) over (order by price desc) as Increment from data order by price; brand | model | price | increment ------------+-----------------+-----------+----------- Pontiac | GTO | 18000.00 | Ford | Bronco | 37000.00 | 19000.00 37,000 - 18,000 Ford | Mustang | 48000.00 | 11000.00 48,000 - 37,000 Ford | Mustang | 53400.00 | 5400.00 Audi | TT | 54000.00 | 600.00 Chevrolet | Camaro | 64999.00 | 10999.00 Dodge | Challenger | 73000.00 | 8001.00 Dodge | Charger | 74999.00 | 1999.00 Chevrolet | Corvette | 122500.00 | 47501.00 Ferrari | 250 GTO | 999999.00 | 877499.00 (10 rows) 38 There is also a LAG function that works with subsequent values
  34. ©2023 Percona | Confidential It is possible to get very

    complicated 39 test=# select model, price, price - LAST_VALUE(price) over w as delta, round ((price - LAST_VALUE(price) over w) / LAST_VALUE(price) over w * 100) as percent_more from data window w as (order by price desc rows between unbounded preceding and unbounded following) order by price desc; model | price | delta | percent_more -----------------+-----------+-----------+-------------- 250 GTO | 999999.00 | 981999.00 | 5456 Corvette | 122500.00 | 104500.00 | 581 Charger | 74999.00 | 56999.00 | 317 Challenger | 73000.00 | 55000.00 | 306 Camaro | 64999.00 | 46999.00 | 261 TT | 54000.00 | 36000.00 | 200 Mustang | 53400.00 | 35400.00 | 197 Mustang | 48000.00 | 30000.00 | 167 48K is 167% of 18K Bronco | 37000.00 | 19000.00 | 106 37K is 106% of 18K GTO | 18000.00 | 0.00 | 0 (10 rows)
  35. ©2023 Percona | Confidential Need to update the data test=#

    update data set price = 73000.0 where id=3 or id = 2; UPDATE 2 40
  36. ©2023 Percona | Confidential RANK() & DENSE_RANK() test=# select id,

    model, price, RANK() over w, DENSE_RANK() over w from data WINDOW w as (order by price desc) order by price desc; id | model | price | rank | dense_rank ----+-----------------+-----------+------+------------ 5 | 250 GTO | 999999.00 | 1 | 1 8 | Corvette | 122500.00 | 2 | 2 10 | Challenger | 73000.00 | 3 | 3 3 | Charger | 73000.00 | 3 | 3 2 | TT | 73000.00 | 3 | 3 7 | Camaro | 64999.00 | 6 | 4 9 | Mustang | 53400.00 | 7 | 5 1 | Mustang | 48000.00 | 8 | 6 4 | Bronco | 37000.00 | 9 | 7 6 | GTO | 18000.00 | 10 | 8 (10 rows) 41
  37. ©2023 Percona | Confidential Window Function Functions CUME_DIST() Cumulative distribution

    value DENSE_RANK() Rank of current row within its partition, without gaps FIRST_VALUE() Value of argument from first row of window frame LAG() Value of argument from row lagging current row within partition LAST_VALUE()Value of argument from last row of window frame LEAD() Value of argument from row leading current row within partition NTH_VALUE() Value of argument from N-th row of window frame NTILE()Bucket number of current row within its partition. PERCENT_RANK() Percentage rank value RANK() Rank of current row within its partition, with gaps ROW_NUMBER() Number of current row within its partition 43
  38. ©2023 Percona | Confidential Usage is function name and window

    declaration test=# select row_number() over(), id, brand, model from data; row_number | id | brand | model ------------+----+------------+----------------- 1 | 1 | Ford | Mustang 2 | 4 | Ford | Bronco 3 | 6 | Pontiac | GTO 4 | 7 | Chevrolet | Camaro 5 | 8 | Chevrolet | Corvette 6 | 9 | Ford | Mustang 7 | 10 | Dodge | Challenger 8 | 5 | Ferrari | 250 GTO 9 | 2 | Audi | TT 10 | 3 | Dodge | Charger (10 rows) 44
  39. ©2023 Percona | Confidential Recommendations Window Functions are useful •

    Practice needed • Double check everything • Don’t forget to use aggregate functions 46