Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average

Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average

“So tell me,” my manager said, “what is an average?”

There’s probably nothing worse than that sinking feeling when you finish an analysis, email it to your manager or client to review, and they point out a mistake so basic you can’t even fathom how you missed it.

This talk is about mine: how to take an average.

Averages are something we use everywhere - it’s a simple np.mean() in pandas or AVG() in SQL. But recently I’ve come to appreciate just how easy it is to calculate this statistic incorrectly. We learn once - in middle school no less - how to take an average, and never revisit it. Then, when we are faced with multidimensional datasets (ie. pretty much every dataset out there), we never reconsider whether we should be taking an average the same way.

In this talk, we follow my arduous and humbling journey of learning how to properly take an average with multidimensional data. We will cover how improperly calculating it can produce grossly incorrect figures, which can slip into publications, research analyses and management reports.

https://us.pycon.org/2018/schedule/presentation/84/

De174d82b2bbfe9e6f14a5a8c38b14be?s=128

PyCon 2018

May 11, 2018
Tweet

Transcript

  1. What pandas and SQL Taught Me About Taking an Average

    Alex Petralia
  2. None
  3. pandas

  4. None
  5. None
  6. “What is our average daily trading volume for each exchange

    we trade on?”
  7. pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =

    vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange
  8. * figures are fictitious

  9. * figures are fictitious

  10. pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =

    vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange “What is our average daily trading volume for each exchange we trade on?”
  11. None
  12. “What do you think an average is?”

  13. x

  14. Outline for this talk 1. One-dimensional data 2. Two-dimensional data

    3. Three-dimensional data 4. The pitfall of multi-dimensional averages 5. The magic formula for taking averages
  15. Back to basics SUM(Volume) / COUNT(*)

  16. One-dimensional data Id ApplesSold 1 5 2 4 3 8

    4 10 5 6 SUM(AppleSold) COUNT(*) 5 + 4 + 8 + 10 + 6 5 = 6.6 “What is the average amount of apples sold?”
  17. One-dimensional data Id ApplesSold 1 5 2 4 3 8

    4 10 5 6 SELECT SUM(ApplesSold)/COUNT(ApplesSold) FROM apples SELECT AVG(ApplesSold) FROM apples “What is the average amount of apples sold?”
  18. Two-dimensional data SUM(AppleSold) COUNT(DISTINCT Date) 5 + 4 + 8

    + 10 + 6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”
  19. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”
  20. Two-dimensional data Date NumSold Monday 9 Tuesday 8 Thursday 16

    SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date 3 dates, after aggregating!
  21. SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM

    apples GROUP BY Date ) AS tmp Two-dimensional data AVG(NumSold) 11
  22. mean = apple.groupby(‘Date’).sum() \ [‘ApplesSold’].mean() Two-dimensional data mean 11

  23. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 The question matters... ...it determines our denominator! (our row COUNT()) “What is the average daily amount of apples sold?”
  24. Two-dimensional data our level of analysis primary key SELECT AVG(NumSold)

    FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6
  25. Who cares?

  26. * figures are fictitious

  27. x

  28. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”
  29. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary ? Bob ? Jane ? “What’s the average daily amount of apples sold for each seller?”
  30. Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary Bob Jane 5 6 16 “What’s the average daily amount of apples sold for each seller?”
  31. SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY

    Date, Seller Three-dimensional data Date Seller total Monday Mary 5 Monday Bob 4 Tuesday Bob 8 Thursday Jane 16 replacing the primary key with a new key relevant to this analysis “collapsing key” “What’s the average daily amount of apples sold for each seller?”
  32. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “What’s the average daily amount of apples sold for each seller?”
  33. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”
  34. Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis

    - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City)
  35. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”
  36. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?” ?
  37. Multi-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5

    2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”
  38. Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis

    - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City) Observation key: a unit of observation for this analysis - eg. “daily average” is (Date), “across regions” is (Region) - this defines how many rows are in the denominator
  39. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “What’s the average daily amount of apples sold for each seller?”
  40. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied
  41. The pitfall of multi-dimensional averages

  42. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “per day” is implied
  43. pd.groupby([‘Date’, ‘Seller’]) \ [‘ApplesSold’].sum() \ .groupby(level=‘Seller’).mean() Multi-dimensional data (pandas) Seller

    AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied
  44. Order of operations 1. Observation key: “daily” 2. Grouping key:

    “for each seller” 3. Collapsing key: [observation key] + [grouping key] = (Date, Seller) “What’s the average daily amount of apples sold for each seller?” collapsing key - grouping key = observation key
  45. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 observation key “per day” is implied 1 2 collapsing key - grouping key = observation key 3
  46. SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS

    total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied 2 1 1 2 3 3
  47. Solve the English question first, then write the code

  48. x

  49. What the blank 1. What if the collapsing key is

    blank? 2. What if the grouping key is blank? 3. What if both are blank? collapsing key - grouping key = observation key inner query outer query the difference
  50. 1. Blank collapsing key

  51. collapsing key - grouping key = observation key inner query

    outer query the difference
  52. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?” implied primary key
  53. Two-dimensional data SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS

    NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”
  54. If you don’t define a collapsing key, then the primary

    key is automatically your collapsing key inner query
  55. collapsing key - grouping key = observation key inner query

    outer query the difference
  56. primary key - grouping key = observation key

  57. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”
  58. Don’t trust the primary key

  59. collapsing key - grouping key = observation key inner query

    outer query the difference
  60. 2. Blank grouping key

  61. collapsing key - grouping key = observation key inner query

    outer query the difference
  62. collapsing key = observation key inner query

  63. SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM

    apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 (no grouping asked)
  64. 3. Blank collapsing key & blank grouping key

  65. collapsing key - grouping key = observation key inner query

    outer query the difference
  66. primary key - grouping key = observation key

  67. primary key = observation key

  68. SELECT AVG(ApplesSold) FROM apples One-dimensional data Id ApplesSold 1 5

    2 4 3 8 4 10 5 6 implied observation key is simply the primary key “What is the average amount of apples sold?”
  69. Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday

    4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples implied observation key is simply the primary key “What is the average daily amount of apples sold?”
  70. SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM

    apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6
  71. x

  72. collapsing key - grouping key = observation key inner query

    outer query the difference
  73. x

  74. What’s the big deal?

  75. Two-dimensional data 5 + 4 + 8 + 10 +

    6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t
  76. Two-dimensional data 5 + 4 + 8 + 10 +

    6 5 = 6.6 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(ApplesSold) FROM apples
  77. Two-dimensional data 5 + 4 + 8 + 10 +

    6 5 = 6.6 “What is the average daily amount of apples sold?” 5 + 4 + 8 + 10 + 6 3 = 11 vs.
  78. If we do not collapse data, we will understate our

    results
  79. x

  80. the big picture

  81. pandas

  82. pandas .mean() AVG()

  83. Beware the pitfall of multi-dimensional averages! collapsing key - grouping

    key = observation key inner query outer query the difference
  84. thank you