Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average

What pandas and SQL Taught Me About Taking an Average
Alex Petralia

pandas

“What is our average daily trading volume for each exchange
we trade on?”

pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =
vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange

* figures are fictitious

pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 =
vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange “What is our average daily trading volume for each exchange we trade on?”

“What do you think an average is?”

Outline for this talk 1. One-dimensional data 2. Two-dimensional data
3. Three-dimensional data 4. The pitfall of multi-dimensional averages 5. The magic formula for taking averages

Back to basics SUM(Volume) / COUNT(*)

One-dimensional data Id ApplesSold 1 5 2 4 3 8
4 10 5 6 SUM(AppleSold) COUNT(*) 5 + 4 + 8 + 10 + 6 5 = 6.6 “What is the average amount of apples sold?”

One-dimensional data Id ApplesSold 1 5 2 4 3 8
4 10 5 6 SELECT SUM(ApplesSold)/COUNT(ApplesSold) FROM apples SELECT AVG(ApplesSold) FROM apples “What is the average amount of apples sold?”

Two-dimensional data SUM(AppleSold) COUNT(DISTINCT Date) 5 + 4 + 8
+ 10 + 6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday
4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”

Two-dimensional data Date NumSold Monday 9 Tuesday 8 Thursday 16
SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date 3 dates, after aggregating!

SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM
apples GROUP BY Date ) AS tmp Two-dimensional data AVG(NumSold) 11

mean = apple.groupby(‘Date’).sum() \ [‘ApplesSold’].mean() Two-dimensional data mean 11

4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 The question matters... ...it determines our denominator! (our row COUNT()) “What is the average daily amount of apples sold?”

Two-dimensional data our level of analysis primary key SELECT AVG(NumSold)
FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6

Who cares?

* figures are fictitious

Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5
2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”

2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary ? Bob ? Jane ? “What’s the average daily amount of apples sold for each seller?”

2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary Bob Jane 5 6 16 “What’s the average daily amount of apples sold for each seller?”

SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY
Date, Seller Three-dimensional data Date Seller total Monday Mary 5 Monday Bob 4 Tuesday Bob 8 Thursday Jane 16 replacing the primary key with a new key relevant to this analysis “collapsing key” “What’s the average daily amount of apples sold for each seller?”

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS
total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “What’s the average daily amount of apples sold for each seller?”

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”

Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis
- we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City)

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?” ?

Multi-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5
2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”

Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis
- we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City) Observation key: a unit of observation for this analysis - eg. “daily average” is (Date), “across regions” is (Region) - this defines how many rows are in the denominator

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “What’s the average daily amount of apples sold for each seller?”

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied

The pitfall of multi-dimensional averages

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “per day” is implied

pd.groupby([‘Date’, ‘Seller’]) \ [‘ApplesSold’].sum() \ .groupby(level=‘Seller’).mean() Multi-dimensional data (pandas) Seller
AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied

Order of operations 1. Observation key: “daily” 2. Grouping key:
“for each seller” 3. Collapsing key: [observation key] + [grouping key] = (Date, Seller) “What’s the average daily amount of apples sold for each seller?” collapsing key - grouping key = observation key

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 observation key “per day” is implied 1 2 collapsing key - grouping key = observation key 3

total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied 2 1 1 2 3 3

Solve the English question first, then write the code

What the blank 1. What if the collapsing key is
blank? 2. What if the grouping key is blank? 3. What if both are blank? collapsing key - grouping key = observation key inner query outer query the difference

1. Blank collapsing key

collapsing key - grouping key = observation key inner query
outer query the difference

4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?” implied primary key

Two-dimensional data SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS
NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”

If you don’t define a collapsing key, then the primary
key is automatically your collapsing key inner query

primary key - grouping key = observation key

4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”

Don’t trust the primary key

2. Blank grouping key

collapsing key = observation key inner query

SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM
apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 (no grouping asked)

3. Blank collapsing key & blank grouping key

primary key - grouping key = observation key

primary key = observation key

SELECT AVG(ApplesSold) FROM apples One-dimensional data Id ApplesSold 1 5
2 4 3 8 4 10 5 6 implied observation key is simply the primary key “What is the average amount of apples sold?”

4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples implied observation key is simply the primary key “What is the average daily amount of apples sold?”

SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM
apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6

What’s the big deal?

Two-dimensional data 5 + 4 + 8 + 10 +
6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t

6 5 = 6.6 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(ApplesSold) FROM apples

6 5 = 6.6 “What is the average daily amount of apples sold?” 5 + 4 + 8 + 10 + 6 3 = 11 vs.

If we do not collapse data, we will understate our
results

the big picture

pandas

pandas .mean() AVG()

Beware the pitfall of multi-dimensional averages! collapsing key - grouping
key = observation key inner query outer query the difference

thank you

Alex Petralia - Analyzing Data: What pandas and...

Alex Petralia - Analyzing Data: What pandas and SQL Taught Me About Taking an Average

More Decks by PyCon 2018

Other Decks in Programming

Featured

Transcript