Slide 1

Slide 1 text

What pandas and SQL Taught Me About Taking an Average Alex Petralia

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

pandas

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

“What is our average daily trading volume for each exchange we trade on?”

Slide 7

Slide 7 text

pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 = vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange

Slide 8

Slide 8 text

* figures are fictitious

Slide 9

Slide 9 text

* figures are fictitious

Slide 10

Slide 10 text

pandas vol = pd.read_sql_table(‘trades’, cnxn) vol = pd.to_datetime(vol[‘Date’]).set_index() vol2018 = vol[‘2018’] df = vol2018.groupby(‘Exchange’) \ [‘Volume’].mean() SELECT Exchange, AVG(Volume) FROM `trades` WHERE Date LIKE ‘2018%’ GROUP BY Exchange “What is our average daily trading volume for each exchange we trade on?”

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

“What do you think an average is?”

Slide 13

Slide 13 text

x

Slide 14

Slide 14 text

Outline for this talk 1. One-dimensional data 2. Two-dimensional data 3. Three-dimensional data 4. The pitfall of multi-dimensional averages 5. The magic formula for taking averages

Slide 15

Slide 15 text

Back to basics SUM(Volume) / COUNT(*)

Slide 16

Slide 16 text

One-dimensional data Id ApplesSold 1 5 2 4 3 8 4 10 5 6 SUM(AppleSold) COUNT(*) 5 + 4 + 8 + 10 + 6 5 = 6.6 “What is the average amount of apples sold?”

Slide 17

Slide 17 text

One-dimensional data Id ApplesSold 1 5 2 4 3 8 4 10 5 6 SELECT SUM(ApplesSold)/COUNT(ApplesSold) FROM apples SELECT AVG(ApplesSold) FROM apples “What is the average amount of apples sold?”

Slide 18

Slide 18 text

Two-dimensional data SUM(AppleSold) COUNT(DISTINCT Date) 5 + 4 + 8 + 10 + 6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”

Slide 19

Slide 19 text

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”

Slide 20

Slide 20 text

Two-dimensional data Date NumSold Monday 9 Tuesday 8 Thursday 16 SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date 3 dates, after aggregating!

Slide 21

Slide 21 text

SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp Two-dimensional data AVG(NumSold) 11

Slide 22

Slide 22 text

mean = apple.groupby(‘Date’).sum() \ [‘ApplesSold’].mean() Two-dimensional data mean 11

Slide 23

Slide 23 text

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 The question matters... ...it determines our denominator! (our row COUNT()) “What is the average daily amount of apples sold?”

Slide 24

Slide 24 text

Two-dimensional data our level of analysis primary key SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6

Slide 25

Slide 25 text

Who cares?

Slide 26

Slide 26 text

* figures are fictitious

Slide 27

Slide 27 text

x

Slide 28

Slide 28 text

Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5 2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”

Slide 29

Slide 29 text

Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5 2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary ? Bob ? Jane ? “What’s the average daily amount of apples sold for each seller?”

Slide 30

Slide 30 text

Three-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5 2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 Seller AVG( ApplesSold) Mary Bob Jane 5 6 16 “What’s the average daily amount of apples sold for each seller?”

Slide 31

Slide 31 text

SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller Three-dimensional data Date Seller total Monday Mary 5 Monday Bob 4 Tuesday Bob 8 Thursday Jane 16 replacing the primary key with a new key relevant to this analysis “collapsing key” “What’s the average daily amount of apples sold for each seller?”

Slide 32

Slide 32 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “What’s the average daily amount of apples sold for each seller?”

Slide 33

Slide 33 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”

Slide 34

Slide 34 text

Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City)

Slide 35

Slide 35 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?”

Slide 36

Slide 36 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Three-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” “What’s the average daily amount of apples sold for each seller?” ?

Slide 37

Slide 37 text

Multi-dimensional data Id Date Seller ApplesSold 1 Monday Mary 5 2 Monday Bob 4 3 Tuesday Bob 8 4 Thursday Jane 10 5 Thursday Jane 6 “What’s the average daily amount of apples sold for each seller?”

Slide 38

Slide 38 text

Definitions Collapsing key: the collapsed/aggregated data relevant to this analysis - we are “overriding” the primary key (ie. what a table defines as an observation) Grouping key: the key defining a group - eg. “for each Seller” is (Seller), “for each Country and City” is (Country, City) Observation key: a unit of observation for this analysis - eg. “daily average” is (Date), “across regions” is (Region) - this defines how many rows are in the denominator

Slide 39

Slide 39 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “What’s the average daily amount of apples sold for each seller?”

Slide 40

Slide 40 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied

Slide 41

Slide 41 text

The pitfall of multi-dimensional averages

Slide 42

Slide 42 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Multi-dimensional data “collapsing key” Seller AVG(total) Mary 5 Bob 6 Jane 16 “grouping key” collapsing key - grouping key = observation key “per day” is implied

Slide 43

Slide 43 text

pd.groupby([‘Date’, ‘Seller’]) \ [‘ApplesSold’].sum() \ .groupby(level=‘Seller’).mean() Multi-dimensional data (pandas) Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied

Slide 44

Slide 44 text

Order of operations 1. Observation key: “daily” 2. Grouping key: “for each seller” 3. Collapsing key: [observation key] + [grouping key] = (Date, Seller) “What’s the average daily amount of apples sold for each seller?” collapsing key - grouping key = observation key

Slide 45

Slide 45 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 observation key “per day” is implied 1 2 collapsing key - grouping key = observation key 3

Slide 46

Slide 46 text

SELECT Seller, AVG(total) FROM ( SELECT Date, Seller, SUM(ApplesSold) AS total FROM apples GROUP BY Date, Seller ) AS t GROUP BY Seller Order of operations Seller AVG(total) Mary 5 Bob 6 Jane 16 (Date, Seller) - (Seller) = (Date) observation key “per day” is implied 2 1 1 2 3 3

Slide 47

Slide 47 text

Solve the English question first, then write the code

Slide 48

Slide 48 text

x

Slide 49

Slide 49 text

What the blank 1. What if the collapsing key is blank? 2. What if the grouping key is blank? 3. What if both are blank? collapsing key - grouping key = observation key inner query outer query the difference

Slide 50

Slide 50 text

1. Blank collapsing key

Slide 51

Slide 51 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 52

Slide 52 text

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?” implied primary key

Slide 53

Slide 53 text

Two-dimensional data SELECT AVG(NumSold) FROM ( SELECT Date, SUM(ApplesSold) AS NumSold FROM apples GROUP BY Date ) AS tmp “collapsing key” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?”

Slide 54

Slide 54 text

If you don’t define a collapsing key, then the primary key is automatically your collapsing key inner query

Slide 55

Slide 55 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 56

Slide 56 text

primary key - grouping key = observation key

Slide 57

Slide 57 text

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples “What is the average daily amount of apples sold?”

Slide 58

Slide 58 text

Don’t trust the primary key

Slide 59

Slide 59 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 60

Slide 60 text

2. Blank grouping key

Slide 61

Slide 61 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 62

Slide 62 text

collapsing key = observation key inner query

Slide 63

Slide 63 text

SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 (no grouping asked)

Slide 64

Slide 64 text

3. Blank collapsing key & blank grouping key

Slide 65

Slide 65 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 66

Slide 66 text

primary key - grouping key = observation key

Slide 67

Slide 67 text

primary key = observation key

Slide 68

Slide 68 text

SELECT AVG(ApplesSold) FROM apples One-dimensional data Id ApplesSold 1 5 2 4 3 8 4 10 5 6 implied observation key is simply the primary key “What is the average amount of apples sold?”

Slide 69

Slide 69 text

Two-dimensional data Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 SELECT AVG(ApplesSold) FROM apples implied observation key is simply the primary key “What is the average daily amount of apples sold?”

Slide 70

Slide 70 text

SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t Two-dimensional data implied “Date” observation key “collapsing key” “What is the average daily amount of apples sold?” Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6

Slide 71

Slide 71 text

x

Slide 72

Slide 72 text

collapsing key - grouping key = observation key inner query outer query the difference

Slide 73

Slide 73 text

x

Slide 74

Slide 74 text

What’s the big deal?

Slide 75

Slide 75 text

Two-dimensional data 5 + 4 + 8 + 10 + 6 3 = 11 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(total) FROM ( SELECT Date, SUM(ApplesSold) AS total FROM apples GROUP BY Date ) AS t

Slide 76

Slide 76 text

Two-dimensional data 5 + 4 + 8 + 10 + 6 5 = 6.6 Id Date ApplesSold 1 Monday 5 2 Monday 4 3 Tuesday 8 4 Thursday 10 5 Thursday 6 “What is the average daily amount of apples sold?” SELECT AVG(ApplesSold) FROM apples

Slide 77

Slide 77 text

Two-dimensional data 5 + 4 + 8 + 10 + 6 5 = 6.6 “What is the average daily amount of apples sold?” 5 + 4 + 8 + 10 + 6 3 = 11 vs.

Slide 78

Slide 78 text

If we do not collapse data, we will understate our results

Slide 79

Slide 79 text

x

Slide 80

Slide 80 text

the big picture

Slide 81

Slide 81 text

pandas

Slide 82

Slide 82 text

pandas .mean() AVG()

Slide 83

Slide 83 text

Beware the pitfall of multi-dimensional averages! collapsing key - grouping key = observation key inner query outer query the difference

Slide 84

Slide 84 text

thank you