## Slide 1

### Slide 1 text

Jake VanderPlas PyCon 2016

## Slide 2

### Slide 2 text

< About Me > - Astronomer by training - Statistician by accident - Active in Python science & open source - Data Scientist at UW eScience Institute - @jakevdp on Twitter & Github

No content

## Slide 4

### Slide 4 text

Hacker (n.) 1. A person who is trying to steal your grandma’s bank password.

## Slide 5

### Slide 5 text

Hacker (n.) 1. A person who is trying to steal your grandma’s bank password. 2. A person whose natural approach to problem-solving involves writing code.

## Slide 6

### Slide 6 text

Statistics is Hard.

## Slide 7

### Slide 7 text

Statistics is Hard. Using programming skills, it can be easy.

## Slide 8

### Slide 8 text

My thesis today: If you can write a for-loop, you can do statistics

## Slide 10

### Slide 10 text

– Dr. Seuss (attr)

Warm-up

## Slide 12

### Slide 12 text

You toss a coin 30 times and see 22 heads. Is it a fair coin? Warm-up: Coin Toss

## Slide 13

### Slide 13 text

A fair coin should show 15 heads in 30 tosses. This coin is biased. Even a fair coin could show 22 heads in 30 tosses. It might be just chance.

## Slide 14

### Slide 14 text

Classic Method: Assume the Skeptic is correct: test the Null Hypothesis. What is the probability of a fair coin showing 22 heads simply by chance?

## Slide 15

### Slide 15 text

Classic Method: Start computing probabilities . . .

Classic Method:

## Slide 17

### Slide 17 text

Classic Method: Number of arrangements (binomial coefficient) Probability of N H heads Probability of N T tails

Classic Method:

Classic Method:

## Slide 20

### Slide 20 text

Classic Method: 0.8 %

## Slide 21

### Slide 21 text

Classic Method: 0.8 % Probability of 0.8% (i.e. p = 0.008) of observations given a fair coin. → reject fair coin hypothesis at p < 0.05

## Slide 22

### Slide 22 text

Could there be an easier way?

## Slide 23

### Slide 23 text

Easier Method: Just simulate it! M = 0 for i in range(10000): trials = randint(2, size=30) if (trials.sum() >= 22): M += 1 p = M / 10000 # 0.008149 → reject fair coin at p = 0.008

## Slide 24

### Slide 24 text

In general . . . Computing the Sampling Distribution is Hard.

## Slide 25

### Slide 25 text

In general . . . Computing the Sampling Distribution is Hard. Simulating the Sampling Distribution is Easy.

## Slide 26

### Slide 26 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

## Slide 27

### Slide 27 text

Now, the Star-Belly Sneetches had bellies with stars. The Plain-Belly Sneetches had none upon thars . . . Sneeches: Stars and Intelligence *inspired by John Rauser’s Statistics Without All The Agonizing Pain

## Slide 28

### Slide 28 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 ★ mean: 73.5 ❌ mean: 66.9 difference: 6.6 Sneeches: Stars and Intelligence Test Scores

## Slide 29

### Slide 29 text

★ mean: 73.5 ❌ mean: 66.9 difference: 6.6 Is this difference of 6.6 statistically significant?

## Slide 30

### Slide 30 text

Classic Method (Welch’s t-test)

## Slide 31

### Slide 31 text

Classic Method (Welch’s t-test)

## Slide 32

### Slide 32 text

Classic Method (Student’s t distribution)

## Slide 33

### Slide 33 text

Classic Method (Student’s t distribution) Degree of Freedom: “The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it.” -Wikipedia

## Slide 34

### Slide 34 text

Degree of Freedom: “The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it.” -Wikipedia Classic Method (Student’s t distribution)

## Slide 35

### Slide 35 text

Classic Method ( Welch–Satterthwaite equation)

## Slide 36

### Slide 36 text

Classic Method ( Welch–Satterthwaite equation)

Classic Method

Classic Method

## Slide 39

### Slide 39 text

Classic Method 1.7959

Classic Method

Classic Method

Classic Method

## Slide 43

### Slide 43 text

“The difference of 6.6 is not significant at the p=0.05 level”

No content

## Slide 45

### Slide 45 text

The biggest problem: We’ve entirely lost-track of what question we’re answering!

## Slide 46

### Slide 46 text

< One popular alternative . . . > “Why don’t you just . . .” from statsmodels.stats.weightstats import ttest_ind t, p, dof = ttest_ind(group1, group2, alternative='larger', usevar='unequal') print(p) # 0.186

## Slide 47

### Slide 47 text

< One popular alternative . . . > “Why don’t you just . . .” from statsmodels.stats.weightstats import ttest_ind t, p, dof = ttest_ind(group1, group2, alternative='larger', usevar='unequal') print(p) # 0.186 . . . But what question is this answering?

## Slide 48

### Slide 48 text

The deep meaning lies in the sampling distribution: Stepping Back... 0.8 % Same principle as the coin example:

## Slide 49

### Slide 49 text

Let’s use a sampling method instead

## Slide 50

### Slide 50 text

The Problem: Unlike coin flipping, we don’t have a generative model . . .

## Slide 51

### Slide 51 text

The Problem: Unlike coin flipping, we don’t have a generative model . . . Solution: Shuffling

## Slide 52

### Slide 52 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 Idea: Simulate the distribution by shuffling the labels repeatedly and computing the desired statistic. Motivation: if the labels really don’t matter, then switching them shouldn’t change the result!

## Slide 53

### Slide 53 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 54

### Slide 54 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 55

### Slide 55 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 56

### Slide 56 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 ★ mean: 72.4 ❌ mean: 67.6 difference: 4.8 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 57

### Slide 57 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 ★ mean: 72.4 ❌ mean: 67.6 difference: 4.8 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 58

### Slide 58 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 59

### Slide 59 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 ★ mean: 62.6 ❌ mean: 74.1 difference: -11.6 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 60

### Slide 60 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 61

### Slide 61 text

★ ❌ 74 56 72 69 61 63 84 57 87 76 81 65 91 99 46 69 66 62 44 69 ★ mean: 75.9 ❌ mean: 65.3 difference: 10.6 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 62

### Slide 62 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 63

### Slide 63 text

★ ❌ 84 81 69 69 61 69 87 74 65 76 56 57 99 44 46 63 66 91 62 72 1. Shuffle Labels 2. Rearrange 3. Compute means

## Slide 64

### Slide 64 text

1. Shuffle Labels 2. Rearrange 3. Compute means ★ ❌ 74 62 72 57 61 63 84 69 87 81 76 65 91 99 46 69 66 56 44 69

## Slide 65

### Slide 65 text

1. Shuffle Labels 2. Rearrange 3. Compute means ★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69

## Slide 66

### Slide 66 text

score difference number

## Slide 67

### Slide 67 text

score difference number

## Slide 68

### Slide 68 text

16 % score difference number

## Slide 69

### Slide 69 text

“A difference of 6.6 is not significant at p = 0.05.” That day, all the Sneetches forgot about stars And whether they had one, or not, upon thars.

## Slide 70

### Slide 70 text

Notes on Shuffling: - Works when the Null Hypothesis assumes two groups are equivalent - Like all methods, it will only work if your samples are representative – always be careful about selection biases! - Needs care for non-independent trials. Good discussion in Simon’s Resampling: The New Statistics

## Slide 71

### Slide 71 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

## Slide 72

### Slide 72 text

Yertle’s Turtle Tower On the far-away island of Sala-ma-Sond, Yertle the Turtle was king of the pond. . .

## Slide 73

### Slide 73 text

How High can Yertle stack his turtles? - What is the mean of the number of turtles in Yertle’s stack? - What is the uncertainty on this estimate? 48 24 32 61 51 12 32 18 19 24 21 41 29 21 25 23 42 18 23 13 Observe 20 of Yertle’s turtle towers . . . # of turtles

## Slide 74

### Slide 74 text

Classic Method: Sample Mean: Standard Error of the Mean:

## Slide 75

### Slide 75 text

What assumptions go into these formulae? Can we use sampling instead?

## Slide 76

### Slide 76 text

Problem: As before, we don’t have a generating model . . .

## Slide 77

### Slide 77 text

Problem: As before, we don’t have a generating model . . . Solution: Bootstrap Resampling

## Slide 78

### Slide 78 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution.

## Slide 79

### Slide 79 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution.

## Slide 80

### Slide 80 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21

## Slide 81

### Slide 81 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19

## Slide 82

### Slide 82 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25

## Slide 83

### Slide 83 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24

## Slide 84

### Slide 84 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23

## Slide 85

### Slide 85 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19

## Slide 86

### Slide 86 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41

## Slide 87

### Slide 87 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23

## Slide 88

### Slide 88 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41

## Slide 89

### Slide 89 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18

## Slide 90

### Slide 90 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61

## Slide 91

### Slide 91 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12

## Slide 92

### Slide 92 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42

## Slide 93

### Slide 93 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42

## Slide 94

### Slide 94 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42

## Slide 95

### Slide 95 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19

## Slide 96

### Slide 96 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19 18

## Slide 97

### Slide 97 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19 18 61

## Slide 98

### Slide 98 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19 18 61 29

## Slide 99

### Slide 99 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19 18 61 29 41

## Slide 100

### Slide 100 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution. 21 19 25 24 23 19 41 23 41 18 61 12 42 42 42 19 18 61 29 41 → 31.05

## Slide 101

### Slide 101 text

Repeat this several thousand times . . .

## Slide 102

### Slide 102 text

for i in range(10000): sample = N[randint(20, size=20)] xbar[i] = mean(sample) mean(xbar), std(xbar) # (28.9, 2.9) Recovers The Analytic Estimate! Height = 29 ± 3 turtles

## Slide 103

### Slide 103 text

Bootstrap sampling can be applied even to more involved statistics

## Slide 104

### Slide 104 text

Bootstrap on Linear Regression: What is the relationship between speed of wind and the height of the Yertle’s turtle tower?

## Slide 105

### Slide 105 text

Bootstrap on Linear Regression: for i in range(10000): i = randint(20, size=20) slope, intercept = fit(x[i], y[i]) results[i] = (slope, intercept)

## Slide 106

### Slide 106 text

Notes on Bootstrapping: - Bootstrap resampling is well-studied and rests on solid theoretical grounds. - Bootstrapping often doesn’t work well for rank-based statistics (e.g. maximum value) - Works poorly with very few samples (N > 20 is a good rule of thumb) - As always, be careful about selection biases & non-independent data!

## Slide 107

### Slide 107 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

## Slide 108

### Slide 108 text

Onceler Industries: Sales of Thneeds I'm being quite useful! This thing is a Thneed. A Thneed's a Fine-Something- That-All-People-Need!

## Slide 109

### Slide 109 text

Thneed sales seem to show a trend with temperature . . .

## Slide 110

### Slide 110 text

y = a + bx y = a + bx + cx2 But which model is a better fit?

## Slide 111

### Slide 111 text

y = a + bx y = a + bx + cx2 Can we judge by root-mean- square error? RMS error = 63.0 RMS error = 51.5

## Slide 112

### Slide 112 text

In general, more flexible models will always have a lower RMS error. y = a + bx y = a + bx + cx2 y = a + bx + cx2 + dx3 y = a + bx + cx2 + dx3 + ex4 y = a + ⋯

## Slide 113

### Slide 113 text

y = a + bx + cx2 + dx3 + ex4 + fx5 + ⋯ + nx14 RMS error does not tell the whole story.

## Slide 114

### Slide 114 text

Not to worry: Statistics has figured this out.

## Slide 115

### Slide 115 text

Classic Method Difference in Mean Squared Error follows chi-square distribution:

## Slide 116

### Slide 116 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution:

## Slide 117

### Slide 117 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution: Plug in our numbers . . .

## Slide 118

### Slide 118 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution: Plug in our numbers . . . Wait… what question were we trying to answer again?

## Slide 119

### Slide 119 text

Another Approach: Cross Validation

Cross-Validation

## Slide 121

### Slide 121 text

Cross-Validation 1. Randomly Split data

## Slide 122

### Slide 122 text

Cross-Validation 1. Randomly Split data

## Slide 123

### Slide 123 text

Cross-Validation 2. Find the best model for each subset

## Slide 124

### Slide 124 text

Cross-Validation 3. Compare models across subsets

## Slide 125

### Slide 125 text

Cross-Validation 3. Compare models across subsets

## Slide 126

### Slide 126 text

Cross-Validation 3. Compare models across subsets

## Slide 127

### Slide 127 text

Cross-Validation 3. Compare models across subsets

## Slide 128

### Slide 128 text

Cross-Validation 4. Compute RMS error for each RMS = 48.9 RMS = 55.1 RMS estimate = 52.1

## Slide 129

### Slide 129 text

Cross-Validation Repeat for as long as you have patience . . .

## Slide 130

### Slide 130 text

Cross-Validation 5. Compare cross-validated RMS for models:

## Slide 131

### Slide 131 text

Cross-Validation Best model minimizes the cross-validated error. 5. Compare cross-validated RMS for models:

## Slide 132

### Slide 132 text

. . . I biggered the loads of the thneeds I shipped out! I was shipping them forth, to the South, to the East to the West, to the North!

## Slide 133

### Slide 133 text

Notes on Cross-Validation: - This was “2-fold” cross-validation; other CV schemes exist & may perform better for your data (see e.g. scikit-learn docs) - Cross-validation is the go-to method for model evaluation in machine learning, as statistics of the models are often not known in the classical sense. - Again: caveats about selection bias and independence in data.

## Slide 134

### Slide 134 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

## Slide 135

### Slide 135 text

Sampling Methods allow you to use intuitive computational approaches in place of often non-intuitive statistical rules. If you can write a for-loop you can do statistical analysis.

## Slide 136

### Slide 136 text

Things I didn’t have time for: - Bayesian Methods: very intuitive & powerful approaches to more sophisticated modeling. (see e.g. Bayesian Methods for Hackers by Cam Davidson-Pilon) - Selection Bias: if you get data selection wrong, you’ll have a bad time. (See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science) - Detailed considerations on use of sampling, shuffling, and bootstrapping. (I recommend Statistics Is Easy by Shasha & Wilson And Resampling: The New Statistics by Julian Simon)

## Slide 137

### Slide 137 text

– Dr. Seuss (attr)

## Slide 138

### Slide 138 text

~ Thank You! ~ Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/ Slides available at http://speakerdeck.com/jakevdp/statistics-for-hackers/