Statistics for Hackers - Speaker Deck

Statistics for Hackers

by Jake VanderPlas

Slide 1

Slide 1 text

Jake VanderPlas PyCon 2016

Slide 2

Slide 2 text

< About Me > - Astronomer by training - Statistician by accident - Active in Python science & open source - Data Scientist at UW eScience Institute - @jakevdp on Twitter & Github

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Hacker (n.) 1. A person who is trying to steal your grandma’s bank password.

Slide 5

Slide 5 text

Hacker (n.) 1. A person who is trying to steal your grandma’s bank password. 2. A person whose natural approach to problem-solving involves writing code.

Slide 6

Slide 6 text

Statistics is Hard.

Slide 7

Slide 7 text

Statistics is Hard. Using programming skills, it can be easy.

Slide 8

Slide 8 text

My thesis today: If you can write a for-loop, you can do statistics

Slide 9

Slide 9 text

Statistics is fundamentally about Asking the Right Question.

Slide 10

Slide 10 text

– Dr. Seuss (attr)

Slide 11

Slide 11 text

Warm-up

Slide 12

Slide 12 text

You toss a coin 30 times and see 22 heads. Is it a fair coin? Warm-up: Coin Toss

Slide 13

Slide 13 text

A fair coin should show 15 heads in 30 tosses. This coin is biased. Even a fair coin could show 22 heads in 30 tosses. It might be just chance.

Slide 14

Slide 14 text

Classic Method: Assume the Skeptic is correct: test the Null Hypothesis. What is the probability of a fair coin showing 22 heads simply by chance?

Slide 15

Slide 15 text

Classic Method: Start computing probabilities . . .

Slide 16

Slide 16 text

Classic Method:

Slide 17

Slide 17 text

Classic Method: Number of arrangements (binomial coefficient) Probability of N H heads Probability of N T tails

Slide 18

Slide 18 text

Classic Method:

Slide 19

Slide 19 text

Classic Method:

Slide 20

Slide 20 text

Classic Method: 0.8 %

Slide 21

Slide 21 text

Classic Method: 0.8 % Probability of 0.8% (i.e. p = 0.008) of observations given a fair coin. → reject fair coin hypothesis at p < 0.05

Slide 22

Slide 22 text

Could there be an easier way?

Slide 23

Slide 23 text

Easier Method: Just simulate it! M = 0 for i in range(10000): trials = randint(2, size=30) if (trials.sum() >= 22): M += 1 p = M / 10000 # 0.008149 → reject fair coin at p = 0.008

Slide 24

Slide 24 text

In general . . . Computing the Sampling Distribution is Hard.

Slide 25

Slide 25 text

In general . . . Computing the Sampling Distribution is Hard. Simulating the Sampling Distribution is Easy.

Slide 26

Slide 26 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

Slide 27

Slide 27 text

Now, the Star-Belly Sneetches had bellies with stars. The Plain-Belly Sneetches had none upon thars . . . Sneeches: Stars and Intelligence *inspired by John Rauser’s Statistics Without All The Agonizing Pain

Slide 28

Slide 28 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 ★ mean: 73.5 ❌ mean: 66.9 difference: 6.6 Sneeches: Stars and Intelligence Test Scores

Slide 29

Slide 29 text

★ mean: 73.5 ❌ mean: 66.9 difference: 6.6 Is this difference of 6.6 statistically significant?

Slide 30

Slide 30 text

Classic Method (Welch’s t-test)

Slide 31

Slide 31 text

Classic Method (Welch’s t-test)

Slide 32

Slide 32 text

Classic Method (Student’s t distribution)

Slide 33

Slide 33 text

Classic Method (Student’s t distribution) Degree of Freedom: “The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it.” -Wikipedia

Slide 34

Slide 34 text

Degree of Freedom: “The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it.” -Wikipedia Classic Method (Student’s t distribution)

Slide 35

Slide 35 text

Classic Method ( Welch–Satterthwaite equation)

Slide 36

Slide 36 text

Classic Method ( Welch–Satterthwaite equation)

Slide 37

Slide 37 text

Classic Method

Slide 38

Slide 38 text

Classic Method

Slide 39

Slide 39 text

Classic Method 1.7959

Slide 40

Slide 40 text

Classic Method

Slide 41

Slide 41 text

Classic Method

Slide 42

Slide 42 text

Classic Method

Slide 43

Slide 43 text

“The difference of 6.6 is not significant at the p=0.05 level”

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

The biggest problem: We’ve entirely lost-track of what question we’re answering!

Slide 46

Slide 46 text

< One popular alternative . . . > “Why don’t you just . . .” from statsmodels.stats.weightstats import ttest_ind t, p, dof = ttest_ind(group1, group2, alternative='larger', usevar='unequal') print(p) # 0.186

Slide 47

Slide 47 text

Slide 48

Slide 48 text

The deep meaning lies in the sampling distribution: Stepping Back... 0.8 % Same principle as the coin example:

Slide 49

Slide 49 text

Let’s use a sampling method instead

Slide 50

Slide 50 text

The Problem: Unlike coin flipping, we don’t have a generative model . . .

Slide 51

Slide 51 text

The Problem: Unlike coin flipping, we don’t have a generative model . . . Solution: Shuffling

Slide 52

Slide 52 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 Idea: Simulate the distribution by shuffling the labels repeatedly and computing the desired statistic. Motivation: if the labels really don’t matter, then switching them shouldn’t change the result!

Slide 53

Slide 53 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 54

Slide 54 text

★ ❌ 84 72 81 69 57 46 74 61 63 76 56 87 99 91 69 65 66 44 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 55

Slide 55 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 56

Slide 56 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 ★ mean: 72.4 ❌ mean: 67.6 difference: 4.8 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 57

Slide 57 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 ★ mean: 72.4 ❌ mean: 67.6 difference: 4.8 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 58

Slide 58 text

★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 59

Slide 59 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 ★ mean: 62.6 ❌ mean: 74.1 difference: -11.6 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 60

Slide 60 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 61

Slide 61 text

★ ❌ 74 56 72 69 61 63 84 57 87 76 81 65 91 99 46 69 66 62 44 69 ★ mean: 75.9 ❌ mean: 65.3 difference: 10.6 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 62

Slide 62 text

★ ❌ 84 56 72 69 61 63 74 57 65 66 81 87 62 44 46 69 76 91 99 69 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 63

Slide 63 text

★ ❌ 84 81 69 69 61 69 87 74 65 76 56 57 99 44 46 63 66 91 62 72 1. Shuffle Labels 2. Rearrange 3. Compute means

Slide 64

Slide 64 text

1. Shuffle Labels 2. Rearrange 3. Compute means ★ ❌ 74 62 72 57 61 63 84 69 87 81 76 65 91 99 46 69 66 56 44 69

Slide 65

Slide 65 text

1. Shuffle Labels 2. Rearrange 3. Compute means ★ ❌ 84 81 72 69 61 69 74 57 65 76 56 87 99 44 46 63 66 91 62 69

Slide 66

Slide 66 text

score difference number

Slide 67

Slide 67 text

score difference number

Slide 68

Slide 68 text

16 % score difference number

Slide 69

Slide 69 text

“A difference of 6.6 is not significant at p = 0.05.” That day, all the Sneetches forgot about stars And whether they had one, or not, upon thars.

Slide 70

Slide 70 text

Notes on Shuffling: - Works when the Null Hypothesis assumes two groups are equivalent - Like all methods, it will only work if your samples are representative – always be careful about selection biases! - Needs care for non-independent trials. Good discussion in Simon’s Resampling: The New Statistics

Slide 71

Slide 71 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

Slide 72

Slide 72 text

Yertle’s Turtle Tower On the far-away island of Sala-ma-Sond, Yertle the Turtle was king of the pond. . .

Slide 73

Slide 73 text

How High can Yertle stack his turtles? - What is the mean of the number of turtles in Yertle’s stack? - What is the uncertainty on this estimate? 48 24 32 61 51 12 32 18 19 24 21 41 29 21 25 23 42 18 23 13 Observe 20 of Yertle’s turtle towers . . . # of turtles

Slide 74

Slide 74 text

Classic Method: Sample Mean: Standard Error of the Mean:

Slide 75

Slide 75 text

What assumptions go into these formulae? Can we use sampling instead?

Slide 76

Slide 76 text

Problem: As before, we don’t have a generating model . . .

Slide 77

Slide 77 text

Problem: As before, we don’t have a generating model . . . Solution: Bootstrap Resampling

Slide 78

Slide 78 text

Bootstrap Resampling: 48 24 51 12 21 41 25 23 32 61 19 24 29 21 23 13 32 18 42 18 Idea: Simulate the distribution by drawing samples with replacement. Motivation: The data estimates its own distribution – we draw random samples from this distribution.

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Slide 92

Slide 92 text

Slide 93

Slide 93 text

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Slide 101

Slide 101 text

Repeat this several thousand times . . .

Slide 102

Slide 102 text

for i in range(10000): sample = N[randint(20, size=20)] xbar[i] = mean(sample) mean(xbar), std(xbar) # (28.9, 2.9) Recovers The Analytic Estimate! Height = 29 ± 3 turtles

Slide 103

Slide 103 text

Bootstrap sampling can be applied even to more involved statistics

Slide 104

Slide 104 text

Bootstrap on Linear Regression: What is the relationship between speed of wind and the height of the Yertle’s turtle tower?

Slide 105

Slide 105 text

Bootstrap on Linear Regression: for i in range(10000): i = randint(20, size=20) slope, intercept = fit(x[i], y[i]) results[i] = (slope, intercept)

Slide 106

Slide 106 text

Notes on Bootstrapping: - Bootstrap resampling is well-studied and rests on solid theoretical grounds. - Bootstrapping often doesn’t work well for rank-based statistics (e.g. maximum value) - Works poorly with very few samples (N > 20 is a good rule of thumb) - As always, be careful about selection biases & non-independent data!

Slide 107

Slide 107 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

Slide 108

Slide 108 text

Onceler Industries: Sales of Thneeds I'm being quite useful! This thing is a Thneed. A Thneed's a Fine-Something- That-All-People-Need!

Slide 109

Slide 109 text

Thneed sales seem to show a trend with temperature . . .

Slide 110

Slide 110 text

y = a + bx y = a + bx + cx2 But which model is a better fit?

Slide 111

Slide 111 text

y = a + bx y = a + bx + cx2 Can we judge by root-mean- square error? RMS error = 63.0 RMS error = 51.5

Slide 112

Slide 112 text

In general, more flexible models will always have a lower RMS error. y = a + bx y = a + bx + cx2 y = a + bx + cx2 + dx3 y = a + bx + cx2 + dx3 + ex4 y = a + ⋯

Slide 113

Slide 113 text

y = a + bx + cx2 + dx3 + ex4 + fx5 + ⋯ + nx14 RMS error does not tell the whole story.

Slide 114

Slide 114 text

Not to worry: Statistics has figured this out.

Slide 115

Slide 115 text

Classic Method Difference in Mean Squared Error follows chi-square distribution:

Slide 116

Slide 116 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution:

Slide 117

Slide 117 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution: Plug in our numbers . . .

Slide 118

Slide 118 text

Classic Method Can estimate degrees of freedom easily because the models are nested . . . Difference in Mean Squared Error follows chi-square distribution: Plug in our numbers . . . Wait… what question were we trying to answer again?

Slide 119

Slide 119 text

Another Approach: Cross Validation

Slide 120

Slide 120 text

Cross-Validation

Slide 121

Slide 121 text

Cross-Validation 1. Randomly Split data

Slide 122

Slide 122 text

Cross-Validation 1. Randomly Split data

Slide 123

Slide 123 text

Cross-Validation 2. Find the best model for each subset

Slide 124

Slide 124 text

Cross-Validation 3. Compare models across subsets

Slide 125

Slide 125 text

Cross-Validation 3. Compare models across subsets

Slide 126

Slide 126 text

Cross-Validation 3. Compare models across subsets

Slide 127

Slide 127 text

Cross-Validation 3. Compare models across subsets

Slide 128

Slide 128 text

Cross-Validation 4. Compute RMS error for each RMS = 48.9 RMS = 55.1 RMS estimate = 52.1

Slide 129

Slide 129 text

Cross-Validation Repeat for as long as you have patience . . .

Slide 130

Slide 130 text

Cross-Validation 5. Compare cross-validated RMS for models:

Slide 131

Slide 131 text

Cross-Validation Best model minimizes the cross-validated error. 5. Compare cross-validated RMS for models:

Slide 132

Slide 132 text

. . . I biggered the loads of the thneeds I shipped out! I was shipping them forth, to the South, to the East to the West, to the North!

Slide 133

Slide 133 text

Notes on Cross-Validation: - This was “2-fold” cross-validation; other CV schemes exist & may perform better for your data (see e.g. scikit-learn docs) - Cross-validation is the go-to method for model evaluation in machine learning, as statistics of the models are often not known in the classical sense. - Again: caveats about selection bias and independence in data.

Slide 134

Slide 134 text

Four Recipes for Hacking Statistics: 1. Direct Simulation 2. Shuffling 3. Bootstrapping 4. Cross Validation

Slide 135

Slide 135 text

Sampling Methods allow you to use intuitive computational approaches in place of often non-intuitive statistical rules. If you can write a for-loop you can do statistical analysis.

Slide 136

Slide 136 text

Things I didn’t have time for: - Bayesian Methods: very intuitive & powerful approaches to more sophisticated modeling. (see e.g. Bayesian Methods for Hackers by Cam Davidson-Pilon) - Selection Bias: if you get data selection wrong, you’ll have a bad time. (See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science) - Detailed considerations on use of sampling, shuffling, and bootstrapping. (I recommend Statistics Is Easy by Shasha & Wilson And Resampling: The New Statistics by Julian Simon)

Slide 137

Slide 137 text

– Dr. Seuss (attr)

Slide 138

Slide 138 text

~ Thank You! ~ Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/ Slides available at http://speakerdeck.com/jakevdp/statistics-for-hackers/