Statistics for Data Science: what you should know and why

Slide 1

Slide 1 text

Statistics for Data Science: what you should know and why Gabriela de Queiroz Data Scientist and Founder of R-Ladies

Slide 2

Slide 2 text

Lonely Statistician Lonely Data Scientist

Slide 3

Slide 3 text

TOP 5 STATISTICAL CONCEPTS

Slide 4

Slide 4 text

1. Know your data

Slide 5

Slide 5 text

Some ways to know your data

Slide 6

Slide 6 text

Summary Statistics

Slide 7

Slide 7 text

Anscombe's quartet

Slide 8

Slide 8 text

The Datasaurus Dozen AutoDesk Research: https://www.autodeskresearch.com/publications/samestats R-package: https://github.com/stephlocke/datasauRus

Slide 9

Slide 9 text

Think twice before using it

Slide 10

Slide 10 text

2. Correlation* ρ = -1 ρ = +1 * Pearson correlation

Slide 11

Slide 11 text

Correlation describes the strength of the linear relationship between two variables.

Slide 12

Slide 12 text

What can we say about this chart? Credits: http://www2.stat.duke.edu/~mc301/ARTSCI101_Su16/post/slides/w2_d2_smoking_research.pdf

Slide 13

Slide 13 text

ICE CREAM SALES SHARK ATTACKS? CAUSE

Slide 14

Slide 14 text

ICE CREAM SALES SHARK ATTACKS? CAUSE X SUMMER?

Slide 15

Slide 15 text

observer

Slide 16

Slide 16 text

umbrella => rain

Slide 17

Slide 17 text

Where is the rain???

Slide 18

Slide 18 text

Correlation doesn’t imply causation

Slide 19

Slide 19 text

Causation vs Correlation • Causality indicates that one event is the result of the occurrence of the other event. • Correlation between two things can be caused by a third factor (confounder) that affects both of them.

Slide 20

Slide 20 text

Is there any time where correlation implies causation? The gold standard for establishing cause and effect is a controlled trial (aka A/B test).

Slide 21

Slide 21 text

3. A/B Testing

Slide 22

Slide 22 text

A/B Testing Online experiments are used to test a new design, a machine learning model, or any new feature.

Slide 23

Slide 23 text

A/B Testing - Hypothesis Tests A hypothesis test is a way to decide whether the data strongly support one point of view or another.

Slide 24

Slide 24 text

How do you set up an experiment?

Slide 25

Slide 25 text

DEFINE THE GOAL AND FORM THE HYPOTHESIS

Slide 26

Slide 26 text

DEFINE THE GOAL AND FORM THE HYPOTHESIS 'SPNTUBUT IZQPUIFTJTUFTUT TJHOJpDBODFMFWFM

Slide 27

Slide 27 text

IDENTIFY THE CONTROL AND THE TREATMENT GROUP

Slide 28

Slide 28 text

IDENTIFY KEY METRICS AND DESIRED IMPROVEMENT 'SPNTUBUT F⒎FDUTJ[F

Slide 29

Slide 29 text

DETERMINE THE FRACTION IN BOTH GROUPS

Slide 30

Slide 30 text

RUN THE TEST FOR A CERTAIN AMOUNT OF TIME 'SPNTUBUT TBNQMFTJ[F

Slide 31

Slide 31 text

ANALYZE THE RESULTS

Slide 32

Slide 32 text

4. Statistical Models

Slide 33

Slide 33 text

The response is the one whose content we are trying to model with other variables (explanatory variables) In any given model: • response variable (Y) • explanatory variables (X1, . . . .Xn)

Slide 34

Slide 34 text

Examples of models Time Series Linear Regression Non-Linear Regression

Slide 35

Slide 35 text

Use Case: Improve Sales of a product • Let’s say we were hired to provide advice on how to improve sales of a particular product. • Our goal is to develop an accurate model that can be used to predict sales based on these 3 media budgets. Example extracted from the book "An Introduction to Statistical Learning with Applications in R"

Slide 36

Slide 36 text

The data consists of the sales of the product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

Slide 37

Slide 37 text

output variable: sales (in thousands of units) input variables: advertising budgets (in thousands of dollars) The sales for a particular product is a function of advertising budgets.

Slide 38

Slide 38 text

Suppose we are asked to suggest a marketing plan for next year that will result in high product sales. WHAT INFORMATION WOULD BE USEFUL TO PROVIDE?

Slide 39

Slide 39 text

1. Is there a relationship between advertising budget and sales? Our ﬁrst goal should be to determine whether the data provide evidence of an association between advertising spend and sales.

Slide 40

Slide 40 text

2. How strong is the relationship between advertising budget and sales?

Slide 41

Slide 41 text

3. Which media contribute to sales? Do all three media contribute to sales, or do just one or two?

Slide 42

Slide 42 text

4. How accurately can we estimate the effect of each media on sales? For every dollar spent on advertising in a particular media, by what amount will sales increase?

Slide 43

Slide 43 text

5. How accurately can we predict future sales? For any given advertising, what is our prediction for sales, and what is the accuracy of this prediction?

Slide 44

Slide 44 text

6. Is the relationship linear? If the relationship between advertising spend in the various media and sales is approximately a straight-line then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

Slide 45

Slide 45 text

We could answer all those questions by setting up a multiple linear regression: sales = 0 + 1TV + 2radio + 3newspaper + ✏

Slide 46

Slide 46 text

Why can’t we throw all these in a black box algorithm?

Slide 47

Slide 47 text

INTERPRETABILITY

Slide 48

Slide 48 text

5. Probability

Slide 49

Slide 49 text

• Naive Bayes • Logistic Regression • k-NN • Latent Dirichlet Allocation • Decision Trees • Association Rules (ex: Basket Analysis) • …

Slide 50

Slide 50 text

It doesn’t matter what technique you choose, the most important skill is critical thinking.

Slide 51

Slide 51 text

THANK YOU! @gdequeiroz @RLadiesGlobal www.rladies.org k-roz.com