Slide 1

Slide 1 text

Statistics for Data Science: what you should know and why 
 —
 Gabriela de Queiroz
 Senior Developer Advocate @ IBM Founder of R-Ladies http://codait.org Ladies http://rladies.org

Slide 2

Slide 2 text

Agenda • Know your data • Correlation (and Causation) • A/B test • Statistical Models • Probability Bonus: R-Ladies

Slide 3

Slide 3 text

Where are you coming from?

Slide 4

Slide 4 text

Lonely Statistician Lonely Data Scientist

Slide 5

Slide 5 text

TOP 5 STATISTICAL CONCEPTS

Slide 6

Slide 6 text

1. Know your data

Slide 7

Slide 7 text

Some ways to know your data

Slide 8

Slide 8 text

Summary Statistics

Slide 9

Slide 9 text

Anscombe's quartet

Slide 10

Slide 10 text

The Datasaurus Dozen AutoDesk Research: https://www.autodeskresearch.com/publications/samestats R-package: https://github.com/stephlocke/datasauRus

Slide 11

Slide 11 text

Be sure to plot your data!

Slide 12

Slide 12 text

Think twice before using it source: http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

Slide 13

Slide 13 text

2. Correlation* ρ = -1 ρ = +1 * Pearson correlation

Slide 14

Slide 14 text

Correlation describes the strength of the linear relationship between two variables.

Slide 15

Slide 15 text

What can we say about this chart? Credits: http://www2.stat.duke.edu/~mc301/ARTSCI101_Su16/post/slides/w2_d2_smoking_research.pdf

Slide 16

Slide 16 text

ICE CREAM SALES SHARK ATTACKS? CAUSE

Slide 17

Slide 17 text

ICE CREAM SALES SHARK ATTACKS? CAUSE X SUMMER?

Slide 18

Slide 18 text

observer

Slide 19

Slide 19 text

umbrella => rain

Slide 20

Slide 20 text

Where is the rain???

Slide 21

Slide 21 text

Correlation doesn’t imply causation

Slide 22

Slide 22 text

Causation vs Correlation • Causality indicates that one event is the result of the occurrence of the other event. • Correlation between two things can be caused by a third factor (confounder) that affects both of them.

Slide 23

Slide 23 text

Is there any time where correlation implies causation? The gold standard for establishing cause and effect is a controlled trial (aka A/B test).

Slide 24

Slide 24 text

3. A/B Testing

Slide 25

Slide 25 text

A/B Testing Online experiments are used to test a new design, a machine learning model, or any new feature.

Slide 26

Slide 26 text

A/B Testing - Hypothesis Tests A hypothesis test is a way to decide whether the data strongly support one point of view or another.

Slide 27

Slide 27 text

How do you set up an experiment?

Slide 28

Slide 28 text

DEFINE THE GOAL AND FORM THE HYPOTHESIS

Slide 29

Slide 29 text

DEFINE THE GOAL AND FORM THE HYPOTHESIS 'SPNTUBUT IZQPUIFTJTUFTUT TJHOJpDBODFMFWFM

Slide 30

Slide 30 text

IDENTIFY THE CONTROL AND THE TREATMENT GROUP

Slide 31

Slide 31 text

IDENTIFY KEY METRICS AND DESIRED IMPROVEMENT 'SPNTUBUT F⒎FDUTJ[F

Slide 32

Slide 32 text

DETERMINE THE FRACTION IN BOTH GROUPS

Slide 33

Slide 33 text

RUN THE TEST FOR A CERTAIN AMOUNT OF TIME 'SPNTUBUT TBNQMFTJ[F

Slide 34

Slide 34 text

ANALYZE THE RESULTS

Slide 35

Slide 35 text

4. Statistical Models

Slide 36

Slide 36 text

The response is the one whose content we are trying to model with other variables (explanatory variables) In any given model: • response variable (Y) • explanatory variables (X1, . . . .Xn)

Slide 37

Slide 37 text

Examples of models Time Series Linear Regression Non-Linear Regression

Slide 38

Slide 38 text

Use Case: Improve Sales of a product • Let’s say we were hired to provide advice on how to improve sales of a particular product. • Our goal is to develop a model that can be used to predict sales based on these 3 media budgets. Example extracted from the book "An Introduction to Statistical Learning with Applications in R"

Slide 39

Slide 39 text

The data consists of the sales of the product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

Slide 40

Slide 40 text

output variable: sales (in thousands of units) input variables: advertising budgets (in thousands of dollars) The sales for a particular product is a function of advertising budgets.

Slide 41

Slide 41 text

Suppose we are asked to suggest a marketing plan for next year that will result in high product sales. WHAT INFORMATION WOULD BE USEFUL TO PROVIDE?

Slide 42

Slide 42 text

1. Is there a relationship between advertising budget and sales? Our first goal should be to determine whether the data provide evidence of an association between advertising spend and sales.

Slide 43

Slide 43 text

2. How strong is the relationship between advertising budget and sales?

Slide 44

Slide 44 text

3. Which media contribute to sales? Do all three media contribute to sales, or do just one or two?

Slide 45

Slide 45 text

4. How accurately can we estimate the effect of each media on sales? For every dollar spent on advertising in a particular media, by what amount will sales increase?

Slide 46

Slide 46 text

5. How accurately can we predict future sales? For any given advertising, what is our prediction for sales, and what is the accuracy of this prediction?

Slide 47

Slide 47 text

6. Is the relationship linear? If the relationship between advertising spend in the various media and sales is approximately a straight-line then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.

Slide 48

Slide 48 text

We could answer all those questions by setting up a multiple linear regression: sales = 0 + 1TV + 2radio + 3newspaper + ✏

Slide 49

Slide 49 text

Why can’t we throw all these in a black box algorithm?

Slide 50

Slide 50 text

INTERPRETABILITY

Slide 51

Slide 51 text

5. Probability

Slide 52

Slide 52 text

• Naive Bayes • Logistic Regression • k-NN • Latent Dirichlet Allocation (LDA) • Decision Trees • Association Rules (ex: Basket Analysis) • …

Slide 53

Slide 53 text

It doesn’t matter what technique you choose, the most important skill is critical thinking.

Slide 54

Slide 54 text

Resources

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Worldwide organization that promotes gender diversity in the R community via meetups and mentorship in a friendly and safe environment

Slide 57

Slide 57 text

Our mission More women/non-binary • coders • developers • speakers • leaders More gender minorities developing R packages and being part of the R community.

Slide 58

Slide 58 text

http://bit.ly/rladiesgroups

Slide 59

Slide 59 text

How can I start my own chapter? #rcatladies

Slide 60

Slide 60 text

Send an email to [email protected]

Slide 61

Slide 61 text

We’ll send everything you'll need! #rdogladies

Slide 62

Slide 62 text

What do you get? 1) Starter-Kit ▪ Tech Infrastructure ▪ Tips on how to organize events ▪ Code of Conduct (En/Spanish) 2) @rladies.org email 3) Organizer slack channel 4) Shared training material 5) Financial support to cover meetup registration/renewal fees!

Slide 63

Slide 63 text

And there is more! YOU WILL: - be part of an incredible family - learn a lot (not onlyR)! - have unlimited support - meet other R-Ladies + MUCH MORE!

Slide 64

Slide 64 text

COME JOIN US!

Slide 65

Slide 65 text

Make sure to schedule your 1:1 Session https://ibmcommunityday.bemyapp.com/#/mentors You can also reach me via: twitter: @gdequeiroz linkedin: http://bit.ly/linkedin-gdq

Slide 66

Slide 66 text

THANK YOU! @RLadiesGlobal www.rladies.org @gdequeiroz www.k-roz.com

Slide 67

Slide 67 text

Aditional Resources • R-Ladies: www.rladies.org • Call for code: https://developer.ibm.com/callforcode/ • Intro to Statistics with R - DataCamp • R4DS: http://r4ds.had.co.nz/ • Think Stats - Probability and Statistics for Programmers: http://greenteapress.com/thinkstats/ • Statistical Learning online class: https://online.stanford.edu/ courses/sohs-ystatslearning-statistical-learning-self-paced

Slide 68

Slide 68 text

Continue the conversation and join: https://community.ibm.com/datascience