Statistics for Data Science: what
you should know and why
—
Gabriela de Queiroz
Senior Developer Advocate @ IBM
Founder of R-Ladies
http://codait.org
Ladies
http://rladies.org
Slide 2
Slide 2 text
Agenda
• Know your data
• Correlation (and Causation)
• A/B test
• Statistical Models
• Probability
Bonus: R-Ladies
Slide 3
Slide 3 text
Where are you coming from?
Slide 4
Slide 4 text
Lonely Statistician
Lonely Data Scientist
Slide 5
Slide 5 text
TOP 5 STATISTICAL
CONCEPTS
Slide 6
Slide 6 text
1. Know your data
Slide 7
Slide 7 text
Some ways to know your data
Slide 8
Slide 8 text
Summary Statistics
Slide 9
Slide 9 text
Anscombe's quartet
Slide 10
Slide 10 text
The Datasaurus Dozen
AutoDesk Research: https://www.autodeskresearch.com/publications/samestats
R-package: https://github.com/stephlocke/datasauRus
Slide 11
Slide 11 text
Be sure to plot your data!
Slide 12
Slide 12 text
Think twice before using it
source: http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html
Correlation describes the
strength of the linear
relationship between two
variables.
Slide 15
Slide 15 text
What can we say about this chart?
Credits: http://www2.stat.duke.edu/~mc301/ARTSCI101_Su16/post/slides/w2_d2_smoking_research.pdf
Slide 16
Slide 16 text
ICE CREAM SALES SHARK ATTACKS?
CAUSE
Slide 17
Slide 17 text
ICE CREAM SALES SHARK ATTACKS?
CAUSE
X
SUMMER?
Slide 18
Slide 18 text
observer
Slide 19
Slide 19 text
umbrella => rain
Slide 20
Slide 20 text
Where is the rain???
Slide 21
Slide 21 text
Correlation doesn’t imply causation
Slide 22
Slide 22 text
Causation vs Correlation
• Causality indicates that one event is the
result of the occurrence of the other event.
• Correlation between two things can be
caused by a third factor (confounder) that
affects both of them.
Slide 23
Slide 23 text
Is there any time where correlation
implies causation?
The gold standard for establishing cause and
effect is a controlled trial (aka A/B test).
Slide 24
Slide 24 text
3. A/B Testing
Slide 25
Slide 25 text
A/B Testing
Online experiments are used to test a new
design, a machine learning model, or any
new feature.
Slide 26
Slide 26 text
A/B Testing - Hypothesis Tests
A hypothesis test is a way to decide whether
the data strongly support one point of view
or another.
Slide 27
Slide 27 text
How do you set up an
experiment?
Slide 28
Slide 28 text
DEFINE THE GOAL
AND
FORM THE HYPOTHESIS
Slide 29
Slide 29 text
DEFINE THE GOAL
AND
FORM THE HYPOTHESIS
'SPNTUBUT
IZQPUIFTJTUFTUT
TJHOJpDBODFMFWFM
Slide 30
Slide 30 text
IDENTIFY THE CONTROL
AND
THE TREATMENT GROUP
Slide 31
Slide 31 text
IDENTIFY KEY METRICS
AND
DESIRED IMPROVEMENT
'SPNTUBUT
F⒎FDUTJ[F
Slide 32
Slide 32 text
DETERMINE THE FRACTION
IN BOTH GROUPS
Slide 33
Slide 33 text
RUN THE TEST FOR A
CERTAIN AMOUNT OF TIME
'SPNTUBUT
TBNQMFTJ[F
Slide 34
Slide 34 text
ANALYZE THE RESULTS
Slide 35
Slide 35 text
4. Statistical Models
Slide 36
Slide 36 text
The response is the one whose content we
are trying to model with other variables
(explanatory variables)
In any given model:
• response variable (Y)
• explanatory variables (X1, . . . .Xn)
Slide 37
Slide 37 text
Examples of models
Time Series
Linear Regression
Non-Linear Regression
Slide 38
Slide 38 text
Use Case: Improve Sales of a
product
• Let’s say we were hired to provide advice on
how to improve sales of a particular product.
• Our goal is to develop a model that can be
used to predict sales based on these 3
media budgets.
Example extracted from the book "An Introduction to Statistical Learning with Applications in R"
Slide 39
Slide 39 text
The data consists of the sales of the product in 200 different
markets, along with advertising budgets for the product in each
of those markets for three different media: TV, radio, and
newspaper.
Slide 40
Slide 40 text
output variable: sales (in thousands of units)
input variables: advertising budgets (in thousands of dollars)
The sales for a particular product is a function of advertising budgets.
Slide 41
Slide 41 text
Suppose we are asked to suggest a marketing plan for
next year that will result in high product sales.
WHAT INFORMATION WOULD BE USEFUL TO
PROVIDE?
Slide 42
Slide 42 text
1. Is there a relationship between
advertising budget and sales?
Our first goal should be to determine whether
the data provide evidence of an association
between advertising spend and sales.
Slide 43
Slide 43 text
2. How strong is the relationship
between advertising budget and
sales?
Slide 44
Slide 44 text
3. Which media contribute to sales?
Do all three media contribute to sales,
or do just one or two?
Slide 45
Slide 45 text
4. How accurately can we estimate the effect
of each media on sales?
For every dollar spent on advertising in a
particular media, by what amount will sales
increase?
Slide 46
Slide 46 text
5. How accurately can we predict future
sales?
For any given advertising, what is our prediction
for sales, and what is the accuracy of this
prediction?
Slide 47
Slide 47 text
6. Is the relationship linear?
If the relationship between advertising spend in the various
media and sales is approximately a straight-line then linear
regression is an appropriate tool.
If not, then it may still be possible to transform the predictor
or the response so that linear regression can be used.
Slide 48
Slide 48 text
We could answer all those questions by
setting up a multiple linear regression:
sales = 0 + 1TV + 2radio + 3newspaper + ✏
Slide 49
Slide 49 text
Why can’t we throw all these in a black box
algorithm?
It doesn’t matter what technique you choose,
the most important skill is critical thinking.
Slide 54
Slide 54 text
Resources
Slide 55
Slide 55 text
No content
Slide 56
Slide 56 text
Worldwide organization that
promotes gender diversity in
the R community via meetups
and mentorship in a friendly
and safe environment
Slide 57
Slide 57 text
Our mission
More women/non-binary
• coders
• developers
• speakers
• leaders
More gender minorities
developing R packages and
being part of the R community.
What do you get?
1) Starter-Kit
▪ Tech Infrastructure
▪ Tips on how to organize events
▪ Code of Conduct (En/Spanish)
2) @rladies.org email
3) Organizer slack channel
4) Shared training material
5) Financial support to cover
meetup registration/renewal
fees!
Slide 63
Slide 63 text
And there is more!
YOU WILL:
- be part of an incredible family
- learn a lot (not onlyR)!
- have unlimited support
- meet other R-Ladies
+ MUCH MORE!
Slide 64
Slide 64 text
COME JOIN US!
Slide 65
Slide 65 text
Make sure to
schedule your
1:1 Session
https://ibmcommunityday.bemyapp.com/#/mentors
You can also reach me via:
twitter: @gdequeiroz
linkedin: http://bit.ly/linkedin-gdq