Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics for Data Science: what you should know and why

Statistics for Data Science: what you should know and why

Talk at IBM Community Day: Data Science, Online 2017-07-24

- https://ibmdatascienceday.bemyapp.com/talks

Gabriela de Queiroz

July 24, 2018
Tweet

More Decks by Gabriela de Queiroz

Other Decks in Technology

Transcript

  1. Statistics for Data Science: what
    you should know and why

    —

    Gabriela de Queiroz

    Senior Developer Advocate @ IBM
    Founder of R-Ladies
    http://codait.org
    Ladies
    http://rladies.org

    View full-size slide

  2. Agenda
    • Know your data
    • Correlation (and Causation)
    • A/B test
    • Statistical Models
    • Probability
    Bonus: R-Ladies

    View full-size slide

  3. Where are you coming from?

    View full-size slide

  4. Lonely Statistician
    Lonely Data Scientist

    View full-size slide

  5. TOP 5 STATISTICAL
    CONCEPTS

    View full-size slide

  6. 1. Know your data

    View full-size slide

  7. Some ways to know your data

    View full-size slide

  8. Summary Statistics

    View full-size slide

  9. Anscombe's quartet

    View full-size slide

  10. The Datasaurus Dozen
    AutoDesk Research: https://www.autodeskresearch.com/publications/samestats
    R-package: https://github.com/stephlocke/datasauRus

    View full-size slide

  11. Be sure to plot your data!

    View full-size slide

  12. Think twice before using it
    source: http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

    View full-size slide

  13. 2. Correlation*
    ρ = -1
    ρ = +1
    * Pearson correlation

    View full-size slide

  14. Correlation describes the
    strength of the linear
    relationship between two
    variables.

    View full-size slide

  15. What can we say about this chart?
    Credits: http://www2.stat.duke.edu/~mc301/ARTSCI101_Su16/post/slides/w2_d2_smoking_research.pdf

    View full-size slide

  16. ICE CREAM SALES SHARK ATTACKS?
    CAUSE

    View full-size slide

  17. ICE CREAM SALES SHARK ATTACKS?
    CAUSE
    X
    SUMMER?

    View full-size slide

  18. umbrella => rain

    View full-size slide

  19. Where is the rain???

    View full-size slide

  20. Correlation doesn’t imply causation

    View full-size slide

  21. Causation vs Correlation
    • Causality indicates that one event is the
    result of the occurrence of the other event.
    • Correlation between two things can be
    caused by a third factor (confounder) that
    affects both of them.

    View full-size slide

  22. Is there any time where correlation
    implies causation?
    The gold standard for establishing cause and
    effect is a controlled trial (aka A/B test).

    View full-size slide

  23. 3. A/B Testing

    View full-size slide

  24. A/B Testing
    Online experiments are used to test a new
    design, a machine learning model, or any
    new feature.

    View full-size slide

  25. A/B Testing - Hypothesis Tests
    A hypothesis test is a way to decide whether
    the data strongly support one point of view
    or another.

    View full-size slide

  26. How do you set up an
    experiment?

    View full-size slide

  27. DEFINE THE GOAL
    AND
    FORM THE HYPOTHESIS

    View full-size slide

  28. DEFINE THE GOAL
    AND
    FORM THE HYPOTHESIS
    'SPNTUBUT
    IZQPUIFTJTUFTUT
    TJHOJpDBODFMFWFM

    View full-size slide

  29. IDENTIFY THE CONTROL
    AND
    THE TREATMENT GROUP

    View full-size slide

  30. IDENTIFY KEY METRICS
    AND
    DESIRED IMPROVEMENT
    'SPNTUBUT
    F⒎FDUTJ[F

    View full-size slide

  31. DETERMINE THE FRACTION
    IN BOTH GROUPS

    View full-size slide

  32. RUN THE TEST FOR A
    CERTAIN AMOUNT OF TIME
    'SPNTUBUT
    TBNQMFTJ[F

    View full-size slide

  33. ANALYZE THE RESULTS

    View full-size slide

  34. 4. Statistical Models

    View full-size slide

  35. The response is the one whose content we
    are trying to model with other variables
    (explanatory variables)
    In any given model:
    • response variable (Y)
    • explanatory variables (X1, . . . .Xn)

    View full-size slide

  36. Examples of models
    Time Series
    Linear Regression
    Non-Linear Regression

    View full-size slide

  37. Use Case: Improve Sales of a
    product
    • Let’s say we were hired to provide advice on
    how to improve sales of a particular product.
    • Our goal is to develop a model that can be
    used to predict sales based on these 3
    media budgets.
    Example extracted from the book "An Introduction to Statistical Learning with Applications in R"

    View full-size slide

  38. The data consists of the sales of the product in 200 different
    markets, along with advertising budgets for the product in each
    of those markets for three different media: TV, radio, and
    newspaper.

    View full-size slide

  39. output variable: sales (in thousands of units)
    input variables: advertising budgets (in thousands of dollars)
    The sales for a particular product is a function of advertising budgets.

    View full-size slide

  40. Suppose we are asked to suggest a marketing plan for
    next year that will result in high product sales.
    WHAT INFORMATION WOULD BE USEFUL TO
    PROVIDE?

    View full-size slide

  41. 1. Is there a relationship between
    advertising budget and sales?
    Our first goal should be to determine whether
    the data provide evidence of an association
    between advertising spend and sales.

    View full-size slide

  42. 2. How strong is the relationship
    between advertising budget and
    sales?

    View full-size slide

  43. 3. Which media contribute to sales?
    Do all three media contribute to sales,
    or do just one or two?

    View full-size slide

  44. 4. How accurately can we estimate the effect
    of each media on sales?
    For every dollar spent on advertising in a
    particular media, by what amount will sales
    increase?

    View full-size slide

  45. 5. How accurately can we predict future
    sales?
    For any given advertising, what is our prediction
    for sales, and what is the accuracy of this
    prediction?

    View full-size slide

  46. 6. Is the relationship linear?
    If the relationship between advertising spend in the various
    media and sales is approximately a straight-line then linear
    regression is an appropriate tool.
    If not, then it may still be possible to transform the predictor
    or the response so that linear regression can be used.

    View full-size slide

  47. We could answer all those questions by
    setting up a multiple linear regression:
    sales = 0 + 1TV + 2radio + 3newspaper + ✏

    View full-size slide

  48. Why can’t we throw all these in a black box
    algorithm?

    View full-size slide

  49. INTERPRETABILITY

    View full-size slide

  50. 5. Probability

    View full-size slide

  51. • Naive Bayes
    • Logistic Regression
    • k-NN
    • Latent Dirichlet Allocation (LDA)
    • Decision Trees
    • Association Rules (ex: Basket Analysis)
    • …

    View full-size slide

  52. It doesn’t matter what technique you choose,
    the most important skill is critical thinking.

    View full-size slide

  53. Worldwide organization that
    promotes gender diversity in
    the R community via meetups
    and mentorship in a friendly
    and safe environment

    View full-size slide

  54. Our mission
    More women/non-binary

    • coders

    • developers

    • speakers

    • leaders

    More gender minorities
    developing R packages and
    being part of the R community.

    View full-size slide

  55. http://bit.ly/rladiesgroups

    View full-size slide

  56. How can I
    start my own
    chapter?
    #rcatladies

    View full-size slide

  57. We’ll send
    everything
    you'll need!
    #rdogladies

    View full-size slide

  58. What do you get?
    1) Starter-Kit
    ▪ Tech Infrastructure

    ▪ Tips on how to organize events

    ▪ Code of Conduct (En/Spanish)

    2) @rladies.org email
    3) Organizer slack channel
    4) Shared training material
    5) Financial support to cover
    meetup registration/renewal
    fees!

    View full-size slide

  59. And there is more!
    YOU WILL:
    - be part of an incredible family

    - learn a lot (not onlyR)!

    - have unlimited support

    - meet other R-Ladies

    + MUCH MORE!

    View full-size slide

  60. COME JOIN US!

    View full-size slide

  61. Make sure to
    schedule your
    1:1 Session
    https://ibmcommunityday.bemyapp.com/#/mentors
    You can also reach me via:
    twitter: @gdequeiroz
    linkedin: http://bit.ly/linkedin-gdq

    View full-size slide

  62. THANK YOU!
    @RLadiesGlobal
    www.rladies.org
    @gdequeiroz
    www.k-roz.com

    View full-size slide

  63. Aditional Resources
    • R-Ladies: www.rladies.org
    • Call for code: https://developer.ibm.com/callforcode/
    • Intro to Statistics with R - DataCamp
    • R4DS: http://r4ds.had.co.nz/
    • Think Stats - Probability and Statistics for Programmers:
    http://greenteapress.com/thinkstats/
    • Statistical Learning online class: https://online.stanford.edu/
    courses/sohs-ystatslearning-statistical-learning-self-paced

    View full-size slide

  64. Continue the conversation and join:
    https://community.ibm.com/datascience

    View full-size slide