Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistics for Data Science: what you should know and why

Statistics for Data Science: what you should know and why

Talk at IBM Community Day: Data Science, Online 2017-07-24

- https://ibmdatascienceday.bemyapp.com/talks

7f378e07b7a5a685e7e273148d221a10?s=128

Gabriela de Queiroz

July 24, 2018
Tweet

Transcript

  1. Statistics for Data Science: what you should know and why

    
 —
 Gabriela de Queiroz
 Senior Developer Advocate @ IBM Founder of R-Ladies http://codait.org Ladies http://rladies.org
  2. Agenda • Know your data • Correlation (and Causation) •

    A/B test • Statistical Models • Probability Bonus: R-Ladies
  3. Where are you coming from?

  4. Lonely Statistician Lonely Data Scientist

  5. TOP 5 STATISTICAL CONCEPTS

  6. 1. Know your data

  7. Some ways to know your data

  8. Summary Statistics

  9. Anscombe's quartet

  10. The Datasaurus Dozen AutoDesk Research: https://www.autodeskresearch.com/publications/samestats R-package: https://github.com/stephlocke/datasauRus

  11. Be sure to plot your data!

  12. Think twice before using it source: http://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

  13. 2. Correlation* ρ = -1 ρ = +1 * Pearson

    correlation
  14. Correlation describes the strength of the linear relationship between two

    variables.
  15. What can we say about this chart? Credits: http://www2.stat.duke.edu/~mc301/ARTSCI101_Su16/post/slides/w2_d2_smoking_research.pdf

  16. ICE CREAM SALES SHARK ATTACKS? CAUSE

  17. ICE CREAM SALES SHARK ATTACKS? CAUSE X SUMMER?

  18. observer

  19. umbrella => rain

  20. Where is the rain???

  21. Correlation doesn’t imply causation

  22. Causation vs Correlation • Causality indicates that one event is

    the result of the occurrence of the other event. • Correlation between two things can be caused by a third factor (confounder) that affects both of them.
  23. Is there any time where correlation implies causation? The gold

    standard for establishing cause and effect is a controlled trial (aka A/B test).
  24. 3. A/B Testing

  25. A/B Testing Online experiments are used to test a new

    design, a machine learning model, or any new feature.
  26. A/B Testing - Hypothesis Tests A hypothesis test is a

    way to decide whether the data strongly support one point of view or another.
  27. How do you set up an experiment?

  28. DEFINE THE GOAL AND FORM THE HYPOTHESIS

  29. DEFINE THE GOAL AND FORM THE HYPOTHESIS 'SPNTUBUT IZQPUIFTJTUFTUT TJHOJpDBODFMFWFM

  30. IDENTIFY THE CONTROL AND THE TREATMENT GROUP

  31. IDENTIFY KEY METRICS AND DESIRED IMPROVEMENT 'SPNTUBUT F⒎FDUTJ[F

  32. DETERMINE THE FRACTION IN BOTH GROUPS

  33. RUN THE TEST FOR A CERTAIN AMOUNT OF TIME 'SPNTUBUT

    TBNQMFTJ[F
  34. ANALYZE THE RESULTS

  35. 4. Statistical Models

  36. The response is the one whose content we are trying

    to model with other variables (explanatory variables) In any given model: • response variable (Y) • explanatory variables (X1, . . . .Xn)
  37. Examples of models Time Series Linear Regression Non-Linear Regression

  38. Use Case: Improve Sales of a product • Let’s say

    we were hired to provide advice on how to improve sales of a particular product. • Our goal is to develop a model that can be used to predict sales based on these 3 media budgets. Example extracted from the book "An Introduction to Statistical Learning with Applications in R"
  39. The data consists of the sales of the product in

    200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
  40. output variable: sales (in thousands of units) input variables: advertising

    budgets (in thousands of dollars) The sales for a particular product is a function of advertising budgets.
  41. Suppose we are asked to suggest a marketing plan for

    next year that will result in high product sales. WHAT INFORMATION WOULD BE USEFUL TO PROVIDE?
  42. 1. Is there a relationship between advertising budget and sales?

    Our first goal should be to determine whether the data provide evidence of an association between advertising spend and sales.
  43. 2. How strong is the relationship between advertising budget and

    sales?
  44. 3. Which media contribute to sales? Do all three media

    contribute to sales, or do just one or two?
  45. 4. How accurately can we estimate the effect of each

    media on sales? For every dollar spent on advertising in a particular media, by what amount will sales increase?
  46. 5. How accurately can we predict future sales? For any

    given advertising, what is our prediction for sales, and what is the accuracy of this prediction?
  47. 6. Is the relationship linear? If the relationship between advertising

    spend in the various media and sales is approximately a straight-line then linear regression is an appropriate tool. If not, then it may still be possible to transform the predictor or the response so that linear regression can be used.
  48. We could answer all those questions by setting up a

    multiple linear regression: sales = 0 + 1TV + 2radio + 3newspaper + ✏
  49. Why can’t we throw all these in a black box

    algorithm?
  50. INTERPRETABILITY

  51. 5. Probability

  52. • Naive Bayes • Logistic Regression • k-NN • Latent

    Dirichlet Allocation (LDA) • Decision Trees • Association Rules (ex: Basket Analysis) • …
  53. It doesn’t matter what technique you choose, the most important

    skill is critical thinking.
  54. Resources

  55. None
  56. Worldwide organization that promotes gender diversity in the R community

    via meetups and mentorship in a friendly and safe environment
  57. Our mission More women/non-binary • coders • developers • speakers

    • leaders More gender minorities developing R packages and being part of the R community.
  58. http://bit.ly/rladiesgroups

  59. How can I start my own chapter? #rcatladies

  60. Send an email to info@rladies.org

  61. We’ll send everything you'll need! #rdogladies

  62. What do you get? 1) Starter-Kit ▪ Tech Infrastructure ▪

    Tips on how to organize events ▪ Code of Conduct (En/Spanish) 2) @rladies.org email 3) Organizer slack channel 4) Shared training material 5) Financial support to cover meetup registration/renewal fees!
  63. And there is more! YOU WILL: - be part of

    an incredible family - learn a lot (not onlyR)! - have unlimited support - meet other R-Ladies + MUCH MORE!
  64. COME JOIN US!

  65. Make sure to schedule your 1:1 Session https://ibmcommunityday.bemyapp.com/#/mentors You can

    also reach me via: twitter: @gdequeiroz linkedin: http://bit.ly/linkedin-gdq
  66. THANK YOU! @RLadiesGlobal www.rladies.org @gdequeiroz www.k-roz.com

  67. Aditional Resources • R-Ladies: www.rladies.org • Call for code: https://developer.ibm.com/callforcode/

    • Intro to Statistics with R - DataCamp • R4DS: http://r4ds.had.co.nz/ • Think Stats - Probability and Statistics for Programmers: http://greenteapress.com/thinkstats/ • Statistical Learning online class: https://online.stanford.edu/ courses/sohs-ystatslearning-statistical-learning-self-paced
  68. Continue the conversation and join: https://community.ibm.com/datascience