Introduction to Statistics with Python

Introduction to Statistics with Python

8cafbb6a1b892de6f03ec6db012fb39f?s=128

barrachri

April 16, 2016
Tweet

Transcript

  1. INTRODUCTION TO STATISTICS WITH PYTHON @CHRISTIANBARRA - PYCON7

  2. MY NAME IS CHRISTIAN I’M STUDYING STATISTICS @UNIPD HELLO !!!

  3. THE STORY OF THIS TALK: 3 DAYS BEFORE THE CONFERENCE

    VALERIO: CHRISTIAN, WE HAVE A FREE SLOT AND WE NEED A TALK CHRISTIAN: I CAN’T IN 3 DAYS… VALERIO: YOU MUST.
  4. CONTENT 1. What is STATISTICS ? 2. Variable types 3.

    Univariate distribution 4. Frequencies 5. M^3 (Mean, Median, Mode) 6. Variance and Standard Deviation 7. Multivariate distribution 8. Covariance and Correlation
  5. 1. WHAT IS STATISTICS ?

  6. — Oxford English Dictionary …. THE BRANCH OF SCIENCE OR

    MATHEMATICS CONCERNED WITH THE ANALYSIS AND INTERPRETATION OF NUMERICAL DATA AND APPROPRIATE WAYS OF GATHERING SUCH DATA. ” “
  7. — American Statistical Association STATISTICS IS THE SCIENCE OF LEARNING

    FROM DATA, AND OF MEASURING, CONTROLLING, AND COMMUNICATING UNCERTAINTY; AND IT THEREBY PROVIDES THE NAVIGATION ESSENTIAL FOR CONTROLLING THE COURSE OF SCIENTIFIC AND SOCIETAL ADVANCES ” “
  8. — John Tukey, Bell Labs, Princeton University THE BEST THING

    ABOUT BEING A STATISTICIAN IS THAT YOU GET TO PLAY IN EVERYONE ELSE'S BACKYARD. ” “
  9. — Mark Twain THERE ARE THREE KINDS OF LIES: LIES,

    DAMNED LIES, AND STATISTICS. ” “
  10. 2. VARIABLES

  11. 4 KINDS OF VARIABLES • QUANTITATIVE VARIABLES • CONTINUOUS •

    DISCRETE • CATEGORICAL VARIABLES • ORDINAL • NOMINAL
  12. OUR RAW DATA

  13. VOTES AT UNIVERSITY FROM 1 TO 30.

  14. QUANTITATIVE… AND DISCRETE

  15. THE DISTANCE BETWEEN 17 AND 18 IS THE SAME BETWEEN

    27 AND 28 ?
  16. THE TYPE OF A VARIABLE SOMETIMES IS NOT STRICTLY RELATED

    TO THE VALUE THAT ASSUMES
  17. ANOTHER TYPICAL ERROR…

  18. FROM 1 TO 7 HOW MUCH DO YOU ENJOY THE

    CONFERENCE ?
  19. AFTER THE SURVEY….

  20. ON AVERAGE PEOPLE ENJOYED THE CONFERENCE 4.5

  21. None
  22. DON’T RAPE YOUR VARIABLES

  23. 4. FREQUENCIES

  24. DIFFERENT TYPES OF FREQUENCY • ABSOLUTE FREQUENCY (ni): number of

    observation for each of the “OBSERVATIONAL UNIT“ • ABSOLUTE CUMULATIVE FREQUENCY (Ni): Ni = Ni-1 + ni • RELATIVE FREQUENCY (fi): number of observations for each of the “OBSERVATIONAL UNIT“ divided by the total number of observations (N) • RELATIVE CUMULATIVE FREQUENCY (Fi): Fi = Fi-1 + fi • % FREQUENCY: fi * 100 • % CUMULATIVE FREQUENCY: Fi * 100
  25. 3. UNIVARIATE DISTRIBUTION

  26. WE WORK WITH JUST 1 VARIABLE

  27. None
  28. 3 MAIN CONCEPTS • OBSERVATIONAL UNITS: entities whose characteristics we

    measure or observe (ALIAS ROWS) • VARIABLE: feature, characteristic of the OBSERVATIONAL UNITS (ALIAS COLUMNS) • FREQUENCY: Number of OBSERVATIONAL UNITS with the same value of a VARIABLE
  29. import numpy as np import pandas as pd import matplotlib.pyplot

    as plt %matplotlib inline univariate = pd.DataFrame(df["Product (X1)"].value_counts()) univariate.columns = ["Absolute Frequency (ni)"] univariate
  30. None
  31. FREQUENCY TABLE

  32. 5. MEAN, MEDIAN AND MODE

  33. THERE ARE DIFFERENT TYPES OF MEAN

  34. ARITHMETIC MEAN (MOST USED)

  35. df.mean() Price (X3) 28.051205 Margin (X5) 15.525602 Stock (X6) 12.293333

    dtype: float64
  36. WHY IS THE MEAN SO IMPORTANT ?

  37. FOR THIS PROPERTY

  38. MODE: VALUE THAT APPEARS MOST OFTEN (HIGHEST FREQUENCY)

  39. None
  40. df["Product (X1)”].mode() 0 Socks dtype: object

  41. MEDIAN: ALSO CALLED 50TH PERCENTILE

  42. THE PROPERTY OF THE MEDIAN

  43. None
  44. YOU NEED A VARIABLE THAT YOU CAN “ORDER”

  45. AND WE CAN’T ORDER PRODUCTS

  46. df.median() Price (X3) 22.652655 Margin (X5) 12.826328 Stock (X6) 12.000000

    dtype: float64
  47. None
  48. univariate_stocks = pd.DataFrame(df["Stock (X6)"].value_counts()) univariate_stocks = univariate_stocks.sort_index() univariate_stocks.columns = ["Absolute

    Frequency (ni)"] univariate_stocks["Relative Frequency (fi)"] = univariate_stocks["Absolute Frequency (ni)"]/ univariate_stocks["Absolute Frequency (ni)"].sum() univariate_stocks['Relative Cumulative Frequency (Fi)'] = univariate_stocks['Relative Frequency (fi)'].cumsum() univariate_stocks
  49. 6. VARIANCE AND STANDARD DEVIATION

  50. WE CALL THEM MEASURES OF DISPERSION

  51. MEAN AND VARIANCE ARE PROBABLY THE MOST IMPORTANT CONCEPTS IN

    STATISTICS
  52. AS MY PROFESSOR SAID… VARIANCE IS YOUR EMPLOYER

  53. HELLO BOSS !

  54. BUT IS A STRANGE CONCEPT… SQUARE OF SOMETHING

  55. STANDARD DEVIATION

  56. NOW WE HAVE A KIND OF DISTANCE

  57. THE DISTANCE, ON AVERAGE, FROM THE MEAN

  58. YOU CAN USE STD ALSO TO SAY ROMANTIC THINGS TO

    YOUR PARTNER
  59. LIKE YOU ARE 3 STD FROM THE MEAN (NERDY WAY

    TO SAY YOU ARE UNIQUE)
  60. NORMAL DISTRIBUTION

  61. 7. BIVARIATE DISTRIBUTION

  62. WE WORK WITH 2 VARIABLES

  63. OUR VARIABLES

  64. BIVARIATE DISTRIBUTION

  65. NOW WE CAN CONSIDER A BIVARIATE LIKE AN UNIVARIATE DISTRIBUTION

  66. CONDITIONED DISTRIBUTION

  67. STOCKS (X6) | X3 = 18.95…

  68. PRICES (X3) | X6 = 4

  69. BIVARIATE DISTRIBUTION

  70. FOR EACH CONDITIONED DISTRIBUTION WE CAN CALCULATE MEAN AND VARIANCE

  71. AT THE END WE HAVE 13 CONDITIONED MEANS/VARIANCES AND 2

    MARGINAL MEANS/VARIANCES
  72. None
  73. 8. COVARIANCE AND CORRELATION

  74. COVARIANCE

  75. bivariate.mean() bivariate bivariate.cov()

  76. None
  77. None
  78. NOT SO USEFUL.

  79. CORRELATION

  80. IT’S A COEFFICIENT OF LINEAR CORRELATION IT GOES FROM -1

    TO 1
  81. bivariate.corr()

  82. QUESTIONS ?