$30 off During Our Annual Pro Sale. View Details »

Introduction to Statistics with Python

Introduction to Statistics with Python

barrachri

April 16, 2016
Tweet

More Decks by barrachri

Other Decks in Programming

Transcript

  1. INTRODUCTION TO
    STATISTICS WITH PYTHON
    @CHRISTIANBARRA - PYCON7

    View Slide

  2. MY NAME IS
    CHRISTIAN
    I’M STUDYING
    STATISTICS
    @UNIPD
    HELLO !!!

    View Slide

  3. THE STORY OF THIS TALK:
    3 DAYS BEFORE THE CONFERENCE
    VALERIO: CHRISTIAN, WE HAVE A FREE
    SLOT AND WE NEED A TALK
    CHRISTIAN: I CAN’T IN 3 DAYS…
    VALERIO: YOU MUST.

    View Slide

  4. CONTENT
    1. What is STATISTICS ?
    2. Variable types
    3. Univariate distribution
    4. Frequencies
    5. M^3 (Mean, Median, Mode)
    6. Variance and Standard
    Deviation
    7. Multivariate distribution
    8. Covariance and Correlation

    View Slide

  5. 1. WHAT IS
    STATISTICS ?

    View Slide

  6. — Oxford English Dictionary
    …. THE BRANCH OF SCIENCE OR
    MATHEMATICS CONCERNED WITH THE
    ANALYSIS AND INTERPRETATION OF
    NUMERICAL DATA AND APPROPRIATE WAYS
    OF GATHERING SUCH DATA.


    View Slide

  7. — American Statistical Association
    STATISTICS IS THE SCIENCE OF LEARNING FROM DATA,
    AND OF MEASURING, CONTROLLING, AND
    COMMUNICATING UNCERTAINTY; AND IT THEREBY
    PROVIDES THE NAVIGATION ESSENTIAL FOR
    CONTROLLING THE COURSE OF SCIENTIFIC AND
    SOCIETAL ADVANCES


    View Slide

  8. — John Tukey, Bell Labs, Princeton University
    THE BEST THING ABOUT BEING A
    STATISTICIAN IS THAT YOU GET TO
    PLAY IN EVERYONE ELSE'S
    BACKYARD.


    View Slide

  9. — Mark Twain
    THERE ARE THREE KINDS OF
    LIES: LIES, DAMNED LIES,
    AND STATISTICS.


    View Slide

  10. 2. VARIABLES

    View Slide

  11. 4 KINDS OF VARIABLES
    • QUANTITATIVE VARIABLES
    • CONTINUOUS
    • DISCRETE
    • CATEGORICAL VARIABLES
    • ORDINAL
    • NOMINAL

    View Slide

  12. OUR RAW DATA

    View Slide

  13. VOTES AT UNIVERSITY
    FROM 1 TO 30.

    View Slide

  14. QUANTITATIVE…
    AND DISCRETE

    View Slide

  15. THE DISTANCE BETWEEN
    17 AND 18 IS THE SAME
    BETWEEN 27 AND 28 ?

    View Slide

  16. THE TYPE OF A VARIABLE
    SOMETIMES IS NOT STRICTLY
    RELATED TO THE VALUE THAT
    ASSUMES

    View Slide

  17. ANOTHER
    TYPICAL ERROR…

    View Slide

  18. FROM 1 TO 7
    HOW MUCH DO YOU
    ENJOY THE CONFERENCE ?

    View Slide

  19. AFTER THE
    SURVEY….

    View Slide

  20. ON AVERAGE PEOPLE
    ENJOYED THE CONFERENCE
    4.5

    View Slide

  21. View Slide

  22. DON’T RAPE
    YOUR VARIABLES

    View Slide

  23. 4. FREQUENCIES

    View Slide

  24. DIFFERENT TYPES OF FREQUENCY
    • ABSOLUTE FREQUENCY (ni): number of observation for each of
    the “OBSERVATIONAL UNIT“
    • ABSOLUTE CUMULATIVE FREQUENCY (Ni): Ni = Ni-1 + ni
    • RELATIVE FREQUENCY (fi): number of observations for each of the
    “OBSERVATIONAL UNIT“ divided by the total number of
    observations (N)
    • RELATIVE CUMULATIVE FREQUENCY (Fi): Fi = Fi-1 + fi
    • % FREQUENCY: fi * 100
    • % CUMULATIVE FREQUENCY: Fi * 100

    View Slide

  25. 3. UNIVARIATE
    DISTRIBUTION

    View Slide

  26. WE WORK WITH
    JUST 1 VARIABLE

    View Slide

  27. View Slide

  28. 3 MAIN CONCEPTS
    • OBSERVATIONAL UNITS: entities whose characteristics we
    measure or observe (ALIAS ROWS)
    • VARIABLE: feature, characteristic of the OBSERVATIONAL UNITS
    (ALIAS COLUMNS)
    • FREQUENCY: Number of OBSERVATIONAL UNITS with the same
    value of a VARIABLE

    View Slide

  29. import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    %matplotlib inline
    univariate = pd.DataFrame(df["Product (X1)"].value_counts())
    univariate.columns = ["Absolute Frequency (ni)"]
    univariate

    View Slide

  30. View Slide

  31. FREQUENCY TABLE

    View Slide

  32. 5. MEAN, MEDIAN
    AND MODE

    View Slide

  33. THERE ARE
    DIFFERENT TYPES
    OF MEAN

    View Slide

  34. ARITHMETIC MEAN (MOST USED)

    View Slide

  35. df.mean()
    Price (X3) 28.051205
    Margin (X5) 15.525602
    Stock (X6) 12.293333
    dtype: float64

    View Slide

  36. WHY IS THE
    MEAN SO
    IMPORTANT ?

    View Slide

  37. FOR THIS PROPERTY

    View Slide

  38. MODE:
    VALUE THAT
    APPEARS MOST OFTEN
    (HIGHEST FREQUENCY)

    View Slide

  39. View Slide

  40. df["Product (X1)”].mode()
    0 Socks
    dtype: object

    View Slide

  41. MEDIAN:
    ALSO CALLED
    50TH PERCENTILE

    View Slide

  42. THE PROPERTY OF THE MEDIAN

    View Slide

  43. View Slide

  44. YOU NEED A
    VARIABLE THAT
    YOU CAN “ORDER”

    View Slide

  45. AND WE CAN’T
    ORDER
    PRODUCTS

    View Slide

  46. df.median()
    Price (X3) 22.652655
    Margin (X5) 12.826328
    Stock (X6) 12.000000
    dtype: float64

    View Slide

  47. View Slide

  48. univariate_stocks = pd.DataFrame(df["Stock (X6)"].value_counts())
    univariate_stocks = univariate_stocks.sort_index()
    univariate_stocks.columns = ["Absolute Frequency (ni)"]
    univariate_stocks["Relative Frequency (fi)"] =
    univariate_stocks["Absolute Frequency (ni)"]/
    univariate_stocks["Absolute Frequency (ni)"].sum()
    univariate_stocks['Relative Cumulative Frequency (Fi)'] =
    univariate_stocks['Relative Frequency (fi)'].cumsum()
    univariate_stocks

    View Slide

  49. 6. VARIANCE AND
    STANDARD DEVIATION

    View Slide

  50. WE CALL THEM
    MEASURES OF
    DISPERSION

    View Slide

  51. MEAN AND VARIANCE ARE
    PROBABLY THE MOST IMPORTANT
    CONCEPTS IN STATISTICS

    View Slide

  52. AS MY PROFESSOR SAID…
    VARIANCE IS YOUR
    EMPLOYER

    View Slide

  53. HELLO BOSS !

    View Slide

  54. BUT IS A STRANGE
    CONCEPT…
    SQUARE OF SOMETHING

    View Slide

  55. STANDARD DEVIATION

    View Slide

  56. NOW WE HAVE A
    KIND OF DISTANCE

    View Slide

  57. THE DISTANCE,
    ON AVERAGE,
    FROM THE MEAN

    View Slide

  58. YOU CAN USE STD ALSO
    TO SAY ROMANTIC
    THINGS TO YOUR PARTNER

    View Slide

  59. LIKE YOU ARE 3 STD
    FROM THE MEAN
    (NERDY WAY TO SAY
    YOU ARE UNIQUE)

    View Slide

  60. NORMAL DISTRIBUTION

    View Slide

  61. 7. BIVARIATE
    DISTRIBUTION

    View Slide

  62. WE WORK WITH
    2 VARIABLES

    View Slide

  63. OUR VARIABLES

    View Slide

  64. BIVARIATE DISTRIBUTION

    View Slide

  65. NOW WE CAN CONSIDER
    A BIVARIATE LIKE AN
    UNIVARIATE DISTRIBUTION

    View Slide

  66. CONDITIONED
    DISTRIBUTION

    View Slide

  67. STOCKS (X6) |
    X3 = 18.95…

    View Slide

  68. PRICES (X3) | X6
    = 4

    View Slide

  69. BIVARIATE DISTRIBUTION

    View Slide

  70. FOR EACH CONDITIONED
    DISTRIBUTION WE CAN
    CALCULATE
    MEAN AND VARIANCE

    View Slide

  71. AT THE END WE HAVE
    13 CONDITIONED MEANS/VARIANCES
    AND 2 MARGINAL MEANS/VARIANCES

    View Slide

  72. View Slide

  73. 8. COVARIANCE
    AND CORRELATION

    View Slide

  74. COVARIANCE

    View Slide

  75. bivariate.mean()
    bivariate
    bivariate.cov()

    View Slide

  76. View Slide

  77. View Slide

  78. NOT SO USEFUL.

    View Slide

  79. CORRELATION

    View Slide

  80. IT’S A COEFFICIENT OF
    LINEAR CORRELATION
    IT GOES FROM -1 TO 1

    View Slide

  81. bivariate.corr()

    View Slide

  82. QUESTIONS ?

    View Slide