barrachri
April 16, 2016
260

# Introduction to Statistics with Python

April 16, 2016

## Transcript

1. INTRODUCTION TO
STATISTICS WITH PYTHON
@CHRISTIANBARRA - PYCON7

2. MY NAME IS
CHRISTIAN
I’M STUDYING
STATISTICS
@UNIPD
HELLO !!!

3. THE STORY OF THIS TALK:
3 DAYS BEFORE THE CONFERENCE
VALERIO: CHRISTIAN, WE HAVE A FREE
SLOT AND WE NEED A TALK
CHRISTIAN: I CAN’T IN 3 DAYS…
VALERIO: YOU MUST.

4. CONTENT
1. What is STATISTICS ?
2. Variable types
3. Univariate distribution
4. Frequencies
5. M^3 (Mean, Median, Mode)
6. Variance and Standard
Deviation
7. Multivariate distribution
8. Covariance and Correlation

5. 1. WHAT IS
STATISTICS ?

6. — Oxford English Dictionary
…. THE BRANCH OF SCIENCE OR
MATHEMATICS CONCERNED WITH THE
ANALYSIS AND INTERPRETATION OF
NUMERICAL DATA AND APPROPRIATE WAYS
OF GATHERING SUCH DATA.

7. — American Statistical Association
STATISTICS IS THE SCIENCE OF LEARNING FROM DATA,
AND OF MEASURING, CONTROLLING, AND
COMMUNICATING UNCERTAINTY; AND IT THEREBY
CONTROLLING THE COURSE OF SCIENTIFIC AND

8. — John Tukey, Bell Labs, Princeton University
THE BEST THING ABOUT BEING A
STATISTICIAN IS THAT YOU GET TO
PLAY IN EVERYONE ELSE'S
BACKYARD.

9. — Mark Twain
THERE ARE THREE KINDS OF
LIES: LIES, DAMNED LIES,
AND STATISTICS.

10. 2. VARIABLES

11. 4 KINDS OF VARIABLES
• QUANTITATIVE VARIABLES
• CONTINUOUS
• DISCRETE
• CATEGORICAL VARIABLES
• ORDINAL
• NOMINAL

12. OUR RAW DATA

FROM 1 TO 30.

14. QUANTITATIVE…
AND DISCRETE

15. THE DISTANCE BETWEEN
17 AND 18 IS THE SAME
BETWEEN 27 AND 28 ?

16. THE TYPE OF A VARIABLE
SOMETIMES IS NOT STRICTLY
RELATED TO THE VALUE THAT
ASSUMES

17. ANOTHER
TYPICAL ERROR…

18. FROM 1 TO 7
HOW MUCH DO YOU
ENJOY THE CONFERENCE ?

19. AFTER THE
SURVEY….

20. ON AVERAGE PEOPLE
ENJOYED THE CONFERENCE
4.5

21. DON’T RAPE

22. 4. FREQUENCIES

23. DIFFERENT TYPES OF FREQUENCY
• ABSOLUTE FREQUENCY (ni): number of observation for each of
the “OBSERVATIONAL UNIT“
• ABSOLUTE CUMULATIVE FREQUENCY (Ni): Ni = Ni-1 + ni
• RELATIVE FREQUENCY (fi): number of observations for each of the
“OBSERVATIONAL UNIT“ divided by the total number of
observations (N)
• RELATIVE CUMULATIVE FREQUENCY (Fi): Fi = Fi-1 + fi
• % FREQUENCY: fi * 100
• % CUMULATIVE FREQUENCY: Fi * 100

24. 3. UNIVARIATE
DISTRIBUTION

25. WE WORK WITH
JUST 1 VARIABLE

26. 3 MAIN CONCEPTS
• OBSERVATIONAL UNITS: entities whose characteristics we
measure or observe (ALIAS ROWS)
• VARIABLE: feature, characteristic of the OBSERVATIONAL UNITS
(ALIAS COLUMNS)
• FREQUENCY: Number of OBSERVATIONAL UNITS with the same
value of a VARIABLE

27. import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
univariate = pd.DataFrame(df["Product (X1)"].value_counts())
univariate.columns = ["Absolute Frequency (ni)"]
univariate

28. FREQUENCY TABLE

29. 5. MEAN, MEDIAN
AND MODE

30. THERE ARE
DIFFERENT TYPES
OF MEAN

31. ARITHMETIC MEAN (MOST USED)

32. df.mean()
Price (X3) 28.051205
Margin (X5) 15.525602
Stock (X6) 12.293333
dtype: float64

33. WHY IS THE
MEAN SO
IMPORTANT ?

34. FOR THIS PROPERTY

35. MODE:
VALUE THAT
APPEARS MOST OFTEN
(HIGHEST FREQUENCY)

36. df["Product (X1)”].mode()
0 Socks
dtype: object

37. MEDIAN:
ALSO CALLED
50TH PERCENTILE

38. THE PROPERTY OF THE MEDIAN

39. YOU NEED A
VARIABLE THAT
YOU CAN “ORDER”

40. AND WE CAN’T
ORDER
PRODUCTS

41. df.median()
Price (X3) 22.652655
Margin (X5) 12.826328
Stock (X6) 12.000000
dtype: float64

42. univariate_stocks = pd.DataFrame(df["Stock (X6)"].value_counts())
univariate_stocks = univariate_stocks.sort_index()
univariate_stocks.columns = ["Absolute Frequency (ni)"]
univariate_stocks["Relative Frequency (fi)"] =
univariate_stocks["Absolute Frequency (ni)"]/
univariate_stocks["Absolute Frequency (ni)"].sum()
univariate_stocks['Relative Cumulative Frequency (Fi)'] =
univariate_stocks['Relative Frequency (fi)'].cumsum()
univariate_stocks

43. 6. VARIANCE AND
STANDARD DEVIATION

44. WE CALL THEM
MEASURES OF
DISPERSION

45. MEAN AND VARIANCE ARE
PROBABLY THE MOST IMPORTANT
CONCEPTS IN STATISTICS

46. AS MY PROFESSOR SAID…
VARIANCE IS YOUR
EMPLOYER

47. HELLO BOSS !

48. BUT IS A STRANGE
CONCEPT…
SQUARE OF SOMETHING

49. STANDARD DEVIATION

50. NOW WE HAVE A
KIND OF DISTANCE

51. THE DISTANCE,
ON AVERAGE,
FROM THE MEAN

52. YOU CAN USE STD ALSO
TO SAY ROMANTIC

53. LIKE YOU ARE 3 STD
FROM THE MEAN
(NERDY WAY TO SAY
YOU ARE UNIQUE)

54. NORMAL DISTRIBUTION

55. 7. BIVARIATE
DISTRIBUTION

56. WE WORK WITH
2 VARIABLES

57. OUR VARIABLES

58. BIVARIATE DISTRIBUTION

59. NOW WE CAN CONSIDER
A BIVARIATE LIKE AN
UNIVARIATE DISTRIBUTION

60. CONDITIONED
DISTRIBUTION

61. STOCKS (X6) |
X3 = 18.95…

62. PRICES (X3) | X6
= 4

63. BIVARIATE DISTRIBUTION

64. FOR EACH CONDITIONED
DISTRIBUTION WE CAN
CALCULATE
MEAN AND VARIANCE

65. AT THE END WE HAVE
13 CONDITIONED MEANS/VARIANCES
AND 2 MARGINAL MEANS/VARIANCES

66. 8. COVARIANCE
AND CORRELATION

67. COVARIANCE

68. bivariate.mean()
bivariate
bivariate.cov()

69. NOT SO USEFUL.

70. CORRELATION

71. IT’S A COEFFICIENT OF
LINEAR CORRELATION
IT GOES FROM -1 TO 1

72. bivariate.corr()

73. QUESTIONS ?