210

# Exploratory Data Analysis Part 2 - Correlation and Association

This is a part of the Exploratory Data Analysis Workshop. In this episode, we'll talk about the correlation and association between variables and introduce some effective methods to investigate such relationships.

* Correlation / Association
* Kruskal-Wallis Test
* Linear Regression

Exploratory(https://exploratory.io)

January 27, 2021

## Transcript

1. ### EXPLORATORY Online Seminar #32 Exploratory Data Analysis Part 2 Correlation

& Association
2. ### Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,

Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

4. ### 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

/ Machine Learning) Data Analysis Data Science Workﬂow
5. ### 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
6. ### EXPLORATORY Online Seminar #32 Exploratory Data Analysis Part 2 Correlation

& Association

8. ### 8 - Wayne Gretzky “I skate to where the puck

is going to be, not where it has been.”
9. ### 9 Predict how many customers we will have. Prediction e.g.

Customers will be 1000 by end of this year.
10. ### 10 - Kan Nishida “I control where the puck is

going to be, not the puck controls where I should be.”
11. ### 11 e.g. We want to grow Customers to 1000. What

can we do to make that happen? Control
12. ### 12 Hypothesis If the weather will be warm, we will

have more customers. If we offer 10% discount, we would have more customers. Prediction Control
13. ### 13 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

Test • A/B Test Hypotheses Data Test Discount increases more customers
14. ### 14 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

Test • A/B Test Hypotheses Data Test Discount increases more customers Conﬁrmatory Analysis

18. ### 18 John Tukey built the method and published a book

called ‘Exploratory Data Analysis’ in 1970s.
19. ### 19 Build hypotheses by exploring data. EDA Hypotheses Data Test

Data Exploratory Data Analysis
20. ### An exploratory and iterative process of asking many questions and

ﬁnd answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 20 Exploratory Data Analysis (EDA)
21. ### An exploratory and iterative process of asking many questions and

ﬁnd answers from data in order to build better hypothesis for Prediction and Control. 21 Exploratory Data Analysis (EDA)
22. ### 22 • How the variation in variables? • How are

the variables associated (or correlated) to one another? Two Principle Questions for EDA

27. None
28. None
29. None
30. None

33. ### 33 Variance is a good starting point for building hypothesis

of association or causal relationship. If there is variance, we can ask “What makes the variance?” and start investigating further. ʁ Income

35. ### 35 A relationship where changes in one variable happen together

with changes in another variable with a certain rule. Association and Correlation
36. ### 36 Association Correlation Any type of relationship between two variables.

A certain type of (usually linear) association between two variables
37. ### 37 US UK Japan 5000 2500 Monthly Income variances are

different among countries. Country Monthly Income 0 Association
38. ### 38 Age Monthly Income The bigger the Age is, the

bigger the Monthly Income is. Correlation
39. ### Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

-1 -0.5 0.5 Correlation
40. ### 40 0 30 20 If we can ﬁnd a correlation

between Monthly Income and Working Years… 10 \$20,000 \$1,000 Working Years Monthly Income
41. ### 41 0 30 20 10 \$20,000 \$1,000 Working Years If

Working Years is 20 years, Monthly Income would be around \$15,000. \$15,000 Monthly Income
42. ### 42 ʁ Income If we can ﬁnd strong correlations, it

makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.
43. ### Correlation is not equal to Causation. Causation is a special

type of Correlation. If we can conﬁrm a given Correlation is Causation, then we can control the outcome. Warning!

interest.

48. ### For numerical variables, it divides into 10 sections and shows

the mean and the 95% conﬁdence interval for each section. 48

52. ### 52 We don’t really know True mean. True Meanɿ? Mean

of this dataɿ\$5,200
53. ### 5,200 5,000 True Mean Mean of this data 53 We

don’t know True mean of the Monthly Income. We might be missing some employees, some employees data is not accurate, the timing is not right, there is some bias, etc. But, we know the mean of this data in our hands.
54. ### 5,200 5,000 54 Can we have a range of this

mean value so that we can say that True mean should reside somewhere within this range? 95% Conﬁdence Interval
55. ### 55 If we take many samples from the population data

and calculate the 95% conﬁdence interval of each mean…
56. ### 56 True Mean 95% of all the conﬁdence intervals should

include the True mean. }Sample
57. ### 57 We happen to be looking at one of the

mean and the 95% conﬁdence interval. }αϯϓϧ True Mean

61. ### Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

-1 -0.5 0.5 Correlation
62. ### 62 Select ‘Correlation’ to sort the variables based on the

correlation values.

64. ### 64 Monthly Income Job Level Monthly Income and Job Role

are correlated for some degrees.

67. ### This signiﬁcance test is based on the Kruskal-Wallis test, which

is used to test if the differences in the variance among variables are signiﬁcant, with no assumption of normality in the target variable. Kruskal-Wallis Test
68. ### 2 Groups More than 2 Groups With Assumption t Test

ANOVA Test No Assumption Wilcoxon Test Kruskal-Wallis Test AssumptionɿThe distribution of the original data is Normal Distribution. Various Types of Hypothesis Test 68
69. ### It indicates the probability of observing the relationship between the

Monthly Income and the Job Level by chance when assuming that there is no relationship between them. It shows less than 0.0001, which means that the chance of observing the relationship we see here is very minimal. And this means that the above assumption doesn’t seem to be working. Therefore, we can conclude that Monthly Income and Job Level indeed have some degree of relationship. P-Value

71. ### 71 The relationship between Monthly Income and Job Level is

visualized with a Scatter plot chart.

Numerical.
74. ### 74 Boxplot is better to visualize the relationship between Numerical

and Categorical variables.

77. ### You can create a Linear Regression model to investigate the

relationship between Monthly Income and Job Level further.
78. ### The ﬁrst thing you see is a tab called ‘Prediction’

where you can see the predicted values of Monthly Income based on the Job Level.
79. ### The gray line shows the mean (average) at each numeric

value of Job Level (e.g. 1, 2, 3…) and the blue line is the predicted values by the Linear Regression Model.

81. ### 81 Monthly Income 5000 10000 15000 25000 20000 Working Years

40 20 10 0 30
82. ### Want to ﬁnd a simple pattern that can explain both

the given data and the data we don’t have at hands. 82
83. ### 83 Draw a line to make the distance between the

actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
84. ### 84 40 20 10 0 30 5000 10000 15000 25000

20000 Monthly Income = 500 * Working Years + 5000
85. ### 85 5000 Slopeɿ500 40 20 10 0 30 Monthly Income

= 500 * Working Years + 5000
86. ### 86 5000 40 20 10 0 30 Y Intercept Monthly

Income = 500 * Working Years + 5000
87. ### Linear Regression algorithm ﬁnds these parameters based on a given

data and build a model. Model Monthly Income = 500 * Working Years + 5000
88. ### Target value (y) can be estimated from a single variable

(x) y = a * x + b Simple Linear Regression Monthly Income = 500 * Working Years + 5000

90. ### The model with Job Level and Working Years is now

showing the R Squared as 0.90.
91. ### R Squared • How good does the model perform compared

to a null model? • How can a given variable(s) explain the variance of the target variable. • It can be between 0 and 1, 1 being the highest. • Square of the Correlation
92. ### When R Squared is high… When R Squared is low…

Let's take a look at High and Low scenarios.

94. ### 94 Mean Prediction 55K 60K 65K 0 70K 75K 80K

85K 50K Monthly Income Working Years
95. ### 95 Let’s look at this data 55K 60K 65K 70K

75K 80K 85K 0 50K Mean Prediction Monthly Income Working Years
96. ### 96 55K 60K 65K 70K 75K 80K 85K 0 50K

Mean Prediction Monthly Income Working Years
97. ### The variance that is not explained by this model. 97

࣮ଌ 55K 60K 65K 70K 75K 80K 85K 0 The variance that is explained by this model. 50K Mean Prediction Monthly Income Working Years
98. ### 98 ࣮ଌ 55K 60K 65K 70K 75K 80K 85K 0

50K Majority of the variance is explained by this model. Mean Prediction Monthly Income Working Years

100. ### 100 Monthly Income 0 40K 80K 120K Working Years 25

30 35 40 45 0 Prediction Mean

40K 80K 120K
102. ### 102 x Mean Prediction Actual 25 30 35 40 45

0 0 40K 80K 120K
103. ### 103 x Mean Prediction Actual 25 30 35 40 45

0 0 40K 80K 120K Majority of the variance is not explained by this model.
104. ### 104 x Mean Prediction Actual 25 30 35 40 45

0 0 40K 80K 120K The variance that is explained by this model is very small.
105. ### R Squared is the ratio of variability of the target

variable values that is explained by the model. When R Squared is high… When R Squared is low…
106. ### The model with Job Level and Working Years is now

showing the R Squared as 0.90.

108. ### Create a Linear Regression model to investigate the relationship between

Monthly Income and Total Working Years further.
109. None

111. ### The coefﬁcient (slope) with the 95% conﬁdence interval is visualized

with Error Bar chart.
112. None

114. ### “Shallow men believe in luck or in circumstance. Strong men

believe in cause and effect.” Ralph Waldo Emerson 114
115. None

119. ### 119 Monthly Income Job Level We now know that Monthly

Income and Job Level are correlated.
120. ### 120 The higher the Job Level is the higher the

Monthly Income is.
121. ### 121 Working Years Monthly Income We also know that Monthly

Income and Working Years are correlated.
122. ### 122 The higher the Working Years is the higher the

Monthly Income is.
123. ### 123 Are Job Level and Working Years correlated? Job Level

Monthly Income Working Years ʁ

126. ### 126 Correlation Job Level Monthly Income Working Years Job Level

and Working Years are correlated.
127. ### 127 Does Working Years have an impact on changes in

Monthly Income? Job Level Monthly Income Working Years
128. ### 128 Maybe, the change in Job Level is the one

that has an effect on the change in Monthly Income. If so…. Job Level Monthly Income Working Years
129. ### 129 When the Job Level changes the Monthly Income and

the Working Years changes. Maybe, we happen to be looking at the results (Monthly Income and Working Years) that are caused by the Job Level. Job Level Monthly Income Working Years

131. ### 131 Job Level Monthly Income Working Years Does Job Level

have an impact on changes in Monthly Income?
132. ### 132 Job Level Monthly Income Working Years Maybe, the change

in Working Years is the one that makes the change in Monthly Income. If so….
133. ### 133 Job Level Monthly Income Working Years When the Working

Years changes the Monthly Income and the Job Level changes. Maybe, we happen to be looking at the results (Monthly Income and Job Level) that are caused by the Working Years.
134. ### How can we know an effect of only Job Level

or of only Working Years on the changes in Monthly Income? 134
135. ### We can hold one of the two variables constant and

see if we still see any change in the Monthly Income. 135
136. ### 136 10 →11 1 →1 Constant Eﬀect? We can make

the Job Level constant and allow only Working Years to change, and see how much the Monthly Income changes. Job Level Monthly Income Working Years
137. ### 137 1 → 2 10 → 10 Constant Job Level

Monthly Income Working Years We can make the Working Years constant and allow only Job Level to change, and see how much the Monthly Income changes. Eﬀect?

139. ### One point increase in x would expect a change of

a in y. y = a * x + b Simple Linear Regression
140. ### One year increase in Working Years would expect \$500 increase

in Monthly Income. Monthly Income = 500 * Working Years + 5000 Simple Linear Regression
141. ### One point increase in x would expect a change of

a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
142. ### One year increase of Working Years would expect \$500 increase

in Monthly Income, Job Level stays the same. Monthly Income = 500 * Working Years + 600 * Job Level + 5000 Multiple Linear Regression

144. ### You can extend this Linear Regression model by adding other

columns (predictor variables).

dialog.

147. ### The model with the Job Level and the Working Years

is now showing the R Squared as 0.90. 90% of variance of the Monthly Income can be explained by the changes in the Job Level and the Working Years.
148. ### The relationship can be formulated as: Monthly Income = 46

* Working Years + 3,788 * Job Level - 1,835
149. ### The relationship can be formulated as: Monthly Income = 46

* Working Years + 3,788 * Job Level - 1,835
150. ### 150 The coefﬁcient and the P values are visualized under

the Coefﬁcient tab.
151. ### One year increase of Working Years would expect \$46 increase

in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
152. ### One level increase of Job Level would expect \$3788 increase

in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
153. ### The prediction line (the blue line) is drawn under the

assumption that the other variables stay constant.
154. ### 154 Monthly Income = 46 * Working Years + 3,788

* Job Level - 1,835 1 year increase in Working Years can expect \$46 increase in Monthly Income, if the Job Level stays constant.
155. ### 155 1 level increase in Job Level can expect \$3,788

increase in Monthly Income, if the Working Years stays constant. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
156. ### We can see that the Job Level variable is much

more important than the Working Years variable when it comes to predicting the monthly income.
157. ### 157 Job Level Monthly Income Working Years Both Job Level

and Working Years have some degree of effects on Monthly Income.
158. ### 158 And, the Job Level has much bigger effect on

the Monthly Income. Job Level Monthly Income Working Years

Income.

Income.
163. None
164. ### We can see how each of the variables has an

effect on the Monthly Income when the other variables stay constant.
165. ### The effect of Age on Monthly Income is not as

signiﬁcant as we expect.
166. ### The changes in the Job Level and the Job Role

seem to be the two biggest impacts on the Monthly Income.
167. ### With the Summary View and its Correlation Mode, we can

quickly investigate the relationship among the variables.
168. ### We have looked at the case where the subject of

our interest was a numerical variable.

170. ### In the next session, we are going to explore on

how to investigate and understand the relationship when the target variable is Logical. • Conﬁdence Interval with Error Bar • Hypothesis Test with Chi-Squared • Logistic Regression
171. ### EXPLORATORY Online Seminar #33 Exploratory Data Analysis Part 3 Relationship

with Logical Variable 2/10/2021 (Wed) 11AM PT
172. None