Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory Data Analysis Part 2 - Correlation and Association

Exploratory Data Analysis Part 2 - Correlation and Association

This is a part of the Exploratory Data Analysis Workshop. In this episode, we'll talk about the correlation and association between variables and introduce some effective methods to investigate such relationships.

* Correlation / Association
* Kruskal-Wallis Test
* Linear Regression

Exploratory(https://exploratory.io)

Kan Nishida

January 27, 2021
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  3. 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  4. 8 - Wayne Gretzky “I skate to where the puck

    is going to be, not where it has been.”
  5. 9 Predict how many customers we will have. Prediction e.g.

    Customers will be 1000 by end of this year.
  6. 10 - Kan Nishida “I control where the puck is

    going to be, not the puck controls where I should be.”
  7. 11 e.g. We want to grow Customers to 1000. What

    can we do to make that happen? Control
  8. 12 Hypothesis If the weather will be warm, we will

    have more customers. If we offer 10% discount, we would have more customers. Prediction Control
  9. 13 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

    Test • A/B Test Hypotheses Data Test Discount increases more customers
  10. 14 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

    Test • A/B Test Hypotheses Data Test Discount increases more customers Confirmatory Analysis
  11. 18 John Tukey built the method and published a book

    called ‘Exploratory Data Analysis’ in 1970s.
  12. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 20 Exploratory Data Analysis (EDA)
  13. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Prediction and Control. 21 Exploratory Data Analysis (EDA)
  14. 22 • How the variation in variables? • How are

    the variables associated (or correlated) to one another? Two Principle Questions for EDA
  15. 33 Variance is a good starting point for building hypothesis

    of association or causal relationship. If there is variance, we can ask “What makes the variance?” and start investigating further. ʁ Income
  16. 35 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  17. 36 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two variables
  18. 37 US UK Japan 5000 2500 Monthly Income variances are

    different among countries. Country Monthly Income 0 Association
  19. 38 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  20. 40 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  21. 41 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  22. 42 ʁ Income If we can find strong correlations, it

    makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.
  23. Correlation is not equal to Causation. Causation is a special

    type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome. Warning!
  24. For numerical variables, it divides into 10 sections and shows

    the mean and the 95% confidence interval for each section. 48
  25. 5,200 5,000 True Mean Mean of this data 53 We

    don’t know True mean of the Monthly Income. We might be missing some employees, some employees data is not accurate, the timing is not right, there is some bias, etc. But, we know the mean of this data in our hands.
  26. 5,200 5,000 54 Can we have a range of this

    mean value so that we can say that True mean should reside somewhere within this range? 95% Confidence Interval
  27. 55 If we take many samples from the population data

    and calculate the 95% confidence interval of each mean…
  28. 57 We happen to be looking at one of the

    mean and the 95% confidence interval. }αϯϓϧ True Mean
  29. 58

  30. 59

  31. This significance test is based on the Kruskal-Wallis test, which

    is used to test if the differences in the variance among variables are significant, with no assumption of normality in the target variable. Kruskal-Wallis Test
  32. 2 Groups More than 2 Groups With Assumption t Test

    ANOVA Test No Assumption Wilcoxon Test Kruskal-Wallis Test AssumptionɿThe distribution of the original data is Normal Distribution. Various Types of Hypothesis Test 68
  33. It indicates the probability of observing the relationship between the

    Monthly Income and the Job Level by chance when assuming that there is no relationship between them. It shows less than 0.0001, which means that the chance of observing the relationship we see here is very minimal. And this means that the above assumption doesn’t seem to be working. Therefore, we can conclude that Monthly Income and Job Level indeed have some degree of relationship. P-Value
  34. 71 The relationship between Monthly Income and Job Level is

    visualized with a Scatter plot chart.
  35. You can create a Linear Regression model to investigate the

    relationship between Monthly Income and Job Level further.
  36. The first thing you see is a tab called ‘Prediction’

    where you can see the predicted values of Monthly Income based on the Job Level.
  37. The gray line shows the mean (average) at each numeric

    value of Job Level (e.g. 1, 2, 3…) and the blue line is the predicted values by the Linear Regression Model.
  38. Want to find a simple pattern that can explain both

    the given data and the data we don’t have at hands. 82
  39. 83 Draw a line to make the distance between the

    actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
  40. 84 40 20 10 0 30 5000 10000 15000 25000

    20000 Monthly Income = 500 * Working Years + 5000
  41. 86 5000 40 20 10 0 30 Y Intercept Monthly

    Income = 500 * Working Years + 5000
  42. Linear Regression algorithm finds these parameters based on a given

    data and build a model. Model Monthly Income = 500 * Working Years + 5000
  43. Target value (y) can be estimated from a single variable

    (x) y = a * x + b Simple Linear Regression Monthly Income = 500 * Working Years + 5000
  44. The model with Job Level and Working Years is now

    showing the R Squared as 0.90.
  45. R Squared • How good does the model perform compared

    to a null model? • How can a given variable(s) explain the variance of the target variable. • It can be between 0 and 1, 1 being the highest. • Square of the Correlation
  46. When R Squared is high… When R Squared is low…

    Let's take a look at High and Low scenarios.
  47. 94 Mean Prediction 55K 60K 65K 0 70K 75K 80K

    85K 50K Monthly Income Working Years
  48. 95 Let’s look at this data 55K 60K 65K 70K

    75K 80K 85K 0 50K Mean Prediction Monthly Income Working Years
  49. 96 55K 60K 65K 70K 75K 80K 85K 0 50K

    Mean Prediction Monthly Income Working Years
  50. The variance that is not explained by this model. 97

    ࣮ଌ 55K 60K 65K 70K 75K 80K 85K 0 The variance that is explained by this model. 50K Mean Prediction Monthly Income Working Years
  51. 98 ࣮ଌ 55K 60K 65K 70K 75K 80K 85K 0

    50K Majority of the variance is explained by this model. Mean Prediction Monthly Income Working Years
  52. 103 x Mean Prediction Actual 25 30 35 40 45

    0 0 40K 80K 120K Majority of the variance is not explained by this model.
  53. 104 x Mean Prediction Actual 25 30 35 40 45

    0 0 40K 80K 120K The variance that is explained by this model is very small.
  54. R Squared is the ratio of variability of the target

    variable values that is explained by the model. When R Squared is high… When R Squared is low…
  55. The model with Job Level and Working Years is now

    showing the R Squared as 0.90.
  56. Create a Linear Regression model to investigate the relationship between

    Monthly Income and Total Working Years further.
  57. “Shallow men believe in luck or in circumstance. Strong men

    believe in cause and effect.” Ralph Waldo Emerson 114
  58. 119 Monthly Income Job Level We now know that Monthly

    Income and Job Level are correlated.
  59. 121 Working Years Monthly Income We also know that Monthly

    Income and Working Years are correlated.
  60. 127 Does Working Years have an impact on changes in

    Monthly Income? Job Level Monthly Income Working Years
  61. 128 Maybe, the change in Job Level is the one

    that has an effect on the change in Monthly Income. If so…. Job Level Monthly Income Working Years
  62. 129 When the Job Level changes the Monthly Income and

    the Working Years changes. Maybe, we happen to be looking at the results (Monthly Income and Working Years) that are caused by the Job Level. Job Level Monthly Income Working Years
  63. 131 Job Level Monthly Income Working Years Does Job Level

    have an impact on changes in Monthly Income?
  64. 132 Job Level Monthly Income Working Years Maybe, the change

    in Working Years is the one that makes the change in Monthly Income. If so….
  65. 133 Job Level Monthly Income Working Years When the Working

    Years changes the Monthly Income and the Job Level changes. Maybe, we happen to be looking at the results (Monthly Income and Job Level) that are caused by the Working Years.
  66. How can we know an effect of only Job Level

    or of only Working Years on the changes in Monthly Income? 134
  67. We can hold one of the two variables constant and

    see if we still see any change in the Monthly Income. 135
  68. 136 10 →11 1 →1 Constant Effect? We can make

    the Job Level constant and allow only Working Years to change, and see how much the Monthly Income changes. Job Level Monthly Income Working Years
  69. 137 1 → 2 10 → 10 Constant Job Level

    Monthly Income Working Years We can make the Working Years constant and allow only Job Level to change, and see how much the Monthly Income changes. Effect?
  70. One point increase in x would expect a change of

    a in y. y = a * x + b Simple Linear Regression
  71. One year increase in Working Years would expect $500 increase

    in Monthly Income. Monthly Income = 500 * Working Years + 5000 Simple Linear Regression
  72. One point increase in x would expect a change of

    a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
  73. One year increase of Working Years would expect $500 increase

    in Monthly Income, Job Level stays the same. Monthly Income = 500 * Working Years + 600 * Job Level + 5000 Multiple Linear Regression
  74. The model with the Job Level and the Working Years

    is now showing the R Squared as 0.90. 90% of variance of the Monthly Income can be explained by the changes in the Job Level and the Working Years.
  75. The relationship can be formulated as: Monthly Income = 46

    * Working Years + 3,788 * Job Level - 1,835
  76. The relationship can be formulated as: Monthly Income = 46

    * Working Years + 3,788 * Job Level - 1,835
  77. One year increase of Working Years would expect $46 increase

    in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
  78. One level increase of Job Level would expect $3788 increase

    in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
  79. The prediction line (the blue line) is drawn under the

    assumption that the other variables stay constant.
  80. 154 Monthly Income = 46 * Working Years + 3,788

    * Job Level - 1,835 1 year increase in Working Years can expect $46 increase in Monthly Income, if the Job Level stays constant.
  81. 155 1 level increase in Job Level can expect $3,788

    increase in Monthly Income, if the Working Years stays constant. Monthly Income = 46 * Working Years + 3,788 * Job Level - 1,835
  82. We can see that the Job Level variable is much

    more important than the Working Years variable when it comes to predicting the monthly income.
  83. 157 Job Level Monthly Income Working Years Both Job Level

    and Working Years have some degree of effects on Monthly Income.
  84. 158 And, the Job Level has much bigger effect on

    the Monthly Income. Job Level Monthly Income Working Years
  85. We can see how each of the variables has an

    effect on the Monthly Income when the other variables stay constant.
  86. The changes in the Job Level and the Job Role

    seem to be the two biggest impacts on the Monthly Income.
  87. With the Summary View and its Correlation Mode, we can

    quickly investigate the relationship among the variables.
  88. We have looked at the case where the subject of

    our interest was a numerical variable.
  89. In the next session, we are going to explore on

    how to investigate and understand the relationship when the target variable is Logical. • Confidence Interval with Error Bar • Hypothesis Test with Chi-Squared • Logistic Regression