Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory Data Analysis Part 3 - What Makes the Difference Between the Two

Kan Nishida
February 17, 2021

Exploratory Data Analysis Part 3 - What Makes the Difference Between the Two

This is a part of the Exploratory Data Analysis Workshop. In this episode, we'll get into how we can find what makes a difference between two groups so that, for example, we can answer questions like "what makes a difference between those who sign up and those who don't."

- Confidence Interval
- Error Bar
- Chi-Square Test
- Logistic Regression

Exploratory Online Seminar: https://exploratory.io/online-seminar

The UI tool used in this video: Exploratory (https://exploratory.io)

Kan Nishida

February 17, 2021
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  3. 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  4. 8 - Wayne Gretzky “I skate to where the puck

    is going to be, not where it has been.”
  5. 9 Predict how many customers we will have. Prediction e.g.

    Customers will be 1000 by end of this year.
  6. 10 - Kan Nishida “I control where the puck is

    going to be, not the puck controls where I should be.”
  7. 11 e.g. We want to grow Customers to 1000. What

    can we do to make that happen? Control
  8. 12 Hypothesis If the weather will be warm, we will

    have more customers. If we offer 10% discount, we would have more customers. Prediction Control
  9. 13 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

    Test • A/B Test Hypotheses Data Test Discount increases more customers
  10. 14 Test Hypotheses by Experimenting and Collecting Data • Hypothesis

    Test • A/B Test Hypotheses Data Test Discount increases more customers Confirmatory Analysis
  11. 18 John Tukey built the method and published a book

    called ‘Exploratory Data Analysis’ in 1970s.
  12. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 20 Exploratory Data Analysis (EDA)
  13. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Prediction and Control. 21 Exploratory Data Analysis (EDA)
  14. 22 • How the variation in variables? • How are

    the variables associated (or correlated) to one another? Two Principle Questions for EDA
  15. 28 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  16. 29 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two variables
  17. 30 US UK Japan 5000 2500 Monthly Income variances are

    different among countries. Country Monthly Income 0 Association
  18. 31 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  19. 32 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  20. 33 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  21. 34 Income ? If we can find strong correlations, it

    makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.
  22. Correlation is not equal to Causation. Causation is a special

    type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome. Warning!
  23. TRUE FALSE Did this employee quit? Logical In R, there

    is a special data type called Logical, which can have only TRUE or FALSE. And in data analysis, we typically convert this type of columns to Logical data type.
  24. str_logical function converts a given column to Logical data type

    and map the values to TRUE/FALSE heuristically.
  25. ‘Yes’ becomes TRUE and ‘No’ becomes ‘FALSE’. You can see

    the percentage of each value. It is ordered as TRUE, then FALSE.
  26. 55 The Error Bar charts show the ratio of Attrition

    (TRUE) along with 95% Confidence Interval for each category.
  27. 56 For example, we can see that employees with ‘Travel

    Frequently’ have higher ratio (25%) of Attrition (TRUE) compared to others.
  28. For numerical variables, it divides into 10 buckets and shows

    the ratio and the 95% confidence interval for each bucket. 57
  29. 33% 31% True Ratio Ratio of this data 61 We

    don’t know True ratio of the Attrition. We might be missing some employees - some employees’ data is not accurate, the timing of getting data is not right, etc. But, we know the ratio of this data in our hands.
  30. 62 Can we set a range around this ratio value

    so that we can say that True ratio should reside somewhere within this range? 95% Confidence Interval 33% 31%
  31. 63 If we take many samples from the population data

    and calculate the 95% confidence interval of each ratio…
  32. 65 We happen to be looking at one of the

    ratio and the 95% confidence interval. } True Ratio Sample
  33. 66 For example, we don’t know the True ratio of

    Attrition for ‘Travel Frequently’ but it should be somewhere between 19.82% and 30%.
  34. 67 The length of the confidence interval tends to get

    longer when the data size is smaller, in this case, that is the number of employees. There are only 63 employees out of 1,470 in total.
  35. 68 There seems to be a relationship between Attrition and

    Gender. It looks that Male employees tend to get higher income than Female employees.
  36. 69 Attrition Gender There seems to be a relationship between

    Attrition and Gender. But, is that relationship really worth paying our attention?
  37. 2 Groups More than 2 Groups With Assumption t Test

    ANOVA Test No Assumption Wilcoxon Test Kruskal-Wallis Test AssumptionɿThe distribution of the original data is Normal Distribution. Various Types of Hypothesis Test 74
  38. 77 Difference in Ratio Categorical / Logical Attrition Gender Is

    that difference significant? Difference in Mean Numerical t Test, Kruskal-Wallis Test Confidence Interval of Mean
  39. 78 Confidence Interval of Ratio Is that difference significant? Difference

    in Mean Numerical t Test, Kruskal-Wallis Test Confidence Interval of Mean Difference in Ratio Categorical / Logical Is that difference significant?
  40. Chi-Square Test 79 Is that difference significant? Difference in Mean

    Numerical t Test, Kruskal-Wallis Test Confidence Interval of Mean Difference in Ratio Categorical / Logical Confidence Interval of Ratio Is that difference significant?
  41. Chi-Square Test is to test if a given pair of

    two categorical variables are independent or not. Another way to put it, we can evaluate if the relationship between two categorical variables is significant or something that is worth our attention. Chi-Square Test
  42. 81 0 60 120 180 240 300 Female Male Attrition

    - TRUE vs. FALSE TRUE FALSE TRUE FALSE
  43. 83 Is this difference in the ratio between Female and

    Male worth our attention? Is it significant? 0% 20% 40% 60% 80% 100% TRUE FALSE TRUE FALSE Female Male
  44. 84 0% 20% 40% 60% 80% 100% If there is

    no relationship between Attrition and Gender, then the ratio of Attrition for Female and the one for Male should be the same. Female Male
  45. 85 If we assume that there is no difference between

    Female and Male in terms of the Attrition, the difference we are seeing with this data is big enough to be considered as ‘significant’? Or, this difference can be explained as marginal difference or just as a variance we could see with any combination of data?
  46. 86 0% 20% 40% 60% 80% 100% Male 0% 20%

    40% 60% 80% 100% Actual Data Theoretical Data Female Male Female Is the difference with the actual data big enough compared to the one we would expect when there is no difference?
  47. It indicates the probability of observing the difference in the

    ratio of Attrition between Male and Female by chance when assuming that there is no relationship between Attrition and Gender variables. P Value here is 26%, which means that the chance of observing the relationship we see here is no so surprising. So the assumption above doesn’t seem to be contradicting and we can keep accepting the assumption. P-Value
  48. 89 If we assume there is no difference between Attrition

    and Gender the chance of observing the difference we see here would be about 26%. We can conclude that there is no difference between Male and Female.
  49. 91 If we assume there is no relationship between Attrition

    and Job Role the chance of observing the difference we see here would be close to 0%. Something that is contradicting with the assumption is happening here, so we can reject the assumption, and conclude that there is a relationship between Attrition and Job Role.
  50. Chi-Square test is not just for the statistical test (hypothesis

    test). It can also be used to find what is something special about the relationship between two variables — Attrition and Job Role. In another word, it tells us which combinations of the values from the two variables we should pay more attention to.
  51. Select ‘Create Chi-Square Test’ from the column header menu to

    create a Chi-Square Test under the Analytics view.
  52. Under the ‘Contribution (%)’ tab, we can see which combinations

    of the values from the two variables are more contributing the Chi-Square value.
  53. Chi-Square value is an indicator of how different between what

    we observe here and what we would expect when we assume there is no relationship between the two variables (Job Role and Attrition). This means that the more contribution it is to the Chi-Square value the more the combination is worth our attention for.
  54. The combination of True (Attrition) and Sales Rep. (Job Role)

    contributes about 33.37% of the Chi-Square value.
  55. Under the Ratio tab, we can confirm that the ratio

    of TRUE (Attrition) in Sales Rep. is much higher than any other job types.
  56. A metric that indicates how much a given model can

    separate TRUE data and FALSE data. It can be used to see the strength of the association between a given two variables. AUC (Area Under the Curve) 100 * Under the Correlation Mode, this AUC is based on a logistic regression model that is built for each variable.
  57. 101 All the variables are sorted by AUC values. This

    means that we are seeing the variables that have stronger relationship with the Attrition variable first.a
  58. You can create a Logistic Regression model to investigate the

    relationship between Attrition and Job Role further.
  59. The first thing you see is a tab called ‘Prediction’

    where you can see the predicted probabilities of Attrition for each Job Role.
  60. It shows the probability of Attrition for each Job Role.

    The gray is the actual ratio of the Attrition with the 95% confidence interval. And the blue dots indicate the predicted probability of Attrition being TRUE.
  61. The gray line indicates the actual ratio of the Attrition

    being True with the 95% confidence interval along the Working Years. And the blue line indicate the predicted probability of Attrition being TRUE.
  62. 110 Job Role Attrition Working Years Both Job Role and

    Working Years have relationship with Attrition.
  63. 111 ʁ Job Role Attrition Working Years Job Role and

    Working Years have a relationship?
  64. 114 Relationship (Association) Job Role Attrition Working Years Job Role

    and Working Years have a some degree of relationship.
  65. 116 Job Role Attrition Working Years Or, is Job Role

    the one having an effect on Attrition?
  66. 117 Job Role Attrition Working Years Or, both Working Years

    and Job Role having an effect on Attrition?
  67. 118 How can we estimate an independent (or isolated) effect

    for each variable? Job Role Attrition Working Years
  68. 119 10 → 11 Hold constant We can allow only

    one of the variables to change the value while holding the other variable constant. Effect? Job Role Attrition Working Years
  69. 120 Sales Rep.→ Executive 10 Job Role Attrition Working Years

    Hold constant Effect? We can allow only one of the variables to change the value while holding the other variable constant.
  70. 122 Select both Job Role and Working Years, and select

    ‘Create Logistic Regression’ from the column header menu.
  71. 123 It creates a new Logistic Regression model with 2

    predictor variables of Job Role and Working Years.
  72. 124 When we have Working Years, the predicted probability of

    the Attrition for Sales rep. becomes lower than the actual value. This is because the effect of Working Years is removed from the Sales Rep.
  73. 125 Being a Sales Rep. would increase the probability of

    leaving the company. But, also having less working years would increase the probability.
  74. Average Working Years for Sales Rep. is 5 years and

    it’s much shorter than other Job Roles.
  75. The ratio of employees with less than 8 working years

    is much higher for Sales Rep. compared to other Job Roles.
  76. 128 If the Working Years stays constant, the probability of

    Attrition being True for Sales Rep. is lower than the actual value, but still it is higher than other Job Roles.
  77. 129 Same for the Working Years. When the Job Role

    stays constant, the slope of probability curve becomes more moderate. This is the effect of Working Years without the effect of Job Role.
  78. Now that we know both Job Role and Working Years

    have some degree of relationship with Attrition. Which of the 2 variables have stronger relationship with Attrition? Or another word, Which of the 2 variables have more effect on Attrition?
  79. Under the Importance tab, we can see which variables have

    more effects on the Attrition. Looks Job Role has a lot more effect on the Attrition.
  80. 133 Job Role Attrition Working Years Both Working Years and

    Job Role have effects on Attrition, but Job Role has more effects on the Attrition.