Exploratory: Data Viz Workshop Part 3 - Visualizing Variance & Correlation

Exploratory: Data Viz Workshop Part 3 - Visualizing Variance & Correlation

This is a part of the Data Visualization Workshop. In this seminar, we'll focus on how to visualize the variance and the correlation effectively.

* Understanding Variance with Histogram, Density, Boxplot, Violin charts
* Understanding Correlation with Scatter, Boxplot charts

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida

May 13, 2020
Tweet

Transcript

  1. 2.

    Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 3.

    3 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  3. 4.

    4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  4. 5.

    5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  5. 7.

    • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  6. 8.

    Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 8
  7. 9.

    Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 9
  8. 10.

    An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 10 EDA (Exploratory Data Analysis)
  9. 12.

    12 Far better an approximate answer to the right question,

    which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey
  10. 13.

    13 • How the variation in variables? • How are

    the variables associated or correlated to one another? 2 Principle Questions for EDA
  11. 22.

    Once the data is imported, the Summary view automatically generates

    a chart for each column along with metrics to describe the data.
  12. 23.

    Each row is for each employee of 1,470. There are

    27 variables to describe each employee.
  13. 25.

    Questions • How is the variation of Monthly Income? •

    What makes it vary? • What are correlated to Monthly Income and how?
  14. 26.

    26 We can see that the average Monthly Income is

    $6,503, and it ranges from 1,009 to 19,999. That’s a huge gap!
  15. 27.

    Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 27
  16. 29.

    Questions for Variance • What are the typical values? •

    Are there any outliers compared to the general trend in the variance? • How the data is distributed? • Are there any patterns you can spot in the variance? 29
  17. 36.

    Numerical 0 10 20 30 40 50 11 22 45

    Continuous and Ordinal relationship among values.
  18. 39.

    39 It splits numerical values into a set of ‘bins’

    with equal range it shows the size (or number of rows) for each ‘bin’.
  19. 40.

    40 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  20. 41.

    41 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  21. 44.

    44 Increase the number of Bars by setting 100 to

    see if any new patterns can be spotted.
  22. 46.
  23. 47.
  24. 49.

    49 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  25. 52.

    52 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  26. 53.

    53 Manager’s Monthly Income range seems to be higher while

    Sales Rep & Research Scientist are lower. Manager Sales Rep Research Scientist
  27. 54.

    54 But, it’s hard to see the differences among the

    Job Roles because they are on top of each other.
  28. 56.

    56 • Draws a smooth curve to visualize the distribution

    of data. • The height shows an estimated data density of any given point. • Each area under the curve is considered as 1.
  29. 57.

    57 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  30. 58.

    58 Create a Density Plot by selecting ‘Density Plot’ chart

    and assigning the ‘Monthly Income’ column to X-Axis.
  31. 59.
  32. 64.

    Categorical California Texas New York Florida Oregon • No continuous

    relationship • Limited Set of Values • Ordinal relationship is NOT necessary
  33. 66.

    66 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  34. 67.

    67 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  35. 68.

    68 Create a bar chart by selecting ‘Bar’ chart type

    and assigning the ‘Job Role’ column to X Axis.
  36. 70.

    70 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  37. 73.

    Some Job Roles like Sales Executive, Research Scientists, etc. have

    a lot more males than female while there aren’t much differences between for Manufacturing Director.
  38. 74.
  39. 75.

    Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 75
  40. 76.

    76 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  41. 77.

    77 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two variables
  42. 78.

    78 US UK Japan 5000 2500 Monthly Income variances are

    different among countries. Country Monthly Income 0 Association
  43. 79.

    79 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  44. 85.

    85 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  45. 86.

    86 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  46. 87.

    87 5000 0 30 20 Working Years Correlation Variance 100

    $20,000 $1,000 $15,000 Correlation reduces the uncertainty caused by Variance. Monthly Income
  47. 88.

    88 US UK Japan Association Variance Reduce Uncertainty 5000 100

    Monthly Income Association reduces the uncertainty caused by Variance.
  48. 89.

    If we can find strong correlations, it makes it easier

    to explain how Monthly Income changes and to predict what Monthly Income will be.
  49. 90.

    Correlation (or Association) is not equal to Causation. Causation is

    a special type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome.
  50. 95.

    95 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  51. 96.

    96 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  52. 98.

    Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 98
  53. 100.

    Boxplot • Displays the distribution of numerical values by Category

    • Y Axis represents range of values, X Axis represents each Category
  54. 101.
  55. 103.

    3Q (3rd Quartile / 75 Percentile) 2Q (2nd Quartile /

    50 Percentile / Median) 1Q (1st Quartile / 25 Percentile)
  56. 104.
  57. 107.

    107 1. How is the variance of Monthly Income associated

    with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions
  58. 108.

    108 1. How is the variance of Monthly Income associated

    with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions
  59. 112.

    112 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  60. 115.

    115 Numeric Numeric Each data point (row) is positioned at

    an intersection of two numeric variables.
  61. 117.

    117 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  62. 118.

    118 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  63. 120.
  64. 121.
  65. 123.

    123 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  66. 126.

    126 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  67. 127.

    127 For example, Sales Executive and Human Resource have the

    same kind of correlation between Monthly Income and Total Working Years?
  68. 128.

    Assign the ‘Job Role’ column to Repeat By to create

    the scatter chart for each of the job roles.
  69. 131.
  70. 135.

    We can see that there is a huge jump in

    Monthly Income after 20th years of working in this company.
  71. 136.

    Also, up till 16th years, the income increases as the

    working years increase. But after 20th years, there is no obvious correlation between the working years and the income.
  72. 137.

    We can assign the ‘Gender’ column to Color to see

    if there is any gender gap in terms of the relationship between the income and the working years.
  73. 138.

    We can see that Female seems to have higher income

    range than Male up to 16th years, but after that Male seems to have higher income than Female.
  74. 139.

    139 • Numerical vs. Categorical • Numerical vs. Numerical •

    Categorical vs. Categorical Combination of Data Types
  75. 141.

    141 Category vs. Category Calculating the size (number of rows)

    for each pair and/or calculate the ratio against the total size.
  76. 142.

    142 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  77. 143.

    143 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  78. 145.

    145 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  79. 146.
  80. 147.

    • Logical is a special case of Categorical. • It

    can have only two unique values, TRUE or FALSE.
  81. 151.

    151 1. between Job Role and Education. 2. between Job

    Role and Attrition. 3. between Job Role and Monthly Income. Visualize the relationship
  82. 152.
  83. 155.

    Assign the ‘Monthly Income’ column to Color, which will automatically

    categorize the values of the Monthly Income into 5 groups.
  84. 156.

    Make the Y-Axis to be Ratio (%) by using ‘%

    of’ Window Calculation. Select ‘Window Calculation’ from Y Axis menu.
  85. 159.
  86. 161.

    • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty - 6/10 (Wed) • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop