Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Data Viz Workshop Part 3 - Visualizing Variance & Correlation

Exploratory: Data Viz Workshop Part 3 - Visualizing Variance & Correlation

This is a part of the Data Visualization Workshop. In this seminar, we'll focus on how to visualize the variance and the correlation effectively.

* Understanding Variance with Histogram, Density, Boxplot, Violin charts
* Understanding Correlation with Scatter, Boxplot charts

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

May 13, 2020
Tweet

Transcript

  1. EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance &

    Correlation
  2. Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  3. 3 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  4. 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  5. 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  6. EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance &

    Correlation
  7. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  8. Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 8
  9. Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 9
  10. An exploratory and iterative process of asking many questions and

    find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 10 EDA (Exploratory Data Analysis)
  11. 11 Questions Answers Data Hypotheses EDA (Exploratory Data Analysis)

  12. 12 Far better an approximate answer to the right question,

    which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey
  13. 13 • How the variation in variables? • How are

    the variables associated or correlated to one another? 2 Principle Questions for EDA
  14. Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

  15. Sample Data

  16. Employee Data

  17. Employee Data

  18. 1. Open Data Catalog 2. Find ‘HR Employee’ Data 3.

    Import the Data Import Data
  19. 19 Select ‘Data Catalog’ from the Data Frame menu.

  20. 20 Type ‘employee’ to search ‘HR Employee Attrition’ data.

  21. 21 Click the Save button to save the data.

  22. Once the data is imported, the Summary view automatically generates

    a chart for each column along with metrics to describe the data.
  23. Each row is for each employee of 1,470. There are

    27 variables to describe each employee.
  24. 24 We’ll focus on Monthly Income.

  25. Questions • How is the variation of Monthly Income? •

    What makes it vary? • What are correlated to Monthly Income and how?
  26. 26 We can see that the average Monthly Income is

    $6,503, and it ranges from 1,009 to 19,999. That’s a huge gap!
  27. Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 27
  28. $6,503 Variance Average $19,999 $1,009

  29. Questions for Variance • What are the typical values? •

    Are there any outliers compared to the general trend in the variance? • How the data is distributed? • Are there any patterns you can spot in the variance? 29
  30. 30 Histogram Density Plot Bar Chart Charts to visualize the

    variance.
  31. 31 How to pick which charts to use?

  32. 32 Depends on Data Type

  33. Continuous Data vs. Categorical Data 33

  34. Continuous Data vs. Categorical Data 34

  35. • Numerical - Numeric, Integer, Double • Date/Time - Date,

    POSIXct 35
  36. Numerical 0 10 20 30 40 50 11 22 45

    Continuous and Ordinal relationship among values.
  37. 37 Histogram Density Plot Bar Chart Visualize the Variance of

    Numerical Variables
  38. 38 Histogram Density Plot Bar Chart

  39. 39 It splits numerical values into a set of ‘bins’

    with equal range it shows the size (or number of rows) for each ‘bin’.
  40. 40 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  41. 41 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  42. 42 Create a histogram by assigning the ‘Monthly Income’ column

    to X-Axis.
  43. 43 Increase the number of Bars by setting 50.

  44. 44 Increase the number of Bars by setting 100 to

    see if any new patterns can be spotted.
  45. 45 There seems to be a few different groups.

  46. None
  47. None
  48. $6,503 Average What makes the Monthly Income varies?

  49. 49 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  50. 50 Assign the ‘Gender’ column to Color.

  51. 51 There is no clear difference between Female and Male.

  52. 52 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  53. 53 Manager’s Monthly Income range seems to be higher while

    Sales Rep & Research Scientist are lower. Manager Sales Rep Research Scientist
  54. 54 But, it’s hard to see the differences among the

    Job Roles because they are on top of each other.
  55. 55 Histogram Density Plot Bar Chart

  56. 56 • Draws a smooth curve to visualize the distribution

    of data. • The height shows an estimated data density of any given point. • Each area under the curve is considered as 1.
  57. 57 1. How the variance of Monthly Income? 2. Are

    there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions
  58. 58 Create a Density Plot by selecting ‘Density Plot’ chart

    and assigning the ‘Monthly Income’ column to X-Axis.
  59. 59 Assign the ‘Job Role’ column to Color to create

    a line for each Job Role type.
  60. 60 It’s easier to see the differences among Job Roles

    compared to Histogram.
  61. Density Plot Histogram Same data variance is visualized in different

    ways.
  62. Density Plot Histogram Estimated distribution based on the actual values.

    Actual distribution.
  63. Continuous Data vs. Categorical Data 63

  64. Categorical California Texas New York Florida Oregon • No continuous

    relationship • Limited Set of Values • Ordinal relationship is NOT necessary
  65. 65 Histogram Density Plot Bar Chart

  66. 66 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  67. 67 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  68. 68 Create a bar chart by selecting ‘Bar’ chart type

    and assigning the ‘Job Role’ column to X Axis.
  69. 69 Sort the bars based on the Y Axis values.

  70. 70 1. How the variance of Job Role? 2. Are

    there any differences in the Job Role variance between Male and Female. If so, how? Questions
  71. Assign the ‘Gender’ column to Color.

  72. Make the bar chart in a side-by-side mode.

  73. Some Job Roles like Sales Executive, Research Scientists, etc. have

    a lot more males than female while there aren’t much differences between for Manufacturing Director.
  74. None
  75. Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 75
  76. 76 A relationship where changes in one variable happen together

    with changes in another variable with a certain rule. Association and Correlation
  77. 77 Association Correlation Any type of relationship between two variables.

    A certain type of (usually linear) association between two variables
  78. 78 US UK Japan 5000 2500 Monthly Income variances are

    different among countries. Country Monthly Income 0 Association
  79. 79 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  80. Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

    -1 -0.5 0.5 Correlation
  81. 81 Why Association & Correlation are Important?

  82. 82 Variance Average (Mean) $20,000 $1,000

  83. 83 Variance $20,000 $1,000 Monthly Income

  84. 84 Uncertainty $20,000 $1,000 Monthly Income How much the income

    would be in this company? Variance
  85. 85 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  86. 86 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  87. 87 5000 0 30 20 Working Years Correlation Variance 100

    $20,000 $1,000 $15,000 Correlation reduces the uncertainty caused by Variance. Monthly Income
  88. 88 US UK Japan Association Variance Reduce Uncertainty 5000 100

    Monthly Income Association reduces the uncertainty caused by Variance.
  89. If we can find strong correlations, it makes it easier

    to explain how Monthly Income changes and to predict what Monthly Income will be.
  90. Correlation (or Association) is not equal to Causation. Causation is

    a special type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome.
  91. But, Causation is not a topic here. We’ll focus primarily

    on Association and Correlation.
  92. Visualizing Association and Correlation 92

  93. 93 How to choose the right charts?

  94. 94 Depends on the combination of data types.

  95. 95 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  96. 96 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  97. Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

  98. Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis)

    • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 98
  99. 99 Boxplot

  100. Boxplot • Displays the distribution of numerical values by Category

    • Y Axis represents range of values, X Axis represents each Category
  101. None
  102. Separate into 4 groups (quartile) so that each group has

    equal size.
  103. 3Q (3rd Quartile / 75 Percentile) 2Q (2nd Quartile /

    50 Percentile / Median) 1Q (1st Quartile / 25 Percentile)
  104. None
  105. 3Q Median 1Q

  106. 3Q Median 1Q Max Min

  107. 107 1. How is the variance of Monthly Income associated

    with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions
  108. 108 1. How is the variance of Monthly Income associated

    with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions
  109. 109 Monthly Income (Numeric) vs. Job Role (Category)

  110. 110 Monthly Income (Numeric) vs. Gender (Category)

  111. Job Role (Category) vs. Monthly Income (Numeric) by Gender (Category)

  112. 112 • Category vs. Numerical • Numerical vs. Numerical •

    Category vs. Category Combination of Data Types
  113. Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

  114. 114 Scatter Plot

  115. 115 Numeric Numeric Each data point (row) is positioned at

    an intersection of two numeric variables.
  116. Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

    -1 -0.5 0.5 Correlation
  117. 117 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  118. 118 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  119. 119 Monthly Income (Numeric) vs. Age (Numeric)

  120. None
  121. None
  122. Check the strength of correlation.

  123. 123 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  124. 124 Monthly Income (Numeric) vs. Total Working Years (Numeric)

  125. 125 Check the strength of correlation.

  126. 126 1. How is the variance of Monthly Income correlated

    with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions
  127. 127 For example, Sales Executive and Human Resource have the

    same kind of correlation between Monthly Income and Total Working Years?
  128. Assign the ‘Job Role’ column to Repeat By to create

    the scatter chart for each of the job roles.
  129. 129 The strength of correlation varies among Job Roles.

  130. Numerical vs. Numerical Numerical vs. Categorical We can categorize numerical

    values.
  131. None
  132. Switch the chart type to Boxplot.

  133. Click on the green text to change the number of

    categories.
  134. Type 10 to split the Total Working Years into 10

    groups.
  135. We can see that there is a huge jump in

    Monthly Income after 20th years of working in this company.
  136. Also, up till 16th years, the income increases as the

    working years increase. But after 20th years, there is no obvious correlation between the working years and the income.
  137. We can assign the ‘Gender’ column to Color to see

    if there is any gender gap in terms of the relationship between the income and the working years.
  138. We can see that Female seems to have higher income

    range than Male up to 16th years, but after that Male seems to have higher income than Female.
  139. 139 • Numerical vs. Categorical • Numerical vs. Numerical •

    Categorical vs. Categorical Combination of Data Types
  140. Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

  141. 141 Category vs. Category Calculating the size (number of rows)

    for each pair and/or calculate the ratio against the total size.
  142. 142 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  143. 143 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  144. Education (Category) vs. Job Role (Category)

  145. 145 1. How is the Job Role associated with the

    Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions
  146. None
  147. • Logical is a special case of Categorical. • It

    can have only two unique values, TRUE or FALSE.
  148. Categorical California Texas New York Florida Oregon No continuous relationship

    No ordinal relationship is necessary
  149. TRUE FALSE Is she Japanese? Either TRUE or FALSE Logical

  150. Job Role (Category) vs. Attrition (Logical)

  151. 151 1. between Job Role and Education. 2. between Job

    Role and Attrition. 3. between Job Role and Monthly Income. Visualize the relationship
  152. None
  153. Numerical Categorical We can categorize numerical values.

  154. Select a bar chart and assign the ‘Job Role’ column

    to X Axis.
  155. Assign the ‘Monthly Income’ column to Color, which will automatically

    categorize the values of the Monthly Income into 5 groups.
  156. Make the Y-Axis to be Ratio (%) by using ‘%

    of’ Window Calculation. Select ‘Window Calculation’ from Y Axis menu.
  157. Select ‘% of’ for the Calculation Type.

  158. Manager and Research Director have higher ratio of higher income

    people, surprise? ;)
  159. Q & A

  160. Next Seminar

  161. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty - 6/10 (Wed) • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  162. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  163. EXPLORATORY 163