Data Visualization Workshop Part 4 - Visualizing Uncertainty

Data Visualization Workshop Part 4 - Visualizing Uncertainty

This is a part of the Data Visualization Workshop. In this seminar, we'll focus on how to visualize the uncertainty with Error Bar.

* Introduction to Error Bar chart
* Introduction to Confidence Interval

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida

June 17, 2020
Tweet

Transcript

  1. EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

  2. Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  3. 3 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  4. 4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  5. 5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  6. EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

  7. I have delivered an awesome presentation.

  8. 3 4 5 2 1 Very Good Very Bad I

    asked the audience to rate it.
  9. 0 1.25 2.5 3.75 5 1 2 3 4 5

    Average score: 3.6
  10. I have done another awesome presentation the next day.

  11. 0 1.25 2.5 3.75 5 1 2 3 4 5

    Average score: 3.4
  12. I have done another awesome presentation the next day again!

  13. 0 1.25 2.5 3.75 5 1 2 3 4 5

    Average score: 3.3
  14. 3.3 3.4 3.6 3.5 Which is my real score?

  15. • The numbers vary. • Average is sensitive, it can

    be influenced significantly by extreme values, especially when the size is small.
  16. 26 24 32 28 20 30 22 Mean 26

  17. 26 24 32 28 20 30 22 26 26 Mean

    Median
  18. 26 24 60 28 20 30 22 Mean 30 Median

    26
  19. I have delivered the same awesome presentation with different numbers

    of audience.
  20. 0 2 4 6 8 10 12 14 1 2

    3 4 5 Average score: 3.4
  21. 0 2 4 6 8 10 12 14 1 2

    3 4 5 Average score: 3.5
  22. 0 2 4 6 8 10 12 14 1 2

    3 4 5 Average score: 3.6
  23. 3.3 3.4 3.6 3.5 Which is my average score? I

    would take the number from the biggest crowd more seriously because the outliers won’t impact so much on the average.
  24. • The numbers vary. • Average is sensitive, it can

    be influenced significantly by extreme values, especially when the size is small. • Intuitively speaking, the bigger the data size is the more trust we want to give.
  25. 25 Average Scoreɿ? Ideally, I want to give a presentation

    to as many audience as possible and get the survey result from them.
  26. 26 This time: 3.6 But, I’ve got only a small

    group… Average Scoreɿ?
  27. 27 Mean: 3.6 Sample True Mean: ? Population

  28. • We have no way of knowing the ‘True mean’

    of all the potential audience (Population) because they didn’t join the seminar for whatever the reason was. (It’s impossible!) • We know the mean score of this group (Sample) as 3.6. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 3.6 assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 28
  29. • We have no way of knowing the ‘True mean’

    weight of all Americans. • We know the mean weight of a given sample as 84kg. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 84kg assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 29 Confidence Interval!
  30. 3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean

  31. 3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean 95%

    Confidence Interval
  32. What is 95% Confidence Interval?

  33. 33 95% Confidence Interval ʹ Mean ± 1.96 * Standard

    Deviation * n 1 √
  34. 34 Take many samples and calculate the 95% Confidence Interval

    for each group.
  35. 35 }Samples Means and 95% Confidence Intervals

  36. 36 True Mean 95% of these confidence intervals should include

    the true mean of the population. }Sample
  37. 37 We happen to be looking at one of the

    sample and its mean and its confidence interval. } True Mean Sample
  38. 38

  39. Exercise

  40. Sample Data

  41. Employee Data

  42. Employee Data

  43. 1. Open Data Catalog 2. Find ‘HR Employee’ Data 3.

    Import the Data Import Data
  44. 44 Select ‘Data Catalog’ from the Data Frame menu.

  45. 45 Type ‘employee’ to search ‘HR Employee Attrition’ data.

  46. 46 Click the Save button to save the data.

  47. Once the data is imported, the Summary view automatically generates

    a chart for each column along with metrics to describe the data.
  48. Each row is for each employee of 1,470. There are

    27 variables to describe each employee.
  49. 49 Exercise 1. Compare the average (mean) Monthly Income between

    Male and Female. 2. Compare it for each Job Role and find if there is disparity between Male and Female for any Job Roles.
  50. 50 We’ll focus on Monthly Income.

  51. 51 Create an Error Bar chart, assign Gender to X-Axis

    and Monthly Income to Y-Axis. This will create the chart comparing the mean of Monthly Income by Gender.
  52. 52 Switch the Marker type to ‘Circle’ to focus on

    the mean and the range.
  53. 53 Exercise 1 1. Compare the average (mean) Monthly Income

    between Male and Female. 2. Compare it for each Job Role and find if there is disparity between Male and Female for any Job Roles.
  54. 54 Assign the Job Role column to Repeat By.

  55. 55 Only the Research Director has a notable difference between

    Male and Female.
  56. 56 How about Categorical or Logical?

  57. 57

  58. 58 We observe how many men and women are in

    this organization by counting them outside the office. Example
  59. 5PM 59 Male Female: 0 (0%) Male: 1 (100%)

  60. 60 Female: 2 (66%) Male: 1 (33%) 5:30PM

  61. 61 Female: 2 (40%) Male: 3 (60%) 6PM

  62. 62 Female: 7 (35%) Male: 13 (65%) 6:30PM

  63. 63 Female: 20 (40%) Male: 30 (60%) 7PM

  64. Even with Categorical, the variance (the ratio of male /

    female) and the sample size are the important factors when considering the difference among the categories.
  65. 65 Exercise 2 1. Compare the ratio of Male and

    Female. 2. Compare it among the Job Roles and find if there are any different patterns.
  66. 66

  67. Create an Error Bar chart, assign Gender to X-Axis and

    keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratio of Female and Male.
  68. Some Job Roles have notable differences in terms of the

    ratio, but some don’t.
  69. 69 Exercise 3 Find if there are any differences in

    the ratio of Attrition among the Job Roles.
  70. 70

  71. This is a bit tricky subject, so let’s take a

    step by step approach.
  72. First, let’s see how the ratios of the employees among

    the Job Roles?
  73. 73

  74. Create an Error Bar chart, assign Job Role to X-Axis

    and keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratios among the Job Roles.
  75. Sales Executive Research Scientist Manager Sales Rep All 326 292

    102 83 Ratio 22.18% 19.86% 6.94% 5.65% This Error Bar is visualizing the ratio of employees by the Job Role.
  76. How can we compare the ratios of the employees who

    left the companies among the Job Roles? Attrition = Whether a given employee left (True) or not (False).
  77. Sales Executive Research Scientist Manager Sales Rep All 326 292

    102 83 TRUE 57 47 5 33 FALSE 269 245 97 50 TRUE = Those who left the company.
  78. Sales Executive Research Scientist Manager Sales Rep All 326 292

    102 83 TRUE 57 47 5 33 Ratio 40% 33% 3.5% 23% FALSE 269 245 97 50 We want to visualize the ratio of those who left the company.
  79. Select the Attrition for Y-Axis and select ‘Number of True’

    as the calculation.
  80. The original question: Find if there are any differences in

    the ratio of Attrition among the Job Roles. Attrition Rate Not Number of Attrition
  81. Sales Executive Research Scientist Manager Sales Rep TRUE 57 47

    5 33 FALSE 269 245 97 50 Attrition rate should be calculated within each Job Role.
  82. Sales Executive Research Scientist Manager Sales Rep TRUE 57 47

    5 33 FALSE 269 245 97 50 Attrition Rate 17.48% 16.1% 4.9% 39.76%
  83. Switch the ‘Ratio by’ from ‘All’ to ‘X-Axis’.

  84. None
  85. Now, we are comparing the Attrition Rates among the Job

    Roles.
  86. 86 There seems to be 3 groups based on the

    Attrition Rates.
  87. 87 The Attrition Rate for Sales Representative is significantly higher

    than the others.
  88. 88 The Attrition Rates for these 4 Job Roles seem

    to be in the same range. There is not much difference in these Job Roles. However, they are significantly different from the other 5 Job Roles.
  89. • The numbers vary. • Average (Mean) is sensitive, it

    can be influenced significantly by extreme values, especially when the size is small. • When comparing the categorical values we can use the Ratio, but the ratio can be also vary, especially when the size is small. Conclusion
  90. • To compare the means or the ratios we should

    take account of the variance in the data and the size of the data. • Confidence Interval is a useful tool that gives us the context around the mean and the ratio. • It helps us compare the means and the ratio and conclude if there are any differences that should be investigated further. Conclusion
  91. If you want to compare the means or the ratios

    with confidence interval, Error Bar chart is your friend!
  92. Next Seminar

  93. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization - 7/1 (Wed) Data Visualization Workshop
  94. Q & A

  95. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  96. EXPLORATORY 96