Data Visualization Workshop Part 4 - Visualizing Uncertainty

EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

3 Data Science is not just for Engineers and Statisticians.
Exploratory makes it possible for Everyone to do Data Science. The Third Wave

4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics
/ Machine Learning) Data Analysis Data Science Workﬂow

5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling
Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

I have delivered an awesome presentation.

3 4 5 2 1 Very Good Very Bad I
asked the audience to rate it.

0 1.25 2.5 3.75 5 1 2 3 4 5
Average score: 3.6

I have done another awesome presentation the next day.

0 1.25 2.5 3.75 5 1 2 3 4 5
Average score: 3.4

I have done another awesome presentation the next day again!

0 1.25 2.5 3.75 5 1 2 3 4 5
Average score: 3.3

3.3 3.4 3.6 3.5 Which is my real score?

• The numbers vary. • Average is sensitive, it can
be inﬂuenced signiﬁcantly by extreme values, especially when the size is small.

26 24 32 28 20 30 22 Mean 26

26 24 32 28 20 30 22 26 26 Mean
Median

26 24 60 28 20 30 22 Mean 30 Median
26

I have delivered the same awesome presentation with diﬀerent numbers
of audience.

0 2 4 6 8 10 12 14 1 2
3 4 5 Average score: 3.4

0 2 4 6 8 10 12 14 1 2

3.3 3.4 3.6 3.5 Which is my average score? I
would take the number from the biggest crowd more seriously because the outliers won’t impact so much on the average.

• The numbers vary. • Average is sensitive, it can
be inﬂuenced signiﬁcantly by extreme values, especially when the size is small. • Intuitively speaking, the bigger the data size is the more trust we want to give.

25 Average Scoreɿ? Ideally, I want to give a presentation
to as many audience as possible and get the survey result from them.

26 This time: 3.6 But, I’ve got only a small
group… Average Scoreɿ?

27 Mean: 3.6 Sample True Mean: ? Population

• We have no way of knowing the ‘True mean’
of all the potential audience (Population) because they didn’t join the seminar for whatever the reason was. (It’s impossible!) • We know the mean score of this group (Sample) as 3.6. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 3.6 assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 28

• We have no way of knowing the ‘True mean’
weight of all Americans. • We know the mean weight of a given sample as 84kg. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 84kg assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 29 Conﬁdence Interval!

3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean

3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean 95%
Conﬁdence Interval

What is 95% Conﬁdence Interval?

33 95% Conﬁdence Interval ʹ Mean ± 1.96 * Standard
Deviation * n 1 √

34 Take many samples and calculate the 95% Conﬁdence Interval
for each group.

35 }Samples Means and 95% Conﬁdence Intervals

36 True Mean 95% of these conﬁdence intervals should include
the true mean of the population. }Sample

37 We happen to be looking at one of the
sample and its mean and its conﬁdence interval. } True Mean Sample

Exercise

Sample Data

Employee Data

1. Open Data Catalog 2. Find ‘HR Employee’ Data 3.
Import the Data Import Data

44 Select ‘Data Catalog’ from the Data Frame menu.

45 Type ‘employee’ to search ‘HR Employee Attrition’ data.

46 Click the Save button to save the data.

Once the data is imported, the Summary view automatically generates
a chart for each column along with metrics to describe the data.

Each row is for each employee of 1,470. There are
27 variables to describe each employee.

49 Exercise 1. Compare the average (mean) Monthly Income between
Male and Female. 2. Compare it for each Job Role and ﬁnd if there is disparity between Male and Female for any Job Roles.

50 We’ll focus on Monthly Income.

51 Create an Error Bar chart, assign Gender to X-Axis
and Monthly Income to Y-Axis. This will create the chart comparing the mean of Monthly Income by Gender.

52 Switch the Marker type to ‘Circle’ to focus on
the mean and the range.

53 Exercise 1 1. Compare the average (mean) Monthly Income
between Male and Female. 2. Compare it for each Job Role and ﬁnd if there is disparity between Male and Female for any Job Roles.

54 Assign the Job Role column to Repeat By.

55 Only the Research Director has a notable difference between
Male and Female.

56 How about Categorical or Logical?

58 We observe how many men and women are in
this organization by counting them outside the ofﬁce. Example

5PM 59 Male Female: 0 (0%) Male: 1 (100%)

60 Female: 2 (66%) Male: 1 (33%) 5:30PM

61 Female: 2 (40%) Male: 3 (60%) 6PM

62 Female: 7 (35%) Male: 13 (65%) 6:30PM

63 Female: 20 (40%) Male: 30 (60%) 7PM

Even with Categorical, the variance (the ratio of male /
female) and the sample size are the important factors when considering the difference among the categories.

65 Exercise 2 1. Compare the ratio of Male and
Female. 2. Compare it among the Job Roles and ﬁnd if there are any different patterns.

Create an Error Bar chart, assign Gender to X-Axis and
keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratio of Female and Male.

Some Job Roles have notable differences in terms of the
ratio, but some don’t.

69 Exercise 3 Find if there are any differences in
the ratio of Attrition among the Job Roles.

This is a bit tricky subject, so let’s take a
step by step approach.

First, let’s see how the ratios of the employees among
the Job Roles?

Create an Error Bar chart, assign Job Role to X-Axis
and keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratios among the Job Roles.

Sales Executive Research Scientist Manager Sales Rep All 326 292
102 83 Ratio 22.18% 19.86% 6.94% 5.65% This Error Bar is visualizing the ratio of employees by the Job Role.

How can we compare the ratios of the employees who
left the companies among the Job Roles? Attrition = Whether a given employee left (True) or not (False).

102 83 TRUE 57 47 5 33 FALSE 269 245 97 50 TRUE = Those who left the company.

102 83 TRUE 57 47 5 33 Ratio 40% 33% 3.5% 23% FALSE 269 245 97 50 We want to visualize the ratio of those who left the company.

Select the Attrition for Y-Axis and select ‘Number of True’
as the calculation.

The original question: Find if there are any differences in
the ratio of Attrition among the Job Roles. Attrition Rate Not Number of Attrition

Sales Executive Research Scientist Manager Sales Rep TRUE 57 47
5 33 FALSE 269 245 97 50 Attrition rate should be calculated within each Job Role.

Sales Executive Research Scientist Manager Sales Rep TRUE 57 47
5 33 FALSE 269 245 97 50 Attrition Rate 17.48% 16.1% 4.9% 39.76%

Switch the ‘Ratio by’ from ‘All’ to ‘X-Axis’.

Now, we are comparing the Attrition Rates among the Job
Roles.

86 There seems to be 3 groups based on the
Attrition Rates.

87 The Attrition Rate for Sales Representative is signiﬁcantly higher
than the others.

88 The Attrition Rates for these 4 Job Roles seem
to be in the same range. There is not much difference in these Job Roles. However, they are signiﬁcantly different from the other 5 Job Roles.

• The numbers vary. • Average (Mean) is sensitive, it
can be inﬂuenced signiﬁcantly by extreme values, especially when the size is small. • When comparing the categorical values we can use the Ratio, but the ratio can be also vary, especially when the size is small. Conclusion

• To compare the means or the ratios we should
take account of the variance in the data and the size of the data. • Conﬁdence Interval is a useful tool that gives us the context around the mean and the ratio. • It helps us compare the means and the ratio and conclude if there are any differences that should be investigated further. Conclusion

If you want to compare the means or the ratios
with conﬁdence interval, Error Bar chart is your friend!

Next Seminar

• Part 1 - Basics: Visualizing Summarized Data • Part
2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization - 7/1 (Wed) Data Visualization Workshop

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

EXPLORATORY 96

Data Visualization Workshop Part 4 - Visualizin...

Data Visualization Workshop Part 4 - Visualizing Uncertainty

More Decks by Kan Nishida

Other Decks in Technology

Featured

Transcript