Slide 1

Slide 1 text

EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

Slide 2

Slide 2 text

Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 3

Slide 3 text

3 Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 4

Slide 4 text

4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 5

Slide 5 text

5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 6

Slide 6 text

EXPLORATORY Seminar #29 Viz Workshop - Part4 Visualizing Uncertainty

Slide 7

Slide 7 text

I have delivered an awesome presentation.

Slide 8

Slide 8 text

3 4 5 2 1 Very Good Very Bad I asked the audience to rate it.

Slide 9

Slide 9 text

0 1.25 2.5 3.75 5 1 2 3 4 5 Average score: 3.6

Slide 10

Slide 10 text

I have done another awesome presentation the next day.

Slide 11

Slide 11 text

0 1.25 2.5 3.75 5 1 2 3 4 5 Average score: 3.4

Slide 12

Slide 12 text

I have done another awesome presentation the next day again!

Slide 13

Slide 13 text

0 1.25 2.5 3.75 5 1 2 3 4 5 Average score: 3.3

Slide 14

Slide 14 text

3.3 3.4 3.6 3.5 Which is my real score?

Slide 15

Slide 15 text

• The numbers vary. • Average is sensitive, it can be influenced significantly by extreme values, especially when the size is small.

Slide 16

Slide 16 text

26 24 32 28 20 30 22 Mean 26

Slide 17

Slide 17 text

26 24 32 28 20 30 22 26 26 Mean Median

Slide 18

Slide 18 text

26 24 60 28 20 30 22 Mean 30 Median 26

Slide 19

Slide 19 text

I have delivered the same awesome presentation with different numbers of audience.

Slide 20

Slide 20 text

0 2 4 6 8 10 12 14 1 2 3 4 5 Average score: 3.4

Slide 21

Slide 21 text

0 2 4 6 8 10 12 14 1 2 3 4 5 Average score: 3.5

Slide 22

Slide 22 text

0 2 4 6 8 10 12 14 1 2 3 4 5 Average score: 3.6

Slide 23

Slide 23 text

3.3 3.4 3.6 3.5 Which is my average score? I would take the number from the biggest crowd more seriously because the outliers won’t impact so much on the average.

Slide 24

Slide 24 text

• The numbers vary. • Average is sensitive, it can be influenced significantly by extreme values, especially when the size is small. • Intuitively speaking, the bigger the data size is the more trust we want to give.

Slide 25

Slide 25 text

25 Average Scoreɿ? Ideally, I want to give a presentation to as many audience as possible and get the survey result from them.

Slide 26

Slide 26 text

26 This time: 3.6 But, I’ve got only a small group… Average Scoreɿ?

Slide 27

Slide 27 text

27 Mean: 3.6 Sample True Mean: ? Population

Slide 28

Slide 28 text

• We have no way of knowing the ‘True mean’ of all the potential audience (Population) because they didn’t join the seminar for whatever the reason was. (It’s impossible!) • We know the mean score of this group (Sample) as 3.6. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 3.6 assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 28

Slide 29

Slide 29 text

• We have no way of knowing the ‘True mean’ weight of all Americans. • We know the mean weight of a given sample as 84kg. • Most likely, this ‘sample mean’ is different from the ‘True mean’, but can we have a range around 84kg assuming that the ‘True mean’ will reside within the range? If so, what would be the range? 29 Confidence Interval!

Slide 30

Slide 30 text

3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean

Slide 31

Slide 31 text

3.3 3.4 3.6 3.5 3.7 True Mean Sample Mean 95% Confidence Interval

Slide 32

Slide 32 text

What is 95% Confidence Interval?

Slide 33

Slide 33 text

33 95% Confidence Interval ʹ Mean ± 1.96 * Standard Deviation * n 1 √

Slide 34

Slide 34 text

34 Take many samples and calculate the 95% Confidence Interval for each group.

Slide 35

Slide 35 text

35 }Samples Means and 95% Confidence Intervals

Slide 36

Slide 36 text

36 True Mean 95% of these confidence intervals should include the true mean of the population. }Sample

Slide 37

Slide 37 text

37 We happen to be looking at one of the sample and its mean and its confidence interval. } True Mean Sample

Slide 38

Slide 38 text

38

Slide 39

Slide 39 text

Exercise

Slide 40

Slide 40 text

Sample Data

Slide 41

Slide 41 text

Employee Data

Slide 42

Slide 42 text

Employee Data

Slide 43

Slide 43 text

1. Open Data Catalog 2. Find ‘HR Employee’ Data 3. Import the Data Import Data

Slide 44

Slide 44 text

44 Select ‘Data Catalog’ from the Data Frame menu.

Slide 45

Slide 45 text

45 Type ‘employee’ to search ‘HR Employee Attrition’ data.

Slide 46

Slide 46 text

46 Click the Save button to save the data.

Slide 47

Slide 47 text

Once the data is imported, the Summary view automatically generates a chart for each column along with metrics to describe the data.

Slide 48

Slide 48 text

Each row is for each employee of 1,470. There are 27 variables to describe each employee.

Slide 49

Slide 49 text

49 Exercise 1. Compare the average (mean) Monthly Income between Male and Female. 2. Compare it for each Job Role and find if there is disparity between Male and Female for any Job Roles.

Slide 50

Slide 50 text

50 We’ll focus on Monthly Income.

Slide 51

Slide 51 text

51 Create an Error Bar chart, assign Gender to X-Axis and Monthly Income to Y-Axis. This will create the chart comparing the mean of Monthly Income by Gender.

Slide 52

Slide 52 text

52 Switch the Marker type to ‘Circle’ to focus on the mean and the range.

Slide 53

Slide 53 text

53 Exercise 1 1. Compare the average (mean) Monthly Income between Male and Female. 2. Compare it for each Job Role and find if there is disparity between Male and Female for any Job Roles.

Slide 54

Slide 54 text

54 Assign the Job Role column to Repeat By.

Slide 55

Slide 55 text

55 Only the Research Director has a notable difference between Male and Female.

Slide 56

Slide 56 text

56 How about Categorical or Logical?

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

58 We observe how many men and women are in this organization by counting them outside the office. Example

Slide 59

Slide 59 text

5PM 59 Male Female: 0 (0%) Male: 1 (100%)

Slide 60

Slide 60 text

60 Female: 2 (66%) Male: 1 (33%) 5:30PM

Slide 61

Slide 61 text

61 Female: 2 (40%) Male: 3 (60%) 6PM

Slide 62

Slide 62 text

62 Female: 7 (35%) Male: 13 (65%) 6:30PM

Slide 63

Slide 63 text

63 Female: 20 (40%) Male: 30 (60%) 7PM

Slide 64

Slide 64 text

Even with Categorical, the variance (the ratio of male / female) and the sample size are the important factors when considering the difference among the categories.

Slide 65

Slide 65 text

65 Exercise 2 1. Compare the ratio of Male and Female. 2. Compare it among the Job Roles and find if there are any different patterns.

Slide 66

Slide 66 text

66

Slide 67

Slide 67 text

Create an Error Bar chart, assign Gender to X-Axis and keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratio of Female and Male.

Slide 68

Slide 68 text

Some Job Roles have notable differences in terms of the ratio, but some don’t.

Slide 69

Slide 69 text

69 Exercise 3 Find if there are any differences in the ratio of Attrition among the Job Roles.

Slide 70

Slide 70 text

70

Slide 71

Slide 71 text

This is a bit tricky subject, so let’s take a step by step approach.

Slide 72

Slide 72 text

First, let’s see how the ratios of the employees among the Job Roles?

Slide 73

Slide 73 text

73

Slide 74

Slide 74 text

Create an Error Bar chart, assign Job Role to X-Axis and keep ‘Number of Rows’ for the Y-Axis. Then, switch the Calculation Type to ‘Ratio (%)’. This will create the chart comparing the ratios among the Job Roles.

Slide 75

Slide 75 text

Sales Executive Research Scientist Manager Sales Rep All 326 292 102 83 Ratio 22.18% 19.86% 6.94% 5.65% This Error Bar is visualizing the ratio of employees by the Job Role.

Slide 76

Slide 76 text

How can we compare the ratios of the employees who left the companies among the Job Roles? Attrition = Whether a given employee left (True) or not (False).

Slide 77

Slide 77 text

Sales Executive Research Scientist Manager Sales Rep All 326 292 102 83 TRUE 57 47 5 33 FALSE 269 245 97 50 TRUE = Those who left the company.

Slide 78

Slide 78 text

Sales Executive Research Scientist Manager Sales Rep All 326 292 102 83 TRUE 57 47 5 33 Ratio 40% 33% 3.5% 23% FALSE 269 245 97 50 We want to visualize the ratio of those who left the company.

Slide 79

Slide 79 text

Select the Attrition for Y-Axis and select ‘Number of True’ as the calculation.

Slide 80

Slide 80 text

The original question: Find if there are any differences in the ratio of Attrition among the Job Roles. Attrition Rate Not Number of Attrition

Slide 81

Slide 81 text

Sales Executive Research Scientist Manager Sales Rep TRUE 57 47 5 33 FALSE 269 245 97 50 Attrition rate should be calculated within each Job Role.

Slide 82

Slide 82 text

Sales Executive Research Scientist Manager Sales Rep TRUE 57 47 5 33 FALSE 269 245 97 50 Attrition Rate 17.48% 16.1% 4.9% 39.76%

Slide 83

Slide 83 text

Switch the ‘Ratio by’ from ‘All’ to ‘X-Axis’.

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

Now, we are comparing the Attrition Rates among the Job Roles.

Slide 86

Slide 86 text

86 There seems to be 3 groups based on the Attrition Rates.

Slide 87

Slide 87 text

87 The Attrition Rate for Sales Representative is significantly higher than the others.

Slide 88

Slide 88 text

88 The Attrition Rates for these 4 Job Roles seem to be in the same range. There is not much difference in these Job Roles. However, they are significantly different from the other 5 Job Roles.

Slide 89

Slide 89 text

• The numbers vary. • Average (Mean) is sensitive, it can be influenced significantly by extreme values, especially when the size is small. • When comparing the categorical values we can use the Ratio, but the ratio can be also vary, especially when the size is small. Conclusion

Slide 90

Slide 90 text

• To compare the means or the ratios we should take account of the variance in the data and the size of the data. • Confidence Interval is a useful tool that gives us the context around the mean and the ratio. • It helps us compare the means and the ratio and conclude if there are any differences that should be investigated further. Conclusion

Slide 91

Slide 91 text

If you want to compare the means or the ratios with confidence interval, Error Bar chart is your friend!

Slide 92

Slide 92 text

Next Seminar

Slide 93

Slide 93 text

• Part 1 - Basics: Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization - 7/1 (Wed) Data Visualization Workshop

Slide 94

Slide 94 text

Q & A

Slide 95

Slide 95 text

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

Slide 96

Slide 96 text

EXPLORATORY 96