Slide 1

Slide 1 text

EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance & Correlation

Slide 2

Slide 2 text

Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 3

Slide 3 text

3 Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 4

Slide 4 text

4 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 5

Slide 5 text

5 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 6

Slide 6 text

EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance & Correlation

Slide 7

Slide 7 text

• Part 1 - Basics: Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop

Slide 8

Slide 8 text

Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis) • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 8

Slide 9

Slide 9 text

Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis) • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 9

Slide 10

Slide 10 text

An exploratory and iterative process of asking many questions and find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 10 EDA (Exploratory Data Analysis)

Slide 11

Slide 11 text

11 Questions Answers Data Hypotheses EDA (Exploratory Data Analysis)

Slide 12

Slide 12 text

12 Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey

Slide 13

Slide 13 text

13 • How the variation in variables? • How are the variables associated or correlated to one another? 2 Principle Questions for EDA

Slide 14

Slide 14 text

Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

Slide 15

Slide 15 text

Sample Data

Slide 16

Slide 16 text

Employee Data

Slide 17

Slide 17 text

Employee Data

Slide 18

Slide 18 text

1. Open Data Catalog 2. Find ‘HR Employee’ Data 3. Import the Data Import Data

Slide 19

Slide 19 text

19 Select ‘Data Catalog’ from the Data Frame menu.

Slide 20

Slide 20 text

20 Type ‘employee’ to search ‘HR Employee Attrition’ data.

Slide 21

Slide 21 text

21 Click the Save button to save the data.

Slide 22

Slide 22 text

Once the data is imported, the Summary view automatically generates a chart for each column along with metrics to describe the data.

Slide 23

Slide 23 text

Each row is for each employee of 1,470. There are 27 variables to describe each employee.

Slide 24

Slide 24 text

24 We’ll focus on Monthly Income.

Slide 25

Slide 25 text

Questions • How is the variation of Monthly Income? • What makes it vary? • What are correlated to Monthly Income and how?

Slide 26

Slide 26 text

26 We can see that the average Monthly Income is $6,503, and it ranges from 1,009 to 19,999. That’s a huge gap!

Slide 27

Slide 27 text

Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis) • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 27

Slide 28

Slide 28 text

$6,503 Variance Average $19,999 $1,009

Slide 29

Slide 29 text

Questions for Variance • What are the typical values? • Are there any outliers compared to the general trend in the variance? • How the data is distributed? • Are there any patterns you can spot in the variance? 29

Slide 30

Slide 30 text

30 Histogram Density Plot Bar Chart Charts to visualize the variance.

Slide 31

Slide 31 text

31 How to pick which charts to use?

Slide 32

Slide 32 text

32 Depends on Data Type

Slide 33

Slide 33 text

Continuous Data vs. Categorical Data 33

Slide 34

Slide 34 text

Continuous Data vs. Categorical Data 34

Slide 35

Slide 35 text

• Numerical - Numeric, Integer, Double • Date/Time - Date, POSIXct 35

Slide 36

Slide 36 text

Numerical 0 10 20 30 40 50 11 22 45 Continuous and Ordinal relationship among values.

Slide 37

Slide 37 text

37 Histogram Density Plot Bar Chart Visualize the Variance of Numerical Variables

Slide 38

Slide 38 text

38 Histogram Density Plot Bar Chart

Slide 39

Slide 39 text

39 It splits numerical values into a set of ‘bins’ with equal range it shows the size (or number of rows) for each ‘bin’.

Slide 40

Slide 40 text

40 1. How the variance of Monthly Income? 2. Are there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions

Slide 41

Slide 41 text

41 1. How the variance of Monthly Income? 2. Are there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions

Slide 42

Slide 42 text

42 Create a histogram by assigning the ‘Monthly Income’ column to X-Axis.

Slide 43

Slide 43 text

43 Increase the number of Bars by setting 50.

Slide 44

Slide 44 text

44 Increase the number of Bars by setting 100 to see if any new patterns can be spotted.

Slide 45

Slide 45 text

45 There seems to be a few different groups.

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

$6,503 Average What makes the Monthly Income varies?

Slide 49

Slide 49 text

49 1. How the variance of Monthly Income? 2. Are there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions

Slide 50

Slide 50 text

50 Assign the ‘Gender’ column to Color.

Slide 51

Slide 51 text

51 There is no clear difference between Female and Male.

Slide 52

Slide 52 text

52 1. How the variance of Monthly Income? 2. Are there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions

Slide 53

Slide 53 text

53 Manager’s Monthly Income range seems to be higher while Sales Rep & Research Scientist are lower. Manager Sales Rep Research Scientist

Slide 54

Slide 54 text

54 But, it’s hard to see the differences among the Job Roles because they are on top of each other.

Slide 55

Slide 55 text

55 Histogram Density Plot Bar Chart

Slide 56

Slide 56 text

56 • Draws a smooth curve to visualize the distribution of data. • The height shows an estimated data density of any given point. • Each area under the curve is considered as 1.

Slide 57

Slide 57 text

57 1. How the variance of Monthly Income? 2. Are there any differences in the Monthly Income variance between Male and Female. If so, how? 3. Are there any differences in the Monthly Income variance among Job Roles. If so, how? Questions

Slide 58

Slide 58 text

58 Create a Density Plot by selecting ‘Density Plot’ chart and assigning the ‘Monthly Income’ column to X-Axis.

Slide 59

Slide 59 text

59 Assign the ‘Job Role’ column to Color to create a line for each Job Role type.

Slide 60

Slide 60 text

60 It’s easier to see the differences among Job Roles compared to Histogram.

Slide 61

Slide 61 text

Density Plot Histogram Same data variance is visualized in different ways.

Slide 62

Slide 62 text

Density Plot Histogram Estimated distribution based on the actual values. Actual distribution.

Slide 63

Slide 63 text

Continuous Data vs. Categorical Data 63

Slide 64

Slide 64 text

Categorical California Texas New York Florida Oregon • No continuous relationship • Limited Set of Values • Ordinal relationship is NOT necessary

Slide 65

Slide 65 text

65 Histogram Density Plot Bar Chart

Slide 66

Slide 66 text

66 1. How the variance of Job Role? 2. Are there any differences in the Job Role variance between Male and Female. If so, how? Questions

Slide 67

Slide 67 text

67 1. How the variance of Job Role? 2. Are there any differences in the Job Role variance between Male and Female. If so, how? Questions

Slide 68

Slide 68 text

68 Create a bar chart by selecting ‘Bar’ chart type and assigning the ‘Job Role’ column to X Axis.

Slide 69

Slide 69 text

69 Sort the bars based on the Y Axis values.

Slide 70

Slide 70 text

70 1. How the variance of Job Role? 2. Are there any differences in the Job Role variance between Male and Female. If so, how? Questions

Slide 71

Slide 71 text

Assign the ‘Gender’ column to Color.

Slide 72

Slide 72 text

Make the bar chart in a side-by-side mode.

Slide 73

Slide 73 text

Some Job Roles like Sales Executive, Research Scientists, etc. have a lot more males than female while there aren’t much differences between for Manufacturing Director.

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis) • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 75

Slide 76

Slide 76 text

76 A relationship where changes in one variable happen together with changes in another variable with a certain rule. Association and Correlation

Slide 77

Slide 77 text

77 Association Correlation Any type of relationship between two variables. A certain type of (usually linear) association between two variables

Slide 78

Slide 78 text

78 US UK Japan 5000 2500 Monthly Income variances are different among countries. Country Monthly Income 0 Association

Slide 79

Slide 79 text

79 Age Monthly Income The bigger the Age is, the bigger the Monthly Income is. Correlation

Slide 80

Slide 80 text

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1 -1 -0.5 0.5 Correlation

Slide 81

Slide 81 text

81 Why Association & Correlation are Important?

Slide 82

Slide 82 text

82 Variance Average (Mean) $20,000 $1,000

Slide 83

Slide 83 text

83 Variance $20,000 $1,000 Monthly Income

Slide 84

Slide 84 text

84 Uncertainty $20,000 $1,000 Monthly Income How much the income would be in this company? Variance

Slide 85

Slide 85 text

85 0 30 20 If we can find a correlation between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income

Slide 86

Slide 86 text

86 0 30 20 10 $20,000 $1,000 Working Years If Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income

Slide 87

Slide 87 text

87 5000 0 30 20 Working Years Correlation Variance 100 $20,000 $1,000 $15,000 Correlation reduces the uncertainty caused by Variance. Monthly Income

Slide 88

Slide 88 text

88 US UK Japan Association Variance Reduce Uncertainty 5000 100 Monthly Income Association reduces the uncertainty caused by Variance.

Slide 89

Slide 89 text

If we can find strong correlations, it makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.

Slide 90

Slide 90 text

Correlation (or Association) is not equal to Causation. Causation is a special type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome.

Slide 91

Slide 91 text

But, Causation is not a topic here. We’ll focus primarily on Association and Correlation.

Slide 92

Slide 92 text

Visualizing Association and Correlation 92

Slide 93

Slide 93 text

93 How to choose the right charts?

Slide 94

Slide 94 text

94 Depends on the combination of data types.

Slide 95

Slide 95 text

95 • Category vs. Numerical • Numerical vs. Numerical • Category vs. Category Combination of Data Types

Slide 96

Slide 96 text

96 • Category vs. Numerical • Numerical vs. Numerical • Category vs. Category Combination of Data Types

Slide 97

Slide 97 text

Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

Slide 98

Slide 98 text

Agenda • 2 Principle Questions for EDA (Exploratory Data Analysis) • Visualize Variance • with Histogram • with Density Plot • with Bar Chart • Visualize Association and Correlation • with Boxplot • with Violin Plot • with Scatter Plot • with Stack Bar Chart 98

Slide 99

Slide 99 text

99 Boxplot

Slide 100

Slide 100 text

Boxplot • Displays the distribution of numerical values by Category • Y Axis represents range of values, X Axis represents each Category

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

Separate into 4 groups (quartile) so that each group has equal size.

Slide 103

Slide 103 text

3Q (3rd Quartile / 75 Percentile) 2Q (2nd Quartile / 50 Percentile / Median) 1Q (1st Quartile / 25 Percentile)

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

3Q Median 1Q

Slide 106

Slide 106 text

3Q Median 1Q Max Min

Slide 107

Slide 107 text

107 1. How is the variance of Monthly Income associated with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions

Slide 108

Slide 108 text

108 1. How is the variance of Monthly Income associated with Job Role? 2. How is the variance of Monthly Income associated with Gender? 3. Are there any difference in the Monthly Income variation between Male and Female in each Job Role? Questions

Slide 109

Slide 109 text

109 Monthly Income (Numeric) vs. Job Role (Category)

Slide 110

Slide 110 text

110 Monthly Income (Numeric) vs. Gender (Category)

Slide 111

Slide 111 text

Job Role (Category) vs. Monthly Income (Numeric) by Gender (Category)

Slide 112

Slide 112 text

112 • Category vs. Numerical • Numerical vs. Numerical • Category vs. Category Combination of Data Types

Slide 113

Slide 113 text

Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

Slide 114

Slide 114 text

114 Scatter Plot

Slide 115

Slide 115 text

115 Numeric Numeric Each data point (row) is positioned at an intersection of two numeric variables.

Slide 116

Slide 116 text

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1 -1 -0.5 0.5 Correlation

Slide 117

Slide 117 text

117 1. How is the variance of Monthly Income correlated with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions

Slide 118

Slide 118 text

118 1. How is the variance of Monthly Income correlated with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions

Slide 119

Slide 119 text

119 Monthly Income (Numeric) vs. Age (Numeric)

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

No content

Slide 122

Slide 122 text

Check the strength of correlation.

Slide 123

Slide 123 text

123 1. How is the variance of Monthly Income correlated with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions

Slide 124

Slide 124 text

124 Monthly Income (Numeric) vs. Total Working Years (Numeric)

Slide 125

Slide 125 text

125 Check the strength of correlation.

Slide 126

Slide 126 text

126 1. How is the variance of Monthly Income correlated with the variance of Age? 2. How is the variance of Monthly Income correlated with the variance of Total Working Years? 3. Are there any differences in the correlation between Monthly Income and Total Working Years among Job Roles? Questions

Slide 127

Slide 127 text

127 For example, Sales Executive and Human Resource have the same kind of correlation between Monthly Income and Total Working Years?

Slide 128

Slide 128 text

Assign the ‘Job Role’ column to Repeat By to create the scatter chart for each of the job roles.

Slide 129

Slide 129 text

129 The strength of correlation varies among Job Roles.

Slide 130

Slide 130 text

Numerical vs. Numerical Numerical vs. Categorical We can categorize numerical values.

Slide 131

Slide 131 text

No content

Slide 132

Slide 132 text

Switch the chart type to Boxplot.

Slide 133

Slide 133 text

Click on the green text to change the number of categories.

Slide 134

Slide 134 text

Type 10 to split the Total Working Years into 10 groups.

Slide 135

Slide 135 text

We can see that there is a huge jump in Monthly Income after 20th years of working in this company.

Slide 136

Slide 136 text

Also, up till 16th years, the income increases as the working years increase. But after 20th years, there is no obvious correlation between the working years and the income.

Slide 137

Slide 137 text

We can assign the ‘Gender’ column to Color to see if there is any gender gap in terms of the relationship between the income and the working years.

Slide 138

Slide 138 text

We can see that Female seems to have higher income range than Male up to 16th years, but after that Male seems to have higher income than Female.

Slide 139

Slide 139 text

139 • Numerical vs. Categorical • Numerical vs. Numerical • Categorical vs. Categorical Combination of Data Types

Slide 140

Slide 140 text

Scatter Plot Boxplot Violin Plot Histogram Stack Bar Density Plot

Slide 141

Slide 141 text

141 Category vs. Category Calculating the size (number of rows) for each pair and/or calculate the ratio against the total size.

Slide 142

Slide 142 text

142 1. How is the Job Role associated with the Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions

Slide 143

Slide 143 text

143 1. How is the Job Role associated with the Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions

Slide 144

Slide 144 text

Education (Category) vs. Job Role (Category)

Slide 145

Slide 145 text

145 1. How is the Job Role associated with the Education? 2. How is the Job Role associated with the Attrition? 3. How is the Job Role associated with the Monthly Income? Questions

Slide 146

Slide 146 text

No content

Slide 147

Slide 147 text

• Logical is a special case of Categorical. • It can have only two unique values, TRUE or FALSE.

Slide 148

Slide 148 text

Categorical California Texas New York Florida Oregon No continuous relationship No ordinal relationship is necessary

Slide 149

Slide 149 text

TRUE FALSE Is she Japanese? Either TRUE or FALSE Logical

Slide 150

Slide 150 text

Job Role (Category) vs. Attrition (Logical)

Slide 151

Slide 151 text

151 1. between Job Role and Education. 2. between Job Role and Attrition. 3. between Job Role and Monthly Income. Visualize the relationship

Slide 152

Slide 152 text

No content

Slide 153

Slide 153 text

Numerical Categorical We can categorize numerical values.

Slide 154

Slide 154 text

Select a bar chart and assign the ‘Job Role’ column to X Axis.

Slide 155

Slide 155 text

Assign the ‘Monthly Income’ column to Color, which will automatically categorize the values of the Monthly Income into 5 groups.

Slide 156

Slide 156 text

Make the Y-Axis to be Ratio (%) by using ‘% of’ Window Calculation. Select ‘Window Calculation’ from Y Axis menu.

Slide 157

Slide 157 text

Select ‘% of’ for the Calculation Type.

Slide 158

Slide 158 text

Manager and Research Director have higher ratio of higher income people, surprise? ;)

Slide 159

Slide 159 text

Q & A

Slide 160

Slide 160 text

Next Seminar

Slide 161

Slide 161 text

• Part 1 - Basics: Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty - 6/10 (Wed) • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop

Slide 162

Slide 162 text

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

Slide 163

Slide 163 text

EXPLORATORY 163