Slide 1

Slide 1 text

EXPLORATORY Online Seminar #31 Exploratory Data Analysis Part 1 Understanding Variance

Slide 2

Slide 2 text

Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 3

Slide 3 text

Mission Democratize Data Science

Slide 4

Slide 4 text

4 Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 5

Slide 5 text

First Wave Second Wave Third Wave Proprietary Open Source UI & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Programmers Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Slide 6

Slide 6 text

6 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 7

Slide 7 text

7 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 8

Slide 8 text

EXPLORATORY Online Seminar #31 Exploratory Data Analysis Part 1 Understanding Variance

Slide 9

Slide 9 text

9 Wayne Gretzky

Slide 10

Slide 10 text

10 - Wayne Gretzky “I skate to where the puck is going to be, not where it has been.”

Slide 11

Slide 11 text

11 Exploratory Data Analysis (EDA) is not just about the ‘playfulness’ that might come off from the name ‘Exploratory’. It is more about the method of rigorous and iterative analysis of the data at your hand.

Slide 12

Slide 12 text

Why Data Analysis? 12

Slide 13

Slide 13 text

13 Goal Grow Business

Slide 14

Slide 14 text

14 Goal Grow Business Increase Number of Customers Problem

Slide 15

Slide 15 text

15 Quantify Increase Customers Number of Customers Problem Goal Grow Business

Slide 16

Slide 16 text

16 Predict how many customers we will have. Prediction e.g. Customers will be 1000 by end of this year.

Slide 17

Slide 17 text

17 e.g. We want to grow Customers to 1000. What can we do to make that happen? Control

Slide 18

Slide 18 text

18 In order to control the future, you want to make decisions about • Provide Discount • Hire more Sales Reps. • Invest in Marketing / Advertising

Slide 19

Slide 19 text

19 Prediction Control Hypothesis What can we monitor to predict a particular outcome better? What causes a particular outcome?

Slide 20

Slide 20 text

20 Hypothesis If the weather will be warm, we will have more customers. If we offer 10% discount, we would have more customers. Prediction Control

Slide 21

Slide 21 text

21 Test Hypotheses by Experimenting and Collecting Data • Hypothesis Test • A/B Test Hypotheses Data Test Discount increases more customers

Slide 22

Slide 22 text

22 Test Hypotheses by Experimenting and Collecting Data • Hypothesis Test • A/B Test Hypotheses Data Test Discount increases more customers Confirmatory Analysis

Slide 23

Slide 23 text

23 Hypotheses Data Test How can we build Hypothesis?

Slide 24

Slide 24 text

24 Intuition! Hypotheses Data Test

Slide 25

Slide 25 text

25 Build Hypothesis based on Data Hypotheses Data Test Data How about Data?

Slide 26

Slide 26 text

26 John Tukey built the method and published a book called ‘Exploratory Data Analysis’ in 1970s.

Slide 27

Slide 27 text

27 Build hypotheses by exploring data. EDA Hypotheses Data Test Data Exploratory Data Analysis

Slide 28

Slide 28 text

An exploratory and iterative process of asking many questions and find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 28 Exploratory Data Analysis (EDA)

Slide 29

Slide 29 text

An exploratory and iterative process of asking many questions and find answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 29 Exploratory Data Analysis (EDA)

Slide 30

Slide 30 text

Employee Data

Slide 31

Slide 31 text

Employee Data

Slide 32

Slide 32 text

Want to understand how the salary is decided.

Slide 33

Slide 33 text

An exploratory and iterative process of asking many questions and find answers from data in order to build better hypothesis for Prediction and Control. 33 Exploratory Data Analysis (EDA)

Slide 34

Slide 34 text

34 Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey

Slide 35

Slide 35 text

35 • How the variation in variables? • How are the variables associated (or correlated) to one another? Two Principle Questions for EDA

Slide 36

Slide 36 text

36 Employee Salary $6,503 Average

Slide 37

Slide 37 text

37 Data varies…

Slide 38

Slide 38 text

The variance is an opportunity for Data Analysis. 38

Slide 39

Slide 39 text

39 If there is no variance…

Slide 40

Slide 40 text

40 Variance is a good starting point for building hypothesis of association or causal relationship. If there is variance, we can ask “What makes the variance?” and start investigating further. ʁ Income

Slide 41

Slide 41 text

41 A relationship where changes in one variable happen together with changes in another variable with a certain rule. Association and Correlation

Slide 42

Slide 42 text

42 Association Correlation Any type of relationship between two variables. A certain type of (usually linear) association between two variables

Slide 43

Slide 43 text

43 US UK Japan 5000 2500 Monthly Income variances are different among countries. Country Monthly Income 0 Association

Slide 44

Slide 44 text

44 Age Monthly Income The bigger the Age is, the bigger the Monthly Income is. Correlation

Slide 45

Slide 45 text

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1 -1 -0.5 0.5 Correlation

Slide 46

Slide 46 text

46 Why Association & Correlation are Important?

Slide 47

Slide 47 text

47 Variance Average (Mean) $20,000 $1,000

Slide 48

Slide 48 text

48 Variance $20,000 $1,000 Monthly Income

Slide 49

Slide 49 text

49 How much the income would be in this company? $20,000 $1,000 Monthly Income Variance

Slide 50

Slide 50 text

50 Uncertainty $20,000 $1,000 Monthly Income How much the income would be in this company? Variance

Slide 51

Slide 51 text

51 0 30 20 If we can find a correlation between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income

Slide 52

Slide 52 text

52 0 30 20 10 $20,000 $1,000 Working Years If Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income

Slide 53

Slide 53 text

53 0 30 20 Working Years Correlation Variance $20,000 $1,000 $15,000 Correlation reduces Uncertainty caused by Variance. Monthly Income $20,000 $1,000

Slide 54

Slide 54 text

54 US UK Japan Association Variance Reduce Uncertainty Monthly Income $20,000 $1,000

Slide 55

Slide 55 text

55 ʁ Income If we can find strong correlations, it makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.

Slide 56

Slide 56 text

Correlation is not equal to Causation. Causation is a special type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome. Warning!

Slide 57

Slide 57 text

57 • How the variation in variables? • How are the variables associated (or correlated) to one another? Two Principle Questions for EDA

Slide 58

Slide 58 text

“Since the aim of exploratory data analysis is to learn what seems to be, it should be no surprise that pictures play a vital role in doing it well. There is nothing better for making you think of questions you had forgotten to ask (even mentally),” - John Tukey

Slide 59

Slide 59 text

59 Visualizing Variance with Charts

Slide 60

Slide 60 text

60 Histogram Density Plot Bar Chart

Slide 61

Slide 61 text

Visualize Variance with Bar Chart 61 Job Role

Slide 62

Slide 62 text

62

Slide 63

Slide 63 text

63

Slide 64

Slide 64 text

Visualize Variance with Bar Chart 64 Education Level

Slide 65

Slide 65 text

65

Slide 66

Slide 66 text

Visualize Variance with Bar Chart 66 Age

Slide 67

Slide 67 text

Bar Chart 67

Slide 68

Slide 68 text

Bar Chart 68

Slide 69

Slide 69 text

Bar Chart 69

Slide 70

Slide 70 text

Bar Chart 70

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

Numerical Data vs. Categorical Data 72

Slide 73

Slide 73 text

Categorical California Texas New York Florida Oregon • No continuous relationship • Limited Set of Values • Ordinal relationship is NOT necessary

Slide 74

Slide 74 text

Numerical 0 10 20 30 40 50 11 22 45 Continuous and Ordinal relationship among values.

Slide 75

Slide 75 text

Bar Chart 75

Slide 76

Slide 76 text

Histogram 76

Slide 77

Slide 77 text

1,000 1,500 3,000 6,500 7,100 2,200 3,800 4,500 5,300 3,400 4,200 5,200 5,800 10,000 7,800 77 1,000 10,000

Slide 78

Slide 78 text

Salary 0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000 Number of Rows 78

Slide 79

Slide 79 text

79 0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000 Divide into a set number of buckets and show how many rows (or employees) in each bucket as the height. Salary Number of Rows

Slide 80

Slide 80 text

Visualizing Variance with Histogram 80 Monthly Income

Slide 81

Slide 81 text

81

Slide 82

Slide 82 text

82 Increase Number of Bars

Slide 83

Slide 83 text

83 Increase to 100 Bars.

Slide 84

Slide 84 text

84 There seems to be a few different groups.

Slide 85

Slide 85 text

What makes the different groups? 85

Slide 86

Slide 86 text

86 Assign ‘Gender’ to Color.

Slide 87

Slide 87 text

87 There is no clear difference between Female and Male.

Slide 88

Slide 88 text

88 Assign ‘Job Role’ to Color.

Slide 89

Slide 89 text

89 Manager’s Monthly Income range seems to be higher while Sales Rep & Research Scientist are lower.

Slide 90

Slide 90 text

Each color is on top of each other. Hard to see the difference… 90

Slide 91

Slide 91 text

Density Plot 91

Slide 92

Slide 92 text

92 Density Plot • Draws a smooth curve to visualize the distribution of data. • The height shows an estimated data density of any given point.

Slide 93

Slide 93 text

Each dot represent each employee and they are located based on the Monthly Income values. 0 5,000 10,000 93 Monthly Income

Slide 94

Slide 94 text

Assuming that the data varies, let’s draw a normal distribution around each data point (employee). 0 څྉ 5,000 10,000 94

Slide 95

Slide 95 text

And, add up all the values of the normal distributions. 0 څྉ 5,000 10,000 95

Slide 96

Slide 96 text

96 We’re going to switch to Density Plot chart.

Slide 97

Slide 97 text

97 It’s easier to see the differences among Job Roles compared to Histogram.

Slide 98

Slide 98 text

98 The size under each curve is 1. And it shows the ratio of data at any given area of each curve.

Slide 99

Slide 99 text

Density Plot Histogram Same data variance is visualized in different ways. Ratio Counts

Slide 100

Slide 100 text

Understanding Variance with Summary View

Slide 101

Slide 101 text

101

Slide 102

Slide 102 text

‘Age’ is ranging from 18 to 60, and there are many employees in the range of 26 to 40 years old by looking at the height of bars of the histogram chart.

Slide 103

Slide 103 text

‘Attrition’ column shows that there are 237 employees who have already quit and that is about 16% of all employees in this data.

Slide 104

Slide 104 text

‘Job’ column shows that there are 9 job roles and ‘Sales Executive’ has the most employees of 326.

Slide 105

Slide 105 text

No content

Slide 106

Slide 106 text

No content

Slide 107

Slide 107 text

Highlight Mode The Highlight helps you understand where a particular set of data that you are interested in is and how it is distributed.

Slide 108

Slide 108 text

How the distribution of ‘Age’ for ‘Sales Rep’ employees?

Slide 109

Slide 109 text

Click ‘Highlight’ button.

Slide 110

Slide 110 text

Create a condition with ‘equal to’ operator and select ‘Sales Representative’ value.

Slide 111

Slide 111 text

The light blue portion in each bar shows the data distribution for ‘Sales Rep’. Looks many of them are in the younger age buckets by looking at the first column ‘Age’.

Slide 112

Slide 112 text

By looking at the metrics, we can see that the average (mean) age of the Sales Rep is 30 years old and it ranges from 18 to 53.

Slide 113

Slide 113 text

When we look at the Department column we can see that they are all in ‘Sales’ department.

Slide 114

Slide 114 text

But, only 18.61% of the people in the Sales department are the Sales Rep. There must be other job roles in the Sales department.

Slide 115

Slide 115 text

The Attrition column shows that 33 Sales reps. have left the company and 50 Sales rep. are still with the company.

Slide 116

Slide 116 text

There doesn’t seem to have much difference between TRUE and FALSE in terms of the number of Sales Representatives. But given that there are much less TRUE employees than FALSE employees in general, the Sales Rep people might have a larger percentage of all employees.

Slide 117

Slide 117 text

Click on the ‘Ratio’ button.

Slide 118

Slide 118 text

Out of all emloyees who have left the company, 13.92% of them are Sales Rep. That’s a high percentage!

Slide 119

Slide 119 text

How is the data for the employees with high Income distributed?

Slide 120

Slide 120 text

Create a condition to pick the high income employees.

Slide 121

Slide 121 text

Those who make greater than $5,000 are in higher Age, False in Attrition, higher Education, Sales in Department, higher Job Level.

Slide 122

Slide 122 text

With the Summary View and its Highlight Mode, we can quickly understand how the variance of each variable is.

Slide 123

Slide 123 text

123 • How the variation in variables? • How are the variables associated (or correlated) to one another? Two Principle Questions for EDA

Slide 124

Slide 124 text

In the next session, we are going to explore on how to investigate and understand the relationship - Correlation and Association - between the variables.

Slide 125

Slide 125 text

EXPLORATORY Online Seminar #32 Exploratory Data Analysis Part 2 Correlation & Association 1/27/2021 (Wed) 11AM PT

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

Information Email [email protected] Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

Slide 128

Slide 128 text

Q & A 128

Slide 129

Slide 129 text

EXPLORATORY 129