Exploratory Data Analysis Part 1 - Understanding Variance

EXPLORATORY Online Seminar #31 Exploratory Data Analysis Part 1 Understanding
Variance

Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Mission Democratize Data Science

4 Data Science is not just for Engineers and Statisticians.
Exploratory makes it possible for Everyone to do Data Science. The Third Wave

First Wave Second Wave Third Wave Proprietary Open Source UI
& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Programmers Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

6 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics
/ Machine Learning) Data Analysis Data Science Workﬂow

7 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling
Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

EXPLORATORY Online Seminar #31 Exploratory Data Analysis Part 1 Understanding
Variance

9 Wayne Gretzky

10 - Wayne Gretzky “I skate to where the puck
is going to be, not where it has been.”

11 Exploratory Data Analysis (EDA) is not just about the
‘playfulness’ that might come off from the name ‘Exploratory’. It is more about the method of rigorous and iterative analysis of the data at your hand.

Why Data Analysis? 12

13 Goal Grow Business

14 Goal Grow Business Increase Number of Customers Problem

15 Quantify Increase Customers Number of Customers Problem Goal Grow
Business

16 Predict how many customers we will have. Prediction e.g.
Customers will be 1000 by end of this year.

17 e.g. We want to grow Customers to 1000. What
can we do to make that happen? Control

18 In order to control the future, you want to
make decisions about • Provide Discount • Hire more Sales Reps. • Invest in Marketing / Advertising

19 Prediction Control Hypothesis What can we monitor to predict
a particular outcome better? What causes a particular outcome?

20 Hypothesis If the weather will be warm, we will
have more customers. If we offer 10% discount, we would have more customers. Prediction Control

21 Test Hypotheses by Experimenting and Collecting Data • Hypothesis
Test • A/B Test Hypotheses Data Test Discount increases more customers

22 Test Hypotheses by Experimenting and Collecting Data • Hypothesis
Test • A/B Test Hypotheses Data Test Discount increases more customers Conﬁrmatory Analysis

23 Hypotheses Data Test How can we build Hypothesis?

24 Intuition! Hypotheses Data Test

25 Build Hypothesis based on Data Hypotheses Data Test Data
How about Data?

26 John Tukey built the method and published a book
called ‘Exploratory Data Analysis’ in 1970s.

27 Build hypotheses by exploring data. EDA Hypotheses Data Test
Data Exploratory Data Analysis

An exploratory and iterative process of asking many questions and
ﬁnd answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 28 Exploratory Data Analysis (EDA)

ﬁnd answers from data in order to build better hypothesis for Explanation, Prediction, and Control. 29 Exploratory Data Analysis (EDA)

Employee Data

Want to understand how the salary is decided.

ﬁnd answers from data in order to build better hypothesis for Prediction and Control. 33 Exploratory Data Analysis (EDA)

34 Far better an approximate answer to the right question,
which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey

35 • How the variation in variables? • How are
the variables associated (or correlated) to one another? Two Principle Questions for EDA

36 Employee Salary $6,503 Average

37 Data varies…

The variance is an opportunity for Data Analysis. 38

39 If there is no variance…

40 Variance is a good starting point for building hypothesis
of association or causal relationship. If there is variance, we can ask “What makes the variance?” and start investigating further. ʁ Income

41 A relationship where changes in one variable happen together
with changes in another variable with a certain rule. Association and Correlation

42 Association Correlation Any type of relationship between two variables.
A certain type of (usually linear) association between two variables

43 US UK Japan 5000 2500 Monthly Income variances are
different among countries. Country Monthly Income 0 Association

44 Age Monthly Income The bigger the Age is, the
bigger the Monthly Income is. Correlation

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1
-1 -0.5 0.5 Correlation

46 Why Association & Correlation are Important?

47 Variance Average (Mean) $20,000 $1,000

48 Variance $20,000 $1,000 Monthly Income

49 How much the income would be in this company?
$20,000 $1,000 Monthly Income Variance

50 Uncertainty $20,000 $1,000 Monthly Income How much the income
would be in this company? Variance

51 0 30 20 If we can ﬁnd a correlation
between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income

52 0 30 20 10 $20,000 $1,000 Working Years If
Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income

53 0 30 20 Working Years Correlation Variance $20,000 $1,000
$15,000 Correlation reduces Uncertainty caused by Variance. Monthly Income $20,000 $1,000

54 US UK Japan Association Variance Reduce Uncertainty Monthly Income
$20,000 $1,000

55 ʁ Income If we can ﬁnd strong correlations, it
makes it easier to explain how Monthly Income changes and to predict what Monthly Income will be.

Correlation is not equal to Causation. Causation is a special
type of Correlation. If we can conﬁrm a given Correlation is Causation, then we can control the outcome. Warning!

“Since the aim of exploratory data analysis is to learn
what seems to be, it should be no surprise that pictures play a vital role in doing it well. There is nothing better for making you think of questions you had forgotten to ask (even mentally),” - John Tukey

59 Visualizing Variance with Charts

60 Histogram Density Plot Bar Chart

Visualize Variance with Bar Chart 61 Job Role

Visualize Variance with Bar Chart 64 Education Level

Visualize Variance with Bar Chart 66 Age

Bar Chart 67

Bar Chart 68

Bar Chart 69

Bar Chart 70

Numerical Data vs. Categorical Data 72

Categorical California Texas New York Florida Oregon • No continuous
relationship • Limited Set of Values • Ordinal relationship is NOT necessary

Numerical 0 10 20 30 40 50 11 22 45
Continuous and Ordinal relationship among values.

Bar Chart 75

Histogram 76

1,000 1,500 3,000 6,500 7,100 2,200 3,800 4,500 5,300 3,400
4,200 5,200 5,800 10,000 7,800 77 1,000 10,000

Salary 0 - 2,000 2,001 - 4,000 4,001 - 6,000
6,001 - 8,000 8,001 - 10,000 Number of Rows 78

79 0 - 2,000 2,001 - 4,000 4,001 - 6,000
6,001 - 8,000 8,001 - 10,000 Divide into a set number of buckets and show how many rows (or employees) in each bucket as the height. Salary Number of Rows

Visualizing Variance with Histogram 80 Monthly Income

82 Increase Number of Bars

83 Increase to 100 Bars.

84 There seems to be a few different groups.

What makes the diﬀerent groups? 85

86 Assign ‘Gender’ to Color.

87 There is no clear difference between Female and Male.

88 Assign ‘Job Role’ to Color.

89 Manager’s Monthly Income range seems to be higher while
Sales Rep & Research Scientist are lower.

Each color is on top of each other. Hard to
see the diﬀerence… 90

Density Plot 91

92 Density Plot • Draws a smooth curve to visualize
the distribution of data. • The height shows an estimated data density of any given point.

Each dot represent each employee and they are located based
on the Monthly Income values. 0 5,000 10,000 93 Monthly Income

Assuming that the data varies, let’s draw a normal distribution
around each data point (employee). 0 څྉ 5,000 10,000 94

And, add up all the values of the normal distributions.
0 څྉ 5,000 10,000 95

96 We’re going to switch to Density Plot chart.

97 It’s easier to see the differences among Job Roles
compared to Histogram.

98 The size under each curve is 1. And it
shows the ratio of data at any given area of each curve.

Density Plot Histogram Same data variance is visualized in different
ways. Ratio Counts

Understanding Variance with Summary View

‘Age’ is ranging from 18 to 60, and there are
many employees in the range of 26 to 40 years old by looking at the height of bars of the histogram chart.

‘Attrition’ column shows that there are 237 employees who have
already quit and that is about 16% of all employees in this data.

‘Job’ column shows that there are 9 job roles and
‘Sales Executive’ has the most employees of 326.

Highlight Mode The Highlight helps you understand where a particular
set of data that you are interested in is and how it is distributed.

How the distribution of ‘Age’ for ‘Sales Rep’ employees?

Click ‘Highlight’ button.

Create a condition with ‘equal to’ operator and select ‘Sales
Representative’ value.

The light blue portion in each bar shows the data
distribution for ‘Sales Rep’. Looks many of them are in the younger age buckets by looking at the ﬁrst column ‘Age’.

By looking at the metrics, we can see that the
average (mean) age of the Sales Rep is 30 years old and it ranges from 18 to 53.

When we look at the Department column we can see
that they are all in ‘Sales’ department.

But, only 18.61% of the people in the Sales department
are the Sales Rep. There must be other job roles in the Sales department.

The Attrition column shows that 33 Sales reps. have left
the company and 50 Sales rep. are still with the company.

There doesn’t seem to have much difference between TRUE and
FALSE in terms of the number of Sales Representatives. But given that there are much less TRUE employees than FALSE employees in general, the Sales Rep people might have a larger percentage of all employees.

Click on the ‘Ratio’ button.

Out of all emloyees who have left the company, 13.92%
of them are Sales Rep. That’s a high percentage!

How is the data for the employees with high Income
distributed?

Create a condition to pick the high income employees.

Those who make greater than $5,000 are in higher Age,
False in Attrition, higher Education, Sales in Department, higher Job Level.

With the Summary View and its Highlight Mode, we can
quickly understand how the variance of each variable is.

In the next session, we are going to explore on
how to investigate and understand the relationship - Correlation and Association - between the variables.

EXPLORATORY Online Seminar #32 Exploratory Data Analysis Part 2 Correlation
& Association 1/27/2021 (Wed) 11AM PT

Information Email [email protected] Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

Q & A 128

EXPLORATORY 129

Exploratory Data Analysis Part 1 - Understandin...

Exploratory Data Analysis Part 1 - Understanding Variance

More Decks by Kan Nishida

Other Decks in Science

Featured

Transcript