EXPLORATORY
Online Seminar #31
Exploratory Data Analysis Part 1
Understanding Variance
Slide 2
Slide 2 text
Kan Nishida
CEO/co-founder
Exploratory
Summary
In Spring 2016, launched Exploratory, Inc. to democratize
Data Science.
Prior to Exploratory, Kan was a director of product
development at Oracle leading teams to build various Data
Science products in areas including Machine Learning, BI,
Data Visualization, Mobile Analytics, Big Data, etc.
While at Oracle, Kan also provided training and consulting
services to help organizations transform with data.
@KanAugust
Speaker
Slide 3
Slide 3 text
Mission
Democratize Data Science
Slide 4
Slide 4 text
4
Data Science is not just for Engineers and Statisticians.
Exploratory makes it possible for Everyone to do Data Science.
The Third Wave
Slide 5
Slide 5 text
First Wave Second Wave Third Wave
Proprietary Open Source
UI & Programming Programming
2016
2000
1976
Monetization Commoditization Democratization
Statisticians Programmers
Democratization of Data Science
Algorithms
Experience
Tools
Open Source
UI & Automation
Business Users
Theme
Users
Slide 6
Slide 6 text
6
Questions Communication
Data Access
Data Wrangling
Visualization
Analytics
(Statistics / Machine
Learning)
Data
Analysis
Data Science Workflow
Slide 7
Slide 7 text
7
Questions Communication
(Dashboard, Note, Slides)
Data Access
Data Wrangling
Visualization
Analytics
(Statistics / Machine
Learning)
Data
Analysis
ExploratoryɹModern & Simple UI
Slide 8
Slide 8 text
EXPLORATORY
Online Seminar #31
Exploratory Data Analysis Part 1
Understanding Variance
Slide 9
Slide 9 text
9
Wayne Gretzky
Slide 10
Slide 10 text
10
- Wayne Gretzky
“I skate to where the puck is going to be, not where it has
been.”
Slide 11
Slide 11 text
11
Exploratory Data Analysis (EDA) is not just about the
‘playfulness’ that might come off from the name ‘Exploratory’.
It is more about the method of rigorous and iterative analysis of the
data at your hand.
Slide 12
Slide 12 text
Why Data Analysis?
12
Slide 13
Slide 13 text
13
Goal
Grow Business
Slide 14
Slide 14 text
14
Goal
Grow Business Increase Number of Customers
Problem
Slide 15
Slide 15 text
15
Quantify
Increase Customers Number of Customers
Problem
Goal
Grow Business
Slide 16
Slide 16 text
16
Predict how many customers we will have.
Prediction
e.g. Customers will be 1000 by end of this year.
Slide 17
Slide 17 text
17
e.g. We want to grow Customers to 1000.
What can we do to make that happen?
Control
Slide 18
Slide 18 text
18
In order to control the future, you want to make decisions about
• Provide Discount
• Hire more Sales Reps.
• Invest in Marketing / Advertising
Slide 19
Slide 19 text
19
Prediction Control
Hypothesis
What can we
monitor to predict a
particular outcome
better?
What causes
a particular outcome?
Slide 20
Slide 20 text
20
Hypothesis
If the weather will be
warm, we will have
more customers.
If we offer 10%
discount, we would
have more customers.
Prediction Control
Slide 21
Slide 21 text
21
Test Hypotheses
by Experimenting and Collecting Data
• Hypothesis Test
• A/B Test
Hypotheses Data Test
Discount increases more customers
Slide 22
Slide 22 text
22
Test Hypotheses
by Experimenting and Collecting Data
• Hypothesis Test
• A/B Test
Hypotheses Data Test
Discount increases more customers
Confirmatory Analysis
Slide 23
Slide 23 text
23
Hypotheses Data Test
How can we build Hypothesis?
Slide 24
Slide 24 text
24
Intuition!
Hypotheses Data Test
Slide 25
Slide 25 text
25
Build Hypothesis based on Data
Hypotheses Data Test
Data
How about Data?
Slide 26
Slide 26 text
26
John Tukey built the method and
published a book called ‘Exploratory
Data Analysis’ in 1970s.
Slide 27
Slide 27 text
27
Build hypotheses by exploring data.
EDA Hypotheses Data Test
Data
Exploratory Data Analysis
Slide 28
Slide 28 text
An exploratory and iterative process of asking many questions
and find answers from data in order to build better hypothesis
for Explanation, Prediction, and Control.
28
Exploratory Data Analysis (EDA)
Slide 29
Slide 29 text
An exploratory and iterative process of asking many questions
and find answers from data in order to build better
hypothesis for Explanation, Prediction, and Control.
29
Exploratory Data Analysis (EDA)
Slide 30
Slide 30 text
Employee Data
Slide 31
Slide 31 text
Employee Data
Slide 32
Slide 32 text
Want to understand how the salary is decided.
Slide 33
Slide 33 text
An exploratory and iterative process of asking many
questions and find answers from data in order to build
better hypothesis for Prediction and Control.
33
Exploratory Data Analysis (EDA)
Slide 34
Slide 34 text
34
Far better an approximate
answer to the right question,
which is often vague, than an
exact answer to the wrong
question, which can always
be made precise.
— John Tukey
Slide 35
Slide 35 text
35
• How the variation in variables?
• How are the variables associated (or correlated) to
one another?
Two Principle Questions for EDA
Slide 36
Slide 36 text
36
Employee Salary
$6,503
Average
Slide 37
Slide 37 text
37
Data varies…
Slide 38
Slide 38 text
The variance is an opportunity for
Data Analysis.
38
Slide 39
Slide 39 text
39
If there is no variance…
Slide 40
Slide 40 text
40
Variance is a good starting point for building hypothesis of association or
causal relationship.
If there is variance, we can ask “What makes the variance?”
and start investigating further.
ʁ Income
Slide 41
Slide 41 text
41
A relationship where changes in one variable
happen together with changes in another variable
with a certain rule.
Association and Correlation
Slide 42
Slide 42 text
42
Association Correlation
Any type of
relationship between
two variables.
A certain type of (usually
linear) association
between two variables
Slide 43
Slide 43 text
43
US UK Japan
5000
2500
Monthly Income variances are
different among countries.
Country
Monthly Income
0
Association
Slide 44
Slide 44 text
44
Age
Monthly Income
The bigger the Age is,
the bigger the Monthly
Income is.
Correlation
49
How much the
income would be in
this company?
$20,000
$1,000
Monthly
Income
Variance
Slide 50
Slide 50 text
50
Uncertainty
$20,000
$1,000
Monthly
Income
How much the
income would be in
this company?
Variance
Slide 51
Slide 51 text
51
0 30
20
If we can find a correlation between Monthly Income and Working Years…
10
$20,000
$1,000
Working Years
Monthly
Income
Slide 52
Slide 52 text
52
0 30
20
10
$20,000
$1,000
Working Years
If Working Years is 20 years,
Monthly Income would be
around $15,000.
$15,000
Monthly
Income
Slide 53
Slide 53 text
53
0 30
20
Working Years
Correlation
Variance
$20,000
$1,000
$15,000
Correlation reduces Uncertainty caused by Variance.
Monthly
Income
$20,000
$1,000
Slide 54
Slide 54 text
54
US
UK
Japan
Association
Variance
Reduce Uncertainty
Monthly
Income
$20,000
$1,000
Slide 55
Slide 55 text
55
ʁ Income
If we can find strong correlations, it makes it easier to
explain how Monthly Income changes
and to predict what Monthly Income will be.
Slide 56
Slide 56 text
Correlation is not equal to Causation.
Causation is a special type of Correlation.
If we can confirm a given Correlation is Causation,
then we can control the outcome.
Warning!
Slide 57
Slide 57 text
57
• How the variation in variables?
• How are the variables associated (or correlated) to one
another?
Two Principle Questions for EDA
Slide 58
Slide 58 text
“Since the aim of exploratory data analysis is to learn what
seems to be, it should be no surprise that pictures play a vital
role in doing it well.
There is nothing better for making you think of questions you
had forgotten to ask (even mentally),”
- John Tukey
Slide 59
Slide 59 text
59
Visualizing Variance with Charts
Slide 60
Slide 60 text
60
Histogram Density Plot Bar Chart
Slide 61
Slide 61 text
Visualize Variance with Bar Chart
61
Job Role
Slide 62
Slide 62 text
62
Slide 63
Slide 63 text
63
Slide 64
Slide 64 text
Visualize Variance with Bar Chart
64
Education Level
Slide 65
Slide 65 text
65
Slide 66
Slide 66 text
Visualize Variance with Bar Chart
66
Age
Slide 67
Slide 67 text
Bar Chart
67
Slide 68
Slide 68 text
Bar Chart
68
Slide 69
Slide 69 text
Bar Chart
69
Slide 70
Slide 70 text
Bar Chart
70
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
Numerical Data vs. Categorical Data
72
Slide 73
Slide 73 text
Categorical
California
Texas
New York
Florida
Oregon
• No continuous relationship
• Limited Set of Values
• Ordinal relationship is NOT necessary
Slide 74
Slide 74 text
Numerical
0 10 20 30 40 50
11 22 45
Continuous and Ordinal relationship among values.
Salary
0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000
Number of
Rows
78
Slide 79
Slide 79 text
79
0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000
Divide into a set number of buckets and show how many rows (or employees)
in each bucket as the height.
Salary
Number of
Rows
Slide 80
Slide 80 text
Visualizing Variance with Histogram
80
Monthly Income
Slide 81
Slide 81 text
81
Slide 82
Slide 82 text
82
Increase Number of Bars
Slide 83
Slide 83 text
83
Increase to 100 Bars.
Slide 84
Slide 84 text
84
There seems to be a few different groups.
Slide 85
Slide 85 text
What makes the different groups?
85
Slide 86
Slide 86 text
86
Assign ‘Gender’ to Color.
Slide 87
Slide 87 text
87
There is no clear difference between Female and Male.
Slide 88
Slide 88 text
88
Assign ‘Job Role’ to Color.
Slide 89
Slide 89 text
89
Manager’s Monthly Income range seems to be higher while
Sales Rep & Research Scientist are lower.
Slide 90
Slide 90 text
Each color is on top of each other.
Hard to see the difference…
90
Slide 91
Slide 91 text
Density Plot
91
Slide 92
Slide 92 text
92
Density Plot
• Draws a smooth curve to
visualize the distribution of
data.
• The height shows an
estimated data density of
any given point.
Slide 93
Slide 93 text
Each dot represent each employee and they are located based on the Monthly Income values.
0 5,000 10,000
93
Monthly Income
Slide 94
Slide 94 text
Assuming that the data varies, let’s draw a normal distribution around
each data point (employee).
0 څྉ
5,000 10,000
94
Slide 95
Slide 95 text
And, add up all the values of the normal distributions.
0 څྉ
5,000 10,000
95
Slide 96
Slide 96 text
96
We’re going to switch to Density Plot chart.
Slide 97
Slide 97 text
97
It’s easier to see the differences among Job Roles compared to Histogram.
Slide 98
Slide 98 text
98
The size under each curve is 1. And it shows the ratio of data at any
given area of each curve.
Slide 99
Slide 99 text
Density Plot Histogram
Same data variance is visualized in different ways.
Ratio Counts
Slide 100
Slide 100 text
Understanding Variance
with Summary View
Slide 101
Slide 101 text
101
Slide 102
Slide 102 text
‘Age’ is ranging from 18 to 60, and there are many employees in the range of
26 to 40 years old by looking at the height of bars of the histogram chart.
Slide 103
Slide 103 text
‘Attrition’ column shows that there are 237 employees who have
already quit and that is about 16% of all employees in this data.
Slide 104
Slide 104 text
‘Job’ column shows that there are 9 job roles and ‘Sales Executive’
has the most employees of 326.
Slide 105
Slide 105 text
No content
Slide 106
Slide 106 text
No content
Slide 107
Slide 107 text
Highlight Mode
The Highlight helps you understand where a particular set of data that
you are interested in is and how it is distributed.
Slide 108
Slide 108 text
How the distribution of ‘Age’ for ‘Sales Rep’ employees?
Slide 109
Slide 109 text
Click ‘Highlight’ button.
Slide 110
Slide 110 text
Create a condition with ‘equal to’ operator and select ‘Sales Representative’
value.
Slide 111
Slide 111 text
The light blue portion in each bar shows the data distribution for ‘Sales Rep’.
Looks many of them are in the
younger age buckets by looking
at the first column ‘Age’.
Slide 112
Slide 112 text
By looking at the metrics, we can see that
the average (mean) age of the Sales Rep is
30 years old and it ranges from 18 to 53.
Slide 113
Slide 113 text
When we look at the Department column we can see that they are all
in ‘Sales’ department.
Slide 114
Slide 114 text
But, only 18.61% of the people in the Sales department are the Sales Rep.
There must be other job roles in the Sales department.
Slide 115
Slide 115 text
The Attrition column shows that 33 Sales reps. have left the company
and 50 Sales rep. are still with the company.
Slide 116
Slide 116 text
There doesn’t seem to have much difference between TRUE and FALSE
in terms of the number of Sales Representatives.
But given that there are much less TRUE employees than FALSE
employees in general, the Sales Rep people might have a larger
percentage of all employees.
Slide 117
Slide 117 text
Click on the ‘Ratio’ button.
Slide 118
Slide 118 text
Out of all emloyees who have left the company, 13.92% of them are
Sales Rep. That’s a high percentage!
Slide 119
Slide 119 text
How is the data for the employees with high Income distributed?
Slide 120
Slide 120 text
Create a condition to pick the high income employees.
Slide 121
Slide 121 text
Those who make greater than $5,000 are in higher Age, False in Attrition, higher
Education, Sales in Department, higher Job Level.
Slide 122
Slide 122 text
With the Summary View and its Highlight Mode, we can quickly
understand how the variance of each variable is.
Slide 123
Slide 123 text
123
• How the variation in variables?
• How are the variables associated (or correlated) to
one another?
Two Principle Questions for EDA
Slide 124
Slide 124 text
In the next session, we are going to explore on how to investigate
and understand the relationship - Correlation and Association -
between the variables.
Slide 125
Slide 125 text
EXPLORATORY
Online Seminar #32
Exploratory Data Analysis Part 2
Correlation & Association
1/27/2021 (Wed) 11AM PT
Slide 126
Slide 126 text
No content
Slide 127
Slide 127 text
Information
Email
[email protected]
Website
https://exploratory.io
Twitter
@ExploratoryData
Seminar
https://exploratory.io/online-seminar