EXPLORATORY
Data Visualization Workshop
Part 1 - Visualizing Summarized Data
Slide 2
Slide 2 text
2
EXPLORATORY
Slide 3
Slide 3 text
Kan Nishida
CEO/co-founder
Exploratory
Summary
Beginning of 2016, launched Exploratory, Inc. to
democratize Data Science.
Prior to Exploratory, Kan was a director of product
development at Oracle leading teams for building various
Data Science products in areas including Machine
Learning, BI, Data Visualization, Mobile Analytics, Big Data,
etc.
While at Oracle, Kan also provided training and consulting
services to help organizations transform with data.
@KanAugust
Speaker
Slide 4
Slide 4 text
Data Science is not just for Engineers and Statisticians.
Exploratory makes it possible for Everyone to do Data Science.
The Third Wave
Slide 5
Slide 5 text
Questions Communication
Data Access
Data Wrangling
Visualization
Analytics
(Statistics / Machine
Learning)
Data
Analysis
Data Science Workflow
Slide 6
Slide 6 text
Questions Communication
(Dashboard, Note, Slides)
Data Access
Data Wrangling
Visualization
Analytics
(Statistics / Machine
Learning)
Data
Analysis
ExploratoryɹModern & Simple UI
Slide 7
Slide 7 text
EXPLORATORY
Data Visualization Workshop
Part 1 - Visualizing Summarized Data
Slide 8
Slide 8 text
• Part 1 - Visualizing Summarized Data
• Part 2 - Visualizing Time Series Data
• Part 3 - Visualizing Distribution & Correaltion
• Part 4 - Visualizing Uncertainty
• Part 5 - Data Wrangling for Data Visualization
Exploratory - Data Visualization Workshop
Slide 9
Slide 9 text
Part 1 - Visualizing Summarized Data
1. Choosing the Right Chart based on Data Type
2. Introduction to Basic Chart Features
3. Introduction to Scatter with Aggregate
4. Introduction to Map
9
Slide 10
Slide 10 text
Before begins…
10
Slide 11
Slide 11 text
Create a Project
11
Slide 12
Slide 12 text
You want to create a project first. That's where you will import all your data.
12
Slide 13
Slide 13 text
Create a new project
13
Slide 14
Slide 14 text
Type the project name and click the ‘Create’ button!
14
Slide 15
Slide 15 text
It opens a new project. Let’s begin!
15
Slide 16
Slide 16 text
Import Data
16
Slide 17
Slide 17 text
Sample Data
Airbnb New York Data
17
Slide 18
Slide 18 text
1. Open Airbnb New York Data Page
2. Download the Data
3. Import the Data
Import Data
Slide 19
Slide 19 text
19
Select ‘Data Catalog’ from the Data Frame menu.
Slide 20
Slide 20 text
20
Type ‘airbnb’ to search ‘Airbnb Listing Data for New York City’ data.
Slide 21
Slide 21 text
21
Click the Import button to import the data.
Slide 22
Slide 22 text
22
Click the Save button to save the data.
Slide 23
Slide 23 text
Data is imported into Exploratory.
23
Slide 24
Slide 24 text
Part 1 - Visualizing Summarized Data
1. Choosing the Right Chart based on Data Type
2. Introduction to Basic Chart Features
3. Introduction to Scatter with Aggregate
4. Introduction to Map
24
Slide 25
Slide 25 text
A different type of charts and Summary Statistics are shown
for each column depending on the data type.
25
Slide 26
Slide 26 text
A. Numeric - Numerical
B. Character - Categorical
C. Date / POSIXct - Date/Time
D. Logical - Logical
E. Factor - Ordinal
Data Type
26
Slide 27
Slide 27 text
A. Numeric - Numerical
B. Character - Categorical
C. Date / POSIXct - Date/Time
Data Type
27
30
There are more columns that need to be converted to Numeric data type.
Slide 31
Slide 31 text
You can change the data type for multiple columns at once.
Select multiple columns by using Command key (Mac) or
Control key (Windows).
Slide 32
Slide 32 text
Select ‘Change Data Type’ and ‘Convert to Numeric’ from
the column header menu.
Slide 33
Slide 33 text
The selected columns are listed and ‘Convert Data Type’ is
selected in the Calculation Type in the dialog.
Simply, click ‘Run’ button.
Slide 34
Slide 34 text
The selected columns are now shown as Numeric data type.
Slide 35
Slide 35 text
Note that these operations are recorded at the right hand side as the Data Wrangling
Steps. We’ll go into more details about the data wrangling in Part 4.
Slide 36
Slide 36 text
36
Let’s create the 1st chart to answer the following questions.
• What is the price range from the lowest to the highest?
• What are the common prices for the majority of the lists?
• How much is the most expensive list?
Slide 37
Slide 37 text
37
Click ‘Create Chart’ button on the ‘price’ column to create a chart.
Slide 38
Slide 38 text
38
A histogram chart is created automatically.
Slide 39
Slide 39 text
39
Histogram
Divide the numerical values into a set of range groups (e.g. from 10 to 100)
and show the data size (number of rows) for each group as a height of bar.
price
0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000
42
Number of Rows
Slide 43
Slide 43 text
43
Most of the data (50,321) resides within 0 and 1,000 range.
Slide 44
Slide 44 text
44
There are a few lists that are extremely high prices.
These extreme values are called ‘Outliers’. We’ll
go into more details about the outliers in the
‘Part 3 - Visualizing Variance & Correlation’
Slide 45
Slide 45 text
45
Let’s remove the outliers by unchecking the ‘Include Outliers’.
Slide 46
Slide 46 text
46
It looks that most of the data are actually less than $300.
Slide 47
Slide 47 text
47
We can change the ‘Number of Bars’ to find the data distribution patterns at
more granular level.
Slide 48
Slide 48 text
48
Sometimes, Histogram doesn’t work well…
Slide 49
Slide 49 text
49
Is this column Numerical or Categorical?
Slide 50
Slide 50 text
50
It’s up to you!
Whatever works better for you.
Slide 51
Slide 51 text
51
Let’s create another chart to answer the following questions.
• What is the range of the accommodate type?
• What are the most common accommodate types?
Slide 52
Slide 52 text
52
Assign the ‘accommodate’ column to X-Axis to see how many lists we
have for each type.
Slide 53
Slide 53 text
53
We can bring back the outliers to see all the accommodate types.
Now we can see the highest number is 25.
Slide 54
Slide 54 text
54
The accommodate column is integer data, this means that we have only 25
unique values. This can be considered as a categorical column!
Slide 55
Slide 55 text
55
Bar chart works better to visualize the data distribution for Categorical
columns.
Slide 56
Slide 56 text
56
The Bar chart automatically categorizes the numerical values when you assign a
numerical column to X Axis.
Slide 57
Slide 57 text
57
This time, we have only a handful number of values for the accommodate column so we
want to see all the numbers as they are. We can select ‘As Number’ to show all the
numerical values at X-Axis.
Slide 58
Slide 58 text
58
We can see how many lists there are for each of the accommodate values.
Most of the lists are less than 4 people accommodation, and 2 people is the most common type.
Slide 59
Slide 59 text
59
And sometimes, Bar Chart doesn’t work well…
Slide 60
Slide 60 text
60
Let’s assign the ‘price’ column to X-Axis.
Slide 61
Slide 61 text
61
When you assign a numerical column with a lot of unique values, the bar chart create a bar for each of
the unique values. This makes it harder to recognize any patterns in the data.
Slide 62
Slide 62 text
62
We can zoom in by dragging the mouse pointer.
Slide 63
Slide 63 text
63
We can see each of the unique price point has its own bar, but this would give you
too much noise rather than presenting visual patterns you can recognize well.
Slide 64
Slide 64 text
64
With Histogram, we can see the overall patterns about how the data is distributed better.
Slide 65
Slide 65 text
A. Numeric - Numerical
B. Character - Categorical
C. Date / POSIXct - Date/Time
Choosing a Right Chart based on Data Type
65
Slide 66
Slide 66 text
Categorical
California
Texas
New York
Florida
Oregon
No continuous relationship
No ordinal relationship is necessary
Slide 67
Slide 67 text
67
Let’s create another chart to answer the following questions.
• What are the neighborhoods with the most lists?
• How many lists do they have and how much more compared to
other neighborhoods?
Slide 68
Slide 68 text
68
Click ‘Create Chart’ button on the neighborhood column.
Slide 69
Slide 69 text
69
We can see the number of lists by the neighborhoods, but there are too many.
Slide 70
Slide 70 text
70
You can zoom into a particular section by dragging the mouse pointer.
Slide 71
Slide 71 text
71
Bedford-Stuyvesant and Williamsburg have the most lists on Airbnb.
Slide 72
Slide 72 text
A. Numeric - Numerical
B. Character - Categorical
C. Date / POSIXct - Date/Time
Choosing a Right Chart based on Data Type
72
Slide 73
Slide 73 text
73
There is a ‘Date’ data type column, ‘host_since’.
Slide 74
Slide 74 text
74
Let’s create another chart to answer the following questions.
• Are there more lists that have been added recently?
• Have the number been increasing or decreasing?
• Do more properties get listed on Airbnb in particular time of year?
Slide 75
Slide 75 text
75
Click on the chart button on the host_since column.
Slide 76
Slide 76 text
76
We can see how many lists were added to Airbnb by each year.
Slide 77
Slide 77 text
77
When the column is Date (or POSIXct) column you’ll have a control of how you want to
aggregate the data. And the chart respects the date order and intervals.
Date order / interval are respected.
Control the data aggregation level.
Slide 78
Slide 78 text
78
Select ‘Month’ under the Round menu. This will aggregate the data by Year/Month.
Slide 79
Slide 79 text
79
Select ‘Month’ under the Extract menu. This will aggregate the data by Month,
ignoring Year.
Slide 80
Slide 80 text
80
Looks more properties are getting listed in summar (May, Jun, July).
Slide 81
Slide 81 text
81
By the way…
Slide 82
Slide 82 text
Position Length Angle Slope Size
Shape Volume Color Intensity Color Hue
Visual Cue
82
Slide 83
Slide 83 text
Position Length Angle Slope Size
Shape Volume Color Intensity Color Hue
Visual Cue
83
Easier to Recognize Harder to Recognize
Slide 84
Slide 84 text
84
VS
Length Slope
Slide 85
Slide 85 text
85
Length
VS
Slope
Slide 86
Slide 86 text
Length
86
We are comparing the lengths of the bars. Easier to recognize how much a given bar is longer
(higher) than the other bars.
Slide 87
Slide 87 text
87
Slope
With Line chart, it’s easier to understand if the trend is going upward or
downward and how steep the slope is.
Slide 88
Slide 88 text
88
Let’s try the Line chart by changing the chart type to ‘Line’.
Slide 89
Slide 89 text
89
The number of new lists increased from 2010 to 2014, then decreased till 2018.
Slide 90
Slide 90 text
90
Visualizing the time series data has a lot more to explore!
We’ll go into more details at the next workshop ‘Part 2 -
Visualizing Time Series Data’!
Slide 91
Slide 91 text
Agenda
• Choosing the Right Chart based on Data Type
• Introduction to Basic Chart Features
• Introduction to Scatter with Aggregate
• Introduction to Map
91
Slide 92
Slide 92 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Chart Features
Slide 93
Slide 93 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 94
Slide 94 text
94
Create a new chart.
Slide 95
Slide 95 text
95
Select a Bar chart.
Slide 96
Slide 96 text
96
Select the ‘neighborhood’ column for X-Axis and select ‘Y1 Axis’ for Sort By to show
the number of rows (or lists) for each bar and sort them from the highest to the lowest.
Slide 97
Slide 97 text
97
There are too many neighborhoods to show. How about showing only the top 30
neighborhoods based on the number of lists.
Slide 98
Slide 98 text
98
Select ‘Limit’ from the X-Axis menu.
Slide 99
Slide 99 text
99
Select ‘Top’ for the Type, and set 30 for the Number of Results.
Slide 100
Slide 100 text
100
Now, only the top 30 neighborhoods are shown in the bar chart.
Slide 101
Slide 101 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Adjust Font Size
G. Group by Color
H. Create ‘Other’ Group
I. Edit Display Name
J. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 102
Slide 102 text
102
Let’s see which neighborhoods are more expensive or
cheaper by using the average renting price.
Slide 103
Slide 103 text
103
Assign the ‘price’ column to Y-Axis and select ‘Mean (Average)’ as the aggregate
function.
Slide 104
Slide 104 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Adjust Font Size
G. Group by Color
H. Create ‘Other’ Group
I. Edit Display Name
J. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 105
Slide 105 text
105
Which neighborhoods are more expensive (or cheaper) compared
to the overall average?
Slide 106
Slide 106 text
106
We can draw a reference as a visual aid.
Slide 107
Slide 107 text
107
Select ‘Reference Line’ from the Y-Axis menu.
Slide 108
Slide 108 text
108
Select ‘(Mean Average)’ for the Reference Line Type.
Slide 109
Slide 109 text
109
Select ‘Light Red’ for the Color.
Slide 110
Slide 110 text
110
All the top 30 neighborhoods are higher prices than the overall average.
Wait, what?
Slide 111
Slide 111 text
111
We have limited the neighborhoods to the top 30. Let’s show all the neighborhoods.
Slide 112
Slide 112 text
112
Remove the Limit setting to show all the neighborhoods and see the
neighborhoods that are cheaper than the overall average.
Slide 113
Slide 113 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Adjust Font Size
G. Group by Color
H. Create ‘Other’ Group
I. Edit Display Name
J. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 114
Slide 114 text
114
Click ‘Horizontal’ for the Orientation to show the bar chart in a horizontal way.
This makes it easier to read the neighborhood names.
Slide 115
Slide 115 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 116
Slide 116 text
116
Select ‘Above’ for the Show Value on Plot inside the Property.
Slide 117
Slide 117 text
117
The numbers (number of lists) are shown inside the chart.
Slide 118
Slide 118 text
118
We can change the font size for the values on the chart.
Slide 119
Slide 119 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 120
Slide 120 text
120
Let’s see what types of properties are more common for each
neighborhood.
Also, let’s find out there are common patterns or differences among
the neighborhoods.
Slide 121
Slide 121 text
121
Let’s remove the reference line for now.
Slide 122
Slide 122 text
122
Select ‘None’ for the Reference Line Type.
Slide 123
Slide 123 text
123
Set ‘Number of Rows’ to Y-Axis.
Slide 124
Slide 124 text
124
Select the ‘property_type’ column to Color By.
Slide 125
Slide 125 text
125
Looks the Apartment type is the most common for most of the neighborhoods.
Some neighborhoods like Flushing seem to have less apartment type ratio.
Slide 126
Slide 126 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 127
Slide 127 text
127
Notice that there is ‘Others’ inside the legend.
Slide 128
Slide 128 text
128
When you assign a column with more than 20 unique values, it automatically creates
a ‘Others’ group. It keeps the 20 most frequent categories and combine everything
else into the ‘Others’ group.
Slide 129
Slide 129 text
129
What if we want to show only the top 10 property types and make
everything else to be ‘Others’?
Slide 130
Slide 130 text
130
Click the green text ‘Frequency 20 (35)’ to open the ‘Other Group’ setting dialog.
Slide 131
Slide 131 text
131
Set 10 to keep the top 10 categories and make everything else as the ‘Others’ group.
Slide 132
Slide 132 text
132
Slide 133
Slide 133 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 134
Slide 134 text
134
We want to edit the names inside the legend.
Also, we want to combine some categories by editing the names.
For example:
• Make ‘Boutique hotel’ and ‘Hotel’ as ‘Hotel’.
• Make ‘Condominium’ and ‘Apartment’ as ‘Apartment’.
Slide 135
Slide 135 text
135
Slide 136
Slide 136 text
136
Select ‘Edit Display Name’ from the Color menu.
Slide 137
Slide 137 text
137
Enter the display names for the ones you want to change.
Boutique hotel -> Hotel
Condominium -> Apartment
Slide 138
Slide 138 text
138
`
Slide 139
Slide 139 text
A. Limit Values
B. Summarize Functions
C. Reference Line
D. Horizontal Bar Chart
E. Show Values on Plot
F. Group by Color
G. Create ‘Other’ Group
H. Edit Display Name
I. Calculate ‘% of Total’ with Window Calculation
Basic Features of Chart
Slide 140
Slide 140 text
140
Are there any differences among the neighborhoods based on the property types?
We want to show the ratio of each property type for each neighborhood.
Slide 141
Slide 141 text
141
Select ‘Window Calculation’ from the Y Axis menu.
Slide 142
Slide 142 text
142
Select ‘% of’ for the Calculation Type and keep ‘Sum (Total)’ as the Summarize
option.
Slide 143
Slide 143 text
143
Now we can see the ratio of each property type for each neighborhood.
Slide 144
Slide 144 text
144
Overall, the apartment type is the most frequent type for all the neighborhoods.
Some neighborhoods like East Flatbush and Flushing have higher ratio of the house property
type.
Slide 145
Slide 145 text
Agenda
• Choosing the Right Chart based on Data Type
• Introduction to Basic Chart Features
• Introduction to Scatter with Aggregate (Bubble chart)
• Introduction to Map
145
Slide 146
Slide 146 text
146
Some neighborhoods are more expensive than the others.
Slide 147
Slide 147 text
147
The ‘review_score_rating’ column has an average review score for each property.
Slide 148
Slide 148 text
148
Slide 149
Slide 149 text
149
Some neighborhoods have higher review scores than the others.
Slide 150
Slide 150 text
150
We want to compare the neighborhoods by using these 2 numerical
columns and answer the following questions.
• Which neighborhoods are more expensive and with higher review
scores?
• Also, which neighborhoods are more expensive but with lower scores?
153
Assign the ‘review_scores_rating’ column to X-Axis, then select ‘Mean (Average)’
summarize function
Slide 154
Slide 154 text
154
Assign the ‘price’ column to Y-Axis, then select ‘Mean (Average)’ summarize function.
Slide 155
Slide 155 text
155
There is a circle that shows the overall average of the review score and the overall
average of the price.
Slide 156
Slide 156 text
156
Assign the ‘neighborhood’ column to ‘Group By’ so that we can break down the
circle into neighborhoods.
Slide 157
Slide 157 text
157
This neighborhood has the
lowest score and is cheaper.
This neighborhood is the most expensive
but the score is not that great.
Slide 158
Slide 158 text
158
How many properties are there in those neighborhoods?
Slide 159
Slide 159 text
159
Assign ‘neighborhood’ to ‘Group By’ so that we can break down the circle into
number of the groups.
Slide 160
Slide 160 text
160
Let’s create 4 regions based on the price and the review score.
Low Price / High Review Score
High Price / High Review Score
High Price / Low Review Score
Low Price / Low Review Score
Slide 161
Slide 161 text
161
Add a reference line for X Axis to show the overall average score.
Slide 162
Slide 162 text
162
Select ‘Mean (Average)’ for the Reference Line Type.
Slide 163
Slide 163 text
163
Add a reference line for Y Axis to show the overall price.
Slide 164
Slide 164 text
164
Select ‘Mean (Average)’ for the Reference Line Type.
Slide 165
Slide 165 text
165
Slide 166
Slide 166 text
166
Assign the ‘neighborhood_group’ column to Color By.
Slide 167
Slide 167 text
167
Looks that the neighborhoods in Manhattan (Green) tend to be more
expensive but have lower review scores.
Slide 168
Slide 168 text
168
When we zoom in we can see that patterns more clearly.
Slide 169
Slide 169 text
Agenda
• Choosing the Right Chart based on Data Type
• Introduction to Basic Chart Features
• Introduction to Scatter with Aggregate
• Introduction to Map
169
Slide 170
Slide 170 text
170
• Which neighborhoods have more properties?
• Which neighborhoods are more expensive?
These questions can be answered by using Bar and Bubble charts.
But can we see the geographical patterns at the same time?
Map!!
Slide 171
Slide 171 text
171
Select ‘Map - Long/Lat’ type.
If you happen to have the column whose names are ‘longitude’ and ‘latitude’ then
they are automatically assigned accordingly.
Slide 172
Slide 172 text
172
Each dots represent each row, in this case, that is each list on Airbnb.
Slide 173
Slide 173 text
173
In order to compare the neighborhoods, we want to aggregate the dots into
a number of neighborhoods.
Slide 174
Slide 174 text
174
Assign the ‘neighborhood’ column to the Group By.
Slide 175
Slide 175 text
175
We can see that each dot represents each neighborhood.
Now, we want to see how many lists each neighborhood has.
Slide 176
Slide 176 text
176
Assign ‘Number of Rows’ to the Size to visualize the numbers by using the circle size.
Slide 177
Slide 177 text
177
We can see that, for example, Williamsburg has a lot of lists (3,936).
Slide 178
Slide 178 text
178
How about the price? Which neighborhoods are more expensive or cheaper?
Slide 179
Slide 179 text
179
Assign ‘price’ column to the Color By.
Slide 180
Slide 180 text
180
It is harder to see the differences with the current color scheme.
Slide 181
Slide 181 text
181
Select ‘Color Setting’ from the Color By menu.
Slide 182
Slide 182 text
182
You can select a different color palette from the menu and also set higher value for the
Opacity to make the color less transparent.
Slide 183
Slide 183 text
183
It’s become easier to spot the neighborhoods with higher average rent price.
Slide 184
Slide 184 text
184
You can also change the background to ‘Dark’ so that it’s even easier to see all the
circles and spot the ones with higher price values.
Slide 185
Slide 185 text
Q & A
Slide 186
Slide 186 text
Next Seminar
Slide 187
Slide 187 text
EXPLORATORY
Data Visualization Workshop
Part 2 - Visualizing Time Series Data
Slide 188
Slide 188 text
• Part 1 - Basics: Visualizing Summarized Data
• Part 2 - Visualizing Time Series Data
• Part 3 - Visualizing Distribution & Correaltion
• Part 4 - Visualizing Uncertainty
• Part 5 - Data Wrangling for Data Visualization
Data Visualization Workshop
Slide 189
Slide 189 text
Information
Email
[email protected]
Website
https://exploratory.io
Twitter
@KanAugust
Training
https://exploratory.io/training