Slide 1

Slide 1 text

EXPLORATORY Data Visualization Workshop Part 1 - Visualizing Summarized Data

Slide 2

Slide 2 text

2 EXPLORATORY

Slide 3

Slide 3 text

Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 5

Slide 5 text

Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 6

Slide 6 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 7

Slide 7 text

EXPLORATORY Data Visualization Workshop Part 1 - Visualizing Summarized Data

Slide 8

Slide 8 text

• Part 1 - Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Distribution & Correaltion • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Exploratory - Data Visualization Workshop

Slide 9

Slide 9 text

Part 1 - Visualizing Summarized Data 1. Choosing the Right Chart based on Data Type 2. Introduction to Basic Chart Features 3. Introduction to Scatter with Aggregate 4. Introduction to Map 9

Slide 10

Slide 10 text

Before begins… 10

Slide 11

Slide 11 text

Create a Project 11

Slide 12

Slide 12 text

You want to create a project first. That's where you will import all your data. 12

Slide 13

Slide 13 text

Create a new project 13

Slide 14

Slide 14 text

Type the project name and click the ‘Create’ button! 14

Slide 15

Slide 15 text

It opens a new project. Let’s begin! 15

Slide 16

Slide 16 text

Import Data 16

Slide 17

Slide 17 text

Sample Data Airbnb New York Data 17

Slide 18

Slide 18 text

1. Open Airbnb New York Data Page 2. Download the Data 3. Import the Data Import Data

Slide 19

Slide 19 text

19 Select ‘Data Catalog’ from the Data Frame menu.

Slide 20

Slide 20 text

20 Type ‘airbnb’ to search ‘Airbnb Listing Data for New York City’ data.

Slide 21

Slide 21 text

21 Click the Import button to import the data.

Slide 22

Slide 22 text

22 Click the Save button to save the data.

Slide 23

Slide 23 text

Data is imported into Exploratory. 23

Slide 24

Slide 24 text

Part 1 - Visualizing Summarized Data 1. Choosing the Right Chart based on Data Type 2. Introduction to Basic Chart Features 3. Introduction to Scatter with Aggregate 4. Introduction to Map 24

Slide 25

Slide 25 text

A different type of charts and Summary Statistics are shown for each column depending on the data type. 25

Slide 26

Slide 26 text

A. Numeric - Numerical B. Character - Categorical C. Date / POSIXct - Date/Time D. Logical - Logical E. Factor - Ordinal Data Type 26

Slide 27

Slide 27 text

A. Numeric - Numerical B. Character - Categorical C. Date / POSIXct - Date/Time Data Type 27

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

Numerical 0 10 20 30 40 50 11 22 45 Continuous Ordinal Relationship

Slide 30

Slide 30 text

30 There are more columns that need to be converted to Numeric data type.

Slide 31

Slide 31 text

You can change the data type for multiple columns at once. Select multiple columns by using Command key (Mac) or Control key (Windows).

Slide 32

Slide 32 text

Select ‘Change Data Type’ and ‘Convert to Numeric’ from the column header menu.

Slide 33

Slide 33 text

The selected columns are listed and ‘Convert Data Type’ is selected in the Calculation Type in the dialog. Simply, click ‘Run’ button.

Slide 34

Slide 34 text

The selected columns are now shown as Numeric data type.

Slide 35

Slide 35 text

Note that these operations are recorded at the right hand side as the Data Wrangling Steps. We’ll go into more details about the data wrangling in Part 4.

Slide 36

Slide 36 text

36 Let’s create the 1st chart to answer the following questions. • What is the price range from the lowest to the highest? • What are the common prices for the majority of the lists? • How much is the most expensive list?

Slide 37

Slide 37 text

37 Click ‘Create Chart’ button on the ‘price’ column to create a chart.

Slide 38

Slide 38 text

38 A histogram chart is created automatically.

Slide 39

Slide 39 text

39 Histogram Divide the numerical values into a set of range groups (e.g. from 10 to 100) and show the data size (number of rows) for each group as a height of bar.

Slide 40

Slide 40 text

40 1,000 1,500 2,200 2,500 3,000 6,500 7,100 2,200 3,800 4,500 2,200 5,300 3,400 4,200 5,200 5,800 8,100 9,000 7,800

Slide 41

Slide 41 text

41 1,000 1,500 3,000 6,500 7,100 2,200 3,800 4,500 5,300 3,400 4,200 5,200 5,800 8,100 7,800

Slide 42

Slide 42 text

price 0 - 2,000 2,001 - 4,000 4,001 - 6,000 6,001 - 8,000 8,001 - 10,000 42 Number of Rows

Slide 43

Slide 43 text

43 Most of the data (50,321) resides within 0 and 1,000 range.

Slide 44

Slide 44 text

44 There are a few lists that are extremely high prices. These extreme values are called ‘Outliers’. We’ll go into more details about the outliers in the ‘Part 3 - Visualizing Variance & Correlation’

Slide 45

Slide 45 text

45 Let’s remove the outliers by unchecking the ‘Include Outliers’.

Slide 46

Slide 46 text

46 It looks that most of the data are actually less than $300.

Slide 47

Slide 47 text

47 We can change the ‘Number of Bars’ to find the data distribution patterns at more granular level.

Slide 48

Slide 48 text

48 Sometimes, Histogram doesn’t work well…

Slide 49

Slide 49 text

49 Is this column Numerical or Categorical?

Slide 50

Slide 50 text

50 It’s up to you! Whatever works better for you.

Slide 51

Slide 51 text

51 Let’s create another chart to answer the following questions. • What is the range of the accommodate type? • What are the most common accommodate types?

Slide 52

Slide 52 text

52 Assign the ‘accommodate’ column to X-Axis to see how many lists we have for each type.

Slide 53

Slide 53 text

53 We can bring back the outliers to see all the accommodate types. Now we can see the highest number is 25.

Slide 54

Slide 54 text

54 The accommodate column is integer data, this means that we have only 25 unique values. This can be considered as a categorical column!

Slide 55

Slide 55 text

55 Bar chart works better to visualize the data distribution for Categorical columns.

Slide 56

Slide 56 text

56 The Bar chart automatically categorizes the numerical values when you assign a numerical column to X Axis.

Slide 57

Slide 57 text

57 This time, we have only a handful number of values for the accommodate column so we want to see all the numbers as they are. We can select ‘As Number’ to show all the numerical values at X-Axis.

Slide 58

Slide 58 text

58 We can see how many lists there are for each of the accommodate values. Most of the lists are less than 4 people accommodation, and 2 people is the most common type.

Slide 59

Slide 59 text

59 And sometimes, Bar Chart doesn’t work well…

Slide 60

Slide 60 text

60 Let’s assign the ‘price’ column to X-Axis.

Slide 61

Slide 61 text

61 When you assign a numerical column with a lot of unique values, the bar chart create a bar for each of the unique values. This makes it harder to recognize any patterns in the data.

Slide 62

Slide 62 text

62 We can zoom in by dragging the mouse pointer.

Slide 63

Slide 63 text

63 We can see each of the unique price point has its own bar, but this would give you too much noise rather than presenting visual patterns you can recognize well.

Slide 64

Slide 64 text

64 With Histogram, we can see the overall patterns about how the data is distributed better.

Slide 65

Slide 65 text

A. Numeric - Numerical B. Character - Categorical C. Date / POSIXct - Date/Time Choosing a Right Chart based on Data Type 65

Slide 66

Slide 66 text

Categorical California Texas New York Florida Oregon No continuous relationship No ordinal relationship is necessary

Slide 67

Slide 67 text

67 Let’s create another chart to answer the following questions. • What are the neighborhoods with the most lists? • How many lists do they have and how much more compared to other neighborhoods?

Slide 68

Slide 68 text

68 Click ‘Create Chart’ button on the neighborhood column.

Slide 69

Slide 69 text

69 We can see the number of lists by the neighborhoods, but there are too many.

Slide 70

Slide 70 text

70 You can zoom into a particular section by dragging the mouse pointer.

Slide 71

Slide 71 text

71 Bedford-Stuyvesant and Williamsburg have the most lists on Airbnb.

Slide 72

Slide 72 text

A. Numeric - Numerical B. Character - Categorical C. Date / POSIXct - Date/Time Choosing a Right Chart based on Data Type 72

Slide 73

Slide 73 text

73 There is a ‘Date’ data type column, ‘host_since’.

Slide 74

Slide 74 text

74 Let’s create another chart to answer the following questions. • Are there more lists that have been added recently? • Have the number been increasing or decreasing? • Do more properties get listed on Airbnb in particular time of year?

Slide 75

Slide 75 text

75 Click on the chart button on the host_since column.

Slide 76

Slide 76 text

76 We can see how many lists were added to Airbnb by each year.

Slide 77

Slide 77 text

77 When the column is Date (or POSIXct) column you’ll have a control of how you want to aggregate the data. And the chart respects the date order and intervals. Date order / interval are respected. Control the data aggregation level.

Slide 78

Slide 78 text

78 Select ‘Month’ under the Round menu. This will aggregate the data by Year/Month.

Slide 79

Slide 79 text

79 Select ‘Month’ under the Extract menu. This will aggregate the data by Month, ignoring Year.

Slide 80

Slide 80 text

80 Looks more properties are getting listed in summar (May, Jun, July).

Slide 81

Slide 81 text

81 By the way…

Slide 82

Slide 82 text

Position Length Angle Slope Size Shape Volume Color Intensity Color Hue Visual Cue 82

Slide 83

Slide 83 text

Position Length Angle Slope Size Shape Volume Color Intensity Color Hue Visual Cue 83 Easier to Recognize Harder to Recognize

Slide 84

Slide 84 text

84 VS Length Slope

Slide 85

Slide 85 text

85 Length VS Slope

Slide 86

Slide 86 text

Length 86 We are comparing the lengths of the bars. Easier to recognize how much a given bar is longer (higher) than the other bars.

Slide 87

Slide 87 text

87 Slope With Line chart, it’s easier to understand if the trend is going upward or downward and how steep the slope is.

Slide 88

Slide 88 text

88 Let’s try the Line chart by changing the chart type to ‘Line’.

Slide 89

Slide 89 text

89 The number of new lists increased from 2010 to 2014, then decreased till 2018.

Slide 90

Slide 90 text

90 Visualizing the time series data has a lot more to explore! We’ll go into more details at the next workshop ‘Part 2 - Visualizing Time Series Data’!

Slide 91

Slide 91 text

Agenda • Choosing the Right Chart based on Data Type • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate • Introduction to Map 91

Slide 92

Slide 92 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Chart Features

Slide 93

Slide 93 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 94

Slide 94 text

94 Create a new chart.

Slide 95

Slide 95 text

95 Select a Bar chart.

Slide 96

Slide 96 text

96 Select the ‘neighborhood’ column for X-Axis and select ‘Y1 Axis’ for Sort By to show the number of rows (or lists) for each bar and sort them from the highest to the lowest.

Slide 97

Slide 97 text

97 There are too many neighborhoods to show. How about showing only the top 30 neighborhoods based on the number of lists.

Slide 98

Slide 98 text

98 Select ‘Limit’ from the X-Axis menu.

Slide 99

Slide 99 text

99 Select ‘Top’ for the Type, and set 30 for the Number of Results.

Slide 100

Slide 100 text

100 Now, only the top 30 neighborhoods are shown in the bar chart.

Slide 101

Slide 101 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 102

Slide 102 text

102 Let’s see which neighborhoods are more expensive or cheaper by using the average renting price.

Slide 103

Slide 103 text

103 Assign the ‘price’ column to Y-Axis and select ‘Mean (Average)’ as the aggregate function.

Slide 104

Slide 104 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 105

Slide 105 text

105 Which neighborhoods are more expensive (or cheaper) compared to the overall average?

Slide 106

Slide 106 text

106 We can draw a reference as a visual aid.

Slide 107

Slide 107 text

107 Select ‘Reference Line’ from the Y-Axis menu.

Slide 108

Slide 108 text

108 Select ‘(Mean Average)’ for the Reference Line Type.

Slide 109

Slide 109 text

109 Select ‘Light Red’ for the Color.

Slide 110

Slide 110 text

110 All the top 30 neighborhoods are higher prices than the overall average. Wait, what?

Slide 111

Slide 111 text

111 We have limited the neighborhoods to the top 30. Let’s show all the neighborhoods.

Slide 112

Slide 112 text

112 Remove the Limit setting to show all the neighborhoods and see the neighborhoods that are cheaper than the overall average.

Slide 113

Slide 113 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 114

Slide 114 text

114 Click ‘Horizontal’ for the Orientation to show the bar chart in a horizontal way. This makes it easier to read the neighborhood names.

Slide 115

Slide 115 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 116

Slide 116 text

116 Select ‘Above’ for the Show Value on Plot inside the Property.

Slide 117

Slide 117 text

117 The numbers (number of lists) are shown inside the chart.

Slide 118

Slide 118 text

118 We can change the font size for the values on the chart.

Slide 119

Slide 119 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 120

Slide 120 text

120 Let’s see what types of properties are more common for each neighborhood. Also, let’s find out there are common patterns or differences among the neighborhoods.

Slide 121

Slide 121 text

121 Let’s remove the reference line for now.

Slide 122

Slide 122 text

122 Select ‘None’ for the Reference Line Type.

Slide 123

Slide 123 text

123 Set ‘Number of Rows’ to Y-Axis.

Slide 124

Slide 124 text

124 Select the ‘property_type’ column to Color By.

Slide 125

Slide 125 text

125 Looks the Apartment type is the most common for most of the neighborhoods. Some neighborhoods like Flushing seem to have less apartment type ratio.

Slide 126

Slide 126 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 127

Slide 127 text

127 Notice that there is ‘Others’ inside the legend.

Slide 128

Slide 128 text

128 When you assign a column with more than 20 unique values, it automatically creates a ‘Others’ group. It keeps the 20 most frequent categories and combine everything else into the ‘Others’ group.

Slide 129

Slide 129 text

129 What if we want to show only the top 10 property types and make everything else to be ‘Others’?

Slide 130

Slide 130 text

130 Click the green text ‘Frequency 20 (35)’ to open the ‘Other Group’ setting dialog.

Slide 131

Slide 131 text

131 Set 10 to keep the top 10 categories and make everything else as the ‘Others’ group.

Slide 132

Slide 132 text

132

Slide 133

Slide 133 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 134

Slide 134 text

134 We want to edit the names inside the legend. Also, we want to combine some categories by editing the names. For example: • Make ‘Boutique hotel’ and ‘Hotel’ as ‘Hotel’. • Make ‘Condominium’ and ‘Apartment’ as ‘Apartment’.

Slide 135

Slide 135 text

135

Slide 136

Slide 136 text

136 Select ‘Edit Display Name’ from the Color menu.

Slide 137

Slide 137 text

137 Enter the display names for the ones you want to change. Boutique hotel -> Hotel Condominium -> Apartment

Slide 138

Slide 138 text

138 `

Slide 139

Slide 139 text

A. Limit Values B. Summarize Functions C. Reference Line D. Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart

Slide 140

Slide 140 text

140 Are there any differences among the neighborhoods based on the property types? We want to show the ratio of each property type for each neighborhood.

Slide 141

Slide 141 text

141 Select ‘Window Calculation’ from the Y Axis menu.

Slide 142

Slide 142 text

142 Select ‘% of’ for the Calculation Type and keep ‘Sum (Total)’ as the Summarize option.

Slide 143

Slide 143 text

143 Now we can see the ratio of each property type for each neighborhood.

Slide 144

Slide 144 text

144 Overall, the apartment type is the most frequent type for all the neighborhoods. Some neighborhoods like East Flatbush and Flushing have higher ratio of the house property type.

Slide 145

Slide 145 text

Agenda • Choosing the Right Chart based on Data Type • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate (Bubble chart) • Introduction to Map 145

Slide 146

Slide 146 text

146 Some neighborhoods are more expensive than the others.

Slide 147

Slide 147 text

147 The ‘review_score_rating’ column has an average review score for each property.

Slide 148

Slide 148 text

148

Slide 149

Slide 149 text

149 Some neighborhoods have higher review scores than the others.

Slide 150

Slide 150 text

150 We want to compare the neighborhoods by using these 2 numerical columns and answer the following questions. • Which neighborhoods are more expensive and with higher review scores? • Also, which neighborhoods are more expensive but with lower scores?

Slide 151

Slide 151 text

151 Create a new chart.

Slide 152

Slide 152 text

152 Select ‘Scatter (With Aggregation)’ chart type.

Slide 153

Slide 153 text

153 Assign the ‘review_scores_rating’ column to X-Axis, then select ‘Mean (Average)’ summarize function

Slide 154

Slide 154 text

154 Assign the ‘price’ column to Y-Axis, then select ‘Mean (Average)’ summarize function.

Slide 155

Slide 155 text

155 There is a circle that shows the overall average of the review score and the overall average of the price.

Slide 156

Slide 156 text

156 Assign the ‘neighborhood’ column to ‘Group By’ so that we can break down the circle into neighborhoods.

Slide 157

Slide 157 text

157 This neighborhood has the lowest score and is cheaper. This neighborhood is the most expensive but the score is not that great.

Slide 158

Slide 158 text

158 How many properties are there in those neighborhoods?

Slide 159

Slide 159 text

159 Assign ‘neighborhood’ to ‘Group By’ so that we can break down the circle into number of the groups.

Slide 160

Slide 160 text

160 Let’s create 4 regions based on the price and the review score. Low Price / High Review Score High Price / High Review Score High Price / Low Review Score Low Price / Low Review Score

Slide 161

Slide 161 text

161 Add a reference line for X Axis to show the overall average score.

Slide 162

Slide 162 text

162 Select ‘Mean (Average)’ for the Reference Line Type.

Slide 163

Slide 163 text

163 Add a reference line for Y Axis to show the overall price.

Slide 164

Slide 164 text

164 Select ‘Mean (Average)’ for the Reference Line Type.

Slide 165

Slide 165 text

165

Slide 166

Slide 166 text

166 Assign the ‘neighborhood_group’ column to Color By.

Slide 167

Slide 167 text

167 Looks that the neighborhoods in Manhattan (Green) tend to be more expensive but have lower review scores.

Slide 168

Slide 168 text

168 When we zoom in we can see that patterns more clearly.

Slide 169

Slide 169 text

Agenda • Choosing the Right Chart based on Data Type • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate • Introduction to Map 169

Slide 170

Slide 170 text

170 • Which neighborhoods have more properties? • Which neighborhoods are more expensive? These questions can be answered by using Bar and Bubble charts. But can we see the geographical patterns at the same time? Map!!

Slide 171

Slide 171 text

171 Select ‘Map - Long/Lat’ type. If you happen to have the column whose names are ‘longitude’ and ‘latitude’ then they are automatically assigned accordingly.

Slide 172

Slide 172 text

172 Each dots represent each row, in this case, that is each list on Airbnb.

Slide 173

Slide 173 text

173 In order to compare the neighborhoods, we want to aggregate the dots into a number of neighborhoods.

Slide 174

Slide 174 text

174 Assign the ‘neighborhood’ column to the Group By.

Slide 175

Slide 175 text

175 We can see that each dot represents each neighborhood. Now, we want to see how many lists each neighborhood has.

Slide 176

Slide 176 text

176 Assign ‘Number of Rows’ to the Size to visualize the numbers by using the circle size.

Slide 177

Slide 177 text

177 We can see that, for example, Williamsburg has a lot of lists (3,936).

Slide 178

Slide 178 text

178 How about the price? Which neighborhoods are more expensive or cheaper?

Slide 179

Slide 179 text

179 Assign ‘price’ column to the Color By.

Slide 180

Slide 180 text

180 It is harder to see the differences with the current color scheme.

Slide 181

Slide 181 text

181 Select ‘Color Setting’ from the Color By menu.

Slide 182

Slide 182 text

182 You can select a different color palette from the menu and also set higher value for the Opacity to make the color less transparent.

Slide 183

Slide 183 text

183 It’s become easier to spot the neighborhoods with higher average rent price.

Slide 184

Slide 184 text

184 You can also change the background to ‘Dark’ so that it’s even easier to see all the circles and spot the ones with higher price values.

Slide 185

Slide 185 text

Q & A

Slide 186

Slide 186 text

Next Seminar

Slide 187

Slide 187 text

EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series Data

Slide 188

Slide 188 text

• Part 1 - Basics: Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Distribution & Correaltion • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop

Slide 189

Slide 189 text

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

Slide 190

Slide 190 text

190 EXPLORATORY