Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Visualization Workshop Part 1 - Visualizing Summarized Data

Data Visualization Workshop Part 1 - Visualizing Summarized Data

I'm starting the Data Visualization workshop by introducing the basics of how to use the charts to visualize data effectively.

* Choosing the right charts based on the data type
* Basic Chart Features in Exploratory
* Introduction to Bubble (Scatter) Chart
* Introduction to Map

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

April 29, 2020
Tweet

Transcript

  1. EXPLORATORY Data Visualization Workshop Part 1 - Visualizing Summarized Data

  2. 2 EXPLORATORY

  3. Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  5. Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics /

    Machine Learning) Data Analysis Data Science Workflow
  6. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  7. EXPLORATORY Data Visualization Workshop Part 1 - Visualizing Summarized Data

  8. • Part 1 - Visualizing Summarized Data • Part 2

    - Visualizing Time Series Data • Part 3 - Visualizing Distribution & Correaltion • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Exploratory - Data Visualization Workshop
  9. Part 1 - Visualizing Summarized Data 1. Choosing the Right

    Chart based on Data Type 2. Introduction to Basic Chart Features 3. Introduction to Scatter with Aggregate 4. Introduction to Map 9
  10. Before begins… 10

  11. Create a Project 11

  12. You want to create a project first. That's where you

    will import all your data. 12
  13. Create a new project 13

  14. Type the project name and click the ‘Create’ button! 14

  15. It opens a new project. Let’s begin! 15

  16. Import Data 16

  17. Sample Data Airbnb New York Data 17

  18. 1. Open Airbnb New York Data Page 2. Download the

    Data 3. Import the Data Import Data
  19. 19 Select ‘Data Catalog’ from the Data Frame menu.

  20. 20 Type ‘airbnb’ to search ‘Airbnb Listing Data for New

    York City’ data.
  21. 21 Click the Import button to import the data.

  22. 22 Click the Save button to save the data.

  23. Data is imported into Exploratory. 23

  24. Part 1 - Visualizing Summarized Data 1. Choosing the Right

    Chart based on Data Type 2. Introduction to Basic Chart Features 3. Introduction to Scatter with Aggregate 4. Introduction to Map 24
  25. A different type of charts and Summary Statistics are shown

    for each column depending on the data type. 25
  26. A. Numeric - Numerical B. Character - Categorical C. Date

    / POSIXct - Date/Time D. Logical - Logical E. Factor - Ordinal Data Type 26
  27. A. Numeric - Numerical B. Character - Categorical C. Date

    / POSIXct - Date/Time Data Type 27
  28. 28

  29. Numerical 0 10 20 30 40 50 11 22 45

    Continuous Ordinal Relationship
  30. 30 There are more columns that need to be converted

    to Numeric data type.
  31. You can change the data type for multiple columns at

    once. Select multiple columns by using Command key (Mac) or Control key (Windows).
  32. Select ‘Change Data Type’ and ‘Convert to Numeric’ from the

    column header menu.
  33. The selected columns are listed and ‘Convert Data Type’ is

    selected in the Calculation Type in the dialog. Simply, click ‘Run’ button.
  34. The selected columns are now shown as Numeric data type.

  35. Note that these operations are recorded at the right hand

    side as the Data Wrangling Steps. We’ll go into more details about the data wrangling in Part 4.
  36. 36 Let’s create the 1st chart to answer the following

    questions. • What is the price range from the lowest to the highest? • What are the common prices for the majority of the lists? • How much is the most expensive list?
  37. 37 Click ‘Create Chart’ button on the ‘price’ column to

    create a chart.
  38. 38 A histogram chart is created automatically.

  39. 39 Histogram Divide the numerical values into a set of

    range groups (e.g. from 10 to 100) and show the data size (number of rows) for each group as a height of bar.
  40. 40 1,000 1,500 2,200 2,500 3,000 6,500 7,100 2,200 3,800

    4,500 2,200 5,300 3,400 4,200 5,200 5,800 8,100 9,000 7,800
  41. 41 1,000 1,500 3,000 6,500 7,100 2,200 3,800 4,500 5,300

    3,400 4,200 5,200 5,800 8,100 7,800
  42. price 0 - 2,000 2,001 - 4,000 4,001 - 6,000

    6,001 - 8,000 8,001 - 10,000 42 Number of Rows
  43. 43 Most of the data (50,321) resides within 0 and

    1,000 range.
  44. 44 There are a few lists that are extremely high

    prices. These extreme values are called ‘Outliers’. We’ll go into more details about the outliers in the ‘Part 3 - Visualizing Variance & Correlation’
  45. 45 Let’s remove the outliers by unchecking the ‘Include Outliers’.

  46. 46 It looks that most of the data are actually

    less than $300.
  47. 47 We can change the ‘Number of Bars’ to find

    the data distribution patterns at more granular level.
  48. 48 Sometimes, Histogram doesn’t work well…

  49. 49 Is this column Numerical or Categorical?

  50. 50 It’s up to you! Whatever works better for you.

  51. 51 Let’s create another chart to answer the following questions.

    • What is the range of the accommodate type? • What are the most common accommodate types?
  52. 52 Assign the ‘accommodate’ column to X-Axis to see how

    many lists we have for each type.
  53. 53 We can bring back the outliers to see all

    the accommodate types. Now we can see the highest number is 25.
  54. 54 The accommodate column is integer data, this means that

    we have only 25 unique values. This can be considered as a categorical column!
  55. 55 Bar chart works better to visualize the data distribution

    for Categorical columns.
  56. 56 The Bar chart automatically categorizes the numerical values when

    you assign a numerical column to X Axis.
  57. 57 This time, we have only a handful number of

    values for the accommodate column so we want to see all the numbers as they are. We can select ‘As Number’ to show all the numerical values at X-Axis.
  58. 58 We can see how many lists there are for

    each of the accommodate values. Most of the lists are less than 4 people accommodation, and 2 people is the most common type.
  59. 59 And sometimes, Bar Chart doesn’t work well…

  60. 60 Let’s assign the ‘price’ column to X-Axis.

  61. 61 When you assign a numerical column with a lot

    of unique values, the bar chart create a bar for each of the unique values. This makes it harder to recognize any patterns in the data.
  62. 62 We can zoom in by dragging the mouse pointer.

  63. 63 We can see each of the unique price point

    has its own bar, but this would give you too much noise rather than presenting visual patterns you can recognize well.
  64. 64 With Histogram, we can see the overall patterns about

    how the data is distributed better.
  65. A. Numeric - Numerical B. Character - Categorical C. Date

    / POSIXct - Date/Time Choosing a Right Chart based on Data Type 65
  66. Categorical California Texas New York Florida Oregon No continuous relationship

    No ordinal relationship is necessary
  67. 67 Let’s create another chart to answer the following questions.

    • What are the neighborhoods with the most lists? • How many lists do they have and how much more compared to other neighborhoods?
  68. 68 Click ‘Create Chart’ button on the neighborhood column.

  69. 69 We can see the number of lists by the

    neighborhoods, but there are too many.
  70. 70 You can zoom into a particular section by dragging

    the mouse pointer.
  71. 71 Bedford-Stuyvesant and Williamsburg have the most lists on Airbnb.

  72. A. Numeric - Numerical B. Character - Categorical C. Date

    / POSIXct - Date/Time Choosing a Right Chart based on Data Type 72
  73. 73 There is a ‘Date’ data type column, ‘host_since’.

  74. 74 Let’s create another chart to answer the following questions.

    • Are there more lists that have been added recently? • Have the number been increasing or decreasing? • Do more properties get listed on Airbnb in particular time of year?
  75. 75 Click on the chart button on the host_since column.

  76. 76 We can see how many lists were added to

    Airbnb by each year.
  77. 77 When the column is Date (or POSIXct) column you’ll

    have a control of how you want to aggregate the data. And the chart respects the date order and intervals. Date order / interval are respected. Control the data aggregation level.
  78. 78 Select ‘Month’ under the Round menu. This will aggregate

    the data by Year/Month.
  79. 79 Select ‘Month’ under the Extract menu. This will aggregate

    the data by Month, ignoring Year.
  80. 80 Looks more properties are getting listed in summar (May,

    Jun, July).
  81. 81 By the way…

  82. Position Length Angle Slope Size Shape Volume Color Intensity Color

    Hue Visual Cue 82
  83. Position Length Angle Slope Size Shape Volume Color Intensity Color

    Hue Visual Cue 83 Easier to Recognize Harder to Recognize
  84. 84 VS Length Slope

  85. 85 Length VS Slope

  86. Length 86 We are comparing the lengths of the bars.

    Easier to recognize how much a given bar is longer (higher) than the other bars.
  87. 87 Slope With Line chart, it’s easier to understand if

    the trend is going upward or downward and how steep the slope is.
  88. 88 Let’s try the Line chart by changing the chart

    type to ‘Line’.
  89. 89 The number of new lists increased from 2010 to

    2014, then decreased till 2018.
  90. 90 Visualizing the time series data has a lot more

    to explore! We’ll go into more details at the next workshop ‘Part 2 - Visualizing Time Series Data’!
  91. Agenda • Choosing the Right Chart based on Data Type

    • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate • Introduction to Map 91
  92. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Chart Features
  93. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  94. 94 Create a new chart.

  95. 95 Select a Bar chart.

  96. 96 Select the ‘neighborhood’ column for X-Axis and select ‘Y1

    Axis’ for Sort By to show the number of rows (or lists) for each bar and sort them from the highest to the lowest.
  97. 97 There are too many neighborhoods to show. How about

    showing only the top 30 neighborhoods based on the number of lists.
  98. 98 Select ‘Limit’ from the X-Axis menu.

  99. 99 Select ‘Top’ for the Type, and set 30 for

    the Number of Results.
  100. 100 Now, only the top 30 neighborhoods are shown in

    the bar chart.
  101. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  102. 102 Let’s see which neighborhoods are more expensive or cheaper

    by using the average renting price.
  103. 103 Assign the ‘price’ column to Y-Axis and select ‘Mean

    (Average)’ as the aggregate function.
  104. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  105. 105 Which neighborhoods are more expensive (or cheaper) compared to

    the overall average?
  106. 106 We can draw a reference as a visual aid.

  107. 107 Select ‘Reference Line’ from the Y-Axis menu.

  108. 108 Select ‘(Mean Average)’ for the Reference Line Type.

  109. 109 Select ‘Light Red’ for the Color.

  110. 110 All the top 30 neighborhoods are higher prices than

    the overall average. Wait, what?
  111. 111 We have limited the neighborhoods to the top 30.

    Let’s show all the neighborhoods.
  112. 112 Remove the Limit setting to show all the neighborhoods

    and see the neighborhoods that are cheaper than the overall average.
  113. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Adjust Font Size G. Group by Color H. Create ‘Other’ Group I. Edit Display Name J. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  114. 114 Click ‘Horizontal’ for the Orientation to show the bar

    chart in a horizontal way. This makes it easier to read the neighborhood names.
  115. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  116. 116 Select ‘Above’ for the Show Value on Plot inside

    the Property.
  117. 117 The numbers (number of lists) are shown inside the

    chart.
  118. 118 We can change the font size for the values

    on the chart.
  119. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  120. 120 Let’s see what types of properties are more common

    for each neighborhood. Also, let’s find out there are common patterns or differences among the neighborhoods.
  121. 121 Let’s remove the reference line for now.

  122. 122 Select ‘None’ for the Reference Line Type.

  123. 123 Set ‘Number of Rows’ to Y-Axis.

  124. 124 Select the ‘property_type’ column to Color By.

  125. 125 Looks the Apartment type is the most common for

    most of the neighborhoods. Some neighborhoods like Flushing seem to have less apartment type ratio.
  126. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  127. 127 Notice that there is ‘Others’ inside the legend.

  128. 128 When you assign a column with more than 20

    unique values, it automatically creates a ‘Others’ group. It keeps the 20 most frequent categories and combine everything else into the ‘Others’ group.
  129. 129 What if we want to show only the top

    10 property types and make everything else to be ‘Others’?
  130. 130 Click the green text ‘Frequency 20 (35)’ to open

    the ‘Other Group’ setting dialog.
  131. 131 Set 10 to keep the top 10 categories and

    make everything else as the ‘Others’ group.
  132. 132

  133. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  134. 134 We want to edit the names inside the legend.

    Also, we want to combine some categories by editing the names. For example: • Make ‘Boutique hotel’ and ‘Hotel’ as ‘Hotel’. • Make ‘Condominium’ and ‘Apartment’ as ‘Apartment’.
  135. 135

  136. 136 Select ‘Edit Display Name’ from the Color menu.

  137. 137 Enter the display names for the ones you want

    to change. Boutique hotel -> Hotel Condominium -> Apartment
  138. 138 `

  139. A. Limit Values B. Summarize Functions C. Reference Line D.

    Horizontal Bar Chart E. Show Values on Plot F. Group by Color G. Create ‘Other’ Group H. Edit Display Name I. Calculate ‘% of Total’ with Window Calculation Basic Features of Chart
  140. 140 Are there any differences among the neighborhoods based on

    the property types? We want to show the ratio of each property type for each neighborhood.
  141. 141 Select ‘Window Calculation’ from the Y Axis menu.

  142. 142 Select ‘% of’ for the Calculation Type and keep

    ‘Sum (Total)’ as the Summarize option.
  143. 143 Now we can see the ratio of each property

    type for each neighborhood.
  144. 144 Overall, the apartment type is the most frequent type

    for all the neighborhoods. Some neighborhoods like East Flatbush and Flushing have higher ratio of the house property type.
  145. Agenda • Choosing the Right Chart based on Data Type

    • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate (Bubble chart) • Introduction to Map 145
  146. 146 Some neighborhoods are more expensive than the others.

  147. 147 The ‘review_score_rating’ column has an average review score for

    each property.
  148. 148

  149. 149 Some neighborhoods have higher review scores than the others.

  150. 150 We want to compare the neighborhoods by using these

    2 numerical columns and answer the following questions. • Which neighborhoods are more expensive and with higher review scores? • Also, which neighborhoods are more expensive but with lower scores?
  151. 151 Create a new chart.

  152. 152 Select ‘Scatter (With Aggregation)’ chart type.

  153. 153 Assign the ‘review_scores_rating’ column to X-Axis, then select ‘Mean

    (Average)’ summarize function
  154. 154 Assign the ‘price’ column to Y-Axis, then select ‘Mean

    (Average)’ summarize function.
  155. 155 There is a circle that shows the overall average

    of the review score and the overall average of the price.
  156. 156 Assign the ‘neighborhood’ column to ‘Group By’ so that

    we can break down the circle into neighborhoods.
  157. 157 This neighborhood has the lowest score and is cheaper.

    This neighborhood is the most expensive but the score is not that great.
  158. 158 How many properties are there in those neighborhoods?

  159. 159 Assign ‘neighborhood’ to ‘Group By’ so that we can

    break down the circle into number of the groups.
  160. 160 Let’s create 4 regions based on the price and

    the review score. Low Price / High Review Score High Price / High Review Score High Price / Low Review Score Low Price / Low Review Score
  161. 161 Add a reference line for X Axis to show

    the overall average score.
  162. 162 Select ‘Mean (Average)’ for the Reference Line Type.

  163. 163 Add a reference line for Y Axis to show

    the overall price.
  164. 164 Select ‘Mean (Average)’ for the Reference Line Type.

  165. 165

  166. 166 Assign the ‘neighborhood_group’ column to Color By.

  167. 167 Looks that the neighborhoods in Manhattan (Green) tend to

    be more expensive but have lower review scores.
  168. 168 When we zoom in we can see that patterns

    more clearly.
  169. Agenda • Choosing the Right Chart based on Data Type

    • Introduction to Basic Chart Features • Introduction to Scatter with Aggregate • Introduction to Map 169
  170. 170 • Which neighborhoods have more properties? • Which neighborhoods

    are more expensive? These questions can be answered by using Bar and Bubble charts. But can we see the geographical patterns at the same time? Map!!
  171. 171 Select ‘Map - Long/Lat’ type. If you happen to

    have the column whose names are ‘longitude’ and ‘latitude’ then they are automatically assigned accordingly.
  172. 172 Each dots represent each row, in this case, that

    is each list on Airbnb.
  173. 173 In order to compare the neighborhoods, we want to

    aggregate the dots into a number of neighborhoods.
  174. 174 Assign the ‘neighborhood’ column to the Group By.

  175. 175 We can see that each dot represents each neighborhood.

    Now, we want to see how many lists each neighborhood has.
  176. 176 Assign ‘Number of Rows’ to the Size to visualize

    the numbers by using the circle size.
  177. 177 We can see that, for example, Williamsburg has a

    lot of lists (3,936).
  178. 178 How about the price? Which neighborhoods are more expensive

    or cheaper?
  179. 179 Assign ‘price’ column to the Color By.

  180. 180 It is harder to see the differences with the

    current color scheme.
  181. 181 Select ‘Color Setting’ from the Color By menu.

  182. 182 You can select a different color palette from the

    menu and also set higher value for the Opacity to make the color less transparent.
  183. 183 It’s become easier to spot the neighborhoods with higher

    average rent price.
  184. 184 You can also change the background to ‘Dark’ so

    that it’s even easier to see all the circles and spot the ones with higher price values.
  185. Q & A

  186. Next Seminar

  187. EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series

    Data
  188. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Distribution & Correaltion • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  189. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  190. 190 EXPLORATORY