How to Use Exploratory

How to Use Exploratory

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida

January 08, 2020
Tweet

Transcript

  1. How to use Exploratory guide 1

  2. 1. Create a Project 2. Import Data 3. Quick Insights

    with Summary View 4. Convert Data Type (Character, Numeric, Date, etc.) 5. Create Charts - Limit Values, Grouping, Trend line, etc.) 6. Create Calculations 7. Filter Data 8. Edit, Move, Disable, & Delete Steps 9. Introduction to Chart Pinning 10. Introduction to Branch 11. Summarize (Aggregate) Data 12. Data Reproducibility 13. Create Dashboard and Publish 2
  3. Create a new Project 3

  4. You want to create a project first. That's where you

    will import all your data. 4
  5. Create a new project 5

  6. Type the project name and click the ‘Create’ button! 6

  7. It opens a new project. Let’s begin! 7

  8. Import Data 8

  9. Sample Data Airbnb New York Data 9

  10. 1. Open Airbnb New York Data Page 2. Download the

    Data 3. Import the Data Import Data
  11. Click this URL link to open the Airbnb data page.

    11 1. Open the Data Page
  12. 2. Download the Data 12

  13. Select ‘File Data’ from Data Frame menu 13 3. Import

    the Data
  14. Select ‘Text File (CSV, delimited)’ 14

  15. Select the downloaded CSV file. 15

  16. Click ‘Save’ button 16

  17. Data is imported into Exploratory. 17

  18. Summary view shows a data distribution chart and summary statistics

    for each column. 18
  19. Table view shows the data in a table format. 19

  20. 20 You can check how many columns and rows in

    the data.
  21. Quick Insights with Summary View 21

  22. A different type of charts and Summary Statistics are shown

    for each column depending on the data type. 22
  23. • numeric • character • Date • POSIXct • logical

    • Factor Data Type There are many data types that are supported in Exploratory, but these 6 types are the most common and good enough for most cases. 23
  24. A histogram chart is used to show the distribution of

    data. Numeric values are grouped into a set of the bars that have equal range and each bar shows the number of rows within each of the ranges. Numeric 24
  25. Underneath the histogram chart, you can see a series of

    summary statistics such as Average, Median, etc. 25 Numeric - Summary Statistics
  26. You can also see how many NA rows there are

    and its percentage for each column. In this case, ‘review_scores_rating’ column has 11,162 missing values and that is 22.06% of the total rows. 26 NA (Not Available/ Missing Values)
  27. A horizontal bar chart is used to see the top

    categorical values along with the number of rows information. 27 Character
  28. A horizontal bar chart is used to indicate how many

    rows there are for TRUE and FALSE values. 28 Logical (TRUE / FALSE)
  29. A histogram chart is used to show the distribution of

    data for the Date and the POSIXct columns. You can also find the date range by looking at the Min and Max of the data. 29 Date / POSIXct (Date & Time)
  30. Factor data type is similar to Character data type except

    that it can have the ‘order’ information. The horizontal bar chart shows the number of the rows for each value. The bars are sorted in a way that is defined inside the column. 30 Factor
  31. Convert Data Type

  32. Some columns look like Numeric, but registered as Character data

    type.
  33. You can change the data type quickly.

  34. Select ‘Change Data Type’ and ‘Convert to Numeric’ from the

    column header menu.
  35. It will open ‘Mutate (Create Calculation) dialog with the ‘parse_number’

    function pre-populated inside the Calculation Editor. Simply, click ‘Run’ button.
  36. The ‘price’ column is converted to ‘numeric’ data type and

    it now shows a histogram chart with the summary statistics.
  37. Convert Data Type Multiple Columns Together

  38. There are more columns that need to be converted to

    Numeric data type.
  39. You can change the data type for multiple columns at

    once. Select multiple columns by using Command key (Mac) or Control key (Windows).
  40. Select ‘Change Data Type’ and ‘Convert to Numeric’ from the

    column header menu.
  41. The selected columns are listed and ‘Convert Data Type’ is

    selected in the Calculation Type in the dialog. Simply, click ‘Run’ button.
  42. The selected columns are now registered as Numeric data type.

  43. Notice that these operations are recorded at the right hand

    side as the Data Wrangling Steps. We’ll cover this in more details later.
  44. Create the 1st Chart 44

  45. 1.Create a Bar Chart 2.Sort the Bars 3.Limit X-Axis Values

    4.Group with Color Create the 1st Chart
  46. 46 Let’s create a chart to answer this question. “Which

    neighborhoods have more places on Airbnb?”
  47. Click ‘+’ button to create a new chart. 47

  48. 48 Select a ‘Bar’ as the chart type.

  49. 49 Assign ‘Neighborhood’ column to X-Axis. Keep ‘Number of Rows’

    for Y-Axis.
  50. 50 You can sort the bars from the highest to

    the lowest.
  51. There are too many bars (neighborhoods) to show. 51

  52. You can limit the X-Axis values.

  53. Select ‘Top’ for the limit type and set 50 so

    that only the top 50 neighborhoods with the most ‘Number of Rows’ will be shown.
  54. You can assign ‘neighborhood_group’ column to ‘Color’ to indicate which

    New York borough each neighborhood belongs to.
  55. Visualize Time Series Data 55

  56. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color Visualize Time Series Data
  57. “How many places have been added to Airbnb over the

    years?” 57 Let’s create a chart to answer this question.
  58. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  59. Create a new chart.

  60. Select Line chart, assign ‘host_since’ column to X-Axis, and keep

    ‘Number of Rows’ for Y-Axis.
  61. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  62. Let’s change the date aggregation level from Year to Month.

  63. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  64. The numbers are going ups and downs. In order to

    make it easier to see the overall trend, let’s show the ‘trend line’.
  65. This time, let’s use ‘Polynomial (Loess)’ as the trend line

    type.
  66. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  67. What if we want to know how many places have

    been added to Airbnb accumulatively, instead of how many places were added at each given month?
  68. You can use ‘Window Calculation’ feature to calculate ‘Cumulative Sum’

    of the ‘Number of Rows’.
  69. Select ‘Cumulative’ for the Calculation Type.

  70. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  71. You can assign ‘neighborhood_group’ column to Color to see how

    the places have been added to Airbnb in each of the 5 boroughs accumulatively.
  72. Create Calculations 72

  73. There is a column called ‘availability_365’, which indicates how many

    days are available (not booked) per year.
  74. Let’s calculate a ‘availability rate’, which can be calculated like

    the following. availability_365 / 365 74
  75. Select ‘Mutate (Create Calculation)’ from the column header menu of

    ‘availability_365’. 75
  76. Type the calculation in the Calculation Editor. 76

  77. 77 Select ‘Create New Column’ and type ‘availability_rate’ as the

    new column name.
  78. A new column is created as ‘availability_rate’. 78

  79. 79 There are too many decimal digits. Let’s round the

    values and keep only two decimal digits.
  80. 80 Select ‘Work with Numeric Function’ and ‘round’ function from

    the column header menu.
  81. 81 This opens ‘Mutate (Create Calculation)’ dialog with the ‘round’

    function pre-populated.
  82. 82 Set 2 for the ‘digits’ parameter to keep 2

    decimal digits.
  83. 83 The values in ‘availability_rate’ column is rounded with 2

    decimal digits.
  84. Note that these ‘Mutate (Create Calculation)’ operations are recorded at

    the right hand side. 84
  85. You can go to Summary view to check the distribution

    of ‘availability_rate’ values. 85
  86. Filter Data 86

  87. Let’s say you want to filter the data to keep

    only the places in Manhattan and Brooklyn. 87
  88. Select ‘Filter’ and ‘Is in (multiple values)’ from the column

    header menu. 88
  89. Select ‘Manhattan’ and ‘Brooklyn’. 89

  90. 90 The ‘neighborhood_group’ column shows only the 2 neighborhood groups.

  91. 91 Note that this ‘Filter’ operation is recorded at the

    right hand side.
  92. Update, Move, Disable, & Delete Steps 92

  93. All the data wrangling operations are recorded in the ‘Steps’

    pane at the right hand side. 93
  94. Each operation becomes a ‘Step’. Step 94

  95. Each step has its own data snapshot behind the scene.

    Data Snapshot
  96. You can move between the steps to see the data

    from each step.
  97. Move Steps 97

  98. Let’s say you want to move the step 5 (Filter)

    to right before the step 4 (Mutate). 98
  99. Drag the step 5 and drop it right before the

    step 4. 99
  100. The Filter step has become the step 4. 100

  101. Tokens in the Step 101

  102. Each box inside the step is called ‘Token’. Token 102

  103. Same type of the operations are added into the same

    step. Here, the 2 calculations have been added to the same ‘Mutate’ step. 103
  104. 104 You can click the token to open the dialog

    to update the configuration.
  105. In this case, it will open ‘Mutate (Create Calculation)’ dialog.

    105
  106. You can delete each token. 106

  107. You can change the order of the tokens by drag-and-drop

    if needed. 107
  108. Introduction to Chart Pinning 108

  109. Use ‘Pin’ to fix a chart to a particular data

    wrangling step. 109
  110. Let’s say we want to visualize the availability rates by

    neighborhood. 110
  111. Open the chart you have created previously. 111

  112. 112 We want to assign the ‘availability_rate’ column, which we

    have created at the previous step to Y-Axis. But, the column is not there in the dropdown!
  113. The ‘availability_rate’ column is created at the step 5. 113

  114. 114 But the chart is ‘Pinned’ to the step 3.

  115. This means that this chart always references the step 3

    data. 115
  116. Even when you select other steps, this chart still references

    the step 3 data. 116
  117. In order to use the step 5 data, you need

    to move the ‘Pin’ to the step 5.
  118. Move the Pin icon to the step 5 by Drag-and-Drop.

    118
  119. Now that this chart is ‘Pinned’ to the step 5,

    it will always references the data at the step 5 regardless of which step you select at the right hand side. 119
  120. The data at the step 5 has only Manhattan and

    Brooklyn, hence the chart shows only the two boroughs with two colors (Blue & Orange). 120
  121. Now you can see ‘availability_rate’ column inside the Y-Axis dropdown.

    121
  122. 122 Select ‘availability_rate’ column and select ‘Average (Mean)’ as the

    calculation.
  123. We are looking at the average availability rate for places

    in Manhattan and Brooklyn. 123
  124. 124 What if we want to see all the neighborhoods,

    not just for Manhattan and Brooklyn?
  125. Delete the Filter step! But, you might want to use

    it later… 125
  126. Instead of deleting, you can disable the step temporally. 126

  127. Now you can compare the average availability rates of the

    top 50 neighborhoods from all 5 boroughs. 127
  128. You can click ‘Enable Step’ icon to enable the step.

    But for this tutorial, we’ll continue with this step being disabled. 128
  129. Looks there are a lot of places in Staten Island

    (Purple) with high availability rates. For example, ‘Fort Wadsworth’ is 100% available, that’s crazy!
  130. Click the bar and click ‘Show Detail’.

  131. There is only one place in this neighborhood! And this

    place happens to have 1 (100%) for the availability rate.
  132. Instead of ‘Top 50’ neighborhoods, we want to show only

    the neighborhood with at least 100 places listed. 132
  133. Click the green text to open ‘Limit Axis Value’ setting

    dialog.
  134. Select ‘Condition’ for Type, select ‘Number of Rows’ for Based

    on, ‘greater than’ for Operator, and type 100 for Value.
  135. The neighborhoods in Queens (Red) tend to be more available

    while the neighborhoods with lower availability rates are in Brooklyn or Manhattan.
  136. Introduction to Branch 136

  137. Sometimes, you might want to create different versions of data

    from the same data. For example, you might want to create a data frame to aggregate the data by city or property_type while you want to keep the original data to be not aggregated. Creating different data frames that are separated from one another will create a maintenance nightmare. Instead, you can use Branch feature to ‘branch off’ from the original data frame and create multiple data frames that share the original data frame.
  138. Create Branch Main Data Frame Branch Data Frame Data Import

    Convert Data Type Create Calculations Filter Aggregate by City
  139. Import Excel Data Convert Data Type Create Calculations Filter Branch

    Data Frame 1 Branch Data Frame 2 Aggregate by City Top 10 Cities Aggregate by Host Clustering with K-Means You can create multiple branches from any steps. Main Data Frame
  140. ϝΠϯσʔλϑϨʔϜ ϒϥϯν1 ϒϥϯν2 Changes in Main Data Frame will propagate

    to only the related branches automatically. Import Excel Data Convert Data Type Create Calculations Filter Aggregate by City Top 10 Cities Aggregate by Host Clustering with K-Means
  141. Create a Branch Data Frame 141

  142. Click the Branch icon to create a new Branch off

    from the step 5. 142
  143. Enter the name for a new Branch data frame. 143

  144. The newly created ‘Branch’ data frame is added underneath the

    original data frame. 144
  145. 145 You can see that the branch data frame is

    branched off from the step 5 of the main data frame.
  146. 146 What if you want this branch data frame to

    branch off from the step 3 instead of step 5?
  147. Go back to the main data frame and click ‘Branch’

    button
  148. Grab the Branch token and Drag to the step 3.

  149. Drop the token to the step 3. 149

  150. The Branch data frame is now branched from the step

    3. 150
  151. Go back to the branch data frame. 151

  152. Summarize (Aggregate) Data 152

  153. Currently, each place listed on Airbnb is presented as each

    row.
  154. Let’s say we want to summarize the data by property_type

    so that each row represents each property_type. Summarize
  155. Select ‘Summarize (Aggregate)’ from the column header menu.

  156. Assign columns and select aggregation functions.

  157. Data is summarized and the operation is recorded as ‘Summarize’

    at the right hand side.
  158. Data Reproducibility 158

  159. What if the original data file has been updated? 159

  160. There are 2 scenarios. • The data is updated and

    saved as the same file. • The data is updated but saved as a separate file. 160
  161. There are 2 scenarios. • The data is updated and

    saved as the same file. • The data is updated but saved as a separate file. 161
  162. Click ‘Re-Import’ button. 162

  163. This will re-read the data from the same file and

    run all the data wrangling steps automatically. 163
  164. There are 2 scenarios. • The data is updated and

    saved as the same file. • The data is updated but saved as a separate file. 164
  165. Open Data Import dialog by clicking on the token inside

    the Step 1. 165
  166. 166 Click ‘Change File’ button and select the newly updated

    data file and click ‘Apply’ button.
  167. Data will be imported and all the data wrangling steps

    will be applied automatically. 167
  168. The Branch data frame gets updated, too! 168

  169. How to Create Dashboard 169

  170. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 170
  171. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 171
  172. Select ‘Dashboard’ from the Reports menu. 172

  173. Type a name for the Dashboard.

  174. It will open Dashboard Editor in a separate window 174

  175. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 175
  176. It will open the Dashboard Editor in a separate window

    176
  177. Select a data frame you want to add a chart

    from. 177
  178. Select a chart and click the OK button. 178

  179. None
  180. Add another chart. This time, let’s add the Line chart.

  181. 181 Select the line chart and click ‘OK’ button.

  182. Now we have two charts being added.

  183. Click ‘Run’ button to preview the output.

  184. 184

  185. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 185
  186. You can use Numbers to show metrics inside Dashboard. 186

  187. Go back to the data frame and add a new

    chart to create the Numbers. 187
  188. Select ‘Number’ as Chart Type. 188

  189. Select ‘Rename’ from the chart tab menu to change the

    name.
  190. Type the name as ‘Number of Places’.

  191. Copy the Number to create another Number.

  192. Select ‘review_scores_rating’ column for the Value and set Average (Mean)

    as the calculation. 192
  193. Change the name to ‘Average Review Score’.

  194. Go back to the Dashboard Edit UI to add the

    2 numbers.
  195. Click ‘Add Chart’ button.

  196. Select ‘Number of Places’ and click ‘OK’ button.

  197. Select ‘Number of Places’ and click ‘OK’ button.

  198. Move the 2 Numbers to the top by drag-and-drop.

  199. None
  200. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 200
  201. Click ‘Publish’ button to publish the Dashboard to Exploratory Cloud.

    201
  202. Publish in Private mode or Public mode? 202

  203. 203 Private Mode Only the ones you explicitly share with

    can open your dashboard. Public Mode Anybody can open your dashboard.
  204. Click ‘Publish’ button to to publish. 204

  205. 205 Once it’s published, an unique URL is assigned to

    the Dashboard. Click ‘Open in Browser’ link to open the Dashboard in the web browser.
  206. You can see the published Dashboard in the web browser.

    206
  207. You can share it with others by clicking on ‘Share’

    button. 207
  208. Share with Invite 208 Type the email and click ‘Share’

    button. This will send an invite email.
  209. 209 Share with Invite The person who is invited can

    log in with her/his Exploratory account and open the Dashboard. If the person doesn’t have an Exploratory account then she/he can create it for FREE. The viewers can continue to view any contents at Exploratory Cloud as long as they are invited to view.
  210. 210 Share with URL You can also share your Dashboard

    with URL. This allows anyone with the URL to open the Dashboard without logging into Exploratory Cloud.
  211. Schedule Dashboard 211 You can schedule the dashboard to keep

    the data always up-to-date by querying against the data sources and applying all the data wrangling steps automatically. Note that you can schedule only the ones with remote data sources that can be accessed by Exploratory Cloud.