Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Use Exploratory

How to Use Exploratory

This is a walk through of 'how to use Exploratory' covering the following topics.

- Working with Chart, Time Series, Pin
- Data Wrangling Introduction
- How to Use the Step
- Branch Data Frame
- Creating Dashboard and Note

I have collected the features that are useful and would make your data analysis work much more efficient and productive.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

April 08, 2020
Tweet

Transcript

  1. How to Use Exploratory Exploratory Seminar #27

  2. 2 EXPLORATORY

  3. Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  5. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  6. Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics /

    Machine Learning) Data Analysis Data Science Workflow
  7. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  8. How to Use Exploratory Exploratory Seminar #27

  9. 1. Create a Project 2. Import Data 3. Quick Insights

    with Summary View 4. Convert Data Type (Character, Numeric, Date, etc.) 5. Create Charts - Limit Values, Grouping, Trend line, etc.) 6. Create Calculations 7. Filter Data 8. Edit, Move, Disable, & Delete Steps 9. Introduction to Chart Pinning 10. Introduction to Branch 11. Summarize (Aggregate) Data 12. Data Reproducibility 13. Create Dashboard and Publish 9
  10. Create a Project 10

  11. You want to create a project first. That's where you

    will import all your data. 11
  12. Create a new project 12

  13. Type the project name and click the ‘Create’ button! 13

  14. It opens a new project. Let’s begin! 14

  15. Import Data 15

  16. Sample Data Airbnb New York Data 16

  17. 1. Open Airbnb New York Data Page 2. Download the

    Data 3. Import the Data Import Data
  18. Click this URL link to open the Airbnb data page.

    18 1. Open the Data Page
  19. 2. Download the Data 19

  20. Select ‘File Data’ from Data Frame menu 20 3. Import

    the Data
  21. Select ‘Text File (CSV, delimited)’ 21

  22. Select the downloaded CSV file. 22

  23. Click ‘Save’ button 23

  24. Data is imported into Exploratory. 24

  25. Summary view shows a data distribution chart and summary statistics

    for each column. 25
  26. Table view shows the data in a table format. 26

  27. 27 You can check how many columns and rows in

    the data.
  28. Quick Insights with Summary View 28

  29. A different type of charts and Summary Statistics are shown

    for each column depending on the data type. 29
  30. • numeric • character • Date • POSIXct • logical

    • Factor Data Type There are many data types that are supported in Exploratory, but these 6 types are the most common and good enough for most cases. 30
  31. A histogram chart is used to show the distribution of

    data. Numeric values are grouped into a set of the bars that have equal range and each bar shows the number of rows within each of the ranges. Numeric 31
  32. Underneath the histogram chart, you can see a series of

    summary statistics such as Average, Median, etc. 32 Numeric - Summary Statistics
  33. You can also see how many NA rows there are

    and its percentage for each column. In this case, ‘review_scores_rating’ column has 11,162 missing values and that is 22.06% of the total rows. 33 NA (Not Available/ Missing Values)
  34. A horizontal bar chart is used to see the top

    categorical values along with the number of rows information. 34 Character
  35. A horizontal bar chart is used to indicate how many

    rows there are for TRUE and FALSE values. 35 Logical (TRUE / FALSE)
  36. A histogram chart is used to show the distribution of

    data for the Date and the POSIXct columns. You can also find the date range by looking at the Min and Max of the data. 36 Date / POSIXct (Date & Time)
  37. Factor data type is similar to Character data type except

    that it can have the ‘order’ information. The horizontal bar chart shows the number of the rows for each value. The bars are sorted in a way that is defined inside the column. 37 Factor
  38. Convert Data Type

  39. Some columns look like Numeric, but registered as Character data

    type.
  40. You can change the data type quickly.

  41. Select ‘Change Data Type’ and ‘Convert to Numeric’ from the

    column header menu.
  42. It will open ‘Mutate (Create Calculation) dialog with the ‘parse_number’

    function pre-populated inside the Calculation Editor. Simply, click ‘Run’ button.
  43. The ‘price’ column is converted to ‘numeric’ data type and

    it now shows a histogram chart with the summary statistics.
  44. Convert Data Type Multiple Columns Together

  45. There are more columns that need to be converted to

    Numeric data type.
  46. You can change the data type for multiple columns at

    once. Select multiple columns by using Command key (Mac) or Control key (Windows).
  47. Select ‘Change Data Type’ and ‘Convert to Numeric’ from the

    column header menu.
  48. The selected columns are listed and ‘Convert Data Type’ is

    selected in the Calculation Type in the dialog. Simply, click ‘Run’ button.
  49. The selected columns are now registered as Numeric data type.

  50. Notice that these operations are recorded at the right hand

    side as the Data Wrangling Steps. We’ll cover this in more details later.
  51. Create the 1st Chart 51

  52. 1.Create a Bar Chart 2.Sort the Bars 3.Limit X-Axis Values

    4.Group with Color Create the 1st Chart
  53. 53 Let’s create a chart to answer this question. “Which

    neighborhoods have more places on Airbnb?”
  54. Click ‘+’ button to create a new chart. 54

  55. 55 Select a ‘Bar’ as the chart type.

  56. 56 Assign ‘Neighborhood’ column to X-Axis. Keep ‘Number of Rows’

    for Y-Axis.
  57. 57 You can sort the bars from the highest to

    the lowest.
  58. There are too many bars (neighborhoods) to show. 58

  59. You can limit the X-Axis values.

  60. Select ‘Top’ for the limit type and set 50 so

    that only the top 50 neighborhoods with the most ‘Number of Rows’ will be shown.
  61. You can assign ‘neighborhood_group’ column to ‘Color’ to indicate which

    New York borough each neighborhood belongs to.
  62. Visualize Time Series Data 62

  63. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color Visualize Time Series Data
  64. “How many places have been added to Airbnb over the

    years?” 64 Let’s create a chart to answer this question.
  65. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  66. Create a new chart.

  67. Select Line chart, assign ‘host_since’ column to X-Axis, and keep

    ‘Number of Rows’ for Y-Axis.
  68. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  69. Let’s change the date aggregation level from Year to Month.

  70. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  71. The numbers are going ups and downs. In order to

    make it easier to see the overall trend, let’s show the ‘trend line’.
  72. This time, let’s use ‘Polynomial (Loess)’ as the trend line

    type.
  73. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  74. What if we want to know how many places have

    been added to Airbnb accumulatively, instead of how many places were added at each given month?
  75. You can use ‘Window Calculation’ feature to calculate ‘Cumulative Sum’

    of the ‘Number of Rows’.
  76. Select ‘Cumulative’ for the Calculation Type.

  77. 1. Create a Line Chart 2. Change Date Aggregation Level

    3. Show a Trend Line 4. Show a Cumulative Sum with Window Calculation 5. Create Groups with Color
  78. You can assign ‘neighborhood_group’ column to Color to see how

    the places have been added to Airbnb in each of the 5 boroughs accumulatively.
  79. Create Calculations 79

  80. There is a column called ‘availability_365’, which indicates how many

    days are available (not booked) per year.
  81. Let’s calculate a ‘availability rate’, which can be calculated like

    the following. availability_365 / 365 81
  82. Select ‘Mutate (Create Calculation)’ from the column header menu of

    ‘availability_365’. 82
  83. Type the calculation in the Calculation Editor. 83

  84. 84 Select ‘Create New Column’ and type ‘availability_rate’ as the

    new column name.
  85. A new column is created as ‘availability_rate’. 85

  86. 86 There are too many decimal digits. Let’s round the

    values and keep only two decimal digits.
  87. 87 Select ‘Work with Numeric Function’ and ‘round’ function from

    the column header menu.
  88. 88 This opens ‘Mutate (Create Calculation)’ dialog with the ‘round’

    function pre-populated.
  89. 89 Set 2 for the ‘digits’ parameter to keep 2

    decimal digits.
  90. 90 The values in ‘availability_rate’ column is rounded with 2

    decimal digits.
  91. Note that these ‘Mutate (Create Calculation)’ operations are recorded at

    the right hand side. 91
  92. You can go to Summary view to check the distribution

    of ‘availability_rate’ values. 92
  93. Filter Data 93

  94. Let’s say you want to filter the data to keep

    only the places in Manhattan and Brooklyn. 94
  95. Select ‘Filter’ and ‘Is in (multiple values)’ from the column

    header menu. 95
  96. Select ‘Manhattan’ and ‘Brooklyn’. 96

  97. 97 The ‘neighborhood_group’ column shows only the 2 neighborhood groups.

  98. 98 Note that this ‘Filter’ operation is recorded at the

    right hand side.
  99. Update, Move, Disable, & Delete Steps 99

  100. All the data wrangling operations are recorded in the ‘Steps’

    pane at the right hand side. 100
  101. Each operation becomes a ‘Step’. Step 101

  102. Each step has its own data snapshot behind the scene.

    Data Snapshot
  103. Tokens in the Step 103

  104. Each box inside the step is called ‘Token’. Token 104

  105. Same type of the operations are added into the same

    step. Here, the 2 calculations have been added to the same ‘Mutate’ step. 105
  106. 106 You can click the token to open the dialog

    to update the configuration.
  107. In this case, it will open ‘Mutate (Create Calculation)’ dialog.

    107
  108. You can delete each token. 108

  109. You can change the order of the tokens by drag-and-drop

    if needed. 109
  110. Move Steps 110

  111. You can move between the steps to see the data

    from each step.
  112. Let’s say you want to move the step 5 (Filter)

    to right before the step 4 (Mutate). 112
  113. Drag the step 5 and drop it right before the

    step 4. 113
  114. The Filter step has become the step 4. 114

  115. Split/Combine Multiple Steps 115

  116. You can split the tokens into separate steps.

  117. Select ‘Split Step’ from the Step menu.

  118. Each token is now in its own step. This is

    useful when you want to see how the data change by each operation.
  119. You can also combine multiple steps into one step as

    long as they are the same type of operation (e.g. Create Calculation)
  120. Select multiple steps with Command (Mac) or Control (Windows) key,

    and click on the ‘Combine’ icon.
  121. Now the multiple steps are combined together as a single

    step.
  122. Introduction to Chart Pinning 122

  123. Use ‘Pin’ to fix a chart to a particular data

    wrangling step. 123
  124. Let’s say we want to visualize the availability rates by

    neighborhood. 124
  125. Open the chart you have created previously. 125

  126. 126 We want to assign the ‘availability_rate’ column, which we

    have created at the previous step to Y-Axis. But, the column is not there in the dropdown!
  127. The ‘availability_rate’ column is created at the step 5. 127

  128. 128 But the chart is ‘Pinned’ to the step 3.

  129. This means that this chart always references the step 3

    data. 129
  130. Even when you select other steps, this chart still references

    the step 3 data. 130
  131. In order to use the step 5 data, you need

    to move the ‘Pin’ to the step 5.
  132. Move the Pin icon to the step 5 by Drag-and-Drop.

    132
  133. Now that this chart is ‘Pinned’ to the step 5,

    it will always references the data at the step 5 regardless of which step you select at the right hand side. 133
  134. The data at the step 5 has only Manhattan and

    Brooklyn, hence the chart shows only the two boroughs with two colors (Blue & Orange). 134
  135. Now you can see ‘availability_rate’ column inside the Y-Axis dropdown.

    135
  136. 136 Select ‘availability_rate’ column and select ‘Average (Mean)’ as the

    calculation.
  137. We are looking at the average availability rate for places

    in Manhattan and Brooklyn. 137
  138. 138 What if we want to see all the neighborhoods,

    not just for Manhattan and Brooklyn?
  139. Delete the Filter step! But, you might want to use

    it later… 139
  140. Instead of deleting, you can disable the step temporally. 140

  141. Now you can compare the average availability rates of the

    top 50 neighborhoods from all 5 boroughs. 141
  142. You can click ‘Enable Step’ icon to enable the step.

    But for this tutorial, we’ll continue with this step being disabled. 142
  143. Looks there are a lot of places in Staten Island

    (Purple) with high availability rates. For example, ‘Fort Wadsworth’ is 100% available, that’s crazy!
  144. Click the bar and click ‘Show Detail’.

  145. There is only one place in this neighborhood! And this

    place happens to have 1 (100%) for the availability rate.
  146. Instead of ‘Top 50’ neighborhoods, we want to show only

    the neighborhood with at least 100 places listed. 146
  147. Click the green text to open ‘Limit Axis Value’ setting

    dialog.
  148. Select ‘Condition’ for Type, select ‘Number of Rows’ for Based

    on, ‘greater than’ for Operator, and type 100 for Value.
  149. The neighborhoods in Queens (Red) tend to be more available

    while the neighborhoods with lower availability rates are in Brooklyn or Manhattan.
  150. None
  151. None
  152. Introduction to Branch 152

  153. Sometimes, you might want to create different versions of data

    from the same data. For example, you might want to create a data frame to aggregate the data by city or property_type while you want to keep the original data to be not aggregated. Creating different data frames that are separated from one another will create a maintenance nightmare. Instead, you can use Branch feature to ‘branch off’ from the original data frame and create multiple data frames that share the original data frame.
  154. Create Branch Main Data Frame Branch Data Frame Data Import

    Convert Data Type Create Calculations Filter Aggregate by City
  155. Import Excel Data Convert Data Type Create Calculations Filter Branch

    Data Frame 1 Branch Data Frame 2 Aggregate by City Top 10 Cities Aggregate by Host Clustering with K-Means You can create multiple branches from any steps. Main Data Frame
  156. Changes in Main Data Frame will propagate to only the

    related branches automatically. Import Excel Data Convert Data Type Create Calculations Filter Aggregate by City Top 10 Cities Aggregate by Host Clustering with K-Means Branch Data Frame 1 Branch Data Frame 2 Main Data Frame
  157. Create a Branch Data Frame 157

  158. Click the Branch icon to create a new Branch off

    from the step 5. 158
  159. Enter the name for a new Branch data frame. 159

  160. The newly created ‘Branch’ data frame is added underneath the

    original data frame. 160
  161. 161 You can see that the branch data frame is

    branched off from the step 5 of the main data frame.
  162. 162 What if you want this branch data frame to

    branch off from the step 3 instead of step 5?
  163. Go back to the main data frame and click ‘Branch’

    button
  164. Grab the Branch token and Drag to the step 3.

  165. Drop the token to the step 3. 165

  166. The Branch data frame is now branched from the step

    3. 166
  167. Go back to the branch data frame. 167

  168. Summarize (Aggregate) Data 168

  169. Currently, each place listed on Airbnb is presented as each

    row.
  170. Let’s say we want to summarize the data by property_type

    so that each row represents each property_type. Summarize
  171. Select ‘Summarize (Aggregate)’ from the column header menu.

  172. Assign columns and select aggregation functions.

  173. Data is summarized and the operation is recorded as ‘Summarize’

    at the right hand side.
  174. Data Reproducibility 174

  175. What if the original data file has been updated? 175

  176. • The data is updated and saved as the same

    file. • The data is updated but saved as a separate file. 176 2 scenarios
  177. 2 scenarios • The data is updated and saved as

    the same file. • The data is updated but saved as a separate file. 177
  178. Click ‘Re-Import’ button. 178

  179. This will re-read the data from the same file and

    run all the data wrangling steps automatically. 179
  180. • The data is updated and saved as the same

    file. • The data is updated but saved as a separate file. 180 2 scenarios
  181. Open Data Import dialog by clicking on the token inside

    the Step 1. 181
  182. 182 Click ‘Change File’ button and select the newly updated

    data file and click ‘Apply’ button.
  183. Data will be imported and all the data wrangling steps

    will be applied automatically. 183
  184. The Branch data frame gets updated, too! 184

  185. How to Create Dashboard 185

  186. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 186
  187. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 187
  188. Select ‘Dashboard’ from the Reports menu. 188

  189. Type a name for the Dashboard.

  190. It will open Dashboard Editor in a separate window 190

  191. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 191
  192. It will open the Dashboard Editor in a separate window

    192
  193. Select a data frame you want to add a chart

    from. 193
  194. Select a chart and click the OK button. 194

  195. None
  196. Add another chart. This time, let’s add the Line chart.

  197. 197 Select the line chart and click ‘OK’ button.

  198. Now we have two charts being added.

  199. Click ‘Run’ button to preview the output.

  200. 200

  201. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 201
  202. You can use Numbers to show metrics inside Dashboard. 202

  203. Go back to the data frame and add a new

    chart to create the Numbers. 203
  204. Select ‘Number’ as Chart Type. 204

  205. Select ‘Rename’ from the chart tab menu to change the

    name.
  206. Type the name as ‘Number of Places’.

  207. Copy the Number to create another Number.

  208. Select ‘review_scores_rating’ column for the Value and set Average (Mean)

    as the calculation. 208
  209. Change the name to ‘Average Review Score’.

  210. Go back to the Dashboard Edit UI to add the

    2 numbers.
  211. Click ‘Add Chart’ button.

  212. Select ‘Number of Places’ and click ‘OK’ button.

  213. Select ‘Number of Places’ and click ‘OK’ button.

  214. Move the 2 Numbers to the top by drag-and-drop.

  215. None
  216. Dashboard • Create a new Dashboard • Add Charts and

    Analytics • Add Numbers • Publish & Share 216
  217. Click ‘Publish’ button to publish the Dashboard to Exploratory Cloud.

    217
  218. Publish in Private mode or Public mode? 218

  219. 219 Private Mode Only the ones you explicitly share with

    can open your dashboard. Public Mode Anybody can open your dashboard.
  220. Click ‘Publish’ button to to publish. 220

  221. 221 Once it’s published, an unique URL is assigned to

    the Dashboard. Click ‘Open in Browser’ link to open the Dashboard in the web browser.
  222. You can see the published Dashboard in the web browser.

    222
  223. You can share it with others by clicking on ‘Share’

    button. 223
  224. Share with Invite 224 Type the email and click ‘Share’

    button. This will send an invite email.
  225. 225 Share with Invite The person who is invited can

    log in with her/his Exploratory account and open the Dashboard. If the person doesn’t have an Exploratory account then she/he can create it for FREE. The viewers can continue to view any contents at Exploratory Cloud as long as they are invited to view.
  226. 226 Share with URL You can also share your Dashboard

    with URL. This allows anyone with the URL to open the Dashboard without logging into Exploratory Cloud.
  227. Schedule Dashboard 227 You can schedule the dashboard to keep

    the data always up-to-date by querying against the data sources and applying all the data wrangling steps automatically. Note that you can schedule only the ones with remote data sources that can be accessed by Exploratory Cloud.
  228. Q & A

  229. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  230. EXPLORATORY