Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Data Viz Workshop Part 2 - Visualizing Time Series Data

Exploratory: Data Viz Workshop Part 2 - Visualizing Time Series Data

This is a part of the Data Visualization Workshop. In this seminar, we'll focus on how to visualize Time Series data effectively.

* Aggregating at Different Levels
* Difference, % of Difference, Moving Average
* Multiple Charts with Repeat By
* Highlight
* Relative Time Filter - Last 6 months, YTD, etc.
* Trend line

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

May 07, 2020
Tweet

Transcript

  1. EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series

    Data
  2. 2 EXPLORATORY

  3. data-introductio Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched

    Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. 4 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  5. 5 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  6. 6 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  7. EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series

    Data
  8. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 8
  9. Sample Data

  10. Airbnb New York Data 10

  11. Airbnb New York Data 11

  12. 1. Open Data Catalog 2. Find ‘Airbnb New York’ Data

    3. Import the Data Import Data
  13. 13 Select ‘Data Catalog’ from the Data Frame menu.

  14. 14 Type ‘airbnb’ to search ‘Airbnb Listing Data for New

    York City’ data.
  15. 15 Click the Import button to import the data.

  16. 16 Click the Save button to save the data.

  17. Data is imported into Exploratory. 17

  18. • How is the trend of newly added properties at

    Airbnb over the last few years? • Has it been increasing or decreasing? Questions
  19. 19 Create a new chart.

  20. 20 Assign the ‘host_since’ column to X-Axis.

  21. 21 ‘Number of Rows’ is selected for Y-Axis by default.

    So, we are looking at the yearly trend of the newly added properties.
  22. 22 The date is ‘rounded’ by Year by default, but

    you can change this.
  23. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) Difference, % of Difference, Moving Average • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Trend Line - Line, Smooth 23
  24. 24 • Round • Extract Date Aggregation Level

  25. 25 What is ‘Round’? Round with Year

  26. 26 The ‘round’ function makes any dates within a given

    year to be the 1st day of the year.
  27. 27 Any dates in 2019 become 2019-01-01. Any dates in

    2020 become 2020-01-01.
  28. 28 What is ‘Round’? Round with Month

  29. 29 The ‘round’ with ‘Month’ makes any dates to be

    the 1st day of the month.
  30. 30 What is ‘Round’? Round with Week

  31. 31 The ‘round’ with ‘Week’ makes any dates to be

    the 1st day of the week.
  32. 32 The number of newly added properties are increasing till

    2015, then the trend went downward till 2018.
  33. 33 Round with Month We are looking at the monthly

    trend.
  34. 34 This 2014-07-01 includes all the dates within the same

    month (e.g. 2014-07-12).
  35. 35 We can see micro trends that we couldn’t see

    at the yearly level.
  36. 36 Round with Week It starts getting a bit too

    noisy, harder to get an overall trend. But, we can see some extreme values in some weeks.
  37. 37 The goal is to understand the trend including global

    and local. It’s up to you which of the aggregation levels works for you.
  38. 38 • Round • Extract Date Aggregation Level

  39. 39 Extract - Month Extract only the month part of

    date data, truncating the year and the day parts.
  40. 40 Overall, the number of the newly added properties increases

    towards to Summer, then drops afterwards.
  41. Position Length Angle Slope Size Shape Volume Color Intensity Color

    Hue Visual Cue 41 Easier to Recognize Harder to Recognize
  42. 42 Length VS Slope

  43. 43 Slope With Line chart, it’s easier to understand if

    the trend is going upward or downward and how steep the slope is.
  44. Length 44 With Bar chart, we are comparing the lengths

    of the bars. Easier to recognize how much a given bar is longer (higher) than the other bars.
  45. 45 The months can be considered as Categorical, then Bar

    chart might make it easier to spot a different trend.
  46. 46 For example, May, Jun, and July are the highest

    months, and the numbers are relatively low from January to April.
  47. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 47
  48. 48 Change the chart type back to Line chart. Set

    the Date aggregation level to Round - Month.
  49. 49 All the data points shown here are the numbers

    of the newly added properties at any given time. We can’t see how many properties are listed at Airbnb in total by a given point of time.
  50. 50 • How is the trend of the number of

    properties at Airbnb over the years? • Is the trend steadily increasing? • Are there any change points of the trend? Questions:
  51. 51 We want to see how many properties have been

    added by any given time and see how the ‘growth’ speed increases or decreases. We can use ‘Cumulative Sum’ (or Running Total) function which can be found under ‘Window Calculation’ menu.
  52. 52 Select ‘Window Calculation’ from Y-Axis menu.

  53. 53 Select ‘Cumulative’ for the Calculation Type and keep ‘Sum’

    for the Summarize function as is.
  54. 54 It looks that it increased very rapidly especially from

    2013 to 2016, but then it slowed down for a bit after that.
  55. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 55
  56. 56 • Is the trend same among different regions? •

    Which regions did contribute the rapid growth? Questions:
  57. 57 Assign the ‘neighborhood_group’ column to Color to break down

    the line into multiple lines.
  58. 58 Manhattan (Green) and Brooklyn (Orange) are the 2 regions

    that contributed the overall trend.
  59. 59 Notice that these 2 regions have similar trends, but

    they started diverging after 2019. 2019
  60. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 60
  61. 61 Assign the ‘neighborhood’ column to Color.

  62. 62 When you assign a categorical column with more than

    20 unique values, it automatically creates ‘Others’ group.
  63. 63 You can hide the ‘Others’ group so that you

    can see the trend for the top 20 neighborhoods better.
  64. 64 Also, you can adjust how many neighborhoods you want

    to keep. Let’s type 10 for the ‘Number of Most Frequents’ properties to keep only the top 10.
  65. 65 We can see that Bedford-Stuyvesant (Blue) and Williamsburg (Light

    Blue) are the significant top 2. The number of properties started picking up as early as 2011.
  66. 66 Another interesting thing is that Bedford-Stuyvesant (Blue) catching up

    with Williamsburg (Light Blue). The growth speed for Williamsburg seems to have settled after 2016.
  67. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 67
  68. 68 Is the regional trend we saw with Neighborhood Group

    same among all the property types (e.g. Apartment, House, etc.)?
  69. 69 Maybe, some neighborhoods are growing rapidly in some particular

    property types?
  70. 70 Assign the ‘property_type’ column to Repeat By.

  71. 71 This creates ‘Others’ group since the ‘property_type’ column has

    35 unique values.
  72. 72 Let’s keep only the 10 most frequent property types.

  73. 73 We can see the difference among the neighborhoods for

    Apartment, but can’t for other property types. This is because the data range for Apartment type is much bigger than the others but all the charts are using the same scale that is optimized for the Apartment.
  74. 74 We can adjust it to have a different Y

    Axis scale that is optimized for each chart. Click ‘Layout’ from the Repeat By menu.
  75. 75 Uncheck ‘Sync Y Axis Among Charts’ property so that

    each chart will have its own Y Axis scale.
  76. 76 Also, change the ‘Number of Charts per Row’ to

    5 so that we will have two rows layout (each row has 5 charts).
  77. 77 We can see some new insights, for example, many

    house type properties have been added for Brooklyn and Queens.
  78. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 78
  79. 79 There is a ‘neighborhood’ column, which is one lower

    level of geography compared to ‘neighborhood_group’. Let’s assign this column to Color.
  80. 80 As we have seen before, it creates ‘Others’ group

    automatically since this column has more than 20 unique values. Now, here is a problem. We can see the trend almost only for this ‘Others’ group, not so much for the top 20 neighborhoods.
  81. 81 Let’s remove the ‘Others’ group from the chart by

    unchecking ‘Show Others’ property inside the ‘Others Group Setting’ dialog.
  82. 82 Now, let’s say we are interested in the trends

    particularly for Williamsburg and Bedford-Stuyvesant in comparison to others. We can highlight these two neighborhood lines to make them standing out.
  83. 83 Select ‘Highlight’ from the Color menu.

  84. 84 Check ‘Enable Highlight’ and select Williamsburg and Bedford-Stuyvesant and

    assign different colors.
  85. 85 We can see that these two neighborhoods share a

    similar trend in the apartment property type but they are different in others such as Condominium, House, Loft, etc.
  86. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 86
  87. 87 Historical Stock Price Data for Tech Giants Let’s switch

    to
  88. Let’s switch to Stock Price data to demonstrate the next

    few topics better.
  89. 89 Select ‘Extension Data’ from the data frame menu.

  90. 90 Click ‘Import Data’ button for Historic Stock Prices data.

  91. 91 If you don’t see it, you can install it

    under ‘Add New’ view.
  92. 92 Enter the stock symbols for the following 5 companies.

    Symbol Company Name GOOG Google AMZN Amazon FB Facebook AAPL Apple NFLX NETFLIX
  93. 93 Enter ‘2010-01-01’ for the Date From to get the

    data since the date through yesterday, and click ‘Save (or Update)’ button.
  94. 94 The stock price data has been imported!

  95. 95 Each row represents each trading day for each company.

  96. 96 We’ll use the ‘adjusted’ as the stock price data.

  97. Adjusted Price?

  98. 98 Many companies split their stock to multiples in the

    lifetime of the stock. For example, 1 stock becomes 2 stocks. 1 stock:1000 1stock:500 1stock:500 After the Split
  99. 99 Date Close 2020-3-1 20,000 2020-3-2 22,000 2020-3-3 11,000 2020-3-4

    13,000 If we use the closing price, it looks a huge decrease on the day of the split. Split
  100. 100 Date Close Adjusted 2020-3-1 20,000 10,000 2020-3-2 22,000 11,000

    2020-3-3 12,000 12,000 2020-3-4 13,000 13,000 In order to make it easier to compare the prices before and after the split, they adjust the prices of the dates before the split. Split
  101. How is the trend for these tech companies’ stock prices

    over the years? Questions:
  102. Create a Line chart by assigning the date column to

    X Axis with ‘Round to Week’ option, and assigning the adjusted column to Y-Axis with ‘Mean (Average)’ calculation.
  103. Select the symbol column to Color to break down the

    line for each company.
  104. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 104
  105. Sometimes, you want to filter for only the last N

    years.
  106. 106 Many ways to filter the date data. For example…

    • Use an exact date. (e.g. later than 2018-01-01) • Use an exact Year. (e.g. later than 2018, equal to 2020) • Use Summarize function (e.g. equal to the Max(Last) year of this data.) • Use Relative date (e.g. last 3 years, last 3 years excluding this year, this year)
  107. 107 Exact Date - Date: Later than or equal to

    2018-01-01
  108. 108 Exact Date - Year: Later than or equal to

    2019
  109. 109 Summarize Function - Max: Later than or equal to

    “the last year of this data”
  110. 110 Relative Date - Last N Years: Last 1 year

    from today
  111. 111 Relative Date - Last N Years Excluding This Year:

    Last 1 year excluding this year
  112. This time, let’s filter for the last 3 years up

    to today.
  113. Click the ‘Filter’ button.

  114. • Column: date • Filter Operator: <relative dates> • Date

    Type: Year • How to Set Value: Last N Years • Value: 3 Create a filter condition of keeping the last 3 years of data.
  115. The data is now filtered for the last 3 years.

  116. Looks Amazon and Google are the ‘highest’ stocks, others not

    so much.
  117. Is that an appropriate insight?

  118. 118 Do we really care the stock price of the

    day? We care how much has the stock price increased or decreased. It’s not so much about the absolute stock prices. We care about the relative values such as 30% increase. Then, what would be the baseline we should compare against?
  119. 119 • Compared to Previous Period (e.g. Yesterday, Last Year)

    • Compared to the bottom • Compared to the beginning of this year • Compared to the date I have invested Baseline
  120. 120 Compared to the first date in the data -

    How much increased? Difference What is the dollars increased? % Difference What is the percentage increased?
  121. Select ‘Window Calculation’ from the Y-Axis menu.

  122. Select ‘Difference From’ for the Calculation Type.

  123. Select ‘First Value’ for the Difference From. This is the

    baseline value for each stock.
  124. It looks that Amazon has grown a lot. But, we

    need to keep in mind that Amazon’s stock prices are moving around between $1,000 and $2,000, which is much higher range compared to the others.
  125. When you have a stock price of $200, and increased

    $200, that’s 100% increase, you’ve just doubled your investment! But when you have a stock price of $2,000, and it increased $200, that’s just 10% increase. That’s still great, you didn’t lose the money! ;) But it’s not as great as the previous scenario.
  126. Ultimately, what you care about is the percentage increase, not

    the dollar increase. This is when we want to switch to the ‘% Difference’.
  127. Select the ‘% Difference From’ for the Calculation Type. Keep

    the ‘First Value’ as the baseline.
  128. When you look at the growth rate, Amazon and Netflix

    are moving quite similar. Also, notice that Apple has grown quite a lot.
  129. Interestingly enough, Apple, Google, and Facebook were on a similar

    trajectory up to the summer of 2019, then after that Apple has outperformed the other two.
  130. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 130
  131. 131 When the values are up and downs it’s harder

    to see the trend. We often use Moving Average to smooth the curve.
  132. Moving Average EBZTNPWJOHBWFSBHF EBZTNPWJOHBWFSBHF

  133. Let’s try with Netflix stock data. First, we’ll create a

    line chart for Netflix stock price data.
  134. 134 Create a new chart and select Line chart.

  135. 135 Select ‘date’ column and set the aggregation level to

    Day. Select the ‘adjusted’ column to Y Axis and set the function to Mean.
  136. 136 Create a filter with a condition of ‘symbol’ equals

    to ‘NFLX’.
  137. 137 Add another filter to keep only the last 3

    years of data.
  138. 138

  139. 139 Now, let’s add Moving Average line.

  140. 140 Add the ‘adjusted’ column to the 2nd Y-Axis.

  141. 141 Select ‘Window Calculation’ from the menu.

  142. 142 Select ‘Moving Calculation’ for the Calculation Type, and type

    25 for the window Size.
  143. 143 Let’s add one more Moving Average line. This time,

    75 days moving average.
  144. 144 Add the ‘adjusted’ column to the 3rd Y-Axis.

  145. 145 Select ‘Moving Calculation’ for the Calculation Type, and type

    75 for the window Size.
  146. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 146
  147. 147 Drawing the moving average is one way to show

    the overall trend. There are other ways to do.
  148. 148 Trend line: Linear Regression Line

  149. 149 Linear Regression is an algorithm to draw a line

    where the distance between all the data points and the line would be minimal.
  150. 150 Create a chart that shows the monthly trend of

    the stock price (adjusted) for each of the companies.
  151. 151 Select ‘Trend Line’ from the Y-Axis menu.

  152. 152 Select ‘Linear Regression’ for the Type.

  153. 153 The trend line for Amazon looks very steep, this

    is because the prices have increased dramatically. But the trend line doesn’t seem to be capturing the actual trend.
  154. 154 There are other types of trend lines, which can

    capture more local movements.
  155. 155 Trend line: LOESS

  156. 156 Loess algorithm tries to fit a line by minimizing

    the distance between the da points and the curve in multiple local sections.
  157. 157 Linear Regression Loess

  158. 158 Select ‘Polynomial (Loess)’ for the Trend Line Type.

  159. 159 Now the trend lines are curves not straight lines,

    and they are more fitting to the actual data.
  160. One More Thing…

  161. 161 Adjusting Y Axis Scale for Each Company

  162. 162 We can see the trends (and movements) for Amazon

    (AMZN) and Google (GOOG) clearly, but not so much for the others.
  163. 163 This is because the scales for the stock price

    ranges are very different among the companies.
  164. 164 We can use Repeat By to separate it to

    multiple charts so that each company to have its own chart. This will allow us to have a different Y Axis scale that is optimized for each company.
  165. 165 Assign the ‘symbol’ column to Repeat By to separate

    the original chart to multiple charts so that each company has its own chart.
  166. 166 Select the ‘Layout’ from the Repeat By menu.

  167. 167 Inside the Layout Setting dialog, uncheck ‘Sync Y Axis

    Among Charts’.
  168. 168 Now we can see all the stock price trends

    clearly. Note that each chart has its own Y Axis scale.
  169. Q & A

  170. Next Seminar

  171. EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance &

    Correlation
  172. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop
  173. Information Email kan@exploratory.io Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

  174. EXPLORATORY 174