Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Data Viz Workshop Part 2 - Visuali...

Exploratory: Data Viz Workshop Part 2 - Visualizing Time Series Data

This is a part of the Data Visualization Workshop. In this seminar, we'll focus on how to visualize Time Series data effectively.

* Aggregating at Different Levels
* Difference, % of Difference, Moving Average
* Multiple Charts with Repeat By
* Highlight
* Relative Time Filter - Last 6 months, YTD, etc.
* Trend line

Kan Nishida

May 07, 2020
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. data-introductio Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched

    Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 4 Data Science is not just for Engineers and Statisticians.

    Exploratory makes it possible for Everyone to do Data Science. The Third Wave
  3. 5 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Analysis Data Science Workflow
  4. 6 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI
  5. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 8
  6. • How is the trend of newly added properties at

    Airbnb over the last few years? • Has it been increasing or decreasing? Questions
  7. 21 ‘Number of Rows’ is selected for Y-Axis by default.

    So, we are looking at the yearly trend of the newly added properties.
  8. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) Difference, % of Difference, Moving Average • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Trend Line - Line, Smooth 23
  9. 32 The number of newly added properties are increasing till

    2015, then the trend went downward till 2018.
  10. 36 Round with Week It starts getting a bit too

    noisy, harder to get an overall trend. But, we can see some extreme values in some weeks.
  11. 37 The goal is to understand the trend including global

    and local. It’s up to you which of the aggregation levels works for you.
  12. 39 Extract - Month Extract only the month part of

    date data, truncating the year and the day parts.
  13. 40 Overall, the number of the newly added properties increases

    towards to Summer, then drops afterwards.
  14. Position Length Angle Slope Size Shape Volume Color Intensity Color

    Hue Visual Cue 41 Easier to Recognize Harder to Recognize
  15. 43 Slope With Line chart, it’s easier to understand if

    the trend is going upward or downward and how steep the slope is.
  16. Length 44 With Bar chart, we are comparing the lengths

    of the bars. Easier to recognize how much a given bar is longer (higher) than the other bars.
  17. 45 The months can be considered as Categorical, then Bar

    chart might make it easier to spot a different trend.
  18. 46 For example, May, Jun, and July are the highest

    months, and the numbers are relatively low from January to April.
  19. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 47
  20. 48 Change the chart type back to Line chart. Set

    the Date aggregation level to Round - Month.
  21. 49 All the data points shown here are the numbers

    of the newly added properties at any given time. We can’t see how many properties are listed at Airbnb in total by a given point of time.
  22. 50 • How is the trend of the number of

    properties at Airbnb over the years? • Is the trend steadily increasing? • Are there any change points of the trend? Questions:
  23. 51 We want to see how many properties have been

    added by any given time and see how the ‘growth’ speed increases or decreases. We can use ‘Cumulative Sum’ (or Running Total) function which can be found under ‘Window Calculation’ menu.
  24. 54 It looks that it increased very rapidly especially from

    2013 to 2016, but then it slowed down for a bit after that.
  25. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 55
  26. 56 • Is the trend same among different regions? •

    Which regions did contribute the rapid growth? Questions:
  27. 59 Notice that these 2 regions have similar trends, but

    they started diverging after 2019. 2019
  28. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 60
  29. 62 When you assign a categorical column with more than

    20 unique values, it automatically creates ‘Others’ group.
  30. 63 You can hide the ‘Others’ group so that you

    can see the trend for the top 20 neighborhoods better.
  31. 64 Also, you can adjust how many neighborhoods you want

    to keep. Let’s type 10 for the ‘Number of Most Frequents’ properties to keep only the top 10.
  32. 65 We can see that Bedford-Stuyvesant (Blue) and Williamsburg (Light

    Blue) are the significant top 2. The number of properties started picking up as early as 2011.
  33. 66 Another interesting thing is that Bedford-Stuyvesant (Blue) catching up

    with Williamsburg (Light Blue). The growth speed for Williamsburg seems to have settled after 2016.
  34. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 67
  35. 68 Is the regional trend we saw with Neighborhood Group

    same among all the property types (e.g. Apartment, House, etc.)?
  36. 73 We can see the difference among the neighborhoods for

    Apartment, but can’t for other property types. This is because the data range for Apartment type is much bigger than the others but all the charts are using the same scale that is optimized for the Apartment.
  37. 74 We can adjust it to have a different Y

    Axis scale that is optimized for each chart. Click ‘Layout’ from the Repeat By menu.
  38. 75 Uncheck ‘Sync Y Axis Among Charts’ property so that

    each chart will have its own Y Axis scale.
  39. 76 Also, change the ‘Number of Charts per Row’ to

    5 so that we will have two rows layout (each row has 5 charts).
  40. 77 We can see some new insights, for example, many

    house type properties have been added for Brooklyn and Queens.
  41. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 78
  42. 79 There is a ‘neighborhood’ column, which is one lower

    level of geography compared to ‘neighborhood_group’. Let’s assign this column to Color.
  43. 80 As we have seen before, it creates ‘Others’ group

    automatically since this column has more than 20 unique values. Now, here is a problem. We can see the trend almost only for this ‘Others’ group, not so much for the top 20 neighborhoods.
  44. 81 Let’s remove the ‘Others’ group from the chart by

    unchecking ‘Show Others’ property inside the ‘Others Group Setting’ dialog.
  45. 82 Now, let’s say we are interested in the trends

    particularly for Williamsburg and Bedford-Stuyvesant in comparison to others. We can highlight these two neighborhood lines to make them standing out.
  46. 85 We can see that these two neighborhoods share a

    similar trend in the apartment property type but they are different in others such as Condominium, House, Loft, etc.
  47. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 86
  48. 92 Enter the stock symbols for the following 5 companies.

    Symbol Company Name GOOG Google AMZN Amazon FB Facebook AAPL Apple NFLX NETFLIX
  49. 93 Enter ‘2010-01-01’ for the Date From to get the

    data since the date through yesterday, and click ‘Save (or Update)’ button.
  50. 98 Many companies split their stock to multiples in the

    lifetime of the stock. For example, 1 stock becomes 2 stocks. 1 stock:1000 1stock:500 1stock:500 After the Split
  51. 99 Date Close 2020-3-1 20,000 2020-3-2 22,000 2020-3-3 11,000 2020-3-4

    13,000 If we use the closing price, it looks a huge decrease on the day of the split. Split
  52. 100 Date Close Adjusted 2020-3-1 20,000 10,000 2020-3-2 22,000 11,000

    2020-3-3 12,000 12,000 2020-3-4 13,000 13,000 In order to make it easier to compare the prices before and after the split, they adjust the prices of the dates before the split. Split
  53. Create a Line chart by assigning the date column to

    X Axis with ‘Round to Week’ option, and assigning the adjusted column to Y-Axis with ‘Mean (Average)’ calculation.
  54. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 104
  55. 106 Many ways to filter the date data. For example…

    • Use an exact date. (e.g. later than 2018-01-01) • Use an exact Year. (e.g. later than 2018, equal to 2020) • Use Summarize function (e.g. equal to the Max(Last) year of this data.) • Use Relative date (e.g. last 3 years, last 3 years excluding this year, this year)
  56. 109 Summarize Function - Max: Later than or equal to

    “the last year of this data”
  57. • Column: date • Filter Operator: <relative dates> • Date

    Type: Year • How to Set Value: Last N Years • Value: 3 Create a filter condition of keeping the last 3 years of data.
  58. 118 Do we really care the stock price of the

    day? We care how much has the stock price increased or decreased. It’s not so much about the absolute stock prices. We care about the relative values such as 30% increase. Then, what would be the baseline we should compare against?
  59. 119 • Compared to Previous Period (e.g. Yesterday, Last Year)

    • Compared to the bottom • Compared to the beginning of this year • Compared to the date I have invested Baseline
  60. 120 Compared to the first date in the data -

    How much increased? Difference What is the dollars increased? % Difference What is the percentage increased?
  61. It looks that Amazon has grown a lot. But, we

    need to keep in mind that Amazon’s stock prices are moving around between $1,000 and $2,000, which is much higher range compared to the others.
  62. When you have a stock price of $200, and increased

    $200, that’s 100% increase, you’ve just doubled your investment! But when you have a stock price of $2,000, and it increased $200, that’s just 10% increase. That’s still great, you didn’t lose the money! ;) But it’s not as great as the previous scenario.
  63. Ultimately, what you care about is the percentage increase, not

    the dollar increase. This is when we want to switch to the ‘% Difference’.
  64. When you look at the growth rate, Amazon and Netflix

    are moving quite similar. Also, notice that Apple has grown quite a lot.
  65. Interestingly enough, Apple, Google, and Facebook were on a similar

    trajectory up to the summer of 2019, then after that Apple has outperformed the other two.
  66. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 130
  67. 131 When the values are up and downs it’s harder

    to see the trend. We often use Moving Average to smooth the curve.
  68. Let’s try with Netflix stock data. First, we’ll create a

    line chart for Netflix stock price data.
  69. 135 Select ‘date’ column and set the aggregation level to

    Day. Select the ‘adjusted’ column to Y Axis and set the function to Mean.
  70. 138

  71. Agenda • Date Aggregation Level • Window Calculation - Cumulative

    Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 146
  72. 147 Drawing the moving average is one way to show

    the overall trend. There are other ways to do.
  73. 149 Linear Regression is an algorithm to draw a line

    where the distance between all the data points and the line would be minimal.
  74. 150 Create a chart that shows the monthly trend of

    the stock price (adjusted) for each of the companies.
  75. 153 The trend line for Amazon looks very steep, this

    is because the prices have increased dramatically. But the trend line doesn’t seem to be capturing the actual trend.
  76. 156 Loess algorithm tries to fit a line by minimizing

    the distance between the da points and the curve in multiple local sections.
  77. 159 Now the trend lines are curves not straight lines,

    and they are more fitting to the actual data.
  78. 162 We can see the trends (and movements) for Amazon

    (AMZN) and Google (GOOG) clearly, but not so much for the others.
  79. 163 This is because the scales for the stock price

    ranges are very different among the companies.
  80. 164 We can use Repeat By to separate it to

    multiple charts so that each company to have its own chart. This will allow us to have a different Y Axis scale that is optimized for each company.
  81. 165 Assign the ‘symbol’ column to Repeat By to separate

    the original chart to multiple charts so that each company has its own chart.
  82. 168 Now we can see all the stock price trends

    clearly. Note that each chart has its own Y Axis scale.
  83. • Part 1 - Basics: Visualizing Summarized Data • Part

    2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop