Slide 1

Slide 1 text

EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series Data

Slide 2

Slide 2 text

2 EXPLORATORY

Slide 3

Slide 3 text

data-introductio Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

4 Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 5

Slide 5 text

5 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 6

Slide 6 text

6 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 7

Slide 7 text

EXPLORATORY Data Visualization Workshop Part 2 - Visualizing Time Series Data

Slide 8

Slide 8 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 8

Slide 9

Slide 9 text

Sample Data

Slide 10

Slide 10 text

Airbnb New York Data 10

Slide 11

Slide 11 text

Airbnb New York Data 11

Slide 12

Slide 12 text

1. Open Data Catalog 2. Find ‘Airbnb New York’ Data 3. Import the Data Import Data

Slide 13

Slide 13 text

13 Select ‘Data Catalog’ from the Data Frame menu.

Slide 14

Slide 14 text

14 Type ‘airbnb’ to search ‘Airbnb Listing Data for New York City’ data.

Slide 15

Slide 15 text

15 Click the Import button to import the data.

Slide 16

Slide 16 text

16 Click the Save button to save the data.

Slide 17

Slide 17 text

Data is imported into Exploratory. 17

Slide 18

Slide 18 text

• How is the trend of newly added properties at Airbnb over the last few years? • Has it been increasing or decreasing? Questions

Slide 19

Slide 19 text

19 Create a new chart.

Slide 20

Slide 20 text

20 Assign the ‘host_since’ column to X-Axis.

Slide 21

Slide 21 text

21 ‘Number of Rows’ is selected for Y-Axis by default. So, we are looking at the yearly trend of the newly added properties.

Slide 22

Slide 22 text

22 The date is ‘rounded’ by Year by default, but you can change this.

Slide 23

Slide 23 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) Difference, % of Difference, Moving Average • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Trend Line - Line, Smooth 23

Slide 24

Slide 24 text

24 • Round • Extract Date Aggregation Level

Slide 25

Slide 25 text

25 What is ‘Round’? Round with Year

Slide 26

Slide 26 text

26 The ‘round’ function makes any dates within a given year to be the 1st day of the year.

Slide 27

Slide 27 text

27 Any dates in 2019 become 2019-01-01. Any dates in 2020 become 2020-01-01.

Slide 28

Slide 28 text

28 What is ‘Round’? Round with Month

Slide 29

Slide 29 text

29 The ‘round’ with ‘Month’ makes any dates to be the 1st day of the month.

Slide 30

Slide 30 text

30 What is ‘Round’? Round with Week

Slide 31

Slide 31 text

31 The ‘round’ with ‘Week’ makes any dates to be the 1st day of the week.

Slide 32

Slide 32 text

32 The number of newly added properties are increasing till 2015, then the trend went downward till 2018.

Slide 33

Slide 33 text

33 Round with Month We are looking at the monthly trend.

Slide 34

Slide 34 text

34 This 2014-07-01 includes all the dates within the same month (e.g. 2014-07-12).

Slide 35

Slide 35 text

35 We can see micro trends that we couldn’t see at the yearly level.

Slide 36

Slide 36 text

36 Round with Week It starts getting a bit too noisy, harder to get an overall trend. But, we can see some extreme values in some weeks.

Slide 37

Slide 37 text

37 The goal is to understand the trend including global and local. It’s up to you which of the aggregation levels works for you.

Slide 38

Slide 38 text

38 • Round • Extract Date Aggregation Level

Slide 39

Slide 39 text

39 Extract - Month Extract only the month part of date data, truncating the year and the day parts.

Slide 40

Slide 40 text

40 Overall, the number of the newly added properties increases towards to Summer, then drops afterwards.

Slide 41

Slide 41 text

Position Length Angle Slope Size Shape Volume Color Intensity Color Hue Visual Cue 41 Easier to Recognize Harder to Recognize

Slide 42

Slide 42 text

42 Length VS Slope

Slide 43

Slide 43 text

43 Slope With Line chart, it’s easier to understand if the trend is going upward or downward and how steep the slope is.

Slide 44

Slide 44 text

Length 44 With Bar chart, we are comparing the lengths of the bars. Easier to recognize how much a given bar is longer (higher) than the other bars.

Slide 45

Slide 45 text

45 The months can be considered as Categorical, then Bar chart might make it easier to spot a different trend.

Slide 46

Slide 46 text

46 For example, May, Jun, and July are the highest months, and the numbers are relatively low from January to April.

Slide 47

Slide 47 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 47

Slide 48

Slide 48 text

48 Change the chart type back to Line chart. Set the Date aggregation level to Round - Month.

Slide 49

Slide 49 text

49 All the data points shown here are the numbers of the newly added properties at any given time. We can’t see how many properties are listed at Airbnb in total by a given point of time.

Slide 50

Slide 50 text

50 • How is the trend of the number of properties at Airbnb over the years? • Is the trend steadily increasing? • Are there any change points of the trend? Questions:

Slide 51

Slide 51 text

51 We want to see how many properties have been added by any given time and see how the ‘growth’ speed increases or decreases. We can use ‘Cumulative Sum’ (or Running Total) function which can be found under ‘Window Calculation’ menu.

Slide 52

Slide 52 text

52 Select ‘Window Calculation’ from Y-Axis menu.

Slide 53

Slide 53 text

53 Select ‘Cumulative’ for the Calculation Type and keep ‘Sum’ for the Summarize function as is.

Slide 54

Slide 54 text

54 It looks that it increased very rapidly especially from 2013 to 2016, but then it slowed down for a bit after that.

Slide 55

Slide 55 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 55

Slide 56

Slide 56 text

56 • Is the trend same among different regions? • Which regions did contribute the rapid growth? Questions:

Slide 57

Slide 57 text

57 Assign the ‘neighborhood_group’ column to Color to break down the line into multiple lines.

Slide 58

Slide 58 text

58 Manhattan (Green) and Brooklyn (Orange) are the 2 regions that contributed the overall trend.

Slide 59

Slide 59 text

59 Notice that these 2 regions have similar trends, but they started diverging after 2019. 2019

Slide 60

Slide 60 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 60

Slide 61

Slide 61 text

61 Assign the ‘neighborhood’ column to Color.

Slide 62

Slide 62 text

62 When you assign a categorical column with more than 20 unique values, it automatically creates ‘Others’ group.

Slide 63

Slide 63 text

63 You can hide the ‘Others’ group so that you can see the trend for the top 20 neighborhoods better.

Slide 64

Slide 64 text

64 Also, you can adjust how many neighborhoods you want to keep. Let’s type 10 for the ‘Number of Most Frequents’ properties to keep only the top 10.

Slide 65

Slide 65 text

65 We can see that Bedford-Stuyvesant (Blue) and Williamsburg (Light Blue) are the significant top 2. The number of properties started picking up as early as 2011.

Slide 66

Slide 66 text

66 Another interesting thing is that Bedford-Stuyvesant (Blue) catching up with Williamsburg (Light Blue). The growth speed for Williamsburg seems to have settled after 2016.

Slide 67

Slide 67 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 67

Slide 68

Slide 68 text

68 Is the regional trend we saw with Neighborhood Group same among all the property types (e.g. Apartment, House, etc.)?

Slide 69

Slide 69 text

69 Maybe, some neighborhoods are growing rapidly in some particular property types?

Slide 70

Slide 70 text

70 Assign the ‘property_type’ column to Repeat By.

Slide 71

Slide 71 text

71 This creates ‘Others’ group since the ‘property_type’ column has 35 unique values.

Slide 72

Slide 72 text

72 Let’s keep only the 10 most frequent property types.

Slide 73

Slide 73 text

73 We can see the difference among the neighborhoods for Apartment, but can’t for other property types. This is because the data range for Apartment type is much bigger than the others but all the charts are using the same scale that is optimized for the Apartment.

Slide 74

Slide 74 text

74 We can adjust it to have a different Y Axis scale that is optimized for each chart. Click ‘Layout’ from the Repeat By menu.

Slide 75

Slide 75 text

75 Uncheck ‘Sync Y Axis Among Charts’ property so that each chart will have its own Y Axis scale.

Slide 76

Slide 76 text

76 Also, change the ‘Number of Charts per Row’ to 5 so that we will have two rows layout (each row has 5 charts).

Slide 77

Slide 77 text

77 We can see some new insights, for example, many house type properties have been added for Brooklyn and Queens.

Slide 78

Slide 78 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 78

Slide 79

Slide 79 text

79 There is a ‘neighborhood’ column, which is one lower level of geography compared to ‘neighborhood_group’. Let’s assign this column to Color.

Slide 80

Slide 80 text

80 As we have seen before, it creates ‘Others’ group automatically since this column has more than 20 unique values. Now, here is a problem. We can see the trend almost only for this ‘Others’ group, not so much for the top 20 neighborhoods.

Slide 81

Slide 81 text

81 Let’s remove the ‘Others’ group from the chart by unchecking ‘Show Others’ property inside the ‘Others Group Setting’ dialog.

Slide 82

Slide 82 text

82 Now, let’s say we are interested in the trends particularly for Williamsburg and Bedford-Stuyvesant in comparison to others. We can highlight these two neighborhood lines to make them standing out.

Slide 83

Slide 83 text

83 Select ‘Highlight’ from the Color menu.

Slide 84

Slide 84 text

84 Check ‘Enable Highlight’ and select Williamsburg and Bedford-Stuyvesant and assign different colors.

Slide 85

Slide 85 text

85 We can see that these two neighborhoods share a similar trend in the apartment property type but they are different in others such as Condominium, House, Loft, etc.

Slide 86

Slide 86 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 86

Slide 87

Slide 87 text

87 Historical Stock Price Data for Tech Giants Let’s switch to

Slide 88

Slide 88 text

Let’s switch to Stock Price data to demonstrate the next few topics better.

Slide 89

Slide 89 text

89 Select ‘Extension Data’ from the data frame menu.

Slide 90

Slide 90 text

90 Click ‘Import Data’ button for Historic Stock Prices data.

Slide 91

Slide 91 text

91 If you don’t see it, you can install it under ‘Add New’ view.

Slide 92

Slide 92 text

92 Enter the stock symbols for the following 5 companies. Symbol Company Name GOOG Google AMZN Amazon FB Facebook AAPL Apple NFLX NETFLIX

Slide 93

Slide 93 text

93 Enter ‘2010-01-01’ for the Date From to get the data since the date through yesterday, and click ‘Save (or Update)’ button.

Slide 94

Slide 94 text

94 The stock price data has been imported!

Slide 95

Slide 95 text

95 Each row represents each trading day for each company.

Slide 96

Slide 96 text

96 We’ll use the ‘adjusted’ as the stock price data.

Slide 97

Slide 97 text

Adjusted Price?

Slide 98

Slide 98 text

98 Many companies split their stock to multiples in the lifetime of the stock. For example, 1 stock becomes 2 stocks. 1 stock:1000 1stock:500 1stock:500 After the Split

Slide 99

Slide 99 text

99 Date Close 2020-3-1 20,000 2020-3-2 22,000 2020-3-3 11,000 2020-3-4 13,000 If we use the closing price, it looks a huge decrease on the day of the split. Split

Slide 100

Slide 100 text

100 Date Close Adjusted 2020-3-1 20,000 10,000 2020-3-2 22,000 11,000 2020-3-3 12,000 12,000 2020-3-4 13,000 13,000 In order to make it easier to compare the prices before and after the split, they adjust the prices of the dates before the split. Split

Slide 101

Slide 101 text

How is the trend for these tech companies’ stock prices over the years? Questions:

Slide 102

Slide 102 text

Create a Line chart by assigning the date column to X Axis with ‘Round to Week’ option, and assigning the adjusted column to Y-Axis with ‘Mean (Average)’ calculation.

Slide 103

Slide 103 text

Select the symbol column to Color to break down the line for each company.

Slide 104

Slide 104 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 104

Slide 105

Slide 105 text

Sometimes, you want to filter for only the last N years.

Slide 106

Slide 106 text

106 Many ways to filter the date data. For example… • Use an exact date. (e.g. later than 2018-01-01) • Use an exact Year. (e.g. later than 2018, equal to 2020) • Use Summarize function (e.g. equal to the Max(Last) year of this data.) • Use Relative date (e.g. last 3 years, last 3 years excluding this year, this year)

Slide 107

Slide 107 text

107 Exact Date - Date: Later than or equal to 2018-01-01

Slide 108

Slide 108 text

108 Exact Date - Year: Later than or equal to 2019

Slide 109

Slide 109 text

109 Summarize Function - Max: Later than or equal to “the last year of this data”

Slide 110

Slide 110 text

110 Relative Date - Last N Years: Last 1 year from today

Slide 111

Slide 111 text

111 Relative Date - Last N Years Excluding This Year: Last 1 year excluding this year

Slide 112

Slide 112 text

This time, let’s filter for the last 3 years up to today.

Slide 113

Slide 113 text

Click the ‘Filter’ button.

Slide 114

Slide 114 text

• Column: date • Filter Operator: • Date Type: Year • How to Set Value: Last N Years • Value: 3 Create a filter condition of keeping the last 3 years of data.

Slide 115

Slide 115 text

The data is now filtered for the last 3 years.

Slide 116

Slide 116 text

Looks Amazon and Google are the ‘highest’ stocks, others not so much.

Slide 117

Slide 117 text

Is that an appropriate insight?

Slide 118

Slide 118 text

118 Do we really care the stock price of the day? We care how much has the stock price increased or decreased. It’s not so much about the absolute stock prices. We care about the relative values such as 30% increase. Then, what would be the baseline we should compare against?

Slide 119

Slide 119 text

119 • Compared to Previous Period (e.g. Yesterday, Last Year) • Compared to the bottom • Compared to the beginning of this year • Compared to the date I have invested Baseline

Slide 120

Slide 120 text

120 Compared to the first date in the data - How much increased? Difference What is the dollars increased? % Difference What is the percentage increased?

Slide 121

Slide 121 text

Select ‘Window Calculation’ from the Y-Axis menu.

Slide 122

Slide 122 text

Select ‘Difference From’ for the Calculation Type.

Slide 123

Slide 123 text

Select ‘First Value’ for the Difference From. This is the baseline value for each stock.

Slide 124

Slide 124 text

It looks that Amazon has grown a lot. But, we need to keep in mind that Amazon’s stock prices are moving around between $1,000 and $2,000, which is much higher range compared to the others.

Slide 125

Slide 125 text

When you have a stock price of $200, and increased $200, that’s 100% increase, you’ve just doubled your investment! But when you have a stock price of $2,000, and it increased $200, that’s just 10% increase. That’s still great, you didn’t lose the money! ;) But it’s not as great as the previous scenario.

Slide 126

Slide 126 text

Ultimately, what you care about is the percentage increase, not the dollar increase. This is when we want to switch to the ‘% Difference’.

Slide 127

Slide 127 text

Select the ‘% Difference From’ for the Calculation Type. Keep the ‘First Value’ as the baseline.

Slide 128

Slide 128 text

When you look at the growth rate, Amazon and Netflix are moving quite similar. Also, notice that Apple has grown quite a lot.

Slide 129

Slide 129 text

Interestingly enough, Apple, Google, and Facebook were on a similar trajectory up to the summer of 2019, then after that Apple has outperformed the other two.

Slide 130

Slide 130 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 130

Slide 131

Slide 131 text

131 When the values are up and downs it’s harder to see the trend. We often use Moving Average to smooth the curve.

Slide 132

Slide 132 text

Moving Average EBZTNPWJOHBWFSBHF EBZTNPWJOHBWFSBHF

Slide 133

Slide 133 text

Let’s try with Netflix stock data. First, we’ll create a line chart for Netflix stock price data.

Slide 134

Slide 134 text

134 Create a new chart and select Line chart.

Slide 135

Slide 135 text

135 Select ‘date’ column and set the aggregation level to Day. Select the ‘adjusted’ column to Y Axis and set the function to Mean.

Slide 136

Slide 136 text

136 Create a filter with a condition of ‘symbol’ equals to ‘NFLX’.

Slide 137

Slide 137 text

137 Add another filter to keep only the last 3 years of data.

Slide 138

Slide 138 text

138

Slide 139

Slide 139 text

139 Now, let’s add Moving Average line.

Slide 140

Slide 140 text

140 Add the ‘adjusted’ column to the 2nd Y-Axis.

Slide 141

Slide 141 text

141 Select ‘Window Calculation’ from the menu.

Slide 142

Slide 142 text

142 Select ‘Moving Calculation’ for the Calculation Type, and type 25 for the window Size.

Slide 143

Slide 143 text

143 Let’s add one more Moving Average line. This time, 75 days moving average.

Slide 144

Slide 144 text

144 Add the ‘adjusted’ column to the 3rd Y-Axis.

Slide 145

Slide 145 text

145 Select ‘Moving Calculation’ for the Calculation Type, and type 75 for the window Size.

Slide 146

Slide 146 text

Agenda • Date Aggregation Level • Window Calculation - Cumulative Sum (Running Total) • Group by Color • ‘Others’ Group • Repeat By for Multiple Charts • Highlight • Date / Time Sensitive Filtering • Window Calculation - Difference, % of Difference, Moving Average • Trend Line - Line, Smooth 146

Slide 147

Slide 147 text

147 Drawing the moving average is one way to show the overall trend. There are other ways to do.

Slide 148

Slide 148 text

148 Trend line: Linear Regression Line

Slide 149

Slide 149 text

149 Linear Regression is an algorithm to draw a line where the distance between all the data points and the line would be minimal.

Slide 150

Slide 150 text

150 Create a chart that shows the monthly trend of the stock price (adjusted) for each of the companies.

Slide 151

Slide 151 text

151 Select ‘Trend Line’ from the Y-Axis menu.

Slide 152

Slide 152 text

152 Select ‘Linear Regression’ for the Type.

Slide 153

Slide 153 text

153 The trend line for Amazon looks very steep, this is because the prices have increased dramatically. But the trend line doesn’t seem to be capturing the actual trend.

Slide 154

Slide 154 text

154 There are other types of trend lines, which can capture more local movements.

Slide 155

Slide 155 text

155 Trend line: LOESS

Slide 156

Slide 156 text

156 Loess algorithm tries to fit a line by minimizing the distance between the da points and the curve in multiple local sections.

Slide 157

Slide 157 text

157 Linear Regression Loess

Slide 158

Slide 158 text

158 Select ‘Polynomial (Loess)’ for the Trend Line Type.

Slide 159

Slide 159 text

159 Now the trend lines are curves not straight lines, and they are more fitting to the actual data.

Slide 160

Slide 160 text

One More Thing…

Slide 161

Slide 161 text

161 Adjusting Y Axis Scale for Each Company

Slide 162

Slide 162 text

162 We can see the trends (and movements) for Amazon (AMZN) and Google (GOOG) clearly, but not so much for the others.

Slide 163

Slide 163 text

163 This is because the scales for the stock price ranges are very different among the companies.

Slide 164

Slide 164 text

164 We can use Repeat By to separate it to multiple charts so that each company to have its own chart. This will allow us to have a different Y Axis scale that is optimized for each company.

Slide 165

Slide 165 text

165 Assign the ‘symbol’ column to Repeat By to separate the original chart to multiple charts so that each company has its own chart.

Slide 166

Slide 166 text

166 Select the ‘Layout’ from the Repeat By menu.

Slide 167

Slide 167 text

167 Inside the Layout Setting dialog, uncheck ‘Sync Y Axis Among Charts’.

Slide 168

Slide 168 text

168 Now we can see all the stock price trends clearly. Note that each chart has its own Y Axis scale.

Slide 169

Slide 169 text

Q & A

Slide 170

Slide 170 text

Next Seminar

Slide 171

Slide 171 text

EXPLORATORY Data Visualization Workshop Part 3 - Visualizing Variance & Correlation

Slide 172

Slide 172 text

• Part 1 - Basics: Visualizing Summarized Data • Part 2 - Visualizing Time Series Data • Part 3 - Visualizing Variance & Correlation • Part 4 - Visualizing Uncertainty • Part 5 - Data Wrangling for Data Visualization Data Visualization Workshop

Slide 173

Slide 173 text

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

Slide 174

Slide 174 text

EXPLORATORY 174