Slide 1

Slide 1 text

Exploratory Seminar #27 Data Science X

Slide 2

Slide 2 text

2 EXPLORATORY

Slide 3

Slide 3 text

Kan Nishida CEO/co-founder Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 5

Slide 5 text

First Wave Second Wave Third Wave Proprietary Open Source UI & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Slide 6

Slide 6 text

Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis Data Science Workflow

Slide 7

Slide 7 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis ExploratoryɹModern & Simple UI

Slide 8

Slide 8 text

Exploratory Seminar #27 Data Science X

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Google Analytics Data is Treasure Trove • The pre-built dashboards on the Google Analytics page are optimized for general purpose, they are not designed to answer your questions you need to answer for your business. • By downloading the data, visualizing it from various perspectives, wrangling it flexibly, and applying various analytics methods quickly, you can gain deeper insights to answer your own questions.

Slide 12

Slide 12 text

Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis with EXPLORATORY

Slide 13

Slide 13 text

Data Access

Slide 14

Slide 14 text

1. Select View 2. Select Period 3. Select Dimensions & Measures 4. Select Segment (Optional) 5. Import Import Google Analytics Data

Slide 15

Slide 15 text

Import Data

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Select Account, Property, and View.

Slide 18

Slide 18 text

Select Period.

Slide 19

Slide 19 text

Select Segment

Slide 20

Slide 20 text

Segment

Slide 21

Slide 21 text

Create Segments in Google Analytics

Slide 22

Slide 22 text

• Users who have visited a product page. • Users who have converted. • Sessions that came from Google Search (Organic) • Sessions that came from mobile devices Example Segments

Slide 23

Slide 23 text

Dimensions & Metrics

Slide 24

Slide 24 text

Select Dimensions & Metrics

Slide 25

Slide 25 text

Dimension Metric

Slide 26

Slide 26 text

Dimensions: They are the attributes of what you are interested in measuring for. Landing Page, Country, Device Type, Source, etc. Metrics: Quantitative measurements of what you are interested. Number of Sessions, Page Views, Bounce Rates, Conversion Rates, etc.

Slide 27

Slide 27 text

Scope

Slide 28

Slide 28 text

Scope • Scope is how Google Analytics collects data. Google Analytics collects data at various different levels and summarize the data by the levels. • Each dimension and measure belongs to a particular level of scope. • This means that when you mix the dimensions and measures that belong to different levels you will get inaccurate data.

Slide 29

Slide 29 text

Top Page Page A Page D Page A Page D Add to Cart Purchase Confirm Page Top Page Page A Page B

Slide 30

Slide 30 text

User : It uses each web browser as a proxy of each user. This level of data is measured across the sessions. Session : Collected per Visit. The activities during the time a given user comes to the site and exits. Hit : Collected per Action. The data about each action such as Click.

Slide 31

Slide 31 text

Session can end without a clear exit • If you kept opening the same page with no activity for 30 minutes then the session is considered as Exit. • When the day changes, passing midnight. • Even within the same 30 minutes duration with the same web browser, if you revisit the same site from different source (e.g. Google Search vs. Facebook) a new session starts.

Slide 32

Slide 32 text

Select Dimensions & Measures

Slide 33

Slide 33 text

Show only the selected Dimensions & Measures

Slide 34

Slide 34 text

Click the Run button to preview the data.

Slide 35

Slide 35 text

Click the Save button to import the data.

Slide 36

Slide 36 text

It will show the summary information once the data is imported.

Slide 37

Slide 37 text

Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis with EXPLORATORY

Slide 38

Slide 38 text

The Common Challenges for Visualizing GA Data • Want to aggregate data by specific date and time levels. • Want to group the data into multiple groups and compare them. • Too many unique values for dimensions. • Want to visualize the variation and the uncertainty of data rather than the summary values.

Slide 39

Slide 39 text

The Common Challenges for Visualizing GA Data • Want to aggregate data by specific date and time levels. • Want to group the data into multiple groups and compare them. • Too many unique values for dimensions. • Want to visualize the variation and the uncertainty of data rather than the summary values.

Slide 40

Slide 40 text

Round by Day

Slide 41

Slide 41 text

Round by Week

Slide 42

Slide 42 text

Round by Month

Slide 43

Slide 43 text

What is this drop!?!?

Slide 44

Slide 44 text

This month has only 10 days and is not complete.

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Remove ‘This Month’ Data

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Extract Month

Slide 49

Slide 49 text

Extract Day of Week

Slide 50

Slide 50 text

Different Level for X-Axis, Color, and Repeat By

Slide 51

Slide 51 text

The Common Challenges for Visualizing GA Data • Want to aggregate data by specific date and time levels. • Want to group the data into multiple groups and compare them. • Too many unique values for dimensions. • Want to visualize the variation and the uncertainty of data rather than the summary values.

Slide 52

Slide 52 text

Break Down with Repeat By

Slide 53

Slide 53 text

The Common Challenges for Visualizing GA Data • Want to aggregate data by specific date and time levels. • Want to group the data into multiple groups and compare them. • Too many unique values for dimensions. • Want to visualize the variation and the uncertainty of data rather than the summary values.

Slide 54

Slide 54 text

Too many unique values for dimensions.

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

• Limit Values - Top 10 • Limit Values - Condition • Create ‘Other’ Group • Highlight A few options to address

Slide 57

Slide 57 text

• Limit Values - Top 10 • Limit Values - Condition • Create ‘Other’ Group • Highlight A few options to address

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Limit to Top 30 Countries

Slide 60

Slide 60 text

• Limit Values - Top 10 • Limit Values - Condition • Create ‘Other’ Group • Highlight A few options to address

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Limit values with more than 1,000 sessions.

Slide 65

Slide 65 text

• Limit Values - Top 10 • Limit Values - Condition • Create ‘Other’ Group • Highlight A few options to address

Slide 66

Slide 66 text

Too Many Values (Lines)

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Limit to only the United States and Japan

Slide 70

Slide 70 text

• Limit Values - Top 10 • Limit Values - Condition • Create ‘Other’ Group • Highlight A few options to address

Slide 71

Slide 71 text

Sometimes, too many lines is not a bad thing if you can highlight.

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

Highlight

Slide 74

Slide 74 text

The Common Challenges for Visualizing GA Data • Want to aggregate data by specific date and time levels. • Want to group the data into multiple groups and compare them. • Too many unique values for dimensions. • Want to visualize the variation and the uncertainty of data rather than the summary values.

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis with EXPLORATORY

Slide 82

Slide 82 text

• Text Data Wrangling • Joining • Merging

Slide 83

Slide 83 text

Text data almost always needs Clean Up!

Slide 84

Slide 84 text

• Extract Text • Remove Text • Replace Text • Convert Text - lower case / UPPER CASE / Title Case

Slide 85

Slide 85 text

Text Wrangling - Extract

Slide 86

Slide 86 text

Text Wrangling - Remove

Slide 87

Slide 87 text

Example Visualize Search Parameter Text

Slide 88

Slide 88 text

Landing Page Path

Slide 89

Slide 89 text

If you visualize the path as is…

Slide 90

Slide 90 text

Extract Parameter Value

Slide 91

Slide 91 text

Extract Parameter Value

Slide 92

Slide 92 text

Extract Parameter Value

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

Clean up the Text! - Handle multiple tags - Convert to Title Case - Remove Text - Remove double quotes Remove Leasing/Training Spaces

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

Visualize it with Word Cloud

Slide 101

Slide 101 text

• Text Data Wrangling • Joining • Merging

Slide 102

Slide 102 text

Engagement

Slide 103

Slide 103 text

Goal • Improve the Conversion Rate • Reduce the Churn Rate

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

Lagging Indicator • Metrics to Confirm • It’s too late when you find out Leading Indicator • Metrics to Predict • You can take actions when you find out

Slide 106

Slide 106 text

Lagging Indicator • Conversion Rate • Churn Rate Leading Indicator • Engagement

Slide 107

Slide 107 text

Goal: Want to improve the engagement Challenge: How can we measure the engagement?

Slide 108

Slide 108 text

DAU

Slide 109

Slide 109 text

MAU

Slide 110

Slide 110 text

• Active User Metrics (DAU, MAU, etc.) are not good indicators of the engagement. • As business grows, they tend to grow regardless of whether users are more engaged or not. • Sales and Marketing can improve the activity, but not necessary the engagement. Activity ≠ Engagement

Slide 111

Slide 111 text

A Engagement Metric • Measure how often the same users visit the site or use the service. • It was popularized by Facebook who was using it to grow the user base at the early stage. DAU / MAU

Slide 112

Slide 112 text

DAU / MAU /

Slide 113

Slide 113 text

How can we do with Google Analytics data?

Slide 114

Slide 114 text

We can use ‘1 Day Active Users (User)’ and ’30 Day Active Users (User)’.

Slide 115

Slide 115 text

115 1 Day Active Users / 30 Day Active Users And calculate inside Exploratory!

Slide 116

Slide 116 text

However…, there is one problem.

Slide 117

Slide 117 text

Due to the scope limitation, we can’t get these metrics data in the same query. (They are aggregated at different levels (1 day vs. 30 days).

Slide 118

Slide 118 text

Get DAU Get MAU So, we can get those metrics data separately…

Slide 119

Slide 119 text

Get DAU Get MAU Then, join them together later. Join

Slide 120

Slide 120 text

Join Get DAU Get MAU Join

Slide 121

Slide 121 text

121

Slide 122

Slide 122 text

Get DAU Get MAU Join Calculate DAU / MAU

Slide 123

Slide 123 text

Create Calculation Get DAU Get MAU Calculate DAU / MAU Join

Slide 124

Slide 124 text

DAU / MAU

Slide 125

Slide 125 text

Visualize the Engagement Trend

Slide 126

Slide 126 text

Add a ‘Smooth Curve’ trendline with LOESS method.

Slide 127

Slide 127 text

• Text Data Wrangling • Joining • Merging

Slide 128

Slide 128 text

Import Data by specifying the Segment

Slide 129

Slide 129 text

Name Page View Bootcamp 15 Bootcamp 100 Bootcamp 20 Name Page View Not Bootcamp 20 Not Bootcamp 95 Not Bootcamp 30 Name Sales Bootcamp 15 Bootcamp 100 Bootcamp 20 Not Bootcamp 20 Not Bootcamp 95 Not Bootcamp 30 Merge

Slide 130

Slide 130 text

Merge two data frames.

Slide 131

Slide 131 text

Data Reproducibility

Slide 132

Slide 132 text

Data gets updated daily (or hourly). Can we automate the data wrangling steps? 132

Slide 133

Slide 133 text

Click Re-Import Button to download the latest data from GA 133

Slide 134

Slide 134 text

Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Analysis with EXPLORATORY

Slide 135

Slide 135 text

• Statistical Modeling • Prediction with Machine Learning Models • Time Series Forecasting • Causal Impact Analysis • Clustering Analytics

Slide 136

Slide 136 text

• Statistical Modeling • Prediction with Machine Learning Models • Time Series Forecasting • Causal Impact Analysis • Clustering Analytics

Slide 137

Slide 137 text

Time Series Forecasting with Prophet

Slide 138

Slide 138 text

• Want to find out how we can prepare our web service infrastructure in order to keep the current performance level. • Number of servers, whether moving to higher spec machines, should we create regional (Japan, France, etc.) servers, etc. • Preparing more servers and higher spec machines will increase the cost. • We want to minimize the cost of adding more servers, but at the same time it will damage our business if we are not ready for higher demands. Challenge

Slide 139

Slide 139 text

• Want to forecast the page accesses in the next few months. • Based on the forecast, we can plan to add more servers or reduce the number of servers. • If we can forecast by region, then we can allocate adequate number of machines in the area where we expect higher demands. Challenge

Slide 140

Slide 140 text

Visualizing Page View Trend

Slide 141

Slide 141 text

Where will our page views be over the next few months?

Slide 142

Slide 142 text

Time Series Forecasting with Prophet

Slide 143

Slide 143 text

• A ‘curve fitting’ algorithm to build time series forcasting models. • Designed for ease of use without expert knowledge on time series forecasting or statistics. • Built by Data Scientists (Sean J. Taylor & co.) at Facebook and open sourced. (https:// facebook.github.io/prophet) Prophet Sean J. Taylor @seanjtaylor

Slide 144

Slide 144 text

Build a model by finding a best smooth line which can be represented as sum of the following components. • Overall growth trend • Seasonality - Yearly, Weekly, Daily, etc. • Holiday effects - X’mas, New Year, July 4th, etc. • External Predictors Prophet - Additive Model

Slide 145

Slide 145 text

Weekly Sales Trend

Slide 146

Slide 146 text

Build a Forecasting Model with Prophet

Slide 147

Slide 147 text

Assign Order Date to Date/Time and Sales to Value.

Slide 148

Slide 148 text

The blue line is the actual data (Sales), and the orange line is the forecasted data.

Slide 149

Slide 149 text

The last area is the forecasted period where there is no actual data.

Slide 150

Slide 150 text

The default is 10 units, in this case, that is 10 weeks.

Slide 151

Slide 151 text

Set the Forecasting Period to 52 weeks.

Slide 152

Slide 152 text

Under the Trend tab, you can see the overall trend that is used by the model. The blue line is the actual (Sales) data, and the green line is the trend.

Slide 153

Slide 153 text

The vertical light green bars are Change Points where the trend changes.

Slide 154

Slide 154 text

Under the Yearly tab you can see the Yearly Seasonality.

Slide 155

Slide 155 text

Every year, the sales doesn’t pick up until June, then it goes down in July.

Slide 156

Slide 156 text

Google Analytics with Prophet • Get the Daily Page View data from Google Analytics. • Run a time series forecasting model with Prophet under Analytics view.

Slide 157

Slide 157 text

No content

Slide 158

Slide 158 text

Forecasting for the next 6 months

Slide 159

Slide 159 text

No content

Slide 160

Slide 160 text

Forecasting for the next 12 months

Slide 161

Slide 161 text

No content

Slide 162

Slide 162 text

Yearly Seasonality

Slide 163

Slide 163 text

Trend

Slide 164

Slide 164 text

with Repeat By

Slide 165

Slide 165 text

with Repeat By

Slide 166

Slide 166 text

Q & A

Slide 167

Slide 167 text

Information Email [email protected] Website https://exploratory.io Twitter @KanAugust Training https://exploratory.io/training

Slide 168

Slide 168 text

EXPLORATORY