Exploratory: An Introduction to Cohort Analysis

EXPLORATORY

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor

Mission Make Data Science Available for Everyone

Data Science is not just for Engineers and Statisticians. Exploratory
makes it possible for Everyone to do Data Science. The Third Wave

First Wave Second Wave Third Wave Proprietary Open Source UI
& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization
Analytics (Statistics / Machine Learning) Exploratory Data Analysis

7 Exploratory Seminar Cohort Analysis

Agenda • Introduction to Survival Analysis and Survival Curve •
Introduction to Cohort Analysis • How to Understand the Insights

It’s not IF, it’s WHEN.

A certain type of events will eventually happen.

• Cancer Patients Die • Cars Break • Employees Quit
• Customers Churn Events

• Not interested in if the event will happen or
not. • Interested in how long it takes for the event to happen.

• Time to patients dying • Time to cars breaking
• Time to employees quitting • Time to users churning Time to Event

• Analysis on how variables are associated with ‘time to
an event’. • Not on the probability of an event occurring. Survival Analysis

NOT about the probability of if an event happens. Survival
Analysis

Survival Curve

Example: Customers for a Startup

Apr Mar Feb Jan Cancelled Cancelled Jessica Tien Nancy Victor

4th Month 3rd Month 2nd Month 1st Month Jessica Tien
Nancy Victor

1 0.66 0.75 1 Survival Rate of Each Month Jessica
Tien Nancy Victor

Survival Rate of Each Period 0.75 1 1 0.66 0.75
1 Survival Rate through the period 1*0.75 Jessica Tien Nancy Victor

Survival Rate of Each Period 0.5 0.75 1 1 0.66
0.75 1 Survival Rate through the period 1*0.75 0.75*0.66 Jessica Tien Nancy Victor

Survival Rate of Each Period 0.5 0.5 0.75 1 1
0.66 0.75 1 Survival Rate through the period 1*0.75 0.75*0.66 0.5*1 Jessica Tien Nancy Victor

In Reality…

Apr Mar Feb Jan Customers signup at diﬀerent timings.

4th Month 3rd Month 2nd Month 1st Month Align all
customers based on how many months they are in.

? ? ? ? 4th Month 3rd Month 2nd Month
1st Month We don’t know if the recently joined customers will quit or not.

4th Month 3rd Month 2nd Month 1st Month Label them
as Censored!

Survival Rate of Each Month 1 0.66 0.75 1

Survival Rate through the period Survival Rate of Each Period
0.5 0.5 0.75 1 1 0.66 0.75 1 1*0.75 0.75*0.66 0.5*1

Kaplan-Meier Algorithm

• Estimate the probability of a given customer being canceled
at a certain point of time by dividing the number of customers who canceled at a given time period by the number of total customers at the same time period. • By cumulatively multiplying these probabilities, we can calculate the probability of survival from the beginning through that point of time. • Instead of ratio of actual survivors, we can estimate survival curve as the probability. • For censored observation, instead of counting them as ‘canceled’, we can treat them properly by reducing the number of remaining customers. Kaplan-Meier Estimator

Let’s try!

User Access Data

• Each row represents one observation (e.g. Customer). • Each
row has at least 2 information. • Start Date and End Date for Life Time • Event Status: e.g. Canceled, Quit, Died, etc. Input data for Survival Analysis

Each row represents one observation

Start and End Date for Life Time

Event Status

Run Survival Analysis

Select Survival Analysis, and assign the columns as below, and
Run!

”Data” tab shows more detailed raw data.

• time - (e.g. Days since sign up) In survival
data, one row represents one point of time. • n_risk - Number of observations whom the event hans’t happened to yet at a given time. • n_event - Number of observations whom the event happened at a given time. • n_censor - Number of observations censored at a given time. • estimate - Survival Rate through this point of time. These are the values visualized as Survival Curve • std_error - Standard error for the estimates • conf_low - Lower limit of the conﬁdence interval for the estimate • conf_high - Upper limit of the conﬁdence interval for estimate Survival Analysis Data

Day 1 Day 2 Day 3 n_risk n_event n_censor estimate
Day 1 Day 2 Day 3

Day 1 10 Day 2 8 Day 3 4

Day 1 10 1 Day 2 8 2 Day 3 4 1

Day 1 10 1 1 Day 2 8 2 2 Day 3 4 1 1

Day 1 10 1 1 0.9 Day 2 8 2 2 0.675 Day 3 4 1 1 0.5025 Survival Rates through the day

Cohort Analysis

• The team was originated from epidemiology. • Analysis of
the Survival Rates of multiple groups. • Each group is called Cohort. Cohort Analysis

Cohort Analysis

Let’s try comparing the survival rates of two groups, Mac
vs. Windows.

Assign os column to Color By and click Run!

• Under Summary tab, result of statistical test on how
sure it is that the difference in OS type has effect on survival. • In this case, P Value is 28%. This means that the probability of getting this much or more difference between OS types by chance is 28%.

• A probability of getting a given value, in this
case, that is the difference between the two curves, if we assume that there is no difference between the two. • When you keep measuring things, you will most likely get slight different numbers. • The difference we are seeing is what can happen by chance, or that is a pattern that can be observed consistently? P Value - Statistical Test

All cohorts are identical. OS doesn’t make a difference. Null
Hypothesis

Hypothesis P-Value is 28%.

Hypothesis P-Value is 28%. If we set the threshold of P-Value to be 5%.

Hypothesis P-Value is 28%. If we set the threshold of P-Value to be 5%. We can NOT Reject the Null Hypothesis.

Hypothesis We can NOT Reject the Null Hypothesis. Conclusion OS doesn’t make a difference on the survival curves.

With Conﬁdence Interval, we can see the two survival curves
are different up to the month 3, but they become very close after it.

Joined Month as Cohort

63 • Group the users based on when they joined
the service, then compare the survival curves among the groups. • Use ‘Joined Month’ as Cohort. • Are the survival curves getting better or worse with cohorts of recent customers? Cohort Analysis with Joined Month

64 • We can’t assign ‘start_date’ column to Color By
because it’s already been assigned to Start Date. • Copy ‘start_date’ column! Data Preparation

65 σʔλͷ४උ

Create join_date column by copying start_date column.

68 Can’t ﬁnd join_date!

69 Analytics is ‘Pinned’ to the previous step.

70 Move the Pin by drag-and-drop!

1. Assign join_date to “Color By” 2. Select Floor to
Month to create the groups at Month level.

Survival Curves for Join Month cohorts.

The survival curves are going down till September, 2016, but
looks it’s getting better after that.

Check P-Value to see if the differences among the cohorts
are signiﬁcant.

75 By assigning ‘os’ column to Repeat By, we get
2 survival curve charts.

76 Check P-Value to see if the differences among the
cohorts are signiﬁcant for each OS type.

What’s Next?

Can we know what makes the diﬀerences in the survival
curves?

Cox Regression What variables make the survival curve more steep
or less steep?

Contact Email [email protected] Home Page https://exploratory.io Twitter @KanAugust Online Seminar
https://exploratory.io/online-seminar

Exploratory: An Introduction to Cohort Analysis

Exploratory: An Introduction to Cohort Analysis

More Decks by Kan Nishida

Other Decks in Science

Featured

Transcript