Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Cohort Analysis

Exploratory: An Introduction to Cohort Analysis

Cohort Analysis is one of the most critical analysis in SaaS / Subscription businesses. It helps you understand how your customers are churning (or retaining) your service as the time goes by.

And if you want to do it right, you want to use Survival Curve algorithm a.k.a. Kaplan-Meyer. This technique has been used in other areas such as employee attrition, machine maintenance, patient treatment, etc. but it works great for today’s data savvy SaaS businesses.

Kan will be introducing what the Survival Curve is and how you can use it in Exploratory.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

July 10, 2019
Tweet

Transcript

  1. EXPLORATORY

  2. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  3. Mission Make Data Science Available for Everyone

  4. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  5. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  6. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  7. 7 Exploratory Seminar Cohort Analysis

  8. Agenda • Introduction to Survival Analysis and Survival Curve •

    Introduction to Cohort Analysis • How to Understand the Insights
  9. It’s not IF, it’s WHEN.

  10. A certain type of events will eventually happen.

  11. • Cancer Patients Die • Cars Break • Employees Quit

    • Customers Churn Events
  12. • Not interested in if the event will happen or

    not. • Interested in how long it takes for the event to happen.
  13. • Time to patients dying • Time to cars breaking

    • Time to employees quitting • Time to users churning Time to Event
  14. • Analysis on how variables are associated with ‘time to

    an event’. • Not on the probability of an event occurring. Survival Analysis
  15. NOT about the probability of if an event happens. Survival

    Analysis
  16. Survival Curve

  17. Example: Customers for a Startup

  18. Apr Mar Feb Jan Cancelled Cancelled Jessica Tien Nancy Victor

  19. 4th Month 3rd Month 2nd Month 1st Month Jessica Tien

    Nancy Victor
  20. 1 0.66 0.75 1 Survival Rate of Each Month Jessica

    Tien Nancy Victor
  21. Survival Rate of Each Period 0.75 1 1 0.66 0.75

    1 Survival Rate through the period 1*0.75 Jessica Tien Nancy Victor
  22. Survival Rate of Each Period 0.5 0.75 1 1 0.66

    0.75 1 Survival Rate through the period 1*0.75 0.75*0.66 Jessica Tien Nancy Victor
  23. Survival Rate of Each Period 0.5 0.5 0.75 1 1

    0.66 0.75 1 Survival Rate through the period 1*0.75 0.75*0.66 0.5*1 Jessica Tien Nancy Victor
  24. In Reality…

  25. Apr Mar Feb Jan Customers signup at different timings.

  26. 4th Month 3rd Month 2nd Month 1st Month Align all

    customers based on how many months they are in.
  27. ? ? ? ? 4th Month 3rd Month 2nd Month

    1st Month We don’t know if the recently joined customers will quit or not.
  28. 4th Month 3rd Month 2nd Month 1st Month Label them

    as Censored!
  29. Survival Rate of Each Month 1 0.66 0.75 1

  30. Survival Rate through the period Survival Rate of Each Period

    0.5 0.5 0.75 1 1 0.66 0.75 1 1*0.75 0.75*0.66 0.5*1
  31. Kaplan-Meier Algorithm

  32. • Estimate the probability of a given customer being canceled

    at a certain point of time by dividing the number of customers who canceled at a given time period by the number of total customers at the same time period. • By cumulatively multiplying these probabilities, we can calculate the probability of survival from the beginning through that point of time. • Instead of ratio of actual survivors, we can estimate survival curve as the probability. • For censored observation, instead of counting them as ‘canceled’, we can treat them properly by reducing the number of remaining customers. Kaplan-Meier Estimator
  33. Let’s try!

  34. Data

  35. User Access Data

  36. • Each row represents one observation (e.g. Customer). • Each

    row has at least 2 information. • Start Date and End Date for Life Time • Event Status: e.g. Canceled, Quit, Died, etc. Input data for Survival Analysis
  37. Each row represents one observation

  38. Start and End Date for Life Time

  39. Event Status

  40. Run Survival Analysis

  41. Select Survival Analysis, and assign the columns as below, and

    Run!
  42. ”Data” tab shows more detailed raw data.

  43. • time - (e.g. Days since sign up) In survival

    data, one row represents one point of time. • n_risk - Number of observations whom the event hans’t happened to yet at a given time. • n_event - Number of observations whom the event happened at a given time. • n_censor - Number of observations censored at a given time. • estimate - Survival Rate through this point of time. These are the values visualized as Survival Curve • std_error - Standard error for the estimates • conf_low - Lower limit of the confidence interval for the estimate • conf_high - Upper limit of the confidence interval for estimate Survival Analysis Data
  44. Day 1 Day 2 Day 3 n_risk n_event n_censor estimate

    Day 1 Day 2 Day 3
  45. Day 1 Day 2 Day 3 n_risk n_event n_censor estimate

    Day 1 10 Day 2 8 Day 3 4
  46. Day 1 Day 2 Day 3 n_risk n_event n_censor estimate

    Day 1 10 1 Day 2 8 2 Day 3 4 1
  47. Day 1 Day 2 Day 3 n_risk n_event n_censor estimate

    Day 1 10 1 1 Day 2 8 2 2 Day 3 4 1 1
  48. Day 1 Day 2 Day 3 n_risk n_event n_censor estimate

    Day 1 10 1 1 0.9 Day 2 8 2 2 0.675 Day 3 4 1 1 0.5025 Survival Rates through the day
  49. Cohort Analysis

  50. • The team was originated from epidemiology. • Analysis of

    the Survival Rates of multiple groups. • Each group is called Cohort. Cohort Analysis
  51. Cohort Analysis

  52. Let’s try comparing the survival rates of two groups, Mac

    vs. Windows.
  53. Assign os column to Color By and click Run!

  54. • Under Summary tab, result of statistical test on how

    sure it is that the difference in OS type has effect on survival. • In this case, P Value is 28%. This means that the probability of getting this much or more difference between OS types by chance is 28%.
  55. • A probability of getting a given value, in this

    case, that is the difference between the two curves, if we assume that there is no difference between the two. • When you keep measuring things, you will most likely get slight different numbers. • The difference we are seeing is what can happen by chance, or that is a pattern that can be observed consistently? P Value - Statistical Test
  56. All cohorts are identical. OS doesn’t make a difference. Null

    Hypothesis
  57. All cohorts are identical. OS doesn’t make a difference. Null

    Hypothesis P-Value is 28%.
  58. All cohorts are identical. OS doesn’t make a difference. Null

    Hypothesis P-Value is 28%. If we set the threshold of P-Value to be 5%.
  59. All cohorts are identical. OS doesn’t make a difference. Null

    Hypothesis P-Value is 28%. If we set the threshold of P-Value to be 5%. We can NOT Reject the Null Hypothesis.
  60. All cohorts are identical. OS doesn’t make a difference. Null

    Hypothesis We can NOT Reject the Null Hypothesis. Conclusion OS doesn’t make a difference on the survival curves.
  61. With Confidence Interval, we can see the two survival curves

    are different up to the month 3, but they become very close after it.
  62. Joined Month as Cohort

  63. 63 • Group the users based on when they joined

    the service, then compare the survival curves among the groups. • Use ‘Joined Month’ as Cohort. • Are the survival curves getting better or worse with cohorts of recent customers? Cohort Analysis with Joined Month
  64. 64 • We can’t assign ‘start_date’ column to Color By

    because it’s already been assigned to Start Date. • Copy ‘start_date’ column! Data Preparation
  65. 65 σʔλͷ४උ

  66. Create join_date column by copying start_date column.

  67. 67

  68. 68 Can’t find join_date!

  69. 69 Analytics is ‘Pinned’ to the previous step.

  70. 70 Move the Pin by drag-and-drop!

  71. 1. Assign join_date to “Color By” 2. Select Floor to

    Month to create the groups at Month level.
  72. Survival Curves for Join Month cohorts.

  73. The survival curves are going down till September, 2016, but

    looks it’s getting better after that.
  74. Check P-Value to see if the differences among the cohorts

    are significant.
  75. 75 By assigning ‘os’ column to Repeat By, we get

    2 survival curve charts.
  76. 76 Check P-Value to see if the differences among the

    cohorts are significant for each OS type.
  77. What’s Next?

  78. Can we know what makes the differences in the survival

    curves?
  79. Cox Regression What variables make the survival curve more steep

    or less steep?
  80. None
  81. None
  82. Q & A

  83. Contact Email kan@exploratory.io Home Page https://exploratory.io Twitter @KanAugust Online Seminar

    https://exploratory.io/online-seminar