Slide 1

Slide 1 text

“Survival” Analysis of Web Users 1 Dell Zhang DCSIS, Birkbeck, University of London

Slide 2

Slide 2 text

Outline • What Is It • Why Is It Useful • Case Study – The Departure Dynamics of Wikipedia Editors 2

Slide 3

Slide 3 text

What Is It 3

Slide 4

Slide 4 text

Time-To-Event Data • Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4

Slide 5

Slide 5 text

5 Y X How to build a probabilistic model of Y ?

Slide 6

Slide 6 text

6 Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ?

Slide 7

Slide 7 text

7 Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ?

Slide 8

Slide 8 text

Censoring • A key problem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10 Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) ……

Slide 11

Slide 11 text

Goals • Survival Analysis attempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11

Slide 12

Slide 12 text

• Censoring of data • Comparing groups – (1 treatment vs. 2 placebo) • Confounding or Interaction factors – Log WBC 12

Slide 13

Slide 13 text

Why Is It Useful for Online Marketing etc. 13

Slide 14

Slide 14 text

The Data Are There • Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… 14 Time Matters!

Slide 15

Slide 15 text

Evidence-Based Marketing • Let’s work as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

17

Slide 18

Slide 18 text

18

Slide 19

Slide 19 text

19 Predict whether a new question asked on Stack Overflow will be closed when

Slide 20

Slide 20 text

Case Study The Departure Dynamics of Wikipedia Editors 20

Slide 21

Slide 21 text

About 90,000 regularly active volunteer editors around the world 21

Slide 22

Slide 22 text

22

Slide 23

Slide 23 text

Departure Dynamics • Who are likely to “die”? • How soon will they “die”? • Why do they “die”? “live” = stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23

Slide 24

Slide 24 text

Who are likely to “die”? (WikiChallenge) 24

Slide 25

Slide 25 text

25

Slide 26

Slide 26 text

2010-09-01 2010-09-01 2011-02-01 2010-04-01 2001-01-01 2001-06-01 26

Slide 27

Slide 27 text

27

Slide 28

Slide 28 text

Behavioural Dynamics Features months Exponential Steps 28 Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009)

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

30

Slide 31

Slide 31 text

31

Slide 32

Slide 32 text

© 2008-2012 ~maniraptora 32 Gradient Boosted Trees (GBT)

Slide 33

Slide 33 text

Gradient Boosted Trees (GBT) • The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b). • GBT vs RF 33

Slide 34

Slide 34 text

34

Slide 35

Slide 35 text

35

Slide 36

Slide 36 text

36

Slide 37

Slide 37 text

37

Slide 38

Slide 38 text

Final Result • The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38

Slide 39

Slide 39 text

How soon will they “die”? 39

Slide 40

Slide 40 text

birth & death The evolution of Wikipedia editors' community. 40 110,000 random samples January 2001

Slide 41

Slide 41 text

active editors The evolution of Wikipedia editors' community. 41 January 2001 110,000 random samples

Slide 42

Slide 42 text

Survival Function 42 What is the fraction of a population which will survive past a certain time?

Slide 43

Slide 43 text

The histogram of Wikipedia editors' lifetime. Customary Editors Occasional Editors 43

Slide 44

Slide 44 text

Kaplan-Meier Estimator 44

Slide 45

Slide 45 text

45

Slide 46

Slide 46 text

The empirical survival function. 46

Slide 47

Slide 47 text

Normal Distribution Probability Plot 47

Slide 48

Slide 48 text

Extreme Value Distribution Probability Plot 48

Slide 49

Slide 49 text

Rayleigh Distribution Probability Plot 49

Slide 50

Slide 50 text

Exponential Distribution Probability Plot 50

Slide 51

Slide 51 text

Lognormal Distribution Probability Plot 51

Slide 52

Slide 52 text

Weibull Distribution Probability Plot 52

Slide 53

Slide 53 text

The survival function. 53

Slide 54

Slide 54 text

Weibull distribution 54

Slide 55

Slide 55 text

Expected Future Lifetime 55 median lifetime: 53 days

Slide 56

Slide 56 text

Hazard Function The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56 Of those that survive, at what rate will they die?

Slide 57

Slide 57 text

Bathtub Curve 57 http://en.wikipedia.org/wiki/Bathtub_curve

Slide 58

Slide 58 text

The hazard function. 58

Slide 59

Slide 59 text

59 The hazard function.

Slide 60

Slide 60 text

Conclusions • For customary Wikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60

Slide 61

Slide 61 text

Why do they “die”? 61

Slide 62

Slide 62 text

Covariates Last Edit 62

Slide 63

Slide 63 text

63

Slide 64

Slide 64 text

64

Slide 65

Slide 65 text

Cox Proportional Hazards Model 65

Slide 66

Slide 66 text

Semi-Parametric • The semi-parametric property of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66

Slide 67

Slide 67 text

Cox PH vs. Logistic 67

Slide 68

Slide 68 text

Maximum Likelihood Estimation 68

Slide 69

Slide 69 text

Cox Proportional Hazards Model β se z p X1: namespace==Main -0.1095 0.0172 -6.3664 0.1935e-9 X2: log(1+cur_size) -0.0688 0.0036 -19.2474 0.0000e-9 69

Slide 70

Slide 70 text

Hazard Ratio 70

Slide 71

Slide 71 text

Adjusted Survival Curves 71

Slide 72

Slide 72 text

72

Slide 73

Slide 73 text

Next Step 73

Slide 74

Slide 74 text

Cartoon: Ron Hipschman Data: David Hand 74

Slide 75

Slide 75 text

Lightning Does Strike Twice! • Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75

Slide 76

Slide 76 text

A Lot More To Do • Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process) • Multiple Types of “Death” – Competing Risks Survival Analysis 76

Slide 77

Slide 77 text

Software Tools • R – The ‘survival’ package • Matlab – The ‘statistics’ toolbox • Python – The ‘statsmodels’ module? 77

Slide 78

Slide 78 text

References • David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta • John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi • Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr • Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78

Slide 79

Slide 79 text

? 79

Slide 80

Slide 80 text

80