that have to do with analysing the timing of events (duration data) and its relationship with covariates. 6 Traditionally, medical researchers and actuaries have dealt with the type of questions addressed by Survival Analysis: • What is the expected lifetime of patients given treatment A? • What is the life expectancy in London? It is also widely used in engineering to look at failure in mechanical systems (reliability theory). survival analysis
• What is the time between signup and ﬁrst order/use? • When will a customer churn? • How long before a tweet has no effect? • When will a store run out of product? • How long before person replaces his/her phone? • How long is the career of a football player? • What is the expected lifetime of a Stark? motivation At lyst: Prediction of Out of Stock products.
set of techniques that can be used as part of a standard data science pipeline - acquiring/cleaning/storing/ modelling • Expand on techniques to evaluate survival functions via random forests and generalised additive models. • Keep advocating the use of Python or R according to the use case in hand • Show it explicitly by applying it to a real dataset…. this talk
an observation period. Death Event: Time associated with the end of the observation period either by an event, end of the study or withdrawal from it. Censorship: We don’t always see the death event occur. The current time or other events, censor us from seeing the death event.
as a function of the survival function is called the hazard function. When the hazard is constant, survival time is described by the exponential distribution. Integrating the hazard function from 0 to t yields the cumulative hazard function: The instantaneous hazard rate describes the probability that the event T occurs in t+ , given that the individual is alive at t: 16 hazard function when (t) = p(t) S(t) (t) = lim t 0 Pr(t T<t+ t|T>t) t t t 0 (t) = t 0 (x)dx = log S(t)
number of observations from the random event whose distribution we are trying to estimate. 17 Kaplan-Meier Estimator It is a non-parametric maximum likelihood estimate of the survival function. It is a step function with jumps at observed event times. S(t) = j−1 i=1 1 − si ri tj−1 ≤ t < tj ri ti si Nelson-Aalen Estimator We can also given a non-parametric estimate of the cumulative hazard function: H(t) = j−1 i−1 si ri tj−1 ≤ t < tj
person is still ‘in academia’. Online repositories & Portals Many papers can be found as electronic preprints and in particular, in many fields of mathematics and physics, all scientific papers are self-archived. There are also open access digital libraries that combine literature databases with author metadata.
date_end = sys.argv[2] df = get_data( arxiv=“physics:hep-th", date_start=date_start, date_end=date_end ) filename = 'arxiv data/arxiv_' + str(date_start) + "_" + str(date_end) + ".csv" df.to_csv(filename) Authors that have published a pre-print in the hep-th arXiv (High Energy Physics). There is an API available. Subset to all authors whose first paper was between Jan 2000 - Jan 2015 and that have more than 2 papers. arxiv_data.head(2)
has an API. Small print: Beware author names. INSPIREURL = "http://inspirehep.net/search?" def retrieve_affiliation(name): results = {} query = 'find a ' + name #name is of the form surname, first name. inspireoptions = dict(action_search="Search", rg=500, #number of results to return in one page of="recjson", # format of results ln="en", #language p=query, # search string jrec=0, # record number to start at ) ot = 'authors,title,creation_date' inspireoptions['ot'] = ot url = INSPIREURL + urlencode(inspireoptions)+ '&sf=earliestdate&so=a' try: f = urlopen(url) data = f.read() my_json_data = eval(data)
variations of name for institution, multiple affiliations by paper and correction of PhD Institution based on number of papers. def definite_affs(full_listings): """ Function to parse affiliations per author and determine PhD affiliation and number of papers during PhD. Fuzzy match parameter needs to be fixed. """ ['Caltech', 'Cambridge U., DAMTP', 'Hamburg U.', 'Hamburg U., Inst. Theor. Phys. II', "King's Coll. London", "King's Coll. London (main)", "King's Coll. London, Dept. Math", 'Santa Barbara, KITP'] Out[2]: ([‘Cambridge U., DAMTP','Caltech', 'Hamburg U., Inst. Theor. Phys. II', 'Santa Barbara, KITP’,"King's Coll. London, Dept. Math"], 'Cambridge U., DAMTP’,4)]
to extract timestamps for first and last paper, calculate durations and determine if observation is censored. def first_and_last_papers(d): """ Function to determine timestamps for first and last papers. "" def timestamp_parse(full_listings): """ Function to reformat timestamps of all papers and evaluates total survival time. Returns: - Date of first paper - Date of last paper - Survival time in days - Average publication time """
estimated average publication time, solo papers and effective papers. results_ss = results[ (results['date_first_paper'] > '2000-01-01') & (results['date_first_paper'] < ‘2015-01-01') ].copy()
of cross-listings. Also ArXiv has issues with names (m tsulaia vs mirian tsulaia are considered different) and even though the INSPIRES API tries to deal with names, results are sometimes faulty, particularly when surnames are formed by more than two words. Cross-listings: In order to deal with authors who publish pre-prints across listings, we demanded that at least one paper was solely submitted to [hep-th]. This left us with a total of 5228 authors, and 770 of these had only one paper to their name Outliers: There were still some authors with ‘crazy’ numbers (authors who participated in any of the big experimental collaborations - ATLAS, CMS, etc.).
Average number of coauthors is less than 6 (95 percentile) • Less than 190 effective papers • We were able to determine their PhD afﬁliation • They have more than one paper, • Their average publication time is less than 5 years • The number of papers is less than 185. • Their number of coauthors in their ﬁrst paper is less than 20 and the number of papers in their PhD is less than 32. We were left with a total of 4244 authors
coauthors in ﬁrst paper, average pub time, papers during PhD. • Gender (via genderize.io) • Flag for independent researcher vs collaborative. • Censoring ﬂag ( 0 if date of last paper is after Jan 2015 and 1 if date of last paper before Jan 2015). • Top University - Generated list of top 20 universities in sample and created ﬂag if author’s PhD was from one of these. We were left with a total of 4244 authors. 3220 could be assigned gender and 760 of these were flagged as having done their PhD in a top university. [Cambridge U.','Tokyo U.', 'Kyoto U.','Durham U.','SISSA Trieste','Harvard U.','Munich Max Planck Inst.','Oxford U.','Sao Paulo U.','Princeton U.','Osaka U.','Potsdam Max Planck Inst.','Madrid Autonoma U.','Munich U.’, 'Rome U.','Vienna Tech. U.','UC Santa Barbara','Rio de Janeiro CBPF','Humboldt U.','Sao Paulo IFT']
male, 24% undetermined. • Average no of papers during PhD was 5.2. However, it has been decreasing in the last 5 years towards 4 (3.07 in 2014). • Figure for female stands at 5.0 and for male at 5.4. They have equivalent number of coauthors on their ﬁrst paper. • If we demand probability of gender matching to be over 80%, difference is wider(4.8 vs 5.4) and sample size decreases fro 433 to 349. • No of coauthors has increased in the last 10 years. In 2001, the average no of authors was 2.5. In 2013 it was 2.9. • We have 2087 observations of a death event (49%). • 2% difference between deaths from individuals who did PhD in top university vs others.
on top of Pandas, includes non-parametric estimators, regression & various utility functions. Survival Analysis in R survival, KMsurv, OIsurv, randomForestSRC So many… https://cran.r-project.org/web/views/ Survival.html
the form of the hazard function) and one or more or covariates. Most common are linear-like models for the log hazard: These have been superseded by the Cox Model, which leaves the baseline hazard function unspeciﬁed: This an estimate of the hazard function at time t given a baseline hazard that's modiﬁed by a set of covariates. To estimate the model, Cox introduced a method called ‘partial likelihood’ to make the problem tractable. Proportional Hazards (Cox Model) 42 regression α(t) = log h0(t) hi(t) = h0(t) exp β1 xi1 + · · · βk xik hi(t) = exp α + β1 xi1 + · · · + βk xik
used, one can use an additive hazards model, as even if effects of covariates are proportional, the coefficient of proportionality can change over time. Aalen’s Additive model is defined as follows: where the matrix is constructed using the risk information and the vector contains the important regression information; a baseline function and the influence of the respective covariates. Additive Hazards Model 43 regression λ(t) = Y(t) · α(t) Y(t) α(t)
return_type='dataframe') X['T'] = dat['delta'] X['E'] = dat['observed'] del X['Intercept'] from lifelines import CoxPHFitter cx = CoxPHFitter(normalize=False) cx.fit(X, duration_col='T', event_col=‘E',s how_progress=True, include_likelihood=True) Baseline is given by individuals who had 2-4 papers during their PhD with 1-2 authors in their first paper.
being the baseline, we find that an individual starting on 2010 has 2X hazard rate. Adding gender: With female being the baseline, we find that a male has 0.91X hazard rate. Building a model with gender, coauthors bucket, pub_average bucket & university (for top 20): • Baseline of Cambridge U, Female, no authors in ﬁrst paper 0-2, coauthors bucket (2,3.4] and no papers during PhD (2,4]. • Harvard, Osaka same hazard; Oxford, SISSA Trieste, Sao Paulo, less and Vienna, Durham, Max Planck more hazard. • Individuals with 3.3,4 coauthors have 0.7X hazard rate w.r.t baseline. • The more co-authors in ﬁrst paper, less hazard rate w.r.t baseline. • The more papers during PhD, less hazard rate w.r.t. baseline.
to evaluate our model which is a generalisation of the area under the ROC curve (AUC), so it measures how well the model discriminates between different responses. Consider all possible pairs of patients, at least one of whom has died. RIP Joffrey Predicted Survival Time Jon > Predicted Survival Time Joffrey Predicted Survival Time Dany < Predicted Survival Time Joffrey Concordant with outcome Anti-concordant with outcome
an average concordance index of 0.69. NB. 0.5 being the expected result for random predictions, 1.0 is perfect concordance. A good model, has concordance scores larger than 0.6.
categorical covariates with many levels, lifelines in Python & standard libraries in R are too slow and/or are unable to realistically train the model (can’t determine partial likelihood of Cox Model). We can however, use generalised additive models to estimate survival probabilities. Poisson GAM models the quantities that are treated semi-parametrically (e.g. the baseline hazard) in the Cox model with functions estimated by penalised regression. Argyropoulos http://bit.ly/1TsZWkv
were developed by Hastie and Tibshirani in 1986, where the prediction is captured by a sum of smooth functions: g(E(Y)) = + s1(x) + · · · sk(xk) is a link function that links the expected value to the predictors variables and are smooth non-parametric functions (fully ﬁxed by the underlying data - splines). g(Y) x1, x2, · · · xk s1(x1), s2(x2), · · · sk(xk) http://multithreaded.stitchfix.com/blog/2015/07/30/gam/ Survival probabilities log h(ti,j) = λ(ti,j) + x · β The log-baseline hazard is estimated along with other quantities (e.g. the log hazard ratios) by:
duration data. In Random Forests, randomisation is introduced by randomly drawing a bootstrapped sample of data to grow the tree and then growing it by splitting nodes on randomly selected predictors. Bonus points: • Only three parameters to set; number of randomly selected predictors, number of trees to grow and splitting rule. • No assumptions. Unnecessary to demand proportional hazards, etc. Random Survival Forests are implemented through the library randomForestSRC. Training a forest yields an ensemble estimate for the cumulative hazard function Random Survival Forests http://bit.ly/1SSWYd7
year 9, for an individual there is a 50% he/she will still be in academia; this is roughly equivalent to a PhD and 1-2 postdocs. Lots of papers during PhD are a good indicator of survival Publish or die! Be a good networker Many collaborations = many papers + opportunities Choosing the right place for a PhD helps Facilitate creation of networks (?) 61 Data consistent with trends of women in STEM
various ﬁelds Good Python Library with estimators & regressors Thanks Cam Davidson-Pilon! More stuff available in R Above + GAMs, Random Survival Forests Other approaches - Bayesian Tutorial on the subject yesterday! 62