Work Log 08/09

Work Log 08/09 Recurrent Analysis Introduction Python(, R) Tutorial Lab
Future Plan 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

recurrentR Progress R Package “recurrentR” Progress Proposed Recurrent Model Introduction
to Survival / Recurrent Analysis 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Introduction to Survival Analysis •  Outcome variable –  Time until
an event occurs –  Be both discrete and continuous •  Whether an event happens [0, 1] •  Time for an event to happen 1.5 weeks, 4 months, … •  Ex. failure (death) –  Recurrent: (e.g. marriage) >1 events per obs •  Need a way to characterize 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd

Survival Property - Censorship •  Study ends – no event
•  Lost to follow-up •  Withdraws 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd

Censorship Assumption •  Relation Between Event and Censorship –  Independent
–  Random •  Random implies independency, but not the case reversely –  Non-informative •  Distribution of time-event T gives no information about distribution of time- censorship C •  T, C: Random variables 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd

Censorship Assumption (cont’d) •  Many of the analytic techniques rely
on the assumption of independent censoring –  Kaplan–Meier survival estimation –  Log rank test –  Cox model •  But this is not the case in real world •  If a person find oneself negative during a AIDS cohort trace study, –  one may be more likely to quit the study –  That’s a dependent and informative censoring –  Healthy person tends to be censored 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

The Informative Censorship Model 2013.09.01 Bioinformatics and Biostatistics Core, NTU
Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,
National Taiwan University Slides by Liang Bo Wang 42 Biometrics, March 2010 The score function derived from the logged pairwise pseudolikelihood can be expressed as i< k I(t Yik , u Yik ) × − exp{ρik (t, u) β} 1 + exp{ρik (t, u) β} ρik (t, u) dNi (t) dNk (u), where Yik = Yi ∧ Yk . Recall that Di and Dk denote the observed data of the ith and the kth subject. Define the function h(Di , Dk ; β) = I(t Yik , u Yik ) × − exp{ρik (t, u) β} 1 + exp{ρik (t, u) β} ρik (t, u) dNi (t) dNk (u), and denote the score function by Sp (β) = 1 (n 2 ) i< k h(Di , Dk ; β). It is easy to see that h is permutation symmetric in its ar- guments and Sp is a U-statistic with the kernel h(·, ·). If β is the true parameter value, it can be shown that the score function Sp (β) = 0. Applying the projection method devel- oped by Hoeffding (1948) and under the assumption that X(t) is bounded by M, we can show that score function √ nSp (β) converges to a normal distribution with mean 0 and variance– covariance V 1 = 4E{h(D1, D2 ; β) h(D1, D3 ; β) }. We then study the asymptotic properties of ˆ β using delta method. The large sample properties of ˆ β are stated in Theorem 1, with proofs given in the Appendix. Theorem 1: Assume that X(t) is bounded by M and E[N(τ)] < ∞. Let ˆ β be the solution of Sp (β) = 0. Then ˆ β is a consistent estimator of β. Furthermore, √ n(ˆ β − β) → N(0, V ), where V = V −1 2 V 1 V −1 2 with V 1 = 4 E{h(D1 , D2 ; β) h(D1 , D3 ; β) } and V 2 = −E{∂h(D1 , D2 ; β)/∂β}. Note that the variance covariance matrix V can be estimated by ˆ V −1 2 ˆ V 1 ˆ V −1 2 , where ˆ V 1 = 4 n n i=1 1 n −1 2 i< j < k h(Di , Dj ; ˆ β)h(Di , Dk ; ˆ β) , and ˆ V 2 = − 1 (n 2 ) i< k ∂h(Di , Dk ; ˆ β) ∂β . Also note that based on the score function Sp , a test statistic for testing the hypothesis β = 0 can be formulated based on 1 (n 2 ) i< k Ni (Yik )Nk (Yik ) Yi k 0 {Xi (u) − Xk (u)} × dNi (u) Ni (u) − dNk (u) Nk (u) . Interestingly, this test statistic for the effects of time- dependent covariates does not require information about time-independent covariates, and hence is useful for check- ing proportionality assumption in the usual proportional rate model. 3.2 Estimation of Λ0 and γ By definition πi (t) can be considered as the distribution function of a biased sample from the distribution F (t) = Λ0 (t)/Λ0 (τ), where the observations are sampled with a probability proportional to exp{Xi (t) β}. It is easy to see that, under the assumption (1), the conditional likelihood (2) is computationally equivalent to the likelihood of a set of independent random variables, where the data are a biased sample from distribution function F(t) with sampling weight proportional to exp{Xi (t) β} and are right truncated by Yi . Thus event times in the risk set are observed with different sampling probabilities, where the probabilities are proportional to exp{Xi (t) β}. If β is known, the probability structure of the risk set can be recovered by using the inverse probability weighting technique. Following the spirit of the Breslow estimator of the cumulative hazard in the Cox model, we modify the truncation product-limit estimator (Wang et al., 1986) as follows to estimate F by assigning each event in the risk set a weight that is proportional to the inverse of the sampling weight function: sl > t 1 − dl (β) Rl (β) , where dl (β) = 1 n n i=1 m i j =1 I(tij = sl ) exp{−Xi (tij ) β} and Rl (β) = 1 n n i=1 m i j =1 I(tij sl Yi ) exp{−Xi (tij ) β}. Note that β = 0 implies that the assigned weight is a unit weight and the proposed estimator reduces to the product-limit estimator that maximizes the nonparametric likelihood function for truncated data. By replacing β with ˆ β, we estimate F with ˆ F (t; ˆ β) = sl > t 1 − dl (ˆ β) Rl (ˆ β) . The large sample properties of ˆ F (t) are stated in Lemma 1, with proofs given in the Appendix. Lemma 1: Assume that (a) Λ0 (τ) > 0, (b) P r(Y > τ, Z > 0) > 0, and (c) G(u) = E{ZI(Y u) exp(W γ)} is a continuous function for u ∈ [0, τ]. For inf{y : Λ0 (y) > 0} < t τ, √ n{ ˆ F (t) − F (t)} converges weakly to a normal distribution with mean 0 and variance 4F (t)2 E{κ(D1 , D2 ; t, β) κ(D1 , D3 ; t, β)}, where κ is defined in the Appendix. To estimate the regression parameters of time-independent covariates, we note that, conditioning on {Zi , Yi , W i , Xi (Yi )}, the expected value of mi is given by E{mi |Zi , Yi , W i , Xi (Yi )} = Zi Λ0 (τ) exp{W i γ} I(Yi u) exp{Xi (u) β} dF (u). Thus, following the assumption E[Zi | W i , Xi (τ)] = 1 and by double expectation, we have E        mi Yi 0 exp{Xi (u) β} dF (u) | W i        = Λ0 (τ) exp(W i γ). Recurrent Event Data Analysis 41 Assumption (1) implies that the occurrence of recurrent events follows a proportional intensity model, where the unobserved frailty Zi inflates/deflates the intensity. Because of the memoryless property of a Poisson process, conditional on Zi , the rate function equals the intensity function of the recurrent event process. Under (1) and E[Zi | W i , Xi (τ)] = 1, the rate function of event occurrence at time t in a random population is given by λ0 (t) exp{Xi (t) β + W i γ}. Thus (1) implies the proportional rate model for recurrent event data studied by Lin et al. (2000) and many others. The proposed model also reduces to the semiparametric model studied in Wang et al. (2001) in the absence of time-dependent covariate Xi (t), and is in line with the model for case series data studied in Farrington and Whitaker (2006) in the absence of time-independent covariate Wi . 3. Estimation Procedure 3.1 Estimation of β We denote by mi the number of recurrent events that occurred before time Yi and ti1, . . . , tim i the observed event times for subject i. For ease of notation, we use mi and tij , i = 1, 2, . . . , n, j = 1, 2, . . . , mi , to denote either random variables or realized values. Let Di denote the observed data of the ith subject, that is, Di = {(Yi , W i , Xi (Yi ), mi , (ti1, . . . , tim i )}. As- sume that {(Zi , Xi (·), W i , Yi , Ni (·)); i = 1, . . . , n} are independent and identically distributed (i.i.d.), so that the {D1 , . . . , Dn } are also i.i.d. For estimation of the regression parameter β, an initial at- tempt might apply the conditional technique used in Wang et al. (2001) to eliminate nuisance parameters from the likelihood. Under (1) the event times (ti1, . . . , tim i ) of the ith subject conditional on (Zi , Yi , mi , W i , X(Yi )) are order statistics of a set of i.i.d. random variables with the density function Zi λ0 (t) exp{Xi (t) β + W i γ} Yi 0 Zi λ0 (u) exp{Xi (t) β + W i γ}du , 0 t Yi . Note that both the unobserved frailty Zi and the time- independent covariates Wi are eliminated from the conditional density function. The conditional likelihood based on all subjects is proportional to n i=1 m i j =1 λ0 (t) exp{Xi (tij ) β} Yi 0 λ0 (u) exp{Xi (u) β}du = n i=1 m i j =1 dπi (tij ) πi (Yi ) , (2) where dπi (t) is the shape function of the recurrent event process given by dπi (t) = λ0 (t) exp{Xi (t) β} τ 0 λ0 (u) exp{Xi (u) β}du I(0 t τ), and πi (t) = t 0 dπi (u). Note that πi defines a proper distribution function with πi (τ) = 1. When the recurrent event model only includes time- independent covariates, the conditional density function fur- ther reduces to λ0 (t)/Λ0 (Yi ), which is in the form of a truncated density function. It is easy to see that for this special case the conditional likelihood depends only on λ0 and is computationally equivalent to the nonparametric likelihood of independently right-truncated data. Hence the reduced conditional likelihood is maximized by the product-limit estimator for independently right-truncated data (Wang, Jewell, and Tsai, 1986). In the presence of time-varying covariates, however, the conditional likelihood (2) involves both the paramet- ric component Xi (t) β and the nonparametric component λ0 . Maximizing the conditional likelihood function is a challenge because the integral in the denominator of the conditional likelihood does not have a closed form with λ0 unspecified. Motivated by Liang and Qin (2000) and Kalbfleisch (1978), we propose an alternative estimation procedure for β that does not involve the nonparametric component λ0 , and hence has the advantage of computational convenience. Because (2) is computationally equivalent to the semiparametric likelihood of a set of independently right-truncated random variables, we can reformulate the problem as esti- mating the regression parameter β using the data {tij , i = 1, . . . , n, j = 1, . . . , mi }, where tij is an observed event time with the distribution function πi and is subject to independent right truncation Yi . The pairwise pseudolikelihood method considered by Liang and Qin (2000), however, can- not be applied directly to truncated data: event times are not necessarily comparable because they are subject to different truncation times. The observation of tij is subject to the con- straint tij Yi , hence any two event times tij and tkl , i = k, are comparable if tij belongs to the observation interval of tkl and tkl belongs to the observation interval of tij . These con- straints amount to tij Yi ∧ Yk and tkl Yi ∧ Yk , where ∧ denotes minimum. For any two event times tij and tkl , let δijkl = 1 if (tij , tkl ) is a comparable pair, and 0 otherwise. We condition on hav- ing observed the values {tij , tkl } for a given pair, but without knowing the order. We refer to this as conditioning on the order statistics of (tij , tkl ). The conditional distribution is de- generate at the observed values if (tij , tkl ) are not comparable. By conditioning on the order statistics of (tij , tkl ) and δijkl = 1, the pairwise pseudolikelihood of (tij , tkl ), i < k, is given by dπi (tij )dπk (tk l ) dπi (tij )dπk (tk l ) + dπi (tk l )dπk (tij ) = exp[{Xi (tij ) + Xk (tk l )} β] exp[{Xi (tij ) + Xk (tk l )} β + exp[{Xi (tk l ) + Xk (tij )} β] = 1 1 + exp{ρik (tij , tk l ) β} , (3) where ρik (t, u) = Xi (u) + Xk (t) − Xi (t) − Xk (u). Interest- ingly, the pairwise pseudolikelihood depends on the regression parameter β but not the nonparametric component λ0 . Hence β can be estimated by maximizing the pairwise pseudolikelihood i< k m i j =1 m k l=1 1 1 + exp{ρik (tij , tk l ) β} δi j k l . Lots of MATH and PROOF!! GOAL! R Package for Use!

Model the Event Occurrence by Inhomogeneous Poisson Random Process 2013.09.01
Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang X Poisson( ) if P(X = k) = ke k! for k = 0, 1, . . .

•  Count the event in a time interval [a, b]
•  Intensity of the event: –  Also the parameter for the Poisson distribution •  In the time interval, # events occurred in [a, b] be –  a random variable –  •  The sequence of events forming Poisson distribution is called a Poisson Process –  Homogeneous: constant –  Inhomogeneous: , a function of time t 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang E(X) = , V ar(X) = a b N (t) a b N Poisson((b a) ) a b N Poisson( b a (u)du)

Progress •  For simulation, generate data given a intensity (lambda)
function –  Validate if the simulation follows the desired random process •  Use our model to characterize the sample, obtain a sampled intensity function –  Verify if they are closely fitted –  Obtain the variance or the confident interval by bootstrap •  Solve the model equation (done by my partner) 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

quantile-quantile plot 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine, National Taiwan University Slides by Liang Bo Wang (t) = esin(t) 1

National Taiwan University Slides by Liang Bo Wang fi(t) = i(t) i(T0) = · · · = 0(t) 0(T0) F(t) = t 0 f(u)du (t) = t 0 (u)du ˆ F(t) = s(l)>t (1 d(l) N(l) ) ˆ(T0) = 1 n n i=1 mi ˆ F 1(Yi) (t) = esin(t) 1 (t) = esin(t) 1

Estimate the Variance by Bootstrap 2013.09.01 Bioinformatics and Biostatistics Core,
NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Keep Working 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Future Plan Integrated Platform Design Plan UI Mockup 2013.09.01 Bioinformatics
and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

UI Mockup 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Integrated Platform Design Plan •  A bio info analysis workflow:
–  Heavy process and analysis (Tophat, Cufflinks, …) –  Results in small size –  Interpretation •  Which parts are we going to provide? •  Possible Solution –  Combine with Galaxy –  Focus on one part first •  TBD 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Python(, R) Tutorial for Lab Plan & Schedule 2013.09.01 Bioinformatics
and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

Python, R Tutorial for Lab •  Current Plan –  Each
time total 40 min – 1 hour, divided into A / B parts –  One part contains 10 min talk and 10-20 min practice •  Covers –  Basic Unix command –  Python Language + Practical Usage (mainly focus here) –  R depends on the feedback and progress 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang A B Talk Practice Talk Practice

Schedule •  2 times a week or up to the
attendees •  start in next week … I hope •  slides will be published –  hope they can be passed to successors of this lab –  some examples like •  http://ccwang002.github.io/python-tutorial-slides/ •  http://ccwang002.github.io/ggplot2-tutorial/ 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang

National Taiwan University Slides by Liang Bo Wang source: http://elements.eaglegenomics.com/, CC2.0 BY-NC-SA

Infographic: The World’s Technology Hubs 2013.09.01 Bioinformatics and Biostatistics Core,
NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang source: MIT Tech Reviews

National Taiwan University Slides by Liang Bo Wang source: MIT Tech Reviews Regions around the world are competing to be the next centers of technological innovation. In a global map, we grade eight of them. Infographic: The World’s Technology Hubs

Work Log 08/09

Work Log 08/09

Liang Bo Wang

More Decks by Liang Bo Wang

Featured

Transcript

Work Log 08/09 Recurrent Analysis Introduction Python(, R) Tutorial Lab

recurrentR Progress R Package “recurrentR” Progress Proposed Recurrent Model Introduction

Introduction to Survival Analysis •  Outcome variable –  Time until

Survival Property - Censorship •  Study ends – no event

Censorship Assumption •  Relation Between Event and Censorship –  Independent

Censorship Assumption (cont’d) •  Many of the analytic techniques rely

The Informative Censorship Model 2013.09.01 Bioinformatics and Biostatistics Core, NTU

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

Model the Event Occurrence by Inhomogeneous Poisson Random Process 2013.09.01

•  Count the event in a time interval [a, b]

Progress •  For simulation, generate data given a intensity (lambda)

quantile-quantile plot 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

Estimate the Variance by Bootstrap 2013.09.01 Bioinformatics and Biostatistics Core,

Keep Working 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

Future Plan Integrated Platform Design Plan UI Mockup 2013.09.01 Bioinformatics

UI Mockup 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

Integrated Platform Design Plan •  A bio info analysis workflow:

Python(, R) Tutorial for Lab Plan & Schedule 2013.09.01 Bioinformatics

Python, R Tutorial for Lab •  Current Plan –  Each

Schedule •  2 times a week or up to the

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

Infographic: The World’s Technology Hubs 2013.09.01 Bioinformatics and Biostatistics Core,

2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,