Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log 08/09

Liang Bo Wang
August 08, 2013
54

Work Log 08/09

Liang Bo Wang

August 08, 2013
Tweet

Transcript

  1. Work Log 08/09 Recurrent Analysis Introduction Python(, R) Tutorial Lab

    Future Plan 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  2. recurrentR Progress R Package “recurrentR” Progress Proposed Recurrent Model Introduction

    to Survival / Recurrent Analysis 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  3. Introduction to Survival Analysis •  Outcome variable –  Time until

    an event occurs –  Be both discrete and continuous •  Whether an event happens [0, 1] •  Time for an event to happen 1.5 weeks, 4 months, … •  Ex. failure (death) –  Recurrent: (e.g. marriage) >1 events per obs •  Need a way to characterize 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd
  4. Survival Property - Censorship •  Study ends – no event

    •  Lost to follow-up •  Withdraws 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd
  5. Censorship Assumption •  Relation Between Event and Censorship –  Independent

    –  Random •  Random implies independency, but not the case reversely –  Non-informative •  Distribution of time-event T gives no information about distribution of time- censorship C •  T, C: Random variables 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  excerpt from David G. Kleinbaum, Survival Analysis: A Self-Learning Text, 3rd
  6. Censorship Assumption (cont’d) •  Many of the analytic techniques rely

    on the assumption of independent censoring –  Kaplan–Meier survival estimation –  Log rank test –  Cox model •  But this is not the case in real world •  If a person find oneself negative during a AIDS cohort trace study, –  one may be more likely to quit the study –  That’s a dependent and informative censoring –  Healthy person tends to be censored 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  7. The Informative Censorship Model 2013.09.01 Bioinformatics and Biostatistics Core, NTU

    Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  8. 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  42 Biometrics, March 2010 The score function derived from the logged pairwise pseudo- likelihood can be expressed as i< k I(t Yik , u Yik ) × − exp{ρik (t, u) β} 1 + exp{ρik (t, u) β} ρik (t, u) dNi (t) dNk (u), where Yik = Yi ∧ Yk . Recall that Di and Dk denote the ob- served data of the ith and the kth subject. Define the function h(Di , Dk ; β) = I(t Yik , u Yik ) × − exp{ρik (t, u) β} 1 + exp{ρik (t, u) β} ρik (t, u) dNi (t) dNk (u), and denote the score function by Sp (β) = 1 (n 2 ) i< k h(Di , Dk ; β). It is easy to see that h is permutation symmetric in its ar- guments and Sp is a U-statistic with the kernel h(·, ·). If β is the true parameter value, it can be shown that the score function Sp (β) = 0. Applying the projection method devel- oped by Hoeffding (1948) and under the assumption that X(t) is bounded by M, we can show that score function √ nSp (β) converges to a normal distribution with mean 0 and variance– covariance V 1 = 4E{h(D1, D2 ; β) h(D1, D3 ; β) }. We then study the asymptotic properties of ˆ β using delta method. The large sample properties of ˆ β are stated in Theorem 1, with proofs given in the Appendix. Theorem 1: Assume that X(t) is bounded by M and E[N(τ)] < ∞. Let ˆ β be the solution of Sp (β) = 0. Then ˆ β is a consistent estimator of β. Furthermore, √ n(ˆ β − β) → N(0, V ), where V = V −1 2 V 1 V −1 2 with V 1 = 4 E{h(D1 , D2 ; β) h(D1 , D3 ; β) } and V 2 = −E{∂h(D1 , D2 ; β)/∂β}. Note that the variance covariance matrix V can be esti- mated by ˆ V −1 2 ˆ V 1 ˆ V −1 2 , where ˆ V 1 = 4 n n i=1 1 n −1 2 i< j < k h(Di , Dj ; ˆ β)h(Di , Dk ; ˆ β) , and ˆ V 2 = − 1 (n 2 ) i< k ∂h(Di , Dk ; ˆ β) ∂β . Also note that based on the score function Sp , a test statistic for testing the hypothesis β = 0 can be formulated based on 1 (n 2 ) i< k Ni (Yik )Nk (Yik ) Yi k 0 {Xi (u) − Xk (u)} × dNi (u) Ni (u) − dNk (u) Nk (u) . Interestingly, this test statistic for the effects of time- dependent covariates does not require information about time-independent covariates, and hence is useful for check- ing proportionality assumption in the usual proportional rate model. 3.2 Estimation of Λ0 and γ By definition πi (t) can be considered as the distribution function of a biased sample from the distribution F (t) = Λ0 (t)/Λ0 (τ), where the observations are sampled with a prob- ability proportional to exp{Xi (t) β}. It is easy to see that, under the assumption (1), the conditional likelihood (2) is computationally equivalent to the likelihood of a set of inde- pendent random variables, where the data are a biased sample from distribution function F(t) with sampling weight propor- tional to exp{Xi (t) β} and are right truncated by Yi . Thus event times in the risk set are observed with different sam- pling probabilities, where the probabilities are proportional to exp{Xi (t) β}. If β is known, the probability structure of the risk set can be recovered by using the inverse probability weighting technique. Following the spirit of the Breslow esti- mator of the cumulative hazard in the Cox model, we modify the truncation product-limit estimator (Wang et al., 1986) as follows to estimate F by assigning each event in the risk set a weight that is proportional to the inverse of the sampling weight function: sl > t 1 − dl (β) Rl (β) , where dl (β) = 1 n n i=1 m i j =1 I(tij = sl ) exp{−Xi (tij ) β} and Rl (β) = 1 n n i=1 m i j =1 I(tij sl Yi ) exp{−Xi (tij ) β}. Note that β = 0 implies that the assigned weight is a unit weight and the proposed estimator reduces to the product-limit esti- mator that maximizes the nonparametric likelihood function for truncated data. By replacing β with ˆ β, we estimate F with ˆ F (t; ˆ β) = sl > t 1 − dl (ˆ β) Rl (ˆ β) . The large sample properties of ˆ F (t) are stated in Lemma 1, with proofs given in the Appendix. Lemma 1: Assume that (a) Λ0 (τ) > 0, (b) P r(Y > τ, Z > 0) > 0, and (c) G(u) = E{ZI(Y u) exp(W γ)} is a con- tinuous function for u ∈ [0, τ]. For inf{y : Λ0 (y) > 0} < t τ, √ n{ ˆ F (t) − F (t)} converges weakly to a normal distribution with mean 0 and variance 4F (t)2 E{κ(D1 , D2 ; t, β) κ(D1 , D3 ; t, β)}, where κ is defined in the Appendix. To estimate the regression parameters of time-independent covariates, we note that, conditioning on {Zi , Yi , W i , Xi (Yi )}, the expected value of mi is given by E{mi |Zi , Yi , W i , Xi (Yi )} = Zi Λ0 (τ) exp{W i γ} I(Yi u) exp{Xi (u) β} dF (u). Thus, following the assumption E[Zi | W i , Xi (τ)] = 1 and by double expectation, we have E        mi Yi 0 exp{Xi (u) β} dF (u) | W i        = Λ0 (τ) exp(W i γ). Recurrent Event Data Analysis 41 Assumption (1) implies that the occurrence of recurrent events follows a proportional intensity model, where the unobserved frailty Zi inflates/deflates the intensity. Because of the memoryless property of a Poisson process, conditional on Zi , the rate function equals the intensity function of the re- current event process. Under (1) and E[Zi | W i , Xi (τ)] = 1, the rate function of event occurrence at time t in a random population is given by λ0 (t) exp{Xi (t) β + W i γ}. Thus (1) implies the proportional rate model for recurrent event data studied by Lin et al. (2000) and many others. The proposed model also reduces to the semiparametric model studied in Wang et al. (2001) in the absence of time-dependent covari- ate Xi (t), and is in line with the model for case series data studied in Farrington and Whitaker (2006) in the absence of time-independent covariate Wi . 3. Estimation Procedure 3.1 Estimation of β We denote by mi the number of recurrent events that occurred before time Yi and ti1, . . . , tim i the observed event times for subject i. For ease of notation, we use mi and tij , i = 1, 2, . . . , n, j = 1, 2, . . . , mi , to denote either random variables or realized values. Let Di denote the observed data of the ith sub- ject, that is, Di = {(Yi , W i , Xi (Yi ), mi , (ti1, . . . , tim i )}. As- sume that {(Zi , Xi (·), W i , Yi , Ni (·)); i = 1, . . . , n} are independent and identically distributed (i.i.d.), so that the {D1 , . . . , Dn } are also i.i.d. For estimation of the regression parameter β, an initial at- tempt might apply the conditional technique used in Wang et al. (2001) to eliminate nuisance parameters from the likeli- hood. Under (1) the event times (ti1, . . . , tim i ) of the ith sub- ject conditional on (Zi , Yi , mi , W i , X(Yi )) are order statistics of a set of i.i.d. random variables with the density function Zi λ0 (t) exp{Xi (t) β + W i γ} Yi 0 Zi λ0 (u) exp{Xi (t) β + W i γ}du , 0 t Yi . Note that both the unobserved frailty Zi and the time- independent covariates Wi are eliminated from the condi- tional density function. The conditional likelihood based on all subjects is proportional to n i=1 m i j =1 λ0 (t) exp{Xi (tij ) β} Yi 0 λ0 (u) exp{Xi (u) β}du = n i=1 m i j =1 dπi (tij ) πi (Yi ) , (2) where dπi (t) is the shape function of the recurrent event pro- cess given by dπi (t) = λ0 (t) exp{Xi (t) β} τ 0 λ0 (u) exp{Xi (u) β}du I(0 t τ), and πi (t) = t 0 dπi (u). Note that πi defines a proper distribu- tion function with πi (τ) = 1. When the recurrent event model only includes time- independent covariates, the conditional density function fur- ther reduces to λ0 (t)/Λ0 (Yi ), which is in the form of a trun- cated density function. It is easy to see that for this special case the conditional likelihood depends only on λ0 and is com- putationally equivalent to the nonparametric likelihood of in- dependently right-truncated data. Hence the reduced condi- tional likelihood is maximized by the product-limit estimator for independently right-truncated data (Wang, Jewell, and Tsai, 1986). In the presence of time-varying covariates, how- ever, the conditional likelihood (2) involves both the paramet- ric component Xi (t) β and the nonparametric component λ0 . Maximizing the conditional likelihood function is a challenge because the integral in the denominator of the conditional likelihood does not have a closed form with λ0 unspecified. Motivated by Liang and Qin (2000) and Kalbfleisch (1978), we propose an alternative estimation procedure for β that does not involve the nonparametric component λ0 , and hence has the advantage of computational convenience. Because (2) is computationally equivalent to the semipara- metric likelihood of a set of independently right-truncated random variables, we can reformulate the problem as esti- mating the regression parameter β using the data {tij , i = 1, . . . , n, j = 1, . . . , mi }, where tij is an observed event time with the distribution function πi and is subject to inde- pendent right truncation Yi . The pairwise pseudolikelihood method considered by Liang and Qin (2000), however, can- not be applied directly to truncated data: event times are not necessarily comparable because they are subject to different truncation times. The observation of tij is subject to the con- straint tij Yi , hence any two event times tij and tkl , i = k, are comparable if tij belongs to the observation interval of tkl and tkl belongs to the observation interval of tij . These con- straints amount to tij Yi ∧ Yk and tkl Yi ∧ Yk , where ∧ denotes minimum. For any two event times tij and tkl , let δijkl = 1 if (tij , tkl ) is a comparable pair, and 0 otherwise. We condition on hav- ing observed the values {tij , tkl } for a given pair, but without knowing the order. We refer to this as conditioning on the order statistics of (tij , tkl ). The conditional distribution is de- generate at the observed values if (tij , tkl ) are not comparable. By conditioning on the order statistics of (tij , tkl ) and δijkl = 1, the pairwise pseudolikelihood of (tij , tkl ), i < k, is given by dπi (tij )dπk (tk l ) dπi (tij )dπk (tk l ) + dπi (tk l )dπk (tij ) = exp[{Xi (tij ) + Xk (tk l )} β] exp[{Xi (tij ) + Xk (tk l )} β + exp[{Xi (tk l ) + Xk (tij )} β] = 1 1 + exp{ρik (tij , tk l ) β} , (3) where ρik (t, u) = Xi (u) + Xk (t) − Xi (t) − Xk (u). Interest- ingly, the pairwise pseudolikelihood depends on the regression parameter β but not the nonparametric component λ0 . Hence β can be estimated by maximizing the pairwise pseudolikeli- hood i< k m i j =1 m k l=1 1 1 + exp{ρik (tij , tk l ) β} δi j k l . Lots of MATH and PROOF!! GOAL! R Package for Use!
  9. Model the Event Occurrence by Inhomogeneous Poisson Random Process 2013.09.01

    Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  X Poisson( ) if P(X = k) = ke k! for k = 0, 1, . . .
  10. •  Count the event in a time interval [a, b]

    •  Intensity of the event: –  Also the parameter for the Poisson distribution •  In the time interval, # events occurred in [a, b] be –  a random variable –  •  The sequence of events forming Poisson distribution is called a Poisson Process –  Homogeneous: constant –  Inhomogeneous: , a function of time t 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  E(X) = , V ar(X) = a b N (t) a b N Poisson((b a) ) a b N Poisson( b a (u)du)
  11. Progress •  For simulation, generate data given a intensity (lambda)

    function –  Validate if the simulation follows the desired random process •  Use our model to characterize the sample, obtain a sampled intensity function –  Verify if they are closely fitted –  Obtain the variance or the confident interval by bootstrap •  Solve the model equation (done by my partner) 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  12. quantile-quantile plot 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

    Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  (t) = esin(t) 1
  13. 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  fi(t) = i(t) i(T0) = · · · = 0(t) 0(T0) F(t) = t 0 f(u)du (t) = t 0 (u)du ˆ F(t) = s(l)>t (1 d(l) N(l) ) ˆ(T0) = 1 n n i=1 mi ˆ F 1(Yi) (t) = esin(t) 1 (t) = esin(t) 1
  14. Estimate the Variance by Bootstrap 2013.09.01 Bioinformatics and Biostatistics Core,

    NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  15. Keep Working 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

    Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  16. Future Plan Integrated Platform Design Plan UI Mockup 2013.09.01 Bioinformatics

    and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  17. UI Mockup 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of

    Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  18. Integrated Platform Design Plan •  A bio info analysis workflow:

    –  Heavy process and analysis (Tophat, Cufflinks, …) –  Results in small size –  Interpretation •  Which parts are we going to provide? •  Possible Solution –  Combine with Galaxy –  Focus on one part first •  TBD 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  19. Python(, R) Tutorial for Lab Plan & Schedule 2013.09.01 Bioinformatics

    and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  20. Python, R Tutorial for Lab •  Current Plan –  Each

    time total 40 min – 1 hour, divided into A / B parts –  One part contains 10 min talk and 10-20 min practice •  Covers –  Basic Unix command –  Python Language + Practical Usage (mainly focus here) –  R depends on the feedback and progress 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  A B Talk Practice Talk Practice
  21. Schedule •  2 times a week or up to the

    attendees •  start in next week … I hope •  slides will be published –  hope they can be passed to successors of this lab –  some examples like •  http://ccwang002.github.io/python-tutorial-slides/ •  http://ccwang002.github.io/ggplot2-tutorial/ 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  22. 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  source: http://elements.eaglegenomics.com/, CC2.0 BY-NC-SA
  23. 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  source: http://elements.eaglegenomics.com/, CC2.0 BY-NC-SA
  24. Infographic: The World’s Technology Hubs 2013.09.01 Bioinformatics and Biostatistics Core,

    NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  source: MIT Tech Reviews
  25. 2013.09.01 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  source: MIT Tech Reviews Regions around the world are competing to be the next centers of technological innovation. In a global map, we grade eight of them. Infographic: The World’s Technology Hubs