Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking a national adult cardiac surgery audit registry with R

Graeme Hickey
July 11, 2013

Unlocking a national adult cardiac surgery audit registry with R

Presented at the Use-R Conference 2013, University of Castilla-La Mancha, Albacete, Spain

Graeme Hickey

July 11, 2013

More Decks by Graeme Hickey


  1. Unlocking  a  na+onal  adult  cardiac   surgery  audit  registry  with

      GL  Hickey1,2,3,  SW  Grant2,3  &  B  Bridgewater1,2,3     1Northwest  Ins.tute  of  BioHealth  Informa.cs,  University  of  Manchester   2University  Hospital  of  South  Manchester   3Na.onal  Ins.tute  of  Cardiovascular  Outcomes  Research,  UCL   The  R  User  Conference  2013   University  of  Cas+lla-­‐La  Mancha,  Albacete,  Spain  

  3. Bristol  Inquiry   Contributory  factors   that  led  to  the

     failings   included:   1.  Inadequate   collec+on  of  data   2.  Inadequate   monitoring  of  data  
  4. Na+onal  Adult  Cardiac  Surgery  Audit   registry   •  Up

     to  166  clinical  variables  collected  on  each   pa+ent:  administra+ve,  demographics,   comorbidi+es,  opera+ve  factors,  outcomes   •  15  years  of  data   •  465,000  records   •  44  hospitals  +  >400  consultant  surgeons  
  5. Flow  of  data   NICOR   NIBHI   HOSPITALS  

    DATABASE   CLEANING   ANALYSES   The Society for Cardiothoracic Surgery in Great Britain & Ireland Sixth National Adult Cardiac Surgical Database Report 2008 Demonstrating quality Prepared by Ben Bridgewater PhD FRCS Bruce Keogh KBE DSc MD FRCS FRCP on behalf of the Society for Cardiothoracic Surgery in Great Britain & Ireland Robin Kinsman BSc PhD Peter Walton MA MB BChir MBA Dendrite Clinical Systems Cardiac Surgery AUDIT  &  GOVERNANCE  TOOLS   CLINICAL   RESEARCHERS   NATIONAL   DEATH   REGISTER*   *  Ability  to  link  with  many   other  na+onal  registries   RESEARCH  

  7. Cleaning  the  registry  in     DATA   EXTRACT  

    VARIABLE  1   VARIABLE  2   VARIABLE  3   …………   EXCLUDE   RECORDS   ADD   VALUE   CLEANED   DATA   Scripts  to  add:   •  Risk  scores   •  Combined  variables   •  ‘Resolve’  conflic+ng   variables   •  Script  per  each  variable   •  Some  dependencies   E.g.  duplicates   Rapidly   reproducible  
  8. > with(SCTS, table(X4.04.Discharge.Destination, X4.05.Status.at.Discharge))! X4.05.Status.at.Discharge! X4.04.Discharge.Destination 0. Alive 1. Dead!

    828 48296 2453! . Another dept within the trust 0 57 0! 0 1 1 0! 0. Not applicable - patient deceased 0 0 1! 1 Home 0 4104 0! 1. Home 674 370763 374! 2 Convalescence 0 63 0! 2. Convalescence 8 7347 4! 2. Convalescence (Non acute Hospital) 2 2164 0! 3 Other hospital 0 1 0! 3 Other Hospital 0 151 0! 3 Other Hospital - wd 6 0 1 0! 3 Other Hospital wd 2 0 1 0! 3 Other ward 0 1 0! 3. Other Acute hospital 1 7680 1! 3. Other hospital 115 22935 37! 4 Patient deceased 0 0 173! 4. Not applicable - patient deceased 51 412 13286! 4. Patient Deceased 0 0 19! 5 0 7 0! 5. Transferred to different Consultant - NGH 0 42 0! 7 0 2 0! 8 0 38 4! 9 114 3820 518! Second op 0 2 6! Illegal  op+ons   Transcrip+onal   discrepancies   Missing  data   Conflicts  
  9. •  Errors  are  difficult  to  find  and  not  all  can

     be   resolved   •  Excluding  all  imperfect  data  not  an  op+on   •  Balance  between  a  ‘research  ready’  dataset  and   robust  audit  capability   •  Needs  to  be  reproducible     •  It  is  locked  to  clinicians  &  researchers  without   being  cleaned   Cleaning  the  registry  in    
  10. Warning:  cleaning  clinical  registries   without  experts  is  dangerous*  

    *  Applies  to  analysing  healthcare  data  also   +   =   DATA  

  12. • • • • Crude 0% 5% 10% 15% 0

    200 400 600 800 0 200 Number of procedures Mortality rate • • • • • • Crude RAMR 200 400 600 800 0 200 400 600 800 Number of procedures Healthcare provider • • • 2386780 2503756 3166114 3207776 3226274 3286898 3451180 3631845 4445638 4473204 4683551 Publica+on  of  named  healthcare   provider  outcomes   hgp://www.scts.org/pa+ents/  
  13. Publica+on  of  named  healthcare   provider  outcomes   FILTER  DATA

        subset! RISK  ADJUSTMENT     glm, glmer {lme4}, mfp {mfp}, predict, auc {pROC}, ! CLASSIFICATION  &   PRESENTATION     ggplot {ggplot2}, write.csv! AGGREGATION     summaryBy {doBy}, merge, arrange {plyr}!
  14. Exploratory  analyses   hgp://www.scts.org/DynamicCharts/   summaryBy {doBy} + gvisMotionChart {googleVis}!

  15. Monitoring  medical  devices   •  Currently  does  not   happen

     in  UK   •  Data:  200  valve  types   entered  13,000  ways   (free  text)   •  But  R  is  good  with   regular  expressions  

  17. 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3

    4 5 6 7 8 9 10 Time from procedure (years) Survival probability No. at risk 1415 991 779 559 398 276 180 114 64 23 6 All octogenarians having MV surgery Evidence  based  medicine   Octogenarians  having  Mitral  Valve  Surgery  ±  CABG  ±  TV  repair   over  10-­‐year  window         survfit + Surv {survival} kmplot {by Tatsuki Koyama} Mean  4  pa+ents  per  unit  /  year  
  18. Contemporary  sta+s+cal  methodology   for  retrospec+ve  data   Unmatched Unmatched

    3 2 1 0 1 2 0.0 0.2 0.3 0.5 0.6 0.8 0.9 Mechanical Biological Propensity score Matched Matched 3 2 1 0 1 2 3 0.0 0.2 0.3 0.5 0.6 0.8 0.9 Mechanical Biological Propensity score matchit {MatchIt} Probability  of  receiving  a  mechanical  valve   Mechanical  valve   Biological  valve   Mechanical  valve   Biological  valve  
  19. Risk  predic+on:  status  quo   2002 2004 2006 2008 2010

    0.02 0.04 0.06 0.08 0.10 Time Mortality proportion Observed Expected Actual Overall average Trend Mortality  propor+on   Ra+o  =  0.37   Ra+o  =  0.73   2%   4%   6%   8%   10%   Mortality   Date  of  surgery  
  20. Risk  predic+on:  with  R   Biometrics 68, 23–30 March 2012

    DOI: 10.1111/j.1541-0420.2011.01645.x Dynamic Logistic Regression and Dynamic Model Averaging for Binary Classification Tyler H. McCormick,1,∗ Adrian E. Raftery,2 David Madigan,1 and Randall S. Burd3 1Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, New York 10025, U.S.A. 2Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98195-4322, U.S.A. 3Children’s National Medical Center, 111 Michigan Avenue NW, Washington, District of Columbia 20010, U.S.A. ∗email: tylermc@u.washington.edu Summary. We propose an online binary classification procedure for cases when there is uncertainty about the model to use and parameters within a model change over time. We account for model uncertainty through dynamic model averaging, a dynamic extension of Bayesian model averaging in which posterior model probabilities may also change with time. We apply a state-space model to the parameters of each model and we allow the data-generating model to change over time according to a Markov chain. Calibrating a “forgetting” factor accommodates different levels of change in the data-generating mechanism. We propose an algorithm that adjusts the level of forgetting in an online fashion using the posterior predictive distribution, and so accommodates various levels of change at different times. We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Factors associated with which children receive a particular type of procedure changed substantially over the 7 years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would require storing sensitive patient information only temporarily, reducing the risk of a breach of confidentiality. Key words: Bayesian model averaging; Binary classification; Confidentiality; Hidden Markov model; Laparoscopic surgery; Markov chain. 1. Introduction We describe a method suited for high-dimensional predic- tive modeling applications with streaming, massive data in which the data-generating process is itself changing over time. Specifically, we propose an online implementation of the dy- namic binary classifier, which dynamically accounts for model uncertainty and allows within-model parameters to change over time. Our model contains three key statistical features that make it well suited for such applications. First, we propose an en- tirely online implementation that allows rapid updating of model parameters as new data arrive. Second, we adopt an ensemble approach in response to a potentially large space of features that addresses overfitting. Specifically we com- bine models using dynamic model averaging (DMA), an exten- sion of Bayesian model averaging (BMA) that allows model weights to change over time. Third, our autotuning algorithm and Bayesian inference address the dynamic nature of the data-generating mechanism. Through the Bayesian paradigm, our adaptive algorithm incorporates more information from past time periods when the process is stable, and less dur- ing periods of volatility. This feature allows us to model local fluctuations without losing sight of overall trends. In what follows we consider a finite set of candidate lo- gistic regression models and assume that the data-generating model follows a (hidden) Markov chain. Within each candi- date model, the parameters follow a state-space model. We present algorithms for recursively updating both the Markov chain and the state-space model in an online fashion. Each candidate model is updated independently because the defi- nition of the state vector is different for each candidate model. This alleviates much of the computational burden associated with hidden Markov models. We also update the posterior model probabilities dynamically, allowing the “correct” model to change over time. “Forgetting” eliminates the need for between-state transi- tion matrices and makes online prediction computationally feasible. The key idea within each candidate model is to cen- ter the prior for the unobserved state of the process at time t on the center of the posterior at the (t − 1)th observation, and to set the prior variance of the state at time t equal to the posterior variance at time (t − 1) inflated by a forgetting factor. Forgetting is similar to applying weights to the sample, where temporally distant observations receive smaller weight than more recent observations. Forgetting calibrates or tunes the influence of past observa- tions. Adaptively calibrating the procedure allows the amount of change in the model parameters to change over time. Our procedure is online and requires no additional data storage, preserving our method’s applicability for large-scale problems and for cases where sensitive information should be discarded as soon as possible. Our method combines components of several well-known dynamic modeling schemes (see Smith, 1979, or Smith, 1992, C 2011, The International Biometric Society 23 +   Intercept −6.00 −5.75 −5.50 −5.25 2002 2004 2006 2008 2010 Time Coefficient Estimate 95% CI No update Rolling 24−month window (12−months) Rolling 24−month window (1−month) Piecewise recalibration (12−months) Piecewise recalibration (24−months) Dynamic logistic regression logistic.dma {dma}

  22. Conclusions   •  We  need  to  unlock  healthcare  registries  to:

      §  Monitor  quality  &  avoid  a  repeat  of  Bristol   §  Revalida+on  of  professional  creden+als   §  Facilitate  pa+ent  choice   §  Develop  &  validate  evidence  based  medicine   §  Increase  in  demand     •  We  can  do  it  all  in  R!    
  23. Comments  &  sugges+ons   •  Funded  by  Heart  Research  UK

     [Grant  Number   RG2583]   •  Dr  Norman  Stein,  North  West  e-­‐Health     Acknowledgements   graeme.hickey@manchester.ac.uk