Slide 1

Slide 1 text

UNDERSTANDING MISSING DATA USING CART AND BRT MODELS! ASC-IMS 2014! Authors: Nicholas Tierney, Dr. Jegar Pitchforth, Prof. Kerrie Mengersen! Special Thanks: Dr. Fiona Harden, Dr. Maurice Harden! ! 1  

Slide 2

Slide 2 text

OUTLINE! •  Aims of research! •  Motivating example! •  Methods! •  Detect Structure in missing data! •  CART! •  BRT! •  Results! •  Discussion + Conclusion! •  Current Research! •  Future Methods! 2  

Slide 3

Slide 3 text

AIMS OF MY RESEARCH! 3   DATA! •  BMI! •  Lung Function! •  Cholesterol! •  Age! •  Gender! •  Blood and Urine! ! AIMS:! •  " Identify risk factors! •  " Identify risk groups! !

Slide 4

Slide 4 text

THE DATA (63% MISSING)! r_code ex_per_week waist bsl pulse1 height weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site 4  

Slide 5

Slide 5 text

HAVING THIS MUCH MISSING DATA IS A PROBLEM.! ! Validity?! Deletion methods = too much omission à bias?! Imputation?! ! 5   r_code ex_per_week waist bsl pulse1 height weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site

Slide 6

Slide 6 text

KNOWN AND UNKNOWN MISSINGNESS! 6   r_code ex_per_week waist bsl pulse1 height weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site

Slide 7

Slide 7 text

KNOWN AND UNKNOWN MISSINGNESS! UIN! AGE! BMI! N-Test! 1-ABC! 21! 22! -! 1-ABC! 21! 24! -! 1-ABC! 21! 23! -! 2-HJK! 45! 25! 8! 2-HJK! 46! 26! 9! 7  

Slide 8

Slide 8 text

AIM! Detect missingness structure in a dataset.! DV = Proportion of missingness! IV = All other variables! CART and BRT – Novel approach!! •  CART & BRT can handle many variables! •  CART provides interpretability! •  BRT provides robustness! ! 8  

Slide 9

Slide 9 text

CART! 9   James et al. (2013)!

Slide 10

Slide 10 text

RESULTS - CART! ## Run the model! cart.small <- rpart(miss_perc ~ X1…Xp, ! data = data, ! na.action = na.rpart, ! method = "anova")! ! ! ! ! ! ! ! ! 10   type = 1 type = 2 rpt_visi = 1 2,3,4,5,6 3,4,5,6 2,3,4,5,6,7,8 Prop. Miss = 0.67 n=7915 Prop. Miss = 0.26 n=1504 Prop. Miss = 0.77 n=6411 Prop. Miss = 0.64 n=1349 Prop. Miss = 0.37 n=65 Prop. Miss = 0.65 n=1284 Prop. Miss = 0.8 n=5062 ## Plot the model! prp(cart.small, extra = 1, type = 4, prefix = "Prop. Miss = ")! !

Slide 11

Slide 11 text

CART: LIMITATIONS! Difficulty in modelling linear relationships ! Sensitive to small changes in the data.! Do not have the same level of predictive accuracy as other methods.! ! 11  

Slide 12

Slide 12 text

BRT! 12  

Slide 13

Slide 13 text

RESULTS - BRT! tree.tc5.lr01 <- gbm.step(data = data, ! ! ! ! ! ! ! ! !tree.complexity = 5,! ! ! ! ! ! ! ! !learning.rate = 0.01,! ! ! ! ! ! ! ! !bag.fraction = 0.5)! 13   BMI! FEV1! FVC! FVCPred ! ! FEV1Pred ! Type! FEV1%! Sys! Smok! FVCperc ! 26.3! 20.3! 15.6! 11.3! 9.5! 4.2! 2! 1.7! 1.7! 1.0!

Slide 14

Slide 14 text

RESULTS: BRT! 14  

Slide 15

Slide 15 text

DISCUSSION: DECISION TREES! Reveal important known and unknown missingness structure! CART: Influence of Type and repeated visit on missingness! BRT: Influence of extreme values! 15  

Slide 16

Slide 16 text

CONCLUSION! CART and BRT models have been helpful for our data! It will help us continue our way towards our aims:! •  Identify Risk factors, groups, and individuals.! 16  

Slide 17

Slide 17 text

CURRENT WORK: DEALING WITH MISSING DATA! 17   type = 1 type = 2 rpt_visi = 1 2,3,4,5,6 3,4,5,6 2,3,4,5,6,7,8 Prop. Miss = 0.67 n=7915 Prop. Miss = 0.26 n=1504 Prop. Miss = 0.77 n=6411 Prop. Miss = 0.64 n=1349 Prop. Miss = 0.37 n=65 Prop. Miss = 0.65 n=1284 Prop. Miss = 0.8 n=5062 Subsetting based upon CART model.! ! ! !

Slide 18

Slide 18 text

CURRENT WORK: PCA! 18   ● ● ● ● ● ● ● ● 29.74 20.66 14.43 10.99 9.9 7.98 4.34 1.96 0.5 1.0 1.5 2.0 1 2 3 4 5 6 7 8 Components In Order of Extraction Eigenvalue Scree plot With Proportion of Variance Explained

Slide 19

Slide 19 text

19  

Slide 20

Slide 20 text

FUTURE WORK! Explore cluster membership over time! Predict the Pr(High Risk) using MLM! Inform our industry to help prevent illness early.! 20  

Slide 21

Slide 21 text

REFERENCES! 1 "Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychol Methods 2002;7:147–77. doi:10.1037//1082-989X. 7.2.147! 2 "Brick J, Kalton G. Handling missing data in survey research. Stat Methods Med Res 1996;5:215–38. doi: 10.1177/096228029600500302! 3 "Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol 2009;60:549–76. doi:10.1146/annurev.psych. 58.110405.085530! 4 "Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. J Am Stat Assoc 1988;83:1198–202.! 5 "Rubin DB. Inference and missing data. Biometrika 1976;63:581–92.http://biomet.oxfordjournals.org/! 6 "Howell D. Statistical Methods for Psychology. Cengage Learning 2012. ! 7 "Breiman L, Friedman J, Stone CJ, et al. Classification and regression trees. CRC press 1984. ! 8 "Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer 2009. http://www.springerlink.com/index/ D7X7KX6772HQ2135.pdf (accessed 5 May2014).! 9 "James G, Witten D, Hastie T, et al. An introduction to statistical learning. Springer 2013. http://link.springer.com/content/pdf/ 10.1007/978-1-4614-7138-7.pdf (accessed 5 May2014).! 10 "Elith J, Leathwick JR, Hastie T. A working guide to boosted regression trees. J Anim Ecol 2008;77:802–13. doi:10.1111/j. 1365-2656.2008.01390.x! 21  

Slide 22

Slide 22 text

REFERENCES! 11 "Therneau TM, Atkinson EJ. An introduction to recursive partitioning using the RPART routines. 1997.! 12 "Friedman JH, Meulman JJ. Multiple additive regression trees with application in epidemiology. Stat Med 2003;22:1365–81. doi:10.1002/sim.1501! 13 "Breiman L. Comment-Statistical Modeling: The Two Cultures. Stat Sci 2001;16:199–231.http:// scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Statistical+Modeling+:+The+Two+Cultures#2 (accessed 30 Dec2013).! 14 "R Core Team. R: A Language and Environment for Statistical Computing. 2013.http://www.r- project.org/! 15 "RStudio. RStudio: Integrated development environment for R. ! 16 "Therneau T, Atkinson B, Ripley B. rpart: Recursive Partitioning. 2013.http://cran.r-project.org/ package=rpart! 17 "Ridgeway G. gbm: Generalized Boosted Regression Models. 2013.http://cran.r-project.org/ package=gbm! 18  Honaker J, King G, Blackwell M. AMELIA II : A Program for Missing Data. 2013;:1–54. ! ! 22  

Slide 23

Slide 23 text

ACKNOWLEDGEMENTS! Thanks to:! Dr. Nicole White; Dr Jegar Pitchforth; Prof. Kerrie Mengersen! Dr. Fiona Harden; Dr. Maurice Harden! ! Australian Postgraduate Award (APA), ! ATN Industrial Doctoral Training Centre (IDTC)! Australian Research Council.! ! 23  

Slide 24

Slide 24 text

QUESTIONS! ?! 24  

Slide 25

Slide 25 text

EXTRA SLIDES! 25  

Slide 26

Slide 26 text

EXPLORATORY ANALYSES! 26   ID! FEV1 %! BMI! AGE! CHOL! 1! 85%! 25! 19! 1.4! 2! 90%! 23! 18! -missing! 3! 89%! -missing-! -missing-! -missing! FEV1% = β0 + β1 BMI + β2 AGE + β3 CHOL +ε FEV1% = β0 + β1 BMI + β2 AGE +ε

Slide 27

Slide 27 text

UNKNOWN MISSINGNESS! 27   UIN! AGE! BMI! FEV1%! 25A! -! 22! 75%! 25A! 21! 24! 79%! 25A! 21! 23! 83%!

Slide 28

Slide 28 text

INFER MISSINGNESS STRUCTURE! UIN! BMI! FEV1%! FVC%! 2-HIJ! -! 75%! 80%! 2-HIJ! -! 70%! 60%! 3-LMNO! -! 75%! 65%! 28   UIN! BMI! FEV1%! FVC%! 4-QRS! 20! 90%! 92%! 4-QRS! 21! 91%! 95%! 5-XYZ! 24! 85%! 85%! p! P’!

Slide 29

Slide 29 text

INFER MISSINGNESS STRUCTURE! GENDER! PRESENCE! ABSENCE! BMI! Observed Count! ! (Expected Count)! Observed Count! ! (Expected Count)! 29  

Slide 30

Slide 30 text

RESULTS! 30   Presence / Absence! Variables Affected! BMI! Date, Age, SYS, DIAS, HDL, CRS, BHL, Missing%, FEV1/FVC, FEV1%, Site, Type, SEG (P), Code, SEG (S), Repeat Visit, Smoking, Sex ! FEV1, FVC, FEV1 / FVC! As for BMI:! + Exercise per week ! Concentration! UIN, Date, Missing%, Site, Type, SEG (P), SEG (S) !