Understanding Missing Data: The Use of Classification and Regression Trees, and Boosted Regression Trees

UNDERSTANDING MISSING DATA USING CART AND BRT MODELS! ASC-IMS 2014!
Authors: Nicholas Tierney, Dr. Jegar Pitchforth, Prof. Kerrie Mengersen! Special Thanks: Dr. Fiona Harden, Dr. Maurice Harden! ! 1

OUTLINE! •  Aims of research! •  Motivating example! •  Methods!
•  Detect Structure in missing data! •  CART! •  BRT! •  Results! •  Discussion + Conclusion! •  Current Research! •  Future Methods! 2

AIMS OF MY RESEARCH! 3 DATA! •  BMI! • 
Lung Function! •  Cholesterol! •  Age! •  Gender! •  Blood and Urine! ! AIMS:! •  " Identify risk factors! •  " Identify risk groups! !

THE DATA (63% MISSING)! r_code ex_per_week waist bsl pulse1 height
weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site 4

HAVING THIS MUCH MISSING DATA IS A PROBLEM.! ! Validity?!
Deletion methods = too much omission à bias?! Imputation?! ! 5 r_code ex_per_week waist bsl pulse1 height weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site

KNOWN AND UNKNOWN MISSINGNESS! 6 r_code ex_per_week waist bsl
pulse1 height weight rpt_visit k10 etoh ess LAeq crs hdl chol fev1_perc fvc_perc fvc_pred fev1_pred fev1_fvc fev1 fvc bmi age code smok dias sys bhl conc sex seg_s seg_p co miss_perc date type uin site

KNOWN AND UNKNOWN MISSINGNESS! UIN! AGE! BMI! N-Test! 1-ABC! 21!
22! -! 1-ABC! 21! 24! -! 1-ABC! 21! 23! -! 2-HJK! 45! 25! 8! 2-HJK! 46! 26! 9! 7

AIM! Detect missingness structure in a dataset.! DV = Proportion
of missingness! IV = All other variables! CART and BRT – Novel approach!! •  CART & BRT can handle many variables! •  CART provides interpretability! •  BRT provides robustness! ! 8

CART! 9 James et al. (2013)!

RESULTS - CART! ## Run the model! cart.small <- rpart(miss_perc
~ X1…Xp, ! data = data, ! na.action = na.rpart, ! method = "anova")! ! ! ! ! ! ! ! ! 10 type = 1 type = 2 rpt_visi = 1 2,3,4,5,6 3,4,5,6 2,3,4,5,6,7,8 Prop. Miss = 0.67 n=7915 Prop. Miss = 0.26 n=1504 Prop. Miss = 0.77 n=6411 Prop. Miss = 0.64 n=1349 Prop. Miss = 0.37 n=65 Prop. Miss = 0.65 n=1284 Prop. Miss = 0.8 n=5062 ## Plot the model! prp(cart.small, extra = 1, type = 4, prefix = "Prop. Miss = ")! !

CART: LIMITATIONS! Difﬁculty in modelling linear relationships ! Sensitive to
small changes in the data.! Do not have the same level of predictive accuracy as other methods.! ! 11

BRT! 12

RESULTS - BRT! tree.tc5.lr01 <- gbm.step(data = data, ! !
! ! ! ! ! ! !tree.complexity = 5,! ! ! ! ! ! ! ! !learning.rate = 0.01,! ! ! ! ! ! ! ! !bag.fraction = 0.5)! 13 BMI! FEV1! FVC! FVCPred ! ! FEV1Pred ! Type! FEV1%! Sys! Smok! FVCperc ! 26.3! 20.3! 15.6! 11.3! 9.5! 4.2! 2! 1.7! 1.7! 1.0!

RESULTS: BRT! 14

DISCUSSION: DECISION TREES! Reveal important known and unknown missingness structure!
CART: Inﬂuence of Type and repeated visit on missingness! BRT: Inﬂuence of extreme values! 15

CONCLUSION! CART and BRT models have been helpful for our
data! It will help us continue our way towards our aims:! •  Identify Risk factors, groups, and individuals.! 16

CURRENT WORK: DEALING WITH MISSING DATA! 17 type =
1 type = 2 rpt_visi = 1 2,3,4,5,6 3,4,5,6 2,3,4,5,6,7,8 Prop. Miss = 0.67 n=7915 Prop. Miss = 0.26 n=1504 Prop. Miss = 0.77 n=6411 Prop. Miss = 0.64 n=1349 Prop. Miss = 0.37 n=65 Prop. Miss = 0.65 n=1284 Prop. Miss = 0.8 n=5062 Subsetting based upon CART model.! ! ! !

CURRENT WORK: PCA! 18 • • • • •
• • • 29.74 20.66 14.43 10.99 9.9 7.98 4.34 1.96 0.5 1.0 1.5 2.0 1 2 3 4 5 6 7 8 Components In Order of Extraction Eigenvalue Scree plot With Proportion of Variance Explained

FUTURE WORK! Explore cluster membership over time! Predict the Pr(High
Risk) using MLM! Inform our industry to help prevent illness early.! 20

REFERENCES! 1 "Schafer JL, Graham JW. Missing data: Our view
of the state of the art. Psychol Methods 2002;7:147–77. doi:10.1037//1082-989X. 7.2.147! 2 "Brick J, Kalton G. Handling missing data in survey research. Stat Methods Med Res 1996;5:215–38. doi: 10.1177/096228029600500302! 3 "Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol 2009;60:549–76. doi:10.1146/annurev.psych. 58.110405.085530! 4 "Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. J Am Stat Assoc 1988;83:1198–202.! 5 "Rubin DB. Inference and missing data. Biometrika 1976;63:581–92.http://biomet.oxfordjournals.org/! 6 "Howell D. Statistical Methods for Psychology. Cengage Learning 2012. ! 7 "Breiman L, Friedman J, Stone CJ, et al. Classiﬁcation and regression trees. CRC press 1984. ! 8 "Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer 2009. http://www.springerlink.com/index/ D7X7KX6772HQ2135.pdf (accessed 5 May2014).! 9 "James G, Witten D, Hastie T, et al. An introduction to statistical learning. Springer 2013. http://link.springer.com/content/pdf/ 10.1007/978-1-4614-7138-7.pdf (accessed 5 May2014).! 10 "Elith J, Leathwick JR, Hastie T. A working guide to boosted regression trees. J Anim Ecol 2008;77:802–13. doi:10.1111/j. 1365-2656.2008.01390.x! 21

REFERENCES! 11 "Therneau TM, Atkinson EJ. An introduction to recursive
partitioning using the RPART routines. 1997.! 12 "Friedman JH, Meulman JJ. Multiple additive regression trees with application in epidemiology. Stat Med 2003;22:1365–81. doi:10.1002/sim.1501! 13 "Breiman L. Comment-Statistical Modeling: The Two Cultures. Stat Sci 2001;16:199–231.http:// scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Statistical+Modeling+:+The+Two+Cultures#2 (accessed 30 Dec2013).! 14 "R Core Team. R: A Language and Environment for Statistical Computing. 2013.http://www.r- project.org/! 15 "RStudio. RStudio: Integrated development environment for R. ! 16 "Therneau T, Atkinson B, Ripley B. rpart: Recursive Partitioning. 2013.http://cran.r-project.org/ package=rpart! 17 "Ridgeway G. gbm: Generalized Boosted Regression Models. 2013.http://cran.r-project.org/ package=gbm! 18  Honaker J, King G, Blackwell M. AMELIA II : A Program for Missing Data. 2013;:1–54. ! ! 22

ACKNOWLEDGEMENTS! Thanks to:! Dr. Nicole White; Dr Jegar Pitchforth; Prof.
Kerrie Mengersen! Dr. Fiona Harden; Dr. Maurice Harden! ! Australian Postgraduate Award (APA), ! ATN Industrial Doctoral Training Centre (IDTC)! Australian Research Council.! ! 23

QUESTIONS! ?! 24

EXTRA SLIDES! 25

EXPLORATORY ANALYSES! 26 ID! FEV1 %! BMI! AGE! CHOL!
1! 85%! 25! 19! 1.4! 2! 90%! 23! 18! -missing! 3! 89%! -missing-! -missing-! -missing! FEV1% = β0 + β1 BMI + β2 AGE + β3 CHOL +ε FEV1% = β0 + β1 BMI + β2 AGE +ε

UNKNOWN MISSINGNESS! 27 UIN! AGE! BMI! FEV1%! 25A! -!
22! 75%! 25A! 21! 24! 79%! 25A! 21! 23! 83%!

INFER MISSINGNESS STRUCTURE! UIN! BMI! FEV1%! FVC%! 2-HIJ! -! 75%!
80%! 2-HIJ! -! 70%! 60%! 3-LMNO! -! 75%! 65%! 28 UIN! BMI! FEV1%! FVC%! 4-QRS! 20! 90%! 92%! 4-QRS! 21! 91%! 95%! 5-XYZ! 24! 85%! 85%! p! P’!

INFER MISSINGNESS STRUCTURE! GENDER! PRESENCE! ABSENCE! BMI! Observed Count! !
(Expected Count)! Observed Count! ! (Expected Count)! 29

RESULTS! 30 Presence / Absence! Variables Affected! BMI! Date,
Age, SYS, DIAS, HDL, CRS, BHL, Missing%, FEV1/FVC, FEV1%, Site, Type, SEG (P), Code, SEG (S), Repeat Visit, Smoking, Sex ! FEV1, FVC, FEV1 / FVC! As for BMI:! + Exercise per week ! Concentration! UIN, Date, Missing%, Site, Type, SEG (P), SEG (S) !

Understanding Missing Data: The Use of Classifi...

Understanding Missing Data: The Use of Classification and Regression Trees, and Boosted Regression Trees

Nicholas Tierney

More Decks by Nicholas Tierney

Other Decks in Science

Featured

Transcript

UNDERSTANDING MISSING DATA USING CART AND BRT MODELS! ASC-IMS 2014!

OUTLINE! •  Aims of research! •  Motivating example! •  Methods!

AIMS OF MY RESEARCH! 3 DATA! •  BMI! •

THE DATA (63% MISSING)! r_code ex_per_week waist bsl pulse1 height

HAVING THIS MUCH MISSING DATA IS A PROBLEM.! ! Validity?!

KNOWN AND UNKNOWN MISSINGNESS! 6 r_code ex_per_week waist bsl

KNOWN AND UNKNOWN MISSINGNESS! UIN! AGE! BMI! N-Test! 1-ABC! 21!

AIM! Detect missingness structure in a dataset.! DV = Proportion

CART! 9 James et al. (2013)!

RESULTS - CART! ## Run the model! cart.small <- rpart(miss_perc

CART: LIMITATIONS! Difﬁculty in modelling linear relationships ! Sensitive to

BRT! 12

RESULTS - BRT! tree.tc5.lr01 <- gbm.step(data = data, ! !

RESULTS: BRT! 14

DISCUSSION: DECISION TREES! Reveal important known and unknown missingness structure!

CONCLUSION! CART and BRT models have been helpful for our

CURRENT WORK: DEALING WITH MISSING DATA! 17 type =

CURRENT WORK: PCA! 18 • • • • •

19

FUTURE WORK! Explore cluster membership over time! Predict the Pr(High

REFERENCES! 1 "Schafer JL, Graham JW. Missing data: Our view

REFERENCES! 11 "Therneau TM, Atkinson EJ. An introduction to recursive

ACKNOWLEDGEMENTS! Thanks to:! Dr. Nicole White; Dr Jegar Pitchforth; Prof.

QUESTIONS! ?! 24

EXTRA SLIDES! 25

EXPLORATORY ANALYSES! 26 ID! FEV1 %! BMI! AGE! CHOL!

UNKNOWN MISSINGNESS! 27 UIN! AGE! BMI! FEV1%! 25A! -!

INFER MISSINGNESS STRUCTURE! UIN! BMI! FEV1%! FVC%! 2-HIJ! -! 75%!

INFER MISSINGNESS STRUCTURE! GENDER! PRESENCE! ABSENCE! BMI! Observed Count! !

RESULTS! 30 Presence / Absence! Variables Affected! BMI! Date,