Automatically Extracting Population Level Cause-of- death Information from Free-text Death Certificates
Presentation to the NSW Epidemiology Special Interest Group on our projects on Automatically Extracting Population Level Cause-of- death Information from Free-text Death Certificates.
Koopman Australian e-Health Research Centre 2 • National e-Health research group in Australia • Joint venture between CSIRO and Qld Health • Currently 60-70 staff, students, visiting researchers
Koopman Health Data Semantics • Clinical language processing • Clinical search • Clinical terminology Health Services • Mobile/Tele Health • Forecasting Research Areas 3 Biomedical Informatics • Medical Imaging • Biostatistics
Koopman Overview • Death certificates: a valuable source of cause-of-death information • The challenge: extracting accurate statistics from death certificates • The approach: natural language processing and machine learning • The evaluation: 10 years of NSW death certificates Disease surveillance: Diabetes, Flu, HIV & Pneumonia Cancer statistics 4
Koopman Death Certificates 5 http://en.wikipedia.org/wiki/Al_Capone Death certificates are a valuable source of mortality statistics. Surveillance and warnings of increases in disease activity
Koopman Death Certificates 5 http://en.wikipedia.org/wiki/Al_Capone Death certificates are a valuable source of mortality statistics. Surveillance and warnings of increases in disease activity Support the development and monitoring of prevention or response strategies.
Koopman The Challenge • Extracting accurate, quantitative data from death certificates authored in unstructured free-text • Ambiguity of natural language • Variety in expressing the same meaning • stomach cancer vs. gastric carcinoma • AIDS, HIV, Human immunodeficiency virus • Errors; e.g., misspellings • Volume of death certificates 7
Koopman The Challenge • Extracting accurate, quantitative data from death certificates authored in unstructured free-text • Ambiguity of natural language • Variety in expressing the same meaning • stomach cancer vs. gastric carcinoma • AIDS, HIV, Human immunodeficiency virus • Errors; e.g., misspellings • Volume of death certificates 7 Two Choices: 1.Get people to structure their data so computers can understand it; or 2.Get computers to better understand people’s natural language.
Koopman Machine Learning for Disease Classification 1. Extract natural language features from death certificates • terms, phrases and medical concepts. 2. Train a supervised model (Support Vector Machine) to recognise different diseases based on the natural language features. 9
Koopman System Workflow Death certificate Real-time feed A) HYPOXIC BRAIN INJURY B) GASTRIC CARCINOMA WITH ... C) ATRIAL FIBRILLATION Feature Extraction
Koopman System Workflow Death certificate Real-time feed Support Vector Machines Classification A) HYPOXIC BRAIN INJURY B) GASTRIC CARCINOMA WITH ... C) ATRIAL FIBRILLATION Feature Extraction
Koopman System Workflow Death certificate Real-time feed Support Vector Machines Classification A) HYPOXIC BRAIN INJURY B) GASTRIC CARCINOMA WITH ... C) ATRIAL FIBRILLATION Feature Extraction Cause of death(s) ICD codes
Koopman 1. Disease Surveillance • Project with NSW Ministry of Health • Aim: Extract cause-of-death stats for Diseases of Interest: Influenza, HIV, Pneumonia and Diabetes. • Data: ~7 years of NSW Death certificate; ~340,142 certificates • Tasks: 1. Identify if certificates contains Disease of Interest 2. Identify specific ICD-10 pertaining to Disease of Interest • e.g., Viral pneumonia vs. Bacterial pneumonia • Non-insulin-dependent vs. Insulin-dependent diabetes • Empirical evaluation on ‘unseen’ set of labelled certificates. • Precision, Recall (Sensitivity) and F-measure 14
Koopman 2. Cancer Classification for Cancer Registries 16 Automatic Classification of Diseases | Bevan Koopman • Project with Cancer Institute NSW • Aim: Extract cause-of-death stats for different types of cancer • Data: 10 years of NSW Death certificate; ~447,336 certificates • Tasks: 1. Classify cancer as underlying cause of death death 2. Classify ~80 different cancer class - very common (Lung) to very rare (Placenta) • Empirical evaluation on ‘unseen’ set of labelled certificates.
Koopman Queensland Cancer Control Analysis • Developed a system in collaboration with Queensland Cancer Control Analysis Team (QCCAT) • Real-time classification of pathology reports: • Identify notifiable cancers • Identify specific characteristics of the cancer • Produce structured report of cancer cases 19
Koopman Queensland Cancer Control Analysis • Developed a system in collaboration with Queensland Cancer Control Analysis Team (QCCAT) • Real-time classification of pathology reports: • Identify notifiable cancers • Identify specific characteristics of the cancer • Produce structured report of cancer cases 19
Koopman Radiology Search • In collaboration with Princess Alexandra Hospital, Brisbane • Customised Radiology Search Engine for ~2 million radiology reports • Summary statistics to aid research related activities 20
Koopman Radiology Search • In collaboration with Princess Alexandra Hospital, Brisbane • Customised Radiology Search Engine for ~2 million radiology reports • Summary statistics to aid research related activities 20
Koopman Conclusions 21 • Death certificates may provide a valuable insight into population cause-of-death information. • Need specific methods to overcome challenges of natural language. • Clinical Natural Language Processing and Machine Learning. • General approaches applied to different diseases and clinical reports. • Interested to hear more about YOUR problems managing clinical natural language (and how we might help).