Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Build a Clinical Diagnostic Model in Python

How to Build a Clinical Diagnostic Model in Python

Slides for my talk at PyCon US 2019 (https://us.pycon.org/2019/schedule/presentation/173/)

Jill Cates

May 02, 2019
Tweet

More Decks by Jill Cates

Other Decks in Technology

Transcript

  1. How to Build a Clinical Diagnostic Model in Python Jill

    Cates May 3rd, 2019 PyCon US, Cleveland Data Scientist at BioSymetrics
  2. What is a Clinical Diagnosis? Patient Data Expert knowledge/ judgment

    of the physician Clinical Diagnosis “process of determining the disease or condition that explains a person’s symptoms or signs” [1]
  3. What is a Clinical Diagnosis? Patient Data Predictive Model Clinical

    Diagnosis • Triage low risk vs. high risk patients • Identify early onset illness • Reduce the risk of misdiagnosis • Save doctors’ time = save $$$ “process of determining the disease or condition that explains a person’s symptoms or signs” [1]
  4. What is Sepsis? “life-threatening condition that arises when the body's

    response to infection causes injury to its own tissues and organs” [2]
  5. What is Sepsis? • Affects more than 30 million people

    worldwide each year • Leading cause of death in the Intensive Care Unit (ICU) [3] • Responsible for 1 out of 3 hospital deaths [3] • Costs $20.3 billion each year in U.S. hospitals [4] • Every hour that passes before treatment begins, a patient’s risk of death from sepsis increased by 4-8% [5]
  6. Objective Build a model that predicts a patient’s likelihood of

    getting sepsis in the Intensive Care Unit (ICU)
  7. Adapted from R.J. Urbanowicz et al. 2018 End-to-End Pipeline Data

    Preparation Modeling Post-processing Data Cleaning Feature Engineering Feature Selection Standardization Harmonize EMR Data Train-Test Partitioning (up/downsampling) Model Selection Hyperparameter Tuning Evaluation Generate Predictions Assess Generalizability Interpretation External Replication Deployment Raw data
  8. Electronic Medical Record (EMR) Disparate Data Types: • Past medical

    history • Vital signs • Laboratory results • Electrocardiographs (ECG) • Electroencephalographs (EEG) • 2D/3D images (X-ray, CT, MRI) • Histology images • Genomic data Chest X-ray ECG Genomic Data Histology Image
  9. OHDSI OMOP Common Data Model EMR 1 EMR 2 EMR

    3 Data Harmonization • Extract data from source • Transform data to common data format • Load into a centralized database
  10. Adapted from García Laencina P.J et al. Pattern Classification with

    Missing Data: A Review. Neural Comput Applied. 2009. 9(1): 1–12 Handling Missing Data Case deletion Direct imputation Model-based imputation Machine learning methods Machine learning based imputation Maximum Likelihood with Expectation Maximization algorithm Ensemble methods, SVM, gradient boosting Statistical imputation k Nearest Neighbours, multi-layer perceptron, neural network imputation (recurrent and auto-associative) Measures of central tendency (mean, median, mode), regression, multiple imputation Gaussian Mixture Models Data Cleaning Handling Missing Values
  11. 1. Missing Completely at Random (MCAR) • Nurse forgot to

    record a patient’s blood pressure 2. Missing at Random (MAR) • Younger patients with no risk of cardiovascular disease tend to have blood pressure missing, and will tend to have lower blood pressure than older patients with cardiovascular disease 3. Missing Not at Random (MNAR) • Blood pressure is missing in patients who haven’t sought treatment, and those patients are more likely to have higher blood pressure Data Cleaning 3 Mechanisms of “Missingness” Examples adapted from Bhaskaran, K. et al. What is the difference between missing completely at random and missing at random? IJE 2014. 43(4): 1336-1339
  12. Data Cleaning Sophisticated Imputation Methods • KNN (k Nearest Neighbours)

    • IterativeSVD (a type of matrix factorization) • IterativeImputer (multiple imputation*) >>> import fancyimpute >>> import statsmodels • BayesGaussMI (Gaussian model to impute multivariate data) • MICE (Multiple Imputation through Chained Equations)
  13. Cleaning up inconsistencies in medical terms Data Cleaning NS vs.

    Normal Saline vs. “Syringe NS” vs. 0.9% sodium chloride
  14. Data Cleaning Cleaning up inconsistencies in medical terms Drug Naming

    Standards Morphine RxNorm 7052 National Drug Code (NDC) 71335-0239 Open source API’s: (RxNav)
  15. Data Cleaning Cleaning up inconsistencies in medical terms •Heart attack

    vs. Myocardial infarction •Bruise vs. Contusion •Lou Gherig’s Disease vs. Amyotrophic Lateral Sclerosis (ALS) •Stroke vs. Cerebrovascular Accident (CVA) •“Patient was febrile” vs. “Patient showed symptoms of a fever” Standardize into Common Unique Identifiers (CUI) Unified Medical Language System (UMLS)
  16. Feature Engineering Extracting properties from drug names morphine 'mesh_terms' :

    [ 'analgesic', 'opioid', 'narcotics' ] amoxicillin 'mesh_terms' : [ 'antibacterial agents', 'beta-Lactamase Inhibitors' ] Medical Subject Headings (MeSH): descriptors used for indexing journal articles in PubMed’s database
  17. Feature Engineering Extracting properties from drug names morphine Anatomical Therapeutic

    Chemical (ATC): classifies active ingredients of drugs according to the organ or system on which they act N02AA Nervous system N02AA Analgesics N02AA Opioids N02AA Natural opium alkaloids {'ATC': ‘N02AA'}
  18. • International Statistical Classification of Diseases and Related Health Problems

    (ICD) • Classification system for clinical diagnosis • 13,000 ICD-9 codes, 68,000 ICD-10 codes • Reported by physician in patient record • Used for billing purposes Feature Engineering Categorizing diagnoses
  19. Feature Engineering 17 Charlson Comorbidities: 30 Elixhauser Comorbidities: Congestive heart

    failure Diabetes, uncomplicated Rheumatoid arthritis Cardiac arrhythmias Diabetes, complicated Coagulopathy Valvular disease Hypothyroidism Obesity Pulmonary circulation disorders Renal failure Weight loss Peripheral vascular disorders Liver disease Fluid and electrolyte disorders Hypertension, uncomplicated Peptic ulcer disease Blood loss anemia Hypertension, complicated AIDS/HIV Deficiency anemia Paralysis Lymphoma Drug abuse Other neurological disorders Metastatic cancer Psychoses Chronic pulmonary disease Solid tumor without metastasis Depression Myocardial infarction Rheumatic disease Renal disease Congestive heart failure Peptic ulcer disease Any malignancy Peripheral vascular disease Mild liver disease Moderate/severe liver disease Cerebrovascular disease Diabetes without complication Metastatic solid tumor Dementia Diabetes with complication AIDS/HIV Chronic pulmonary disease Hemiplegia/paraplegia
  20. Feature Engineering Moderate/Severe Liver Disease Hepatic encephalopathy Portal hypertension Hepatorenal

    syndrome Other sequelae of chronic liver disease Esophageal varices with bleeding Esophageal varices without mention of bleeding Charlson Comorbidity Mapping (456.0-456.2, 572.2- 572.8)
  21. Feature Engineering Clinical notes Source: PhysioNet • Admission note •

    Daily progress note • Radiology note • Consult note • Discharge summary
  22. Identifying symptoms using a Named Entity Recognition (NER) Model Feature

    Engineering cough symptom anatomy treatment condition hospital dept medication Possible categories
  23. Identifying symptoms using a Named Entity Recognition (NER) Model Feature

    Engineering Beware of ambiguous terms: • Parkinson’s - patient name or disease? • Lou Gherig (also known as ALS) - patient name or disease? • Dermatome - area of skin or surgical instrument used for skin grafting? • Pelvis - part of the kidney or hip bone?
  24. Defining the Target Variable Approach 3: Severity Score • Sequential

    Organ Failure Assessment (SOFA) • Systemic Inflammatory Response Syndrome (SIRS) • Logistic Organ Dysfunction System (LODS) vitals blood test results urine test results a mortality prediction score that is based on the degree of dysfunction of 6 organ systems
  25. Defining the Target Variable Sepsis is defined by “an acute

    change in total SOFA score ≥2 points consequent to infection.” Singer at al. [6] *suspicion of infection = ordering a culture lab draw AND prescribing antibiotics suspicion of infection 48 hours prior 24 hours post If SOFA score increases by 2 points or greater within this window, we will assume that the patient has sepsis Approach 3: Severity Score
  26. Model Selection Binary Classification • Random Forest Classifier • Logistic

    regression with L1/ L2 regularization • Support Vector Machine Time-series Classification • Fixed/random effects regression • Multivariate Long Short Term Memory (LSTM) Model • Bayesian networks no_sepsis 0 has_sepsis 1 health score 0 30 60 90 time 0 1 2 3 4 5 6 no sepsis has sepsis Did the patient get sepsis during their hospital stay? When did the patient get sepsis during their hospital stay?
  27. No Free Lunch Theorem Model Selection Free Lunch “all optimization

    problem strategies perform equally well when averaged over all possible problems”
  28. Evaluation Metrics RMSE = ΣN i= 1 (y i −

    ̂ y i )2 N Area Under the Receiver Operating Curve (AUROC/AUC) precisio n = TP TP + FP recall = TP TP + FN F1 = 2 ⋅ precisio n ⋅ recall precisio n + recall accu racy = TP + TN Nto tal lo glo ss = − (y log(p) + (1 − y)log(1 − p)) MAE = ΣN i= 1 |y i − ̂ y i | N FPR TPR