Jill Cates - How to Build a Clinical Diagnostic Model in Python

Slide 1

Slide 1 text

How to Build a Clinical Diagnostic Model in Python Jill Cates May 3rd, 2019 PyCon US, Cleveland Data Scientist at BioSymetrics

Slide 2

Slide 2 text

What is a Clinical Diagnosis? Patient Data Expert knowledge/ judgment of the physician Clinical Diagnosis “process of determining the disease or condition that explains a person’s symptoms or signs” [1]

Slide 3

Slide 3 text

What is a Clinical Diagnosis? Patient Data Predictive Model Clinical Diagnosis • Triage low risk vs. high risk patients • Identify early onset illness • Reduce the risk of misdiagnosis • Save doctors’ time = save $$$ “process of determining the disease or condition that explains a person’s symptoms or signs” [1]

Slide 4

Slide 4 text

Case Study Sepsis

Slide 5

Slide 5 text

What is Sepsis? “life-threatening condition that arises when the body's response to infection causes injury to its own tissues and organs” [2]

Slide 6

Slide 6 text

What is Sepsis? • Aﬀects more than 30 million people worldwide each year • Leading cause of death in the Intensive Care Unit (ICU) [3] • Responsible for 1 out of 3 hospital deaths [3] • Costs $20.3 billion each year in U.S. hospitals [4] • Every hour that passes before treatment begins, a patient’s risk of death from sepsis increased by 4-8% [5]

Slide 7

Slide 7 text

Objective Build a model that predicts a patient’s likelihood of getting sepsis in the Intensive Care Unit (ICU)

Slide 8

Slide 8 text

Adapted from R.J. Urbanowicz et al. 2018 End-to-End Pipeline Data Preparation Modeling Post-processing Data Cleaning Feature Engineering Feature Selection Standardization Harmonize EMR Data Train-Test Partitioning (up/downsampling) Model Selection Hyperparameter Tuning Evaluation Generate Predictions Assess Generalizability Interpretation External Replication Deployment Raw data

Slide 9

Slide 9 text

Electronic Medical Record (EMR)

Slide 10

Slide 10 text

Electronic Medical Record (EMR) Disparate Data Types: • Past medical history • Vital signs • Laboratory results • Electrocardiographs (ECG) • Electroencephalographs (EEG) • 2D/3D images (X-ray, CT, MRI) • Histology images • Genomic data Chest X-ray ECG Genomic Data Histology Image

Slide 11

Slide 11 text

OHDSI OMOP Common Data Model EMR 1 EMR 2 EMR 3 Data Harmonization • Extract data from source • Transform data to common data format • Load into a centralized database

Slide 12

Slide 12 text

Data Cleaning 1. Handling missing values 2. Standardizing inconsistencies in medical terms

Slide 13

Slide 13 text

Data Cleaning Handling Missing Values

Slide 14

Slide 14 text

Adapted from García Laencina P.J et al. Pattern Classiﬁcation with Missing Data: A Review. Neural Comput Applied. 2009. 9(1): 1–12 Handling Missing Data Case deletion Direct imputation Model-based imputation Machine learning methods Machine learning based imputation Maximum Likelihood with Expectation Maximization algorithm Ensemble methods, SVM, gradient boosting Statistical imputation k Nearest Neighbours, multi-layer perceptron, neural network imputation (recurrent and auto-associative) Measures of central tendency (mean, median, mode), regression, multiple imputation Gaussian Mixture Models Data Cleaning Handling Missing Values

Slide 15

Slide 15 text

1. Missing Completely at Random (MCAR) • Nurse forgot to record a patient’s blood pressure 2. Missing at Random (MAR) • Younger patients with no risk of cardiovascular disease tend to have blood pressure missing, and will tend to have lower blood pressure than older patients with cardiovascular disease 3. Missing Not at Random (MNAR) • Blood pressure is missing in patients who haven’t sought treatment, and those patients are more likely to have higher blood pressure Data Cleaning 3 Mechanisms of “Missingness” Examples adapted from Bhaskaran, K. et al. What is the diﬀerence between missing completely at random and missing at random? IJE 2014. 43(4): 1336-1339

Slide 16

Slide 16 text

Data Cleaning Complete Case Method Drop observations with NaNs

Slide 17

Slide 17 text

Data Cleaning Last Value Carried Forward Impute missing values by the last observed values

Slide 18

Slide 18 text

Data Cleaning Sophisticated Imputation Methods • KNN (k Nearest Neighbours) • IterativeSVD (a type of matrix factorization) • IterativeImputer (multiple imputation*) >>> import fancyimpute >>> import statsmodels • BayesGaussMI (Gaussian model to impute multivariate data) • MICE (Multiple Imputation through Chained Equations)

Slide 19

Slide 19 text

Cleaning up inconsistencies in medical terms Data Cleaning NS vs. Normal Saline vs. “Syringe NS” vs. 0.9% sodium chloride

Slide 20

Slide 20 text

Data Cleaning Cleaning up inconsistencies in medical terms Morphine vs. Morphine S vs. Morphine Sulfate

Slide 21

Slide 21 text

Data Cleaning Cleaning up inconsistencies in medical terms Drug Naming Standards Morphine RxNorm 7052 National Drug Code (NDC) 71335-0239 Open source API’s: (RxNav)

Slide 22

Slide 22 text

Data Cleaning Cleaning up inconsistencies in medical terms •Heart attack vs. Myocardial infarction •Bruise vs. Contusion •Lou Gherig’s Disease vs. Amyotrophic Lateral Sclerosis (ALS) •Stroke vs. Cerebrovascular Accident (CVA) •“Patient was febrile” vs. “Patient showed symptoms of a fever” Standardize into Common Unique Identiﬁers (CUI) Uniﬁed Medical Language System (UMLS)

Slide 23

Slide 23 text

Feature Engineering Extracting properties from drug names morphine 'mesh_terms' : [ 'analgesic', 'opioid', 'narcotics' ] amoxicillin 'mesh_terms' : [ 'antibacterial agents', 'beta-Lactamase Inhibitors' ] Medical Subject Headings (MeSH): descriptors used for indexing journal articles in PubMed’s database

Slide 24

Slide 24 text

Feature Engineering Extracting properties from drug names morphine Anatomical Therapeutic Chemical (ATC): classiﬁes active ingredients of drugs according to the organ or system on which they act N02AA Nervous system N02AA Analgesics N02AA Opioids N02AA Natural opium alkaloids {'ATC': ‘N02AA'}

Slide 25

Slide 25 text

• International Statistical Classiﬁcation of Diseases and Related Health Problems (ICD) • Classiﬁcation system for clinical diagnosis • 13,000 ICD-9 codes, 68,000 ICD-10 codes • Reported by physician in patient record • Used for billing purposes Feature Engineering Categorizing diagnoses

Slide 26

Slide 26 text

Feature Engineering 17 Charlson Comorbidities: 30 Elixhauser Comorbidities: Congestive heart failure Diabetes, uncomplicated Rheumatoid arthritis Cardiac arrhythmias Diabetes, complicated Coagulopathy Valvular disease Hypothyroidism Obesity Pulmonary circulation disorders Renal failure Weight loss Peripheral vascular disorders Liver disease Fluid and electrolyte disorders Hypertension, uncomplicated Peptic ulcer disease Blood loss anemia Hypertension, complicated AIDS/HIV Deﬁciency anemia Paralysis Lymphoma Drug abuse Other neurological disorders Metastatic cancer Psychoses Chronic pulmonary disease Solid tumor without metastasis Depression Myocardial infarction Rheumatic disease Renal disease Congestive heart failure Peptic ulcer disease Any malignancy Peripheral vascular disease Mild liver disease Moderate/severe liver disease Cerebrovascular disease Diabetes without complication Metastatic solid tumor Dementia Diabetes with complication AIDS/HIV Chronic pulmonary disease Hemiplegia/paraplegia

Slide 27

Slide 27 text

Feature Engineering Moderate/Severe Liver Disease Hepatic encephalopathy Portal hypertension Hepatorenal syndrome Other sequelae of chronic liver disease Esophageal varices with bleeding Esophageal varices without mention of bleeding Charlson Comorbidity Mapping (456.0-456.2, 572.2- 572.8)

Slide 28

Slide 28 text

Feature Engineering Generating new features from imaging data

Slide 29

Slide 29 text

Feature Engineering Clinical notes Source: PhysioNet • Admission note • Daily progress note • Radiology note • Consult note • Discharge summary

Slide 30

Slide 30 text

Identifying symptoms using a Named Entity Recognition (NER) Model Feature Engineering cough symptom anatomy treatment condition hospital dept medication Possible categories

Slide 31

Slide 31 text

Identifying symptoms using a Named Entity Recognition (NER) Model Feature Engineering Beware of ambiguous terms: • Parkinson’s - patient name or disease? • Lou Gherig (also known as ALS) - patient name or disease? • Dermatome - area of skin or surgical instrument used for skin grafting? • Pelvis - part of the kidney or hip bone?

Slide 32

Slide 32 text

Deﬁning the Target Variable Approach 1: ICD codes (explicit diagnosis by the physician)

Slide 33

Slide 33 text

Deﬁning the Target Variable Approach 2: Clinical Notes Present? Absent? Assertion classiﬁcation Speculation?

Slide 34

Slide 34 text

Deﬁning the Target Variable Approach 3: Severity Score • Sequential Organ Failure Assessment (SOFA) • Systemic Inﬂammatory Response Syndrome (SIRS) • Logistic Organ Dysfunction System (LODS) vitals blood test results urine test results a mortality prediction score that is based on the degree of dysfunction of 6 organ systems

Slide 35

Slide 35 text

Deﬁning the Target Variable Sepsis is deﬁned by “an acute change in total SOFA score ≥2 points consequent to infection.” Singer at al. [6] *suspicion of infection = ordering a culture lab draw AND prescribing antibiotics suspicion of infection 48 hours prior 24 hours post If SOFA score increases by 2 points or greater within this window, we will assume that the patient has sepsis Approach 3: Severity Score

Slide 36

Slide 36 text

Model Selection Binary Classification • Random Forest Classifier • Logistic regression with L1/ L2 regularization • Support Vector Machine Time-series Classification • Fixed/random effects regression • Multivariate Long Short Term Memory (LSTM) Model • Bayesian networks no_sepsis 0 has_sepsis 1 health score 0 30 60 90 time 0 1 2 3 4 5 6 no sepsis has sepsis Did the patient get sepsis during their hospital stay? When did the patient get sepsis during their hospital stay?

Slide 37

Slide 37 text

No Free Lunch Theorem Model Selection Free Lunch “all optimization problem strategies perform equally well when averaged over all possible problems”

Slide 38

Slide 38 text

Evaluation Metrics RMSE = ΣN i= 1 (y i − ̂ y i )2 N Area Under the Receiver Operating Curve (AUROC/AUC) precisio n = TP TP + FP recall = TP TP + FN F1 = 2 ⋅ precisio n ⋅ recall precisio n + recall accu racy = TP + TN Nto tal lo glo ss = − (y log(p) + (1 − y)log(1 − p)) MAE = ΣN i= 1 |y i − ̂ y i | N FPR TPR

Slide 39

Slide 39 text

Caveats Demographic Bias bias

Slide 40

Slide 40 text

Caveats Deﬁning the Ground Truth

Slide 41

Slide 41 text

Caveats Adversarial Attacks Source: Attacking Machine Learning with Adversarial Examples (OpenAI) adversarial input

Slide 42

Slide 42 text

Caveats Adversarial Attacks

Slide 43

Slide 43 text

Caveats Collaboration, Validation, and Buy-in from Clinicians

Slide 44

Slide 44 text

Thank you! Jill Cates Data Scientist at BioSymetrics github: @topspinj [email protected]