How to Build a Clinical Diagnostic Model in Python

How to Build a Clinical Diagnostic Model in Python Jill
Cates May 3rd, 2019 PyCon US, Cleveland Data Scientist at BioSymetrics

What is a Clinical Diagnosis? Patient Data Expert knowledge/ judgment
of the physician Clinical Diagnosis “process of determining the disease or condition that explains a person’s symptoms or signs” [1]

What is a Clinical Diagnosis? Patient Data Predictive Model Clinical
Diagnosis • Triage low risk vs. high risk patients • Identify early onset illness • Reduce the risk of misdiagnosis • Save doctors’ time = save $$$ “process of determining the disease or condition that explains a person’s symptoms or signs” [1]

Case Study Sepsis

What is Sepsis? “life-threatening condition that arises when the body's
response to infection causes injury to its own tissues and organs” [2]

What is Sepsis? • Aﬀects more than 30 million people
worldwide each year • Leading cause of death in the Intensive Care Unit (ICU) [3] • Responsible for 1 out of 3 hospital deaths [3] • Costs $20.3 billion each year in U.S. hospitals [4] • Every hour that passes before treatment begins, a patient’s risk of death from sepsis increased by 4-8% [5]

Objective Build a model that predicts a patient’s likelihood of
getting sepsis in the Intensive Care Unit (ICU)

Adapted from R.J. Urbanowicz et al. 2018 End-to-End Pipeline Data
Preparation Modeling Post-processing Data Cleaning Feature Engineering Feature Selection Standardization Harmonize EMR Data Train-Test Partitioning (up/downsampling) Model Selection Hyperparameter Tuning Evaluation Generate Predictions Assess Generalizability Interpretation External Replication Deployment Raw data

Electronic Medical Record (EMR)

Electronic Medical Record (EMR) Disparate Data Types: • Past medical
history • Vital signs • Laboratory results • Electrocardiographs (ECG) • Electroencephalographs (EEG) • 2D/3D images (X-ray, CT, MRI) • Histology images • Genomic data Chest X-ray ECG Genomic Data Histology Image

OHDSI OMOP Common Data Model EMR 1 EMR 2 EMR
3 Data Harmonization • Extract data from source • Transform data to common data format • Load into a centralized database

Data Cleaning 1. Handling missing values 2. Standardizing inconsistencies in
medical terms

Data Cleaning Handling Missing Values

Adapted from García Laencina P.J et al. Pattern Classiﬁcation with
Missing Data: A Review. Neural Comput Applied. 2009. 9(1): 1–12 Handling Missing Data Case deletion Direct imputation Model-based imputation Machine learning methods Machine learning based imputation Maximum Likelihood with Expectation Maximization algorithm Ensemble methods, SVM, gradient boosting Statistical imputation k Nearest Neighbours, multi-layer perceptron, neural network imputation (recurrent and auto-associative) Measures of central tendency (mean, median, mode), regression, multiple imputation Gaussian Mixture Models Data Cleaning Handling Missing Values

1. Missing Completely at Random (MCAR) • Nurse forgot to
record a patient’s blood pressure 2. Missing at Random (MAR) • Younger patients with no risk of cardiovascular disease tend to have blood pressure missing, and will tend to have lower blood pressure than older patients with cardiovascular disease 3. Missing Not at Random (MNAR) • Blood pressure is missing in patients who haven’t sought treatment, and those patients are more likely to have higher blood pressure Data Cleaning 3 Mechanisms of “Missingness” Examples adapted from Bhaskaran, K. et al. What is the diﬀerence between missing completely at random and missing at random? IJE 2014. 43(4): 1336-1339

Data Cleaning Complete Case Method Drop observations with NaNs

Data Cleaning Last Value Carried Forward Impute missing values by
the last observed values

Data Cleaning Sophisticated Imputation Methods • KNN (k Nearest Neighbours)
• IterativeSVD (a type of matrix factorization) • IterativeImputer (multiple imputation*) >>> import fancyimpute >>> import statsmodels • BayesGaussMI (Gaussian model to impute multivariate data) • MICE (Multiple Imputation through Chained Equations)

Cleaning up inconsistencies in medical terms Data Cleaning NS vs.
Normal Saline vs. “Syringe NS” vs. 0.9% sodium chloride

Data Cleaning Cleaning up inconsistencies in medical terms Morphine vs.
Morphine S vs. Morphine Sulfate

Data Cleaning Cleaning up inconsistencies in medical terms Drug Naming
Standards Morphine RxNorm 7052 National Drug Code (NDC) 71335-0239 Open source API’s: (RxNav)

Data Cleaning Cleaning up inconsistencies in medical terms •Heart attack
vs. Myocardial infarction •Bruise vs. Contusion •Lou Gherig’s Disease vs. Amyotrophic Lateral Sclerosis (ALS) •Stroke vs. Cerebrovascular Accident (CVA) •“Patient was febrile” vs. “Patient showed symptoms of a fever” Standardize into Common Unique Identiﬁers (CUI) Uniﬁed Medical Language System (UMLS)

Feature Engineering Extracting properties from drug names morphine 'mesh_terms' :
[ 'analgesic', 'opioid', 'narcotics' ] amoxicillin 'mesh_terms' : [ 'antibacterial agents', 'beta-Lactamase Inhibitors' ] Medical Subject Headings (MeSH): descriptors used for indexing journal articles in PubMed’s database

Feature Engineering Extracting properties from drug names morphine Anatomical Therapeutic
Chemical (ATC): classiﬁes active ingredients of drugs according to the organ or system on which they act N02AA Nervous system N02AA Analgesics N02AA Opioids N02AA Natural opium alkaloids {'ATC': ‘N02AA'}

• International Statistical Classiﬁcation of Diseases and Related Health Problems
(ICD) • Classiﬁcation system for clinical diagnosis • 13,000 ICD-9 codes, 68,000 ICD-10 codes • Reported by physician in patient record • Used for billing purposes Feature Engineering Categorizing diagnoses

Feature Engineering 17 Charlson Comorbidities: 30 Elixhauser Comorbidities: Congestive heart
failure Diabetes, uncomplicated Rheumatoid arthritis Cardiac arrhythmias Diabetes, complicated Coagulopathy Valvular disease Hypothyroidism Obesity Pulmonary circulation disorders Renal failure Weight loss Peripheral vascular disorders Liver disease Fluid and electrolyte disorders Hypertension, uncomplicated Peptic ulcer disease Blood loss anemia Hypertension, complicated AIDS/HIV Deﬁciency anemia Paralysis Lymphoma Drug abuse Other neurological disorders Metastatic cancer Psychoses Chronic pulmonary disease Solid tumor without metastasis Depression Myocardial infarction Rheumatic disease Renal disease Congestive heart failure Peptic ulcer disease Any malignancy Peripheral vascular disease Mild liver disease Moderate/severe liver disease Cerebrovascular disease Diabetes without complication Metastatic solid tumor Dementia Diabetes with complication AIDS/HIV Chronic pulmonary disease Hemiplegia/paraplegia

Feature Engineering Moderate/Severe Liver Disease Hepatic encephalopathy Portal hypertension Hepatorenal
syndrome Other sequelae of chronic liver disease Esophageal varices with bleeding Esophageal varices without mention of bleeding Charlson Comorbidity Mapping (456.0-456.2, 572.2- 572.8)

Feature Engineering Generating new features from imaging data

Feature Engineering Clinical notes Source: PhysioNet • Admission note •
Daily progress note • Radiology note • Consult note • Discharge summary

Identifying symptoms using a Named Entity Recognition (NER) Model Feature
Engineering cough symptom anatomy treatment condition hospital dept medication Possible categories

Identifying symptoms using a Named Entity Recognition (NER) Model Feature
Engineering Beware of ambiguous terms: • Parkinson’s - patient name or disease? • Lou Gherig (also known as ALS) - patient name or disease? • Dermatome - area of skin or surgical instrument used for skin grafting? • Pelvis - part of the kidney or hip bone?

Deﬁning the Target Variable Approach 1: ICD codes (explicit diagnosis
by the physician)

Deﬁning the Target Variable Approach 2: Clinical Notes Present? Absent?
Assertion classiﬁcation Speculation?

Deﬁning the Target Variable Approach 3: Severity Score • Sequential
Organ Failure Assessment (SOFA) • Systemic Inﬂammatory Response Syndrome (SIRS) • Logistic Organ Dysfunction System (LODS) vitals blood test results urine test results a mortality prediction score that is based on the degree of dysfunction of 6 organ systems

Deﬁning the Target Variable Sepsis is deﬁned by “an acute
change in total SOFA score ≥2 points consequent to infection.” Singer at al. [6] *suspicion of infection = ordering a culture lab draw AND prescribing antibiotics suspicion of infection 48 hours prior 24 hours post If SOFA score increases by 2 points or greater within this window, we will assume that the patient has sepsis Approach 3: Severity Score

Model Selection Binary Classification • Random Forest Classifier • Logistic
regression with L1/ L2 regularization • Support Vector Machine Time-series Classification • Fixed/random effects regression • Multivariate Long Short Term Memory (LSTM) Model • Bayesian networks no_sepsis 0 has_sepsis 1 health score 0 30 60 90 time 0 1 2 3 4 5 6 no sepsis has sepsis Did the patient get sepsis during their hospital stay? When did the patient get sepsis during their hospital stay?

No Free Lunch Theorem Model Selection Free Lunch “all optimization
problem strategies perform equally well when averaged over all possible problems”

Evaluation Metrics RMSE = ΣN i= 1 (y i −
̂ y i )2 N Area Under the Receiver Operating Curve (AUROC/AUC) precisio n = TP TP + FP recall = TP TP + FN F1 = 2 ⋅ precisio n ⋅ recall precisio n + recall accu racy = TP + TN Nto tal lo glo ss = − (y log(p) + (1 − y)log(1 − p)) MAE = ΣN i= 1 |y i − ̂ y i | N FPR TPR

Caveats Demographic Bias bias

Caveats Deﬁning the Ground Truth

Caveats Adversarial Attacks Source: Attacking Machine Learning with Adversarial Examples
(OpenAI) adversarial input

Caveats Adversarial Attacks

Caveats Collaboration, Validation, and Buy-in from Clinicians

Thank you! Jill Cates Data Scientist at BioSymetrics github: @topspinj
[email protected]

How to Build a Clinical Diagnostic Model in Python

How to Build a Clinical Diagnostic Model in Python

More Decks by Jill Cates

Other Decks in Technology

Featured

Transcript