Slide 1

Slide 1 text

Building a Binary Classification Model to Predict Hospital Readmission TMLS Workshop Jill Cates BioSymetrics Team November 20, 2019

Slide 2

Slide 2 text

• BioSymetrics has built a biomedical ML framework (Augusta) designed to transition time from data pre-processing and integration to model building and interrogation using familiar toolsets within Python. • Pre-processes diverse, raw medical data types (e.g. images, chemical structures, tabular data), and constructs workflows designed to reduce bias and maximize efficiency • Applications: early stage drug discovery, disease diagnostics, patient outcome prediction

Slide 3

Slide 3 text

• Experience with Python? • Experience with pandas? • Experience with scikit-learn? • Familiarity with machine learning concepts? Quick Survey

Slide 4

Slide 4 text

Agenda • Role of machine learning in healthcare • Challenges of machine learning in healthcare • Tutorial - Part 1: Data Exploration and Pre-processing - Part 2: Machine Learning Introduction

Slide 5

Slide 5 text

The Role of Healthcare in Machine Learning • Helping diagnose patients • Detecting early stages of diseases • Developing personalized treatments • Automated triaging of patients

Slide 6

Slide 6 text

The Role of Healthcare in Machine Learning • Processing clinical notes: AWS Comprehend Medical • Diagnosing diabetic eye disease: Google AI • Assessing patient reported symptoms: Babylon Health • Developing targeted cancer treatments using precision medicine: Microsoft

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Why is Machine Learning for Healthcare So Hard? 1. High-stakes industry 2. Regulations 3. Resistant to change 4. Messy, unstructured data

Slide 9

Slide 9 text

Electronic Medical Records

Slide 10

Slide 10 text

Disparate Data Types: • Past medical history • Vital signs • Laboratory results • Electrocardiographs (ECG) • Electroencephalographs (EEG) • 2D/3D images (X-ray, CT, MRI) • Histology images • Genomic data Chest X-ray ECG Genomic Data Histology Image Electronic Medical Records

Slide 11

Slide 11 text

Tutorial https://github.com/topspinj/diabetes-ml-workshop

Slide 12

Slide 12 text

Data Science Tools seaborn

Slide 13

Slide 13 text

Our Dataset

Slide 14

Slide 14 text

column description encounter_id Unique identifier of an encounter patient_nbr Unique identifier of a patient race Race of patient gender Gender of patient age Age of patient, grouped in 10-year intervals weight Weight of patient in pounds admission_type_id Identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available time_in_hospital Length of hospital stay (number of days between admission and discharge) medical_specialty Specialty of the attending physician num_lab_procedures Number of lab tests performed during the encounter num_procedures Number of procedures (other than lab tests) performed during the encounter num_medications Number of distinct generic medication names administered during the encounter number_outpatient Number of outpatient visits of the patient in the year preceding the encounter number_emergency Number of emergency visits of the patient in the year preceding the encounter number_inpatient Number of inpatient visits of the patient in the year preceding the encounter number_diagnoses Number of diagnoses entered to the system (most likely in the form of ICD codes max_glu_serum Result from glucose serum test. "None" if test was not taken. A1Cresult Results from A1C test which reflects a patient's average blood glucose levels over the past 3 months. *medications 24 columns for medications. Describes whether the drug was prescribed or if there was a change in dosage. diabetesMed Indicates whether the patient was prescribed any diabetic medication. readmitted Indicates whether the patient was readmitted to the hospital.

Slide 15

Slide 15 text

• insert code cell above: CTRL/CMD M A • insert code cell below: CTRL/CMD M B • convert to text cell: CTRL/CMD M M • convert to code cell: CTRL/CMD M Y

Slide 16

Slide 16 text

Supervised Learning age gender HR sys_bp has_sepsis 60 M 95 145 0 15 F 62 112 1 89 M 110 138 0 12 M 57 97 1 predictors (X) label (y)

Slide 17

Slide 17 text

Supervised Learning Machine Learning Model age gender HR sys_bp 60 M 95 145 15 F 62 112 89 M 110 138 12 M 57 97 has_sepsis 0 1 0 1 X y

Slide 18

Slide 18 text

Supervised Learning age gender HR sys_bp has_sepsis 60 M 95 145 0 15 F 62 112 1 89 M 110 138 0 12 M 57 97 1 predictors (X) label (y)

Slide 19

Slide 19 text

Supervised Learning Regression Classification

Slide 20

Slide 20 text

Supervised Learning Regression Classification has_sepsis 0 1 0 1

Slide 21

Slide 21 text

Supervised Learning Regression Classification health_status normal viral bacterial viral

Slide 22

Slide 22 text

Supervised Learning Regression Classification n_days 10 2 21 25

Slide 23

Slide 23 text

Binary Classification • Decision Trees • Random Forest Classifier • Support Vector Machine • K-Neighbours • Logistic Regression • AdaBoost Classifier Possible Models: Readmitted Not Readmitted Patient

Slide 24

Slide 24 text

Random Forest Classification A forest of decision trees Readmitted Readmitted Not readmitted Final prediction: READMITTED prob=0.667

Slide 25

Slide 25 text

Precision and Recall True negative False negative Actual Predicted readmitted not readmitted readmitted not readmitted precision = TP TP + FP recall = TP TP + FN True positive False positive True negative False negative Actual Predicted readmitted not readmitted readmitted not readmitted True positive False positive

Slide 26

Slide 26 text

Hyperparameter Tuning model hyperparameters Configuration that is external to the model Set to a pre-determined value before model training

Slide 27

Slide 27 text

What is a Hyperparameter? Example: clinical trials goal: maximize drug effectiveness active ingredients concentrations Did it cure the patient?

Slide 28

Slide 28 text

Example: drug discovery What is a Hyperparameter? 0174413 Cdk4/D: 0.210 μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM

Slide 29

Slide 29 text

Example: drug discovery What is a Hyperparameter? 0174413 Cdk4/D: 0.210 μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM Toxic Therapeutic Location of nitrogen = hyperparameter

Slide 30

Slide 30 text

Grid Search Search Space • n_estimators = [5,10,50] • max_depth = [3,5] Models 1) n_estimators=5, max_depth=3 2) n_estimators=5, max_depth=5 3) n_estimators=10, max_depth=3 4) n_estimators=10, max_depth=5 5) n_estimators=50, max_depth=3 6) n_estimators=50, max_depth=5 Provide discrete set of hyperparamter values max_depth n_estimators 3 5 10 5 10 50

Slide 31

Slide 31 text

Random Search • Based on assumption that not all hyperparameters are equally important • Works by sampling hyperparamater values from a distribution

Slide 32

Slide 32 text

Grid Search Random Search Random Search

Slide 33

Slide 33 text

Sequential Model-Based Optimization

Slide 34

Slide 34 text

• Data cleaning • Data visualization • Feature selection • Supervised learning • Random forest classification • Hyperparameter tuning • Model evaluation Wrap-Up What we covered: