Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMLS Workshop Slides

Jill Cates
November 20, 2019

TMLS Workshop Slides

Jill Cates

November 20, 2019
Tweet

More Decks by Jill Cates

Other Decks in Technology

Transcript

  1. Building a Binary Classification Model to Predict Hospital Readmission TMLS

    Workshop Jill Cates BioSymetrics Team November 20, 2019
  2. • BioSymetrics has built a biomedical ML framework (Augusta) designed

    to transition time from data pre-processing and integration to model building and interrogation using familiar toolsets within Python. • Pre-processes diverse, raw medical data types (e.g. images, chemical structures, tabular data), and constructs workflows designed to reduce bias and maximize efficiency • Applications: early stage drug discovery, disease diagnostics, patient outcome prediction
  3. • Experience with Python? • Experience with pandas? • Experience

    with scikit-learn? • Familiarity with machine learning concepts? Quick Survey
  4. Agenda • Role of machine learning in healthcare • Challenges

    of machine learning in healthcare • Tutorial - Part 1: Data Exploration and Pre-processing - Part 2: Machine Learning Introduction
  5. The Role of Healthcare in Machine Learning • Helping diagnose

    patients • Detecting early stages of diseases • Developing personalized treatments • Automated triaging of patients
  6. The Role of Healthcare in Machine Learning • Processing clinical

    notes: AWS Comprehend Medical • Diagnosing diabetic eye disease: Google AI • Assessing patient reported symptoms: Babylon Health • Developing targeted cancer treatments using precision medicine: Microsoft
  7. Why is Machine Learning for Healthcare So Hard? 1. High-stakes

    industry 2. Regulations 3. Resistant to change 4. Messy, unstructured data
  8. Disparate Data Types: • Past medical history • Vital signs

    • Laboratory results • Electrocardiographs (ECG) • Electroencephalographs (EEG) • 2D/3D images (X-ray, CT, MRI) • Histology images • Genomic data Chest X-ray ECG Genomic Data Histology Image Electronic Medical Records
  9. column description encounter_id Unique identifier of an encounter patient_nbr Unique

    identifier of a patient race Race of patient gender Gender of patient age Age of patient, grouped in 10-year intervals weight Weight of patient in pounds admission_type_id Identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available time_in_hospital Length of hospital stay (number of days between admission and discharge) medical_specialty Specialty of the attending physician num_lab_procedures Number of lab tests performed during the encounter num_procedures Number of procedures (other than lab tests) performed during the encounter num_medications Number of distinct generic medication names administered during the encounter number_outpatient Number of outpatient visits of the patient in the year preceding the encounter number_emergency Number of emergency visits of the patient in the year preceding the encounter number_inpatient Number of inpatient visits of the patient in the year preceding the encounter number_diagnoses Number of diagnoses entered to the system (most likely in the form of ICD codes max_glu_serum Result from glucose serum test. "None" if test was not taken. A1Cresult Results from A1C test which reflects a patient's average blood glucose levels over the past 3 months. *medications 24 columns for medications. Describes whether the drug was prescribed or if there was a change in dosage. diabetesMed Indicates whether the patient was prescribed any diabetic medication. readmitted Indicates whether the patient was readmitted to the hospital.
  10. • insert code cell above: CTRL/CMD M A • insert

    code cell below: CTRL/CMD M B • convert to text cell: CTRL/CMD M M • convert to code cell: CTRL/CMD M Y
  11. Supervised Learning age gender HR sys_bp has_sepsis 60 M 95

    145 0 15 F 62 112 1 89 M 110 138 0 12 M 57 97 1 predictors (X) label (y)
  12. Supervised Learning Machine Learning Model age gender HR sys_bp 60

    M 95 145 15 F 62 112 89 M 110 138 12 M 57 97 has_sepsis 0 1 0 1 X y
  13. Supervised Learning age gender HR sys_bp has_sepsis 60 M 95

    145 0 15 F 62 112 1 89 M 110 138 0 12 M 57 97 1 predictors (X) label (y)
  14. Binary Classification • Decision Trees • Random Forest Classifier •

    Support Vector Machine • K-Neighbours • Logistic Regression • AdaBoost Classifier Possible Models: Readmitted Not Readmitted Patient
  15. Random Forest Classification A forest of decision trees Readmitted Readmitted

    Not readmitted Final prediction: READMITTED prob=0.667
  16. Precision and Recall True negative False negative Actual Predicted readmitted

    not readmitted readmitted not readmitted precision = TP TP + FP recall = TP TP + FN True positive False positive True negative False negative Actual Predicted readmitted not readmitted readmitted not readmitted True positive False positive
  17. Hyperparameter Tuning model hyperparameters Configuration that is external to the

    model Set to a pre-determined value before model training
  18. What is a Hyperparameter? Example: clinical trials goal: maximize drug

    effectiveness active ingredients concentrations Did it cure the patient?
  19. Example: drug discovery What is a Hyperparameter? 0174413 Cdk4/D: 0.210

    μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM
  20. Example: drug discovery What is a Hyperparameter? 0174413 Cdk4/D: 0.210

    μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM Toxic Therapeutic Location of nitrogen = hyperparameter
  21. Grid Search Search Space • n_estimators = [5,10,50] • max_depth

    = [3,5] Models 1) n_estimators=5, max_depth=3 2) n_estimators=5, max_depth=5 3) n_estimators=10, max_depth=3 4) n_estimators=10, max_depth=5 5) n_estimators=50, max_depth=3 6) n_estimators=50, max_depth=5 Provide discrete set of hyperparamter values max_depth n_estimators 3 5 10 5 10 50
  22. Random Search • Based on assumption that not all hyperparameters

    are equally important • Works by sampling hyperparamater values from a distribution
  23. • Data cleaning • Data visualization • Feature selection •

    Supervised learning • Random forest classification • Hyperparameter tuning • Model evaluation Wrap-Up What we covered: