Slide 1

Slide 1 text

A Brief Introduction to Hyperparameter Optimization Jill Cates March 2, 2020 Data Scientist @ Shopify Toronto Womxn in Data Science

Slide 2

Slide 2 text

Why is hyperparameter tuning important?

Slide 3

Slide 3 text

R.J. Urbanowicz et al. 2018 A Typical ML Pipeline Pre-processing Modeling Post-processing Hyperparameter optimization

Slide 4

Slide 4 text

Case Study Sepsis Prediction

Slide 5

Slide 5 text

“life-threatening condition that arises when the body's response to infection causes injury to its own tissues and organs” [1] 750, 000 patients are diagnosed with severe sepsis in the United States each year with a 30% mortality rate [2] costs $20.3  billion each year ($55.6  million per day) in U.S. hospitals [3] every hour that passes before treatment begins, a patients’ risk of death from sepsis increases by 4-8% [4] What is sepsis?

Slide 6

Slide 6 text

Build a model that predicts a patient’s likelihood of getting sepsis Proposal

Slide 7

Slide 7 text

EMR data Past medical history Blood test results Microbiology results Imaging (MRI, US, CT) Predict sepsis Demographics (age, gender, ethnicity) Modeling Feature Engineering & Feature Selection Model selection Hyperparameter tuning Create new features Evaluation Select best features An Overview of Our Pipeline

Slide 8

Slide 8 text

Data Description Admissions information Diagnosis upon admission, time of admission/discharge Patient demographics Age, gender, religion, marital status Prescriptions Which drugs were they prescribed and when? Unit transfers Did they move from the medical ward to ICU? Vital signs Heart rate, blood pressure, respiratory rate, spO2 Lab results Blood tests, urine tests Diagnoses ICD-10 codes Chest X-ray images DICOM format 50,000 hospital admissions and 40,000 patients Our Data

Slide 9

Slide 9 text

Clean up inconsistencies in medical terms • Aspirin vs. ASA (acetylsalicylic acid) • NS (normal saline) vs. 0.9% sodium chloride Unified Medical Language System Data Pre-Processing

Slide 10

Slide 10 text

Generate features from clinical notes using topic modelling Data Pre-Processing Treat each topic as a feature Latent Dirichlet Allocation (LDA) Mr. John Smith, 78 y.o. Patient records

Slide 11

Slide 11 text

Data Pre-Processing Generate new features from imaging data • identify lung opacities in X-ray image

Slide 12

Slide 12 text

How do we identify sepsis in a patient? * International Statistical Classification of Diseases and Related Health Problems (ICD), 10th revision, developed by the World Health Organization (WHO) * ICD codes are listed for billing patients at end of stay Creating a Sepsis Score

Slide 13

Slide 13 text

How do we identify sepsis in a patient? Severity scores based on lab results and vitals: • SOFA: Sequential Organ Failure Assessment [6] • SIRS: Systemic Inflammatory Response Syndrome [7] • LODS: Logistic Organ Dysfunction System [8] Creating a Sepsis Score

Slide 14

Slide 14 text

Present? Absent? Assertion classification Speculation? How do we identify sepsis in a patient? Creating a Sepsis Score

Slide 15

Slide 15 text

SOFA: Sequential Organ Failure Assessment mortality prediction score that is based on the degree of dysfunction of six organ systems Jones et al. 2010. Crit Care Med. vitals blood test results urine test results Sepsis = acute change in total SOFA score ≥ 2 points upon initial infection [9] Creating a Sepsis Score

Slide 16

Slide 16 text

Random Forest Classifier admission_id sepsis 1001 0 1002 1 1003 0 1004 1 A binary classification problem Output Between 0 and 1 represents patient’s likelihood of sepsis A forest of decision trees Patient Sepsis Sepsis No sepsis Final prediction: SEPSIS prob=0.667 Picking a Model

Slide 17

Slide 17 text

No Free Lunch Theorem “all optimization problem strategies perform equally well when averaged over all possible problems” Free Lunch

Slide 18

Slide 18 text

RMSE = ΣN i=1 (y − ̂ y)2 N Area Under the Receiver Operating Curve (AUROC) precision = TP TP + FP recall = TP TP + FN F1 = 2 ⋅ precision ⋅ recall precision + recall Evaluating Model Quality

Slide 19

Slide 19 text

Hyperparameter Tuning

Slide 20

Slide 20 text

model hyperparameters Configuration that is external to the model Set to a pre-determined value before model training What is a hyperparameter?

Slide 21

Slide 21 text

Example: clinical trials goal: maximize drug effectiveness active ingredients concentrations Did it cure the patient? What is a hyperparameter?

Slide 22

Slide 22 text

0174413 Cdk4/D: 0.210 μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM Example: drug discovery What is a hyperparameter?

Slide 23

Slide 23 text

0174413 Cdk4/D: 0.210 μM Cdk2/A: 0.012 μM 0204661 Cdk4/D: 0.092 μM Cdk2/A: 0.002 μM 0205783 Cdk4/D: 0.145 μM Cdk2/A: 5.010 μM Toxic Therapeutic Example: drug discovery What is a hyperparameter?

Slide 24

Slide 24 text

What is a hyperparameter? Model Hyperparameters Random Forest Classifier Number of decision trees, max tree depth Singular Value Decomposition Number of latent factors Support Vector Machine Reguarlization (C), tolerance threshold (Ɛ) Gradiant descent Learning rate , regularization (λ) K-means clustering K clusters

Slide 25

Slide 25 text

https://playground.tensorflow.org What is a hyperparameter?

Slide 26

Slide 26 text

https://playground.tensorflow.org What is a hyperparameter?

Slide 27

Slide 27 text

Random Forest Classifier • Number of decision trees (n_estimators) • Maximum tree depth (max_depth) Our Hyperparameters

Slide 28

Slide 28 text

1. Grad Student Descent 2. Grid Search 3. Random Search 4. Informed Search Sampling Techniques

Slide 29

Slide 29 text

a.k.a. tinkering until you get descent results “Grad Student” Descent

Slide 30

Slide 30 text

Search Space skelarn.ensemble.RandomForestClassifier() • n_estimators = [5,10,50] • max_depth = [3,5] Models 1) n_estimators=5, max_depth=3 2) n_estimators=5, max_depth=5 3) n_estimators=10, max_depth=3 4) n_estimators=10, max_depth=5 5) n_estimators=50, max_depth=3 6) n_estimators=50, max_depth=5 Provide discrete set of hyperparamter values max_depth n_estimators 3 5 10 5 10 50 Grid Search

Slide 31

Slide 31 text

“for most data sets only a few of the hyper-parameters really matter…” “…different hyper-parameters are important on different data sets” • Based on assumption that not all hyperparameters are equally important • Works by sampling hyperparamater values from a distribution Random Search

Slide 32

Slide 32 text

Grid Search Random Search A visual explanation of why random search can be better Random Search

Slide 33

Slide 33 text

Uses past evaluation results to choose the next hyperparameter values to optimization Sequential Model-Based Optimization Informed Search P(metric|hyperparameters)

Slide 34

Slide 34 text

• scikit-optimize (skopt): works well with scikit-learn models • hyperopt: based on the Tree Parzen Estimator • SMAC3: uses AutoML • Metric Optimization Engine (MOE): uses gaussian processes Sequential Model-Based Optimization Informed Search Uses past evaluation results to choose the next hyperparameter values to optimization Python Packages:

Slide 35

Slide 35 text

Which sampling technique is best?

Slide 36

Slide 36 text

No Free Lunch Theorem “all optimization problem strategies perform equally well when averaged over all possible problems” Free Lunch

Slide 37

Slide 37 text

The Bias-Variance Trade-off Learning from noise vs. signal Model is tightly bound to training set How to Detect It High performance on training set Poor performance on test set Overfitting When it’s too good to be true…

Slide 38

Slide 38 text

•Consider an ensemble model •Regularization •Cross-validation •Occam’s Razor How to Prevent Overfitting

Slide 39

Slide 39 text

Pick the model with fewer assumptions! Occam’s Razor

Slide 40

Slide 40 text

Biased dataset “Fluctuating hormones and differences between male and female study subjects could all complicate the design of the study” Defining the “ground truth” Selecting the appropriate evaluation metric False positives vs. False negatives A Word of Caution

Slide 41

Slide 41 text

Jill Cates twitter: @JillACates github: @topspinj [email protected] Free Lunch Thank you!

Slide 42

Slide 42 text

1) Sepsis article. Wikipedia. 2) Stevenson EK et al. Two decades of mortality trends among patients with severe sepsis: a comparative meta-analysis. Crit Care Med 2014;42:625. 3) Cost H et al. In Healthcare Cost and Utilization Project (HCUP) Statistical Briefs: MDAgency for Healthcare Research and Quality USA, 2006. 4) Angus DC et al. Epidemiology of severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care. Criti Care Med. 2001;1303-10. 5) Martin GS et al. The Epidemiology of Sepsis in the United States from 1979 through 2000. N Engl J Med 2003; 348:1546-1554. References