Slide 1

Slide 1 text

On the Evaluation of Binary Classifiers Robin Chauhan https://twitter.com/robinc Image credit: Alex Borland, “Man With Metal Detector”

Slide 2

Slide 2 text

Model -> Predictions -> Threshold -> Decisions -> Cost/Benefit Model -> Predictions -> Probabilities -> Threshold -> Decisions -> Benefit Calibration Curves Decision Curve Analysis Classification Evaluation: Accuracy, F1, MCC Threshold Selection: ROC, TOC,

Slide 3

Slide 3 text

Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Slide 4

Slide 4 text

Patient View What the patient cares about after getting a test result: How to interpret single + / - ? Healthcare provider view Aggregate performance Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Slide 5

Slide 5 text

Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Slide 6

Slide 6 text

Recall / Sensitivity / Power / TPR Wikipedia: https://en.wikipedia.org/wiki/Receiver_operating_characteristic Precision Recall Curve: Imbalanced classes ROC Curve: Balanced classes “Precision-Recall AUC vs ROC AUC for class imbalance problems” Discussion at https://www.kaggle.com/general/7517 Recall / Sensitivity / Power https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html Choosing Thresholds “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets”, Saito et al 2015 https://journals.plos.org/ploso ne/article?id=10.1371/journal. pone.0118432 * don't translate well to more balanced cases, or cases where negatives are rare Precision / PPV: Positive Predictive Value

Slide 7

Slide 7 text

Total Operating Characteristic (TOC) Curve ● More info than ROC ● Provides equiv of full contingency table https://en.wikipedia.org/wiki/Total_operating_characteristic

Slide 8

Slide 8 text

Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Slide 9

Slide 9 text

Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers Matthews correlation coefficient = √(TPR×TNR×PPV×NPV) − √(FNR×FPR×FOR×FDR) But: FOR = 1-NPV FDR = 1-PPV FPR = 1-TNR FNR = 1-TPR = √(TPR×TNR×PPV×NPV) − √((1-TPR)×(1-TNR)×(1-PPV)x(1-NPV))

Slide 10

Slide 10 text

Matthews correlation coefficient = √(TPR×TNR×PPV×NPV) − √((1-TPR)×(1-TNR)×(1-PPV)x(1-NPV)) = √“goodness?” [0-1] - √“badness?” [0-1] => [-1,+1] √ Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Slide 11

Slide 11 text

Matthews correlation coefficient = √(TPR×TNR×PPV×NPV) − √( (1-TPR)×(1-TNR)×(1-PPV)x(1-NPV) ) ● “Healthcare provider view” ○ TPR / Recall / Sensitivity / Power ■ “What proportion of the Positives did we correctly detect?” ○ TNR / Specificity / Selectivity ■ “What proportion of the Negatives did we correctly detect?” ● “Patient view” ● PPV ○ “When I get a positive prediction, what’s the chance its a true positive?” ● NPV ○ “When I get a negative prediction, what’s the chance its a true negative?” ● Symmetry: Positive and Negative treated identically (unlike F1) MCC == Pearson correlation coefficient == Phi Coefficient “ϕ” or “r ϕ ” "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation", Chicco et al 2020 https://pubmed.ncbi.nlm.nih.gov/31898477/ MCC introduced in 1975: B. W. Matthews, ‘‘Comparison of the predicted and observed secondary structure of t4 phage lysozyme,’’ Biochimica et Biophysica Acta (BBA)- Protein Struct., vol. 405, no. 2, pp. 442–451, Oct. 1975

Slide 12

Slide 12 text

Threshold Tuning https://www.scikit-yb.org/en/latest/api/classifier/threshold.html Improving visual communication of discriminative accuracy for predictive models: the probability threshold plot, Johnston et al 2020

Slide 13

Slide 13 text

Decision Curve Analysis, Vickers et al ● Cost / Benefit of False Positive vs False Negative ● Optimize for Net Benefit ● “The threshold reflects the cost/benefit.” – Dr. Singh ● Vary the cost/benefit to see benefit of various models over different risk ranges ● Estimate cost/benefit by asking doctors: ○ “How many patients would you be willing to test, to find 1 true positive?” ○ Measure of cost of test vs benefit of finding cases A simple, step-by-step guide to interpreting decision curve analysis, Vickers et al 2019 https://diagnprognres.biomedcentral.com/track/pdf/10.1186/s41512-019-0064-7.pdf Image via Dr. Karandeep Singh https://twitter.com/kdpsinghlab/status/14346962807191183 “The threshold reflects the cost/benefit.” – Dr. Singh

Slide 14

Slide 14 text

Calibration Curves https://scikit-learn.org/stable/modules/calibration.html

Slide 15

Slide 15 text

Thank you! Questions? [email protected]