180

# On the Evaluation of Binary Classifiers

A brief tour of some aspects of evaluation for binary classifiers.

We look at Matthews Correlation Coefficient and compare its construction to some other popular metrics.

We look at the threshold selection problem.

And we also touch on Decision Curve Analysis.

January 17, 2022

## Transcript

1. On the Evaluation of
Binary Classifiers
Robin Chauhan
Image credit: Alex Borland, “Man With Metal Detector”

2. Model -> Predictions -> Threshold -> Decisions -> Cost/Benefit
Model -> Predictions -> Probabilities -> Threshold -> Decisions -> Benefit
Calibration Curves
Decision Curve Analysis
Classification Evaluation: Accuracy, F1, MCC
Threshold Selection: ROC, TOC,

3. Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

4. Patient View
What the patient cares about after
getting a test result:
How to interpret single + / - ?
Healthcare provider view
Aggregate performance
Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

5. Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

6. Recall / Sensitivity / Power /
TPR
Wikipedia:
Precision Recall Curve: Imbalanced classes
ROC Curve: Balanced classes
“Precision-Recall AUC vs ROC AUC for class
imbalance problems” Discussion at
https://www.kaggle.com/general/7517
Recall / Sensitivity / Power
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
Choosing Thresholds
“The Precision-Recall Plot Is
ROC Plot When Evaluating
Binary Classifiers on
Imbalanced Datasets”, Saito
et al 2015
https://journals.plos.org/ploso
ne/article?id=10.1371/journal.
pone.0118432
* don't translate well to more
balanced cases, or cases
where negatives are rare
Precision / PPV: Positive
Predictive Value

7. Total Operating Characteristic (TOC) Curve
● Provides equiv of full contingency table
https://en.wikipedia.org/wiki/Total_operating_characteristic

8. Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

9. Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
Matthews correlation
coefficient
= √(TPR×TNR×PPV×NPV) −
√(FNR×FPR×FOR×FDR)
But:
FOR = 1-NPV
FDR = 1-PPV
FPR = 1-TNR
FNR = 1-TPR
= √(TPR×TNR×PPV×NPV) −
√((1-TPR)×(1-TNR)×(1-PPV)x(1-NPV))

10. Matthews correlation coefficient
= √(TPR×TNR×PPV×NPV) −
√((1-TPR)×(1-TNR)×(1-PPV)x(1-NPV))
= √“goodness?” [0-1] - √“badness?” [0-1]
=> [-1,+1]

Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

11. Matthews correlation coefficient
= √(TPR×TNR×PPV×NPV) − √( (1-TPR)×(1-TNR)×(1-PPV)x(1-NPV) )
● “Healthcare provider view”
○ TPR / Recall / Sensitivity / Power
■ “What proportion of the Positives did we correctly detect?”
○ TNR / Specificity / Selectivity
■ “What proportion of the Negatives did we correctly detect?”
● “Patient view”
● PPV
○ “When I get a positive prediction, what’s the chance its a true positive?”
● NPV
○ “When I get a negative prediction, what’s the chance its a true negative?”
● Symmetry: Positive and Negative treated identically (unlike F1)
MCC == Pearson correlation coefficient == Phi Coefficient “ϕ” or “r
ϕ

"The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation", Chicco
et al 2020 https://pubmed.ncbi.nlm.nih.gov/31898477/
MCC introduced in 1975:
B. W. Matthews, ‘‘Comparison of the predicted and observed secondary structure of t4 phage lysozyme,’’ Biochimica et Biophysica
Acta (BBA)- Protein Struct., vol. 405, no. 2, pp. 442–451, Oct. 1975

12. Threshold Tuning
https://www.scikit-yb.org/en/latest/api/classifier/threshold.html
Improving visual communication of discriminative accuracy for predictive models: the probability threshold
plot, Johnston et al 2020

13. Decision Curve Analysis, Vickers et al
● Cost / Benefit of False Positive vs False Negative
● Optimize for Net Benefit
● “The threshold reflects the cost/benefit.” – Dr. Singh
● Vary the cost/benefit to see benefit of various models over different risk ranges
● Estimate cost/benefit by asking doctors:
○ “How many patients would you be willing to test, to find 1 true
positive?”
○ Measure of cost of test vs benefit of finding cases
A simple, step-by-step guide to interpreting decision curve analysis, Vickers et al 2019
https://diagnprognres.biomedcentral.com/track/pdf/10.1186/s41512-019-0064-7.pdf
Image via Dr. Karandeep Singh https://twitter.com/kdpsinghlab/status/14346962807191183
“The threshold reﬂects the cost/beneﬁt.” – Dr. Singh

14. Calibration Curves
https://scikit-learn.org/stable/modules/calibration.html

15. Thank you!
Questions?
[email protected]