Machine Learning Model Evaluation Metrics #AnacondaCON

Machine Learning Model Evaluation Metrics Maria Khalusova, JetBrains, @mariakhalusova

!2 What’s an evaluation metric? A way to quantify performance
of a machine learning model Evaluation metric ≠ Loss function

!3 Supervised learning metrics Classification Regression Classification accuracy Precision Recall
F1 score ROC/AUC Precision/Recall AUC Matthews correlation coefficient Log loss … R^2 MAE MSE RMSE RMSLE …

Classification Metrics Binary classification

!5 Classification accuracy Accuracy = Number of correct predictions Total
number of predictions

!6 Classification accuracy DummyClassifier on the same dataset:

!7 Classification accuracy 96% Accuracy • Is it a good
model? • What errors is the model making?

!8 Confusion matrix • Not a metric • Helps to
gain insight into the type of errors a model is making • Helps to understand some other metrics

!9 Confusion matrix

!10 Confusion matrix Predicted: 0 Predicted: 1 True: 0 126
13 True: 1 24 60

!11 Precision, Recall, F1 score Predicted: 0 Predicted: 1 True:
0 TN = 126 FP = 13 True: 1 FN = 24 TP = 60 Precision = TP TP + FP Recall = TP TP + FN F1 score = 2 * Precision * Recall Precision + Recall = 2 * TP 2 * TP + FP + FN

!12 Precision or Recall? What do you care about? •
Minimizing False Positives -> Precision • Minimizing False Negatives -> Recall

!13 Precision, Recall, F1 score

!14 Precision, Recall, F1 score

!15 Matthews correlation coefficient MCC = TP * TN −
FP * FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Takes into account all four confusion matrix categories Another way to sum up confusion matrix

!16 MCC vs F1 score Predicted: 0 Predicted: 1 True:
0 TN = 0 FP = 5 True: 1 FN = 0 TP = 95 Data: 100 samples. 95 positive, 5 negative. Model: DummyClassifier F1 score = 2 * TP 2 * TP + FP + FN = 2 * 95 2 * 95 + 5 = 190 195 = 0.974 MCC = TP * TN − FP * FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) = 95 * 0 − 5 * 0 100 * 95 * 5 * 0 = undefined

!17 MCC vs F1 score Predicted: 0 Predicted: 1 True:
0 TN = 1 FP = 4 True: 1 FN = 5 TP = 90 Predicted: 0 Predicted: 1 True: 0 TN = 90 FP = 5 True: 1 FN = 4 TP = 1 F1 Score = 0.952 MCC = 0.135 F1 Score = 0.182 MCC = 0.135 F1-score is sensitive to which class is positive, and which is negative. MCC isn’t.

!18 ROC (Receiver Operating Characteristic) curve True Positive Rate =
TP TP + FN False Positive Rate = FP FP + TN

!19 AUC (Area Under Curve) True Positive Rate = TP
TP + FN False Positive Rate = FP FP + TN

!20 Precision/Recall curve Precision = TP TP + FP Recall
= TP TP + FN

!21 ROC curve vs Precision/Recall curve

!22 ROC Curve vs Precision/Recall Curve

!23 Log loss • Takes into account uncertainty of model
predictions • Larger penalty for confident false predictions − 1 n n ∑ i=1 (yi log pi + (1 − yi )log(1 − pi )) where: N = number of observations y = in binary case, this is the true label (0 or 1) for an observation i p = the model's predicted probability that observation i is 1

!24 Log loss intuition True label Predicted prob. of class
1 Log loss 1 0.9 0.105360515657 1 0.55 0.597837000755 1 0.10 2.302585092994 0 0.95 2.995732273553

Classification Metrics Multi-class classification

!26 Confusion matrix https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

!27 Precision, Recall, F1 Score Label Predicted cat cat cat
cat cat cat cat cat dog dog dog dog dog cat bird dog bird bird

!28 Precision: micro, macro, weighted Label Predicted cat cat cat
cat cat cat cat cat dog dog dog dog dog cat bird dog bird bird

!29 Precision: micro, macro, weighted TP FP bird 1 0
cat 4 1 dog 2 1 TOTAL 7 2

!30 Precision: micro, macro, weighted TP FP Pr. N samples
bird 1 0 1 2 cat 4 1 0.8 4 dog 2 1 0.6666 3 TOTAL 7 2 macro pr = 1 3 (1 + 0.8 + 0.6666) = 0.8222 micro pr = 7 7 + 2 = 0.7777 weighted pr = 1 * 2 + 0.8 * 4 + 0.6666 * 3 2 + 4 + 3 = 0.8

!31 Micro, macro, weighted • Micro-averaged: all samples equally contribute
to the average • Macro-averaged: all classes equally contribute to the average • Weighted-average: each classes’s contribution to the average is weighted by its size

!32 Multi-class log loss − 1 N N ∑ i=1
M ∑ j=1 yij log pij

Regression metrics

!34 R Squared (coefficient of determination) • Indicates how well
the model predictions approximate the true values • 1 = perfect fit vs 0 = DummyRegressor predicting average

!35 R Squared (coefficient of determination) R2(y, ̂ y) =
1 − ∑n i=1 (yi − ̂ yi )2 ∑n i=1 (yi − ¯ y)2 R squared has an intuitive scale and doesn’t depend on y units R squared gives you no information about prediction error. y = actual values , ̂ y = predicted values , ¯ y = mean of the actual values

!36 MAE: mean absolute error MAE(y, ̂ y) = 1
n n ∑ i=1 yi − ̂ yi

!37 MSE: mean squared error MSE(y, ̂ y) = 1
n n ∑ i=1 (yi − ̂ yi )2

!38 RMSE: root mean squared error RMSE(y, ̂ y) =
1 n n ∑ i=1 (yi − ̂ yi )2

!39 What do MAE and RMSE have in common? •
Range: 0 -> ∞ • MAE and RMSE have the same units as y values • Indifferent to the direction of the errors • The lower the metric value, the better

!40 MAE vs RMSE: what’s different? • RMSE gives a
relatively high weight to large errors • MAE is more robust to outliers • RMSE is differentiable

!41 MAE vs RMSE • Advantages of the mean absolute
error (MAE) over the root mean square error (RMSE) in assessing average model performance, Cort J. Willmott, Kenji Matsuura, 2005 • Root mean square error (RMSE) or mean absolute error (MAE)?, Tianfeng Chai, R. R. Draxler, 2009 • Neither metric is robust on a small test set (<100)

!42 RMSLE: root mean squared logarithmic error RMSLE(y, ̂ y)
= 1 n n ∑ i=1 (loge (yi + 1) − loge ( ̂ yi + 1))2 • Similar to RMSE: uses natural logarithm of (y+1) instead of y • +1 because log of 0 is not defined • Shows relative error • Penalizes under-predicted estimate more than over-predicted

!43 Take aways There’s no “one fits all” evaluation metric
Get to know your data Keep in mind business objective of your ML problem

Thank you. Recommended resources: - scikit-learn User Guide - http://wiki.fast.ai/
- "Root mean square error (RMSE) or mean absolute error (MAE)?" by Tianfeng Chai, R. R. Draxler - Tip 8 from "Ten quick tips for machine learning in computational biology" - “Macro- and micro-averaged evaluation measures” by Vincent Van Asch @mariakhalusova

Machine Learning Model Evaluation Metrics #Anac...

Machine Learning Model Evaluation Metrics #AnacondaCON

More Decks by MKhalusova

Other Decks in Programming

Featured

Transcript