Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Model Evaluation Metrics #AnacondaCON

Machine Learning Model Evaluation Metrics #AnacondaCON

Slides from my talk at AnacondaCON:
"Choosing the right evaluation metric for your machine learning project is crucial, as it decides which model you’ll ultimately use. Those coming to ML from software development are often self-taught, but practice exercises and competitions generally dictate the evaluation metric. In a real-world scenario, how do you choose an appropriate metric? This talk will explore the important evaluation metrics used in regression and classification tasks, their pros and cons, and how to make a smart decision."

Links from the slides:
- Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance, Cort J. Willmott, Kenji Matsuura, 2005: https://www.int-res.com/abstracts/cr/v30/n1/p79-82/
- Root mean square error (RMSE) or mean absolute error (MAE)?, Tianfeng Chai, R. R. Draxler, 2009: https://www.researchgate.net/publication/262980567_Root_mean_square_error_RMSE_or_mean_absolute_error_MAE

Recommended resources:
- scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
- http://wiki.fast.ai/
- Tip 8 from "Ten quick tips for machine learning in computational biology": https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0155-3
- “Macro- and micro-averaged evaluation measures” by Vincent Van Asch: https://pdfs.semanticscholar.org/1d10/6a2730801b6210a67f7622e4d192bb309303.pdf

MKhalusova

April 04, 2019
Tweet

More Decks by MKhalusova

Other Decks in Programming

Transcript

  1. !2 What’s an evaluation metric? A way to quantify performance

    of a machine learning model Evaluation metric ≠ Loss function
  2. !3 Supervised learning metrics Classification Regression Classification accuracy Precision Recall

    F1 score ROC/AUC Precision/Recall AUC Matthews correlation coefficient Log loss … R^2 MAE MSE RMSE RMSLE …
  3. !7 Classification accuracy 96% Accuracy • Is it a good

    model? • What errors is the model making?
  4. !8 Confusion matrix • Not a metric • Helps to

    gain insight into the type of errors a model is making • Helps to understand some other metrics
  5. !11 Precision, Recall, F1 score Predicted: 0 Predicted: 1 True:

    0 TN = 126 FP = 13 True: 1 FN = 24 TP = 60 Precision = TP TP + FP Recall = TP TP + FN F1 score = 2 * Precision * Recall Precision + Recall = 2 * TP 2 * TP + FP + FN
  6. !12 Precision or Recall? What do you care about? •

    Minimizing False Positives -> Precision • Minimizing False Negatives -> Recall
  7. !15 Matthews correlation coefficient MCC = TP * TN −

    FP * FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Takes into account all four confusion matrix categories Another way to sum up confusion matrix
  8. !16 MCC vs F1 score Predicted: 0 Predicted: 1 True:

    0 TN = 0 FP = 5 True: 1 FN = 0 TP = 95 Data: 100 samples. 95 positive, 5 negative. Model: DummyClassifier F1 score = 2 * TP 2 * TP + FP + FN = 2 * 95 2 * 95 + 5 = 190 195 = 0.974 MCC = TP * TN − FP * FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) = 95 * 0 − 5 * 0 100 * 95 * 5 * 0 = undefined
  9. !17 MCC vs F1 score Predicted: 0 Predicted: 1 True:

    0 TN = 1 FP = 4 True: 1 FN = 5 TP = 90 Predicted: 0 Predicted: 1 True: 0 TN = 90 FP = 5 True: 1 FN = 4 TP = 1 F1 Score = 0.952 MCC = 0.135 F1 Score = 0.182 MCC = 0.135 F1-score is sensitive to which class is positive, and which is negative. MCC isn’t.
  10. !19 AUC (Area Under Curve) True Positive Rate = TP

    TP + FN False Positive Rate = FP FP + TN
  11. !23 Log loss • Takes into account uncertainty of model

    predictions • Larger penalty for confident false predictions − 1 n n ∑ i=1 (yi log pi + (1 − yi )log(1 − pi )) where: N = number of observations y = in binary case, this is the true label (0 or 1) for an observation i p = the model's predicted probability that observation i is 1
  12. !24 Log loss intuition True label Predicted prob. of class

    1 Log loss 1 0.9 0.105360515657 1 0.55 0.597837000755 1 0.10 2.302585092994 0 0.95 2.995732273553
  13. !27 Precision, Recall, F1 Score Label Predicted cat cat cat

    cat cat cat cat cat dog dog dog dog dog cat bird dog bird bird
  14. !28 Precision: micro, macro, weighted Label Predicted cat cat cat

    cat cat cat cat cat dog dog dog dog dog cat bird dog bird bird
  15. !30 Precision: micro, macro, weighted TP FP Pr. N samples

    bird 1 0 1 2 cat 4 1 0.8 4 dog 2 1 0.6666 3 TOTAL 7 2 macro pr = 1 3 (1 + 0.8 + 0.6666) = 0.8222 micro pr = 7 7 + 2 = 0.7777 weighted pr = 1 * 2 + 0.8 * 4 + 0.6666 * 3 2 + 4 + 3 = 0.8
  16. !31 Micro, macro, weighted • Micro-averaged: all samples equally contribute

    to the average • Macro-averaged: all classes equally contribute to the average • Weighted-average: each classes’s contribution to the average is weighted by its size
  17. !34 R Squared (coefficient of determination) • Indicates how well

    the model predictions approximate the true values • 1 = perfect fit vs 0 = DummyRegressor predicting average
  18. !35 R Squared (coefficient of determination) R2(y, ̂ y) =

    1 − ∑n i=1 (yi − ̂ yi )2 ∑n i=1 (yi − ¯ y)2 R squared has an intuitive scale and doesn’t depend on y units R squared gives you no information about prediction error. y = actual values , ̂ y = predicted values , ¯ y = mean of the actual values
  19. !37 MSE: mean squared error MSE(y, ̂ y) = 1

    n n ∑ i=1 (yi − ̂ yi )2
  20. !38 RMSE: root mean squared error RMSE(y, ̂ y) =

    1 n n ∑ i=1 (yi − ̂ yi )2
  21. !39 What do MAE and RMSE have in common? •

    Range: 0 -> ∞ • MAE and RMSE have the same units as y values • Indifferent to the direction of the errors • The lower the metric value, the better
  22. !40 MAE vs RMSE: what’s different? • RMSE gives a

    relatively high weight to large errors • MAE is more robust to outliers • RMSE is differentiable
  23. !41 MAE vs RMSE • Advantages of the mean absolute

    error (MAE) over the root mean square error (RMSE) in assessing average model performance, Cort J. Willmott, Kenji Matsuura, 2005 • Root mean square error (RMSE) or mean absolute error (MAE)?, Tianfeng Chai, R. R. Draxler, 2009 • Neither metric is robust on a small test set (<100)
  24. !42 RMSLE: root mean squared logarithmic error RMSLE(y, ̂ y)

    = 1 n n ∑ i=1 (loge (yi + 1) − loge ( ̂ yi + 1))2 • Similar to RMSE: uses natural logarithm of (y+1) instead of y • +1 because log of 0 is not defined • Shows relative error • Penalizes under-predicted estimate more than over-predicted
  25. !43 Take aways There’s no “one fits all” evaluation metric

    Get to know your data Keep in mind business objective of your ML problem
  26. Thank you. Recommended resources: - scikit-learn User Guide - http://wiki.fast.ai/

    - "Root mean square error (RMSE) or mean absolute error (MAE)?" by Tianfeng Chai, R. R. Draxler - Tip 8 from "Ten quick tips for machine learning in computational biology" - “Macro- and micro-averaged evaluation measures” by Vincent Van Asch @mariakhalusova