Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluating Machine Learning Models - PyData Global 2022

Valerio Maggio
December 02, 2022

Evaluating Machine Learning Models - PyData Global 2022

It seems like we avoided the worst signs of the reproducibility crisis in science when applying machine learning in science. Thanks to better education for reviewers, easier access to tools, and a better understanding of zero-knowledge models.

However, there is much more potential for ML in science. The real world comes with many pitfalls that make the application of machine learning very promising, but the verification of scientific results is complex. Nevertheless, many open-source contributors in the field have worked hard to develop practices and resources to ease this process.

We discuss pitfalls and solutions in model evaluation, where the choice of appropriate metrics and adequate splits of the data is important. We discuss benchmarks, testing, and machine learning reproducibility, where we go into detail on pipelines. Pipelines are a great showcase to avoid the main reproducibility pitfalls, as well as, a tool to bridge the gap between ML experts and domain scientists. Interaction with domain scientists, involving existing knowledge, and communication are a constant undercurrent in producing trustworthy, validated, and reliable machine learning solutions.

Overall, this workshop relies on existing high-quality resources like the Turing Way, more applied tutorials like Jesper Dramsch’s Euroscipy tutorial on ML reproducibility, and professional tools like the Ersilia Hub. Where we utilize real-world examples from different scientific disciplines, e.g. weather and biomedicine.

In this workshop, we present a series of talks from invited speakers that are experts in the application of data science and machine learning to real-world applications. Each talk will be followed by an interactive session to take the theory into practical examples the participants can directly implement to improve their own research. Finally, we close on a discussion that invited active participation and engagement with the speakers as a group.

Valerio Maggio

December 02, 2022
Tweet

More Decks by Valerio Maggio

Other Decks in Education

Transcript

  1. Evaluating Machine
    Learning Models
    @leriomaggio
    [email protected]
    Valerio Maggio, PhD
    Slides available at: bit.ly/evaluate-ml-models-pydata

    View Slide

  2. features
    Data
    Domain
    Model
    objects
    Output
    Task
    Recap: Machine Learning Overview
    Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

    View Slide

  3. features
    Data
    Domain
    Model
    Learning
    algorithm
    objects
    Output
    Training Data
    Learning problem
    Task
    Recap: Machine Learning Overview
    Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

    View Slide

  4. features
    Data
    Domain
    Model
    Learning
    algorithm
    objects
    Output
    Training Data
    Learning problem
    Task
    Recap: Machine Learning Overview
    Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
    {(Xi, yi), i = 1 , . . N
    }
    {Xi, i = 1 , . . N
    }
    Supervised Learning
    Unsupervised Learning

    View Slide

  5. Aim #1: Provide a description of the basic components that are
    required to carry out a Machine Learning Experiment (see next slide)


    Basic components
    ! =
    Recipes
    Learning Objectives

    View Slide

  6. Aim #1: Provide a description of the basic components that are
    required to carry out a Machine Learning Experiment (see next slide)


    Basic components
    ! =
    Recipes
    Aim #2: Give you some appreciation of the importance of choosing
    measurements that are appropriate for your particular experiment


    e.g. (just) Accuracy may not be the right metric to use!
    Learning Objectives

    View Slide

  7. ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
    Machine Learning Experiment

    View Slide

  8. ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
    Common Examples of RQs are:
    How does model m perform on data from domain D
    Much harder: How m would (also) perform on data from D2 (
    ! =
    D)
    Machine Learning Experiment

    View Slide

  9. ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
    Common Examples of RQs are:
    How does model m perform on data from domain D
    Much harder: How m would (also) perform on data from D2 (
    ! =
    D)
    Which of these models m1, m2, … mk from A has the best performance on
    data from D
    Which of these learning algorithms gives the best model on data from D
    Machine Learning Experiment

    View Slide

  10. What to measure ?
    How to measure it ?
    Machine Learning Experiment
    In order to set up our experimental framework we need to investigate:

    View Slide

  11. What to measure ?
    How to measure it ?
    How to interpret the results ? (next step)


    iow. How much results are robust and reliable?
    Machine Learning Experiment
    In order to set up our experimental framework we need to investigate:

    View Slide

  12. WHAT to measure ?

    View Slide

  13. (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix (In clockwise order…)

    View Slide

  14. true positive (TP): Positive samples correctly predicted as Positive
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix (In clockwise order…)

    View Slide

  15. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix (In clockwise order…)

    View Slide

  16. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive] P = TP + FN
    (In clockwise order…)

    View Slide

  17. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive] P = TP + FN
    (In clockwise order…)

    View Slide

  18. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    false positive (FP): Negative samples wrongly predicted as Positive
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive] P = TP + FN
    (In clockwise order…)

    View Slide

  19. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    false positive (FP): Negative samples wrongly predicted as Positive
    condition negative (N): # real negative cases in the data
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive]
    N[egative] N = FP + TN
    P = TP + FN
    (In clockwise order…)

    View Slide

  20. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    false positive (FP): Negative samples wrongly predicted as Positive
    condition negative (N): # real negative cases in the data
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive]
    N[egative] N = FP + TN
    P = TP + FN
    T = P + N = TP + TN + FP + FN
    (In clockwise order…)

    View Slide

  21. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    false positive (FP): Negative samples wrongly predicted as Positive
    condition negative (N): # real negative cases in the data
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive]
    N[egative] N = FP + TN
    P = TP + FN
    T = P + N = TP + TN + FP + FN
    (In clockwise order…)
    Portion of Positive
    Pos =
    P
    T

    View Slide

  22. true positive (TP): Positive samples correctly predicted as Positive
    false negative (FN): Positive samples wrongly predicted as Negative
    condition positive (P): # of real positive cases in the data
    true negative (TN): Negative samples correctly predicted as Negative
    false positive (FP): Negative samples wrongly predicted as Positive
    condition negative (N): # real negative cases in the data
    (Binary) Classification Problem
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    P[ositive]
    N[egative] N = FP + TN
    P = TP + FN
    T = P + N = TP + TN + FP + FN
    (In clockwise order…)
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T

    View Slide

  23. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    (Main) PRIMARY Metrics

    View Slide

  24. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    TPR =
    TP
    P
    True-Positive Rate, Sensitivity,

    RECALL
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    (Main) PRIMARY Metrics

    View Slide

  25. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    TPR =
    TP
    P
    True-Positive Rate, Sensitivity,

    RECALL
    True-Negative Rate, Specificity,
    NEGATIVE RECALL
    TNR =
    TN
    N
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    (Main) PRIMARY Metrics

    View Slide

  26. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    TPR =
    TP
    P
    True-Positive Rate, Sensitivity,

    RECALL
    True-Negative Rate, Specificity,
    NEGATIVE RECALL
    TNR =
    TN
    N
    Confidence,
    PRECISION
    PREC =
    TP
    TP + FP
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    (Main) PRIMARY Metrics

    View Slide

  27. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    TPR =
    TP
    P
    True-Positive Rate, Sensitivity,

    RECALL
    True-Negative Rate, Specificity,
    NEGATIVE RECALL
    TNR =
    TN
    N
    Confidence,
    PRECISION
    PREC =
    TP
    TP + FP
    F1 Score
    F1 = 2
    PREC + TPR
    PREC * TPR
    Memo: Harmonic Mean
    of Prec. & Rec.
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    (Popular) SECONDARY Metrics
    (Main) PRIMARY Metrics

    View Slide

  28. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Classification Metrics
    Without any loss of generality, let’s consider a Binary Classification Problem

    (we’re still in the Supervised learning territory)
    TPR =
    TP
    P
    True-Positive Rate, Sensitivity,

    RECALL
    True-Negative Rate, Specificity,
    NEGATIVE RECALL
    TNR =
    TN
    N
    Confidence,
    PRECISION
    PREC =
    TP
    TP + FP
    F1 Score
    F1 = 2
    PREC + TPR
    PREC * TPR
    Memo: Harmonic Mean
    of Prec. & Rec.
    P[ositive]
    N[egative]
    N = FP + TN
    P = TP + FN
    T = P + N
    Portion of Positive
    Pos =
    P
    T
    Portion of Negative
    Neg = = 1 - POS
    N
    T
    ACC =
    TP + TN
    P + N
    = POS*TPR + (1 - POS)*TNR
    ACCURACY
    (Popular) SECONDARY Metrics
    (Main) PRIMARY Metrics

    View Slide

  29. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Matthew Correlation Coefficient ( MCC )
    Let’s introduce our last metric we will going to explore today

    (still derived from the confusion matrix)
    P[ositive]
    N[egative]
    MCC =
    (TP * TN) - (FP * FN)
    Matthew Correlation Coefficient
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

    View Slide

  30. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Matthew Correlation Coefficient ( MCC )
    Let’s introduce our last metric we will going to explore today

    (still derived from the confusion matrix)
    P[ositive]
    N[egative]
    MCC =
    (TP * TN) - (FP * FN)
    Matthew Correlation Coefficient
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
    The Good

    View Slide

  31. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Matthew Correlation Coefficient ( MCC )
    Let’s introduce our last metric we will going to explore today

    (still derived from the confusion matrix)
    P[ositive]
    N[egative]
    MCC =
    (TP * TN) - (FP * FN)
    Matthew Correlation Coefficient
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
    The Good The Bad

    View Slide

  32. True Positive

    TP
    True Negative

    TN
    False Negative

    FN
    False Positive

    FP
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    Confusion
    Matrix
    Matthew Correlation Coefficient ( MCC )
    Let’s introduce our last metric we will going to explore today

    (still derived from the confusion matrix)
    P[ositive]
    N[egative]
    MCC =
    (TP * TN) - (FP * FN)
    Matthew Correlation Coefficient
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
    The Good The Bad
    The Ugly 🙃

    View Slide

  33. We use some data for evaluation as representative for any future data


    Nonetheless the model may need to operate in different operating context


    e.g. Different class distribution!
    Is Accuracy a Good Idea?
    ACC = POS*TPR + (1 - POS)*TNR

    View Slide

  34. We use some data for evaluation as representative for any future data


    Nonetheless the model may need to operate in different operating context


    e.g. Different class distribution!
    We could treat ACC on future data as random variable, and take its expectation

    (and assuming a uniform prob. distribution over the portion of positive)


    E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1]
    Is Accuracy a Good Idea?
    ACC = POS*TPR + (1 - POS)*TNR
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

    View Slide

  35. We use some data for evaluation as representative for any future data


    Nonetheless the model may need to operate in different operating context


    e.g. Different class distribution!
    We could treat ACC on future data as random variable, and take its expectation

    (and assuming a uniform prob. distribution over the portion of positive)


    E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1]
    [On the other hand] If we’d choose ACC as evaluation measure, we’d making an implicit assumption
    that class distribution in the test data is representative operating context
    Is Accuracy a Good Idea?
    ACC = POS*TPR + (1 - POS)*TNR
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

    View Slide

  36. TPR = 0.75; TNR = 1.00


    ACC = 0.8


    AVG-REC = 0.88
    Is Accuracy a Good Idea?
    60
    20
    20
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Examples
    Model m1 on D
    75
    10
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Model m2 on D
    TPR = 0.94; TNR = 0.5


    ACC = 0.85


    AVG-REC = 0.72
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

    View Slide

  37. TPR = 0.75; TNR = 1.00


    ACC = 0.8


    AVG-REC = 0.88
    Is Accuracy a Good Idea?
    60
    20
    20
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Examples
    Model m1 on D
    75
    10
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Model m2 on D
    TPR = 0.94; TNR = 0.5


    ACC = 0.85


    AVG-REC = 0.72
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
    Mmm…
    not really

    View Slide

  38. Is F-Measure (F1) a Good Idea?
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
    TPR =
    TP
    P
    RECALL PRECISION
    PREC =
    TP
    TP + FP
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    75
    10
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Model m2 on D
    PREC = 75 / 85 = 0.88;

    TPR = 75 / 80 = 0.94


    F1 = 0.91


    ACC = 0.85

    View Slide

  39. Is F-Measure (F1) a Good Idea?
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
    TPR =
    TP
    P
    RECALL PRECISION
    PREC =
    TP
    TP + FP
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    75
    10
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Model m2 on D
    PREC = 75 / 85 = 0.88;

    TPR = 75 / 80 = 0.94


    F1 = 0.91


    ACC = 0.85
    75
    910
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=1000
    P=80
    N=920
    Model m2 on D2
    PREC = 75 / 85 = 0.88;

    TPR = 75 / 80 = 0.94


    F1 = 0.91


    ACC = 0.99

    View Slide

  40. Is F-Measure (F1) a Good Idea?
    [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
    TPR =
    TP
    P
    RECALL PRECISION
    PREC =
    TP
    TP + FP
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    75
    10
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=80
    N=20
    Model m2 on D
    PREC = 75 / 85 = 0.88;

    TPR = 75 / 80 = 0.94


    F1 = 0.91


    ACC = 0.85
    75
    910
    5
    10
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=1000
    P=80
    N=920
    Model m2 on D2
    PREC = 75 / 85 = 0.88;

    TPR = 75 / 80 = 0.94


    F1 = 0.91


    ACC = 0.99
    F1 to be preferred
    in domains where
    negatives abound

    (and are not the
    relevant class)

    View Slide

  41. Is F-Measure (F1) a Good Idea?
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    95
    0
    5
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=100
    N=0
    Model m2 on D
    PREC = 95 / 95 = 1.00;

    TPR = 95 / 100 = 0.95


    F1 = 0.974


    ACC = 0.95


    MCC = UNDEFINED
    MCC =
    (TP * TN) - (FP * FN)
    MCC
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

    View Slide

  42. Is F-Measure (F1) a Good Idea?
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    95
    0
    5
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=100
    N=0
    Model m2 on D
    PREC = 95 / 95 = 1.00;

    TPR = 95 / 100 = 0.95


    F1 = 0.974


    ACC = 0.95


    MCC = UNDEFINED
    90
    4
    5
    1
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=95
    N=5
    Model m2 on D
    PREC = 90 / 91 = 0.98;

    TPR = 90 / 95 = 0.95


    F1 = 0.952


    ACC = 0.91


    MCC = 0.14


    MCC =
    (TP * TN) - (FP * FN)
    MCC
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

    View Slide

  43. Is F-Measure (F1) a Good Idea?
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    95
    0
    5
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=100
    N=0
    Model m2 on D
    PREC = 95 / 95 = 1.00;

    TPR = 95 / 100 = 0.95


    F1 = 0.974


    ACC = 0.95


    MCC = UNDEFINED
    90
    4
    5
    1
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=95
    N=5
    Model m2 on D
    PREC = 90 / 91 = 0.98;

    TPR = 90 / 95 = 0.95


    F1 = 0.952


    ACC = 0.91


    MCC = 0.14


    MCC to be preferred
    in general


    (when predictions on all
    classes count!)
    MCC =
    (TP * TN) - (FP * FN)
    MCC
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

    View Slide

  44. Is F-Measure (F1) a Good Idea?
    F1 Score (Harmonic Mean)
    F1 = 2
    PREC + TPR
    PREC * TPR
    95
    0
    5
    0
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=100
    N=0
    Model m2 on D
    PREC = 95 / 95 = 1.00;

    TPR = 95 / 100 = 0.95


    F1 = 0.974


    ACC = 0.95


    MCC = UNDEFINED
    90
    4
    5
    1
    True Class
    Predicated Class
    Positive Negative
    Positive
    Negative
    T=100
    P=95
    N=5
    Model m2 on D
    PREC = 90 / 91 = 0.98;

    TPR = 90 / 95 = 0.95


    F1 = 0.952


    ACC = 0.91


    MCC = 0.14


    MCC to be preferred
    in general


    (when predictions on all
    classes count!)
    ACC =
    TP + TN
    P + N
    ACCURACY
    MCC =
    (TP * TN) - (FP * FN)
    MCC
    (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

    View Slide

  45. Be aware that not all metrics are the same, so
    choose consciously


    e.g. Choose F1 where negative abounds (and
    are NOT relevant for the task)


    e.g. Choose MCC when predictions on all
    classes count!


    [Practical] Don’t just record ACC


    instead keep track of the main Primary
    Metrics, so (other) Secondary metrics could be
    derived
    Take away Lessons

    View Slide

  46. HOW to measure ?

    View Slide

  47. Use the “Data”, Luke
    Evaluating Supervised Learning models might appear straightforward:

    (1) train the model;

    (2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error)
    Learning
    algorithm
    Training Data Results

    View Slide

  48. Use the “Data”, Luke
    Evaluating Supervised Learning models might appear straightforward:

    (1) train the model;

    (2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error)
    Learning
    algorithm
    Training Data Results
    FLAWED
    Our goal is to evaluate how well the model does on data

    it has never seen before (out-of-sample error)
    Overly optimistic estimate!
    (a.k.a. in-sample error)
    Ch 7.4 Optimism of the Training Error rate

    View Slide

  49. Train-Test Partitions
    Hold-out Evaluation
    Dataset
    Training Set Test Set
    75% 25%
    TRAIN EVALUATE
    Hold-out => This data must be put it off to the
    side, to be used only for evaluating performance

    View Slide

  50. Train-Test Partition (code)
    Hold-out Evaluation
    Dataset
    Training Set Test Set
    75% 25%

    View Slide

  51. Train-Test Partition (code)
    Hold-out Evaluation
    Dataset
    Training Set Test Set
    75% 25%

    View Slide

  52. Train-Test Partition (code)
    Hold-out Evaluation
    Dataset
    Training Set Test Set
    75% 25%
    Weak: performance highly
    dependent on the selected
    samples in the test partition

    View Slide

  53. Idea: We could generate several test partitions, and use them to assess the model.


    More systematically, what we could do instead is:
    Introducing Cross-Validation

    View Slide

  54. Idea: We could generate several test partitions, and use them to assess the model.


    More systematically, what we could do instead is:
    Introducing Cross-Validation
    Dataset
    Pk
    P1
    P2
    P3
    Pk-1

    1. Randomly Split D in
    (~equally-sized) k Partitions (P)
    - called folds

    View Slide

  55. Idea: We could generate several test partitions, and use them to assess the model.


    More systematically, what we could do instead is:
    Introducing Cross-Validation
    K-fold

    Cross-Validation
    Dataset
    Pk
    P1
    P2
    P3
    Pk-1

    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1

    Test
    Training
    Legend
    1. Randomly Split D in
    (~equally-sized) k Partitions (P)
    - called folds
    2. (In turn, k-times)

    2.a fit the model on k-1


    Partitions (combined);


    2.b evaluate the prediction


    error on the remaining Pk

    View Slide

  56. Idea: We could generate several test partitions, and use them to assess the model.


    More systematically, what we could do instead is:
    Introducing Cross-Validation
    K-fold

    Cross-Validation
    Dataset
    Pk
    P1
    P2
    P3
    Pk-1

    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1

    Test
    Training
    Legend
    1. Randomly Split D in
    (~equally-sized) k Partitions (P)
    - called folds
    2. (In turn, k-times)

    2.a fit the model on k-1


    Partitions (combined);


    2.b evaluate the prediction


    error on the remaining Pk
    m1
    m2
    m3

    mk

    View Slide

  57. Idea: We could generate several test partitions, and use them to assess the model.


    More systematically, what we could do instead is:
    Introducing Cross-Validation
    K-fold

    Cross-Validation
    Dataset
    Pk
    P1
    P2
    P3
    Pk-1

    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1
    Pk
    P1
    P2
    P3
    Pk-1

    Test
    Training
    Legend
    1. Randomly Split D in
    (~equally-sized) k Partitions (P)
    - called folds
    2. (In turn, k-times)

    2.a fit the model on k-1


    Partitions (combined);


    2.b evaluate the prediction


    error on the remaining Pk
    m1
    m2
    m3

    mk
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  58. REMEMBER: the deal with test partition is always the same!


    Test folds must remain UNSEEN to the model during training
    Cross-Validation:Tips and Rules
    CV for Learning Algorithm A on Dataset D
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  59. REMEMBER: the deal with test partition is always the same!


    Test folds must remain UNSEEN to the model during training
    K can be (~) any number in [1, N]


    k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);

    K = N —> LOO (Leave-One-Out)
    Cross-Validation:Tips and Rules
    CV for Learning Algorithm A on Dataset D
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  60. REMEMBER: the deal with test partition is always the same!


    Test folds must remain UNSEEN to the model during training
    K can be (~) any number in [1, N]


    k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);

    K = N —> LOO (Leave-One-Out)
    Cross-Validation could be Repeated


    Changing the random seed


    Although increasingly violating the IID assumption
    Cross-Validation:Tips and Rules
    CV for Learning Algorithm A on Dataset D
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  61. REMEMBER: the deal with test partition is always the same!


    Test folds must remain UNSEEN to the model during training
    K can be (~) any number in [1, N]


    k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);

    K = N —> LOO (Leave-One-Out)
    Cross-Validation could be Repeated


    Changing the random seed


    Although increasingly violating the IID assumption
    Cross-Validation can be Stratified


    i.e. maintain ~ same class distribution among training and testing folds


    e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be
    sensitive to class distribution
    Cross-Validation:Tips and Rules
    CV for Learning Algorithm A on Dataset D
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  62. REMEMBER: the deal with test partition is always the same!


    Test folds must remain UNSEEN to the model during training
    K can be (~) any number in [1, N]


    k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);

    K = N —> LOO (Leave-One-Out)
    Cross-Validation could be Repeated


    Changing the random seed


    Although increasingly violating the IID assumption
    Cross-Validation can be Stratified


    i.e. maintain ~ same class distribution among training and testing folds


    e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be
    sensitive to class distribution
    Cross-Validation:Tips and Rules
    bit.ly/sklearn-model-selection
    CV for Learning Algorithm A on Dataset D
    CV(A,D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( mi,
    Pi
    )

    View Slide

  63. A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection)


    This is methodologically wrong, as param-tuning should be part of the training (so test data
    shouldn’t be used at all!)
    CV for Model Selection?

    View Slide

  64. A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection)


    This is methodologically wrong, as param-tuning should be part of the training (so test data
    shouldn’t be used at all!)
    A methodologically sound option is to perform what’s referred to as “Internal Cross Validation”
    CV for Model Selection?
    Dataset
    Training Set Test Set
    Training Set Validation
    CV Model selection + Retrain on whole Training set with m*

    View Slide

  65. In 1996 David Wolpert demonstrated that if you make absolutely
    no assumption about the data, then there is no reason to prefer
    one model over any other.


    This is called the No Free Lunch (NFL) theorem.


    For some datasets the best model is a linear model, while for
    other datasets it is a neural network.
    No Free Lunch Theorem

    View Slide

  66. In 1996 David Wolpert demonstrated that if you make absolutely
    no assumption about the data, then there is no reason to prefer
    one model over any other.


    This is called the No Free Lunch (NFL) theorem.


    For some datasets the best model is a linear model, while for
    other datasets it is a neural network.
    There is no model that is a priori guaranteed to work better
    (hence the name of the theorem).


    The only way is to make some reasonable assumptions
    about the data and evaluate only a few reasonable models.
    No Free Lunch Theorem

    View Slide

  67. In 1996 David Wolpert demonstrated that if you make absolutely
    no assumption about the data, then there is no reason to prefer
    one model over any other.


    This is called the No Free Lunch (NFL) theorem.


    For some datasets the best model is a linear model, while for
    other datasets it is a neural network.
    There is no model that is a priori guaranteed to work better
    (hence the name of the theorem).


    The only way is to make some reasonable assumptions
    about the data and evaluate only a few reasonable models.
    CV provides a robust framework to do so!
    No Free Lunch Theorem

    View Slide

  68. https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial

    View Slide

  69. Inflated Cross-Validation?

    View Slide

  70. Inflated Cross-Validation?

    View Slide

  71. Inflated Cross-Validation?
    Using features which have no connection with class labels, we managed to predict the correct
    class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated?
    Whoa!

    View Slide

  72. Inflated Cross-Validation?
    Using features which have no connection with class labels, we managed to predict the correct
    class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated?
    Whoa!
    Sampling Bias

    (or selection Bias)

    View Slide

  73. Does Cross-Validation Really Works?
    CV(A, D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( Pi
    )
    CV for Learning Algorithm A on Dataset D
    Ch 7.12 Conditional or Expected Test Error?
    Empirically Demonstrates that K-fold CV provide
    reasonable estimates of the expected Test error Err

    (whereas it’s not that straightforward for Conditional
    Error ErrT
    on a given training set T)
    Ch 7.10.3 Does Cross-Validation Really Works?

    View Slide

  74. Dataset with N = 20 samples in two equal-sized classes,
    and p = 500 quantitative features that are independent
    of the class labels.


    the true error rate of any classifier is 50%.


    Fitting to the entire training set, then


    If we do 5-fold cross-validation, this same predictor should
    split any 4/5ths and 1/5th of the data well too, and hence its
    cross-validation error will be small (much less than 50%.)
    Thus CV does not give an accurate estimate of error.
    Does Cross-Validation Really Works?
    CV(A, D) =
    1
    K
    Σ
    K
    i=1
    Åi
    = metric( Pi
    )
    CV for Learning Algorithm A on Dataset D
    Ch 7.12 Conditional or Expected Test Error?
    Empirically Demonstrates that K-fold CV provide
    reasonable estimates of the expected Test error Err

    (whereas it’s not that straightforward for Conditional
    Error ErrT
    on a given training set T)
    Corner Case
    Ch 7.10.3 Does Cross-Validation Really Works?

    View Slide

  75. Does Cross-Validation Really Works?
    Ch 7.10.3 Does Cross-Validation Really Works?
    The argument has ignored the fact that in cross-validation, the model must
    be completely retrained for each fold


    The Random Labels trick can be a useful sanitisation trick for your CV pipeline
    Different Performance
    Avg. Error = 0.5 as it should be!

    (i.e. Random Guessing)
    Take Aways

    View Slide

  76. [Article] Why every statistician should know about cross-validation


    (https://robjhyndman.com/hyndsight/crossvalidation/)


    [Paper] A survey of cross-validation procedures for model selection


    DOI: 10.1214/09-SS054


    [Article] IID Violation and Robust Standard Errors


    https://stat-analysis.netlify.app/the-iid-violation-and-robust-standard-
    errors.html


    Non i.i.d. Data and Cross Validation:

    https://inria.github.io/scikit-learn-mooc/python_scripts/
    cross_validation_time.html
    References and Further Readings
    References

    View Slide

  77. Thank you very much for your
    kind attention
    @leriomaggio
    [email protected]
    Valerio Maggio
    Slides available at: bit.ly/evaluate-ml-models-pydata

    View Slide