Evaluating Machine
Learning Models
@leriomaggio
[email protected]
Valerio Maggio, PhD
Slides available at: bit.ly/evaluate-ml-models-pydata

features
Data
Domain
Model
objects
Output
Task
Recap: Machine Learning Overview
Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

features
Data
Domain
Model
Learning
algorithm
objects
Output
Training Data
Learning problem
Task
Recap: Machine Learning Overview
Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

features
Data
Domain
Model
Learning
algorithm
objects
Output
Training Data
Learning problem
Task
Recap: Machine Learning Overview
Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
{(Xi, yi), i = 1 , . . N
}
{Xi, i = 1 , . . N
}
Supervised Learning
Unsupervised Learning

Aim #1: Provide a description of the basic components that are
required to carry out a Machine Learning Experiment (see next slide)
Basic components
! =
Recipes
Learning Objectives

Aim #1: Provide a description of the basic components that are
required to carry out a Machine Learning Experiment (see next slide)
Basic components
! =
Recipes
Aim #2: Give you some appreciation of the importance of choosing
measurements that are appropriate for your particular experiment
e.g. (just) Accuracy may not be the right metric to use!
Learning Objectives

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
Machine Learning Experiment

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
Common Examples of RQs are:
How does model m perform on data from domain D
Much harder: How m would (also) perform on data from D2 (
! =
D)
Machine Learning Experiment

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D)
Common Examples of RQs are:
How does model m perform on data from domain D
Much harder: How m would (also) perform on data from D2 (
! =
D)
Which of these models m1, m2, … mk from A has the best performance on
data from D
Which of these learning algorithms gives the best model on data from D
Machine Learning Experiment

What to measure ?
How to measure it ?
Machine Learning Experiment
In order to set up our experimental framework we need to investigate:

What to measure ?
How to measure it ?
How to interpret the results ? (next step)
iow. How much results are robust and reliable?
Machine Learning Experiment
In order to set up our experimental framework we need to investigate:

WHAT to measure ?

(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix (In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix (In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix (In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive] P = TP + FN
(In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive] P = TP + FN
(In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
false positive (FP): Negative samples wrongly predicted as Positive
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive] P = TP + FN
(In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
false positive (FP): Negative samples wrongly predicted as Positive
condition negative (N): # real negative cases in the data
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive]
N[egative] N = FP + TN
P = TP + FN
(In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
false positive (FP): Negative samples wrongly predicted as Positive
condition negative (N): # real negative cases in the data
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive]
N[egative] N = FP + TN
P = TP + FN
T = P + N = TP + TN + FP + FN
(In clockwise order…)

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
false positive (FP): Negative samples wrongly predicted as Positive
condition negative (N): # real negative cases in the data
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive]
N[egative] N = FP + TN
P = TP + FN
T = P + N = TP + TN + FP + FN
(In clockwise order…)
Portion of Positive
Pos =
P
T

true positive (TP): Positive samples correctly predicted as Positive
false negative (FN): Positive samples wrongly predicted as Negative
condition positive (P): # of real positive cases in the data
true negative (TN): Negative samples correctly predicted as Negative
false positive (FP): Negative samples wrongly predicted as Positive
condition negative (N): # real negative cases in the data
(Binary) Classification Problem
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
P[ositive]
N[egative] N = FP + TN
P = TP + FN
T = P + N = TP + TN + FP + FN
(In clockwise order…)
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
TPR =
TP
P
True-Positive Rate, Sensitivity,
RECALL
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
TPR =
TP
P
True-Positive Rate, Sensitivity,
RECALL
True-Negative Rate, Specificity,
NEGATIVE RECALL
TNR =
TN
N
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
TPR =
TP
P
True-Positive Rate, Sensitivity,
RECALL
True-Negative Rate, Specificity,
NEGATIVE RECALL
TNR =
TN
N
Confidence,
PRECISION
PREC =
TP
TP + FP
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
TPR =
TP
P
True-Positive Rate, Sensitivity,
RECALL
True-Negative Rate, Specificity,
NEGATIVE RECALL
TNR =
TN
N
Confidence,
PRECISION
PREC =
TP
TP + FP
F1 Score
F1 = 2
PREC + TPR
PREC * TPR
Memo: Harmonic Mean
of Prec. & Rec.
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
(Popular) SECONDARY Metrics
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Classification Metrics
Without any loss of generality, let’s consider a Binary Classification Problem
(we’re still in the Supervised learning territory)
TPR =
TP
P
True-Positive Rate, Sensitivity,
RECALL
True-Negative Rate, Specificity,
NEGATIVE RECALL
TNR =
TN
N
Confidence,
PRECISION
PREC =
TP
TP + FP
F1 Score
F1 = 2
PREC + TPR
PREC * TPR
Memo: Harmonic Mean
of Prec. & Rec.
P[ositive]
N[egative]
N = FP + TN
P = TP + FN
T = P + N
Portion of Positive
Pos =
P
T
Portion of Negative
Neg = = 1 - POS
N
T
ACC =
TP + TN
P + N
= POS*TPR + (1 - POS)*TNR
ACCURACY
(Popular) SECONDARY Metrics
(Main) PRIMARY Metrics

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Matthew Correlation Coefficient ( MCC )
Let’s introduce our last metric we will going to explore today
(still derived from the confusion matrix)
P[ositive]
N[egative]
MCC =
(TP * TN) - (FP * FN)
Matthew Correlation Coefficient
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Matthew Correlation Coefficient ( MCC )
Let’s introduce our last metric we will going to explore today
(still derived from the confusion matrix)
P[ositive]
N[egative]
MCC =
(TP * TN) - (FP * FN)
Matthew Correlation Coefficient
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
The Good

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Matthew Correlation Coefficient ( MCC )
Let’s introduce our last metric we will going to explore today
(still derived from the confusion matrix)
P[ositive]
N[egative]
MCC =
(TP * TN) - (FP * FN)
Matthew Correlation Coefficient
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
The Good The Bad

True Positive
TP
True Negative
TN
False Negative
FN
False Positive
FP
True Class
Predicated Class
Positive Negative
Positive
Negative
Confusion
Matrix
Matthew Correlation Coefficient ( MCC )
Let’s introduce our last metric we will going to explore today
(still derived from the confusion matrix)
P[ositive]
N[egative]
MCC =
(TP * TN) - (FP * FN)
Matthew Correlation Coefficient
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)
The Good The Bad
The Ugly 🙃

We use some data for evaluation as representative for any future data
Nonetheless the model may need to operate in different operating context
e.g. Different class distribution!
Is Accuracy a Good Idea?
ACC = POS*TPR + (1 - POS)*TNR

We use some data for evaluation as representative for any future data
Nonetheless the model may need to operate in different operating context
e.g. Different class distribution!
We could treat ACC on future data as random variable, and take its expectation
(and assuming a uniform prob. distribution over the portion of positive)
E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1]
Is Accuracy a Good Idea?
ACC = POS*TPR + (1 - POS)*TNR
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

We use some data for evaluation as representative for any future data
Nonetheless the model may need to operate in different operating context
e.g. Different class distribution!
We could treat ACC on future data as random variable, and take its expectation
(and assuming a uniform prob. distribution over the portion of positive)
E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1]
[On the other hand] If we’d choose ACC as evaluation measure, we’d making an implicit assumption
that class distribution in the test data is representative operating context
Is Accuracy a Good Idea?
ACC = POS*TPR + (1 - POS)*TNR
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

TPR = 0.75; TNR = 1.00
ACC = 0.8
AVG-REC = 0.88
Is Accuracy a Good Idea?
60
20
20
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Examples
Model m1 on D
75
10
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Model m2 on D
TPR = 0.94; TNR = 0.5
ACC = 0.85
AVG-REC = 0.72
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

TPR = 0.75; TNR = 1.00
ACC = 0.8
AVG-REC = 0.88
Is Accuracy a Good Idea?
60
20
20
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Examples
Model m1 on D
75
10
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Model m2 on D
TPR = 0.94; TNR = 0.5
ACC = 0.85
AVG-REC = 0.72
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
Mmm…
not really

Is F-Measure (F1) a Good Idea?
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
TPR =
TP
P
RECALL PRECISION
PREC =
TP
TP + FP
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
75
10
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Model m2 on D
PREC = 75 / 85 = 0.88;
TPR = 75 / 80 = 0.94
F1 = 0.91
ACC = 0.85

Is F-Measure (F1) a Good Idea?
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
TPR =
TP
P
RECALL PRECISION
PREC =
TP
TP + FP
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
75
10
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Model m2 on D
PREC = 75 / 85 = 0.88;
TPR = 75 / 80 = 0.94
F1 = 0.91
ACC = 0.85
75
910
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=1000
P=80
N=920
Model m2 on D2
PREC = 75 / 85 = 0.88;
TPR = 75 / 80 = 0.94
F1 = 0.91
ACC = 0.99

Is F-Measure (F1) a Good Idea?
[1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012
TPR =
TP
P
RECALL PRECISION
PREC =
TP
TP + FP
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
75
10
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=80
N=20
Model m2 on D
PREC = 75 / 85 = 0.88;
TPR = 75 / 80 = 0.94
F1 = 0.91
ACC = 0.85
75
910
5
10
True Class
Predicated Class
Positive Negative
Positive
Negative
T=1000
P=80
N=920
Model m2 on D2
PREC = 75 / 85 = 0.88;
TPR = 75 / 80 = 0.94
F1 = 0.91
ACC = 0.99
F1 to be preferred
in domains where
negatives abound
(and are not the
relevant class)

Is F-Measure (F1) a Good Idea?
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
95
0
5
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=100
N=0
Model m2 on D
PREC = 95 / 95 = 1.00;
TPR = 95 / 100 = 0.95
F1 = 0.974
ACC = 0.95
MCC = UNDEFINED
MCC =
(TP * TN) - (FP * FN)
MCC
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Is F-Measure (F1) a Good Idea?
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
95
0
5
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=100
N=0
Model m2 on D
PREC = 95 / 95 = 1.00;
TPR = 95 / 100 = 0.95
F1 = 0.974
ACC = 0.95
MCC = UNDEFINED
90
4
5
1
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=95
N=5
Model m2 on D
PREC = 90 / 91 = 0.98;
TPR = 90 / 95 = 0.95
F1 = 0.952
ACC = 0.91
MCC = 0.14
MCC =
(TP * TN) - (FP * FN)
MCC
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Is F-Measure (F1) a Good Idea?
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
95
0
5
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=100
N=0
Model m2 on D
PREC = 95 / 95 = 1.00;
TPR = 95 / 100 = 0.95
F1 = 0.974
ACC = 0.95
MCC = UNDEFINED
90
4
5
1
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=95
N=5
Model m2 on D
PREC = 90 / 91 = 0.98;
TPR = 90 / 95 = 0.95
F1 = 0.952
ACC = 0.91
MCC = 0.14
MCC to be preferred
in general
(when predictions on all
classes count!)
MCC =
(TP * TN) - (FP * FN)
MCC
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Is F-Measure (F1) a Good Idea?
F1 Score (Harmonic Mean)
F1 = 2
PREC + TPR
PREC * TPR
95
0
5
0
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=100
N=0
Model m2 on D
PREC = 95 / 95 = 1.00;
TPR = 95 / 100 = 0.95
F1 = 0.974
ACC = 0.95
MCC = UNDEFINED
90
4
5
1
True Class
Predicated Class
Positive Negative
Positive
Negative
T=100
P=95
N=5
Model m2 on D
PREC = 90 / 91 = 0.98;
TPR = 90 / 95 = 0.95
F1 = 0.952
ACC = 0.91
MCC = 0.14
MCC to be preferred
in general
(when predictions on all
classes count!)
ACC =
TP + TN
P + N
ACCURACY
MCC =
(TP * TN) - (FP * FN)
MCC
(TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Be aware that not all metrics are the same, so
choose consciously
e.g. Choose F1 where negative abounds (and
are NOT relevant for the task)
e.g. Choose MCC when predictions on all
classes count!
[Practical] Don’t just record ACC
instead keep track of the main Primary
Metrics, so (other) Secondary metrics could be
derived
Take away Lessons

HOW to measure ?

Use the “Data”, Luke
Evaluating Supervised Learning models might appear straightforward:
(1) train the model;
(2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error)
Learning
algorithm
Training Data Results

Use the “Data”, Luke
Evaluating Supervised Learning models might appear straightforward:
(1) train the model;
(2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error)
Learning
algorithm
Training Data Results
FLAWED
Our goal is to evaluate how well the model does on data
it has never seen before (out-of-sample error)
Overly optimistic estimate!
(a.k.a. in-sample error)
Ch 7.4 Optimism of the Training Error rate

Train-Test Partitions
Hold-out Evaluation
Dataset
Training Set Test Set
75% 25%
TRAIN EVALUATE
Hold-out => This data must be put it off to the
side, to be used only for evaluating performance

Train-Test Partition (code)
Hold-out Evaluation
Dataset
Training Set Test Set
75% 25%

Train-Test Partition (code)
Hold-out Evaluation
Dataset
Training Set Test Set
75% 25%

Train-Test Partition (code)
Hold-out Evaluation
Dataset
Training Set Test Set
75% 25%
Weak: performance highly
dependent on the selected
samples in the test partition

Idea: We could generate several test partitions, and use them to assess the model.
More systematically, what we could do instead is:
Introducing Cross-Validation

Idea: We could generate several test partitions, and use them to assess the model.
More systematically, what we could do instead is:
Introducing Cross-Validation
Dataset
Pk
P1
P2
P3
Pk-1
…
1. Randomly Split D in
(~equally-sized) k Partitions (P)
- called folds

Idea: We could generate several test partitions, and use them to assess the model.
More systematically, what we could do instead is:
Introducing Cross-Validation
K-fold
Cross-Validation
Dataset
Pk
P1
P2
P3
Pk-1
…
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
…
Test
Training
Legend
1. Randomly Split D in
(~equally-sized) k Partitions (P)
- called folds
2. (In turn, k-times)
2.a fit the model on k-1
Partitions (combined);
2.b evaluate the prediction
error on the remaining Pk

Idea: We could generate several test partitions, and use them to assess the model.
More systematically, what we could do instead is:
Introducing Cross-Validation
K-fold
Cross-Validation
Dataset
Pk
P1
P2
P3
Pk-1
…
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
…
Test
Training
Legend
1. Randomly Split D in
(~equally-sized) k Partitions (P)
- called folds
2. (In turn, k-times)
2.a fit the model on k-1
Partitions (combined);
2.b evaluate the prediction
error on the remaining Pk
m1
m2
m3
…
mk

Idea: We could generate several test partitions, and use them to assess the model.
More systematically, what we could do instead is:
Introducing Cross-Validation
K-fold
Cross-Validation
Dataset
Pk
P1
P2
P3
Pk-1
…
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
Pk
P1
P2
P3
Pk-1
…
Test
Training
Legend
1. Randomly Split D in
(~equally-sized) k Partitions (P)
- called folds
2. (In turn, k-times)
2.a fit the model on k-1
Partitions (combined);
2.b evaluate the prediction
error on the remaining Pk
m1
m2
m3
…
mk
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

REMEMBER: the deal with test partition is always the same!
Test folds must remain UNSEEN to the model during training
Cross-Validation:Tips and Rules
CV for Learning Algorithm A on Dataset D
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

REMEMBER: the deal with test partition is always the same!
Test folds must remain UNSEEN to the model during training
K can be (~) any number in [1, N]
k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);
K = N —> LOO (Leave-One-Out)
Cross-Validation:Tips and Rules
CV for Learning Algorithm A on Dataset D
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

REMEMBER: the deal with test partition is always the same!
Test folds must remain UNSEEN to the model during training
K can be (~) any number in [1, N]
k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);
K = N —> LOO (Leave-One-Out)
Cross-Validation could be Repeated
Changing the random seed
Although increasingly violating the IID assumption
Cross-Validation:Tips and Rules
CV for Learning Algorithm A on Dataset D
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

REMEMBER: the deal with test partition is always the same!
Test folds must remain UNSEEN to the model during training
K can be (~) any number in [1, N]
k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);
K = N —> LOO (Leave-One-Out)
Cross-Validation could be Repeated
Changing the random seed
Although increasingly violating the IID assumption
Cross-Validation can be Stratified
i.e. maintain ~ same class distribution among training and testing folds
e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be
sensitive to class distribution
Cross-Validation:Tips and Rules
CV for Learning Algorithm A on Dataset D
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

REMEMBER: the deal with test partition is always the same!
Test folds must remain UNSEEN to the model during training
K can be (~) any number in [1, N]
k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995);
K = N —> LOO (Leave-One-Out)
Cross-Validation could be Repeated
Changing the random seed
Although increasingly violating the IID assumption
Cross-Validation can be Stratified
i.e. maintain ~ same class distribution among training and testing folds
e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be
sensitive to class distribution
Cross-Validation:Tips and Rules
bit.ly/sklearn-model-selection
CV for Learning Algorithm A on Dataset D
CV(A,D) =
1
K
Σ
K
i=1
Åi
= metric( mi,
Pi
)

A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection)
This is methodologically wrong, as param-tuning should be part of the training (so test data
shouldn’t be used at all!)
CV for Model Selection?

A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection)
This is methodologically wrong, as param-tuning should be part of the training (so test data
shouldn’t be used at all!)
A methodologically sound option is to perform what’s referred to as “Internal Cross Validation”
CV for Model Selection?
Dataset
Training Set Test Set
Training Set Validation
CV Model selection + Retrain on whole Training set with m*

In 1996 David Wolpert demonstrated that if you make absolutely
no assumption about the data, then there is no reason to prefer
one model over any other.
This is called the No Free Lunch (NFL) theorem.
For some datasets the best model is a linear model, while for
other datasets it is a neural network.
No Free Lunch Theorem

In 1996 David Wolpert demonstrated that if you make absolutely
no assumption about the data, then there is no reason to prefer
one model over any other.
This is called the No Free Lunch (NFL) theorem.
For some datasets the best model is a linear model, while for
other datasets it is a neural network.
There is no model that is a priori guaranteed to work better
(hence the name of the theorem).
The only way is to make some reasonable assumptions
about the data and evaluate only a few reasonable models.
No Free Lunch Theorem

In 1996 David Wolpert demonstrated that if you make absolutely
no assumption about the data, then there is no reason to prefer
one model over any other.
This is called the No Free Lunch (NFL) theorem.
For some datasets the best model is a linear model, while for
other datasets it is a neural network.
There is no model that is a priori guaranteed to work better
(hence the name of the theorem).
The only way is to make some reasonable assumptions
about the data and evaluate only a few reasonable models.
CV provides a robust framework to do so!
No Free Lunch Theorem

https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial

Inflated Cross-Validation?

Inflated Cross-Validation?

Inflated Cross-Validation?
Using features which have no connection with class labels, we managed to predict the correct
class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated?
Whoa!

Inflated Cross-Validation?
Using features which have no connection with class labels, we managed to predict the correct
class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated?
Whoa!
Sampling Bias
(or selection Bias)

Does Cross-Validation Really Works?
CV(A, D) =
1
K
Σ
K
i=1
Åi
= metric( Pi
)
CV for Learning Algorithm A on Dataset D
Ch 7.12 Conditional or Expected Test Error?
Empirically Demonstrates that K-fold CV provide
reasonable estimates of the expected Test error Err
(whereas it’s not that straightforward for Conditional
Error ErrT
on a given training set T)
Ch 7.10.3 Does Cross-Validation Really Works?

Dataset with N = 20 samples in two equal-sized classes,
and p = 500 quantitative features that are independent
of the class labels.
the true error rate of any classifier is 50%.
Fitting to the entire training set, then
If we do 5-fold cross-validation, this same predictor should
split any 4/5ths and 1/5th of the data well too, and hence its
cross-validation error will be small (much less than 50%.)
Thus CV does not give an accurate estimate of error.
Does Cross-Validation Really Works?
CV(A, D) =
1
K
Σ
K
i=1
Åi
= metric( Pi
)
CV for Learning Algorithm A on Dataset D
Ch 7.12 Conditional or Expected Test Error?
Empirically Demonstrates that K-fold CV provide
reasonable estimates of the expected Test error Err
(whereas it’s not that straightforward for Conditional
Error ErrT
on a given training set T)
Corner Case
Ch 7.10.3 Does Cross-Validation Really Works?

Does Cross-Validation Really Works?
Ch 7.10.3 Does Cross-Validation Really Works?
The argument has ignored the fact that in cross-validation, the model must
be completely retrained for each fold
The Random Labels trick can be a useful sanitisation trick for your CV pipeline
Different Performance
Avg. Error = 0.5 as it should be!
(i.e. Random Guessing)
Take Aways

[Article] Why every statistician should know about cross-validation
(https://robjhyndman.com/hyndsight/crossvalidation/)
[Paper] A survey of cross-validation procedures for model selection
DOI: 10.1214/09-SS054
[Article] IID Violation and Robust Standard Errors
https://stat-analysis.netlify.app/the-iid-violation-and-robust-standard-
errors.html
Non i.i.d. Data and Cross Validation:
https://inria.github.io/scikit-learn-mooc/python_scripts/
cross_validation_time.html
References and Further Readings
References

Thank you very much for your
kind attention
@leriomaggio
[email protected]
Valerio Maggio
Slides available at: bit.ly/evaluate-ml-models-pydata