Slide 1

Slide 1 text

Evaluating Machine Learning Models @leriomaggio [email protected] Valerio Maggio, PhD Slides available at: bit.ly/evaluate-ml-models-pydata

Slide 2

Slide 2 text

features Data Domain Model objects Output Task Recap: Machine Learning Overview Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

Slide 3

Slide 3 text

features Data Domain Model Learning algorithm objects Output Training Data Learning problem Task Recap: Machine Learning Overview Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

Slide 4

Slide 4 text

features Data Domain Model Learning algorithm objects Output Training Data Learning problem Task Recap: Machine Learning Overview Adapted from: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012 {(Xi, yi), i = 1 , . . N } {Xi, i = 1 , . . N } Supervised Learning Unsupervised Learning

Slide 5

Slide 5 text

Aim #1: Provide a description of the basic components that are required to carry out a Machine Learning Experiment (see next slide) Basic components ! = Recipes Learning Objectives

Slide 6

Slide 6 text

Aim #1: Provide a description of the basic components that are required to carry out a Machine Learning Experiment (see next slide) Basic components ! = Recipes Aim #2: Give you some appreciation of the importance of choosing measurements that are appropriate for your particular experiment e.g. (just) Accuracy may not be the right metric to use! Learning Objectives

Slide 7

Slide 7 text

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D) Machine Learning Experiment

Slide 8

Slide 8 text

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D) Common Examples of RQs are: How does model m perform on data from domain D Much harder: How m would (also) perform on data from D2 ( ! = D) Machine Learning Experiment

Slide 9

Slide 9 text

ML Experiment: Research Question (RQ); Learning Algorithm (A, m); Dataset[s] (D) Common Examples of RQs are: How does model m perform on data from domain D Much harder: How m would (also) perform on data from D2 ( ! = D) Which of these models m1, m2, … mk from A has the best performance on data from D Which of these learning algorithms gives the best model on data from D Machine Learning Experiment

Slide 10

Slide 10 text

What to measure ? How to measure it ? Machine Learning Experiment In order to set up our experimental framework we need to investigate:

Slide 11

Slide 11 text

What to measure ? How to measure it ? How to interpret the results ? (next step) iow. How much results are robust and reliable? Machine Learning Experiment In order to set up our experimental framework we need to investigate:

Slide 12

Slide 12 text

WHAT to measure ?

Slide 13

Slide 13 text

(Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix (In clockwise order…)

Slide 14

Slide 14 text

true positive (TP): Positive samples correctly predicted as Positive (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix (In clockwise order…)

Slide 15

Slide 15 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix (In clockwise order…)

Slide 16

Slide 16 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] P = TP + FN (In clockwise order…)

Slide 17

Slide 17 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] P = TP + FN (In clockwise order…)

Slide 18

Slide 18 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative false positive (FP): Negative samples wrongly predicted as Positive (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] P = TP + FN (In clockwise order…)

Slide 19

Slide 19 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative false positive (FP): Negative samples wrongly predicted as Positive condition negative (N): # real negative cases in the data (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] N[egative] N = FP + TN P = TP + FN (In clockwise order…)

Slide 20

Slide 20 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative false positive (FP): Negative samples wrongly predicted as Positive condition negative (N): # real negative cases in the data (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N = TP + TN + FP + FN (In clockwise order…)

Slide 21

Slide 21 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative false positive (FP): Negative samples wrongly predicted as Positive condition negative (N): # real negative cases in the data (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N = TP + TN + FP + FN (In clockwise order…) Portion of Positive Pos = P T

Slide 22

Slide 22 text

true positive (TP): Positive samples correctly predicted as Positive false negative (FN): Positive samples wrongly predicted as Negative condition positive (P): # of real positive cases in the data true negative (TN): Negative samples correctly predicted as Negative false positive (FP): Negative samples wrongly predicted as Positive condition negative (N): # real negative cases in the data (Binary) Classification Problem Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N = TP + TN + FP + FN (In clockwise order…) Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T

Slide 23

Slide 23 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T (Main) PRIMARY Metrics

Slide 24

Slide 24 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) TPR = TP P True-Positive Rate, Sensitivity, 
 RECALL P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T (Main) PRIMARY Metrics

Slide 25

Slide 25 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) TPR = TP P True-Positive Rate, Sensitivity, 
 RECALL True-Negative Rate, Specificity, NEGATIVE RECALL TNR = TN N P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T (Main) PRIMARY Metrics

Slide 26

Slide 26 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) TPR = TP P True-Positive Rate, Sensitivity, 
 RECALL True-Negative Rate, Specificity, NEGATIVE RECALL TNR = TN N Confidence, PRECISION PREC = TP TP + FP P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T (Main) PRIMARY Metrics

Slide 27

Slide 27 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) TPR = TP P True-Positive Rate, Sensitivity, 
 RECALL True-Negative Rate, Specificity, NEGATIVE RECALL TNR = TN N Confidence, PRECISION PREC = TP TP + FP F1 Score F1 = 2 PREC + TPR PREC * TPR Memo: Harmonic Mean of Prec. & Rec. P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T (Popular) SECONDARY Metrics (Main) PRIMARY Metrics

Slide 28

Slide 28 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Classification Metrics Without any loss of generality, let’s consider a Binary Classification Problem 
 (we’re still in the Supervised learning territory) TPR = TP P True-Positive Rate, Sensitivity, 
 RECALL True-Negative Rate, Specificity, NEGATIVE RECALL TNR = TN N Confidence, PRECISION PREC = TP TP + FP F1 Score F1 = 2 PREC + TPR PREC * TPR Memo: Harmonic Mean of Prec. & Rec. P[ositive] N[egative] N = FP + TN P = TP + FN T = P + N Portion of Positive Pos = P T Portion of Negative Neg = = 1 - POS N T ACC = TP + TN P + N = POS*TPR + (1 - POS)*TNR ACCURACY (Popular) SECONDARY Metrics (Main) PRIMARY Metrics

Slide 29

Slide 29 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Matthew Correlation Coefficient ( MCC ) Let’s introduce our last metric we will going to explore today 
 (still derived from the confusion matrix) P[ositive] N[egative] MCC = (TP * TN) - (FP * FN) Matthew Correlation Coefficient (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Slide 30

Slide 30 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Matthew Correlation Coefficient ( MCC ) Let’s introduce our last metric we will going to explore today 
 (still derived from the confusion matrix) P[ositive] N[egative] MCC = (TP * TN) - (FP * FN) Matthew Correlation Coefficient (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN) The Good

Slide 31

Slide 31 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Matthew Correlation Coefficient ( MCC ) Let’s introduce our last metric we will going to explore today 
 (still derived from the confusion matrix) P[ositive] N[egative] MCC = (TP * TN) - (FP * FN) Matthew Correlation Coefficient (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN) The Good The Bad

Slide 32

Slide 32 text

True Positive 
 TP True Negative 
 TN False Negative 
 FN False Positive 
 FP True Class Predicated Class Positive Negative Positive Negative Confusion Matrix Matthew Correlation Coefficient ( MCC ) Let’s introduce our last metric we will going to explore today 
 (still derived from the confusion matrix) P[ositive] N[egative] MCC = (TP * TN) - (FP * FN) Matthew Correlation Coefficient (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN) The Good The Bad The Ugly 🙃

Slide 33

Slide 33 text

We use some data for evaluation as representative for any future data Nonetheless the model may need to operate in different operating context e.g. Different class distribution! Is Accuracy a Good Idea? ACC = POS*TPR + (1 - POS)*TNR

Slide 34

Slide 34 text

We use some data for evaluation as representative for any future data Nonetheless the model may need to operate in different operating context e.g. Different class distribution! We could treat ACC on future data as random variable, and take its expectation 
 (and assuming a uniform prob. distribution over the portion of positive) E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1] Is Accuracy a Good Idea? ACC = POS*TPR + (1 - POS)*TNR [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

Slide 35

Slide 35 text

We use some data for evaluation as representative for any future data Nonetheless the model may need to operate in different operating context e.g. Different class distribution! We could treat ACC on future data as random variable, and take its expectation 
 (and assuming a uniform prob. distribution over the portion of positive) E[ACC] = E[POS]*TPR + E[1-POS]TNR = TPR/2 + TNR/2 = AVG-REC[1] [On the other hand] If we’d choose ACC as evaluation measure, we’d making an implicit assumption that class distribution in the test data is representative operating context Is Accuracy a Good Idea? ACC = POS*TPR + (1 - POS)*TNR [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

Slide 36

Slide 36 text

TPR = 0.75; TNR = 1.00 ACC = 0.8 AVG-REC = 0.88 Is Accuracy a Good Idea? 60 20 20 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Examples Model m1 on D 75 10 5 10 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Model m2 on D TPR = 0.94; TNR = 0.5 ACC = 0.85 AVG-REC = 0.72 [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012

Slide 37

Slide 37 text

TPR = 0.75; TNR = 1.00 ACC = 0.8 AVG-REC = 0.88 Is Accuracy a Good Idea? 60 20 20 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Examples Model m1 on D 75 10 5 10 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Model m2 on D TPR = 0.94; TNR = 0.5 ACC = 0.85 AVG-REC = 0.72 [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012 Mmm… not really

Slide 38

Slide 38 text

Is F-Measure (F1) a Good Idea? [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012 TPR = TP P RECALL PRECISION PREC = TP TP + FP F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 75 10 5 10 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Model m2 on D PREC = 75 / 85 = 0.88; 
 TPR = 75 / 80 = 0.94 F1 = 0.91 ACC = 0.85

Slide 39

Slide 39 text

Is F-Measure (F1) a Good Idea? [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012 TPR = TP P RECALL PRECISION PREC = TP TP + FP F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 75 10 5 10 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Model m2 on D PREC = 75 / 85 = 0.88; 
 TPR = 75 / 80 = 0.94 F1 = 0.91 ACC = 0.85 75 910 5 10 True Class Predicated Class Positive Negative Positive Negative T=1000 P=80 N=920 Model m2 on D2 PREC = 75 / 85 = 0.88; 
 TPR = 75 / 80 = 0.94 F1 = 0.91 ACC = 0.99

Slide 40

Slide 40 text

Is F-Measure (F1) a Good Idea? [1]: “Machine Learning. The art and science of Algorithms that make sense of Data”, P. Flach 2012 TPR = TP P RECALL PRECISION PREC = TP TP + FP F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 75 10 5 10 True Class Predicated Class Positive Negative Positive Negative T=100 P=80 N=20 Model m2 on D PREC = 75 / 85 = 0.88; 
 TPR = 75 / 80 = 0.94 F1 = 0.91 ACC = 0.85 75 910 5 10 True Class Predicated Class Positive Negative Positive Negative T=1000 P=80 N=920 Model m2 on D2 PREC = 75 / 85 = 0.88; 
 TPR = 75 / 80 = 0.94 F1 = 0.91 ACC = 0.99 F1 to be preferred in domains where negatives abound 
 (and are not the relevant class)

Slide 41

Slide 41 text

Is F-Measure (F1) a Good Idea? F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 95 0 5 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=100 N=0 Model m2 on D PREC = 95 / 95 = 1.00; 
 TPR = 95 / 100 = 0.95 F1 = 0.974 ACC = 0.95 MCC = UNDEFINED MCC = (TP * TN) - (FP * FN) MCC (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Slide 42

Slide 42 text

Is F-Measure (F1) a Good Idea? F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 95 0 5 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=100 N=0 Model m2 on D PREC = 95 / 95 = 1.00; 
 TPR = 95 / 100 = 0.95 F1 = 0.974 ACC = 0.95 MCC = UNDEFINED 90 4 5 1 True Class Predicated Class Positive Negative Positive Negative T=100 P=95 N=5 Model m2 on D PREC = 90 / 91 = 0.98; 
 TPR = 90 / 95 = 0.95 F1 = 0.952 ACC = 0.91 MCC = 0.14 MCC = (TP * TN) - (FP * FN) MCC (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Slide 43

Slide 43 text

Is F-Measure (F1) a Good Idea? F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 95 0 5 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=100 N=0 Model m2 on D PREC = 95 / 95 = 1.00; 
 TPR = 95 / 100 = 0.95 F1 = 0.974 ACC = 0.95 MCC = UNDEFINED 90 4 5 1 True Class Predicated Class Positive Negative Positive Negative T=100 P=95 N=5 Model m2 on D PREC = 90 / 91 = 0.98; 
 TPR = 90 / 95 = 0.95 F1 = 0.952 ACC = 0.91 MCC = 0.14 MCC to be preferred in general (when predictions on all classes count!) MCC = (TP * TN) - (FP * FN) MCC (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Slide 44

Slide 44 text

Is F-Measure (F1) a Good Idea? F1 Score (Harmonic Mean) F1 = 2 PREC + TPR PREC * TPR 95 0 5 0 True Class Predicated Class Positive Negative Positive Negative T=100 P=100 N=0 Model m2 on D PREC = 95 / 95 = 1.00; 
 TPR = 95 / 100 = 0.95 F1 = 0.974 ACC = 0.95 MCC = UNDEFINED 90 4 5 1 True Class Predicated Class Positive Negative Positive Negative T=100 P=95 N=5 Model m2 on D PREC = 90 / 91 = 0.98; 
 TPR = 90 / 95 = 0.95 F1 = 0.952 ACC = 0.91 MCC = 0.14 MCC to be preferred in general (when predictions on all classes count!) ACC = TP + TN P + N ACCURACY MCC = (TP * TN) - (FP * FN) MCC (TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)

Slide 45

Slide 45 text

Be aware that not all metrics are the same, so choose consciously e.g. Choose F1 where negative abounds (and are NOT relevant for the task) e.g. Choose MCC when predictions on all classes count! [Practical] Don’t just record ACC instead keep track of the main Primary Metrics, so (other) Secondary metrics could be derived Take away Lessons

Slide 46

Slide 46 text

HOW to measure ?

Slide 47

Slide 47 text

Use the “Data”, Luke Evaluating Supervised Learning models might appear straightforward: 
 (1) train the model; 
 (2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error) Learning algorithm Training Data Results

Slide 48

Slide 48 text

Use the “Data”, Luke Evaluating Supervised Learning models might appear straightforward: 
 (1) train the model; 
 (2) calculate how well it performs using some appropriate metric (e.g. accuracy, squared error) Learning algorithm Training Data Results FLAWED Our goal is to evaluate how well the model does on data 
 it has never seen before (out-of-sample error) Overly optimistic estimate! (a.k.a. in-sample error) Ch 7.4 Optimism of the Training Error rate

Slide 49

Slide 49 text

Train-Test Partitions Hold-out Evaluation Dataset Training Set Test Set 75% 25% TRAIN EVALUATE Hold-out => This data must be put it off to the side, to be used only for evaluating performance

Slide 50

Slide 50 text

Train-Test Partition (code) Hold-out Evaluation Dataset Training Set Test Set 75% 25%

Slide 51

Slide 51 text

Train-Test Partition (code) Hold-out Evaluation Dataset Training Set Test Set 75% 25%

Slide 52

Slide 52 text

Train-Test Partition (code) Hold-out Evaluation Dataset Training Set Test Set 75% 25% Weak: performance highly dependent on the selected samples in the test partition

Slide 53

Slide 53 text

Idea: We could generate several test partitions, and use them to assess the model. More systematically, what we could do instead is: Introducing Cross-Validation

Slide 54

Slide 54 text

Idea: We could generate several test partitions, and use them to assess the model. More systematically, what we could do instead is: Introducing Cross-Validation Dataset Pk P1 P2 P3 Pk-1 … 1. Randomly Split D in (~equally-sized) k Partitions (P) - called folds

Slide 55

Slide 55 text

Idea: We could generate several test partitions, and use them to assess the model. More systematically, what we could do instead is: Introducing Cross-Validation K-fold 
 Cross-Validation Dataset Pk P1 P2 P3 Pk-1 … Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 … Test Training Legend 1. Randomly Split D in (~equally-sized) k Partitions (P) - called folds 2. (In turn, k-times) 
 2.a fit the model on k-1 Partitions (combined); 2.b evaluate the prediction error on the remaining Pk

Slide 56

Slide 56 text

Idea: We could generate several test partitions, and use them to assess the model. More systematically, what we could do instead is: Introducing Cross-Validation K-fold 
 Cross-Validation Dataset Pk P1 P2 P3 Pk-1 … Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 … Test Training Legend 1. Randomly Split D in (~equally-sized) k Partitions (P) - called folds 2. (In turn, k-times) 
 2.a fit the model on k-1 Partitions (combined); 2.b evaluate the prediction error on the remaining Pk m1 m2 m3 … mk

Slide 57

Slide 57 text

Idea: We could generate several test partitions, and use them to assess the model. More systematically, what we could do instead is: Introducing Cross-Validation K-fold 
 Cross-Validation Dataset Pk P1 P2 P3 Pk-1 … Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 Pk P1 P2 P3 Pk-1 … Test Training Legend 1. Randomly Split D in (~equally-sized) k Partitions (P) - called folds 2. (In turn, k-times) 
 2.a fit the model on k-1 Partitions (combined); 2.b evaluate the prediction error on the remaining Pk m1 m2 m3 … mk CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 58

Slide 58 text

REMEMBER: the deal with test partition is always the same! Test folds must remain UNSEEN to the model during training Cross-Validation:Tips and Rules CV for Learning Algorithm A on Dataset D CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 59

Slide 59 text

REMEMBER: the deal with test partition is always the same! Test folds must remain UNSEEN to the model during training K can be (~) any number in [1, N] k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995); 
 K = N —> LOO (Leave-One-Out) Cross-Validation:Tips and Rules CV for Learning Algorithm A on Dataset D CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 60

Slide 60 text

REMEMBER: the deal with test partition is always the same! Test folds must remain UNSEEN to the model during training K can be (~) any number in [1, N] k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995); 
 K = N —> LOO (Leave-One-Out) Cross-Validation could be Repeated Changing the random seed Although increasingly violating the IID assumption Cross-Validation:Tips and Rules CV for Learning Algorithm A on Dataset D CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 61

Slide 61 text

REMEMBER: the deal with test partition is always the same! Test folds must remain UNSEEN to the model during training K can be (~) any number in [1, N] k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995); 
 K = N —> LOO (Leave-One-Out) Cross-Validation could be Repeated Changing the random seed Although increasingly violating the IID assumption Cross-Validation can be Stratified i.e. maintain ~ same class distribution among training and testing folds e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be sensitive to class distribution Cross-Validation:Tips and Rules CV for Learning Algorithm A on Dataset D CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 62

Slide 62 text

REMEMBER: the deal with test partition is always the same! Test folds must remain UNSEEN to the model during training K can be (~) any number in [1, N] k=5 (Breiman and Spector, 1992); K=10 (Kohavi, 1995); 
 K = N —> LOO (Leave-One-Out) Cross-Validation could be Repeated Changing the random seed Although increasingly violating the IID assumption Cross-Validation can be Stratified i.e. maintain ~ same class distribution among training and testing folds e.g. Imbalanced Datasets and/or if we expect the learning algorithm to be sensitive to class distribution Cross-Validation:Tips and Rules bit.ly/sklearn-model-selection CV for Learning Algorithm A on Dataset D CV(A,D) = 1 K Σ K i=1 Åi = metric( mi, Pi )

Slide 63

Slide 63 text

A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection) This is methodologically wrong, as param-tuning should be part of the training (so test data shouldn’t be used at all!) CV for Model Selection?

Slide 64

Slide 64 text

A common mistake is to use cross-validation to do model selection (a.k.a. Hyper-parameter selection) This is methodologically wrong, as param-tuning should be part of the training (so test data shouldn’t be used at all!) A methodologically sound option is to perform what’s referred to as “Internal Cross Validation” CV for Model Selection? Dataset Training Set Test Set Training Set Validation CV Model selection + Retrain on whole Training set with m*

Slide 65

Slide 65 text

In 1996 David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is a linear model, while for other datasets it is a neural network. No Free Lunch Theorem

Slide 66

Slide 66 text

In 1996 David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is a linear model, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way is to make some reasonable assumptions about the data and evaluate only a few reasonable models. No Free Lunch Theorem

Slide 67

Slide 67 text

In 1996 David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is a linear model, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better (hence the name of the theorem). The only way is to make some reasonable assumptions about the data and evaluate only a few reasonable models. CV provides a robust framework to do so! No Free Lunch Theorem

Slide 68

Slide 68 text

https://github.com/JesperDramsch/ml-for-science-reproducibility-tutorial

Slide 69

Slide 69 text

Inflated Cross-Validation?

Slide 70

Slide 70 text

Inflated Cross-Validation?

Slide 71

Slide 71 text

Inflated Cross-Validation? Using features which have no connection with class labels, we managed to predict the correct class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated? Whoa!

Slide 72

Slide 72 text

Inflated Cross-Validation? Using features which have no connection with class labels, we managed to predict the correct class in about 60% of cases, 10% better than random guessing! Can you spot where we cheated? Whoa! Sampling Bias 
 (or selection Bias)

Slide 73

Slide 73 text

Does Cross-Validation Really Works? CV(A, D) = 1 K Σ K i=1 Åi = metric( Pi ) CV for Learning Algorithm A on Dataset D Ch 7.12 Conditional or Expected Test Error? Empirically Demonstrates that K-fold CV provide reasonable estimates of the expected Test error Err 
 (whereas it’s not that straightforward for Conditional Error ErrT on a given training set T) Ch 7.10.3 Does Cross-Validation Really Works?

Slide 74

Slide 74 text

Dataset with N = 20 samples in two equal-sized classes, and p = 500 quantitative features that are independent of the class labels. the true error rate of any classifier is 50%. Fitting to the entire training set, then If we do 5-fold cross-validation, this same predictor should split any 4/5ths and 1/5th of the data well too, and hence its cross-validation error will be small (much less than 50%.) Thus CV does not give an accurate estimate of error. Does Cross-Validation Really Works? CV(A, D) = 1 K Σ K i=1 Åi = metric( Pi ) CV for Learning Algorithm A on Dataset D Ch 7.12 Conditional or Expected Test Error? Empirically Demonstrates that K-fold CV provide reasonable estimates of the expected Test error Err 
 (whereas it’s not that straightforward for Conditional Error ErrT on a given training set T) Corner Case Ch 7.10.3 Does Cross-Validation Really Works?

Slide 75

Slide 75 text

Does Cross-Validation Really Works? Ch 7.10.3 Does Cross-Validation Really Works? The argument has ignored the fact that in cross-validation, the model must be completely retrained for each fold The Random Labels trick can be a useful sanitisation trick for your CV pipeline Different Performance Avg. Error = 0.5 as it should be! 
 (i.e. Random Guessing) Take Aways

Slide 76

Slide 76 text

[Article] Why every statistician should know about cross-validation (https://robjhyndman.com/hyndsight/crossvalidation/) [Paper] A survey of cross-validation procedures for model selection DOI: 10.1214/09-SS054 [Article] IID Violation and Robust Standard Errors https://stat-analysis.netlify.app/the-iid-violation-and-robust-standard- errors.html Non i.i.d. Data and Cross Validation: 
 https://inria.github.io/scikit-learn-mooc/python_scripts/ cross_validation_time.html References and Further Readings References

Slide 77

Slide 77 text

Thank you very much for your kind attention @leriomaggio [email protected] Valerio Maggio Slides available at: bit.ly/evaluate-ml-models-pydata