Machine Learning and Performance Evaluation @ DataPhilly 2016

Wednesday, November 30, 2016 6:30 PM to 7:00 PM 441
N 5th St, Suite 301, Philadelphia, PA Machine Learning and Performance Evaluation Sebastian Raschka DATAPHILLY

Estimating the Performance of Predictive Models Why bother?

①  Generalization Performance ②  Model Selection ③  Algorithm Selection

target y

target y variance bias

Bias = E ⇥ ˆ ⇤ Variance = E 
ˆ E[ˆ] 2 Low Variance (Precise) High Variance (Not Precise) Low Bias (Accurate) High Bias (Not Accurate) This work by Sebastian Raschka is licensed under a Bias = E ⇥ ˆ ⇤ Variance = E  ˆ E[ˆ] 2 expected estimated value VARIANCE BIAS

Performance Estimates – Absolute vs Relative

①  Generalization Performance ②  Model Selection ③  Algorithm Selection

Sources of Bias and Variance TRAIN TRAIN TEST

TRAIN TEST

TRAIN TEST Pessimistic Bias

* SoftMax Classiﬁer on a small MNIST subset

TRAIN TEST

TRAIN TEST Pessimistic Bias

TRAIN TEST Pessimistic Bias Variance

Train (70%) Test (30%) Train (70%) Test (30%) n=1000 n=100
Real World Distribution Sample 1 Sample 2 Sample 3 Resampling

* 3-NN on Iris dataset

1st 2nd 3rd 4th 5th K Iterations (K-Folds) Validation Fold
Training Fold Performance Performance Performance Performance Performance 1 2 3 4 5 Performance 1 5 ∑ 5 i =1 Performancei = K-fold Cross Validation

Learning Algorithm Hyperparameter Values Model Training Fold Data Training Fold
Labels Prediction Performance Model Validation Fold Data Validation Fold Labels Hyperparameter Values Training Fold Data Training Fold Labels Prediction Performance Model Validation Fold Data Validation Fold Labels 1st 2nd 3rd 4th 5th K Iterations (K-Folds) Validation Fold Training Fold Performance Performance Performance Performance Performance 1 2 3 4 5 Performance 1 5 ∑ 5 i =1 Performancei =

Σ Logis'c cost . . . Net input func'on (weighted
sum) Logis'c (sigmoid) func'on Quan'zer Predicted class label y Update Model parameters w 1 w 2 w m w 0 1 x 1 x 2 x m y True class label Number of itera'ons w λ 2 L2-regulariza'on strength

The law of parsimony 1-standard error method

Test Labels Test Data Training Data Training Labels Data Labels
1 K-fold for Model Selection step-by-step

Learning Algorithm Hyperparameter values Hyperparameter values Hyperparameter values Training Data
Training Labels 2 Performance Performance Performance Test Labels Test Data Training Data Training Labels Data Labels 1

Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels
3 Learning Algorithm Hyperparameter values Hyperparameter values Hyperparameter values Training Data Training Labels 2 Performance Performance Performance

Prediction Test Labels Performance Model Test Data 4 Learning Algorithm
Best Hyperparameter Values Model Training Data Training Labels 3

Prediction Test Labels Performance Model Test Data 4 Learning Algorithm
Best Hyperparameter Values Final Model Data Labels 5

1st 2nd 3rd 4th 5th Outer Loop Outer Validation Fold
Outer Training Fold Performance Performance Performance Performance Performance 1 2 3 4 5 Performance 1 10 ∑ 10 i=1 Performancei = Inner Loop Inner Training Fold Inner Validation Fold Performance Performance 5,1 5, 2 Performance 1 2 ∑ 2 j=1 Performance5,j Best Algorithm Best Model Nested Cross-Validation for Algorithm Selection

Beyond Performance Metrics Ideal features that are ... •  discriminatory
•  salient •  invariant

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why
Should I Trust You?”: Explaining the Predictions of Any Classiﬁer. In Knowledge Discovery and Data Mining (KDD).

THANK YOU!

https://github.com/rasbt [email protected] http://sebastianraschka.com @rasbt

Machine Learning and Performance Evaluation @ ...

Machine Learning and Performance Evaluation @ DataPhilly 2016

Sebastian Raschka

More Decks by Sebastian Raschka

Other Decks in Technology

Featured

Transcript

Wednesday, November 30, 2016 6:30 PM to 7:00 PM 441

Estimating the Performance of Predictive Models Why bother?

①  Generalization Performance ②  Model Selection ③  Algorithm Selection

target y

target y variance bias

Bias = E ⇥ ˆ ⇤ Variance = E 

Performance Estimates – Absolute vs Relative

①  Generalization Performance ②  Model Selection ③  Algorithm Selection

Sources of Bias and Variance TRAIN TRAIN TEST

TRAIN TEST

TRAIN TEST Pessimistic Bias

* SoftMax Classiﬁer on a small MNIST subset

TRAIN TEST

TRAIN TEST Pessimistic Bias

TRAIN TEST Pessimistic Bias Variance

Train (70%) Test (30%) Train (70%) Test (30%) n=1000 n=100

* 3-NN on Iris dataset

* 3-NN on Iris dataset

1st 2nd 3rd 4th 5th K Iterations (K-Folds) Validation Fold

Learning Algorithm Hyperparameter Values Model Training Fold Data Training Fold

Σ Logis'c cost . . . Net input func'on (weighted

The law of parsimony 1-standard error method

The law of parsimony 1-standard error method

Test Labels Test Data Training Data Training Labels Data Labels

Learning Algorithm Hyperparameter values Hyperparameter values Hyperparameter values Training Data

Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels

Prediction Test Labels Performance Model Test Data 4 Learning Algorithm

Prediction Test Labels Performance Model Test Data 4 Learning Algorithm

1st 2nd 3rd 4th 5th Outer Loop Outer Validation Fold

Beyond Performance Metrics Ideal features that are ... •  discriminatory

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why

THANK YOU!

https://github.com/rasbt [email protected] http://sebastianraschka.com @rasbt