Reproducibility and Selection Bias in Learning: when just cross validation is not enough

Reproducibility and   Selection Bias   in Machine Learning When
just Cross Validation is not enough! Valerio Maggio @leriomaggio

Reproducibility Selection Bias

Reproducibility the ability to recompute results and   replicability (i.e.
the chances other experimenters will achieve a consistent results)   are among the main important beliefs of the   scientific method

1. Recompute

The scientific   paper is obsolete

2. Replicate

Replicability Pillars Random Seeds !!!

Reproducibility &   Machine Learning ?

Reproducibility as   a measure of Confidence for Machine Learning
Models

Check Supervised Learning Cross Validation

Input data Features Machine Learning Algorithm Output label Labels Predictive
Model New input data Features Supervised Learning

Building ML Models keep error as low as possible. Two
major sources of error are bias and variance. If we managed to reduce these two, then we could build more accurate models.

bit.ly/ml-data-junk

How we diagnose   bias and variance ? What actions
should be take, in case?

Bias-Variance Generalisation

from sklearn.cross_validation import train_test_split X_tr, X_ts, y_tr, y_ts = train_test_split(X,
y, train_size=0.3,   random_state=42)

Bias-Variance feature(s) target f X y f is almost completely
unknown

Bias-Variance feature(s) target f’ X y Model try to estimate

Bias-Variance f’ Model TS1 Training Sets Estimates Y1 f’ TS1
Y1 f’ TS1 Y1

The amount by which the model varies as we change
training data is Variance

Bias-Variance f’ Model TS1 Training Sets Estimates Y1 f’ TS1
Y1 f’ TS1 Y1

The bias reflects the amount of assumptions we do on
the model

Bias-Variance Trade-off

Learning Curve

from sklearn.model_selection import learning_curve train_sizes, train_scores, validation_scores = learning_curve(estimator, X,
y, train_sizes=0.8, cv=5, scoring='accuracy')

Cross Validation

from sklearn.model_selection import KFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import StratifiedKFold

Replicable CV

Selection Bias Selection Bias is the selection of data in
such a way that proper randomisation is not achieved the sample obtained is not representative of the population selection bias not considered => conclusions not accurate

Nested CV 5 x 3 CV - Hyper Parameter Search
10 x 5 CV - Selection Bias

Nested CV todo list ✓ Initialise and set random seeds
✓ Be sure to create NEW models at each run ✓ Calculate the Error (metric) at each run ✓ Average and Get the Confidence Intervals ✓ Save your models and checkpoints

Boilerplate Code Model Independent Easily tuneable to be used w/
ML | DL Feature Normalisation ?? Feature Selection??

Reproducible Learn Introducing leriomaggio/reproducible-learn

Settings

DL Settings

DAP Test

Template Method

Runners

TODO List • Improve Parallelisation • Integrate DB Analytics •
frameworks agnostic for DL (Backend) • Integrate Multiple Metrics

PyData Challenge AI for Precision Medicine http://contest.pycon.it

Valerio Maggio @leriomaggio Thanks a lot for you   kind
attention

Reproducibility and Selection Bias in Learning:...

Reproducibility and Selection Bias in Learning: when just cross validation is not enough

More Decks by Valerio Maggio

Other Decks in Programming

Featured

Transcript