Reproducibility and Selection Bias in Machine Learning

Valerio Maggio Reproducibility &   Selection Bias   in Machine
Learning Researcher @ Fondazione Bruno Kessler Trento, Italy When only Cross Validation is not enough! @leriomaggio valeriomaggio@gmail @leriomaggio

Reproducibility Selection Bias

Building Blocks of the   Scientiﬁc Method (a) recompute results
  (b) replicability   (i.e. the chances other experimenters will achieve a consistent results)  

Reproducible   vs   Replicable Source: Barriers to Reproducible Research
(and how to overcome them)  doi: https://dx.doi.org/10.6084/m9.ﬁgshare.7140050 by @kirstie_j

Source: Barriers to Reproducible Research (and how to overcome them) 
doi: https://dx.doi.org/10.6084/m9.ﬁgshare.7140050 by @kirstie_j Reproducible Replicable Robust Generalisable

Reproduce Same Code - Same Data

The scientiﬁc   paper is obsolete

Keystones +

Steps to Reproducibility Create a public repo

Steps to Reproducibility Create a public repo Freeze the development
  with a git tag

  with a git tag Share the virtual env:  pip install -r requirements  conda env create -f environment.yml

  with a git tag Share the virtual env:  pip install -r requirements  conda env create -f environment.yml Share the Python version  conda create -n env python=3.6  pyenv + virtualenv

  with a git tag Share the virtual env:  pip install -r requirements  conda env create -f environment.yml Share the Python version  conda create -n env python=3.6  pyenv + virtualenv Create a Container  with all the env.

  with a git tag Share the virtual env:  pip install -r requirements  conda env create -f environment.yml Share the Python version  conda create -n env python=3.6  pyenv + virtualenv Share the CUDA version  nvcc —version  cuDNN: cat <path to>/cudnn.h | grep CUDNN_MAJOR -A 2

Reproducing Machine Learning Results

Machine Learning for dummies ! Machine Learning

Machine Learning for dummies (a.k.a. ML explained to computer scientists)
! Machine Learning

Note: I *am* a computer scientist ! Machine Learning

Note: I *am* a computer scientist Matrix Multiplication ! Machine Learning =

Note: I *am* a computer scientist + Matrix Multiplication Random Number   Generation ! Machine Learning =

Note: I *am* a computer scientist + Matrix Multiplication Random Number   Generation ! Machine Learning = ( ) t≅ 2k Deep

Randomness in ML from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

Randomness in ML from sklearn.linear_models import LogisticRegression LogisticRegression(penalty=’l2’, dual=False, tol=0.0001,
C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None …) from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

Randomness in ML from sklearn.linear_models import LogisticRegression LogisticRegression(penalty=’l2’, dual=False, tol=0.0001,
C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None …) from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None) from sklearn.svm import SVC SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True,   probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)

Randomness in ML from sklearn.model_selection import train_test_split train_test_split(*arrays, test_size=.25, random_state=None,
shuffle=True, stratify=None) from sklearn.model_selection import ShuffleSplit ShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None) from sklearn.model_selection import KFold KFold(n_splits=3, shuffle=False, random_state=None) # n_splits=5 in 0.22 from sklearn.model_selection import StratifiedKFold StratifiedKFold(n_splits=3, shuffle=False, random_state=None) # n_splits=5 in 0.22

Randomness in DL from keras.layers.core import Dense Dense(units, activation=None, use_bias=True,
kernel_initializer=‘glorot_uniform', bias_initializer=‘zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None,   kernel_constraint=None, bias_constraint=None)

Randomness in DL from keras.layers.core import Dense Dense(units, activation=None, use_bias=True,
kernel_initializer=‘glorot_uniform', bias_initializer=‘zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None,   kernel_constraint=None, bias_constraint=None) from keras.initializers import glorot_uniform glorot_uniform(seed=None)

Determinism Pillars: Set the Random Seeds!

Set Random Seed

cuDNN bit.ly/cudnn-reproducibility

cuDNN Conv ﬁx

cuDNN Conv ﬁx // The execution of an individual op
(for some op types) can be // parallelized on a pool of intra_op_parallelism_threads. // 0 means the system picks an appropriate number. int32 intra_op_parallelism_threads = 2; // Nodes that perform blocking operations are enqueued on a pool of // inter_op_parallelism_threads available in each process. // // 0 means the system picks an appropriate number. // // Note that the first Session created in the process sets the // number of threads for all future sessions unless use_per_session_threads is // true or session_inter_op_thread_pool is configured. int32 inter_op_parallelism_threads = 5;

Reproducibility &   Machine Learning ?

Reproducibility as   a measure of Conﬁdence for Machine Learning
models

Reproducibility Selection Bias

Building ML Models keep error as low as possible. Two
major sources of error are bias and variance. If we managed to reduce these two, then we could build more accurate models.

bit.ly/ml-data-junk

How we diagnose   bias and variance ? What actions
should be take, in case?

Bias-Variance Generalisation

from sklearn.cross_validation import train_test_split X_tr, X_ts, y_tr, y_ts = train_test_split(X,
y, train_size=0.3,   random_state=42) Hold-out

Bias-Variance feature(s) target f X y f is almost completely
unknown

Bias-Variance feature(s) target f’ X y Model try to estimate

Bias-Variance f’ Model TS1 Training Sets Estimates Y1 f’ TS1
Y1 f’ TS1 Y1

The amount by which the model varies as we change
training data is Variance

Bias-Variance f’ Model TS1 Training Sets Estimates Y1 f’ TS1
Y1 f’ TS1 Y1

The bias reﬂects the amount of assumptions we do on
the model

Bias-Variance Trade-oﬀ

Learning Curve

from sklearn.model_selection import learning_curve train_sizes, train_scores, validation_scores = learning_curve(estimator, X,
y, train_sizes=0.8, cv=5, scoring='accuracy')

Cross Validation

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

Replicable CV

Selection Bias Selection Bias is the selection of data in
such a way that proper randomisation is not achieved the sample obtained is not representative of the population selection bias not considered => conclusions not accurate

Nested CV

Nested CV 5 x 3 CV - Hyper Parameter Search
10 x 5 CV - Selection Bias

Nested CV todo list ✓ Initialise and set random seeds
✓ Be sure to create NEW models at each run ✓ Calculate the Error (metric) at each run ✓ Average and Get the Conﬁdence Intervals ✓ Save your models and checkpoints

Boilerplate Code Model Independent Easily tuneable to be used w/
ML | DL Feature Normalisation ?? Feature Selection??

Reproducible Learn Introducing leriomaggio/reproducible-learn

Settings

DL Settings

DAP Test

Template Method

Runners

TODO List • Improve Parallelisation • Integrate DB Analytics •
frameworks agnostic for DL (Backend) • Integrate Multiple Metrics

Thanks a lot for you   kind attention @leriomaggio valeriomaggio@gmail
@leriomaggio

Reproducibility and Selection Bias in Machine L...

Reproducibility and Selection Bias in Machine Learning

More Decks by Valerio Maggio

Other Decks in Research

Featured

Transcript