Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducibility and Selection Bias in Machine Learning

Reproducibility and Selection Bias in Machine Learning

_Reproducibility_ - the ability to recompute results — and _replicability_— the chances other experimenters will achieve a consistent result[1]- are among the main important beliefs of the scientific method.

Surprisingly, these two aspects are often underestimated or not even considered when setting up scientific experimental pipelines. In this, one of the main threat to replicability is the selection bias, that is the error in choosing the individuals or groups to take part in a study. Selection bias may come in different flavours: the selection of the population of samples in the dataset (sample bias); the selection of features used by the learning models, particularly sensible in case of high dimensionality; the selection of hyper parameter best performing on specific dataset(s). If not properly considered, the selection bias may strongly affect the validity of derived conclusions, as well as the reliability of the learning model.

In this talk I will provide a solid introduction to the topics of reproducibility and selection bias, with examples taken from the biomedical research, in which reliability is paramount.

From a more technological perspective, to date the scientific Python ecosystem still misses tools to consolidate the experimental pipelines in in research, that can be used together with Machine and Deep learning frameworks (e.g. `sklearn` and `keras`).

In this talk, I will introduce `reproducible-lern`, a new Python frameworks for reproducible research to be used for machine and deep learning.

During the talk, the main features of the framework will be presented, along with several examples, technical insights and implementation choices to be discussed with the audience.

The talk is intended for intermediate PyData researchers and practitioners. Basic prior knowledge of the main Machine Learning concepts is assumed for the first part of the talk. On the other hand, good proficiency with the Python language and with scientific python libraries (e.g. numpy, sklearn) are required for the second part.

---

1 Reproducible research can still be wrong: Adopting a prevention approach by Jeffrey T. Leek, and Roger D. Peng

2 Dictionary of Cancer Terms -> "selection bias"

Valerio Maggio

October 25, 2018
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Valerio Maggio Reproducibility & 
 Selection Bias 
 in Machine

    Learning Researcher @ Fondazione Bruno Kessler Trento, Italy When only Cross Validation is not enough! @leriomaggio valeriomaggio@gmail @leriomaggio
  2. Building Blocks of the 
 Scientific Method (a) recompute results

    
 (b) replicability 
 (i.e. the chances other experimenters will achieve a consistent results) 

  3. Reproducible 
 vs 
 Replicable Source: Barriers to Reproducible Research

    (and how to overcome them)
 doi: https://dx.doi.org/10.6084/m9.figshare.7140050 by @kirstie_j
  4. Source: Barriers to Reproducible Research (and how to overcome them)


    doi: https://dx.doi.org/10.6084/m9.figshare.7140050 by @kirstie_j Reproducible Replicable Robust Generalisable
  5. Steps to Reproducibility Create a public repo Freeze the development

    
 with a git tag Share the virtual env:
 pip install -r requirements
 conda env create -f environment.yml
  6. Steps to Reproducibility Create a public repo Freeze the development

    
 with a git tag Share the virtual env:
 pip install -r requirements
 conda env create -f environment.yml Share the Python version
 conda create -n env python=3.6
 pyenv + virtualenv
  7. Steps to Reproducibility Create a public repo Freeze the development

    
 with a git tag Share the virtual env:
 pip install -r requirements
 conda env create -f environment.yml Share the Python version
 conda create -n env python=3.6
 pyenv + virtualenv Create a Container
 with all the env.
  8. Steps to Reproducibility Create a public repo Freeze the development

    
 with a git tag Share the virtual env:
 pip install -r requirements
 conda env create -f environment.yml Share the Python version
 conda create -n env python=3.6
 pyenv + virtualenv Share the CUDA version
 nvcc —version
 cuDNN: cat <path to>/cudnn.h | grep CUDNN_MAJOR -A 2
  9. Machine Learning for dummies (a.k.a. ML explained to computer scientists)

    Note: I *am* a computer scientist ! Machine Learning
  10. Machine Learning for dummies (a.k.a. ML explained to computer scientists)

    Note: I *am* a computer scientist Matrix Multiplication ! Machine Learning =
  11. Machine Learning for dummies (a.k.a. ML explained to computer scientists)

    Note: I *am* a computer scientist + Matrix Multiplication Random Number 
 Generation ! Machine Learning =
  12. Machine Learning for dummies (a.k.a. ML explained to computer scientists)

    Note: I *am* a computer scientist + Matrix Multiplication Random Number 
 Generation ! Machine Learning = ( ) t≅ 2k Deep
  13. Randomness in ML from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None,

    min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
  14. Randomness in ML from sklearn.linear_models import LogisticRegression LogisticRegression(penalty=’l2’, dual=False, tol=0.0001,

    C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None …) from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
  15. Randomness in ML from sklearn.linear_models import LogisticRegression LogisticRegression(penalty=’l2’, dual=False, tol=0.0001,

    C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None …) from sklearn.ensemble import RandomForestClassifier RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None) from sklearn.svm import SVC SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto_deprecated’, coef0=0.0, shrinking=True, 
 probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, random_state=None)
  16. Randomness in ML from sklearn.model_selection import train_test_split train_test_split(*arrays, test_size=.25, random_state=None,

    shuffle=True, stratify=None) from sklearn.model_selection import ShuffleSplit ShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None) from sklearn.model_selection import KFold KFold(n_splits=3, shuffle=False, random_state=None) # n_splits=5 in 0.22 from sklearn.model_selection import StratifiedKFold StratifiedKFold(n_splits=3, shuffle=False, random_state=None) # n_splits=5 in 0.22
  17. Randomness in DL from keras.layers.core import Dense Dense(units, activation=None, use_bias=True,

    kernel_initializer=‘glorot_uniform', bias_initializer=‘zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
 kernel_constraint=None, bias_constraint=None)
  18. Randomness in DL from keras.layers.core import Dense Dense(units, activation=None, use_bias=True,

    kernel_initializer=‘glorot_uniform', bias_initializer=‘zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, 
 kernel_constraint=None, bias_constraint=None) from keras.initializers import glorot_uniform glorot_uniform(seed=None)
  19. cuDNN Conv fix // The execution of an individual op

    (for some op types) can be // parallelized on a pool of intra_op_parallelism_threads. // 0 means the system picks an appropriate number. int32 intra_op_parallelism_threads = 2; // Nodes that perform blocking operations are enqueued on a pool of // inter_op_parallelism_threads available in each process. // // 0 means the system picks an appropriate number. // // Note that the first Session created in the process sets the // number of threads for all future sessions unless use_per_session_threads is // true or session_inter_op_thread_pool is configured. int32 inter_op_parallelism_threads = 5;
  20. Building ML Models keep error as low as possible. Two

    major sources of error are bias and variance. If we managed to reduce these two, then we could build more accurate models.
  21. Selection Bias Selection Bias is the selection of data in

    such a way that proper randomisation is not achieved the sample obtained is not representative of the population selection bias not considered => conclusions not accurate
  22. Nested CV 5 x 3 CV - Hyper Parameter Search

    10 x 5 CV - Selection Bias
  23. Nested CV todo list ✓ Initialise and set random seeds

    ✓ Be sure to create NEW models at each run ✓ Calculate the Error (metric) at each run ✓ Average and Get the Confidence Intervals ✓ Save your models and checkpoints
  24. Boilerplate Code Model Independent Easily tuneable to be used w/

    ML | DL Feature Normalisation ?? Feature Selection??
  25. TODO List • Improve Parallelisation • Integrate DB Analytics •

    frameworks agnostic for DL (Backend) • Integrate Multiple Metrics