Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Composing Testable and Robust Machine Learning Pipelines

Holger Peters
October 14, 2015

Composing Testable and Robust Machine Learning Pipelines

This talk is about Scikit-Learn's Transformer and Estimator interface. It explains how machine learning models can be formulated using these interfaces, how this makes the models easier to test. The overall aim is to build ML models from composable building blocks.

Holger Peters

October 14, 2015
Tweet

More Decks by Holger Peters

Other Decks in Technology

Transcript

  1. Composing Testable and Robust Machine Learning Pipelines Holger Peters, Data

    Scientist and Software Developer @data_hope Budapest BI Forum 2015 http://www.holger-peters.de Slides: https://speakerdeck.com/holgerpeters 1
  2. Given history of data: Historic features X Historic target y

    
 Estimate for target: Prediction ŷ based on new X' Supervised Machine Learning The Problem Scikit-Learn •Open Source •ML algorithms •"Plumbing" code •Python 2
  3. Supervised Machine Learning Feature1 Feature 2 ... Target 1 8

    1 4 2 11 1 1 3 17 5 4 4 18 4 6 ... 34123 21 7 ? 34124 25 0 ? 34125 15 4 ? 34126 15 1 ? Values to be estimated ŷ X
  4. Given history of data: Historic features X Historic target y

    
 Estimate for target: Prediction ŷ based on new X' Supervised Machine Learning Training/Fit Estimation/Predict 4
  5. Substructure of ML Model Data Cleanup Feature building ML algorithm

    Preprocessing for estimator this part is "predictive" most steps involve transforming the feature matrix X 7
  6. Structuring Predictive Models Data Cleanup Feature building ML algorithm Transformer

    1 Estimator Preprocessing for estimators Transformer 2 Transformer n Transformer 3 Support Vector Machine Gradient Boosting Random Forest, etc. StandardScaler Imputer, LabelEncoder, etc. 8
  7. Pipeline Pipelines: Sequential Transformations Transformer 1 Estimator Transformer 2 Transformer

    n We compose transformers and estimators with a pipeline A pipeline is an estimator (a meta-estimator) Composite pattern Transformers and estimator can be tested independently 10
  8. Example pipe = Pipeline([('pca', PCA(n_components=20)), ('scaler', StandardScaler()), ('svc', LinearSVC())]) pipe.fit(X_train,

    y_train) y_pred = pipe.predict(X_test) score = mean_absolute_error(y_pred, y_test)
  9. Intermediate Summary Assemble models using Transformers and Estimators. Write preprocessing

    using Transformers. Small building blocks make testing easier. Decoupling of ML algorithm, preprocessing, meta-logic.
  10. Multi classification problem. Problem: Our algorithms are all about binary

    classification. Approach: Turn several binary classifications into a multi-classification Example: Recognise Written Digits 3 0 2 0 1 0 8 7 6 5 4 3 2 9 1 0 8 7 6 5 4 3 2 9 1 0 4 1 3 1 2 1 45 trainings one-vs-one 10 trainings with one-vs-rest
  11. Multiclassification problem. Problem: Our algorithms are all about binary classification.

    Ansatz: Turn several binary classifications into a multi-classification Example: Recognise Written Digits
  12. from sklearn.pipeline import Pipeline from sklearn.cross_validation import train_test_split from sklearn.decomposition

    import PCA from sklearn.svm import LinearSVC, SVC from sklearn.preprocessing import StandardScaler from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier from sklearn.datasets import load_digits digits = load_digits() X, y = digits.data, digits.target X_train, X_test, y_train, y_test = train_test_split(X, y) pipe = Pipeline([('pca', PCA(n_components=20)), ('scaler', StandardScaler()), ('svc', LinearSVC())]) one_vs_one = OneVsOneClassifier(pipe).fit(X_train, y_train) score_one_vs_one = one_vs_one.score(X_test, y_test) one_vs_rest = OneVsRestClassifier(pipe).fit(X_train, y_train) score_one_vs_rest = one_vs_rest.score(X_test, y_test) # Score of one_vs_one: 0.97 # Score of one_vs_rest: 0.94
  13. Learnings Use Meta-Estimators to build upon other models. Write logic

    that works with any estimator/transformer. Be able to exchange inner model as needed.
  14. A Rough Reimplementation of StandardScaler import numpy as np from

    sklearn.base import BaseEstimator, TransformerMixin class Scaler(BaseEstimator, TransformerMixin): def fit(self, X, y=None): self.mean_ = np.mean(X, axis=0) self.std_ = np.std(X, axis=0) return self def transform(self, X): X = np.asarray(X, dtype=np.float).copy() X -= self.mean_ X /= self.std_ return X def test_scaler_noop(): X = np.c_[[-1, 1]] s = Scaler() Xt = s.fit_transform(X) assert Xt is not X np.testing.assert_allclose(Xt, np.c_[[-1, 1]]) np.testing.assert_allclose(np.mean(Xt, axis=0), 0., atol=1e-10) np.testing.assert_allclose(np.std(Xt, axis=0), 1., atol=1e-10) def test_scaler_simple(): X = np.c_[np.arange(10.), np.arange(10.)] s = Scaler() Xt = s.fit_transform(X) assert Xt is not X np.testing.assert_allclose(np.mean(Xt, axis=0), 0., atol=1e-10) np.testing.assert_allclose(np.std(Xt, axis=0), 1., atol=1e-10) def test_scaler_with_data_where_one_column_is_of_constant_value(): X = np.c_[np.ones(10.), np.arange(10.)] s = Scaler() Xt = s.fit_transform(X) assert Xt is not X np.testing.assert_allclose(np.mean(Xt, axis=0), 0., atol=1e-10) np.testing.assert_allclose(np.std(Xt, axis=0), 1., atol=1e-10)
  15. Final Advice Use composable Transformers and Estimators. Small building blocks

    make testing easier,
 tested data science makes data science easier A transformer should do one thing (and one thing only),
 decouple what can be independent.