Testable ML Data Science

Testable ML Data Science How to make numeric code testable
using Scikit-Learn’s interfaces. EuroPython 2015, Bilbao Holger Peters | @data_hope http://flic.kr/p/71Ewbs CC

Supervised Machine Learning season holiday weekday weather temperature windspeed counts
features target 4 0 5 clear 0.6 0.10 4 0 6 rain 0.55 0.26 4 0 0 clear 0.44 0.14 4 1 1 rain 0.38 0.18 8156 7965 3510 5478 known data for estimation

Training a Machine Learning Model est.fit(X, y) → est train
estimator using X, y est.predict(X) → y estimate target from feature-data We train the model using feature & known target values, we estimate the based on feature values only.

Training a model from sklearn.svm import SVR X_train, X_test, y_train,
y_test = train_test_split(X, y) est = SVR() est.fit(X_train, y_train) y_pred = est.predict(X_test) mean_absolute_error(y_test, y_pred) # 2.34 use simple metric to obtain prediction-quality

Preparing Data For The Estimator http://flic.kr/p/b8fA22 CC license … because
usually it is not enough to just feed data into the estimator

Preparing data for the estimator x -= np.mean(x, axis=0) x
/= np.std(x, axis=0) For some estimators, we need to scale each feature  to mean µ=0 and standard deviation σ=1 But what about missing values? Suppose we have NaN values in x

Test array([[ nan, 0.78406256], [ nan, -1.41131261], [ nan, 0.62725005]])
x = array([[ 15.7, 32. ], [ nan, 18. ], [ 9.8, 31. ]]) x -= np.mean(x, axis=0) x /= np.std(x, axis=0) Results to:

Preparing data for the estimator mean = np.nanmean(x, axis=0) std
= np.nanstd(x, axis=0) x -= mean x /= std x = np.nan_to_num(x) Let’s add treatment for missing values (encoded as NaN)

How does this transform our data? array([[ 15.7, 32. ],
[ nan, 18. ], [ 9.8, 31. ]]) array([[ 1. , 0.78406256], [ 0. , -1.41131261], [-1. , 0.62725005]]) Results to: This works for the fit, but how do we prepare for the predict?

How can we do this for both fit and predict?
Transformers! http://flic.kr/p/rLvEXy, License: CC

Transformers est.fit(X, y) → est est.transform(X) → X’ transform feature
matrix Transformers are all about the feature matrix est.fit_transform(X, y) → X’ train & transform train transformer using X, y

Let us rephrase our preprocessing as a transformer from sklearn.base
import BaseEstimator, TransformerMixin class NaNGuessingScaler(BaseEstimator, TransformerMixin): def fit(self, X, y=None): self.mean_ = np.nanmean(X, axis=0) self.std_ = np.nanstd(X, axis=0) return self def transform(self, X, y=None): Xt = X - self.mean_ Xt /= self.std_ return np.nan_to_num(Xt)

Testing out tranformer def test_nan_guessing_scaler(): X = np.array([[ 15.7, 32.
], [ nan, 18. ], [ 9.8, 31. ]]) scaler = NaNGuessingScaler() Xt = scaler.fit_transform(X) np.testing.assert_allclose(np.std(Xt, axis=0), [1., 1.]) … The standard deviation is determined, before missing values are corrected to the mean. Rather have one transformer impute missing values and one transformer do the scaling Conclusion: !

How can we combine several transformers? Composition! http://flic.kr/p/rLvEXy, License: CC

Composition using Pipelines: Chained transformations T1 T2 fit fit_transform fit_transform
T1 T2 predict transform transform Xt Xt X X, y >>> pipeline = make_pipeline(t1, t2) Pipeline(…)

Why solve it yourself… … when batteries are already included?

Scikit-Learn oﬀers transformers data preparation StandardScaler µ=0, σ=1 Imputer 2
24 4 1 11 11 NaN 1 8 11 -2 0 8 11 11 1 2 24 4 1 11 11 4.33 1 8 11 -2 0 8 11 11 1

from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Imputer, StandardScaler pipeline
= make_pipeline(Imputer(), StandardScaler()) X = np.c_[[ 15.7, 32.], [ np.nan, 18.], [ 9.8, 31.]].T Xt = pipeline.fit_transform(X, y) np.testing.assert_allclose(np.std(Xt, axis=0), [1., 1.]) # success

Testable and Reliable Machine Learning using well-separable building blocks CC
license, http://flic.kr/p/uRk4u3

Combining transformers and the ML estimator from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Imputer, StandardScaler X_train, X_test, y_train, y_test = train_test_split(X, y) est = make_pipeline(Imputer(), StandardScaler(), SVR()) est.fit(X_train, y_train) y_pred = est.predict(X_test) mean_absolute_error(y_test, y_pred) # 2.34

• Use fit/predict and fit/transform interfaces. • Apply single-responsibility principle
to estimators. • Think in small, testable units • Have your complete trained model in a serialisable object. Conclusions Outlook • Composition of transformers for feature construction (FeatureUnion) • Composition of estimators (GridSearchCV, etc.)

Testable ML Data Science

Testable ML Data Science

Holger Peters

More Decks by Holger Peters

Other Decks in Programming

Featured

Transcript

Testable ML Data Science How to make numeric code testable

Supervised Machine Learning season holiday weekday weather temperature windspeed counts

Training a Machine Learning Model est.fit(X, y) → est train

Training a model from sklearn.svm import SVR X_train, X_test, y_train,

Preparing Data For The Estimator http://flic.kr/p/b8fA22 CC license … because

Preparing data for the estimator x -= np.mean(x, axis=0) x

Test array([[ nan, 0.78406256], [ nan, -1.41131261], [ nan, 0.62725005]])

Preparing data for the estimator mean = np.nanmean(x, axis=0) std

How does this transform our data? array([[ 15.7, 32. ],

How can we do this for both fit and predict?

Transformers est.fit(X, y) → est est.transform(X) → X’ transform feature

Let us rephrase our preprocessing as a transformer from sklearn.base

Testing out tranformer def test_nan_guessing_scaler(): X = np.array([[ 15.7, 32.

How can we combine several transformers? Composition! http://flic.kr/p/rLvEXy, License: CC

Composition using Pipelines: Chained transformations T1 T2 fit fit_transform fit_transform

Why solve it yourself… … when batteries are already included?

Scikit-Learn oﬀers transformers data preparation StandardScaler µ=0, σ=1 Imputer 2

from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Imputer, StandardScaler pipeline

Testable and Reliable Machine Learning using well-separable building blocks CC

Combining transformers and the ML estimator from sklearn.svm import SVR

• Use fit/predict and fit/transform interfaces. • Apply single-responsibility principle