Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testable ML Data Science

Testable ML Data Science

Talk at EuroPython 2015 on why and how to use Scikit-Learn's Transformer and Estimator interface.

Holger Peters

July 21, 2015
Tweet

More Decks by Holger Peters

Other Decks in Programming

Transcript

  1. Testable ML Data Science How to make numeric code testable

    using Scikit-Learn’s interfaces. EuroPython 2015, Bilbao Holger Peters | @data_hope http://flic.kr/p/71Ewbs CC
  2. Supervised Machine Learning season holiday weekday weather temperature windspeed counts

    features target 4 0 5 clear 0.6 0.10 4 0 6 rain 0.55 0.26 4 0 0 clear 0.44 0.14 4 1 1 rain 0.38 0.18 8156 7965 3510 5478 known data for estimation
  3. Training a Machine Learning Model est.fit(X, y) → est train

    estimator using X, y est.predict(X) → y estimate target from feature-data We train the model using feature & known target values, we estimate the based on feature values only.
  4. Training a model from sklearn.svm import SVR X_train, X_test, y_train,

    y_test = train_test_split(X, y) est = SVR() est.fit(X_train, y_train) y_pred = est.predict(X_test) mean_absolute_error(y_test, y_pred) # 2.34 use simple metric to obtain prediction-quality
  5. Preparing Data For The Estimator http://flic.kr/p/b8fA22 CC license … because

    usually it is not enough to just feed data into the estimator
  6. Preparing data for the estimator x -= np.mean(x, axis=0) x

    /= np.std(x, axis=0) For some estimators, we need to scale each feature
 to mean µ=0 and standard deviation σ=1 But what about missing values? Suppose we have NaN values in x
  7. Test array([[ nan, 0.78406256], [ nan, -1.41131261], [ nan, 0.62725005]])

    x = array([[ 15.7, 32. ], [ nan, 18. ], [ 9.8, 31. ]]) x -= np.mean(x, axis=0) x /= np.std(x, axis=0) Results to:
  8. Preparing data for the estimator mean = np.nanmean(x, axis=0) std

    = np.nanstd(x, axis=0) x -= mean x /= std x = np.nan_to_num(x) Let’s add treatment for missing values (encoded as NaN)
  9. How does this transform our data? array([[ 15.7, 32. ],

    [ nan, 18. ], [ 9.8, 31. ]]) array([[ 1. , 0.78406256], [ 0. , -1.41131261], [-1. , 0.62725005]]) Results to: This works for the fit, but how do we prepare for the predict?
  10. How can we do this for both fit and predict?

    Transformers! http://flic.kr/p/rLvEXy, License: CC
  11. Transformers est.fit(X, y) → est est.transform(X) → X’ transform feature

    matrix Transformers are all about the feature matrix est.fit_transform(X, y) → X’ train & transform train transformer using X, y
  12. Let us rephrase our preprocessing as a transformer from sklearn.base

    import BaseEstimator, TransformerMixin class NaNGuessingScaler(BaseEstimator, TransformerMixin): def fit(self, X, y=None): self.mean_ = np.nanmean(X, axis=0) self.std_ = np.nanstd(X, axis=0) return self def transform(self, X, y=None): Xt = X - self.mean_ Xt /= self.std_ return np.nan_to_num(Xt)
  13. Testing out tranformer def test_nan_guessing_scaler(): X = np.array([[ 15.7, 32.

    ], [ nan, 18. ], [ 9.8, 31. ]]) scaler = NaNGuessingScaler() Xt = scaler.fit_transform(X) np.testing.assert_allclose(np.std(Xt, axis=0), [1., 1.]) … The standard deviation is determined, before missing values are corrected to the mean. Rather have one transformer impute missing values and one transformer do the scaling Conclusion: !
  14. Composition using Pipelines: Chained transformations T1 T2 fit fit_transform fit_transform

    T1 T2 predict transform transform Xt Xt X X, y >>> pipeline = make_pipeline(t1, t2) Pipeline(…)
  15. Scikit-Learn offers transformers data preparation StandardScaler µ=0, σ=1 Imputer 2

    24 4 1 11 11 NaN 1 8 11 -2 0 8 11 11 1 2 24 4 1 11 11 4.33 1 8 11 -2 0 8 11 11 1
  16. from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Imputer, StandardScaler pipeline

    = make_pipeline(Imputer(), StandardScaler()) X = np.c_[[ 15.7, 32.], [ np.nan, 18.], [ 9.8, 31.]].T Xt = pipeline.fit_transform(X, y) np.testing.assert_allclose(np.std(Xt, axis=0), [1., 1.]) # success
  17. Combining transformers and the ML estimator from sklearn.svm import SVR

    from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Imputer, StandardScaler X_train, X_test, y_train, y_test = train_test_split(X, y) est = make_pipeline(Imputer(), StandardScaler(), SVR()) est.fit(X_train, y_train) y_pred = est.predict(X_test) mean_absolute_error(y_test, y_pred) # 2.34
  18. • Use fit/predict and fit/transform interfaces. • Apply single-responsibility principle

    to estimators. • Think in small, testable units • Have your complete trained model in a serialisable object. Conclusions Outlook • Composition of transformers for feature construction (FeatureUnion) • Composition of estimators (GridSearchCV, etc.)