Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maintainable Code for Machine Learning Engineering

Kevin Lemagnen
September 21, 2019

Maintainable Code for Machine Learning Engineering

Kevin Lemagnen

September 21, 2019
Tweet

More Decks by Kevin Lemagnen

Other Decks in Technology

Transcript

  1. What do we need next? • Convert this code to

    production ready software • It needs to be easy to maintain and test • Things change quickly and often in Data Science
  2. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps?
  3. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed?
  4. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed?
  5. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦ Can you run all the same steps easily on any new data, without repeating code?
  6. Autopsy of preprocessing code 1. Loading some data 2. Drop

    some features 3. “Clean” some features 4. “Engineer” some features from original ones 5. Apply One-Hot Encoding on categorical features 6. Scale/transform some features
  7. Autopsy of preprocessing code How do you preprocess test/new data?

    Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…)
  8. Example - One Hot Encoding Training set: we apply One

    Hot Encoding ID Country 1 UK 2 Italy 3 China ID Country_UK Country_Italy Country_China 1 1 0 0 2 0 1 0 3 0 0 1
  9. Example - One Hot Encoding Test set: if we apply

    One Hot Encoding again ID Country 117 France 118 UK ID Country_France Country_UK 117 1 0 118 0 1 Columns mismatch with the training set: [ UK, Italy, China ] Instead we need to memorise the transformation and apply the same one
  10. Example - One Hot Encoding Test set: if we apply

    One Hot Encoding again ID Country 117 France 118 UK Columns mismatch with the training set … Instead we need to memorise the transformation and apply the same one ID Country_UK Country_Italy Country_China 1 0 0 0 2 1 0 0
  11. Autopsy of preprocessing code How do you preprocess test/new data?

    Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…) Create a reusable class! Scikit-learn transformers are a way to facilitate this and keep it consistent.
  12. Common Transformers In sklearn.preprocessing • LabelEncoder • OneHotEncoder • StandardScaler

    All have a .fit() method to train the transformation on some data, and .transform() to apply the same transformation on any new data.
  13. Common Transformers • PCA also works as a transformer (.fit

    to learn a transformation, .transform to apply it on new data)
  14. Build your own transformer • Need to extend both TransformerMixin

    and BaseEstimator • Implement a .fit with what needs to be saved from the training data • Implement a .transform which applies the actual transformation • If you need options, add parameters to an __init__
  15. Build your own transformer class CustomScaler(TransformerMixin, BaseEstimator): def fit(self, X,

    y= None): self.median = np.median(X, axis= 0) self.interquartile_range = interquartile_range(X) return self def transform(self, X, y= None): Xt = (X - self.median) / self.interquartile_range return Xt
  16. Column Transformer • Transformers seen before take a column or

    a set of columns and generate new features • ColumnTransformer is a higher level transformer used to apply different transformations on all your columns Example: Apply PCA on numerical features, One Hot Encoding on categorical ones.
  17. Column Transformer num_cols = ["age", "salary"] cat_cols = ["country", "gender"]

    # List of transformations # ("name", Transformer, List_of_columns) preprocessor = ColumnTransformer([( "pca", PCA(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])
  18. Column Transformer Once you’ve created a column transformer object, you

    can apply the same transformations everywhere: • preprocessor.fit(X_train) • preprocessor.transform(X_train) • preprocessor.transform(X_test) • preprocessor.transform(X_new) It keeps in memory all unique transformations applied to all columns.
  19. Pipeline • We’ve seen how to apply transformation in parallel

    • Sometimes you need to apply multiple sequential transformations to build new features Example: 1. Scale data 2. Apply PCA
  20. Pipeline # Define pipeline pipeline = Pipeline([( "scaler", StandardScaler()), (

    "pca", PCA())]) # Calling .fit will “learn” both steps pipeline.fit(data) # Calling .transform will “apply” both steps sequentially pipeline.transform(data)
  21. Pipeline The resulting Pipeline object is a transformer, it has

    .fit and .transform So ... It can be used together with other transformers such as ColumnTransformer
  22. Pipeline # Define pipeline for numerical features num_pipeline = Pipeline([(

    "scaler", StandardScaler()), ( "pca", PCA())]) # Use ColumnTransformer to apply different transformations to all columns preprocessor = ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])
  23. Pipeline (again) Transformers, ColumnTransformer and Pipeline allow to build: •

    Complex preprocessing code • In a modular way ◦ Each transformer == a step/an action We’re only missing the predictive model part … Or are we?
  24. Pipeline (again) • SKlearn Estimators have a .fit() to train

    the model and .predict() to generate predictions. • Pipeline support the last step to be an Estimator object • When calling the pipeline: ◦ .fit(): fits all transformers and trains predictive model ◦ .predict(): calls .transform() on the transformers and then .predict()
  25. Pipeline (again) # Preprocessor object -- to transform data preprocessor

    = ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)]) # Final pipeline contains preprocessor and the algorithm pipeline = Pipeline([( "preprocessor" , preprocessor), ( "model", DecisionTreeClassifier())])
  26. Pipeline (again) • Your model isn’t just an algorithm but

    preprocessing steps + algorithm • Can use directly in GridSearch, Cross Validation, etc… ◦ Avoids data leakage (all steps run on the right folds only) • Can apply same pipeline to new data • More modular -> easier to test and maintain
  27. What’s next? • Hyperparameters can be defined in a config

    file ◦ Includes params of estimator AND preprocessing • We can train the model and dump the whole object to disk • Once the model is trained, you can efficiently load use it in production
  28. Pickle & Joblib In Python, there are two main libraries

    to “serialise” objects: • Pickle • Joblib Those work with any Python object and are mostly equivalent. Joblib tends to be more efficient with larger arrays.
  29. Joblib import joblib # Save model joblib.dump(model_object, 'model.joblib' ) #

    Load model new_model = joblib.load( 'model.joblib' )
  30. Pickle & Joblib - Limitations Pickle and joblib are really

    useful to save trained models, but keep in mind: • It does not save dependencies. ◦ To load your model, you need to import the code it relies on ▪ external libraries, your own classes definition, etc... • Data is not saved, so if you need to retrain the model make sure you keep a snapshot.
  31. Give it a spin... • Cross validate the model with

    `python run.py crossval` ◦ This will use parameters defined in config.py • Train the model with `python run.py train` ◦ This use parameters defined in config.py • Test the model with `python run.py test`
  32. Further... • Make sure you always load the data with

    same dtypes • Have a requirements.txt with versions of the libraries you use • When updating code, check the crossval score to confirm it improved performance • With modular code, you can more easily write tests! (and you should) • With all that, updating your models should be less stressful.