Maintainable Code for Machine Learning Engineering

@cambridgespark Maintainable Code for Machine Learning Engineering @KevinLemagnen

It all starts with a Jupyter Notebook...

What do we need next? • Convert this code to
production ready software • It needs to be easy to maintain and test • Things change quickly and often in Data Science

Pieces of a Data Science Project 1. Process data /
engineer new features ◦ How easy it is to modify / remove / add steps?

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed?

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed?

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to update parameters after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦ Can you run all the same steps easily on any new data, without repeating code?

Let’s look at a simple model on Kickstarter dataset...

Autopsy of preprocessing code 1. Loading some data 2. Drop
some features 3. “Clean” some features 4. “Engineer” some features from original ones 5. Apply One-Hot Encoding on categorical features 6. Scale/transform some features

Autopsy of preprocessing code How do you preprocess test/new data?
Create a reusable function!

Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…)

Example - One Hot Encoding Training set: we apply One
Hot Encoding ID Country 1 UK 2 Italy 3 China ID Country_UK Country_Italy Country_China 1 1 0 0 2 0 1 0 3 0 0 1

Example - One Hot Encoding Test set: if we apply
One Hot Encoding again ID Country 117 France 118 UK ID Country_France Country_UK 117 1 0 118 0 1 Columns mismatch with the training set: [ UK, Italy, China ] Instead we need to memorise the transformation and apply the same one

Example - One Hot Encoding Test set: if we apply
One Hot Encoding again ID Country 117 France 118 UK Columns mismatch with the training set … Instead we need to memorise the transformation and apply the same one ID Country_UK Country_Italy Country_China 1 0 0 0 2 1 0 0

Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…) Create a reusable class! Scikit-learn transformers are a way to facilitate this and keep it consistent.

Common Transformers In sklearn.preprocessing • LabelEncoder • OneHotEncoder • StandardScaler
All have a .fit() method to train the transformation on some data, and .transform() to apply the same transformation on any new data.

Common Transformers • PCA also works as a transformer (.fit
to learn a transformation, .transform to apply it on new data)

Build your own transformer • Need to extend both TransformerMixin
and BaseEstimator • Implement a .fit with what needs to be saved from the training data • Implement a .transform which applies the actual transformation • If you need options, add parameters to an __init__

Build your own transformer class CustomScaler(TransformerMixin, BaseEstimator): def fit(self, X,
y= None): self.median = np.median(X, axis= 0) self.interquartile_range = interquartile_range(X) return self def transform(self, X, y= None): Xt = (X - self.median) / self.interquartile_range return Xt

Let’s start refactoring!

Column Transformer • Transformers seen before take a column or
a set of columns and generate new features • ColumnTransformer is a higher level transformer used to apply diﬀerent transformations on all your columns Example: Apply PCA on numerical features, One Hot Encoding on categorical ones.

Column Transformer num_cols = ["age", "salary"] cat_cols = ["country", "gender"]
# List of transformations # ("name", Transformer, List_of_columns) preprocessor = ColumnTransformer([( "pca", PCA(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])

Column Transformer ColumnTransformer([("pca", PCA(), num_cols),("one_hot", OneHotEncoder(), cat_cols)]) Numerical Columns PCA
OneHotEncoded New DataFrame Original DataFrame Categorical Columns

Column Transformer Once you’ve created a column transformer object, you
can apply the same transformations everywhere: • preprocessor.fit(X_train) • preprocessor.transform(X_train) • preprocessor.transform(X_test) • preprocessor.transform(X_new) It keeps in memory all unique transformations applied to all columns.

Pipeline • We’ve seen how to apply transformation in parallel
• Sometimes you need to apply multiple sequential transformations to build new features Example: 1. Scale data 2. Apply PCA

Pipeline # Define pipeline pipeline = Pipeline([( "scaler", StandardScaler()), (
"pca", PCA())]) # Calling .fit will “learn” both steps pipeline.fit(data) # Calling .transform will “apply” both steps sequentially pipeline.transform(data)

Pipeline Pipeline([("scaler", StandardScaler()), ( "pca", PCA())]) Original DataFrame DataFrame Scaled
Final DataFrame Apply scaler Apply PCA

Pipeline The resulting Pipeline object is a transformer, it has
.fit and .transform So ... It can be used together with other transformers such as ColumnTransformer

Pipeline # Define pipeline for numerical features num_pipeline = Pipeline([(
"scaler", StandardScaler()), ( "pca", PCA())]) # Use ColumnTransformer to apply different transformations to all columns preprocessor = ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])

Let’s finish the preprocessor!

Pipeline (again) Transformers, ColumnTransformer and Pipeline allow to build: •
Complex preprocessing code • In a modular way ◦ Each transformer == a step/an action We’re only missing the predictive model part … Or are we?

Pipeline (again) • SKlearn Estimators have a .fit() to train
the model and .predict() to generate predictions. • Pipeline support the last step to be an Estimator object • When calling the pipeline: ◦ .fit(): fits all transformers and trains predictive model ◦ .predict(): calls .transform() on the transformers and then .predict()

Pipeline (again) # Preprocessor object -- to transform data preprocessor
= ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)]) # Final pipeline contains preprocessor and the algorithm pipeline = Pipeline([( "preprocessor" , preprocessor), ( "model", DecisionTreeClassifier())])

Pipeline (again) • Your model isn’t just an algorithm but
preprocessing steps + algorithm • Can use directly in GridSearch, Cross Validation, etc… ◦ Avoids data leakage (all steps run on the right folds only) • Can apply same pipeline to new data • More modular -> easier to test and maintain

Let’s add the Estimator

What’s next? • Hyperparameters can be defined in a config
file ◦ Includes params of estimator AND preprocessing • We can train the model and dump the whole object to disk • Once the model is trained, you can eﬀiciently load use it in production

Pickle & Joblib In Python, there are two main libraries
to “serialise” objects: • Pickle • Joblib Those work with any Python object and are mostly equivalent. Joblib tends to be more eﬀicient with larger arrays.

Joblib import joblib # Save model joblib.dump(model_object, 'model.joblib' ) #
Load model new_model = joblib.load( 'model.joblib' )

Pickle & Joblib - Limitations Pickle and joblib are really
useful to save trained models, but keep in mind: • It does not save dependencies. ◦ To load your model, you need to import the code it relies on ▪ external libraries, your own classes definition, etc... • Data is not saved, so if you need to retrain the model make sure you keep a snapshot.

Give it a spin... • Cross validate the model with
`python run.py crossval` ◦ This will use parameters defined in config.py • Train the model with `python run.py train` ◦ This use parameters defined in config.py • Test the model with `python run.py test`

Further... • Make sure you always load the data with
same dtypes • Have a requirements.txt with versions of the libraries you use • When updating code, check the crossval score to confirm it improved performance • With modular code, you can more easily write tests! (and you should) • With all that, updating your models should be less stressful.

Maintainable Code for Machine Learning Engineering

Maintainable Code for Machine Learning Engineering

More Decks by Kevin Lemagnen

Other Decks in Technology

Featured

Transcript