Maintainable Code for Data Science

@cambridgespark Maintainable Code in Data Science https://github.com/klemag/pydataLDN_2019-maintainable-code-for-data-science @KevinLemagnen

What is maintainable code?

Maintainable code == “Easy” to modify

Single Responsibility Principle “Every module, class, or function should have
responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class.” • Each class/function should do one thing for our model • In Data Science we have extra constraints

Pieces of a Data Science Project 1. Process data /
engineer new features ◦ 2. Tune hyperparameters ◦ 3. Train model ◦ 4. Generate predictions on new data ◦

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ 3. Train model ◦ 4. Generate predictions on new data ◦

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ 4. Generate predictions on new data ◦

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦

engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦ Can you run all the same steps easily on any new data, without repeating code?

Features Engineering / Data Processing

Autopsy of common preprocessing code 1. Loading some data 2.
Splitting data into train/test set 3. Drop some features 4. “Clean” some features 5. “Engineer” some features from original ones 6. Apply One-Hot Encoding on categorical features 7. Scale/transform some features

Autopsy of common preprocessing code How do you preprocess test/new
data? Create a reusable function!

data? Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…)

Example - One Hot Encoding Training set: we apply One
Hot Encoding ID Country 1 UK 2 Italy 3 China ID Country_UK Country_Italy Country_China 1 1 0 0 2 0 1 0 3 0 0 1

Example - One Hot Encoding Test set: if we apply
One Hot Encoding again ID Country 117 France 118 UK ID Country_France Country_UK 117 1 0 118 0 1 Columns mismatch with the training set: [ UK, Italy, China ] Instead we need to memorise the transformation and apply the same one

Example - One Hot Encoding Test set: if we apply
One Hot Encoding again ID Country 117 France 118 UK Columns mismatch with the training set … Instead we need to memorise the transformation and apply the same one ID Country_UK Country_Italy Country_China 1 0 0 0 2 1 0 0

data? Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…) Create a reusable class! Scikit-learn transformers are a way to facilitate this and keep it consistent.

Common Transformers In sklearn.preprocessing • LabelEncoder • OneHotEncoder • StandardScaler
All have a .fit() method to train the transformation on some data, and .transform() to apply the same transformation on any new data.

Common Transformers • PCA also works as a transformer (.fit
to learn a transformation, .transform to apply it on new data) • FeatureUnion allows you to apply multiply transformers on one column to generate multiple features:

FeatureUnion DataFrame PCA(DataFrame) PolynomialFeatures(DataFrame) New DataFrame Original DataFrame FeatureUnion

Build your own transformer • Need to extend both TransformerMixin
and BaseEstimator • Implement a .fit with what needs to be saved from the training data • Implement a .transform which applies the actual transformation • If you need options, add parameters to an __init__

Build your own transformer class CustomScaler(TransformerMixin, BaseEstimator): def fit(self, X,
y= None): self.median = np.median(X, axis= 0) self.interquartile_range = interquartile_range(X) return self def transform(self, X, y= None): Xt = (X - self.median) / self.interquartile_range return Xt

Hands-on session Exercise.ipynb Part 1

Column Transformer • Transformers seen before take a column or
a set of columns and generate new features • ColumnTransformer is a higher level transformer used to apply diﬀerent transformations on all your columns Example: Apply PCA on numerical features, One Hot Encoding on categorical ones.

Column Transformer num_cols = ["age", "salary"] cat_cols = ["country", "gender"]
# List of transformations # ("name", Transformer, List_of_columns) preprocessor = ColumnTransformer([( "pca", PCA(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])

Column Transformer ColumnTransformer([("pca", PCA(), num_cols),("one_hot", OneHotEncoder(), cat_cols)]) Numerical Columns PCA
OneHotEncoded New DataFrame Original DataFrame Categorical Columns

Column Transformer Once you’ve created a column transformer object, you
can apply the same transformations everywhere: • preprocessor.fit(X_train) • preprocessor.transform(X_train) • preprocessor.transform(X_test) • preprocessor.transform(X_new) It keeps in memory all unique transformations applied to all columns.

Pipeline

Pipeline • We’ve seen how to apply transformation in parallel
• Sometimes you need to apply multiple sequential transformations to build new features Example: 1. Scale data 2. Apply PCA

Pipeline # Define pipeline pipeline = Pipeline([( "scaler", StandardScaler()), (
"pca", PCA())]) # Calling .fit will “learn” both steps pipeline.fit(data) # Calling .transform will “apply” both steps sequentially pipeline.transform(data)

Pipeline Pipeline([("scaler", StandardScaler()), ( "pca", PCA())]) Original DataFrame DataFrame Scaled
Final DataFrame Apply scaler Apply PCA

Pipeline The resulting Pipeline object is a transformer, it has
.fit and .transform So ... It can be used together with other transformers such as ColumnTransformer

Pipeline # Define pipeline for numerical features num_pipeline = Pipeline([(
"scaler", StandardScaler()), ( "pca", PCA())]) # Use ColumnTransformer to apply different transformations to all columns preprocessor = ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])

Pipeline Transformers together with ColumnTransformer and Pipeline allow to build:
• Complex preprocessing code • In a modular way ◦ Each transformer == a step/an action We’re only missing the predictive model part … Or are we?

Pipeline • SKlearn predictive models have a .fit() to train
the model and .predict() to generate predictions. • Pipeline support the last step to be a predictive model object • When calling the pipeline: ◦ .fit(): fits all transformers and trains predictive model ◦ .predict(): calls .transform() on the transformers and then .predict()

Pipeline # Preprocessor object -- to transform data preprocessor =
ColumnTransformer([( "numerical", num_pipeline(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)]) # Final pipeline contains preprocessor and the algorithm pipeline = Pipeline([( "preprocessor" , preprocessor), ( "model", DecisionTreeClassifier())])

Pipeline • Your model isn’t just an algorithm but preprocessing
steps + algorithm • Can use directly in GridSearch, Cross Validation, etc… ◦ Avoids data leakage • Can apply same pipeline to new data • More modular -> easier to test and maintain

Train Model

Model Persistence • Training a model can take a while
• It is useful to be able to save it pre-trained to disk • Once the model is trained, you can eﬀiciently load use it in production

Pickle & Joblib In Python, there are two main libraries
to “serialise” objects: • Pickle • Joblib Those work with any Python object and are mostly equivalent. Joblib tends to be more eﬀicient with larger arrays.

Pickle import pickle # Open a file in write binary
mode with open("model.pickle" , "wb") as f: pickle.dump(model_object, f) # Open our pickle model with open("model.pickle" , "rb") as f: new_model = pickle.load(f)

Joblib import joblib # Save model joblib.dump(model_object, 'model.joblib' ) #
Load model new_model = joblib.load( 'model.joblib' )

Pickle & Joblib - Limitations Pickle and joblib are really
useful to save trained models, but keep in mind: • It does not save dependencies. ◦ To load your model, you need to import the code it relies on ▪ external libraries, your own classes definition, etc... • Data is not saved, so if you need to retrain the model make sure you keep a snapshot.

Pickle & Joblib - Limitations When using pickle and joblib,
keep in mind: • Dependencies/versions are important. ◦ If a model is saved with a version of a library, make sure you load with the same version • You can pickle any python object, hence do not open a pickle that you do not trust ◦ It could contain anything

Try the code... • Tune the model with `python run.py
tune` ◦ This will use parameters defined in config.py • Train the model with `python run.py train` ◦ This use parameters defined in config.py • Test the model with `python run.py test`

Further... • Make sure you always load the data with
same dtypes • Have a requirements.txt with versions of the libraries you use • If making incremental changes, make sure you check the tuning CV score ◦ Not the one on the test set • With modular code, you can more easily write tests! (and you should) • With all that, updating your models should be less stressful.

Come say hi... ...at the Cambridge Spark booth! https://cambridgespark.com/

Maintainable Code for Data Science

Maintainable Code for Data Science

More Decks by Kevin Lemagnen

Other Decks in Technology

Featured

Transcript