Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maintainable Code for Data Science

Maintainable Code for Data Science

Kevin Lemagnen

July 12, 2019
Tweet

More Decks by Kevin Lemagnen

Other Decks in Technology

Transcript

  1. Single Responsibility Principle “Every module, class, or function should have

    responsibility over a single part of the functionality provided by the software, and that responsibility should be entirely encapsulated by the class.” • Each class/function should do one thing for our model • In Data Science we have extra constraints
  2. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ 2. Tune hyperparameters ◦ 3. Train model ◦ 4. Generate predictions on new data ◦
  3. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ 3. Train model ◦ 4. Generate predictions on new data ◦
  4. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ 4. Generate predictions on new data ◦
  5. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦
  6. Pieces of a Data Science Project 1. Process data /

    engineer new features ◦ How easy it is to modify / remove / add steps? 2. Tune hyperparameters ◦ How easy it is to re-tune your model after the data has changed? 3. Train model ◦ How easy it is to re-train your model after the data has changed? 4. Generate predictions on new data ◦ Can you run all the same steps easily on any new data, without repeating code?
  7. Autopsy of common preprocessing code 1. Loading some data 2.

    Splitting data into train/test set 3. Drop some features 4. “Clean” some features 5. “Engineer” some features from original ones 6. Apply One-Hot Encoding on categorical features 7. Scale/transform some features
  8. Autopsy of common preprocessing code How do you preprocess test/new

    data? Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…)
  9. Example - One Hot Encoding Training set: we apply One

    Hot Encoding ID Country 1 UK 2 Italy 3 China ID Country_UK Country_Italy Country_China 1 1 0 0 2 0 1 0 3 0 0 1
  10. Example - One Hot Encoding Test set: if we apply

    One Hot Encoding again ID Country 117 France 118 UK ID Country_France Country_UK 117 1 0 118 0 1 Columns mismatch with the training set: [ UK, Italy, China ] Instead we need to memorise the transformation and apply the same one
  11. Example - One Hot Encoding Test set: if we apply

    One Hot Encoding again ID Country 117 France 118 UK Columns mismatch with the training set … Instead we need to memorise the transformation and apply the same one ID Country_UK Country_Italy Country_China 1 0 0 0 2 1 0 0
  12. Autopsy of common preprocessing code How do you preprocess test/new

    data? Create a reusable function! What if transformation needs to remember a state? (one hot, scaler, etc…) Create a reusable class! Scikit-learn transformers are a way to facilitate this and keep it consistent.
  13. Common Transformers In sklearn.preprocessing • LabelEncoder • OneHotEncoder • StandardScaler

    All have a .fit() method to train the transformation on some data, and .transform() to apply the same transformation on any new data.
  14. Common Transformers • PCA also works as a transformer (.fit

    to learn a transformation, .transform to apply it on new data) • FeatureUnion allows you to apply multiply transformers on one column to generate multiple features:
  15. Build your own transformer • Need to extend both TransformerMixin

    and BaseEstimator • Implement a .fit with what needs to be saved from the training data • Implement a .transform which applies the actual transformation • If you need options, add parameters to an __init__
  16. Build your own transformer class CustomScaler(TransformerMixin, BaseEstimator): def fit(self, X,

    y= None): self.median = np.median(X, axis= 0) self.interquartile_range = interquartile_range(X) return self def transform(self, X, y= None): Xt = (X - self.median) / self.interquartile_range return Xt
  17. Column Transformer • Transformers seen before take a column or

    a set of columns and generate new features • ColumnTransformer is a higher level transformer used to apply different transformations on all your columns Example: Apply PCA on numerical features, One Hot Encoding on categorical ones.
  18. Column Transformer num_cols = ["age", "salary"] cat_cols = ["country", "gender"]

    # List of transformations # ("name", Transformer, List_of_columns) preprocessor = ColumnTransformer([( "pca", PCA(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])
  19. Column Transformer Once you’ve created a column transformer object, you

    can apply the same transformations everywhere: • preprocessor.fit(X_train) • preprocessor.transform(X_train) • preprocessor.transform(X_test) • preprocessor.transform(X_new) It keeps in memory all unique transformations applied to all columns.
  20. Pipeline • We’ve seen how to apply transformation in parallel

    • Sometimes you need to apply multiple sequential transformations to build new features Example: 1. Scale data 2. Apply PCA
  21. Pipeline # Define pipeline pipeline = Pipeline([( "scaler", StandardScaler()), (

    "pca", PCA())]) # Calling .fit will “learn” both steps pipeline.fit(data) # Calling .transform will “apply” both steps sequentially pipeline.transform(data)
  22. Pipeline The resulting Pipeline object is a transformer, it has

    .fit and .transform So ... It can be used together with other transformers such as ColumnTransformer
  23. Pipeline # Define pipeline for numerical features num_pipeline = Pipeline([(

    "scaler", StandardScaler()), ( "pca", PCA())]) # Use ColumnTransformer to apply different transformations to all columns preprocessor = ColumnTransformer([( "numerical", num_pipeline, num_cols), ( "one_hot", OneHotEncoder(), cat_cols)])
  24. Pipeline Transformers together with ColumnTransformer and Pipeline allow to build:

    • Complex preprocessing code • In a modular way ◦ Each transformer == a step/an action We’re only missing the predictive model part … Or are we?
  25. Pipeline • SKlearn predictive models have a .fit() to train

    the model and .predict() to generate predictions. • Pipeline support the last step to be a predictive model object • When calling the pipeline: ◦ .fit(): fits all transformers and trains predictive model ◦ .predict(): calls .transform() on the transformers and then .predict()
  26. Pipeline # Preprocessor object -- to transform data preprocessor =

    ColumnTransformer([( "numerical", num_pipeline(), num_cols), ( "one_hot", OneHotEncoder(), cat_cols)]) # Final pipeline contains preprocessor and the algorithm pipeline = Pipeline([( "preprocessor" , preprocessor), ( "model", DecisionTreeClassifier())])
  27. Pipeline • Your model isn’t just an algorithm but preprocessing

    steps + algorithm • Can use directly in GridSearch, Cross Validation, etc… ◦ Avoids data leakage • Can apply same pipeline to new data • More modular -> easier to test and maintain
  28. Model Persistence • Training a model can take a while

    • It is useful to be able to save it pre-trained to disk • Once the model is trained, you can efficiently load use it in production
  29. Pickle & Joblib In Python, there are two main libraries

    to “serialise” objects: • Pickle • Joblib Those work with any Python object and are mostly equivalent. Joblib tends to be more efficient with larger arrays.
  30. Pickle import pickle # Open a file in write binary

    mode with open("model.pickle" , "wb") as f: pickle.dump(model_object, f) # Open our pickle model with open("model.pickle" , "rb") as f: new_model = pickle.load(f)
  31. Joblib import joblib # Save model joblib.dump(model_object, 'model.joblib' ) #

    Load model new_model = joblib.load( 'model.joblib' )
  32. Pickle & Joblib - Limitations Pickle and joblib are really

    useful to save trained models, but keep in mind: • It does not save dependencies. ◦ To load your model, you need to import the code it relies on ▪ external libraries, your own classes definition, etc... • Data is not saved, so if you need to retrain the model make sure you keep a snapshot.
  33. Pickle & Joblib - Limitations When using pickle and joblib,

    keep in mind: • Dependencies/versions are important. ◦ If a model is saved with a version of a library, make sure you load with the same version • You can pickle any python object, hence do not open a pickle that you do not trust ◦ It could contain anything
  34. Try the code... • Tune the model with `python run.py

    tune` ◦ This will use parameters defined in config.py • Train the model with `python run.py train` ◦ This use parameters defined in config.py • Test the model with `python run.py test`
  35. Further... • Make sure you always load the data with

    same dtypes • Have a requirements.txt with versions of the libraries you use • If making incremental changes, make sure you check the tuning CV score ◦ Not the one on the test set • With modular code, you can more easily write tests! (and you should) • With all that, updating your models should be less stressful.