Upgrade to Pro — share decks privately, control downloads, hide ads and more …

I made a model, now what?

I made a model, now what?

Talk for PyData Atlanta, Oct 2017.

Will McGinnis

October 24, 2017
Tweet

More Decks by Will McGinnis

Other Decks in Technology

Transcript

  1. Whoami • Will McGinnis • Chief Scientist at Predikto •

    Maintainer: scikit-learn-contrib/categorical-encoding • Github.com/wdm0006 • Willmcginnis.com • Predikto.com
  2. So you have a model • You used python right?

    • You used scikit-learn, probably. • Use pipelines. • Push ETL and feature engineering into the pipeline syntax. • Validate-Validate-Validate.
  3. Know your organization • Who owns production? • Who owns

    the models? • Who owns the architecture? • Understand who is responsible for what, and make sure the architecture reflects that organizational structure.
  4. Packaging a trained model • Push as much as is

    absolutely possible into a scikit-learn pipeline. This is the item to be packaged. Remember type coersion, standardization, feature engineering, dimensionality reduction, probability calibration, class balancing, category encoding, etc., etc., etc. • Use joblib to serialize it. • The data scientist controls this artifact. Document and define inputs and outputs. Define expected ranges of in/out where possible. • Never let someone else send you a serialized model and just run it, this is internal only. Arbitrary code _will_ be executed in your runtime environment.
  5. Deployment Mechanisms • Transactional Scoring – Think: ”is this request

    fraudulent?” “what’s in this picture?”. – All of the data you need to issue a prediction is held by whoever is asking for it (so it’s a transaction, trade data for prediction) • Recurring Forecasting – Think: “what’s the weather going to be tomorrow?” – Some of the data you need to issue a prediction is held by the requester, but you have a bunch of other stuff you need other than just the model (historical data). • Many, many others
  6. Data persistence • There are a few things we need

    to store: – Trained models (big binary blob) – Version history of models – Data required for issuing predictions – Data about predictions we’ve issued – Metrics – Logs – Audit trail • Is there an existing database? Or do you need your own? • Postgresql will store pickled models without too much trouble, but be careful about upgrading scipy/numpy.
  7. Shove it all in an API - Transactional 1. User

    POSTs some data to an endpoint 2. API has a model in memory backed by cache 3. Data is formatted and run through pipeline 4. Output sent back to user 5. Async: write data and prediction to data store
  8. Run it as a batch job - Recurring 1. Every

    night at midnight, get some new batch of data via whatever 2. Process the data in your data store (hadoop? Spark? Luigi? SQL?) 3. Pull most recent pickled model from model store 4. Use the data to issue a collection of new forecasts 5. Write forecasts, meta-data, metrics to datastore 6. Serve predictions the rest of the day via REST API
  9. Extra concerns • Model versioning • Data versioning? • How

    to know when model performance is degrading • How to know when predictions are trash (because of bad data, code, whatever) • How to know when input data is trash (it will be eventually) • Non-ML performance (throughput, hardware requirements, uptime, etc.)