I made a model, now what?

I made a model, now what?: Architecting Machine Learning Systems

Whoami • Will McGinnis • Chief Scientist at Predikto •
Maintainer: scikit-learn-contrib/categorical-encoding • Github.com/wdm0006 • Willmcginnis.com • Predikto.com

So you have a model • You used python right?
• You used scikit-learn, probably. • Use pipelines. • Push ETL and feature engineering into the pipeline syntax. • Validate-Validate-Validate.

Know your organization • Who owns production? • Who owns
the models? • Who owns the architecture? • Understand who is responsible for what, and make sure the architecture reflects that organizational structure.

Packaging a trained model • Push as much as is
absolutely possible into a scikit-learn pipeline. This is the item to be packaged. Remember type coersion, standardization, feature engineering, dimensionality reduction, probability calibration, class balancing, category encoding, etc., etc., etc. • Use joblib to serialize it. • The data scientist controls this artifact. Document and define inputs and outputs. Define expected ranges of in/out where possible. • Never let someone else send you a serialized model and just run it, this is internal only. Arbitrary code _will_ be executed in your runtime environment.

Deployment Mechanisms • Transactional Scoring Think: ”is this request
fraudulent?” “what’s in this picture?”. All of the data you need to issue a prediction is held by whoever is asking for it (so it’s a transaction, trade data for prediction) • Recurring Forecasting Think: “what’s the weather going to be tomorrow?” Some of the data you need to issue a prediction is held by the requester, but you have a bunch of other stuff you need other than just the model (historical data). • Many, many others

Data persistence • There are a few things we need
to store: Trained models (big binary blob) Version history of models Data required for issuing predictions Data about predictions we’ve issued Metrics Logs Audit trail • Is there an existing database? Or do you need your own? • Postgresql will store pickled models without too much trouble, but be careful about upgrading scipy/numpy.

A note on ways data can ruin your week

Shove it all in an API - Transactional 1. User
POSTs some data to an endpoint 2. API has a model in memory backed by cache 3. Data is formatted and run through pipeline 4. Output sent back to user 5. Async: write data and prediction to data store

Run it as a batch job - Recurring 1. Every
night at midnight, get some new batch of data via whatever 2. Process the data in your data store (hadoop? Spark? Luigi? SQL?) 3. Pull most recent pickled model from model store 4. Use the data to issue a collection of new forecasts 5. Write forecasts, meta-data, metrics to datastore 6. Serve predictions the rest of the day via REST API

Extra concerns • Model versioning • Data versioning? • How
to know when model performance is degrading • How to know when predictions are trash (because of bad data, code, whatever) • How to know when input data is trash (it will be eventually) • Non-ML performance (throughput, hardware requirements, uptime, etc.)

Thanks!

I made a model, now what?

I made a model, now what?

Will McGinnis

More Decks by Will McGinnis

Other Decks in Technology

Featured

Transcript