Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jessica Lundin - Snakes on a Hyperplane: Python Machine Learning in Production

Jessica Lundin - Snakes on a Hyperplane: Python Machine Learning in Production

Companies with an artificial-intelligence plan have a differentiating strategy in the intelligence economy; however, implementing robust machine-learning in production is nontrivial, often requiring a close collaboration between data scientists and developers, and retooling the production stack and workflows to develop and maintain accurate models.

Machine learning in production involves model application, handling missing data, data artifacts, and data outside of the training calibration. A rigorous evaluation framework draws upon logging to determine characteristics of model coverage, model performance, auditing, and run-time performance. Model coverage includes the number of times the model produced sensible output relative to number of times it is called. Model coverage is reduced if the model does not converge or model criteria are not met. Model performance is evaluated with a suite of metrics (accuracy, AUC, FPR, TPR, RMSE, MAPE, etc.), which assist in determining the most appropriate model to use in the production scenario and the validity of the model training. Regularly performing manual audits for spot checks is important for debugging and ensuring the model passes sanity checks. Model performance includes run times and profiling model pieces, ensuring performance is within specified requirements and refactoring otherwise.

In the AI renaissance, where ML is a critical piece of intelligent products, seamlessly integrating model evaluation into workflows is an important component of making robust products and building a satisfying customer experience. Python is a great language to build intelligent products with its abundance of ML libraries and wrappers contributed as open-source software in addition to rich full-stack capabilities.


PyCon 2017

May 21, 2017

More Decks by PyCon 2017

Other Decks in Programming


  1. Snakes on a Hyperplane: Python Machine Learning in Production Jessica

    Lundin Machine Learning Manager Microsoft Research @_JessicaLundin https://notebooks.azure.com/LundinMachine http://stackoverflow.com/questions/9480605/what-is-the-relation- between-the-number-of-support-vectors-and-training-data-and
  2. What is machine learning? “machine learning explores the study and

    construction of algorithms that can learn from and make predictions on data” Ron Kohavi; Foster Provost (1998). "Glossary of terms". Machine Learning. 30: 271–274.
  3. What is machine learning? “machine learning explores the study and

    construction of algorithms that can learn from and make predictions on data” https://www.kaggle.com/c/data-science-bowl-2017/data Cat video classification Handwritten digit identification Lung cancer detection
  4. Machine learning in production: practical tips Preproduction data Fit model

    Production data Apply model Measure Preproduction Baseline Results Measure Production Results + Compare to preproduction
  5. Synthetic data fit a binary classifier Preproduction Accuracy = 95%

    ## Fit a Logistic Regression model from sklearn.linear_model import LogisticRegressionCV clf = LogisticRegressionCV() clf.fit(X,y) ## measure the accuracy clf.score(X,y)
  6. Support Vector Machine (SVM) Kernel = Radial Basis Function Accuracy

    = 93% Random Forest Accuracy = 100% Unknown production distribution Retrain with non-linear algorithms
  7. X1 X1 X2 X1 X2 unknown production distribution Feature engineering

    to linearize features Original features Accuracy = 90% Modified features
  8. Techniques for suspected distribution differences between preproduction and production: -

    Visualization (histograms, pairplots) - Clustering - Kullback-Leibler (KL) divergence Model performance: unknown production distribution
  9. Model performance: unbalanced problems Accuracy: 0.98 = (Σ TP +

    Σ TN)/Σ total population Precision: 0.0 = Σ TP/Σ prediction positive Recall:0.0 = Σ TP/Σ condition positive Model predicts single class 0 for all observations
  10. Model performance: unbalanced problems Accuracy: 1 Precision: 1 = Σ

    TP/Σ prediction positive Recall: 1 = Σ TP/Σ condition positive
  11. Techniques for unbalanced problems Cost-sensitive classification: - Rare-class upsampling with

    replacement - Importance weighting - Boosting Treat it as an anomaly detection problem (one-class SVM) Model performance: unbalanced problems
  12. Machine learning in production: practical tips Logging: - Timestamp, Instance

    ids - Model run time - Model results, performance metrics - Model convergence errors Auditing: - Manual process of digging into logs and data to resolve unexpected behavior
  13. Machine Learning Resources General Resources: Introduction to Machine Learning, Coursera

    by Andrew Ng https://www.coursera.org/learn/machine-learning The Elements of Statistical Learning (free pdf download) by Hastie, Tibshirani, Friedman http://statweb.stanford.edu/~tibs/ElemStatLearn/ Kaggle Tutorials https://www.kaggle.com/wiki/Tutorials ML in Python: Scikit Learn http://scikit-learn.org/ Caffe TensorFlow CNTK Theano Keras (packages all on github) Rpy2: Python’s R wrapper
  14. Microsoft Python resources Azure SDK - https://azure.microsoft.com/en-us/develop/python/ Intro to Python

    Programming - https://mva.microsoft.com/en-us/training-courses/introduction-to- programming-with-python-8360 Python tools for Visual Studio - https://microsoft.github.io/PTVS/ Cognitive Toolkit (CNTK) - https://www.microsoft.com/en-us/research/product/cognitive-toolkit/
  15. Thanks! Health ML team is hiring Data Scientists! Come work

    at Microsoft Research https://careers.microsoft.com/ ML/Data Scientist: 1030519 Developer: 1048462, 1032009, 1031571, 1031704, 1026221 @_JessicaLundin