Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open standards for machine learning model deployment presented to Milwaukee Code Camp

Open standards for machine learning model deployment presented to Milwaukee Code Camp

Intro to machine learning and PMML, PFA, ONNX

Svetlana Levitan

November 16, 2019
Tweet

More Decks by Svetlana Levitan

Other Decks in Technology

Transcript

  1. Svetlana Levitan, PhD Developer Advocate and PMML Release Manager Center

    for Open Data and AI Technologies (CODAIT) IBM Cognitive Applications @SvetaLevitan [email protected] 1 Open standards for machine learning model deployment
  2. 2 Who is Svetlana Levitan? Originally from Moscow, Russia PhD

    in Applied Mathematics and MS in Computer Science from University of Maryland, College Park Software Engineer for SPSS Analytic components (2000-2018) Working on PMML since 2001, ONNX recently IBM acquired SPSS in 2009 Developer Advocate with IBM Center for Open Data and AI Technologies (since June 2018) Meetup organizer: Big Data Developers, Open Source Analytics Two daughters love programming: IIT and Niles North
  3. 3 • Intro to Machine Learning • Deployment challenges •

    PMML Internals • PMML in Python and R • PMML in IBM products • PFA • ONNX Agenda
  4. 4 Machine Learning (ML) is popular Data Scientists are in

    high demand Data Science = ML + Business Domain Knowledge ML is everywhere! Actuarial Science = first Data Science? CODAIT/Cognitive Applications/ November 25, 2019 / © 2019 IBM Corporation https://wacamlds.podia.com/
  5. Examples of ML around us 5 Weather forecast Chat bots,

    Alexa, Siri Identifying fraud in banks, credit cards Online shopping recommendations Pattern recognition, Spam filters Computer vision and self-driving cars Watson playing Jeopardy in 2011 CODAIT/Cognitive Applications/ November 25, 2019 / © 2019 IBM Corporation
  6. Frequently used terms 6 Machine learning is "learning" from data,

    or generalizing from examples. Computer finds certain trends from data without being explicitly programmed. Structured data - highly organized information that can be stored in row database structures. Lots of data is unstructured, e.g. free text, speech, video Columns are fields or variables, rows are cases. A field can be categorical (nominal, ordinal) or continuous. Examples: ordinal field: age_group: baby, toddler, child, teenager, adult. nominal field: car_make: Ford, Chevy, Toyota, Honda, Tesla continuous field: age (any value from 0 to ~120) CODAIT/Cognitive Applications/ November 25, 2019 / © 2019 IBM Corporation
  7. Typical Stages in Machine Learning 8 8 Collect Data Analyze

    and Clean Data Transform data Build a Model Deploy the model Monitor and update as needed (C) 2019 IBM Corp
  8. Data Sources CODAIT/Cognitive Applications/ November 25, 2019 / © 2019

    IBM Corporation 9 Relational Databases Data warehouses, data lakes Web logs Medical or business records Streaming data IOT data – sensors, cameras, etc. Diagram from Intel
  9. Some typical data transformations CODAIT/Cognitive Applications/ November 25, 2019 /

    © 2019 IBM Corporation 10 One Hot Encoding: categorical field into K dummy variables Image from Kaggle.com
  10. Some typical data transformations CODAIT/Cognitive Applications/ November 25, 2019 /

    © 2019 IBM Corporation 11 Binning, or discretization: a continuous field into a categorical Temperature TempBin Label < 0 1 Very cold 0 to 60 2 Cold 60 to 80 3 Warm >80 4 Hot
  11. Some typical data transformations CODAIT/Cognitive Applications/ November 25, 2019 /

    © 2019 IBM Corporation 12 Linear transformation: normalization or standardization
  12. Some typical data transformations CODAIT/Cognitive Applications/ November 25, 2019 /

    © 2019 IBM Corporation 13 Category mapping Village County Skokie Cook Lisle DuPage Deerfield Lake Evanston Cook Oak Park Cook
  13. Typical Stages in Machine Learning 15 15 Collect Data Analyze

    and Clean Data Transform data Build a Model Deploy the model Monitor and update as needed (C) 2018 IBM Corp CODAIT/Cognitive Applications/ November 25, 2019 / © 2019 IBM Corporation
  14. Clustering Models – unsupervised learning CODAIT/Cognitive Applications/ November 25, 2019

    / © 2019 IBM Corporation 17 Many distance or similarity measures Many Algorithms
  15. Linear Regression – supervised learning, continuous target CODAIT/Cognitive Applications/ November

    25, 2019 / © 2019 IBM Corporation 18 Categorical predictors → dummies More advanced features: • Interaction and nonlinear terms • Model selection • Regularization
  16. Logistic regression – supervised learning with a binary target CODAIT/Cognitive

    Applications/ November 25, 2019 / © 2019 IBM Corporation 19 Categorical predictors → dummies More advanced features: Interaction and nonlinear terms Model selection, Regularization More complicated kinds of regression
  17. Decision Trees – supervised learning with a continuous or categorical

    target CODAIT/Cognitive Applications/ November 25, 2019 / © 2019 IBM Corporation 20 Many algorithms Easily explainable Continuous and categorical predictors Can handle missing data
  18. Model ensembles CODAIT/Cognitive Applications/ November 25, 2019 / © 2019

    IBM Corporation 21 Useful for distributed data or for improving accuracy Bagging, boosting Random forest, XGBoost, Light GBM Diagram from Quora
  19. The Elementary Perceptron IBM Cloud and Cognitive Software/November 25, 2019

    / © 2019 IBM Corporation 23 1958 Frank Rosenblatt A machine, first implemented in software on IBM 704, later hardware 1969 book by Minsky and Papert →AI winter Then MLP = multilayer perceptron
  20. Training Neural Networks with Backpropagation IBM Cloud and Cognitive Software/November

    25, 2019 / © 2019 IBM Corporation 27 Initialize weights with small random values Apply inputs, compute predictions, propagate error back and update weights Gradient descent methods: Adagrad, Adam, … Online or mini-batch or batch
  21. Backpropagation Labeled Training Data Coat Sneaker T-shirt Sneaker Pullover Output

    Errors Pullover Coat Coat Sneaker T-shirt ❌ ❌ ❌ Fashion-MNIST dataset by Zalando Research, on GitHub <https://github.com/zalandoresearch/fashion-mnist> (MIT License).
  22. Convolutional layer IBM Cloud and Cognitive Software/November 25, 2019 /

    © 2019 IBM Corporation 29 https://medium.com/datadriveninvestor/convolutional-neural-networks-3b241a5da51e
  23. Max-Pooling layer IBM Cloud and Cognitive Software/November 25, 2019 /

    © 2019 IBM Corporation 30 Image from Wikipedia
  24. Deep learning model = NN with many layers CODAIT/Cognitive Applications/

    November 25, 2019 / © 2019 IBM Corporation 31 Image from Medium
  25. Model evaluation CODAIT/Cognitive Applications/ November 25, 2019 / © 2019

    IBM Corporation 32 Split data into training, testing, validation Always check model quality on new data! Some quality measures: • R squared and adjusted R squared • Mean Absolute Error • RMSE • Accuracy • Precision and recall • AUC, BIC, AIC …
  26. Tools for ML CODAIT/Cognitive Applications/ November 25, 2019 / ©

    2019 IBM Corporation 33 Commercial Open Source
  27. Building some models in Python and R CODAIT/Cognitive Applications/ November

    25, 2019 / © 2019 IBM Corporation 34 >library(rpart); > > data(iris); Build a linear regression model predicting Sepal length > irisLR<-lm(Sepal.Length~.,iris) Build a decision tree (C&RT) model predicting Species > irisTree <- rpart( Species~., iris ) from sklearn import datasets, tree iris = datasets.load_iris() # Example tree model clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) #Example logistic regression model from sklearn.linear_model import LogisticRegression clr=LogisticRegression().fit(iris.data, iris.target)
  28. Building some models in Watson Studio CODAIT/Cognitive Applications/ November 25,

    2019 / © 2019 IBM Corporation 35 Easy graphic interface (Modeler flow), collaboration and deployment tools. Sign up for free account: https://ibm.biz/BdzwwC
  29. Resources for learning ML CODAIT/Cognitive Applications/ November 25, 2019 /

    © 2019 IBM Corporation 36 https://www.kaggle.com/learn/overview https://www.coursera.org/learn/machine-learning Watson Studio: sign up for IBM Cloud: https://ibm.biz/BdzwwC https://developer.ibm.com/articles/cc-cognitive-neural-networks-deep-dive/ https://cognitiveclass.ai/ Meetup.com Codait.org @SvetaLevitan
  30. Typical Stages in Machine Learning 37 37 Collect Data Analyze

    and Clean Data Transform Data Build a Model Deploy the model Monitor and update as needed (C) 2019 IBM Corp
  31. Model Deployment Challenges 38 • Data Scientists and statisticians •

    Application developers and IT Teams • OS and File Systems • Databases, desktop, cloud Environm ents • Python or R, various packages, C++ or Java or Scala, Dependencies and versions Languages • Aggregation and joins • Normalization, Category Encoding, Binning, Missing value replacement Data Preparation
  32. DMG to the rescue! 39 Data Mining Group dmg.org Predictive

    Model Markup Language • An Open Standard for XML Representation • Over 30 vendors and organizations • PMML 4.4 Release manager: Svetlana Levitan
  33. Brief History of PMML versions 40 0.7 in 1997 First

    1.1 in 2000 Six models 2.0 in 2001 Transformations Naïve Bayes, Sequence 3.0 in 2004 Functions Output Composition SVM, Ruleset 4.0 in 2009 Ensembles, Cox, Time Series, Model Explanation 4.4 in 2019 More Time Series, Anomaly Detection
  34. Transformations • NormContinuous: piece-wise linear transform • NormDiscrete: map a

    categorical field to a set of dummy fields • Discretize: binning • MapValues: map one or more categorical fields into another categorical one • Functions: built-in and user-defined • Other transformations
  35. PMML 4.4 Models o Anomaly Detection (new) o Association Rules

    Model o Clustering Model o General Regression o Naïve Bayes o Nearest Neighbor Model o Neural Network o Regression o Tree Model o Mining Model: composition or ensemble (or both) of models o Baseline Model o Bayesian Network o Gaussian Process o Ruleset o Scorecard o Sequence Model o Support Vector Machine o Time Series
  36. Contents of a PMML Model ❖Mining Schema: target and predictors,

    importance, missing value treatment, invalid value treatment, outlier treatment ❖Output: what to report, post-processing ❖Model Stats: description of input data ❖Model Explanation: model diagnostics, useful for visualization ❖Targets: target category info and prior probabilities ❖Local Transformations: predictor transformations local to the model ❖…<Specific model contents>… ❖Model Verification: expected results for some cases August 16, 2019 / © 2019 IBM Corporation
  37. Example PMML - Neural Network hidden layer and outputs 47

    Hidden layer neuron Output Layer Neurons Connecting target to the neurons
  38. <Node id=“0"> <True/> <Node id=“1" score="Iris-setosa" recordCount="50.0"> <SimplePredicate field="petal_length" operator="lessOrEqual“

    value=“2.6"/> <ScoreDistribution value="Iris-setosa" recordCount="50.0"/> <ScoreDistribution value="Iris-versicolor" recordCount="0.0"/> <ScoreDistribution value="Iris-virginica" recordCount="0.0"/> </Node> <Node id=“2"> <SimplePredicate field="petal_length" operator="greaterThan“ value=“2.6"/> <Node id=“3“score="Iris-versicolor" recordCount=“40.0"> <SimplePredicate field="petal_length" operator="lessOrEqual" value=“4.75"/> Example PMML for a Tree Model
  39. PMML Powered From http://dmg.org/pmml/pr oducts.html: Alpine Data Angoss BigML Equifax

    Experian FICO Fiserv Frontline Solvers GDS Link IBM (Includes SPSS) JPMML KNIME KXEN Liga Data Microsoft MicroStrategy NG Data Open Data Opera Pega Pervasive Data Rush Predixion Software Rapid I R Salford Systems (Minitab) SAND SAS Software AG (incl. Zementis) Spark Sparkling Logic Teradata TIBCO WEKA
  40. 50 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  41. PMML in Python JPMML package is created and maintained by

    Villu Ruusmann in Estonia. From https://stackoverflow.com/questions/33221331/export-python-scikit-learn-models-into-pmml pip install git+https://github.com/jpmml/sklearn2pmml.git Example of how to export a classifier tree to PMML. First grow the tree: # example tree & viz from http://scikit-learn.org/stable/modules/tree.html from sklearn import datasets, tree iris = datasets.load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) SkLearn2PMML conversion takes 2 arguments: an estimator (our clf) and a mapper for preprocessing. Our mapper is pretty basic, since no transformations. from sklearn_pandas import DataFrameMapper default_mapper = DataFrameMapper([(i, None) for i in iris.feature_names + ['Species']]) from sklearn2pmml import sklearn2pmml sklearn2pmml(estimator=clf, mapper=default_mapper, pmml=“IrisClassificationTree.pmml")
  42. PMML in R R packages “pmml” and “pmmlTransformations” https://cran.r-project.org/package=pmml Supports

    a number of R models: ada, amap, arules, caret, clue, data.table, gbm, glmnet, neighbr, nnet, rpart, randomForest, kernlab, e1071, testthat, survival, xgboost, knitr Maintained by Dmitriy Bolotov and others from Software AG JPMML also has a package that augments “pmml” and provides PMML export for additional R models Build and save a decision tree (C&RT) model predicting Species class: > irisTree <- rpart( Species~., iris ) > saveXML( pmml( irisTree ), "IrisTree.xml" )
  43. 53 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  44. IBM SPSS Statistics 54 1968 Statistical Package for Social Sciences

    Acquired by IBM in 2009 Release 25 in August 2017, 26 in Spring 2019. Subscription option Integration with Python and R
  45. IBM SPSS Statistics Transformation PMML from: ADP (Automatic Data Preparation)

    TMS Begin/TMS End Model PMML from: COXREG, CSCOXREG CSGLM, CSLOGISTIC, CSORDINAL GENLIN, Logistic regression, NOMREG GENLINMIXED LINEAR, KNN MLP, RBF neural networks NAÏVE BAYES REGRESSION TREE, TSMODEL TWOSTEP CLUSTER IBM SPSS Modeler Apriori, CARMA, Association Rules C5, CART, Chaid decision trees Cox regression GENLIN Decision List K-Means Cluster KNN LINEAR, Regression Logistic Regression MLP and RBF NOMREG Random Trees Regression Two Step Cluster 56
  46. 59 Watson Studio (formerly Data Science Experience) PMML export possible

    in Jupyter notebooks, Modeler flows, R Studio. PMML scoring can be done in Flows, notebooks, Watson Machine Learning. CODAIT/Cognitive Applications/ September 20, 2019 / © 2019 IBM Corporation
  47. An example of practical application (from Software AG) 62 Monitoring

    sensor data from paint-spraying robots Anomaly detection model in PMML Sound an alarm when something starts going bad Easy to update the model Image from Flickr on Tesla manufacturing
  48. Benefits of PMML Allows seamless deployment and model exchange Transparency:

    human and machine- readable Fosters best practices in model building and deployment
  49. 64 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  50. 65 Portable Format for Analytics - PFA PMML is great,

    except when a model or feature is not supported PFA to overcome this JSON format, AVRO schemas for data types A mini functional math language + schema specification Info: dmg.org/pfa Jim Pivarski
  51. 66 PFA details • PFA consists of: • JSON serialization

    format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Built-in functions and common models • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • Portability across languages, frameworks, run times and versions
  52. 67 A Simple Example of PFA (copied from Nick Pentreath’s

    presentation) • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input) 67 (C) 2019 IBM Corp
  53. 68 Known Support for PFA Hadrian (PFA export and scoring

    engine) from Open Data Group (Chicago, IL) Aardpfark (PFA export in SparkML) by Nick Pentreath, IBM CODAIT, South Africa Woken (PFA export and validation) by Ludovic Claude, CHUV, Lausanne, Switzerland There was a lot of interest in PFA. Many opportunities for open source contributions.
  54. Use of PMML and PFA in medical applications 69 Ludovic

    Claude, CHUV Lausanne, Switzerland Human Brain Project
  55. 70 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  56. ONNX: Open Neural Network eXchange CODAIT/Cognitive Applications/ September 20, 2019

    / © 2019 IBM Corporation 71 Since Sep. 2017. Protobuf Covers DL and traditional ML Active work by many companies
  57. ONNX Background ▪ Initial goal: make it easier to exchange

    trained models between DL frameworks. ▪ ONNX github has 20 repos, onnx is the core. Others are tutorials, model zoo, importers and exporters for frameworks. ▪ Onnx/onnx currently has 12 releases, 112 contributors, 5771 stars. ▪ Core is in C++ with Python API and tools. ▪ Supported frameworks: Caffe2, Chainer, Cognitive Toolkit (CNTK), Core ML, MXNet, PyTorch, PaddlePaddle; TF in progress 72
  58. ONNX use pattern ONNX IR Spec .onnx Frontend Models in

    different frameworks Tools Netron visualizer Net Drawer visualizer Checker Shape Inferencer Graph Optimizer Opset Version Converter Backend Models in different frameworks Training Inference Export Import Run 74
  59. ONNX governance: under LF AI now Working groups: • Edge

    • Pipelines • Training • Testing and compliance Steering Committee of 5 SIGs: • Infra • Operators • Converters • Model Zoo 77
  60. 79 Conclusions Model deployment is an important part of ML

    lifecycle DMG works on open standards for model deployment PMML eases deployment for supported models and data prep PFA is an emerging standard that needs work ONNX is becoming a de-facto standard for Deep Learning, needs work!
  61. 80 Links and resources @SvetaLevitan PMML dmg.org/pmml PFA dmg.org/pfa ONNX

    onnx.ai CODAIT: codait.org SPSS: https://www.ibm.com/analytics/spss-statistics-software Watson Studio: https://www.ibm.com/cloud/watson-studio Sign up for free IBM Cloud account: https://ibm.biz/BdzwwC Join Meetup groups: Big Data Developers, Chicago ML