Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open standards for machine learning model deplo...

Avatar for Svetlana Levitan Svetlana Levitan
May 30, 2019
120

Open standards for machine learning model deployment

Predictive model deployment is the part of the machine learning process where the practical results are achieved, when the model is used for generating predictions on new data (known as scoring). The deployment used to present big difficulties, as models were typically built in one environment and needed to be deployed in a different one. Often they would need to be re-implemented in a new programming language, that would be very slow and error-prone.

Predictive Model Markup Language (PMML) and Portable Format for Analytics (PFA) were developed by the Data Mining Group (DMG) that originated in Chicago. PMML has been around for more than 20 years and is used widely. PFA is an emerging standard that is getting a lot of interest. Open Neural Network eXchange (ONNX) format was recently developed by Facebook and Microsoft as a way to exchange deep learning (DL) models between different DL frameworks, and is now experiencing explosive growth. Attendees will get a good understanding of predictive model deployment challenges and approaches.

Avatar for Svetlana Levitan

Svetlana Levitan

May 30, 2019
Tweet

Transcript

  1. Svetlana Levitan, PhD Developer Advocate and PMML Release Manager Center

    for Open Data and AI Technologies (CODAIT) IBM Cognitive Applications 1 Open standards for machine learning model deployment
  2. 2 Who is Svetlana Levitan? Originally from Moscow, Russia PhD

    in Applied Mathematics and MS in Computer Science from University of Maryland, College Park Software Engineer for SPSS Analytic components (2000-2018) Developer Advocate with IBM Center for Open Data and AI Technologies (since June) Married, two daughters who love programming
  3. 3 https://callforcode.org The world answered the call in 2018, and

    we made a difference… …in 2019, we have a chance to change the world together… CODAIT/Cognitive Applications/ May 30, 2019 / © 2019 IBM Corporation
  4. 1500 Drones Are Ready To Fly 4 IBM Developer Drone

    Challenge Challenge runs from May 13 through June 16 and those 18 years of age or older residing in the US and Canada are eligible to enter Enter at https://developer.ibm.com/contest Once per week during the 5 weeks of the contest a random drawing Will be held to determine the winners (watch for the drawing on Twitch) Winners will receive a DJI Tello programmable drone, an IBM Developer T-shirt, and an IBM Developer laptop sticker #IBMDroneDrop
  5. 5 • Deployment challenges • PMML Internals • PMML in

    Python and R • PMML in IBM products • PFA • ONNX Agenda
  6. Some ML models 7 Clustering Linear regression Logistic regression Decision

    tree Neural network (Multi-Layer perceptron) Deep learning (C) 2018 IBM Corp
  7. The Iris dataset 8 ▪ Data set from UCI ML

    data depository ▪ 3 classes of iris flower: Setosa, Versicolor, Virginica, 50 cases each ▪ Four continuous attributes: sepal length, sepal width, petal length, petal width (C) 2018 IBM Corp
  8. Typical Stages in Machine Learning 9 9 Collect Data Analyze

    and Clean Data Transform data Build a Model Deploy the model Monitor and update as needed (C) 2018 IBM Corp
  9. Typical Stages in Machine Learning 10 10 Collect Data Analyze

    and Clean Data Transform Data Build a Model Deploy the model Monitor and update as needed (C) 2018 IBM Corp
  10. Model Deployment Challenges 11 • Data Scientists and statisticians •

    Application developers and IT Teams • OS and File Systems • Databases, desktop, cloud Environm ents • Python or R, various packages, C++ or Java or Scala, Dependencies and versions Languages • Aggregation and joins • Normalization, Category Encoding, Binning, Missing value replacement Data Preparation
  11. DMG to the rescue! 12 Data Mining Group dmg.org The

    Data Mining Group is a consortium managed by the Center for Computational Science Research, Inc., which is an Illinois based 501(c)(3) not-for-profit corporation Founded in late 1990’s by Professor Robert Grossman
  12. What is PMML? Predictive Model Markup Language • An Open

    Standard for XML Representation • Developed by DMG • Over 30 vendors and organizations • PMML 4.4 Release manager: Svetlana Levitan dmg.org/pmml
  13. Brief History of PMML versions 14 0.7 in 1997 First

    1.1 in 2000 Six models 2.0 in 2001 Transformations Naïve Bayes, Sequence 3.0 in 2004 Functions Output Composition SVM,Ruleset 4.0 in 2009 Ensembles, Cox, Time Series, Model Explanation 4.4 in 2018 More Time Series, BN, Gaussian Process
  14. PMML under the hood 16 CODAIT/Cognitive Applications/ May 30, 2019

    / © 2019 IBM Corporation • Application name and version • Timestamp and copyright Header • Field names and labels, values • Data type and measurement level Data Dictionary • Define Function • Derived Fields Transformation Dictionary • Mining Schema • Specific model contents Model(s)
  15. Transformations • NormContinuous: piece-wise linear transform • NormDiscrete: map a

    categorical field to a set of dummy fields • Discretize: binning • MapValues: map one or more categorical fields into another categorical one • Functions: built-in and user-defined • Other transformations
  16. PMML Models o Association Rules Model o Clustering Model o

    General Regression o Naïve Bayes o Nearest Neighbor Model o Neural Network o Regression o Tree Model o Mining Model: composition or ensemble (or both) of models o Baseline Model o Bayesian Network o Gaussian Process o Ruleset o Scorecard o Sequence Model o Support Vector Machine o Time Series
  17. Contents of a PMML Model ❖Mining Schema: target and predictors,

    importance, missing value treatment, invalid value treatment, outlier treatment ❖Output: what to report, post-processing ❖Model Stats: description of input data ❖Model Explanation: model diagnostics, useful for visualization ❖Targets: target category info and prior probabilities ❖Local Transformations: predictor transformations local to the model ❖…<Specific model contents>… ❖Model Verification: expected results for some cases August 16, 2018 / © 2018 IBM Corporation
  18. Specific contents of some models • ClusteringField, Comparison Measure, Cluster

    Clustering • NeuralInputs, NeuralLayer, Neuron, NeuralOutputs, Con Neural network • RegressionTable • NumericPredictor, Categorical Predictor Regression • Node, Predicates, Score Distribution Tree Model
  19. Example PMML - Neural Network hidden layer and outputs 23

    Hidden layer neuron Output Layer Neurons Connecting target to the neurons
  20. <Node id=“0"> <True/> <Node id=“1" score="Iris-setosa" recordCount="50.0"> <SimplePredicate field="petal_length" operator="lessOrEqual“

    value=“2.6"/> <ScoreDistribution value="Iris-setosa" recordCount="50.0"/> <ScoreDistribution value="Iris-versicolor" recordCount="0.0"/> <ScoreDistribution value="Iris-virginica" recordCount="0.0"/> </Node> <Node id=“2"> <SimplePredicate field="petal_length" operator="greaterThan“ value=“2.6"/> <Node id=“3“score="Iris-versicolor" recordCount=“40.0"> <SimplePredicate field="petal_length" operator="lessOrEqual" value=“4.75"/> Example PMML for a Tree Model
  21. PMML Powered From http://dmg.org/pmml/pr oducts.html: Alpine Data Angoss BigML Equifax

    Experian FICO Fiserv Frontline Solvers GDS Link IBM (Includes SPSS) JPMML KNIME KXEN Liga Data Microsoft MicroStrategy NG Data Open Data Opera Pega Pervasive Data Rush Predixion Software Rapid I R Salford Systems (Minitab) SAND SAS Software AG (incl. Zementis) Spark Sparkling Logic Teradata TIBCO WEKA
  22. 26 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  23. PMML in Python JPMML package is created and maintained by

    Villu Ruusmann. From https://stackoverflow.com/questions/33221331/export-python-scikit-learn-models-into-pmml pip install git+https://github.com/jpmml/sklearn2pmml.git Example of how to export a classifier tree to PMML. First grow the tree: # example tree & viz from http://scikit-learn.org/stable/modules/tree.html from sklearn import datasets, tree iris = datasets.load_iris() clf = tree.DecisionTreeClassifier() clf = clf.fit(iris.data, iris.target) SkLearn2PMML conversion takes 2 arguments: an estimator (our clf) and a mapper for preprocessing. Our mapper is pretty basic, since no transformations. from sklearn_pandas import DataFrameMapper default_mapper = DataFrameMapper([(i, None) for i in iris.feature_names + ['Species']]) from sklearn2pmml import sklearn2pmml sklearn2pmml(estimator=clf, mapper=default_mapper, pmml=“IrisClassificationTree.pmml")
  24. PMML in R R is a programming language and software

    environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. R packages “pmml” and “pmmlTransformations” https://cran.r-project.org/package=pmml Depend on XML package Supports a number of R models: ada, amap, arules, gbm, glmnet, neighbr, nnet, rpart, randomForest, kernlab, e1071, testthat, survival, xgboost, knitr, rmarkdown Maintained by Dmitriy Bolotov and Tridivesh Jena from Software AG JPMML also has a package that augments “pmml” and provides PMML export for additional R models
  25. Create PMML in R (using R Studio) >library(XML); >library(pmml); >

    data(iris); Build and save a linear regression model predicting Sepal length: > irisLR<-lm(Sepal.Length~.,iris) >saveXML( pmml(irisLR), "IrisLR.xml" ) Build and save a decision tree (C&RT) model predicting Species class: > irisTree <- rpart( Species~., iris ) > saveXML( pmml( irisTree ), "IrisTree.xml" )
  26. 30 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  27. IBM SPSS Statistics 31 1968 Statistical Package for Social Sciences

    Acquired by IBM in 2009 Release 25 in August 2017 Subscription option Integration with Python and R
  28. IBM SPSS Statistics Transformation PMML from: ADP (Automatic Data Preparation)

    TMS Begin/TMS End Model PMML from: COXREG, CSCOXREG CSGLM, CSLOGISTIC, CSORDINAL GENLIN, Logistic regression, NOMREG GENLINMIXED LINEAR, KNN MLP, RBF neural networks NAÏVE BAYES REGRESSION TREE, TSMODEL TWOSTEP CLUSTER IBM SPSS Modeler Apriori, CARMA, Association Rules C5, CART, Chaid decision trees Cox regression GENLIN Decision List K-Means Cluster KNN LINEAR, Regression Logistic Regression MLP and RBF NOMREG Random Trees Regression Two Step Cluster 33
  29. 36 Watson Studio (formerly Data Science Experience) PMML export enabled

    in Jupyter notebooks, also possible in R Studio. PMML scoring can be done in Flows, notebooks, Watson Machine Learning CODAIT/Cognitive Applications/ May 30, 2019 / © 2019 IBM Corporation
  30. Using SPSS Two Step Cluster model in Python notebook 37

    Create a Jupyter notebook with Spark service Load the data into Cloud Object Storage, then into the notebook as a Spark data frame, specifying which fields are numeric-valued from spss.ml.clustering.twostep import TwoStep from spss.ml.clustering.twostep import TwoStepModel cluster = TwoStep( ).\ setInputFieldList(["sepal_length","sepal_width","petal_length","petal_width"]).\ setDistMeasure("LOGLIKELIHOOD").\ setFeatureSelection(False).\ setAutoClustering(True) clusterModel = cluster.fit(df_data_1) cePMML = clusterModel.toPMML()
  31. Benefits of PMML Allows seamless deployment and model exchange Transparency:

    human and machine- readable Fosters best practices in model building and deployment
  32. 41 • Challenges • PMML Internals • PMML in Python

    and R • PMML in IBM products • PFA • ONNX Agenda
  33. 42 Portable Format for Analytics - PFA PMML is great,

    except when a model or feature is not supported PFA to overcome this JSON format, AVRO schemas for data types A mini functional math language + schema specification Info: dmg.org/pfa Jim Pivarski
  34. 43 PFA details • PFA consists of: • JSON serialization

    format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Built-in functions and common models • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • Portability across languages, frameworks, run times and versions
  35. 44 A Simple Example of PFA (copied from Nick Pentreath’s

    presentation) • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input) 44 (C) 2018 IBM Corp
  36. 45 Managing State in PFA (copied from Nick Pentreath’s presentation)

    • Data storage specified by cells • A cell is a named value acting as a global variable • Typically used to store state (such as model coefficients, vocabulary mappings, etc) • Types specified with Avro schemas • Cell values are mutable within an action, but immutable between action executions of a given PFA document • Persistent storage specified by pools • Closer in concept to a database • Pools values are mutable across action executions 45 (C) 2018 IBM Corp
  37. 46 Known Support for PFA Hadrian (PFA export and scoring

    engine) from Open Data Group (Chicago, IL) Aardpfark (PFA export in SparkML) by Nick Pentreath, IBM CODAIT, South Africa Woken (PFA export and validation) by Ludovic Claude, CHUV, Lausanne, Switzerland There is a lot of interest in PFA! If you want to help, let me know.
  38. What about deep learning models? Current PMML NN model would

    be too verbose. How to represent convolutional or recurrent networks? Tensors? So many DL frameworks… Need an interchange format. A draft proposal in PMML, to be presented at a conference in Anaheim, CA in August And there is ONNX!
  39. ONNX: Open Neural Network eXchange CODAIT/Cognitive Applications/ May 30, 2019

    / © 2019 IBM Corporation 48 Since 2017 Protobuf Covers DL and traditional ML Active work by many companies
  40. ONNX Background and factoids ▪ Started Sept 2017 by Microsoft

    and Facebook ▪ Initial goal is to make it easier for data analysts to exchange trained models between different machine learning frameworks. ▪ ONNX github has 20 repos. onnx is the core. Others are tutorials, model zoo, importers and exporters for frameworks. ▪ Onnx/onnx currently has 12 releases, 112 contributors, 5771 stars. ▪ Core is in C++ with python API and tools. ▪ Supported frameworks: Caffe2, Chainer, Cognitive Toolkit (CNTK), Core ML, MXNet, PyTorch, PaddlePaddle 49
  41. What is ONNX (onnx/onnx) ▪ Open ecosystem for interchangeable AI

    models ▪ ONNX is designed to be an open format and specification, empowering developers to freely select the framework/tool that works best for their project, at any stage of development. ▪ Key feature is to generically describe the model graph, which serves as an Intermediate Representation (IR) that captures the specific intent of the developer's source code. ▪ The onnx models (xxx.onnx) are binary protobuf files which contains the network structure and parameters of the model. 50
  42. ONNX use pattern ONNX IR Spec .onnx Frontend Models in

    different frameworks Tools Netron visualizer Net Drawer visualizer Checker Shape Inferencer Graph Optimizer Opset Version Converter Backend Models in different frameworks Training Inference Export Import Run 51
  43. ONNX components ▪ The ONNX open spec is comprised of

    the following components ▪ A definition of an extensible computation graph model. ▪ Definitions of standard data types. ▪ Definitions of built-in operators. ▪ ONNX does not pre-suppose or imply any particular method of runtime implementation. ONNX specifies the portable, serialized format of a computation graph (xxx.onnx). ▪ ONNX defines a standard set of operators that all implementations MUST support (https://github.com/onnx/onnx/blob/master/onnx/onnx-operators.proto, https://github.com/onnx/onnx/blob/master/docs/Operators.md). ▪ An implementation MAY extend ONNX is by adding operators expressing semantics beyond the standard set (need to investigate how?). 52
  44. ONNX IR Spec – computation graph, used to describe executable

    graphs that can be executed directly by a framework, runtime, or engine Model OpSetIDs graph Graph name node Node[] inputs ValueInfo[] initializer Tensor[] Node op_type string input string[] output string[] name string attribute Attribute[] outputs ValueInfo[] Tensor data_type dims int64[] xxx_data xxx[] Attribute name type value xxx or xxx[] ValueInfo name type 53
  45. ONNX IR Spec – operator set, used to describe a

    set of operators that are available in a given environment OperatorSet opset_version functions Function[] operator Operator[] Operator op_type string status Function name string status input string[] output string[] node Node[] attribute string[] 54
  46. Open standards advocacy at IBM Nick Pentreath presented on PFA

    and ONNX at many conferences Svetlana presented at many conferences, Meetups, Northwestern U. Poster on PMML/PFA/ONNX at Applied Machine Learning Days in January 2019
  47. 57 Conclusions Model deployment is an important part of ML

    lifecycle Data Mining Group works on open standards for model deployment PMML eases deployment for supported models and data prep PFA is an emerging standard that needs your help ONNX is becoming a de-facto standard for Deep Learning
  48. 59 Links and resources Twitter: @SvetaLevitan Drone challenge: https://developer.ibm.com/contest PMML

    dmg.org/pmml PFA dmg.org/pfa ONNX onnx.ai CODAIT: codait.org SPSS: https://www.ibm.com/analytics/spss-statistics-software Watson Studio: https://www.ibm.com/cloud/watson-studio Sign up for free IBM Cloud account: https://ibm.biz/BdzrCz