Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PMML with R and Java

PMML with R and Java

Brief summary: Thomas will introduces PMML - an open standard for the definition of predictive models. He will demonstrate the generation and export of a predictive model to PMML via R, as well as model scoring by using the exported PMML model in a Java application.

Thomas Darimont

September 24, 2014
Tweet

More Decks by Thomas Darimont

Other Decks in Programming

Transcript

  1. PMML  with  R  and  Java   Thomas  Darimont   Data

     Science  Meetup  Luxembourg   24th  Sep    2014   1  
  2. PredicAve  Model  Lifecycle   TradiAonal  way  …     2

      Model  SpecificaAon   V1   Source:  Own  representaAon  based  on  “RepresenAng  PredicAve  SoluAons  with   PMML”,  by  Alex  Guazzelli  hPps://www.youtube.com/watch?v=QBpguVZRVPo •  Uses  staAsAcal  tool   •  Defines  /  trains  model   •  R,  Python   •  Writes  model  specificaAon     •  Implements  Spec   •  Writes  custom  code   •  C++,  C#,  Java   •  Deploys  model  (code)     Scien&st   Engineer  
  3. Problems   •  Model  definiAon  not  machine  readable   • 

    Model  needs  to  be  implemented  by  hand   •  Changes  in  the  model  documents  have  to  be   propagated  –  by  hand   •  Time  consuming  (weeks,  months,  years)   •  Prone  to  errors   •  ImplementaAon  ≠  SpecificaAon   3   Solu&on?  
  4. PMML   •  PredicAve  Model  Markup  Language   •  Open

     Standard   •  Maintained  by  Data  Mining  Group  (DMG)   •  XML  based  DSL  for  predicAve  models   •  First  Version  (1999)  –  Current  Version  4.2.1     Goal:          “Bridge  the  Gap  between                    Data  ScienAsts  and  Engineers”   4  
  5. Anatomy  of  PMML  Model   •  Pre  Processing   • 

    PredicAve  Model   – Algorithm  descripAon(s)   – ParameterizaAon   à  trained  model   •  Post  Processing   – Transform  model  output   – Thresholds  /  Business  rules     5   Source:  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  7.
  6. PMML  General  Structure   • Version  /  Timestamp   • Model  development

     environment  informaAon   Header   • DefiniAon  of  variable  types   • Handling  of  valid,  invalid  and  missing  values   Data  DicAonary   • Pre-­‐processing:  NormalizaAon,  mapping  and   discreAzaAon   • Built-­‐in  and  user  defined  funcAons   Data  TransformaAons   • Mining  Schema   • Targets   • Outputs   Model  1..*   6   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.
  7. PMML  Model  Structure   •  DefiniAon  of  usage  type  

    •  Outlier  and  missing  value  treatment  /   replacement     Mining  Schema   •  Prior  probability  and  default  value   Targets   •  List  of  computed  output  fields   •  Post-­‐processing   Outputs   •  DefiniAon  of  model  specific  parameters   (Parameters)   7   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.
  8. PMML  example   8   Header   Data  DicAonary  

    Model  Parameters   Output   Model   irisModel  <-­‐  lm(Petal.Width  ~  Petal.Length,  data=iris)  
  9. PMML  Supported  Models   •  15  model  types   • 

    AssociaAon  Rules   •  Baseline  Models   •  Cluster  Models   •  (General)  Regression   •  k-­‐Nearest  Neighbors   •  Naive  Bayes     •  Neural  Network   •  Ruleset   •  Scorecard   •  Sequences   •  Text  Models   •  Time  Series   •  Trees   •  Vector  Machine   •  …  roll  your  own:  Ensemble  Models  -­‐>  Use  provided  building  blocks   9  
  10. PMML  TransformaAons   •  Normaliza&on     map  values  to

     numbers,  the  input  can  be   conAnuous  (element  NormConAnuous)  or   discrete  (element  NormDiscrete).   •  Discre&za&on     map  conAnuous  values  to  discrete  values.   •  Value  Mapping   map  discrete  values  to  other  discrete  values.   •  Func&ons   derive  a  value  by  applying  a  funcAon  to  one  or   more  parameters.   10  
  11. PMML  FuncAons   •  Custom  funcAons  for  common  transformaAons  

    •  Building  blocks   Category   Func&ons   Arithme&c     +,  -­‐,  *  and  /   Math   log10,  ln,  sqrt,  abs,  exp,  pow,  threshold,  floor,  ceil,  round   Stats     min,  max,  sum,  avg,  median,  product   Logic   if,  and,  or,  not,  equal,  notEqual,  lessThan,  lessOrEqual,  greaterThan,   greaterOrEqual,  isMissing,  isNotMissing,  isIn,  isNotIn   String     uppercase,  lowercase,  substring,  trimBlanks,  concat,  replace,   matches   Format     formatNumber,  formatDateAme   Date/Time   dateDaysSinceYear,  dateSecondsSinceYear,   dateSecondsSinceMidnight   11   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  63.
  12. PMML  MulAple  Models   •  Several  ways  for  combining  mulAple

     models  in   one  PMML  file   – Model  SegmentaAon   – Model  Ensemble   – Model  Chaining   – Model  ComposiAon   •  Custom  extensions  for  referencing  external   model  files   12  
  13. Model  SegmentaAon   Input   Valida&on   Data  Pre-­‐ Processing

      Model  1   Model  2   Model  n   Raw  input   Predic&on   …   Predicate  based   Model  selecAon   E.g.:  SelectFirst   ?   13   Outliers,   Missing   Values,  Invalid   Values   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  190. X  =  1   X  =  2   PredicAve   Model  
  14. Model  Ensemble   Input   Valida&on   Data  Pre-­‐ Processing

      Vo&ng   Model  1   Model  2   Model  n   …   Scores  from  all   models  are   computed     Majority  VoAng,   Weighted  VoAng,   Weighted  Average,   etc.   14   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  193. Raw  input   Predic&on  
  15. Model  Chaining   Input   Valida&on   Data  Pre-­‐ Processing

      Model  1   Model  2   Model  n   …   Output  scores  from   earlier  models  are  used   by  subsequent  models   15   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  195. Raw  input   Predic&on  
  16. Model  ComposiAon   Input   Valida&on   Data  Pre-­‐ Processing

      Main  Model   Model  2   Model  n   …   Predicate   based  model   selecAon   ?   16   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  196. Raw  input   Predic&on  
  17. Model  VerificaAon   •  “Scoring  matching  test”   •  “Regression

     tests  for  models”   •  VerificaAonFields   – Asserts,  range  checks  for  results   •  InlineTable   – Input  +  expected  output   – Include  already  scored  data     17  
  18. Model  Deployment  with  PMML   18   •  StaAsAcs  Tool

      •  Data  Mining  Tool   •  …   Model  Building   •  AnalyAcs  ApplicaAon   Model  Scoring   Export  Model   Deploy  Model  
  19. Example  ApplicaAon   19   Source:  Own  representaAon  based  on

     Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390. Real-­‐&me  Analy&cs  in  Stream  Processing  
  20. Example  ApplicaAon  cont.   20   PMML   R  

    Madlib   Spring  XD   analyAc-­‐PMML   Spring  Batch   HTML5  /  JS   D3   Spring  Boot   Source:  Own  representaAon  based  on  Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390. Real-­‐&me  Analy&cs  in  Stream  Processing   Redis   Postgresql   HDFS   EC2  Cluster   “Predic&on  of  short-­‐term   energy  consump&on  in  a   SmartGrid”     Sensor  Data   Rabbit  MQ   W  /  kWh   every  s   40  houses   325  households   2125  plugs  
  21. PMML  Tools   •  R  /  RaPle   •  RapidMiner

      •  KNIME   •  Various  PMML  Tools  from  ZemenAs   –  TransformaAon  Generator   –  Generic  OperaAon  Generator   •  Py2PMML   –  Can  transform  models  learned  with  scikit-­‐learn  to  PMML   •  SPSS   •  SAS   •  StaAsAca   •  …   21  
  22. PMML  Industry  Support   Digest  of  analyAc  soyware  vendors  with

     PMML  support   •  hPp://www.dmg.org/products.html   •  IBM   •  Microsoy   •  Google   •  Oracle   •  EMC   •  Pivotal   •  SAS   •  Pentaho   •  Teradata   22  
  23. PMML  Resources   •  PMML  in  AcAon  2nd  EdiAon  Book

      •  hPps://support.zemenAs.com/entries/22119057-­‐ Top-­‐10-­‐PMML-­‐Resources   •  hPp://journal.r-­‐project.org/archive/2009-­‐1/ RJournal_2009-­‐1_Guazzelli+et+al.pdf   •  hPps://www.ibm.com/developerworks/opensource/ library/ba-­‐ind-­‐PMML1/   •  hPp://zemenAs.com/knowledge-­‐base-­‐resources/ white-­‐papers/   •  yt  Talk:  RepresenAng  PredicAve  SoluAons  with  PMML   hPps://www.youtube.com/watch?v=QBpguVZRVPo       23  
  24. PMML  Summary   •  Open   •  Mature   • 

    Extensible   •  Standard   •  Broad  industry  support     “PMML  is  the  Lingua  Franca  for  sharing   Predic5ve  Model  Solu5ons”   24   Source:  Dr.  Alex  Guazzelli,  RepresenAng  PredicAve  SoluAons  with  PMML,  youtube,  2012      
  25. PMML  with  R   •  Packages   – pmml  /  10

     years   •  Export  model  to  PMML   – pmmlTransformaAons  /  1.5  years   •  WrapData  wraps  dataframe  in  a  SmartObject(SO)   •  TransformaAons  applied  to  SO  are  saved  in  PMML   •  Support  for:   ksvm,  nnet,  rpart,  lm  &  glm,  arules,  kmeans  and   hclust,  randomForest   25  
  26. PMML  with  Java   •  JPMML  hPps://github.com/jpmml/jpmml   –  Java

     based  dual  licensed  AGPL  V3  “Umbrella”  Project   –  Reference  implementaAon  of  PMML  standard   –  Backed  by  hPp://openscoring.io/   –  Supports  latest  PMML  Version  >=  3.0   –  12  out  of  15  model  types  supported  (No  Time  Series  L)   •  jpmml-­‐evaluator  sub-­‐project   –  API  for  scoring  /  evaluaAon   •  jpmml-­‐model  sub-­‐project   –  JAXB  model  derived  from  PMML  XSD   –  Import  /  Export  /  Model  generaAon   •  Some  integraAon  projects   –  Hive,  PostgreSQL,  pig   –  Planned:  Apache  Storm  and  Apache  Spark   DEMO   28