Slide 1

Slide 1 text

PMML  with  R  and  Java   Thomas  Darimont   Data  Science  Meetup  Luxembourg   24th  Sep    2014   1  

Slide 2

Slide 2 text

PredicAve  Model  Lifecycle   TradiAonal  way  …     2   Model  SpecificaAon   V1   Source:  Own  representaAon  based  on  “RepresenAng  PredicAve  SoluAons  with   PMML”,  by  Alex  Guazzelli  hPps://www.youtube.com/watch?v=QBpguVZRVPo •  Uses  staAsAcal  tool   •  Defines  /  trains  model   •  R,  Python   •  Writes  model  specificaAon     •  Implements  Spec   •  Writes  custom  code   •  C++,  C#,  Java   •  Deploys  model  (code)     Scien&st   Engineer  

Slide 3

Slide 3 text

Problems   •  Model  definiAon  not  machine  readable   •  Model  needs  to  be  implemented  by  hand   •  Changes  in  the  model  documents  have  to  be   propagated  –  by  hand   •  Time  consuming  (weeks,  months,  years)   •  Prone  to  errors   •  ImplementaAon  ≠  SpecificaAon   3   Solu&on?  

Slide 4

Slide 4 text

PMML   •  PredicAve  Model  Markup  Language   •  Open  Standard   •  Maintained  by  Data  Mining  Group  (DMG)   •  XML  based  DSL  for  predicAve  models   •  First  Version  (1999)  –  Current  Version  4.2.1     Goal:          “Bridge  the  Gap  between                    Data  ScienAsts  and  Engineers”   4  

Slide 5

Slide 5 text

Anatomy  of  PMML  Model   •  Pre  Processing   •  PredicAve  Model   – Algorithm  descripAon(s)   – ParameterizaAon   à  trained  model   •  Post  Processing   – Transform  model  output   – Thresholds  /  Business  rules     5   Source:  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  7.

Slide 6

Slide 6 text

PMML  General  Structure   • Version  /  Timestamp   • Model  development  environment  informaAon   Header   • DefiniAon  of  variable  types   • Handling  of  valid,  invalid  and  missing  values   Data  DicAonary   • Pre-­‐processing:  NormalizaAon,  mapping  and   discreAzaAon   • Built-­‐in  and  user  defined  funcAons   Data  TransformaAons   • Mining  Schema   • Targets   • Outputs   Model  1..*   6   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.

Slide 7

Slide 7 text

PMML  Model  Structure   •  DefiniAon  of  usage  type   •  Outlier  and  missing  value  treatment  /   replacement     Mining  Schema   •  Prior  probability  and  default  value   Targets   •  List  of  computed  output  fields   •  Post-­‐processing   Outputs   •  DefiniAon  of  model  specific  parameters   (Parameters)   7   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  24.

Slide 8

Slide 8 text

PMML  example   8   Header   Data  DicAonary   Model  Parameters   Output   Model   irisModel  <-­‐  lm(Petal.Width  ~  Petal.Length,  data=iris)  

Slide 9

Slide 9 text

PMML  Supported  Models   •  15  model  types   •  AssociaAon  Rules   •  Baseline  Models   •  Cluster  Models   •  (General)  Regression   •  k-­‐Nearest  Neighbors   •  Naive  Bayes     •  Neural  Network   •  Ruleset   •  Scorecard   •  Sequences   •  Text  Models   •  Time  Series   •  Trees   •  Vector  Machine   •  …  roll  your  own:  Ensemble  Models  -­‐>  Use  provided  building  blocks   9  

Slide 10

Slide 10 text

PMML  TransformaAons   •  Normaliza&on     map  values  to  numbers,  the  input  can  be   conAnuous  (element  NormConAnuous)  or   discrete  (element  NormDiscrete).   •  Discre&za&on     map  conAnuous  values  to  discrete  values.   •  Value  Mapping   map  discrete  values  to  other  discrete  values.   •  Func&ons   derive  a  value  by  applying  a  funcAon  to  one  or   more  parameters.   10  

Slide 11

Slide 11 text

PMML  FuncAons   •  Custom  funcAons  for  common  transformaAons   •  Building  blocks   Category   Func&ons   Arithme&c     +,  -­‐,  *  and  /   Math   log10,  ln,  sqrt,  abs,  exp,  pow,  threshold,  floor,  ceil,  round   Stats     min,  max,  sum,  avg,  median,  product   Logic   if,  and,  or,  not,  equal,  notEqual,  lessThan,  lessOrEqual,  greaterThan,   greaterOrEqual,  isMissing,  isNotMissing,  isIn,  isNotIn   String     uppercase,  lowercase,  substring,  trimBlanks,  concat,  replace,   matches   Format     formatNumber,  formatDateAme   Date/Time   dateDaysSinceYear,  dateSecondsSinceYear,   dateSecondsSinceMidnight   11   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  63.

Slide 12

Slide 12 text

PMML  MulAple  Models   •  Several  ways  for  combining  mulAple  models  in   one  PMML  file   – Model  SegmentaAon   – Model  Ensemble   – Model  Chaining   – Model  ComposiAon   •  Custom  extensions  for  referencing  external   model  files   12  

Slide 13

Slide 13 text

Model  SegmentaAon   Input   Valida&on   Data  Pre-­‐ Processing   Model  1   Model  2   Model  n   Raw  input   Predic&on   …   Predicate  based   Model  selecAon   E.g.:  SelectFirst   ?   13   Outliers,   Missing   Values,  Invalid   Values   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  190. X  =  1   X  =  2   PredicAve   Model  

Slide 14

Slide 14 text

Model  Ensemble   Input   Valida&on   Data  Pre-­‐ Processing   Vo&ng   Model  1   Model  2   Model  n   …   Scores  from  all   models  are   computed     Majority  VoAng,   Weighted  VoAng,   Weighted  Average,   etc.   14   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  193. Raw  input   Predic&on  

Slide 15

Slide 15 text

Model  Chaining   Input   Valida&on   Data  Pre-­‐ Processing   Model  1   Model  2   Model  n   …   Output  scores  from   earlier  models  are  used   by  subsequent  models   15   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  195. Raw  input   Predic&on  

Slide 16

Slide 16 text

Model  ComposiAon   Input   Valida&on   Data  Pre-­‐ Processing   Main  Model   Model  2   Model  n   …   Predicate   based  model   selecAon   ?   16   PMML  File   Source:  Own  representaAon  based  on  PMML  in  AcAon,  2nd  EdiAon,  2012,  p.  196. Raw  input   Predic&on  

Slide 17

Slide 17 text

Model  VerificaAon   •  “Scoring  matching  test”   •  “Regression  tests  for  models”   •  VerificaAonFields   – Asserts,  range  checks  for  results   •  InlineTable   – Input  +  expected  output   – Include  already  scored  data     17  

Slide 18

Slide 18 text

Model  Deployment  with  PMML   18   •  StaAsAcs  Tool   •  Data  Mining  Tool   •  …   Model  Building   •  AnalyAcs  ApplicaAon   Model  Scoring   Export  Model   Deploy  Model  

Slide 19

Slide 19 text

Example  ApplicaAon   19   Source:  Own  representaAon  based  on  Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390. Real-­‐&me  Analy&cs  in  Stream  Processing  

Slide 20

Slide 20 text

Example  ApplicaAon  cont.   20   PMML   R   Madlib   Spring  XD   analyAc-­‐PMML   Spring  Batch   HTML5  /  JS   D3   Spring  Boot   Source:  Own  representaAon  based  on  Fundamentals  of  Stream  Processing,  Cambridge  Press,  2014,  p.  390. Real-­‐&me  Analy&cs  in  Stream  Processing   Redis   Postgresql   HDFS   EC2  Cluster   “Predic&on  of  short-­‐term   energy  consump&on  in  a   SmartGrid”     Sensor  Data   Rabbit  MQ   W  /  kWh   every  s   40  houses   325  households   2125  plugs  

Slide 21

Slide 21 text

PMML  Tools   •  R  /  RaPle   •  RapidMiner   •  KNIME   •  Various  PMML  Tools  from  ZemenAs   –  TransformaAon  Generator   –  Generic  OperaAon  Generator   •  Py2PMML   –  Can  transform  models  learned  with  scikit-­‐learn  to  PMML   •  SPSS   •  SAS   •  StaAsAca   •  …   21  

Slide 22

Slide 22 text

PMML  Industry  Support   Digest  of  analyAc  soyware  vendors  with  PMML  support   •  hPp://www.dmg.org/products.html   •  IBM   •  Microsoy   •  Google   •  Oracle   •  EMC   •  Pivotal   •  SAS   •  Pentaho   •  Teradata   22  

Slide 23

Slide 23 text

PMML  Resources   •  PMML  in  AcAon  2nd  EdiAon  Book   •  hPps://support.zemenAs.com/entries/22119057-­‐ Top-­‐10-­‐PMML-­‐Resources   •  hPp://journal.r-­‐project.org/archive/2009-­‐1/ RJournal_2009-­‐1_Guazzelli+et+al.pdf   •  hPps://www.ibm.com/developerworks/opensource/ library/ba-­‐ind-­‐PMML1/   •  hPp://zemenAs.com/knowledge-­‐base-­‐resources/ white-­‐papers/   •  yt  Talk:  RepresenAng  PredicAve  SoluAons  with  PMML   hPps://www.youtube.com/watch?v=QBpguVZRVPo       23  

Slide 24

Slide 24 text

PMML  Summary   •  Open   •  Mature   •  Extensible   •  Standard   •  Broad  industry  support     “PMML  is  the  Lingua  Franca  for  sharing   Predic5ve  Model  Solu5ons”   24   Source:  Dr.  Alex  Guazzelli,  RepresenAng  PredicAve  SoluAons  with  PMML,  youtube,  2012      

Slide 25

Slide 25 text

PMML  with  R   •  Packages   – pmml  /  10  years   •  Export  model  to  PMML   – pmmlTransformaAons  /  1.5  years   •  WrapData  wraps  dataframe  in  a  SmartObject(SO)   •  TransformaAons  applied  to  SO  are  saved  in  PMML   •  Support  for:   ksvm,  nnet,  rpart,  lm  &  glm,  arules,  kmeans  and   hclust,  randomForest   25  

Slide 26

Slide 26 text

PMML  with  R   •  Hello  World     DEMO   26  

Slide 27

Slide 27 text

PMML  example   27   irisModel  <-­‐  lm(Petal.Width  ~  Petal.Length,  data=iris)  

Slide 28

Slide 28 text

PMML  with  Java   •  JPMML  hPps://github.com/jpmml/jpmml   –  Java  based  dual  licensed  AGPL  V3  “Umbrella”  Project   –  Reference  implementaAon  of  PMML  standard   –  Backed  by  hPp://openscoring.io/   –  Supports  latest  PMML  Version  >=  3.0   –  12  out  of  15  model  types  supported  (No  Time  Series  L)   •  jpmml-­‐evaluator  sub-­‐project   –  API  for  scoring  /  evaluaAon   •  jpmml-­‐model  sub-­‐project   –  JAXB  model  derived  from  PMML  XSD   –  Import  /  Export  /  Model  generaAon   •  Some  integraAon  projects   –  Hive,  PostgreSQL,  pig   –  Planned:  Apache  Storm  and  Apache  Spark   DEMO   28  

Slide 29

Slide 29 text

QuesAons   29