Operationalizing Data Science with Apache Spark

Operationalizing Data Science with Apache Spark

Presentation at Big Data Day LA

B7189c9a09c7d99379c2a343fcfb2dbd?s=128

Lawrence Spracklen

August 05, 2017
Tweet

Transcript

  1. Operationalizing Data Science with Apache Spark Alpine Data Lawrence Spracklen

    VPE
  2. 2 Alpine Data

  3. 3 Operationalization •  What happens after the models are created?

    •  How does the business benefit from the insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows?
  4. 4 Define Act Transform Deploy Model Business Leader Employees and

    Customers Effective Data Science Iterate to refine model as required
  5. 5 ML Pipelines Train Score

  6. 6 Barriers to Model Ops •  Scoring often performed on

    a different data platform to training –  Framework specific persistence formats •  Complex data preprocessing requirements –  Data cleansing and feature engineering •  Batch training versus RT/stream scoring •  How frequently are models updated? •  How is performance monitored?
  7. 7 Heterogeneous Big Data Data in, Models out model model

    model model
  8. 8 Streaming Example Historical data Train Test Train models Test

    models Stream Predict Stream
  9. 9 One format to rule them all?

  10. 10 PMML •  XML based predictive model interchange format – 

    Created in 1998 –  Version 4.3 just released •  Good for specifying many common model types •  Limited support for complex data preprocessing –  Can require companion scripts/code •  Broad PMML export support •  Limited import support
  11. 11 Turn-key model updates? Conditionally push model to Scoring engine

  12. 12 Complex scoring flows “Push to Scoring Engine” -  One-click

    deployment to a scoring engine -  Entire flow encapsulated in model output Pre-Processing Workflow -  Transformations required before model -  ETL, feature engineering, etc. Trained model -  ML model
  13. 13 Helper scripts .py .pmml .java .cpp Composite model

  14. 14 PFA •  Portable Format for analytics is the JSON-based

    successor to PMML –  Version 0.84 available •  Significant flexibility in encapsulating complex data pre- and post-processing
  15. 15 PFA Support •  Not only model operators need to

    export PFA •  Process entire pipeline from raw data input to final model output –  Synthetize PFA doc to represent the flow •  PFA is capable of representing many key operations –  Much richer than PMML
  16. 16 PFA Scoring •  Open Source PFA implementations available – 

    Apache licensed –  Python and Java APIs •  e.g. Titus and Hadrian •  Very simple to build your own scoring engines
  17. 17 Simple Example import json import sys from titus.genpy import

    PFAEngine # Leverage PFA doc specified on the command-line pfa_model = sys.argv[1] engine, = PFAEngine.fromJson(json.load(open(pfa_model))) # Invoke any initialization functions engine.begin() # Score example input input = {"Sepal_length" : "1.0", "Sepal_width" : "1.0", "Petal_length" : "1.0", "Petal_width" : "1.0"} results = engine.action(input) print results Load PFA Python Module Create PFA engine Initialize engine Score data
  18. 18 RESTful Example #(r"/demo/score/([a-zA-Z0-9_]+)", scoreModel), class scoreModel(tornado.web.RequestHandler): #Score model def

    post(self, id): engine, = PFAEngine.fromJson(json.load(open('models/%s.pfa' % (id)))) dd = tornado.escape.json_decode(self.request.body) self.write(str(engine.action(dd)))
  19. 19 Kafka Example from kafka import KafkaConsumer from kafka import

    KafkaProducer pfa_model = sys.argv[1] engine, = PFAEngine.fromJson(json.load(open(pfa_model))) server = sys.argv[2] kafka_topic_to_score = sys.argv[3] kafka_topic_to_emit = sys.argv[4] consumer = KafkaConsumer(kafka_topic_to_score, bootstrap_servers=server) producer = KafkaProducer(bootstrap_servers=server) for msg in consumer: score = engine.action(json.loads(msg.value)) producer.send(kafka_topic_to_emit, str(score))
  20. 20 PySpark PFA Example MapPartition example

  21. 21 Spark PMML Support val clusters = KMeans.train(parsedData, numClusters, numIterations)

    clusters.toPMML("/tmp/kmeans.xml")
  22. 22 Spark PFA Support •  Staged solution 1.  Enable *.toPFA()

    for MLlib models 2.  Enable ML_pipeline.toPFA() for entire pipelines •  Easy to add support for #1 –  Parity with PMML
  23. 23 Conclusions •  Operationalization of Data Science findings often overlooked

    •  Need cross-platform and cross-framework interoperability •  Need easy model deployment to ensure maximum impact •  PFA makes it much simpler to deploy complex scoring flows •  OSS PFA scoring engines available and easily integrated with Spark •  Working to enable PFA model export from SparkML
  24. Questions? lawrence@alpinenow.com