Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operationalizing Data Science with Apache Spark

Operationalizing Data Science with Apache Spark

Presentation at Big Data Day LA


Lawrence Spracklen

August 05, 2017


  1. Operationalizing Data Science with Apache Spark Alpine Data Lawrence Spracklen

  2. 2 Alpine Data

  3. 3 Operationalization •  What happens after the models are created?

    •  How does the business benefit from the insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows?
  4. 4 Define Act Transform Deploy Model Business Leader Employees and

    Customers Effective Data Science Iterate to refine model as required
  5. 5 ML Pipelines Train Score

  6. 6 Barriers to Model Ops •  Scoring often performed on

    a different data platform to training –  Framework specific persistence formats •  Complex data preprocessing requirements –  Data cleansing and feature engineering •  Batch training versus RT/stream scoring •  How frequently are models updated? •  How is performance monitored?
  7. 7 Heterogeneous Big Data Data in, Models out model model

    model model
  8. 8 Streaming Example Historical data Train Test Train models Test

    models Stream Predict Stream
  9. 9 One format to rule them all?

  10. 10 PMML •  XML based predictive model interchange format – 

    Created in 1998 –  Version 4.3 just released •  Good for specifying many common model types •  Limited support for complex data preprocessing –  Can require companion scripts/code •  Broad PMML export support •  Limited import support
  11. 11 Turn-key model updates? Conditionally push model to Scoring engine

  12. 12 Complex scoring flows “Push to Scoring Engine” -  One-click

    deployment to a scoring engine -  Entire flow encapsulated in model output Pre-Processing Workflow -  Transformations required before model -  ETL, feature engineering, etc. Trained model -  ML model
  13. 13 Helper scripts .py .pmml .java .cpp Composite model

  14. 14 PFA •  Portable Format for analytics is the JSON-based

    successor to PMML –  Version 0.84 available •  Significant flexibility in encapsulating complex data pre- and post-processing
  15. 15 PFA Support •  Not only model operators need to

    export PFA •  Process entire pipeline from raw data input to final model output –  Synthetize PFA doc to represent the flow •  PFA is capable of representing many key operations –  Much richer than PMML
  16. 16 PFA Scoring •  Open Source PFA implementations available – 

    Apache licensed –  Python and Java APIs •  e.g. Titus and Hadrian •  Very simple to build your own scoring engines
  17. 17 Simple Example import json import sys from titus.genpy import

    PFAEngine # Leverage PFA doc specified on the command-line pfa_model = sys.argv[1] engine, = PFAEngine.fromJson(json.load(open(pfa_model))) # Invoke any initialization functions engine.begin() # Score example input input = {"Sepal_length" : "1.0", "Sepal_width" : "1.0", "Petal_length" : "1.0", "Petal_width" : "1.0"} results = engine.action(input) print results Load PFA Python Module Create PFA engine Initialize engine Score data
  18. 18 RESTful Example #(r"/demo/score/([a-zA-Z0-9_]+)", scoreModel), class scoreModel(tornado.web.RequestHandler): #Score model def

    post(self, id): engine, = PFAEngine.fromJson(json.load(open('models/%s.pfa' % (id)))) dd = tornado.escape.json_decode(self.request.body) self.write(str(engine.action(dd)))
  19. 19 Kafka Example from kafka import KafkaConsumer from kafka import

    KafkaProducer pfa_model = sys.argv[1] engine, = PFAEngine.fromJson(json.load(open(pfa_model))) server = sys.argv[2] kafka_topic_to_score = sys.argv[3] kafka_topic_to_emit = sys.argv[4] consumer = KafkaConsumer(kafka_topic_to_score, bootstrap_servers=server) producer = KafkaProducer(bootstrap_servers=server) for msg in consumer: score = engine.action(json.loads(msg.value)) producer.send(kafka_topic_to_emit, str(score))
  20. 20 PySpark PFA Example MapPartition example

  21. 21 Spark PMML Support val clusters = KMeans.train(parsedData, numClusters, numIterations)

  22. 22 Spark PFA Support •  Staged solution 1.  Enable *.toPFA()

    for MLlib models 2.  Enable ML_pipeline.toPFA() for entire pipelines •  Easy to add support for #1 –  Parity with PMML
  23. 23 Conclusions •  Operationalization of Data Science findings often overlooked

    •  Need cross-platform and cross-framework interoperability •  Need easy model deployment to ensure maximum impact •  PFA makes it much simpler to deploy complex scoring flows •  OSS PFA scoring engines available and easily integrated with Spark •  Working to enable PFA model export from SparkML
  24. Questions? lawrence@alpinenow.com