Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed ML/DL with Ignite ML module using Spark as a data source

Distributed ML/DL with Ignite ML module using Spark as a data source

The current implementation of ML algorithms in Spark has several disadvantages associated with the transition from standard Spark SQL types to ML-specific types, a low level of algorithms' adaptation to distributed computing, a relatively slow speed of adding new algorithms to the current library.

Also, Spark ML doesn't support online-learning by nature for all algorithms, stacking, boosting and a bunch of approximate ML algorithms that gives a significant speedup in many cases. The Apache Ignite could work closely with Apache Spark due to exellent Ignite RDD/Ignite DataFrame implementation (see
https://ignite.apache.org/use-cases/spark/shared-memory-layer.html).

Also Apache Ignite has Ignite ML module that includes a lot of distributed ML algorithms, NLP package (will be available in next release, 2.8), the bunch of approximate ML algorithms, simple integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg.

We suggest to use Apache Ignite ML module to speedup your ML training and use Spark + Ignite as backend for distributed TensorFlow calculations. You will see live demos of ML pipeline building with Apache Ignite ML module, Apache Spark, Apache Kafka, TensorFlow and more.

Alexey Zinoviev

April 23, 2019
Tweet

More Decks by Alexey Zinoviev

Other Decks in Science

Transcript

  1. Zinovyev Alexey, Apache Ignite Distributed ML/DL with Ignite ML module

    using Spark as a data source #UnifiedAnalytics #SparkAISummit
  2. Bio • Java developer • Distributed ML enthusiast • Apache

    Spark user • Apache Ignite Committer • Happy father and husband 3 #UnifiedAnalytics #SparkAISummit
  3. Spark ML as an answer • It supports classic ML

    algorithms • Algorithms are distributed by nature • Wide support of different data sources and sinks • Easy building of Pipelines • Model evaluation and hyper-parameter tuning support 6 #UnifiedAnalytics #SparkAISummit
  4. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging 7 #UnifiedAnalytics #SparkAISummit
  5. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms 8 #UnifiedAnalytics #SparkAISummit
  6. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types 9 #UnifiedAnalytics #SparkAISummit
  7. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types • The hard integration with TensorFlow/Caffee 10 #UnifiedAnalytics #SparkAISummit
  8. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix 11 #UnifiedAnalytics #SparkAISummit
  9. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving 12 #UnifiedAnalytics #SparkAISummit
  10. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms 13 #UnifiedAnalytics #SparkAISummit
  11. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms • It doesn’t support ML operators in Spark SQL • ML algorithms internally uses Mllib on RDD 14 #UnifiedAnalytics #SparkAISummit
  12. The main problem with Spark ML You grow old before

    your PR will be merged #UnifiedAnalytics #SparkAISummit 15
  13. 20 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,

    $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop("weight") .write().format("ignite") .option("config", "default-config.xml") .option("table", "employees") .mode("overwrite") .save();
  14. 21 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,

    $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop(“fare") .write().format("ignite") .option("config", "default-config.xml") .option("table", “passengers") .mode("overwrite") .save();
  15. 22 #UnifiedAnalytics #SparkAISummit public class SparkCacheStore implements CacheStore<Integer, Object[]>, Serializable

    { private SparkSession spark; private Dataset<Row> ds; private static IgniteBiInClosure<Integer, Object[]> staticClo; { spark = SparkSession ....getOrCreate(); ds = spark.read()....csv(“data-file"); ds = ds.withColumn("index", functions.monotonically_increasing_id()); } Implement CacheStore interface
  16. Algorithms: Classification • Logistic Regression • SVM • KNN •

    ANN • Decision trees • Random Forest 24 #UnifiedAnalytics #SparkAISummit
  17. Algorithms: Regression • KNN Regression • Linear Regression • Decision

    tree regression • Random forest regression • Gradient-boosted tree regression 25 #UnifiedAnalytics #SparkAISummit
  18. 28 #UnifiedAnalytics #SparkAISummit Fill the cache IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  19. 29 #UnifiedAnalytics #SparkAISummit Build Labeled Vectors IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  20. 30 #UnifiedAnalytics #SparkAISummit Define the trainer IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  21. 31 #UnifiedAnalytics #SparkAISummit Train the model IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  22. 32 #UnifiedAnalytics #SparkAISummit Evaluate the model IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  23. 36 #UnifiedAnalytics #SparkAISummit Preprocessing Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache,

    vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .withP(1) .fit(ignite, dataCache, minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());
  24. 39 #UnifiedAnalytics #SparkAISummit Pipeline Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new

    ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); CrossValidation scoreCalculator = new CrossValidation(); ParamGrid paramGrid = new ParamGrid() .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});
  25. 40 #UnifiedAnalytics #SparkAISummit Pipeline BinaryClassificationMetrics metrics = new BinaryClassificationMetrics() .withNegativeClsLb(0.0)

    .withPositiveClsLb(1.0) .withMetric(BinaryClassificationMetricValues::accuracy); CrossValidationResult crossValidationRes = scoreCalculator.score( pipeline, metrics, ignite, dataCache, 3, paramGrid); crossValidationRes.getScoringBoard().forEach((hyperParams, score) -> System.out.println("Score " + Arrays.toString(score) + " for params " + hyperParams));
  26. ML Ensemble Model Averaging 41 #UnifiedAnalytics #SparkAISummit • Ensemble as

    a Mean value of predictions • Majority-based Ensemble • Ensemble as a weighted sum of predictions
  27. 43 #UnifiedAnalytics #SparkAISummit Stacking in code DecisionTreeClassificationTrainer trainer = new

    DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);
  28. Online Machine Learning 45 #UnifiedAnalytics #SparkAISummit KNNClassificationTrainer trainer = new

    KNNClassificationTrainer(); KNNClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer) .withK(3) .withDistanceMeasure(new EuclideanDistance()) .withStrategy(NNStrategy.WEIGHTED); KNNClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  29. TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin

    • Distributed Training • More info here 46 #UnifiedAnalytics #SparkAISummit
  30. TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin

    • Distributed Training • More info here 47 #UnifiedAnalytics #SparkAISummit >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  31. Distributed Training 48 #UnifiedAnalytics #SparkAISummit dataset = IgniteDataset("IMAGES") gradients =

    [] # Compute gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))
  32. Model Inference • PMML via JPMML • XGBoost model parser

    • Spark model parser • MLeap runtime usage 50 #UnifiedAnalytics #SparkAISummit
  33. MLeap 51 #UnifiedAnalytics #SparkAISummit MLeapModelParser parser = new MLeapModelParser(); ModelReader

    reader = new FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));
  34. Spark Model Parser 53 #UnifiedAnalytics #SparkAISummit val passengers = TitanicUtils.readPassengers(spark)

    val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")
  35. Spark Model Parser 54 #UnifiedAnalytics #SparkAISummit dataCache = TitanicUtils.readPassengers(ignite); final

    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>() );
  36. Roadmap for Ignite 3.0 • NLP support • Spark Pipeline

    Inference Support • DL4j integration • More approximate ML algorithms to speed up training 56 #UnifiedAnalytics #SparkAISummit
  37. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area 57 #UnifiedAnalytics #SparkAISummit
  38. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) 58 #UnifiedAnalytics #SparkAISummit
  39. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML 59 #UnifiedAnalytics #SparkAISummit
  40. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML • You could load Spark models to Ignite and update them via online-learning mechanism 60 #UnifiedAnalytics #SparkAISummit
  41. It’s very easy to add new feature • Write me

    [email protected] • Create a ticket here • Prepare a PR • Assign me as a reviewer 61 #UnifiedAnalytics #SparkAISummit