Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed ML/DL with Ignite ML module using Spark as a data source

Distributed ML/DL with Ignite ML module using Spark as a data source

The current implementation of ML algorithms in Spark has several disadvantages associated with the transition from standard Spark SQL types to ML-specific types, a low level of algorithms' adaptation to distributed computing, a relatively slow speed of adding new algorithms to the current library.

Also, Spark ML doesn't support online-learning by nature for all algorithms, stacking, boosting and a bunch of approximate ML algorithms that gives a significant speedup in many cases. The Apache Ignite could work closely with Apache Spark due to exellent Ignite RDD/Ignite DataFrame implementation (see
https://ignite.apache.org/use-cases/spark/shared-memory-layer.html).

Also Apache Ignite has Ignite ML module that includes a lot of distributed ML algorithms, NLP package (will be available in next release, 2.8), the bunch of approximate ML algorithms, simple integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg.

We suggest to use Apache Ignite ML module to speedup your ML training and use Spark + Ignite as backend for distributed TensorFlow calculations. You will see live demos of ML pipeline building with Apache Ignite ML module, Apache Spark, Apache Kafka, TensorFlow and more.

376cd2fd5ffded946c96d5a45766350f?s=128

Alexey Zinoviev

April 23, 2019
Tweet

Transcript

  1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

  2. Zinovyev Alexey, Apache Ignite Distributed ML/DL with Ignite ML module

    using Spark as a data source #UnifiedAnalytics #SparkAISummit
  3. Bio • Java developer • Distributed ML enthusiast • Apache

    Spark user • Apache Ignite Committer • Happy father and husband 3 #UnifiedAnalytics #SparkAISummit
  4. ML/DL Most Popular Frameworks 4 #UnifiedAnalytics #SparkAISummit

  5. Training on PBs with scikit-learn 5 #UnifiedAnalytics #SparkAISummit

  6. Spark ML as an answer • It supports classic ML

    algorithms • Algorithms are distributed by nature • Wide support of different data sources and sinks • Easy building of Pipelines • Model evaluation and hyper-parameter tuning support 6 #UnifiedAnalytics #SparkAISummit
  7. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging 7 #UnifiedAnalytics #SparkAISummit
  8. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms 8 #UnifiedAnalytics #SparkAISummit
  9. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types 9 #UnifiedAnalytics #SparkAISummit
  10. What is bad with Spark ML? • It doesn’t support

    model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types • The hard integration with TensorFlow/Caffee 10 #UnifiedAnalytics #SparkAISummit
  11. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix 11 #UnifiedAnalytics #SparkAISummit
  12. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving 12 #UnifiedAnalytics #SparkAISummit
  13. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms 13 #UnifiedAnalytics #SparkAISummit
  14. What is bad with Spark ML? • A part of

    algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms • It doesn’t support ML operators in Spark SQL • ML algorithms internally uses Mllib on RDD 14 #UnifiedAnalytics #SparkAISummit
  15. The main problem with Spark ML You grow old before

    your PR will be merged #UnifiedAnalytics #SparkAISummit 15
  16. What is Apache Ignite? #UnifiedAnalytics #SparkAISummit 16

  17. Make distributed learning with Ignite 17 #UnifiedAnalytics #SparkAISummit

  18. Spark Cluster as data-source 18 #UnifiedAnalytics #SparkAISummit

  19. 19 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,

    $IGNITE_HOME/libs/ignite-indexing/*.jar
  20. 20 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,

    $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop("weight") .write().format("ignite") .option("config", "default-config.xml") .option("table", "employees") .mode("overwrite") .save();
  21. 21 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,

    $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop(“fare") .write().format("ignite") .option("config", "default-config.xml") .option("table", “passengers") .mode("overwrite") .save();
  22. 22 #UnifiedAnalytics #SparkAISummit public class SparkCacheStore implements CacheStore<Integer, Object[]>, Serializable

    { private SparkSession spark; private Dataset<Row> ds; private static IgniteBiInClosure<Integer, Object[]> staticClo; { spark = SparkSession ....getOrCreate(); ds = spark.read()....csv(“data-file"); ds = ds.withColumn("index", functions.monotonically_increasing_id()); } Implement CacheStore interface
  23. Partitioned-Based Dataset 23 #UnifiedAnalytics #SparkAISummit

  24. Algorithms: Classification • Logistic Regression • SVM • KNN •

    ANN • Decision trees • Random Forest 24 #UnifiedAnalytics #SparkAISummit
  25. Algorithms: Regression • KNN Regression • Linear Regression • Decision

    tree regression • Random forest regression • Gradient-boosted tree regression 25 #UnifiedAnalytics #SparkAISummit
  26. Multilayer Perceptron Neural Network 26 #UnifiedAnalytics #SparkAISummit

  27. Build the model 27 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

  28. 28 #UnifiedAnalytics #SparkAISummit Fill the cache IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  29. 29 #UnifiedAnalytics #SparkAISummit Build Labeled Vectors IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  30. 30 #UnifiedAnalytics #SparkAISummit Define the trainer IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  31. 31 #UnifiedAnalytics #SparkAISummit Train the model IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  32. 32 #UnifiedAnalytics #SparkAISummit Evaluate the model IgniteCache<Integer, Vector> dataCache =

    TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  33. Preprocessors: Normalization 33 #UnifiedAnalytics #SparkAISummit

  34. Preprocessors: Scaling 34 #UnifiedAnalytics #SparkAISummit

  35. Preprocessors: One-Hot Encoding 35 #UnifiedAnalytics #SparkAISummit

  36. 36 #UnifiedAnalytics #SparkAISummit Preprocessing Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache,

    vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .withP(1) .fit(ignite, dataCache, minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());
  37. Model Evaluation with K-fold CV 37 #UnifiedAnalytics #SparkAISummit

  38. Pipeline 38 #UnifiedAnalytics #SparkAISummit

  39. 39 #UnifiedAnalytics #SparkAISummit Pipeline Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new

    ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); CrossValidation scoreCalculator = new CrossValidation(); ParamGrid paramGrid = new ParamGrid() .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});
  40. 40 #UnifiedAnalytics #SparkAISummit Pipeline BinaryClassificationMetrics metrics = new BinaryClassificationMetrics() .withNegativeClsLb(0.0)

    .withPositiveClsLb(1.0) .withMetric(BinaryClassificationMetricValues::accuracy); CrossValidationResult crossValidationRes = scoreCalculator.score( pipeline, metrics, ignite, dataCache, 3, paramGrid); crossValidationRes.getScoringBoard().forEach((hyperParams, score) -> System.out.println("Score " + Arrays.toString(score) + " for params " + hyperParams));
  41. ML Ensemble Model Averaging 41 #UnifiedAnalytics #SparkAISummit • Ensemble as

    a Mean value of predictions • Majority-based Ensemble • Ensemble as a weighted sum of predictions
  42. Stacking 42 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset Result

  43. 43 #UnifiedAnalytics #SparkAISummit Stacking in code DecisionTreeClassificationTrainer trainer = new

    DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);
  44. Online Machine Learning 44 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

  45. Online Machine Learning 45 #UnifiedAnalytics #SparkAISummit KNNClassificationTrainer trainer = new

    KNNClassificationTrainer(); KNNClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer) .withK(3) .withDistanceMeasure(new EuclideanDistance()) .withStrategy(NNStrategy.WEIGHTED); KNNClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  46. TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin

    • Distributed Training • More info here 46 #UnifiedAnalytics #SparkAISummit
  47. TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin

    • Distributed Training • More info here 47 #UnifiedAnalytics #SparkAISummit >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  48. Distributed Training 48 #UnifiedAnalytics #SparkAISummit dataset = IgniteDataset("IMAGES") gradients =

    [] # Compute gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))
  49. TF Distributed Training 49 #UnifiedAnalytics #SparkAISummit

  50. Model Inference • PMML via JPMML • XGBoost model parser

    • Spark model parser • MLeap runtime usage 50 #UnifiedAnalytics #SparkAISummit
  51. MLeap 51 #UnifiedAnalytics #SparkAISummit MLeapModelParser parser = new MLeapModelParser(); ModelReader

    reader = new FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));
  52. Spark ML Model Parser 52 #UnifiedAnalytics #SparkAISummit

  53. Spark Model Parser 53 #UnifiedAnalytics #SparkAISummit val passengers = TitanicUtils.readPassengers(spark)

    val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")
  54. Spark Model Parser 54 #UnifiedAnalytics #SparkAISummit dataCache = TitanicUtils.readPassengers(ignite); final

    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>() );
  55. It could be your application☺ 55 #UnifiedAnalytics #SparkAISummit

  56. Roadmap for Ignite 3.0 • NLP support • Spark Pipeline

    Inference Support • DL4j integration • More approximate ML algorithms to speed up training 56 #UnifiedAnalytics #SparkAISummit
  57. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area 57 #UnifiedAnalytics #SparkAISummit
  58. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) 58 #UnifiedAnalytics #SparkAISummit
  59. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML 59 #UnifiedAnalytics #SparkAISummit
  60. Conclusion • Apache Spark and Apache Ignite could work together

    in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML • You could load Spark models to Ignite and update them via online-learning mechanism 60 #UnifiedAnalytics #SparkAISummit
  61. It’s very easy to add new feature • Write me

    zaleslaw.sin@gmail.com • Create a ticket here • Prepare a PR • Assign me as a reviewer 61 #UnifiedAnalytics #SparkAISummit
  62. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK

    + AI SUMMIT