Slide 1

Slide 1 text

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Slide 2

Slide 2 text

Zinovyev Alexey, Apache Ignite Distributed ML/DL with Ignite ML module using Spark as a data source #UnifiedAnalytics #SparkAISummit

Slide 3

Slide 3 text

Bio • Java developer • Distributed ML enthusiast • Apache Spark user • Apache Ignite Committer • Happy father and husband 3 #UnifiedAnalytics #SparkAISummit

Slide 4

Slide 4 text

ML/DL Most Popular Frameworks 4 #UnifiedAnalytics #SparkAISummit

Slide 5

Slide 5 text

Training on PBs with scikit-learn 5 #UnifiedAnalytics #SparkAISummit

Slide 6

Slide 6 text

Spark ML as an answer • It supports classic ML algorithms • Algorithms are distributed by nature • Wide support of different data sources and sinks • Easy building of Pipelines • Model evaluation and hyper-parameter tuning support 6 #UnifiedAnalytics #SparkAISummit

Slide 7

Slide 7 text

What is bad with Spark ML? • It doesn’t support model ensembles as stacking, boosting, bagging 7 #UnifiedAnalytics #SparkAISummit

Slide 8

Slide 8 text

What is bad with Spark ML? • It doesn’t support model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms 8 #UnifiedAnalytics #SparkAISummit

Slide 9

Slide 9 text

What is bad with Spark ML? • It doesn’t support model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types 9 #UnifiedAnalytics #SparkAISummit

Slide 10

Slide 10 text

What is bad with Spark ML? • It doesn’t support model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types • The hard integration with TensorFlow/Caffee 10 #UnifiedAnalytics #SparkAISummit

Slide 11

Slide 11 text

What is bad with Spark ML? • A part of algorithms are using sparse matrix 11 #UnifiedAnalytics #SparkAISummit

Slide 12

Slide 12 text

What is bad with Spark ML? • A part of algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving 12 #UnifiedAnalytics #SparkAISummit

Slide 13

Slide 13 text

What is bad with Spark ML? • A part of algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms 13 #UnifiedAnalytics #SparkAISummit

Slide 14

Slide 14 text

What is bad with Spark ML? • A part of algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms • It doesn’t support ML operators in Spark SQL • ML algorithms internally uses Mllib on RDD 14 #UnifiedAnalytics #SparkAISummit

Slide 15

Slide 15 text

The main problem with Spark ML You grow old before your PR will be merged #UnifiedAnalytics #SparkAISummit 15

Slide 16

Slide 16 text

What is Apache Ignite? #UnifiedAnalytics #SparkAISummit 16

Slide 17

Slide 17 text

Make distributed learning with Ignite 17 #UnifiedAnalytics #SparkAISummit

Slide 18

Slide 18 text

Spark Cluster as data-source 18 #UnifiedAnalytics #SparkAISummit

Slide 19

Slide 19 text

19 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar, $IGNITE_HOME/libs/ignite-indexing/*.jar

Slide 20

Slide 20 text

20 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar, $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop("weight") .write().format("ignite") .option("config", "default-config.xml") .option("table", "employees") .mode("overwrite") .save();

Slide 21

Slide 21 text

21 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar, $IGNITE_HOME/libs/ignite-indexing/*.jar Dataset passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop(“fare") .write().format("ignite") .option("config", "default-config.xml") .option("table", “passengers") .mode("overwrite") .save();

Slide 22

Slide 22 text

22 #UnifiedAnalytics #SparkAISummit public class SparkCacheStore implements CacheStore, Serializable { private SparkSession spark; private Dataset ds; private static IgniteBiInClosure staticClo; { spark = SparkSession ....getOrCreate(); ds = spark.read()....csv(“data-file"); ds = ds.withColumn("index", functions.monotonically_increasing_id()); } Implement CacheStore interface

Slide 23

Slide 23 text

Partitioned-Based Dataset 23 #UnifiedAnalytics #SparkAISummit

Slide 24

Slide 24 text

Algorithms: Classification • Logistic Regression • SVM • KNN • ANN • Decision trees • Random Forest 24 #UnifiedAnalytics #SparkAISummit

Slide 25

Slide 25 text

Algorithms: Regression • KNN Regression • Linear Regression • Decision tree regression • Random forest regression • Gradient-boosted tree regression 25 #UnifiedAnalytics #SparkAISummit

Slide 26

Slide 26 text

Multilayer Perceptron Neural Network 26 #UnifiedAnalytics #SparkAISummit

Slide 27

Slide 27 text

Build the model 27 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

Slide 28

Slide 28 text

28 #UnifiedAnalytics #SparkAISummit Fill the cache IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 29

Slide 29 text

29 #UnifiedAnalytics #SparkAISummit Build Labeled Vectors IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 30

Slide 30 text

30 #UnifiedAnalytics #SparkAISummit Define the trainer IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 31

Slide 31 text

31 #UnifiedAnalytics #SparkAISummit Train the model IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 32

Slide 32 text

32 #UnifiedAnalytics #SparkAISummit Evaluate the model IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 33

Slide 33 text

Preprocessors: Normalization 33 #UnifiedAnalytics #SparkAISummit

Slide 34

Slide 34 text

Preprocessors: Scaling 34 #UnifiedAnalytics #SparkAISummit

Slide 35

Slide 35 text

Preprocessors: One-Hot Encoding 35 #UnifiedAnalytics #SparkAISummit

Slide 36

Slide 36 text

36 #UnifiedAnalytics #SparkAISummit Preprocessing Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache, vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .withP(1) .fit(ignite, dataCache, minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

Slide 37

Slide 37 text

Model Evaluation with K-fold CV 37 #UnifiedAnalytics #SparkAISummit

Slide 38

Slide 38 text

Pipeline 38 #UnifiedAnalytics #SparkAISummit

Slide 39

Slide 39 text

39 #UnifiedAnalytics #SparkAISummit Pipeline Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); CrossValidation scoreCalculator = new CrossValidation(); ParamGrid paramGrid = new ParamGrid() .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

Slide 40

Slide 40 text

40 #UnifiedAnalytics #SparkAISummit Pipeline BinaryClassificationMetrics metrics = new BinaryClassificationMetrics() .withNegativeClsLb(0.0) .withPositiveClsLb(1.0) .withMetric(BinaryClassificationMetricValues::accuracy); CrossValidationResult crossValidationRes = scoreCalculator.score( pipeline, metrics, ignite, dataCache, 3, paramGrid); crossValidationRes.getScoringBoard().forEach((hyperParams, score) -> System.out.println("Score " + Arrays.toString(score) + " for params " + hyperParams));

Slide 41

Slide 41 text

ML Ensemble Model Averaging 41 #UnifiedAnalytics #SparkAISummit • Ensemble as a Mean value of predictions • Majority-based Ensemble • Ensemble as a weighted sum of predictions

Slide 42

Slide 42 text

Stacking 42 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset Result

Slide 43

Slide 43 text

43 #UnifiedAnalytics #SparkAISummit Stacking in code DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);

Slide 44

Slide 44 text

Online Machine Learning 44 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

Slide 45

Slide 45 text

Online Machine Learning 45 #UnifiedAnalytics #SparkAISummit KNNClassificationTrainer trainer = new KNNClassificationTrainer(); KNNClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer) .withK(3) .withDistanceMeasure(new EuclideanDistance()) .withStrategy(NNStrategy.WEIGHTED); KNNClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

Slide 46

Slide 46 text

TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin • Distributed Training • More info here 46 #UnifiedAnalytics #SparkAISummit

Slide 47

Slide 47 text

TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin • Distributed Training • More info here 47 #UnifiedAnalytics #SparkAISummit >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Slide 48

Slide 48 text

Distributed Training 48 #UnifiedAnalytics #SparkAISummit dataset = IgniteDataset("IMAGES") gradients = [] # Compute gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))

Slide 49

Slide 49 text

TF Distributed Training 49 #UnifiedAnalytics #SparkAISummit

Slide 50

Slide 50 text

Model Inference • PMML via JPMML • XGBoost model parser • Spark model parser • MLeap runtime usage 50 #UnifiedAnalytics #SparkAISummit

Slide 51

Slide 51 text

MLeap 51 #UnifiedAnalytics #SparkAISummit MLeapModelParser parser = new MLeapModelParser(); ModelReader reader = new FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model> mdl = mdlBuilder.build(reader, parser)) Future prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

Slide 52

Slide 52 text

Spark ML Model Parser 52 #UnifiedAnalytics #SparkAISummit

Slide 53

Slide 53 text

Spark Model Parser 53 #UnifiedAnalytics #SparkAISummit val passengers = TitanicUtils.readPassengers(spark) val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

Slide 54

Slide 54 text

Spark Model Parser 54 #UnifiedAnalytics #SparkAISummit dataCache = TitanicUtils.readPassengers(ignite); final Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>() );

Slide 55

Slide 55 text

It could be your application☺ 55 #UnifiedAnalytics #SparkAISummit

Slide 56

Slide 56 text

Roadmap for Ignite 3.0 • NLP support • Spark Pipeline Inference Support • DL4j integration • More approximate ML algorithms to speed up training 56 #UnifiedAnalytics #SparkAISummit

Slide 57

Slide 57 text

Conclusion • Apache Spark and Apache Ignite could work together in ML/DL area 57 #UnifiedAnalytics #SparkAISummit

Slide 58

Slide 58 text

Conclusion • Apache Spark and Apache Ignite could work together in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) 58 #UnifiedAnalytics #SparkAISummit

Slide 59

Slide 59 text

Conclusion • Apache Spark and Apache Ignite could work together in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML 59 #UnifiedAnalytics #SparkAISummit

Slide 60

Slide 60 text

Conclusion • Apache Spark and Apache Ignite could work together in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML • You could load Spark models to Ignite and update them via online-learning mechanism 60 #UnifiedAnalytics #SparkAISummit

Slide 61

Slide 61 text

It’s very easy to add new feature • Write me [email protected] • Create a ticket here • Prepare a PR • Assign me as a reviewer 61 #UnifiedAnalytics #SparkAISummit

Slide 62

Slide 62 text

DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT