Distributed ML/DL with Ignite ML module using Spark as a data source

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Zinovyev Alexey, Apache Ignite Distributed ML/DL with Ignite ML module
using Spark as a data source #UnifiedAnalytics #SparkAISummit

Bio • Java developer • Distributed ML enthusiast • Apache
Spark user • Apache Ignite Committer • Happy father and husband 3 #UnifiedAnalytics #SparkAISummit

ML/DL Most Popular Frameworks 4 #UnifiedAnalytics #SparkAISummit

Training on PBs with scikit-learn 5 #UnifiedAnalytics #SparkAISummit

Spark ML as an answer • It supports classic ML
algorithms • Algorithms are distributed by nature • Wide support of different data sources and sinks • Easy building of Pipelines • Model evaluation and hyper-parameter tuning support 6 #UnifiedAnalytics #SparkAISummit

What is bad with Spark ML? • It doesn’t support
model ensembles as stacking, boosting, bagging 7 #UnifiedAnalytics #SparkAISummit

model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms 8 #UnifiedAnalytics #SparkAISummit

model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types 9 #UnifiedAnalytics #SparkAISummit

model ensembles as stacking, boosting, bagging • It doesn’t support online-learning for all algorithms • A lot of data transformation/overhead from data source to ML types • The hard integration with TensorFlow/Caffee 10 #UnifiedAnalytics #SparkAISummit

What is bad with Spark ML? • A part of
algorithms are using sparse matrix 11 #UnifiedAnalytics #SparkAISummit

algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving 12 #UnifiedAnalytics #SparkAISummit

algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms 13 #UnifiedAnalytics #SparkAISummit

algorithms are using sparse matrix • Several unfinished approaches of model inference/model serving • It doesn’t support Auto ML algorithms • It doesn’t support ML operators in Spark SQL • ML algorithms internally uses Mllib on RDD 14 #UnifiedAnalytics #SparkAISummit

The main problem with Spark ML You grow old before
your PR will be merged #UnifiedAnalytics #SparkAISummit 15

What is Apache Ignite? #UnifiedAnalytics #SparkAISummit 16

Make distributed learning with Ignite 17 #UnifiedAnalytics #SparkAISummit

Spark Cluster as data-source 18 #UnifiedAnalytics #SparkAISummit

19 #UnifiedAnalytics #SparkAISummit Via .write.format('ignite') bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar, $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar, $IGNITE_HOME/libs/*.jar,
$IGNITE_HOME/libs/ignite-indexing/*.jar

$IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop("weight") .write().format("ignite") .option("config", "default-config.xml") .option("table", "employees") .mode("overwrite") .save();

$IGNITE_HOME/libs/ignite-indexing/*.jar Dataset<Row> passengers = spark.read().format("json").load("filename.json"); employees.filter("age is not null") .drop(“fare") .write().format("ignite") .option("config", "default-config.xml") .option("table", “passengers") .mode("overwrite") .save();

22 #UnifiedAnalytics #SparkAISummit public class SparkCacheStore implements CacheStore<Integer, Object[]>, Serializable
{ private SparkSession spark; private Dataset<Row> ds; private static IgniteBiInClosure<Integer, Object[]> staticClo; { spark = SparkSession ....getOrCreate(); ds = spark.read()....csv(“data-file"); ds = ds.withColumn("index", functions.monotonically_increasing_id()); } Implement CacheStore interface

Partitioned-Based Dataset 23 #UnifiedAnalytics #SparkAISummit

Algorithms: Classification • Logistic Regression • SVM • KNN •
ANN • Decision trees • Random Forest 24 #UnifiedAnalytics #SparkAISummit

Algorithms: Regression • KNN Regression • Linear Regression • Decision
tree regression • Random forest regression • Gradient-boosted tree regression 25 #UnifiedAnalytics #SparkAISummit

Multilayer Perceptron Neural Network 26 #UnifiedAnalytics #SparkAISummit

Build the model 27 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

28 #UnifiedAnalytics #SparkAISummit Fill the cache IgniteCache<Integer, Vector> dataCache =
TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

29 #UnifiedAnalytics #SparkAISummit Build Labeled Vectors IgniteCache<Integer, Vector> dataCache =

30 #UnifiedAnalytics #SparkAISummit Define the trainer IgniteCache<Integer, Vector> dataCache =

31 #UnifiedAnalytics #SparkAISummit Train the model IgniteCache<Integer, Vector> dataCache =

32 #UnifiedAnalytics #SparkAISummit Evaluate the model IgniteCache<Integer, Vector> dataCache =

Preprocessors: Normalization 33 #UnifiedAnalytics #SparkAISummit

Preprocessors: Scaling 34 #UnifiedAnalytics #SparkAISummit

Preprocessors: One-Hot Encoding 35 #UnifiedAnalytics #SparkAISummit

36 #UnifiedAnalytics #SparkAISummit Preprocessing Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache,
vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .withP(1) .fit(ignite, dataCache, minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

Model Evaluation with K-fold CV 37 #UnifiedAnalytics #SparkAISummit

Pipeline 38 #UnifiedAnalytics #SparkAISummit

39 #UnifiedAnalytics #SparkAISummit Pipeline Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new
ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); CrossValidation scoreCalculator = new CrossValidation(); ParamGrid paramGrid = new ParamGrid() .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

40 #UnifiedAnalytics #SparkAISummit Pipeline BinaryClassificationMetrics metrics = new BinaryClassificationMetrics() .withNegativeClsLb(0.0)
.withPositiveClsLb(1.0) .withMetric(BinaryClassificationMetricValues::accuracy); CrossValidationResult crossValidationRes = scoreCalculator.score( pipeline, metrics, ignite, dataCache, 3, paramGrid); crossValidationRes.getScoringBoard().forEach((hyperParams, score) -> System.out.println("Score " + Arrays.toString(score) + " for params " + hyperParams));

ML Ensemble Model Averaging 41 #UnifiedAnalytics #SparkAISummit • Ensemble as
a Mean value of predictions • Majority-based Ensemble • Ensemble as a weighted sum of predictions

Stacking 42 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset Result

43 #UnifiedAnalytics #SparkAISummit Stacking in code DecisionTreeClassificationTrainer trainer = new
DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);

Online Machine Learning 44 #UnifiedAnalytics #SparkAISummit Partitioned-Based Dataset

Online Machine Learning 45 #UnifiedAnalytics #SparkAISummit KNNClassificationTrainer trainer = new
KNNClassificationTrainer(); KNNClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer) .withK(3) .withDistanceMeasure(new EuclideanDistance()) .withStrategy(NNStrategy.WEIGHTED); KNNClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin
• Distributed Training • More info here 46 #UnifiedAnalytics #SparkAISummit

TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin
• Distributed Training • More info here 47 #UnifiedAnalytics #SparkAISummit >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Distributed Training 48 #UnifiedAnalytics #SparkAISummit dataset = IgniteDataset("IMAGES") gradients =
[] # Compute gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))

TF Distributed Training 49 #UnifiedAnalytics #SparkAISummit

Model Inference • PMML via JPMML • XGBoost model parser
• Spark model parser • MLeap runtime usage 50 #UnifiedAnalytics #SparkAISummit

MLeap 51 #UnifiedAnalytics #SparkAISummit MLeapModelParser parser = new MLeapModelParser(); ModelReader
reader = new FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

Spark ML Model Parser 52 #UnifiedAnalytics #SparkAISummit

Spark Model Parser 53 #UnifiedAnalytics #SparkAISummit val passengers = TitanicUtils.readPassengers(spark)
val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

Spark Model Parser 54 #UnifiedAnalytics #SparkAISummit dataCache = TitanicUtils.readPassengers(ignite); final
Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>() );

It could be your application☺ 55 #UnifiedAnalytics #SparkAISummit

Roadmap for Ignite 3.0 • NLP support • Spark Pipeline
Inference Support • DL4j integration • More approximate ML algorithms to speed up training 56 #UnifiedAnalytics #SparkAISummit

Conclusion • Apache Spark and Apache Ignite could work together
in ML/DL area 57 #UnifiedAnalytics #SparkAISummit

in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) 58 #UnifiedAnalytics #SparkAISummit

in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML 59 #UnifiedAnalytics #SparkAISummit

in ML/DL area • Apache Ignite ML is a son of Apache Spark ML (we learnt a lot from Spark ML algorithms impl.) • New features and capabilities of distributed ML learning could be a reason to taste Ignite ML • You could load Spark models to Ignite and update them via online-learning mechanism 60 #UnifiedAnalytics #SparkAISummit

It’s very easy to add new feature • Write me
[email protected] • Create a ticket here • Prepare a PR • Assign me as a reviewer 61 #UnifiedAnalytics #SparkAISummit

DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK
+ AI SUMMIT

Distributed ML/DL with Ignite ML module using S...

Distributed ML/DL with Ignite ML module using Spark as a data source

More Decks by Alexey Zinoviev

Other Decks in Science

Featured

Transcript