Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with
Apache Ignite Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Committer

• Java developer • Distributed ML enthusiast • Apache Ignite
Committer • Apache Spark user • Happy father and husband • https://github.com/zaleslaw Bio

What is Apache Ignite?

What is Machine Learning?

ML Task in math form (shortly)

ML Task in math form (by Vorontsov) X - objects,
Y - answers, f: X → Y is target function training sample known answers Find decision function

Model example [Linear Regression]

Model example [Linear Regression] Loss Function

Model example [Decision Tree]

Finding the best model

Distributed ML

ML Pipeline Raw Data

ML Pipeline Raw Data Preprocessing Vectors

ML Pipeline Raw Data Preprocessing Vectors Training Model

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter
Tuning

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter
Tuning D e p l o y Evaluation

What can be distributed in typical ML Pipeline Step Apache
Spark Apache Ignite Dataset distributed distributed Preprocessing distributed distributed Training distributed distributed Prediction distributed distributed Evaluation distributed distributed (since 2.8) Hyper-parameter tuning parallel parallel (since 2.8) Online Learning distributed in 3 algorithms distributed Ensembles for RF* distributed/parallel

Distributed Data Structures

Partition-based dataset Partition Data Dataset Context Dataset Data Upstream Cache
Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable Dataset dataset = … // Partition based dataset, internal API dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...)) double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset Structures Source Data

ML Algorithms

Classification algorithms • Logistic Regression • SVM • KNN •
ANN • Decision trees • Random Forest

Regression algorithms • KNN Regression • Linear Regression • Decision
tree regression • Random forest regression • Gradient-boosted tree regression

Multilayer Perceptron Neural Network

Train the model on Ignite data Partitioned-Based Dataset

Fill the cache IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer
vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Build Labeled Vectors IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

Define the trainer IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

Train the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

Evaluate the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

Preprocessors

Normalize vector v to L2 norm

Standard Scaling

One-Hot Encoding

Preprocessing + Training + Evaluation Preprocessor imputingPr = new ImputerTrainer().fit(ignite,
dataCache, vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .fit(ignite, dataCache,minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

Linear Regression via LSQR

Linear Regression with MR approach Golub-Kahan-Lanczos Bidiagonalization Procedure core of
LSQR linear regression trainer A, feature matrix u,label vector v, result

Linear Regression with MR approach A, feature matrix u,label vector
v, result Part 1 Part 2 Part 3 Part 4 Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer MapReduce MapReduce

Linear Regression Model

Target function for Linear Regression

Loss Function

Distributed Gradient

SGD Pseudocode def SGD(X, Y, Loss, GradLoss, W0, s): W
= W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

What can be distributed? def SGD(X, Y, Loss, GradLoss, W0,
s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

ML Working Horse SGD LogReg Neural Networks SVM Linear Regression

Model Evaluation

Model Evaluation with K-fold cross validation

Pipeline API and ParamGrid Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new
ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); ParamGrid paramGrid = new ParamGrid() .withParameterSearchStrategy(new EvolutionOptimizationStrategy()) .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

Genetic Algorithm Flow

Cross-Validation and Hyper-parameter tuning CrossValidation<DecisionTreeNode, Integer, Vector> cv = new
CrossValidation<>(); cv.withIgnite(ignite).withUpstreamCache(dataCache).withPipeline(pipeline) .withMetric(MetricName.ACCURACY).withAmountOfFolds(3) .withParamGrid(paramGrid); CrossValidationResult cvRes = cv.tuneHyperParameters(); System.out.println(cvRes.getBest("maxDeep")); System.out.println(cvRes.getBest("minImpurityDecrease"));

Ensembles in distributed mode

• Ensemble as a Mean value of predictions • Majority-based
Ensemble • Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging

Bagging

Boosting

Stacking Partitioned-Based Dataset

Stacking Partitioned-Based Dataset Result

Stacking DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 =
new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);

Online Learning

Online Learning Partitioned-Based Dataset

Update LogReg model with new data LogisticRegressionSGDTrainer trainer = new
LogisticRegressionSGDTrainer() .withMaxIterations(100000) .withLocIterations(100) .withBatchSize(10) .withSeed(123L); LogisticRegressionModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); LogisticRegressionModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

TensorFlow integration

• Ignite Dataset • IGFS Plugin • Distributed Training •
More info here TensorFlow on Apache Ignite

TensorFlow on Apache Ignite import tensorflow as tf from tensorflow.contrib.ignite
import IgniteDataset dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") iterator = dataset.make_one_shot_iterator() next_obj = iterator.get_next() with tf.Session() as sess: for _ in range(3): print(sess.run(next_obj)) >>> {'key': 1, 'val': {'NAME': b'WARM KITTY'}} >>> {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} >>> {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Distributed Training dataset = IgniteDataset("IMAGES") gradients = [] # Compute
gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))

TF Distributed Training

Model inference

• PMML via JPMML • XGBoost model parser • Spark
model parser • MLeap runtime usage • H2O model parser Model Inference

MLeap MLeapModelParser parser = new MLeapModelParser(); ModelReader reader = new
FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

Spark ML Model Parser

Train GBT model in Spark and export to Ignite val
passengers = TitanicUtils.readPassengers(spark) val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

Load & evaluate the Spark model dataCache = TitanicUtils.readPassengers(ignite); final
Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double prediction = mdl.predict(new LabeledVector<>(...)); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

It could be your application☺

How to contribute?

> 200 contributors totally 10 ML authors Blog posts Ignite
Documentation ML Documentation Apache Ignite Community

NLP (TF-IDF, Word2Vec) More integration with TF, H2O Clustering: LDA,
Bisecting K-Means Statistical package … a lot of tasks for beginners:) Roadmap for Ignite 3.0

E-mail : [email protected] Twitter : @zaleslaw Github: zaleslaw Follow me

Ensembles of ML algorithms and Distributed Onli...

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

More Decks by Alexey Zinoviev

Other Decks in Programming

Featured

Transcript