Slide 1

Slide 1 text

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Committer

Slide 2

Slide 2 text

● Java developer ● Distributed ML enthusiast ● Apache Ignite Committer ● Apache Spark user ● Happy father and husband ● https://github.com/zaleslaw Bio

Slide 3

Slide 3 text

What is Apache Ignite?

Slide 4

Slide 4 text

What is Machine Learning?

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

ML Task in math form (shortly)

Slide 7

Slide 7 text

ML Task in math form (by Vorontsov) X - objects, Y - answers, f: X → Y is target function training sample known answers Find decision function

Slide 8

Slide 8 text

Model example [Linear Regression]

Slide 9

Slide 9 text

Model example [Linear Regression] Loss Function

Slide 10

Slide 10 text

Model example [Decision Tree]

Slide 11

Slide 11 text

Finding the best model

Slide 12

Slide 12 text

Distributed ML

Slide 13

Slide 13 text

ML Pipeline Raw Data

Slide 14

Slide 14 text

ML Pipeline Raw Data Preprocessing Vectors

Slide 15

Slide 15 text

ML Pipeline Raw Data Preprocessing Vectors Training Model

Slide 16

Slide 16 text

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter Tuning

Slide 17

Slide 17 text

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter Tuning D e p l o y Evaluation

Slide 18

Slide 18 text

What can be distributed in typical ML Pipeline Step Apache Spark Apache Ignite Dataset distributed distributed Preprocessing distributed distributed Training distributed distributed Prediction distributed distributed Evaluation distributed distributed (since 2.8) Hyper-parameter tuning parallel parallel (since 2.8) Online Learning distributed in 3 algorithms distributed Ensembles for RF* distributed/parallel

Slide 19

Slide 19 text

Distributed Data Structures

Slide 20

Slide 20 text

Partition-based dataset Partition Data Dataset Context Dataset Data Upstream Cache Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable Dataset dataset = … // Partition based dataset, internal API dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...)) double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset Structures Source Data

Slide 21

Slide 21 text

ML Algorithms

Slide 22

Slide 22 text

Classification algorithms ● Logistic Regression ● SVM ● KNN ● ANN ● Decision trees ● Random Forest

Slide 23

Slide 23 text

Regression algorithms ● KNN Regression ● Linear Regression ● Decision tree regression ● Random forest regression ● Gradient-boosted tree regression

Slide 24

Slide 24 text

Multilayer Perceptron Neural Network

Slide 25

Slide 25 text

Train the model on Ignite data Partitioned-Based Dataset

Slide 26

Slide 26 text

Fill the cache IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 27

Slide 27 text

Build Labeled Vectors IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 28

Slide 28 text

Define the trainer IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 29

Slide 29 text

Train the model IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 30

Slide 30 text

Evaluate the model IgniteCache dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 31

Slide 31 text

Preprocessors

Slide 32

Slide 32 text

Normalize vector v to L2 norm

Slide 33

Slide 33 text

Standard Scaling

Slide 34

Slide 34 text

One-Hot Encoding

Slide 35

Slide 35 text

Preprocessing + Training + Evaluation Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache, vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .fit(ignite, dataCache,minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

Slide 36

Slide 36 text

Linear Regression via LSQR

Slide 37

Slide 37 text

Linear Regression with MR approach Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer A, feature matrix u,label vector v, result

Slide 38

Slide 38 text

Linear Regression with MR approach A, feature matrix u,label vector v, result Part 1 Part 2 Part 3 Part 4 Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer MapReduce MapReduce

Slide 39

Slide 39 text

SGD

Slide 40

Slide 40 text

Linear Regression Model

Slide 41

Slide 41 text

Target function for Linear Regression

Slide 42

Slide 42 text

Loss Function

Slide 43

Slide 43 text

Distributed Gradient

Slide 44

Slide 44 text

Distributed Gradient

Slide 45

Slide 45 text

SGD Pseudocode def SGD(X, Y, Loss, GradLoss, W0, s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

Slide 46

Slide 46 text

What can be distributed? def SGD(X, Y, Loss, GradLoss, W0, s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

Slide 47

Slide 47 text

ML Working Horse SGD LogReg Neural Networks SVM Linear Regression

Slide 48

Slide 48 text

Model Evaluation

Slide 49

Slide 49 text

Model Evaluation with K-fold cross validation

Slide 50

Slide 50 text

Pipeline API and ParamGrid Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); ParamGrid paramGrid = new ParamGrid() .withParameterSearchStrategy(new EvolutionOptimizationStrategy()) .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

Slide 51

Slide 51 text

Genetic Algorithm Flow

Slide 52

Slide 52 text

Cross-Validation and Hyper-parameter tuning CrossValidation cv = new CrossValidation<>(); cv.withIgnite(ignite).withUpstreamCache(dataCache).withPipeline(pipeline) .withMetric(MetricName.ACCURACY).withAmountOfFolds(3) .withParamGrid(paramGrid); CrossValidationResult cvRes = cv.tuneHyperParameters(); System.out.println(cvRes.getBest("maxDeep")); System.out.println(cvRes.getBest("minImpurityDecrease"));

Slide 53

Slide 53 text

Ensembles in distributed mode

Slide 54

Slide 54 text

● Ensemble as a Mean value of predictions ● Majority-based Ensemble ● Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging

Slide 55

Slide 55 text

Bagging

Slide 56

Slide 56 text

Boosting

Slide 57

Slide 57 text

Stacking Partitioned-Based Dataset

Slide 58

Slide 58 text

Stacking Partitioned-Based Dataset

Slide 59

Slide 59 text

Stacking Partitioned-Based Dataset

Slide 60

Slide 60 text

Stacking Partitioned-Based Dataset Result

Slide 61

Slide 61 text

Stacking DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);

Slide 62

Slide 62 text

Online Learning

Slide 63

Slide 63 text

Online Learning Partitioned-Based Dataset

Slide 64

Slide 64 text

Update LogReg model with new data LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer() .withMaxIterations(100000) .withLocIterations(100) .withBatchSize(10) .withSeed(123L); LogisticRegressionModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); LogisticRegressionModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

Slide 65

Slide 65 text

TensorFlow integration

Slide 66

Slide 66 text

● Ignite Dataset ● IGFS Plugin ● Distributed Training ● More info here TensorFlow on Apache Ignite

Slide 67

Slide 67 text

TensorFlow on Apache Ignite import tensorflow as tf from tensorflow.contrib.ignite import IgniteDataset dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") iterator = dataset.make_one_shot_iterator() next_obj = iterator.get_next() with tf.Session() as sess: for _ in range(3): print(sess.run(next_obj)) >>> {'key': 1, 'val': {'NAME': b'WARM KITTY'}} >>> {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} >>> {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Slide 68

Slide 68 text

Distributed Training dataset = IgniteDataset("IMAGES") gradients = [] # Compute gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))

Slide 69

Slide 69 text

TF Distributed Training

Slide 70

Slide 70 text

Model inference

Slide 71

Slide 71 text

● PMML via JPMML ● XGBoost model parser ● Spark model parser ● MLeap runtime usage ● H2O model parser Model Inference

Slide 72

Slide 72 text

MLeap MLeapModelParser parser = new MLeapModelParser(); ModelReader reader = new FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model> mdl = mdlBuilder.build(reader, parser)) Future prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

Slide 73

Slide 73 text

Spark ML Model Parser

Slide 74

Slide 74 text

Train GBT model in Spark and export to Ignite val passengers = TitanicUtils.readPassengers(spark) val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

Slide 75

Slide 75 text

Load & evaluate the Spark model dataCache = TitanicUtils.readPassengers(ignite); final Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double prediction = mdl.predict(new LabeledVector<>(...)); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Slide 76

Slide 76 text

It could be your application☺

Slide 77

Slide 77 text

How to contribute?

Slide 78

Slide 78 text

> 200 contributors totally 10 ML authors Blog posts Ignite Documentation ML Documentation Apache Ignite Community

Slide 79

Slide 79 text

NLP (TF-IDF, Word2Vec) More integration with TF, H2O Clustering: LDA, Bisecting K-Means Statistical package … a lot of tasks for beginners:) Roadmap for Ignite 3.0

Slide 80

Slide 80 text

E-mail : [email protected] Twitter : @zaleslaw Github: zaleslaw Follow me

Slide 81

Slide 81 text

DEMO