Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ensembles of ML algorithms and Distributed Onli...

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Currently, Apache Ignite has ML module that includes a lot of distributed ML algorithms, the bunch of approximate ML algorithms, easy integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg unlike Apache Spark.

We suggest to use Apache Ignite ML module to speedup your ML training and use Ignite as backend for distributed TensorFlow calculations.

Also this talk lights issues of distributed machine learning algorithm implementations.

Alexey Zinoviev

October 25, 2019
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. Ensembles of ML algorithms and Distributed Online Machine Learning with

    Apache Ignite Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Committer
  2. • Java developer • Distributed ML enthusiast • Apache Ignite

    Committer • Apache Spark user • Happy father and husband • https://github.com/zaleslaw Bio
  3. ML Task in math form (by Vorontsov) X - objects,

    Y - answers, f: X → Y is target function training sample known answers Find decision function
  4. What can be distributed in typical ML Pipeline Step Apache

    Spark Apache Ignite Dataset distributed distributed Preprocessing distributed distributed Training distributed distributed Prediction distributed distributed Evaluation distributed distributed (since 2.8) Hyper-parameter tuning parallel parallel (since 2.8) Online Learning distributed in 3 algorithms distributed Ensembles for RF* distributed/parallel
  5. Partition-based dataset Partition Data Dataset Context Dataset Data Upstream Cache

    Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable Dataset dataset = … // Partition based dataset, internal API dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...)) double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset Structures Source Data
  6. Regression algorithms • KNN Regression • Linear Regression • Decision

    tree regression • Random forest regression • Gradient-boosted tree regression
  7. Fill the cache IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  8. Build Labeled Vectors IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  9. Define the trainer IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  10. Train the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  11. Evaluate the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  12. Preprocessing + Training + Evaluation Preprocessor imputingPr = new ImputerTrainer().fit(ignite,

    dataCache, vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .fit(ignite, dataCache,minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());
  13. Linear Regression with MR approach Golub-Kahan-Lanczos Bidiagonalization Procedure core of

    LSQR linear regression trainer A, feature matrix u,label vector v, result
  14. Linear Regression with MR approach A, feature matrix u,label vector

    v, result Part 1 Part 2 Part 3 Part 4 Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer MapReduce MapReduce
  15. SGD

  16. SGD Pseudocode def SGD(X, Y, Loss, GradLoss, W0, s): W

    = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)
  17. What can be distributed? def SGD(X, Y, Loss, GradLoss, W0,

    s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)
  18. Pipeline API and ParamGrid Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new

    ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); ParamGrid paramGrid = new ParamGrid() .withParameterSearchStrategy(new EvolutionOptimizationStrategy()) .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});
  19. Cross-Validation and Hyper-parameter tuning CrossValidation<DecisionTreeNode, Integer, Vector> cv = new

    CrossValidation<>(); cv.withIgnite(ignite).withUpstreamCache(dataCache).withPipeline(pipeline) .withMetric(MetricName.ACCURACY).withAmountOfFolds(3) .withParamGrid(paramGrid); CrossValidationResult cvRes = cv.tuneHyperParameters(); System.out.println(cvRes.getBest("maxDeep")); System.out.println(cvRes.getBest("minImpurityDecrease"));
  20. • Ensemble as a Mean value of predictions • Majority-based

    Ensemble • Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging
  21. Stacking DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 =

    new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);
  22. Update LogReg model with new data LogisticRegressionSGDTrainer trainer = new

    LogisticRegressionSGDTrainer() .withMaxIterations(100000) .withLocIterations(100) .withBatchSize(10) .withSeed(123L); LogisticRegressionModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); LogisticRegressionModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  23. • Ignite Dataset • IGFS Plugin • Distributed Training •

    More info here TensorFlow on Apache Ignite
  24. TensorFlow on Apache Ignite import tensorflow as tf from tensorflow.contrib.ignite

    import IgniteDataset dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") iterator = dataset.make_one_shot_iterator() next_obj = iterator.get_next() with tf.Session() as sess: for _ in range(3): print(sess.run(next_obj)) >>> {'key': 1, 'val': {'NAME': b'WARM KITTY'}} >>> {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} >>> {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  25. Distributed Training dataset = IgniteDataset("IMAGES") gradients = [] # Compute

    gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))
  26. • PMML via JPMML • XGBoost model parser • Spark

    model parser • MLeap runtime usage • H2O model parser Model Inference
  27. MLeap MLeapModelParser parser = new MLeapModelParser(); ModelReader reader = new

    FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));
  28. Train GBT model in Spark and export to Ignite val

    passengers = TitanicUtils.readPassengers(spark) val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")
  29. Load & evaluate the Spark model dataCache = TitanicUtils.readPassengers(ignite); final

    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double prediction = mdl.predict(new LabeledVector<>(...)); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  30. > 200 contributors totally 10 ML authors Blog posts Ignite

    Documentation ML Documentation Apache Ignite Community
  31. NLP (TF-IDF, Word2Vec) More integration with TF, H2O Clustering: LDA,

    Bisecting K-Means Statistical package … a lot of tasks for beginners:) Roadmap for Ignite 3.0