Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Currently, Apache Ignite has ML module that includes a lot of distributed ML algorithms, the bunch of approximate ML algorithms, easy integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg unlike Apache Spark.

We suggest to use Apache Ignite ML module to speedup your ML training and use Ignite as backend for distributed TensorFlow calculations.

Also this talk lights issues of distributed machine learning algorithm implementations.

376cd2fd5ffded946c96d5a45766350f?s=128

Alexey Zinoviev

October 25, 2019
Tweet

Transcript

  1. Ensembles of ML algorithms and Distributed Online Machine Learning with

    Apache Ignite Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Committer
  2. • Java developer • Distributed ML enthusiast • Apache Ignite

    Committer • Apache Spark user • Happy father and husband • https://github.com/zaleslaw Bio
  3. What is Apache Ignite?

  4. What is Machine Learning?

  5. None
  6. ML Task in math form (shortly)

  7. ML Task in math form (by Vorontsov) X - objects,

    Y - answers, f: X → Y is target function training sample known answers Find decision function
  8. Model example [Linear Regression]

  9. Model example [Linear Regression] Loss Function

  10. Model example [Decision Tree]

  11. Finding the best model

  12. Distributed ML

  13. ML Pipeline Raw Data

  14. ML Pipeline Raw Data Preprocessing Vectors

  15. ML Pipeline Raw Data Preprocessing Vectors Training Model

  16. ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter

    Tuning
  17. ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter

    Tuning D e p l o y Evaluation
  18. What can be distributed in typical ML Pipeline Step Apache

    Spark Apache Ignite Dataset distributed distributed Preprocessing distributed distributed Training distributed distributed Prediction distributed distributed Evaluation distributed distributed (since 2.8) Hyper-parameter tuning parallel parallel (since 2.8) Online Learning distributed in 3 algorithms distributed Ensembles for RF* distributed/parallel
  19. Distributed Data Structures

  20. Partition-based dataset Partition Data Dataset Context Dataset Data Upstream Cache

    Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable Dataset dataset = … // Partition based dataset, internal API dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...)) double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset Structures Source Data
  21. ML Algorithms

  22. Classification algorithms • Logistic Regression • SVM • KNN •

    ANN • Decision trees • Random Forest
  23. Regression algorithms • KNN Regression • Linear Regression • Decision

    tree regression • Random forest regression • Gradient-boosted tree regression
  24. Multilayer Perceptron Neural Network

  25. Train the model on Ignite data Partitioned-Based Dataset

  26. Fill the cache IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  27. Build Labeled Vectors IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  28. Define the trainer IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  29. Train the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  30. Evaluate the model IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer

    vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  31. Preprocessors

  32. Normalize vector v to L2 norm

  33. Standard Scaling

  34. One-Hot Encoding

  35. Preprocessing + Training + Evaluation Preprocessor imputingPr = new ImputerTrainer().fit(ignite,

    dataCache, vectorizer); Preprocessor minMaxScalerPr = new MinMaxScalerTrainer() .fit(ignite, dataCache, imputingPr); Preprocessor normalizationPr = new NormalizationTrainer() .fit(ignite, dataCache,minMaxScalerPr); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr); double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());
  36. Linear Regression via LSQR

  37. Linear Regression with MR approach Golub-Kahan-Lanczos Bidiagonalization Procedure core of

    LSQR linear regression trainer A, feature matrix u,label vector v, result
  38. Linear Regression with MR approach A, feature matrix u,label vector

    v, result Part 1 Part 2 Part 3 Part 4 Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer MapReduce MapReduce
  39. SGD

  40. Linear Regression Model

  41. Target function for Linear Regression

  42. Loss Function

  43. Distributed Gradient

  44. Distributed Gradient

  45. SGD Pseudocode def SGD(X, Y, Loss, GradLoss, W0, s): W

    = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)
  46. What can be distributed? def SGD(X, Y, Loss, GradLoss, W0,

    s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)
  47. ML Working Horse SGD LogReg Neural Networks SVM Linear Regression

  48. Model Evaluation

  49. Model Evaluation with K-fold cross validation

  50. Pipeline API and ParamGrid Pipeline pipeline = new Pipeline().addVectorizer(vectorizer) .addPreprocessingTrainer(new

    ImputerTrainer()) .addPreprocessingTrainer(new MinMaxScalerTrainer()) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)); ParamGrid paramGrid = new ParamGrid() .withParameterSearchStrategy(new EvolutionOptimizationStrategy()) .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0}) .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});
  51. Genetic Algorithm Flow

  52. Cross-Validation and Hyper-parameter tuning CrossValidation<DecisionTreeNode, Integer, Vector> cv = new

    CrossValidation<>(); cv.withIgnite(ignite).withUpstreamCache(dataCache).withPipeline(pipeline) .withMetric(MetricName.ACCURACY).withAmountOfFolds(3) .withParamGrid(paramGrid); CrossValidationResult cvRes = cv.tuneHyperParameters(); System.out.println(cvRes.getBest("maxDeep")); System.out.println(cvRes.getBest("minImpurityDecrease"));
  53. Ensembles in distributed mode

  54. • Ensemble as a Mean value of predictions • Majority-based

    Ensemble • Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging
  55. Bagging

  56. Boosting

  57. Stacking Partitioned-Based Dataset

  58. Stacking Partitioned-Based Dataset

  59. Stacking Partitioned-Based Dataset

  60. Stacking Partitioned-Based Dataset Result

  61. Stacking DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeClassificationTrainer trainer1 =

    new DecisionTreeClassificationTrainer(3, 0); LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer(); StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator) .addTrainerWithDoubleOutput(trainer) .addTrainerWithDoubleOutput(trainer1) .fit(ignite, dataCache, normalizationPreprocessor);
  62. Online Learning

  63. Online Learning Partitioned-Based Dataset

  64. Update LogReg model with new data LogisticRegressionSGDTrainer trainer = new

    LogisticRegressionSGDTrainer() .withMaxIterations(100000) .withLocIterations(100) .withBatchSize(10) .withSeed(123L); LogisticRegressionModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); LogisticRegressionModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  65. TensorFlow integration

  66. • Ignite Dataset • IGFS Plugin • Distributed Training •

    More info here TensorFlow on Apache Ignite
  67. TensorFlow on Apache Ignite import tensorflow as tf from tensorflow.contrib.ignite

    import IgniteDataset dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") iterator = dataset.make_one_shot_iterator() next_obj = iterator.get_next() with tf.Session() as sess: for _ in range(3): print(sess.run(next_obj)) >>> {'key': 1, 'val': {'NAME': b'WARM KITTY'}} >>> {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} >>> {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  68. Distributed Training dataset = IgniteDataset("IMAGES") gradients = [] # Compute

    gradients locally on every worker node. for i in range(5): with tf.device("/job:WORKER/task:%d" % i): device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) device_next_obj = device_iterator.get_next() gradient = compute_gradient(device_next_obj) gradients.append(gradient) result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node. with tf.Session("grpc://localhost:10000") as sess: print(sess.run(result_gradient))
  69. TF Distributed Training

  70. Model inference

  71. • PMML via JPMML • XGBoost model parser • Spark

    model parser • MLeap runtime usage • H2O model parser Model Inference
  72. MLeap MLeapModelParser parser = new MLeapModelParser(); ModelReader reader = new

    FileSystemModelReader(mdlRsrc.getPath()); AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2); Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser)) Future<Double> prediction = mdl.predict(VectorUtils.of(22.0, 100.0));
  73. Spark ML Model Parser

  74. Train GBT model in Spark and export to Ignite val

    passengers = TitanicUtils.readPassengers(spark) val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch")) .setOutputCol("features") val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch"))) .select("features", "survived") val trainer = new GBTClassifier() .setMaxIter(10) .setLabelCol("survived") .setFeaturesCol("features") .setMaxDepth(7) val model = trainer.fit(output) model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")
  75. Load & evaluate the Spark model dataCache = TitanicUtils.readPassengers(ignite); final

    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1); ModelsComposition mdl = SparkModelParser.parse( SPARK_MDL_PATH, SupportedSparkModels.GRADIENT_BOOSTED_TREES ); double prediction = mdl.predict(new LabeledVector<>(...)); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  76. It could be your application☺

  77. How to contribute?

  78. > 200 contributors totally 10 ML authors Blog posts Ignite

    Documentation ML Documentation Apache Ignite Community
  79. NLP (TF-IDF, Word2Vec) More integration with TF, H2O Clustering: LDA,

    Bisecting K-Means Statistical package … a lot of tasks for beginners:) Roadmap for Ignite 3.0
  80. E-mail : zaleslaw.sin@gmail.com Twitter : @zaleslaw Github: zaleslaw Follow me

  81. DEMO