Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Ensembles of ML algorithms and Distributed Online Machine Learning with Apache Ignite

Currently, Apache Ignite has ML module that includes a lot of distributed ML algorithms, the bunch of approximate ML algorithms, easy integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg unlike Apache Spark.

We suggest to use Apache Ignite ML module to speedup your ML training and use Ignite as backend for distributed TensorFlow calculations.

Also this talk lights issues of distributed machine learning algorithm implementations.

Alexey Zinoviev

October 25, 2019
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. Ensembles of ML algorithms and
    Distributed Online Machine Learning with
    Apache Ignite
    Alexey Zinoviev, Java/BigData Trainer,
    Apache Ignite Committer

    View Slide

  2. ● Java developer
    ● Distributed ML enthusiast
    ● Apache Ignite Committer
    ● Apache Spark user
    ● Happy father and husband
    ● https://github.com/zaleslaw
    Bio

    View Slide

  3. What is Apache Ignite?

    View Slide

  4. What is Machine Learning?

    View Slide

  5. View Slide

  6. ML Task in math form (shortly)

    View Slide

  7. ML Task in math form (by Vorontsov)
    X - objects, Y - answers, f: X → Y is target function
    training sample
    known answers
    Find decision function

    View Slide

  8. Model example [Linear Regression]

    View Slide

  9. Model example [Linear Regression]
    Loss
    Function

    View Slide

  10. Model example [Decision Tree]

    View Slide

  11. Finding the best model

    View Slide

  12. Distributed ML

    View Slide

  13. ML Pipeline
    Raw Data

    View Slide

  14. ML Pipeline
    Raw Data
    Preprocessing Vectors

    View Slide

  15. ML Pipeline
    Raw Data
    Preprocessing Vectors Training Model

    View Slide

  16. ML Pipeline
    Raw Data
    Preprocessing Vectors Training Model
    Hyper
    parameter
    Tuning

    View Slide

  17. ML Pipeline
    Raw Data
    Preprocessing Vectors Training Model
    Hyper
    parameter
    Tuning
    D
    e
    p
    l
    o
    y
    Evaluation

    View Slide

  18. What can be distributed in typical ML Pipeline
    Step Apache Spark Apache Ignite
    Dataset distributed distributed
    Preprocessing distributed distributed
    Training distributed distributed
    Prediction distributed distributed
    Evaluation distributed distributed (since 2.8)
    Hyper-parameter tuning parallel parallel (since 2.8)
    Online Learning distributed in 3 algorithms distributed
    Ensembles for RF* distributed/parallel

    View Slide

  19. Distributed Data Structures

    View Slide

  20. Partition-based dataset
    Partition Data Dataset Context Dataset Data
    Upstream Cache Context Cache On-Heap
    Learning Env
    On-Heap
    Durable Stateless Durable Recoverable
    Dataset dataset = … // Partition based dataset, internal API
    dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...))
    double[][] x = …
    double[] y = ...
    double[][] x = …
    double[] y = ...
    Partition Based Dataset Structures
    Source Data

    View Slide

  21. ML Algorithms

    View Slide

  22. Classification algorithms
    ● Logistic Regression
    ● SVM
    ● KNN
    ● ANN
    ● Decision trees
    ● Random Forest

    View Slide

  23. Regression algorithms
    ● KNN Regression
    ● Linear Regression
    ● Decision tree regression
    ● Random forest
    regression
    ● Gradient-boosted tree
    regression

    View Slide

  24. Multilayer Perceptron Neural Network

    View Slide

  25. Train the model on Ignite data
    Partitioned-Based Dataset

    View Slide

  26. Fill the cache
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  27. Build Labeled Vectors
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  28. Define the trainer
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  29. Train the model
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  30. Evaluate the model
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  31. Preprocessors

    View Slide

  32. Normalize vector v to L2 norm

    View Slide

  33. Standard Scaling

    View Slide

  34. One-Hot Encoding

    View Slide

  35. Preprocessing + Training + Evaluation
    Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache, vectorizer);
    Preprocessor minMaxScalerPr = new MinMaxScalerTrainer()
    .fit(ignite, dataCache, imputingPr);
    Preprocessor normalizationPr = new NormalizationTrainer()
    .fit(ignite, dataCache,minMaxScalerPr);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr);
    double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

    View Slide

  36. Linear Regression via LSQR

    View Slide

  37. Linear Regression with MR approach
    Golub-Kahan-Lanczos Bidiagonalization Procedure
    core of LSQR linear regression trainer
    A,
    feature matrix
    u,label vector
    v, result

    View Slide

  38. Linear Regression with MR approach
    A,
    feature matrix
    u,label vector
    v, result
    Part 1
    Part 2
    Part 3
    Part 4
    Golub-Kahan-Lanczos Bidiagonalization Procedure
    core of LSQR linear regression trainer
    MapReduce
    MapReduce

    View Slide

  39. SGD

    View Slide

  40. Linear Regression Model

    View Slide

  41. Target function for Linear Regression

    View Slide

  42. Loss Function

    View Slide

  43. Distributed Gradient

    View Slide

  44. Distributed Gradient

    View Slide

  45. SGD Pseudocode
    def SGD(X, Y, Loss, GradLoss, W0, s):
    W = W0
    lastLoss = Double.Inf
    for i = 0 .. maxIterations:
    W = W - s * GradLoss(W, X, Y)
    currentLoss = Loss(Model(W), X, Y)
    if abs(currentLoss - lastLoss) > eps:
    lastLoss = currentLoss
    else:
    break
    return Model(W)

    View Slide

  46. What can be distributed?
    def SGD(X, Y, Loss, GradLoss, W0, s):
    W = W0
    lastLoss = Double.Inf
    for i = 0 .. maxIterations:
    W = W - s * GradLoss(W, X, Y)
    currentLoss = Loss(Model(W), X, Y)
    if abs(currentLoss - lastLoss) > eps:
    lastLoss = currentLoss
    else:
    break
    return Model(W)

    View Slide

  47. ML Working Horse
    SGD
    LogReg
    Neural
    Networks
    SVM
    Linear
    Regression

    View Slide

  48. Model Evaluation

    View Slide

  49. Model Evaluation with K-fold cross validation

    View Slide

  50. Pipeline API and ParamGrid
    Pipeline pipeline = new Pipeline().addVectorizer(vectorizer)
    .addPreprocessingTrainer(new ImputerTrainer())
    .addPreprocessingTrainer(new MinMaxScalerTrainer())
    .addTrainer(new DecisionTreeClassificationTrainer(5, 0));
    ParamGrid paramGrid = new ParamGrid()
    .withParameterSearchStrategy(new EvolutionOptimizationStrategy())
    .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0})
    .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

    View Slide

  51. Genetic Algorithm Flow

    View Slide

  52. Cross-Validation and Hyper-parameter tuning
    CrossValidation cv = new CrossValidation<>();
    cv.withIgnite(ignite).withUpstreamCache(dataCache).withPipeline(pipeline)
    .withMetric(MetricName.ACCURACY).withAmountOfFolds(3)
    .withParamGrid(paramGrid);
    CrossValidationResult cvRes = cv.tuneHyperParameters();
    System.out.println(cvRes.getBest("maxDeep"));
    System.out.println(cvRes.getBest("minImpurityDecrease"));

    View Slide

  53. Ensembles in distributed mode

    View Slide

  54. ● Ensemble as a Mean
    value of predictions
    ● Majority-based
    Ensemble
    ● Ensemble as a weighted
    sum of predictions
    Machine Learning Ensemble Model Averaging

    View Slide

  55. Bagging

    View Slide

  56. Boosting

    View Slide

  57. Stacking
    Partitioned-Based Dataset

    View Slide

  58. Stacking
    Partitioned-Based Dataset

    View Slide

  59. Stacking
    Partitioned-Based Dataset

    View Slide

  60. Stacking
    Partitioned-Based Dataset
    Result

    View Slide

  61. Stacking
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0);
    LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer();
    StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator)
    .addTrainerWithDoubleOutput(trainer)
    .addTrainerWithDoubleOutput(trainer1)
    .fit(ignite, dataCache, normalizationPreprocessor);

    View Slide

  62. Online Learning

    View Slide

  63. Online Learning
    Partitioned-Based Dataset

    View Slide

  64. Update LogReg model with new data
    LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer()
    .withMaxIterations(100000)
    .withLocIterations(100)
    .withBatchSize(10)
    .withSeed(123L);
    LogisticRegressionModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
    LogisticRegressionModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

    View Slide

  65. TensorFlow integration

    View Slide

  66. ● Ignite Dataset
    ● IGFS Plugin
    ● Distributed Training
    ● More info here
    TensorFlow on Apache Ignite

    View Slide

  67. TensorFlow on Apache Ignite
    import tensorflow as tf
    from tensorflow.contrib.ignite import IgniteDataset
    dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
    iterator = dataset.make_one_shot_iterator()
    next_obj = iterator.get_next()
    with tf.Session() as sess:
    for _ in range(3):
    print(sess.run(next_obj))
    >>> {'key': 1, 'val': {'NAME': b'WARM KITTY'}}
    >>> {'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
    >>> {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

    View Slide

  68. Distributed Training
    dataset = IgniteDataset("IMAGES")
    gradients = [] # Compute gradients locally on every worker node.
    for i in range(5):
    with tf.device("/job:WORKER/task:%d" % i):
    device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    device_next_obj = device_iterator.get_next()
    gradient = compute_gradient(device_next_obj)
    gradients.append(gradient)
    result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node.
    with tf.Session("grpc://localhost:10000") as sess:
    print(sess.run(result_gradient))

    View Slide

  69. TF Distributed Training

    View Slide

  70. Model inference

    View Slide

  71. ● PMML via JPMML
    ● XGBoost model parser
    ● Spark model parser
    ● MLeap runtime usage
    ● H2O model parser
    Model Inference

    View Slide

  72. MLeap
    MLeapModelParser parser = new MLeapModelParser();
    ModelReader reader = new FileSystemModelReader(mdlRsrc.getPath());
    AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2);
    Model> mdl = mdlBuilder.build(reader, parser))
    Future prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

    View Slide

  73. Spark ML Model Parser

    View Slide

  74. Train GBT model in Spark and export to Ignite
    val passengers = TitanicUtils.readPassengers(spark)
    val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch"))
    .setOutputCol("features")
    val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch")))
    .select("features", "survived")
    val trainer = new GBTClassifier()
    .setMaxIter(10)
    .setLabelCol("survived")
    .setFeaturesCol("features")
    .setMaxDepth(7)
    val model = trainer.fit(output)
    model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

    View Slide

  75. Load & evaluate the Spark model
    dataCache = TitanicUtils.readPassengers(ignite);
    final Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    ModelsComposition mdl = SparkModelParser.parse(
    SPARK_MDL_PATH,
    SupportedSparkModels.GRADIENT_BOOSTED_TREES
    );
    double prediction = mdl.predict(new LabeledVector<>(...));
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  76. It could be your application☺

    View Slide

  77. How to contribute?

    View Slide

  78. > 200 contributors totally
    10 ML authors
    Blog posts
    Ignite Documentation
    ML Documentation
    Apache Ignite Community

    View Slide

  79. NLP (TF-IDF, Word2Vec)
    More integration with TF, H2O
    Clustering: LDA, Bisecting
    K-Means
    Statistical package
    … a lot of tasks for beginners:)
    Roadmap for Ignite 3.0

    View Slide

  80. E-mail : [email protected]
    Twitter : @zaleslaw
    Github: zaleslaw
    Follow me

    View Slide

  81. DEMO

    View Slide