Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed ML/DL with Ignite ML module using Spark as a data source

Distributed ML/DL with Ignite ML module using Spark as a data source

The current implementation of ML algorithms in Spark has several disadvantages associated with the transition from standard Spark SQL types to ML-specific types, a low level of algorithms' adaptation to distributed computing, a relatively slow speed of adding new algorithms to the current library.

Also, Spark ML doesn't support online-learning by nature for all algorithms, stacking, boosting and a bunch of approximate ML algorithms that gives a significant speedup in many cases. The Apache Ignite could work closely with Apache Spark due to exellent Ignite RDD/Ignite DataFrame implementation (see
https://ignite.apache.org/use-cases/spark/shared-memory-layer.html).

Also Apache Ignite has Ignite ML module that includes a lot of distributed ML algorithms, NLP package (will be available in next release, 2.8), the bunch of approximate ML algorithms, simple integration with TensorFlow via TensorFlow Ignite Dataset (currently, this is a part of TF.contrib package) and also each algorithm supports the model updating that gives us ability to make online-learning not only for KMeans and LinReg.

We suggest to use Apache Ignite ML module to speedup your ML training and use Spark + Ignite as backend for distributed TensorFlow calculations. You will see live demos of ML pipeline building with Apache Ignite ML module, Apache Spark, Apache Kafka, TensorFlow and more.

Alexey Zinoviev

April 23, 2019
Tweet

More Decks by Alexey Zinoviev

Other Decks in Science

Transcript

  1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

    View Slide

  2. Zinovyev Alexey, Apache Ignite
    Distributed ML/DL with Ignite ML
    module using Spark as a data
    source
    #UnifiedAnalytics #SparkAISummit

    View Slide

  3. Bio
    • Java developer
    • Distributed ML enthusiast
    • Apache Spark user
    • Apache Ignite Committer
    • Happy father and husband
    3
    #UnifiedAnalytics #SparkAISummit

    View Slide

  4. ML/DL Most Popular Frameworks
    4
    #UnifiedAnalytics #SparkAISummit

    View Slide

  5. Training on PBs with scikit-learn
    5
    #UnifiedAnalytics #SparkAISummit

    View Slide

  6. Spark ML as an answer
    • It supports classic ML algorithms
    • Algorithms are distributed by nature
    • Wide support of different data sources and sinks
    • Easy building of Pipelines
    • Model evaluation and hyper-parameter tuning
    support
    6
    #UnifiedAnalytics #SparkAISummit

    View Slide

  7. What is bad with Spark ML?
    • It doesn’t support model ensembles as stacking,
    boosting, bagging
    7
    #UnifiedAnalytics #SparkAISummit

    View Slide

  8. What is bad with Spark ML?
    • It doesn’t support model ensembles as stacking,
    boosting, bagging
    • It doesn’t support online-learning for all
    algorithms
    8
    #UnifiedAnalytics #SparkAISummit

    View Slide

  9. What is bad with Spark ML?
    • It doesn’t support model ensembles as stacking,
    boosting, bagging
    • It doesn’t support online-learning for all
    algorithms
    • A lot of data transformation/overhead from data
    source to ML types
    9
    #UnifiedAnalytics #SparkAISummit

    View Slide

  10. What is bad with Spark ML?
    • It doesn’t support model ensembles as stacking,
    boosting, bagging
    • It doesn’t support online-learning for all
    algorithms
    • A lot of data transformation/overhead from data
    source to ML types
    • The hard integration with TensorFlow/Caffee
    10
    #UnifiedAnalytics #SparkAISummit

    View Slide

  11. What is bad with Spark ML?
    • A part of algorithms are using sparse matrix
    11
    #UnifiedAnalytics #SparkAISummit

    View Slide

  12. What is bad with Spark ML?
    • A part of algorithms are using sparse matrix
    • Several unfinished approaches of model
    inference/model serving
    12
    #UnifiedAnalytics #SparkAISummit

    View Slide

  13. What is bad with Spark ML?
    • A part of algorithms are using sparse matrix
    • Several unfinished approaches of model
    inference/model serving
    • It doesn’t support Auto ML algorithms
    13
    #UnifiedAnalytics #SparkAISummit

    View Slide

  14. What is bad with Spark ML?
    • A part of algorithms are using sparse matrix
    • Several unfinished approaches of model
    inference/model serving
    • It doesn’t support Auto ML algorithms
    • It doesn’t support ML operators in Spark SQL
    • ML algorithms internally uses Mllib on RDD
    14
    #UnifiedAnalytics #SparkAISummit

    View Slide

  15. The main problem
    with Spark ML
    You grow old before
    your PR will be
    merged
    #UnifiedAnalytics #SparkAISummit 15

    View Slide

  16. What is
    Apache
    Ignite?
    #UnifiedAnalytics #SparkAISummit 16

    View Slide

  17. Make distributed learning with Ignite
    17
    #UnifiedAnalytics #SparkAISummit

    View Slide

  18. Spark Cluster as data-source
    18
    #UnifiedAnalytics #SparkAISummit

    View Slide

  19. 19
    #UnifiedAnalytics #SparkAISummit
    Via .write.format('ignite')
    bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar,
    $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar,
    $IGNITE_HOME/libs/*.jar,
    $IGNITE_HOME/libs/ignite-indexing/*.jar

    View Slide

  20. 20
    #UnifiedAnalytics #SparkAISummit
    Via .write.format('ignite')
    bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar,
    $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar,
    $IGNITE_HOME/libs/*.jar,
    $IGNITE_HOME/libs/ignite-indexing/*.jar
    Dataset passengers = spark.read().format("json").load("filename.json");
    employees.filter("age is not null")
    .drop("weight")
    .write().format("ignite")
    .option("config", "default-config.xml")
    .option("table", "employees")
    .mode("overwrite")
    .save();

    View Slide

  21. 21
    #UnifiedAnalytics #SparkAISummit
    Via .write.format('ignite')
    bin/pyspark --jars $IGNITE_HOME/libs/ignite-spring/*.jar,
    $IGNITE_HOME/libs/optional/ignite-spark/ignite-*.jar,
    $IGNITE_HOME/libs/*.jar,
    $IGNITE_HOME/libs/ignite-indexing/*.jar
    Dataset passengers = spark.read().format("json").load("filename.json");
    employees.filter("age is not null")
    .drop(“fare")
    .write().format("ignite")
    .option("config", "default-config.xml")
    .option("table", “passengers")
    .mode("overwrite")
    .save();

    View Slide

  22. 22
    #UnifiedAnalytics #SparkAISummit
    public class SparkCacheStore implements CacheStore, Serializable {
    private SparkSession spark;
    private Dataset ds;
    private static IgniteBiInClosure staticClo;
    {
    spark = SparkSession ....getOrCreate();
    ds = spark.read()....csv(“data-file");
    ds = ds.withColumn("index", functions.monotonically_increasing_id());
    }
    Implement CacheStore interface

    View Slide

  23. Partitioned-Based Dataset
    23
    #UnifiedAnalytics #SparkAISummit

    View Slide

  24. Algorithms: Classification
    • Logistic Regression
    • SVM
    • KNN
    • ANN
    • Decision trees
    • Random Forest
    24
    #UnifiedAnalytics #SparkAISummit

    View Slide

  25. Algorithms: Regression
    • KNN Regression
    • Linear Regression
    • Decision tree
    regression
    • Random forest
    regression
    • Gradient-boosted tree
    regression
    25
    #UnifiedAnalytics #SparkAISummit

    View Slide

  26. Multilayer Perceptron Neural Network
    26
    #UnifiedAnalytics #SparkAISummit

    View Slide

  27. Build the model
    27
    #UnifiedAnalytics #SparkAISummit
    Partitioned-Based Dataset

    View Slide

  28. 28
    #UnifiedAnalytics #SparkAISummit
    Fill the cache
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  29. 29
    #UnifiedAnalytics #SparkAISummit
    Build Labeled Vectors
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  30. 30
    #UnifiedAnalytics #SparkAISummit
    Define the trainer
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  31. 31
    #UnifiedAnalytics #SparkAISummit
    Train the model
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  32. 32
    #UnifiedAnalytics #SparkAISummit
    Evaluate the model
    IgniteCache dataCache = TitanicUtils.readPassengers (ignite);
    Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

    View Slide

  33. Preprocessors: Normalization
    33
    #UnifiedAnalytics #SparkAISummit

    View Slide

  34. Preprocessors: Scaling
    34
    #UnifiedAnalytics #SparkAISummit

    View Slide

  35. Preprocessors: One-Hot Encoding
    35
    #UnifiedAnalytics #SparkAISummit

    View Slide

  36. 36
    #UnifiedAnalytics #SparkAISummit
    Preprocessing
    Preprocessor imputingPr = new ImputerTrainer().fit(ignite, dataCache, vectorizer);
    Preprocessor minMaxScalerPr = new MinMaxScalerTrainer()
    .fit(ignite, dataCache, imputingPr);
    Preprocessor normalizationPr = new NormalizationTrainer()
    .withP(1)
    .fit(ignite, dataCache, minMaxScalerPr);
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeNode mdl = trainer.fit(ignite, dataCache, normalizationPr);
    double accuracy = Evaluator.evaluate(dataCache, mdl, normalizationPr, new Accuracy<>());

    View Slide

  37. Model Evaluation with K-fold CV
    37
    #UnifiedAnalytics #SparkAISummit

    View Slide

  38. Pipeline
    38
    #UnifiedAnalytics #SparkAISummit

    View Slide

  39. 39
    #UnifiedAnalytics #SparkAISummit
    Pipeline
    Pipeline pipeline = new Pipeline().addVectorizer(vectorizer)
    .addPreprocessingTrainer(new ImputerTrainer())
    .addPreprocessingTrainer(new MinMaxScalerTrainer())
    .addTrainer(new DecisionTreeClassificationTrainer(5, 0));
    CrossValidation scoreCalculator = new CrossValidation();
    ParamGrid paramGrid = new ParamGrid()
    .addHyperParam("maxDeep", new Double[]{1.0, 2.0, 3.0, 4.0, 5.0, 10.0})
    .addHyperParam("minImpurityDecrease", new Double[]{0.0, 0.25, 0.5});

    View Slide

  40. 40
    #UnifiedAnalytics #SparkAISummit
    Pipeline
    BinaryClassificationMetrics metrics = new BinaryClassificationMetrics()
    .withNegativeClsLb(0.0)
    .withPositiveClsLb(1.0)
    .withMetric(BinaryClassificationMetricValues::accuracy);
    CrossValidationResult crossValidationRes = scoreCalculator.score(
    pipeline, metrics, ignite, dataCache, 3, paramGrid);
    crossValidationRes.getScoringBoard().forEach((hyperParams, score)
    -> System.out.println("Score " + Arrays.toString(score) + " for params " + hyperParams));

    View Slide

  41. ML Ensemble Model Averaging
    41
    #UnifiedAnalytics #SparkAISummit
    • Ensemble as a Mean
    value of predictions
    • Majority-based
    Ensemble
    • Ensemble as a
    weighted sum of
    predictions

    View Slide

  42. Stacking
    42
    #UnifiedAnalytics #SparkAISummit
    Partitioned-Based Dataset
    Result

    View Slide

  43. 43
    #UnifiedAnalytics #SparkAISummit
    Stacking in code
    DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
    DecisionTreeClassificationTrainer trainer1 = new DecisionTreeClassificationTrainer(3, 0);
    LogisticRegressionSGDTrainer aggregator = new LogisticRegressionSGDTrainer();
    StackedModel mdl = new StackedVectorDatasetTrainer<>(aggregator)
    .addTrainerWithDoubleOutput(trainer)
    .addTrainerWithDoubleOutput(trainer1)
    .fit(ignite, dataCache, normalizationPreprocessor);

    View Slide

  44. Online Machine Learning
    44
    #UnifiedAnalytics #SparkAISummit
    Partitioned-Based Dataset

    View Slide

  45. Online Machine Learning
    45
    #UnifiedAnalytics #SparkAISummit
    KNNClassificationTrainer trainer = new KNNClassificationTrainer();
    KNNClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer)
    .withK(3)
    .withDistanceMeasure(new EuclideanDistance())
    .withStrategy(NNStrategy.WEIGHTED);
    KNNClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);

    View Slide

  46. TensorFlow on Apache Ignite
    • Ignite Dataset
    • IGFS Plugin
    • Distributed Training
    • More info here
    46
    #UnifiedAnalytics #SparkAISummit

    View Slide

  47. TensorFlow on Apache Ignite
    • Ignite Dataset
    • IGFS Plugin
    • Distributed Training
    • More info here
    47
    #UnifiedAnalytics #SparkAISummit
    >>> import tensorflow as tf
    >>> from tensorflow.contrib.ignite import IgniteDataset
    >>>
    >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
    >>> iterator = dataset.make_one_shot_iterator()
    >>> next_obj = iterator.get_next()
    >>>
    >>> with tf.Session() as sess:
    >>> for _ in range(3):
    >>> print(sess.run(next_obj))
    {'key': 1, 'val': {'NAME': b'WARM KITTY'}}
    {'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
    {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

    View Slide

  48. Distributed Training
    48
    #UnifiedAnalytics #SparkAISummit
    dataset = IgniteDataset("IMAGES")
    gradients = [] # Compute gradients locally on every worker node.
    for i in range(5):
    with tf.device("/job:WORKER/task:%d" % i):
    device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    device_next_obj = device_iterator.get_next()
    gradient = compute_gradient(device_next_obj)
    gradients.append(gradient)
    result_gradient = tf.reduce_sum(gradients) # Aggregate them on master node.
    with tf.Session("grpc://localhost:10000") as sess:
    print(sess.run(result_gradient))

    View Slide

  49. TF Distributed Training
    49
    #UnifiedAnalytics #SparkAISummit

    View Slide

  50. Model Inference
    • PMML via JPMML
    • XGBoost model parser
    • Spark model parser
    • MLeap runtime usage
    50
    #UnifiedAnalytics #SparkAISummit

    View Slide

  51. MLeap
    51
    #UnifiedAnalytics #SparkAISummit
    MLeapModelParser parser = new MLeapModelParser();
    ModelReader reader = new FileSystemModelReader(mdlRsrc.getPath());
    AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 8, 2);
    Model> mdl = mdlBuilder.build(reader, parser))
    Future prediction = mdl.predict(VectorUtils.of(22.0, 100.0));

    View Slide

  52. Spark ML Model Parser
    52
    #UnifiedAnalytics #SparkAISummit

    View Slide

  53. Spark Model Parser
    53
    #UnifiedAnalytics #SparkAISummit
    val passengers = TitanicUtils.readPassengers(spark)
    val assembler = new VectorAssembler().setInputCols(Array("pclass", "sibsp", "parch"))
    .setOutputCol("features")
    val output = assembler.transform(passengers.na.drop(Array("pclass", "sibsp", "parch")))
    .select("features", "survived")
    val trainer = new GBTClassifier()
    .setMaxIter(10)
    .setLabelCol("survived")
    .setFeaturesCol("features")
    .setMaxDepth(7)
    val model = trainer.fit(output)
    model.write.overwrite().save("/home/zaleslaw/models/titanic/gbt")

    View Slide

  54. Spark Model Parser
    54
    #UnifiedAnalytics #SparkAISummit
    dataCache = TitanicUtils.readPassengers(ignite);
    final Vectorizer vectorizer = new DummyVectorizer(0, 5, 6).labeled(1);
    ModelsComposition mdl = SparkModelParser.parse(
    SPARK_MDL_PATH,
    SupportedSparkModels.GRADIENT_BOOSTED_TREES
    );
    double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer,
    new Accuracy<>()
    );

    View Slide

  55. It could be your application☺
    55
    #UnifiedAnalytics #SparkAISummit

    View Slide

  56. Roadmap for Ignite 3.0
    • NLP support
    • Spark Pipeline Inference Support
    • DL4j integration
    • More approximate ML algorithms to speed up
    training
    56
    #UnifiedAnalytics #SparkAISummit

    View Slide

  57. Conclusion
    • Apache Spark and Apache Ignite could work
    together in ML/DL area
    57
    #UnifiedAnalytics #SparkAISummit

    View Slide

  58. Conclusion
    • Apache Spark and Apache Ignite could work
    together in ML/DL area
    • Apache Ignite ML is a son of Apache Spark ML
    (we learnt a lot from Spark ML algorithms impl.)
    58
    #UnifiedAnalytics #SparkAISummit

    View Slide

  59. Conclusion
    • Apache Spark and Apache Ignite could work
    together in ML/DL area
    • Apache Ignite ML is a son of Apache Spark ML
    (we learnt a lot from Spark ML algorithms impl.)
    • New features and capabilities of distributed ML
    learning could be a reason to taste Ignite ML
    59
    #UnifiedAnalytics #SparkAISummit

    View Slide

  60. Conclusion
    • Apache Spark and Apache Ignite could work
    together in ML/DL area
    • Apache Ignite ML is a son of Apache Spark ML
    (we learnt a lot from Spark ML algorithms impl.)
    • New features and capabilities of distributed ML
    learning could be a reason to taste Ignite ML
    • You could load Spark models to Ignite and
    update them via online-learning mechanism
    60
    #UnifiedAnalytics #SparkAISummit

    View Slide

  61. It’s very easy to add new feature
    • Write me
    [email protected]
    • Create a ticket here
    • Prepare a PR
    • Assign me as a reviewer
    61
    #UnifiedAnalytics #SparkAISummit

    View Slide

  62. DON’T FORGET TO RATE
    AND REVIEW THE SESSIONS
    SEARCH SPARK + AI SUMMIT

    View Slide