$30 off During Our Annual Pro Sale. View Details »

Nuances of Machine Learning with Ignite ML

Nuances of Machine Learning with Ignite ML

Alexey Zinoviev

October 15, 2018
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. Nuances of Machine
    Learning with Ignite ML
    Alexey Zinoviev, Java/BigData Trainer,
    Apache Ignite Contributor

    View Slide

  2. E-mail : [email protected]
    Twitter : @zaleslaw @BigDataRussia
    vk.com/big_data_russia Big Data Russia
    + Telegram @bigdatarussia
    vk.com/java_jvm Java & JVM langs
    + Telegram @javajvmlangs
    Follow me

    View Slide

  3. What is Apache Ignite?

    View Slide

  4. Release 2.7

    View Slide

  5. SQL
    ACID (β)
    MVCC (β)
    Spring Data 2.0
    New features in 2.5 - 2.7

    View Slide

  6. SQL
    ACID (β)
    MVCC (β)
    Spring Data 2.0
    New features in 2.5 - 2.7
    Thin clients
    Java
    .NET
    C++
    + PHP
    + Python
    + Node.js

    View Slide

  7. SQL
    ACID (β)
    MVCC (β)
    Spring Data 2.0
    New features in 2.5 - 2.7
    Thin clients
    Java
    .NET
    C++
    + PHP
    + Python
    + Node.js
    Topology
    Star topology with
    Zookeeper

    View Slide

  8. Integration

    View Slide

  9. Read from Ignite to DataFrame
    Write from DataFrame to Ignite
    IgniteSparkSession + IgniteCatalog
    SQL Optimization
    Spark integration
    spark
    .read
    .format(FORMAT_IGNITE)
    .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE)
    .option(OPTION_TABLE, "person")
    .load()
    tbl
    .write
    .format(FORMAT_IGNITE).
    option(OPTION_CONFIG_FILE, CFG_PATH).
    option(OPTION_TABLE, tableName).
    option(OPTION_CRT_TBL_PRIMARY_KEY_FIELDS, pk).
    save()

    View Slide

  10. Spark integration

    View Slide

  11. Kafka Streamer
    Flink Streamer
    JMS Streamer
    Storm Streamer
    Flume Streamer
    ...
    Streaming integrations
    KafkaStreamer ks = new
    KafkaStreamer<>();
    IgniteDataStreamer stmr =
    ignite.dataStreamer("myCache"));
    ks.setIgnite(ignite);
    ks.setStreamer(stmr);
    ks.setTopic(someKafkaTopic);
    ks.setConsumerConfig(kafkaConsumerConfig);
    ks.setSingleTupleExtractor(strExtractor);
    ks.start();
    ks.stop();
    strm.close();

    View Slide

  12. Kafka & Flink integrations

    View Slide

  13. Hibernate L2 Cache

    View Slide

  14. ML Frameworks

    View Slide

  15. ML/DL Most Popular Frameworks

    View Slide

  16. Could it be trained directly on Ignite data?

    View Slide

  17. Could it be trained directly on Ignite data?

    View Slide

  18. Ignite ML Overview

    View Slide

  19. Classification
    Regression
    Clustering
    Neural Networks
    Multiclass and multilabel
    algorithms
    scikit-learn
    Preprocessing
    NLP
    Dimensionality reduction
    Pipelines
    Imputation of missing
    values
    Model selection and
    evaluation
    Model persistence
    Ensemble methods
    Tuning the
    hyper-parameters

    View Slide

  20. ML module in 2.4 - 2.6 releases

    View Slide

  21. ML module in 2.7 release

    View Slide

  22. Abstraction layer on top of Ignite
    storage and computation
    Partitioned-Based Dataset

    View Slide

  23. Abstraction layer on top of Ignite
    storage and computation
    MapReduce using Compute Grid
    Partitioned-Based Dataset

    View Slide

  24. Abstraction layer on top of Ignite
    storage and computation
    MapReduce using Compute Grid
    Partition data (can be recovered
    from another node)
    Partitioned-Based Dataset

    View Slide

  25. Abstraction layer on top of Ignite
    storage and computation
    MapReduce using Compute Grid
    Partition data (can be recovered
    from another node)
    Partition context (ML algorithms
    are iterative and require context)
    Partitioned-Based Dataset

    View Slide

  26. Recovery after Node Failure
    P = Partition
    C = Partition Context
    D = Partition Data
    D* = Local ETL

    View Slide

  27. ML Algorithms

    View Slide

  28. Logistic Regression
    SVM
    KNN
    ANN
    Decision trees
    Random Forest
    Classification algorithms

    View Slide

  29. KNN Regression
    Linear Regression
    Decision tree regression
    Random forest regression
    Gradient-boosted tree
    regression
    Regression algorithms

    View Slide

  30. Multilayer Perceptron Neural Network

    View Slide

  31. Preprocessors

    View Slide

  32. Normalization

    View Slide

  33. Scaling

    View Slide

  34. One-Hot Encoding

    View Slide

  35. How to get the best model?

    View Slide

  36. Pipeline API
    Test-Train Split
    Parameter Grid
    Binary Evaluator
    Binary Classification Metrics
    Tuning Hyperparameters
    Model Evaluation with K-fold cross validation

    View Slide

  37. Ensemble as a Mean value of
    predictions
    Majority-based Ensemble
    Ensemble as a weighted sum of
    predictions
    Machine Learning Ensemble Model Averaging

    View Slide

  38. Superpower Features

    View Slide

  39. Online mini-batch learning

    View Slide

  40. TensorFlow on Apache Ignite
    Ignite Dataset
    IGFS Plugin
    Distributed Training

    View Slide

  41. TensorFlow on Apache Ignite
    Ignite Dataset
    IGFS Plugin
    Distributed Training
    >>> import tensorflow as tf
    >>> from tensorflow.contrib.ignite import IgniteDataset
    >>>
    >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
    >>> iterator = dataset.make_one_shot_iterator()
    >>> next_obj = iterator.get_next()
    >>>
    >>> with tf.Session() as sess:
    >>> for _ in range(3):
    >>> print(sess.run(next_obj))
    {'key': 1, 'val': {'NAME': b'WARM KITTY'}}
    {'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
    {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

    View Slide

  42. ISSUES

    View Slide

  43. Data Parallelism

    View Slide

  44. Amdahl's law for Distributed Programming

    View Slide

  45. Find or prepare something locally
    Repeat it a few times
    (locIterations)
    Reduce results
    Make next step
    (globalIterations++)
    Check convergence
    The iterative-convergent nature of ML programs

    View Slide

  46. Assume, n the sample size and p the number of features
    The complexity of ML algorithms
    Algorithm Training complexity Prediction complexity
    Naive Bayes O(n*p) O(p)
    kNN O(1) O(n*p)
    ANN O(n*p) + KMeans Complexity O(p)
    Decision Tree O(n^2*p) O(p)
    Random Forest O(n^2*p*amount of trees) O(p*amount of trees)
    SVM O(n^2*p + n^3) O(amount of sup.vec * p)
    Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes *
    O(sort(classes)))

    View Slide

  47. Use > or avoid them?
    How to make enough good Java API?

    View Slide

  48. Use > or avoid them?
    Use Stream as a return type?
    How to make enough good Java API?

    View Slide

  49. How to make enough good Java API?
    Use > or avoid them?
    Use Stream as a return type?
    Be coherent with other parts of Ignite

    View Slide

  50. How to make enough good Java API?
    Use > or avoid them?
    Use Stream as a return type?
    Be coherent with other parts of Ignite
    Naming conventions

    View Slide

  51. How to make enough good Java API?
    Use > or avoid them?
    Use Stream as a return type?
    Be coherent with other parts of Ignite
    Naming conventions
    Lambda everywhere

    View Slide

  52. How to make enough good Java API?
    Use > or avoid them?
    Use Stream as a return type?
    Be coherent with other parts of Ignite
    Naming conventions
    Lambda everywhere
    Method chaining vs setters

    View Slide

  53. How to make enough good Java API?
    Use > or avoid them?
    Use Stream as a return type?
    Be coherent with other parts of Ignite
    Naming conventions
    Lambda everywhere
    Method chaining vs setters
    Chain method naming for hyperparameters
    tuning

    View Slide

  54. How to contribute?

    View Slide

  55. > 180 contributors totally
    8 contributors to ML module
    VK Group
    Blog posts
    Ignite Documentation
    ML Documentation
    Apache Ignite Community

    View Slide

  56. NLP (TF-IDF, Word2Vec)
    More integration with TF
    Clustering: LDA, Bisecting K-Means
    Naive Bayes and Statistical package
    Dimensionality reduction
    … a lot of tasks for beginners:)
    Roadmap for Ignite 3.0

    View Slide

  57. DEMO

    View Slide

  58. E-mail : [email protected]
    Twitter : @zaleslaw @BigDataRussia
    vk.com/big_data_russia Big Data Russia
    + Telegram @bigdatarussia
    vk.com/java_jvm Java & JVM langs
    + Telegram @javajvmlangs
    Follow me

    View Slide