Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nuances of Machine Learning with Ignite ML

Nuances of Machine Learning with Ignite ML

Alexey Zinoviev

October 15, 2018
Tweet

More Decks by Alexey Zinoviev

Other Decks in Programming

Transcript

  1. E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data

    Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me
  2. SQL ACID (β) MVCC (β) Spring Data 2.0 New features

    in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js
  3. SQL ACID (β) MVCC (β) Spring Data 2.0 New features

    in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js Topology Star topology with Zookeeper
  4. Read from Ignite to DataFrame Write from DataFrame to Ignite

    IgniteSparkSession + IgniteCatalog SQL Optimization Spark integration spark .read .format(FORMAT_IGNITE) .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE) .option(OPTION_TABLE, "person") .load() tbl .write .format(FORMAT_IGNITE). option(OPTION_CONFIG_FILE, CFG_PATH). option(OPTION_TABLE, tableName). option(OPTION_CRT_TBL_PRIMARY_KEY_FIELDS, pk). save()
  5. Kafka Streamer Flink Streamer JMS Streamer Storm Streamer Flume Streamer

    ... Streaming integrations KafkaStreamer<String, String, String> ks = new KafkaStreamer<>(); IgniteDataStreamer<String, String> stmr = ignite.dataStreamer("myCache")); ks.setIgnite(ignite); ks.setStreamer(stmr); ks.setTopic(someKafkaTopic); ks.setConsumerConfig(kafkaConsumerConfig); ks.setSingleTupleExtractor(strExtractor); ks.start(); ks.stop(); strm.close();
  6. Classification Regression Clustering Neural Networks Multiclass and multilabel algorithms scikit-learn

    Preprocessing NLP Dimensionality reduction Pipelines Imputation of missing values Model selection and evaluation Model persistence Ensemble methods Tuning the hyper-parameters
  7. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partitioned-Based Dataset
  8. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partition data (can be recovered from another node) Partitioned-Based Dataset
  9. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partition data (can be recovered from another node) Partition context (ML algorithms are iterative and require context) Partitioned-Based Dataset
  10. Recovery after Node Failure P = Partition C = Partition

    Context D = Partition Data D* = Local ETL
  11. KNN Regression Linear Regression Decision tree regression Random forest regression

    Gradient-boosted tree regression Regression algorithms
  12. Pipeline API Test-Train Split Parameter Grid Binary Evaluator Binary Classification

    Metrics Tuning Hyperparameters Model Evaluation with K-fold cross validation
  13. Ensemble as a Mean value of predictions Majority-based Ensemble Ensemble

    as a weighted sum of predictions Machine Learning Ensemble Model Averaging
  14. TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training

    >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  15. Find or prepare something locally Repeat it a few times

    (locIterations) Reduce results Make next step (globalIterations++) Check convergence The iterative-convergent nature of ML programs
  16. Assume, n the sample size and p the number of

    features The complexity of ML algorithms Algorithm Training complexity Prediction complexity Naive Bayes O(n*p) O(p) kNN O(1) O(n*p) ANN O(n*p) + KMeans Complexity O(p) Decision Tree O(n^2*p) O(p) Random Forest O(n^2*p*amount of trees) O(p*amount of trees) SVM O(n^2*p + n^3) O(amount of sup.vec * p) Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes * O(sort(classes)))
  17. Use <?> or avoid them? Use Stream as a return

    type? How to make enough good Java API?
  18. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite
  19. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions
  20. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere
  21. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters
  22. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters Chain method naming for hyperparameters tuning
  23. > 180 contributors totally 8 contributors to ML module VK

    Group Blog posts Ignite Documentation ML Documentation Apache Ignite Community
  24. NLP (TF-IDF, Word2Vec) More integration with TF Clustering: LDA, Bisecting

    K-Means Naive Bayes and Statistical package Dimensionality reduction … a lot of tasks for beginners:) Roadmap for Ignite 3.0
  25. E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data

    Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me