Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nuances of Machine Learning with Ignite ML

Nuances of Machine Learning with Ignite ML

376cd2fd5ffded946c96d5a45766350f?s=128

Alexey Zinoviev

October 15, 2018
Tweet

Transcript

  1. Nuances of Machine Learning with Ignite ML Alexey Zinoviev, Java/BigData

    Trainer, Apache Ignite Contributor
  2. E-mail : zaleslaw.sin@gmail.com Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data

    Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me
  3. What is Apache Ignite?

  4. Release 2.7

  5. SQL ACID (β) MVCC (β) Spring Data 2.0 New features

    in 2.5 - 2.7
  6. SQL ACID (β) MVCC (β) Spring Data 2.0 New features

    in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js
  7. SQL ACID (β) MVCC (β) Spring Data 2.0 New features

    in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js Topology Star topology with Zookeeper
  8. Integration

  9. Read from Ignite to DataFrame Write from DataFrame to Ignite

    IgniteSparkSession + IgniteCatalog SQL Optimization Spark integration spark .read .format(FORMAT_IGNITE) .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE) .option(OPTION_TABLE, "person") .load() tbl .write .format(FORMAT_IGNITE). option(OPTION_CONFIG_FILE, CFG_PATH). option(OPTION_TABLE, tableName). option(OPTION_CRT_TBL_PRIMARY_KEY_FIELDS, pk). save()
  10. Spark integration

  11. Kafka Streamer Flink Streamer JMS Streamer Storm Streamer Flume Streamer

    ... Streaming integrations KafkaStreamer<String, String, String> ks = new KafkaStreamer<>(); IgniteDataStreamer<String, String> stmr = ignite.dataStreamer("myCache")); ks.setIgnite(ignite); ks.setStreamer(stmr); ks.setTopic(someKafkaTopic); ks.setConsumerConfig(kafkaConsumerConfig); ks.setSingleTupleExtractor(strExtractor); ks.start(); ks.stop(); strm.close();
  12. Kafka & Flink integrations

  13. Hibernate L2 Cache

  14. ML Frameworks

  15. ML/DL Most Popular Frameworks

  16. Could it be trained directly on Ignite data?

  17. Could it be trained directly on Ignite data?

  18. Ignite ML Overview

  19. Classification Regression Clustering Neural Networks Multiclass and multilabel algorithms scikit-learn

    Preprocessing NLP Dimensionality reduction Pipelines Imputation of missing values Model selection and evaluation Model persistence Ensemble methods Tuning the hyper-parameters
  20. ML module in 2.4 - 2.6 releases

  21. ML module in 2.7 release

  22. Abstraction layer on top of Ignite storage and computation Partitioned-Based

    Dataset
  23. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partitioned-Based Dataset
  24. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partition data (can be recovered from another node) Partitioned-Based Dataset
  25. Abstraction layer on top of Ignite storage and computation MapReduce

    using Compute Grid Partition data (can be recovered from another node) Partition context (ML algorithms are iterative and require context) Partitioned-Based Dataset
  26. Recovery after Node Failure P = Partition C = Partition

    Context D = Partition Data D* = Local ETL
  27. ML Algorithms

  28. Logistic Regression SVM KNN ANN Decision trees Random Forest Classification

    algorithms
  29. KNN Regression Linear Regression Decision tree regression Random forest regression

    Gradient-boosted tree regression Regression algorithms
  30. Multilayer Perceptron Neural Network

  31. Preprocessors

  32. Normalization

  33. Scaling

  34. One-Hot Encoding

  35. How to get the best model?

  36. Pipeline API Test-Train Split Parameter Grid Binary Evaluator Binary Classification

    Metrics Tuning Hyperparameters Model Evaluation with K-fold cross validation
  37. Ensemble as a Mean value of predictions Majority-based Ensemble Ensemble

    as a weighted sum of predictions Machine Learning Ensemble Model Averaging
  38. Superpower Features

  39. Online mini-batch learning

  40. TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training

  41. TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training

    >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  42. ISSUES

  43. Data Parallelism

  44. Amdahl's law for Distributed Programming

  45. Find or prepare something locally Repeat it a few times

    (locIterations) Reduce results Make next step (globalIterations++) Check convergence The iterative-convergent nature of ML programs
  46. Assume, n the sample size and p the number of

    features The complexity of ML algorithms Algorithm Training complexity Prediction complexity Naive Bayes O(n*p) O(p) kNN O(1) O(n*p) ANN O(n*p) + KMeans Complexity O(p) Decision Tree O(n^2*p) O(p) Random Forest O(n^2*p*amount of trees) O(p*amount of trees) SVM O(n^2*p + n^3) O(amount of sup.vec * p) Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes * O(sort(classes)))
  47. Use <?> or avoid them? How to make enough good

    Java API?
  48. Use <?> or avoid them? Use Stream as a return

    type? How to make enough good Java API?
  49. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite
  50. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions
  51. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere
  52. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters
  53. How to make enough good Java API? Use <?> or

    avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters Chain method naming for hyperparameters tuning
  54. How to contribute?

  55. > 180 contributors totally 8 contributors to ML module VK

    Group Blog posts Ignite Documentation ML Documentation Apache Ignite Community
  56. NLP (TF-IDF, Word2Vec) More integration with TF Clustering: LDA, Bisecting

    K-Means Naive Bayes and Statistical package Dimensionality reduction … a lot of tasks for beginners:) Roadmap for Ignite 3.0
  57. DEMO

  58. E-mail : zaleslaw.sin@gmail.com Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data

    Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me