Nuances of Machine Learning with Ignite ML

Nuances of Machine Learning with Ignite ML Alexey Zinoviev, Java/BigData
Trainer, Apache Ignite Contributor

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data
Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me

What is Apache Ignite?

Release 2.7

SQL ACID (β) MVCC (β) Spring Data 2.0 New features
in 2.5 - 2.7

in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js

in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js Topology Star topology with Zookeeper

Integration

Read from Ignite to DataFrame Write from DataFrame to Ignite
IgniteSparkSession + IgniteCatalog SQL Optimization Spark integration spark .read .format(FORMAT_IGNITE) .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE) .option(OPTION_TABLE, "person") .load() tbl .write .format(FORMAT_IGNITE). option(OPTION_CONFIG_FILE, CFG_PATH). option(OPTION_TABLE, tableName). option(OPTION_CRT_TBL_PRIMARY_KEY_FIELDS, pk). save()

Spark integration

Kafka Streamer Flink Streamer JMS Streamer Storm Streamer Flume Streamer
... Streaming integrations KafkaStreamer<String, String, String> ks = new KafkaStreamer<>(); IgniteDataStreamer<String, String> stmr = ignite.dataStreamer("myCache")); ks.setIgnite(ignite); ks.setStreamer(stmr); ks.setTopic(someKafkaTopic); ks.setConsumerConfig(kafkaConsumerConfig); ks.setSingleTupleExtractor(strExtractor); ks.start(); ks.stop(); strm.close();

Kafka & Flink integrations

Hibernate L2 Cache

ML Frameworks

ML/DL Most Popular Frameworks

Could it be trained directly on Ignite data?

Ignite ML Overview

Classification Regression Clustering Neural Networks Multiclass and multilabel algorithms scikit-learn
Preprocessing NLP Dimensionality reduction Pipelines Imputation of missing values Model selection and evaluation Model persistence Ensemble methods Tuning the hyper-parameters

ML module in 2.4 - 2.6 releases

ML module in 2.7 release

Abstraction layer on top of Ignite storage and computation Partitioned-Based
Dataset

Abstraction layer on top of Ignite storage and computation MapReduce
using Compute Grid Partitioned-Based Dataset

using Compute Grid Partition data (can be recovered from another node) Partitioned-Based Dataset

using Compute Grid Partition data (can be recovered from another node) Partition context (ML algorithms are iterative and require context) Partitioned-Based Dataset

Recovery after Node Failure P = Partition C = Partition
Context D = Partition Data D* = Local ETL

ML Algorithms

Logistic Regression SVM KNN ANN Decision trees Random Forest Classification
algorithms

KNN Regression Linear Regression Decision tree regression Random forest regression
Gradient-boosted tree regression Regression algorithms

Multilayer Perceptron Neural Network

Preprocessors

Normalization

Scaling

One-Hot Encoding

How to get the best model?

Pipeline API Test-Train Split Parameter Grid Binary Evaluator Binary Classification
Metrics Tuning Hyperparameters Model Evaluation with K-fold cross validation

Ensemble as a Mean value of predictions Majority-based Ensemble Ensemble
as a weighted sum of predictions Machine Learning Ensemble Model Averaging

Superpower Features

Online mini-batch learning

TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training

TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

ISSUES

Data Parallelism

Amdahl's law for Distributed Programming

Find or prepare something locally Repeat it a few times
(locIterations) Reduce results Make next step (globalIterations++) Check convergence The iterative-convergent nature of ML programs

Assume, n the sample size and p the number of
features The complexity of ML algorithms Algorithm Training complexity Prediction complexity Naive Bayes O(n*p) O(p) kNN O(1) O(n*p) ANN O(n*p) + KMeans Complexity O(p) Decision Tree O(n^2*p) O(p) Random Forest O(n^2*p*amount of trees) O(p*amount of trees) SVM O(n^2*p + n^3) O(amount of sup.vec * p) Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes * O(sort(classes)))

Use <?> or avoid them? How to make enough good
Java API?

Use <?> or avoid them? Use Stream as a return
type? How to make enough good Java API?

How to make enough good Java API? Use <?> or
avoid them? Use Stream as a return type? Be coherent with other parts of Ignite

avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions

avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere

avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters

avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters Chain method naming for hyperparameters tuning

How to contribute?

> 180 contributors totally 8 contributors to ML module VK
Group Blog posts Ignite Documentation ML Documentation Apache Ignite Community

NLP (TF-IDF, Word2Vec) More integration with TF Clustering: LDA, Bisecting
K-Means Naive Bayes and Statistical package Dimensionality reduction … a lot of tasks for beginners:) Roadmap for Ignite 3.0

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data
Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me

Nuances of Machine Learning with Ignite ML

Nuances of Machine Learning with Ignite ML

More Decks by Alexey Zinoviev

Other Decks in Programming

Featured

Transcript