Slide 1

Slide 1 text

Nuances of Machine Learning with Ignite ML Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Contributor

Slide 2

Slide 2 text

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me

Slide 3

Slide 3 text

What is Apache Ignite?

Slide 4

Slide 4 text

Release 2.7

Slide 5

Slide 5 text

SQL ACID (β) MVCC (β) Spring Data 2.0 New features in 2.5 - 2.7

Slide 6

Slide 6 text

SQL ACID (β) MVCC (β) Spring Data 2.0 New features in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js

Slide 7

Slide 7 text

SQL ACID (β) MVCC (β) Spring Data 2.0 New features in 2.5 - 2.7 Thin clients Java .NET C++ + PHP + Python + Node.js Topology Star topology with Zookeeper

Slide 8

Slide 8 text

Integration

Slide 9

Slide 9 text

Read from Ignite to DataFrame Write from DataFrame to Ignite IgniteSparkSession + IgniteCatalog SQL Optimization Spark integration spark .read .format(FORMAT_IGNITE) .option(OPTION_CONFIG_FILE, TEST_CONFIG_FILE) .option(OPTION_TABLE, "person") .load() tbl .write .format(FORMAT_IGNITE). option(OPTION_CONFIG_FILE, CFG_PATH). option(OPTION_TABLE, tableName). option(OPTION_CRT_TBL_PRIMARY_KEY_FIELDS, pk). save()

Slide 10

Slide 10 text

Spark integration

Slide 11

Slide 11 text

Kafka Streamer Flink Streamer JMS Streamer Storm Streamer Flume Streamer ... Streaming integrations KafkaStreamer ks = new KafkaStreamer<>(); IgniteDataStreamer stmr = ignite.dataStreamer("myCache")); ks.setIgnite(ignite); ks.setStreamer(stmr); ks.setTopic(someKafkaTopic); ks.setConsumerConfig(kafkaConsumerConfig); ks.setSingleTupleExtractor(strExtractor); ks.start(); ks.stop(); strm.close();

Slide 12

Slide 12 text

Kafka & Flink integrations

Slide 13

Slide 13 text

Hibernate L2 Cache

Slide 14

Slide 14 text

ML Frameworks

Slide 15

Slide 15 text

ML/DL Most Popular Frameworks

Slide 16

Slide 16 text

Could it be trained directly on Ignite data?

Slide 17

Slide 17 text

Could it be trained directly on Ignite data?

Slide 18

Slide 18 text

Ignite ML Overview

Slide 19

Slide 19 text

Classification Regression Clustering Neural Networks Multiclass and multilabel algorithms scikit-learn Preprocessing NLP Dimensionality reduction Pipelines Imputation of missing values Model selection and evaluation Model persistence Ensemble methods Tuning the hyper-parameters

Slide 20

Slide 20 text

ML module in 2.4 - 2.6 releases

Slide 21

Slide 21 text

ML module in 2.7 release

Slide 22

Slide 22 text

Abstraction layer on top of Ignite storage and computation Partitioned-Based Dataset

Slide 23

Slide 23 text

Abstraction layer on top of Ignite storage and computation MapReduce using Compute Grid Partitioned-Based Dataset

Slide 24

Slide 24 text

Abstraction layer on top of Ignite storage and computation MapReduce using Compute Grid Partition data (can be recovered from another node) Partitioned-Based Dataset

Slide 25

Slide 25 text

Abstraction layer on top of Ignite storage and computation MapReduce using Compute Grid Partition data (can be recovered from another node) Partition context (ML algorithms are iterative and require context) Partitioned-Based Dataset

Slide 26

Slide 26 text

Recovery after Node Failure P = Partition C = Partition Context D = Partition Data D* = Local ETL

Slide 27

Slide 27 text

ML Algorithms

Slide 28

Slide 28 text

Logistic Regression SVM KNN ANN Decision trees Random Forest Classification algorithms

Slide 29

Slide 29 text

KNN Regression Linear Regression Decision tree regression Random forest regression Gradient-boosted tree regression Regression algorithms

Slide 30

Slide 30 text

Multilayer Perceptron Neural Network

Slide 31

Slide 31 text

Preprocessors

Slide 32

Slide 32 text

Normalization

Slide 33

Slide 33 text

Scaling

Slide 34

Slide 34 text

One-Hot Encoding

Slide 35

Slide 35 text

How to get the best model?

Slide 36

Slide 36 text

Pipeline API Test-Train Split Parameter Grid Binary Evaluator Binary Classification Metrics Tuning Hyperparameters Model Evaluation with K-fold cross validation

Slide 37

Slide 37 text

Ensemble as a Mean value of predictions Majority-based Ensemble Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging

Slide 38

Slide 38 text

Superpower Features

Slide 39

Slide 39 text

Online mini-batch learning

Slide 40

Slide 40 text

TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training

Slide 41

Slide 41 text

TensorFlow on Apache Ignite Ignite Dataset IGFS Plugin Distributed Training >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Slide 42

Slide 42 text

ISSUES

Slide 43

Slide 43 text

Data Parallelism

Slide 44

Slide 44 text

Amdahl's law for Distributed Programming

Slide 45

Slide 45 text

Find or prepare something locally Repeat it a few times (locIterations) Reduce results Make next step (globalIterations++) Check convergence The iterative-convergent nature of ML programs

Slide 46

Slide 46 text

Assume, n the sample size and p the number of features The complexity of ML algorithms Algorithm Training complexity Prediction complexity Naive Bayes O(n*p) O(p) kNN O(1) O(n*p) ANN O(n*p) + KMeans Complexity O(p) Decision Tree O(n^2*p) O(p) Random Forest O(n^2*p*amount of trees) O(p*amount of trees) SVM O(n^2*p + n^3) O(amount of sup.vec * p) Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes * O(sort(classes)))

Slide 47

Slide 47 text

Use or avoid them? How to make enough good Java API?

Slide 48

Slide 48 text

Use or avoid them? Use Stream as a return type? How to make enough good Java API?

Slide 49

Slide 49 text

How to make enough good Java API? Use or avoid them? Use Stream as a return type? Be coherent with other parts of Ignite

Slide 50

Slide 50 text

How to make enough good Java API? Use or avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions

Slide 51

Slide 51 text

How to make enough good Java API? Use or avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere

Slide 52

Slide 52 text

How to make enough good Java API? Use or avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters

Slide 53

Slide 53 text

How to make enough good Java API? Use or avoid them? Use Stream as a return type? Be coherent with other parts of Ignite Naming conventions Lambda everywhere Method chaining vs setters Chain method naming for hyperparameters tuning

Slide 54

Slide 54 text

How to contribute?

Slide 55

Slide 55 text

> 180 contributors totally 8 contributors to ML module VK Group Blog posts Ignite Documentation ML Documentation Apache Ignite Community

Slide 56

Slide 56 text

NLP (TF-IDF, Word2Vec) More integration with TF Clustering: LDA, Bisecting K-Means Naive Bayes and Statistical package Dimensionality reduction … a lot of tasks for beginners:) Roadmap for Ignite 3.0

Slide 57

Slide 57 text

DEMO

Slide 58

Slide 58 text

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me