Not all ML algorithms go to distributed heaven

Slide 1

Slide 1 text

Not all ML algorithms go to distributed heaven Alexey Zinoviev, Java/BigData Trainer, Apache Ignite Committer

Slide 2

Slide 2 text

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me

Slide 3

Slide 3 text

What is Machine Learning?

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

ML Task in math form (shortly)

Slide 6

Slide 6 text

ML Task in math form (by Vorontsov) X - objects, Y - answers, f: X → Y is target function training sample known answers Find decision function

Slide 7

Slide 7 text

Model example [Linear Regression]

Slide 8

Slide 8 text

Model example [Linear Regression] Loss Function

Slide 9

Slide 9 text

Model example [Decision Tree]

Slide 10

Slide 10 text

Finding the best model

Slide 11

Slide 11 text

Distributed ML

Slide 12

Slide 12 text

Classification Regression Clustering Neural Networks Multiclass and multilabel algorithms scikit-learn Preprocessing NLP Dimensionality reduction Pipelines Imputation of missing values Model selection and evaluation Model persistence Ensemble methods Tuning the hyper- parameters

Slide 13

Slide 13 text

Take scikit-learn and distribute it!!!

Slide 14

Slide 14 text

Take scikit-learn and distribute it!!!

Slide 15

Slide 15 text

Take scikit-learn and distribute it!!!

Slide 16

Slide 16 text

Main issues of standard implementation ● It designed by scientists and described in papers ● Pseudo-code from papers copied and adopted in Python libs ● Typically, it’s implemented with one global while cycle ● Usually, it uses simple data structures like multi-dimensional arrays ● These data structures are located in shared memory on one computer ● A lot of algorithms has O(n^3) calculation complexity and higher ● As a result all these algorithms could be used for 10^2-10^5 observations effectively

Slide 17

Slide 17 text

Distributed Pipeline

Slide 18

Slide 18 text

ML Pipeline Raw Data

Slide 19

Slide 19 text

ML Pipeline Raw Data Preprocessing Vectors

Slide 20

Slide 20 text

ML Pipeline Raw Data Preprocessing Vectors Training Model

Slide 21

Slide 21 text

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter Tuning

Slide 22

Slide 22 text

ML Pipeline Raw Data Preprocessing Vectors Training Model Hyper parameter Tuning D e p l o y Evaluation

Slide 23

Slide 23 text

What can be distributed in typical ML Pipeline ● Data primitives (datasets, RDD, dataframes and etc) ● Preprocessing ● Training ● Cross-Validation and another techniques of hyper-parameter tuning ● Prediction (if you need massive prediction, for example) ● Ensembles (like training trees in Random Forest)

Slide 24

Slide 24 text

What can be distributed in typical ML Pipeline Step Apache Spark Apache Ignite Dataset distributed distributed Preprocessing distributed distributed Training distributed distributed Prediction distributed distributed Evaluation distributed distributed (since 2.8) Hyper-parameter tuning parallel parallel (since 2.8) Online Learning distributed in 3 algorithms distributed Ensembles for RF* distributed/parallel

Slide 25

Slide 25 text

Distributed Data Structures

Slide 26

Slide 26 text

The main problem with classic ML algorithm They are designed to learn from a unique data set

Slide 27

Slide 27 text

What can be distributed in typical ML Pipeline ● Horizontal fragmentation wherein subsets of instances are stored at different sites (distributed by rows) ● Vertical fragmentation wherein subsets of attributes of instances are stored at different sites (distributed by columns) ● Cell fragmentation - mixed approach of two above (distributed by row and column ranges) ● Improvement with data collocation based on some hypothesis (geographic factor, for example)

Slide 28

Slide 28 text

Popular Matrix Representations

Slide 29

Slide 29 text

How to multiply distributed matrices?

Slide 30

Slide 30 text

How to multiply distributed matrices? ● Rows * columns (deliver columns to rows in shuffle phase) ● Block * block (Cannon's algorithm) ● SUMMA: Scalable Universal Matrix Multiplication Algorithm ● Dimension Independent Matrix Square using MapReduce (DIMSUM) (Spark PR) ● OverSketch: Approximate Matrix Multiplication for the Cloud ● Polar Coded Distributed Matrix Multiplication

Slide 31

Slide 31 text

Block multiplication

Slide 32

Slide 32 text

Block multiplication with Cannon

Slide 33

Slide 33 text

Popular Matrix Representations

Slide 34

Slide 34 text

Reasons to avoid distributed algebra 1. A lot of different Matrix/Vector format 2. Bad performance results for SGD-based algorithms 3. A lot of data are shuffled with Sparse Block Distributed Matrices 4. Extension to algorithms that are not based on Linear Algebra 5. Illusion that a lot of algorithms could be easily adopted (like DBScan) 6. It provokes to use direct methods instead of dual

Slide 35

Slide 35 text

Partition-based dataset Partition Data Dataset Context Dataset Data Upstream Cache Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable Dataset dataset = … // Partition based dataset, internal API dataset.compute((env, ctx, data) -> map(...), (r1, r2) -> reduce(...)) double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset Structures Source Data

Slide 36

Slide 36 text

Preprocessors

Slide 37

Slide 37 text

Normalize vector v to L2 norm

Slide 38

Slide 38 text

Distributed Vector Normalization 1. Define the p (vector norm) 2. Run normalization of each vector on each partition in Map phase

Slide 39

Slide 39 text

Standard Scaling

Slide 40

Slide 40 text

Distributed Standard Scaling 1. Collect Standard Scaling statistics (mean, variance) ○ one Map-Reduce step to collect 2. Scale each row using statistics (or produced model) ○ one Map step to transform

Slide 41

Slide 41 text

One-Hot Encoding

Slide 42

Slide 42 text

Distributed Encoding 1. Collect Encoding statistics (Categories frequencies) ○ one Map-Reduce step to collect 2. Transform each row using statistics (or produced model) ○ one Map step to transform ○ NOTE: it adds k-1 new columns for each categorical feature, where k is amount of categories

Slide 43

Slide 43 text

ML Algorithms

Slide 44

Slide 44 text

Classification algorithms ● Logistic Regression ● SVM ● KNN ● ANN ● Decision trees ● Random Forest

Slide 45

Slide 45 text

Regression algorithms ● KNN Regression ● Linear Regression ● Decision tree regression ● Random forest regression ● Gradient-boosted tree regression

Slide 46

Slide 46 text

Distributed approaches to design ML algorithm 1. Data-Parallelism: The data is partitioned and distributed onto the different workers. Each worker typically updates all parameters based on its share of the data 2. Model-Parallelism: Each worker has access to the entire dataset but only updates a subset of the parameters at a time 3. Combination of two above

Slide 47

Slide 47 text

The iterative-convergent nature of ML programs 1. Find or prepare something locally 2. Repeat it a few times (locIterations++) 3. Reduce results 4. Make next step (globalIterations++) 5. Check convergence

Slide 48

Slide 48 text

Shortly, Distributed ML Training can be implemented as an ... Iterative MapReduce algorithm in-memory or on disk

Slide 49

Slide 49 text

Potential acceleration points in Iterative MR 1. Reduce the amount of global iterations 2. Reduce the time of one global iteration 3. Reduce the size of shuffled data pushed through network

Slide 50

Slide 50 text

Let’s consider that ML algorithm is badly distributed if... The amount of shuffled data depends on initial dataset size

Slide 51

Slide 51 text

ML algorithms that are easy to scale 1. Linear Regression via SGD 2. Linear Regression via LSQR 3. K-Means 4. Linear SVM 5. KNN 6. Logistic Regression

Slide 52

Slide 52 text

They are not designed for distributed world 1. PCA (matrix calculations) 2. DBSCAN 3. Topic Modeling (text analysis) 4. Non-linear Kernels for SVM

Slide 53

Slide 53 text

Linear Regression via LSQR

Slide 54

Slide 54 text

Linear Regression with MR approach Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer A, feature matrix u,label vector v, result

Slide 55

Slide 55 text

Linear Regression with MR approach A, feature matrix u,label vector v, result Part 1 Part 2 Part 3 Part 4 Golub-Kahan-Lanczos Bidiagonalization Procedure core of LSQR linear regression trainer MapReduce MapReduce

Slide 56

Slide 56 text

Clustering (K-Means)

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Distributed K-Means (First version) 1. Fix k 2. Initialize k centers 3. Clusterize points locally on each partition (local K-Means) 4. Push to reducer { centroid, amount of points, cluster diameter } 5. Join on reducer clusters

Slide 59

Slide 59 text

Distributed K-Means (Second version) 1. Fix k & Initialize k cluster centers 2. Spread them among cluster nodes 3. Calculates distances locally on every node 4. Form stat for every cluster center on every node 5. Merge stats on Reducer 6. Recalculate k cluster centers and repeat 3-7 before convergence

Slide 60

Slide 60 text

SGD

Slide 61

Slide 61 text

Linear Regression Model

Slide 62

Slide 62 text

Target function for Linear Regression

Slide 63

Slide 63 text

Loss Function

Slide 64

Slide 64 text

Distributed Gradient

Slide 65

Slide 65 text

Distributed Gradient

Slide 66

Slide 66 text

SGD Pseudocode def SGD(X, Y, Loss, GradLoss, W0, s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

Slide 67

Slide 67 text

What can be distributed? def SGD(X, Y, Loss, GradLoss, W0, s): W = W0 lastLoss = Double.Inf for i = 0 .. maxIterations: W = W - s * GradLoss(W, X, Y) currentLoss = Loss(Model(W), X, Y) if abs(currentLoss - lastLoss) > eps: lastLoss = currentLoss else: break return Model(W)

Slide 68

Slide 68 text

Distributed Gradient

Slide 69

Slide 69 text

Naive Apache Ignite implementation try (Dataset dataset = … ) { int datasetSize = sizeOf(dataset); double error = computeMSE(model, dataset); int i = 0; while(error > minError && i < maxIterations) { Vector grad = dataset.compute( data -> computeLocalGrad(model, data), // map phase (left, right) -> left.plus(right) // reduce phase ); grad = grad.times(2.0).divide(datasetSize); // normalize part of grad Vector newWeights = model.weights().minus(grad.times(gradStep)); // add anti- gradient model.setWeights(newWeights); error = computeMSE(model, dataset); i++; }

Slide 70

Slide 70 text

Distributed Gradient

Slide 71

Slide 71 text

SVM

Slide 72

Slide 72 text

Linear SVM

Slide 73

Slide 73 text

Linear SVM

Slide 74

Slide 74 text

Linear SVM

Slide 75

Slide 75 text

Linear SVM via SGD

Slide 76

Slide 76 text

Linear SVM via SGD

Slide 77

Slide 77 text

Dual Problem for SVM

Slide 78

Slide 78 text

Coordinate Descent

Slide 79

Slide 79 text

Communication-efficient Distributed Dual Coordinate Ascent Algorithm (CoCoA)

Slide 80

Slide 80 text

How often should we sink the w vector?

Slide 81

Slide 81 text

How often should we sink the w vector?

Slide 82

Slide 82 text

How often should we sink the w vector?

Slide 83

Slide 83 text

Main idea 1. Spread data among partitions 2. Initialize dual variables and the initial weights 3. Associate vectors with corresponding dual variables ○ Run local stochastic dual coordinate ascent method (SDCA) method on each partition ○ Update dual variables 4. Update global weights and repeat

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

SVM Pain

Slide 87

Slide 87 text

Kernel Trick

Slide 88

Slide 88 text

Kernel Trick

Slide 89

Slide 89 text

The main problem with SVM No distributed SVM with any kernel except linear

Slide 90

Slide 90 text

KNN

Slide 91

Slide 91 text

kNN (k-nearest neighbor)

Slide 92

Slide 92 text

Distributed kNN (First version) 1. Compute the cross product between the data we wish to classify and our training data 2. Ship the data evenly across all of our machines 3. Compute the distance between each pair of points locally 4. Reduce for each data point we wish to classify that data point and the K smallest distances, which we then use to predict

Slide 93

Slide 93 text

Distributed kNN (Second version) 1. Spread the data on N machines 2. For each predicted point find k nearest neighbour on each node (k * N totally) 3. Collect k * N candidates to Reducer and re-select the k closest neighbours

Slide 94

Slide 94 text

The main problem with kNN No real training phase

Slide 95

Slide 95 text

Approximate Nearest Neighbours 1. Spread the train data for N machines 2. Find limited set of candidates S representing all train data with procedure A 3. Spread the test data for M machines with S candidates 4. Classify locally by local kNN based on S candidates

Slide 96

Slide 96 text

Model Evaluation

Slide 97

Slide 97 text

Model Evaluation with K-fold cross validation

Slide 98

Slide 98 text

K-fold Cross-Validation ● It could generate K tasks for training and evaluation to run in parallel ● Results could be merged on one node or in distributed data primitive

Slide 99

Slide 99 text

Ensembles in distributed mode

Slide 100

Slide 100 text

Empirical rule The computational cost of training several classifiers on subsets of data is lower than training one classifier on the whole data set

Slide 101

Slide 101 text

● Ensemble as a Mean value of predictions ● Majority-based Ensemble ● Ensemble as a weighted sum of predictions Machine Learning Ensemble Model Averaging

Slide 102

Slide 102 text

Random Forest

Slide 103

Slide 103 text

Distributed Random Forest

Slide 104

Slide 104 text

Distributed Random Forest on Histograms

Slide 105

Slide 105 text

Bagging

Slide 106

Slide 106 text

Boosting

Slide 107

Slide 107 text

Stacking

Slide 108

Slide 108 text

How to contribute?

Slide 109

Slide 109 text

> 200 contributors totally 8 contributors to ML module VK Group Blog posts Ignite Documentation ML Documentation Apache Ignite Community

Slide 110

Slide 110 text

NLP (TF-IDF, Word2Vec) More integration with TF Clustering: LDA, Bisecting K- Means Naive Bayes and Statistical package Dimensionality reduction … a lot of tasks for beginners:) Roadmap for Ignite 3.0

Slide 111

Slide 111 text

Assume, n the sample size and p the number of features The complexity of ML algorithms Algorithm Training complexity Prediction complexity Naive Bayes O(n*p) O(p) kNN O(1) O(n*p) ANN O(n*p) + KMeans Complexity O(p) Decision Tree O(n^2*p) O(p) Random Forest O(n^2*p*amount of trees) O(p*amount of trees) SVM O(n^2*p + n^3) O(amount of sup.vec * p) Multi - SVM O(O(SVM) * amount of classes) O(O(SVM) * amount of classes * O(sort(classes)))

Slide 112

Slide 112 text

Papers and links 1. A survey of methods for distributed machine learning 2. Strategies and Principles of Distributed Machine Learning on Big Data 3. Distributed k-means algorithm 4. MapReduce Algorithms for k-means Clustering 5. An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication 6. Communication-Efficient Distributed Dual Coordinate Ascent 7. Distributed K-Nearest Neighbors

Slide 113

Slide 113 text

E-mail : [email protected] Twitter : @zaleslaw @BigDataRussia vk.com/big_data_russia Big Data Russia + Telegram @bigdatarussia vk.com/java_jvm Java & JVM langs + Telegram @javajvmlangs Follow me