Scalable Machine Learning with SPARK

Slide 1

Slide 1 text

SCALABLE MACHINE LEARNING with SPARK Sam Bessalah - @samklr Software Engineer These days : Data Pipelines Plumber

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Why SPARK for Machine Learning?

Slide 6

Slide 6 text

It’s less about ML, rather than having an agile data science environment.

Slide 7

Slide 7 text

• Dataflow system materialized by immutable distributed collections with in memory data, suited for iterative and complex workloads • Its own ML library (MLLIB) and more … • Built in Scala, fully interoperable with Java Can use most of the work done in the JVM • Fluent and intuitive API, thanks to Scala functional style • Comes with a REPL, like R and Python, for quick feedback and interactive testing. • Not just for ML , but fit for data pipelines, integration, data access and ETL

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

MATLAB Example - Integrated environment for ML development. - Data access and processing tools - Leverages and extends the Lapack functionalities But doesn’t scale to distributed environments. Single Node Lapack Matlab Interface

Slide 12

Slide 12 text

Lapack Single Node Matlab Interface MATLAB SPARK MLBASE

Slide 13

Slide 13 text

MLLib : Low level ML algorithms implemented as Spark’s standard ML library Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib

Slide 14

Slide 14 text

MLLib : Low level ML algorithms in SPARK. MLI : API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib MLI

Slide 15

Slide 15 text

Spark : In memory, fast cluster computing system MLlib : Low level ML algorithms in Spark. MLI : API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. ML Optimizer : automates model selection, by solving a search problem over features extractors SPARK MLBASE MLlib MLI ML Optimizer

Slide 16

Slide 16 text

MLlib • Core ML algorithms implemented using the Spark programming model • So far contains algorithms for : - Regression : Ridge, Lasso - Classification : Support Vector Machine, Logistic Regression, - RecSys : Matrix Factorisation with ALS - Clustering : K means - Optimisation : Stochastic Gradient Descent More being contributed ….

Slide 17

Slide 17 text

MLI : ML developer API • Aims to shield ML developers from runtimes implementation. • High level abstractions and operators to build models compatible with parallel data processing tools. • Linear Algebra : MLMatrix, MLRow, MLVector … - Linear algebra on local partitions - Sparse and Dense matrix support • Table Computations : MLTable - Similar to R/Pandas(Python) DataFrame or NumPy Array - Flexibility when loading /processing data - Common interface for feature extraction

Slide 18

Slide 18 text

ML Optimizer • Aimed at non ML developers • Specify ML tasks declaratively • Have the system do all the heavy lifting using MLI and MLLib. var X = load (‘’local_file ’’, 2 to 10) var Y = load (‘’text_file’’, 1) var (fn-model, summary) = doClassify(X, y) Not available yet, currently under development .

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Example Worklow : Document Classification with MLI

Slide 21

Slide 21 text

Typical data analysis workflow Spark /MLI Load Raw Data

Slide 22

Slide 22 text

Typical data analysis workflow Spark /MLI Load Data Data Exploration / Feature Extraction Spark / MLI

Slide 23

Slide 23 text

Typical data analysis workflow Spark /MLI Load Raw Data Data Exploration / Features Extraction Spark / MLI Learn

Slide 24

Slide 24 text

Typical data analysis workflow Spark /MLI Load Raw Data Data Exploration / Features Extraction Spark / MLI Learn Evaluate Model

Slide 25

Slide 25 text

Typical data analysis workflow Spark /MLI Load Raw Data Data Exploration / Features Extraction Spark / MLI Learn Evaluate Model Test / Feature Eng. Deploy Application

Slide 26

Slide 26 text

• • But MLI is still under development. Good for some small stuff, but not quite there yet. Not a full blown machine learning library yet.

Slide 27

Slide 27 text

MLlib • Core Machine Learning algorithms in the Spark stdlib. • Primarily written in Scala, but can be used in PySpark via PythonMLLibAPI • So far contains algorithms for : - Regression : Ridge, Lasso, Linear - Classification : Support Vector Machines, Logistic Regression, Naive Bayes, Decision Trees - Linear Algebra : DistributedMatrix, RowMatrix, etc. - Recommenders : Alternating Least squares, (SVD++ in GraphX) - Clustering : K means - Optimisation : Stochastic Gradient Descent - … … More being contributed …. Look at the Spark JIRA and for the Spark 1.0 release

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Mllib Example • Example from Sean R. Owen of Cloudera : http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ • Stackoverflow tags suggestions • Build a model that can suggest new tags to questions based on existing tags, using the alternating least squares recommender algorithm; • Questions are “users” and tags are “items

Slide 30

Slide 30 text

Interactive data processing Result from the REPL

Slide 31

Slide 31 text

Transform XML data into a distributed collection of (questionID, tag) tuples.

Slide 32

Slide 32 text

Transform strings into numeric values through hashing, and get tuples Convert those tuples into Rating(userId, productId, rating) to be fed to ALS We get a factored matrix model, from which we can predict recommendations. Not in Mllib

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

And to call it, pick any question with at least four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:

Slide 36

Slide 36 text

- MLlib is still in its infancy. - It remains a set of algorithms that requires some extra work to get value from - Many people are contributing, and a lot more trying to build on top of it. - Look for the Spark 1.0 release. - A new version of MLI should join Spark soon too.

Slide 37

Slide 37 text

Spark Bindings for MAHOUT • Scala DSL and algebraic optimizer for Matrix operation for Mahout. • SPARK is being adopted as a second backend for Mahout (even Oryx) • Provides so far algorithms for - Mahout’s Distributed Row Matrices (DRM) - ALS and Cooccurence (coming soon ) - SSVD

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

val A = drmFromHDFS(…) // Compute Cooccurrences val C = A.t %*% A // Find anomalous co occurrences // compute & broadcast number // of interactions per item val numInteractions = drmBroadcast(A.colSums) // create indicator matrix val I = C.mapBlock() { case (keys, block) => val indicatorBlock = sparse(block.nrow, block.ncol) for (row <- block) indicatorBlock(row.index,::) = ComputeLLR(row,numInteractions) keys -> indicatorBlock } Co occurrence analysis in Mahout on Spark

Slide 40

Slide 40 text

One stack to rule them all !

Slide 41

Slide 41 text

A lot more than MLbase GraphX : Graph processing engine for Spark. Built from the Graphlab ideas, and provide some machine learning functions like PageRank, ALS, SVD++, TriangleCount, SparkR and PySPARK, leverages R and Python data analysis tools with on SPARK. Spark Streaming : Stream processing on Spark, a great abstraction for stream mining, and analytics (think Algebird) SparkSQL and SHARK, to build machine learning workflows.