Scalable Machine Learning with SPARK

SCALABLE MACHINE LEARNING with SPARK Sam Bessalah - @samklr Software
Engineer These days : Data Pipelines Plumber

Why SPARK for Machine Learning?

It’s less about ML, rather than having an agile data
science environment.

• Dataflow system materialized by immutable distributed collections with in
memory data, suited for iterative and complex workloads • Its own ML library (MLLIB) and more … • Built in Scala, fully interoperable with Java Can use most of the work done in the JVM • Fluent and intuitive API, thanks to Scala functional style • Comes with a REPL, like R and Python, for quick feedback and interactive testing. • Not just for ML , but fit for data pipelines, integration, data access and ETL

MATLAB Example - Integrated environment for ML development. - Data
access and processing tools - Leverages and extends the Lapack functionalities But doesn’t scale to distributed environments. Single Node Lapack Matlab Interface

Lapack Single Node Matlab Interface MATLAB SPARK MLBASE

MLLib : Low level ML algorithms implemented as Spark’s standard
ML library Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib

MLLib : Low level ML algorithms in SPARK. MLI :
API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib MLI

Spark : In memory, fast cluster computing system MLlib :
Low level ML algorithms in Spark. MLI : API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. ML Optimizer : automates model selection, by solving a search problem over features extractors SPARK MLBASE MLlib MLI ML Optimizer

MLlib • Core ML algorithms implemented using the Spark programming
model • So far contains algorithms for : - Regression : Ridge, Lasso - Classification : Support Vector Machine, Logistic Regression, - RecSys : Matrix Factorisation with ALS - Clustering : K means - Optimisation : Stochastic Gradient Descent More being contributed ….

MLI : ML developer API • Aims to shield ML
developers from runtimes implementation. • High level abstractions and operators to build models compatible with parallel data processing tools. • Linear Algebra : MLMatrix, MLRow, MLVector … - Linear algebra on local partitions - Sparse and Dense matrix support • Table Computations : MLTable - Similar to R/Pandas(Python) DataFrame or NumPy Array - Flexibility when loading /processing data - Common interface for feature extraction

ML Optimizer • Aimed at non ML developers • Specify
ML tasks declaratively • Have the system do all the heavy lifting using MLI and MLLib. var X = load (‘’local_file ’’, 2 to 10) var Y = load (‘’text_file’’, 1) var (fn-model, summary) = doClassify(X, y) Not available yet, currently under development .

Example Worklow : Document Classification with MLI

Typical data analysis workflow Spark /MLI Load Raw Data

Typical data analysis workflow Spark /MLI Load Data Data Exploration
/ Feature Extraction Spark / MLI

Typical data analysis workflow Spark /MLI Load Raw Data Data
Exploration / Features Extraction Spark / MLI Learn

Exploration / Features Extraction Spark / MLI Learn Evaluate Model

Exploration / Features Extraction Spark / MLI Learn Evaluate Model Test / Feature Eng. Deploy Application

• • But MLI is still under development. Good for
some small stuff, but not quite there yet. Not a full blown machine learning library yet.

MLlib • Core Machine Learning algorithms in the Spark stdlib.
• Primarily written in Scala, but can be used in PySpark via PythonMLLibAPI • So far contains algorithms for : - Regression : Ridge, Lasso, Linear - Classification : Support Vector Machines, Logistic Regression, Naive Bayes, Decision Trees - Linear Algebra : DistributedMatrix, RowMatrix, etc. - Recommenders : Alternating Least squares, (SVD++ in GraphX) - Clustering : K means - Optimisation : Stochastic Gradient Descent - … … More being contributed …. Look at the Spark JIRA and for the Spark 1.0 release

Mllib Example • Example from Sean R. Owen of Cloudera
: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ • Stackoverflow tags suggestions • Build a model that can suggest new tags to questions based on existing tags, using the alternating least squares recommender algorithm; • Questions are “users” and tags are “items

Interactive data processing Result from the REPL

Transform XML data into a distributed collection of (questionID, tag)
tuples.

Transform strings into numeric values through hashing, and get tuples
Convert those tuples into Rating(userId, productId, rating) to be fed to ALS We get a factored matrix model, from which we can predict recommendations. Not in Mllib

And to call it, pick any question with at least
four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:

- MLlib is still in its infancy. - It remains
a set of algorithms that requires some extra work to get value from - Many people are contributing, and a lot more trying to build on top of it. - Look for the Spark 1.0 release. - A new version of MLI should join Spark soon too.

Spark Bindings for MAHOUT • Scala DSL and algebraic optimizer
for Matrix operation for Mahout. • SPARK is being adopted as a second backend for Mahout (even Oryx) • Provides so far algorithms for - Mahout’s Distributed Row Matrices (DRM) - ALS and Cooccurence (coming soon ) - SSVD

val A = drmFromHDFS(…) // Compute Cooccurrences val C =
A.t %*% A // Find anomalous co occurrences // compute & broadcast number // of interactions per item val numInteractions = drmBroadcast(A.colSums) // create indicator matrix val I = C.mapBlock() { case (keys, block) => val indicatorBlock = sparse(block.nrow, block.ncol) for (row <- block) indicatorBlock(row.index,::) = ComputeLLR(row,numInteractions) keys -> indicatorBlock } Co occurrence analysis in Mahout on Spark

One stack to rule them all !

A lot more than MLbase GraphX : Graph processing engine
for Spark. Built from the Graphlab ideas, and provide some machine learning functions like PageRank, ALS, SVD++, TriangleCount, SparkR and PySPARK, leverages R and Python data analysis tools with on SPARK. Spark Streaming : Stream processing on Spark, a great abstraction for stream mining, and analytics (think Algebird) SparkSQL and SHARK, to build machine learning workflows.

http://speakerdeck.com/samklr/

Scalable Machine Learning with SPARK

Scalable Machine Learning with SPARK

Sam Bessalah

More Decks by Sam Bessalah

Other Decks in Programming

Featured

Transcript