Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Machine Learning with SPARK

Scalable Machine Learning with SPARK

MLBASE, MLlib MLI

Sam Bessalah

April 09, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. SCALABLE MACHINE LEARNING with SPARK Sam Bessalah - @samklr Software

    Engineer These days : Data Pipelines Plumber
  2. • Dataflow system materialized by immutable distributed collections with in

    memory data, suited for iterative and complex workloads • Its own ML library (MLLIB) and more … • Built in Scala, fully interoperable with Java Can use most of the work done in the JVM • Fluent and intuitive API, thanks to Scala functional style • Comes with a REPL, like R and Python, for quick feedback and interactive testing. • Not just for ML , but fit for data pipelines, integration, data access and ETL
  3. MATLAB Example - Integrated environment for ML development. - Data

    access and processing tools - Leverages and extends the Lapack functionalities But doesn’t scale to distributed environments. Single Node Lapack Matlab Interface
  4. MLLib : Low level ML algorithms implemented as Spark’s standard

    ML library Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib
  5. MLLib : Low level ML algorithms in SPARK. MLI :

    API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib MLI
  6. Spark : In memory, fast cluster computing system MLlib :

    Low level ML algorithms in Spark. MLI : API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. ML Optimizer : automates model selection, by solving a search problem over features extractors SPARK MLBASE MLlib MLI ML Optimizer
  7. MLlib • Core ML algorithms implemented using the Spark programming

    model • So far contains algorithms for : - Regression : Ridge, Lasso - Classification : Support Vector Machine, Logistic Regression, - RecSys : Matrix Factorisation with ALS - Clustering : K means - Optimisation : Stochastic Gradient Descent More being contributed ….
  8. MLI : ML developer API • Aims to shield ML

    developers from runtimes implementation. • High level abstractions and operators to build models compatible with parallel data processing tools. • Linear Algebra : MLMatrix, MLRow, MLVector … - Linear algebra on local partitions - Sparse and Dense matrix support • Table Computations : MLTable - Similar to R/Pandas(Python) DataFrame or NumPy Array - Flexibility when loading /processing data - Common interface for feature extraction
  9. ML Optimizer • Aimed at non ML developers • Specify

    ML tasks declaratively • Have the system do all the heavy lifting using MLI and MLLib. var X = load (‘’local_file ’’, 2 to 10) var Y = load (‘’text_file’’, 1) var (fn-model, summary) = doClassify(X, y) Not available yet, currently under development .
  10. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn
  11. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn Evaluate Model
  12. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn Evaluate Model Test / Feature Eng. Deploy Application
  13. • • But MLI is still under development. Good for

    some small stuff, but not quite there yet. Not a full blown machine learning library yet.
  14. MLlib • Core Machine Learning algorithms in the Spark stdlib.

    • Primarily written in Scala, but can be used in PySpark via PythonMLLibAPI • So far contains algorithms for : - Regression : Ridge, Lasso, Linear - Classification : Support Vector Machines, Logistic Regression, Naive Bayes, Decision Trees - Linear Algebra : DistributedMatrix, RowMatrix, etc. - Recommenders : Alternating Least squares, (SVD++ in GraphX) - Clustering : K means - Optimisation : Stochastic Gradient Descent - … … More being contributed …. Look at the Spark JIRA and for the Spark 1.0 release
  15. Mllib Example • Example from Sean R. Owen of Cloudera

    : http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ • Stackoverflow tags suggestions • Build a model that can suggest new tags to questions based on existing tags, using the alternating least squares recommender algorithm; • Questions are “users” and tags are “items
  16. Transform strings into numeric values through hashing, and get tuples

    Convert those tuples into Rating(userId, productId, rating) to be fed to ALS We get a factored matrix model, from which we can predict recommendations. Not in Mllib
  17. And to call it, pick any question with at least

    four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:
  18. - MLlib is still in its infancy. - It remains

    a set of algorithms that requires some extra work to get value from - Many people are contributing, and a lot more trying to build on top of it. - Look for the Spark 1.0 release. - A new version of MLI should join Spark soon too.
  19. Spark Bindings for MAHOUT • Scala DSL and algebraic optimizer

    for Matrix operation for Mahout. • SPARK is being adopted as a second backend for Mahout (even Oryx) • Provides so far algorithms for - Mahout’s Distributed Row Matrices (DRM) - ALS and Cooccurence (coming soon ) - SSVD
  20. val A = drmFromHDFS(…) // Compute Cooccurrences val C =

    A.t %*% A // Find anomalous co occurrences // compute & broadcast number // of interactions per item val numInteractions = drmBroadcast(A.colSums) // create indicator matrix val I = C.mapBlock() { case (keys, block) => val indicatorBlock = sparse(block.nrow, block.ncol) for (row <- block) indicatorBlock(row.index,::) = ComputeLLR(row,numInteractions) keys -> indicatorBlock } Co occurrence analysis in Mahout on Spark
  21. A lot more than MLbase GraphX : Graph processing engine

    for Spark. Built from the Graphlab ideas, and provide some machine learning functions like PageRank, ALS, SVD++, TriangleCount, SparkR and PySPARK, leverages R and Python data analysis tools with on SPARK. Spark Streaming : Stream processing on Spark, a great abstraction for stream mining, and analytics (think Algebird) SparkSQL and SHARK, to build machine learning workflows.