Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Machine Learning with SPARK

Scalable Machine Learning with SPARK

MLBASE, MLlib MLI

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

April 09, 2014
Tweet

Transcript

  1. SCALABLE MACHINE LEARNING with SPARK Sam Bessalah - @samklr Software

    Engineer These days : Data Pipelines Plumber
  2. None
  3. None
  4. None
  5. Why SPARK for Machine Learning?

  6. It’s less about ML, rather than having an agile data

    science environment.
  7. • Dataflow system materialized by immutable distributed collections with in

    memory data, suited for iterative and complex workloads • Its own ML library (MLLIB) and more … • Built in Scala, fully interoperable with Java Can use most of the work done in the JVM • Fluent and intuitive API, thanks to Scala functional style • Comes with a REPL, like R and Python, for quick feedback and interactive testing. • Not just for ML , but fit for data pipelines, integration, data access and ETL
  8. None
  9. None
  10. None
  11. MATLAB Example - Integrated environment for ML development. - Data

    access and processing tools - Leverages and extends the Lapack functionalities But doesn’t scale to distributed environments. Single Node Lapack Matlab Interface
  12. Lapack Single Node Matlab Interface MATLAB SPARK MLBASE

  13. MLLib : Low level ML algorithms implemented as Spark’s standard

    ML library Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib
  14. MLLib : Low level ML algorithms in SPARK. MLI :

    API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. Lapack Single Node Matlab Interface MATLAB SPARK MLBASE MLlib MLI
  15. Spark : In memory, fast cluster computing system MLlib :

    Low level ML algorithms in Spark. MLI : API / Platform for feature extraction, data pre processing and algorithm consumption. Aims to be platform independent. ML Optimizer : automates model selection, by solving a search problem over features extractors SPARK MLBASE MLlib MLI ML Optimizer
  16. MLlib • Core ML algorithms implemented using the Spark programming

    model • So far contains algorithms for : - Regression : Ridge, Lasso - Classification : Support Vector Machine, Logistic Regression, - RecSys : Matrix Factorisation with ALS - Clustering : K means - Optimisation : Stochastic Gradient Descent More being contributed ….
  17. MLI : ML developer API • Aims to shield ML

    developers from runtimes implementation. • High level abstractions and operators to build models compatible with parallel data processing tools. • Linear Algebra : MLMatrix, MLRow, MLVector … - Linear algebra on local partitions - Sparse and Dense matrix support • Table Computations : MLTable - Similar to R/Pandas(Python) DataFrame or NumPy Array - Flexibility when loading /processing data - Common interface for feature extraction
  18. ML Optimizer • Aimed at non ML developers • Specify

    ML tasks declaratively • Have the system do all the heavy lifting using MLI and MLLib. var X = load (‘’local_file ’’, 2 to 10) var Y = load (‘’text_file’’, 1) var (fn-model, summary) = doClassify(X, y) Not available yet, currently under development .
  19. None
  20. Example Worklow : Document Classification with MLI

  21. Typical data analysis workflow Spark /MLI Load Raw Data

  22. Typical data analysis workflow Spark /MLI Load Data Data Exploration

    / Feature Extraction Spark / MLI
  23. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn
  24. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn Evaluate Model
  25. Typical data analysis workflow Spark /MLI Load Raw Data Data

    Exploration / Features Extraction Spark / MLI Learn Evaluate Model Test / Feature Eng. Deploy Application
  26. • • But MLI is still under development. Good for

    some small stuff, but not quite there yet. Not a full blown machine learning library yet.
  27. MLlib • Core Machine Learning algorithms in the Spark stdlib.

    • Primarily written in Scala, but can be used in PySpark via PythonMLLibAPI • So far contains algorithms for : - Regression : Ridge, Lasso, Linear - Classification : Support Vector Machines, Logistic Regression, Naive Bayes, Decision Trees - Linear Algebra : DistributedMatrix, RowMatrix, etc. - Recommenders : Alternating Least squares, (SVD++ in GraphX) - Clustering : K means - Optimisation : Stochastic Gradient Descent - … … More being contributed …. Look at the Spark JIRA and for the Spark 1.0 release
  28. None
  29. Mllib Example • Example from Sean R. Owen of Cloudera

    : http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ • Stackoverflow tags suggestions • Build a model that can suggest new tags to questions based on existing tags, using the alternating least squares recommender algorithm; • Questions are “users” and tags are “items
  30. Interactive data processing Result from the REPL

  31. Transform XML data into a distributed collection of (questionID, tag)

    tuples.
  32. Transform strings into numeric values through hashing, and get tuples

    Convert those tuples into Rating(userId, productId, rating) to be fed to ALS We get a factored matrix model, from which we can predict recommendations. Not in Mllib
  33. None
  34. None
  35. And to call it, pick any question with at least

    four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:
  36. - MLlib is still in its infancy. - It remains

    a set of algorithms that requires some extra work to get value from - Many people are contributing, and a lot more trying to build on top of it. - Look for the Spark 1.0 release. - A new version of MLI should join Spark soon too.
  37. Spark Bindings for MAHOUT • Scala DSL and algebraic optimizer

    for Matrix operation for Mahout. • SPARK is being adopted as a second backend for Mahout (even Oryx) • Provides so far algorithms for - Mahout’s Distributed Row Matrices (DRM) - ALS and Cooccurence (coming soon ) - SSVD
  38. None
  39. val A = drmFromHDFS(…) // Compute Cooccurrences val C =

    A.t %*% A // Find anomalous co occurrences // compute & broadcast number // of interactions per item val numInteractions = drmBroadcast(A.colSums) // create indicator matrix val I = C.mapBlock() { case (keys, block) => val indicatorBlock = sparse(block.nrow, block.ncol) for (row <- block) indicatorBlock(row.index,::) = ComputeLLR(row,numInteractions) keys -> indicatorBlock } Co occurrence analysis in Mahout on Spark
  40. One stack to rule them all !

  41. A lot more than MLbase GraphX : Graph processing engine

    for Spark. Built from the Graphlab ideas, and provide some machine learning functions like PageRank, ALS, SVD++, TriangleCount, SparkR and PySPARK, leverages R and Python data analysis tools with on SPARK. Spark Streaming : Stream processing on Spark, a great abstraction for stream mining, and analytics (think Algebird) SparkSQL and SHARK, to build machine learning workflows.
  42. http://speakerdeck.com/samklr/

  43. None