Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Machine Learning with SPARK

Scalable Machine Learning with SPARK

MLBASE, MLlib MLI

Sam Bessalah

April 09, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. SCALABLE MACHINE LEARNING
    with SPARK
    Sam Bessalah - @samklr
    Software Engineer
    These days : Data Pipelines Plumber

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. Why SPARK for Machine Learning?

    View Slide

  6. It’s less about ML, rather than having
    an agile data science
    environment.

    View Slide

  7. • Dataflow system materialized by immutable distributed collections
    with in memory data, suited for iterative and complex workloads
    • Its own ML library (MLLIB) and more …
    • Built in Scala, fully interoperable with Java
    Can use most of the work done in the JVM
    • Fluent and intuitive API, thanks to Scala functional style
    • Comes with a REPL, like R and Python, for quick feedback and
    interactive testing.
    • Not just for ML , but fit for data pipelines, integration, data access and
    ETL

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. MATLAB Example
    - Integrated environment for ML development.
    - Data access and processing tools
    - Leverages and extends the Lapack functionalities
    But doesn’t scale to distributed environments.
    Single Node
    Lapack
    Matlab Interface

    View Slide

  12. Lapack
    Single Node
    Matlab Interface
    MATLAB
    SPARK
    MLBASE

    View Slide

  13. MLLib : Low level ML algorithms implemented as Spark’s
    standard ML library
    Lapack
    Single Node
    Matlab Interface
    MATLAB
    SPARK
    MLBASE
    MLlib

    View Slide

  14. MLLib : Low level ML algorithms in SPARK.
    MLI : API / Platform for feature extraction, data pre processing and algorithm
    consumption. Aims to be platform independent.
    Lapack
    Single Node
    Matlab Interface
    MATLAB
    SPARK
    MLBASE
    MLlib
    MLI

    View Slide

  15. Spark : In memory, fast cluster computing system
    MLlib : Low level ML algorithms in Spark.
    MLI : API / Platform for feature extraction, data pre processing and algorithm
    consumption. Aims to be platform independent.
    ML Optimizer : automates model selection, by solving a search problem over
    features extractors
    SPARK
    MLBASE
    MLlib
    MLI
    ML Optimizer

    View Slide

  16. MLlib
    • Core ML algorithms implemented using the Spark
    programming model
    • So far contains algorithms for :
    - Regression : Ridge, Lasso
    - Classification : Support Vector Machine, Logistic Regression,
    - RecSys : Matrix Factorisation with ALS
    - Clustering : K means
    - Optimisation : Stochastic Gradient Descent
    More being contributed ….

    View Slide

  17. MLI : ML developer API
    • Aims to shield ML developers from runtimes implementation.
    • High level abstractions and operators to build models compatible with parallel
    data processing tools.
    • Linear Algebra : MLMatrix, MLRow, MLVector …
    - Linear algebra on local partitions
    - Sparse and Dense matrix support
    • Table Computations : MLTable
    - Similar to R/Pandas(Python) DataFrame or NumPy Array
    - Flexibility when loading /processing data
    - Common interface for feature extraction

    View Slide

  18. ML Optimizer
    • Aimed at non ML developers
    • Specify ML tasks declaratively
    • Have the system do all the heavy lifting using MLI and MLLib.
    var X = load (‘’local_file ’’, 2 to 10)
    var Y = load (‘’text_file’’, 1)
    var (fn-model, summary) = doClassify(X, y)
    Not available yet, currently under development .

    View Slide

  19. View Slide

  20. Example Worklow : Document Classification with MLI

    View Slide

  21. Typical data analysis workflow
    Spark /MLI Load Raw
    Data

    View Slide

  22. Typical data analysis workflow
    Spark /MLI Load Data
    Data Exploration /
    Feature Extraction
    Spark / MLI

    View Slide

  23. Typical data analysis workflow
    Spark /MLI Load Raw Data
    Data Exploration / Features
    Extraction
    Spark / MLI Learn

    View Slide

  24. Typical data analysis workflow
    Spark /MLI Load Raw Data
    Data Exploration / Features
    Extraction
    Spark / MLI Learn Evaluate Model

    View Slide

  25. Typical data analysis workflow
    Spark /MLI Load Raw Data
    Data Exploration / Features
    Extraction
    Spark / MLI Learn Evaluate Model
    Test / Feature Eng.
    Deploy
    Application

    View Slide


  26. • But MLI is still under development.
    Good for some small stuff, but not quite there yet.
    Not a full blown machine learning library yet.

    View Slide

  27. MLlib
    • Core Machine Learning algorithms in the Spark stdlib.
    • Primarily written in Scala, but can be used in PySpark via PythonMLLibAPI
    • So far contains algorithms for :
    - Regression : Ridge, Lasso, Linear
    - Classification : Support Vector Machines, Logistic Regression,
    Naive Bayes, Decision Trees
    - Linear Algebra : DistributedMatrix, RowMatrix, etc.
    - Recommenders : Alternating Least squares, (SVD++ in GraphX)
    - Clustering : K means
    - Optimisation : Stochastic Gradient Descent
    - … …
    More being contributed …. Look at the Spark JIRA and for the Spark 1.0 release

    View Slide

  28. View Slide

  29. Mllib Example
    • Example from Sean R. Owen of Cloudera :
    http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
    • Stackoverflow tags suggestions
    • Build a model that can suggest new tags to
    questions based on existing tags, using the
    alternating least squares recommender
    algorithm;
    • Questions are “users” and tags are “items

    View Slide

  30. Interactive data processing
    Result from the REPL

    View Slide

  31. Transform XML data into a distributed collection of
    (questionID, tag) tuples.

    View Slide

  32. Transform strings into numeric values through hashing, and get tuples
    Convert those tuples into Rating(userId, productId, rating) to be fed to ALS
    We get a factored matrix model, from which we can predict recommendations. Not in
    Mllib

    View Slide

  33. View Slide

  34. View Slide

  35. And to call it, pick any question with at least four tags, like “How to make
    substring-matching query work fast on a large table?” and get its ID from
    the URL. Here, that’s 7122697:

    View Slide

  36. - MLlib is still in its infancy.
    - It remains a set of algorithms that requires
    some extra work to get value from
    - Many people are contributing, and a lot more
    trying to build on top of it.
    - Look for the Spark 1.0 release.
    - A new version of MLI should join Spark soon
    too.

    View Slide

  37. Spark Bindings for MAHOUT
    • Scala DSL and algebraic optimizer for Matrix
    operation for Mahout.
    • SPARK is being adopted as a second backend
    for Mahout (even Oryx)
    • Provides so far algorithms for
    - Mahout’s Distributed Row Matrices (DRM)
    - ALS and Cooccurence (coming soon )
    - SSVD

    View Slide

  38. View Slide

  39. val A = drmFromHDFS(…)
    // Compute Cooccurrences
    val C = A.t %*% A
    // Find anomalous co occurrences
    // compute & broadcast number
    // of interactions per item
    val numInteractions =
    drmBroadcast(A.colSums)
    // create indicator matrix
    val I = C.mapBlock() {
    case (keys, block) =>
    val indicatorBlock = sparse(block.nrow, block.ncol)
    for (row <- block)
    indicatorBlock(row.index,::) =
    ComputeLLR(row,numInteractions)
    keys -> indicatorBlock
    }
    Co occurrence analysis in Mahout on Spark

    View Slide

  40. One stack to rule them all !

    View Slide

  41. A lot more than MLbase
    GraphX : Graph processing engine for Spark. Built from the
    Graphlab ideas, and provide some machine learning functions
    like PageRank, ALS, SVD++, TriangleCount,
    SparkR and PySPARK, leverages R and Python data analysis
    tools with on SPARK.
    Spark Streaming : Stream processing on Spark, a great
    abstraction for stream mining, and analytics (think Algebird)
    SparkSQL and SHARK, to build machine learning workflows.

    View Slide

  42. http://speakerdeck.com/samklr/

    View Slide

  43. View Slide