Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Strata NY 2014 Spark Camp MLlib talk

jkbradley
October 15, 2014

Strata NY 2014 Spark Camp MLlib talk

From the Spark Camp at Strata NY 2014, this is the talk on MLlib, Spark's Machine Learning library. It gives a quick overview of the project, upcoming improvements, and links to more resources.

jkbradley

October 15, 2014
Tweet

More Decks by jkbradley

Other Decks in Technology

Transcript

  1. About Me •  Ph.D. in ML from Carnegie Mellon U.

    •  Postdoc at UC Berkeley •  Working at Databricks on MLlib
  2. History of MLlib Now (Spark 1.1) •  Alternating Least Squares

    •  Lasso •  Ridge Regression •  Logistic Regression •  Decision Trees •  Naïve Bayes •  Support Vector Machines •  K-Means •  Gradient descent •  L-BFGS •  Random data generation •  Linear algebra •  Feature transformations •  Statistics: testing, correlation •  Evaluation metrics Collaborative Filtering for Recommendation Prediction Clustering Optimization Many Utilities
  3. History of MLlib Beginnings •  UC Berkeley AMPLab project • 

    Shipped in Sep. 2013 with Spark 0.8 Currently •  82 contributors from many organizations
  4. Benefits of MLlib •  Part of Spark •  Integrated data

    analysis workflow •  Free performance gains Apache Spark SparkSQL Spark Streaming MLlib GraphX
  5. Benefits of MLlib •  Part of Spark •  Integrated data

    analysis workflow •  Free performance gains •  Scalable •  Python, Scala, Java APIs •  Broad coverage of applications & algorithms •  Rapid improvements in speed & robustness
  6. Clustering with K-Means Smart initialization Limited communication (# clusters <<

    # instances) Data distributed by instance (point/row)
  7. K-Means: Scala // Load and parse data. val data =

    sc.textFile("kmeans_data.txt") val parsedData = data.map { x => Vectors.dense(x.split(‘ ').map(_.toDouble)) }.cache() // Cluster data into 5 classes using KMeans. val clusters = KMeans.train( parsedData, k = 5, numIterations = 20) // Evaluate clustering error. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost)
  8. K-Means: Python # Load and parse data. data = sc.textFile("kmeans_data.txt")

    parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Cluster data into 5 classes using KMeans. clusters = KMeans.train(parsedData, k = 5, maxIterations = 20) # Evaluate clustering error. def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) \ .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
  9. Recommendation Goal: Recommend movies to users Challenges: •  Defining similarity

    •  Dimensionality 25M Users, 100K Movies •  Sparsity Collaborative filtering
  10. Recommendation Solution: Assume ratings are determined by a small number

    of factors. ≈ x 25M Users, 100K Movies à 2.5 trillion ratings With 10 factors/user à 250M parameters
  11. ? Recommendation with Alternating Least Squares (ALS) = x Algorithm

    Alternating update of user/movie factors ?
  12. Recommendation with Alternating Least Squares (ALS) = x Algorithm Alternating

    update of user/movie factors Can update factors in parallel Must be careful about communication
  13. Recommendation with Alternating Least Squares (ALS) // Load and parse

    the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val model = ALS.train( ratings, rank = 10, numIterations = 20, regularizer = 0.01) // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts)
  14. ALS: Today’s ML Exercise •  Load 1M/10M ratings from MovieLens

    •  Specify YOUR ratings on examples •  Split examples into training/validation •  Fit a model (Python or Scala) •  Improve model via parameter tuning •  Get YOUR recommendations
  15. Performance Steady performance gains ALS Decision Trees K-Means Logistic Regression

    Ridge Regression Speedup (Spark 1.0 vs. 1.1) ~3X speedups on average
  16. Algorithms Definitely in Spark 1.2 •  Random Forests: ensembles of

    Decision Trees Under development •  Boosting •  Topic modeling •  (many others) Too many contributors to list!
  17. ML Pipelines Training Data Feature Extraction Model Training Model Testing

    Test Data Feature Extraction Other Training Data Other Model Training Typical ML workflow is complex.
  18. ML Pipelines Pipelines under development •  Easy workflow construction • 

    Standardized interface for model tuning •  Testing & failing early Typical ML workflow is complex. Collaboration with UC Berkeley AMPLab
  19. Datasets Further Integration with SparkSQL ML pipelines require Datasets • 

    Handle many data types (features) •  Keep metadata about features •  Select subsets of features for different parts of pipeline •  Join groups of features ML Dataset = SchemaRDD Under development
  20. Diving into MLlib Today’s exercise: basic workflow User Guide (Spark

    website) •  Available algorithms •  Algorithm description, details, tuning Spark codebase examples/ •  Executable examples •  Can extend to your application
  21. Resources MLlib Programming Guide http://spark.apache.org/docs/latest/mllib-guide.html Databricks site: Spark training info

    http://databricks.com/spark-training Spark user lists & community https://spark.apache.org/community.html