Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark & MLlib

Apache Spark & MLlib

Apache Spark is an emerging cluster computing platform that allows data processing programs to run up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark also has a built in machine learning library, MLlib, that implements many common supervised and unsupervised machine learning algorithms. In this talk we will discuss how Spark has improved cluster computing and data processing along with an overview of the MLlib algorithms available. After getting familiar with the basics, we will explore how you can create a product recommendation engine for eCommerce utilizing Collaborative Filtering and the Alternating Least Squares algorithm.

Addam Hardy

October 21, 2016
Tweet

More Decks by Addam Hardy

Other Decks in Technology

Transcript

  1. 1 TWEET @addamh: Tacos are the best . Tom Brady

    is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .
  2. 6000 TWEETS / SECOND X 12 MB / SECOND =

    ~ 200 BYTES / TWEET X 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .
  3. 6000 TWEETS / SECOND X 12 MB / SECOND =

    ~ 200 BYTES / TWEET X 24 HOURS X 1TB / DAY = 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .
  4. 6000 TWEETS / SECOND X 12 MB / SECOND =

    ~ 200 BYTES / TWEET X 24 HOURS X 1TB / DAY = 30 DAYS / MONTH X 30TB / MONTH = 1 TWEET @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is . @addamh: Tacos are the best . Tom Brady is the best. So, Tom Brady is .
  5. DATA ARRAY RDD RDD Partition Host 02 RAM partition RAM

    partition RAM partition RAM partition Host 01 RAM partition RAM partition RAM partition RAM partition Host n RAM partition RAM partition RAM partition RAM partition RDD Partition RDD Partition RDD Partition RDD Partition
  6. MLlib is a machine learning toolbox that is tightly integrated

    into Spark and has an RDD API. This allows MLlib’s algorithms to run on a distributed Spark cluster.
  7. MLlib Classification: Logistic Regression Naive Bayes Decision Trees: Random Forests

    Gradient-Boost Trees Clustering: K-means Topic Modeling: latent Dirichlet allocation Recommendation: Alternating Least Squares Regression: Linear Regression Isotonic Regression is a machine learning toolbox that is tightly integrated into Spark and has an RDD API. This allows MLlib’s algorithms to run on a distributed Spark cluster.
  8. MLlib Classification: Logistic Regression Naive Bayes Decision Trees: Random Forests

    Gradient-Boost Trees Clustering: K-means Topic Modeling: latent Dirichlet allocation Recommendation: Alternating Least Squares Regression: Linear Regression Isotonic Regression is a machine learning toolbox that is tightly integrated into Spark and has an RDD API. This allows MLlib’s algorithms to run on a distributed Spark cluster.
  9. ?

  10. ? ? ? 2 3 5 2 4 3 2

    4 5 5 1 4 5
  11. 3 5 ? 5 2 5 1 4 ? 4

    2 3 5 2 ? 4 ω = Our data as a dense matrix:
  12. LOW-RANK MATRIX FACTORIZATION IS A NON-CONVEX PROBLEM. THIS MAKES FINDING

    THE GLOBAL MINIMA VERY COSTLY. THE ALTERNATING LEAST SQUARES ALGORITHM ALLOWS US CONVERT TO A CONVEX PROBLEM.
  13. Movielens Data movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old

    Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance 5,Father of the Bride Part II (1995),Comedy 6,Heat (1995),Action|Crime|Thriller 7,Sabrina (1995),Comedy|Romance 8,Tom and Huck (1995),Adventure|Children 9,Sudden Death (1995),Action 10,GoldenEye (1995),Action|Adventure|Thriller 11,"American President, The (1995)",Comedy|Drama|Romance 12,Dracula: Dead and Loving It (1995),Comedy|Horror 13,Balto (1995),Adventure|Animation|Children
  14. Remove just rated movies from ratings RDD Make predictions for

    unrated movies from full dataset Get IDs of movies just rated
  15. YOU DON’T HAVE TO BE A DATA SCIENTIST TO START

    GETTING VALUE FROM THE TOOLS.