Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Spark: Unified Platform for Big Data

Apache Spark: Unified Platform for Big Data

Reynold Xin

July 21, 2014
Tweet

More Decks by Reynold Xin

Other Decks in Technology

Transcript

  1. ! ! ! ! ! ! “Simple things simple, complex

    things possible.” ! ! - Alan Kay
  2. Apache Spark: Unified Platform for Big Data Reynold Xin, Databricks

    /\ / \ | | | | | | | | / \ -————- The tower at Berkeley campus
  3. Founded last year by creators of Apache Spark ! !

    Building a cloud platform to drastically simplify Big Data
  4. ! ! ! ! ! ! “Simple things simple, complex

    things possible.” ! ! - Alan Kay
  5. Big Data Space Today ! ! ! Zoo of tools

    to learn, deploy, connect, and maintain. ! ! ! ! ! “Simple things complex, complex things impossible!” ! !
  6. Why a Platform Matters Good for developers: one system to

    learn Good for users: take apps anywhere Good for distributors: more applications
  7. Apache Spark “The Lingua franca for Big Data” — Eric

    Baldeschwieler ! 1. Unified processing platform 2. Rich standard libraries 3. Availability 4. Scalability
  8. 1. Unified Processing Platform for Big Data Batch Interactive Streaming

    Hadoop Cassandra Mesos … … Cloud Providers … Uniform API for diverse workloads over diverse storage systems and runtimes
  9. 2. Standard Library for Big Data Big Data apps lack

    libraries
 of common algorithms ! Spark’s generality + support
 for multiple languages make it
 suitable to offer this Core SQL ML graph … Python Scala Java
  10. Spark MLlib One of the most actively developed library. •

    classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree • regression: generalized linear models (GLMs), regression tree • collaborative filtering: alternating least squares (ALS) • clustering: k-means • decomposition: singular value decomposition (SVD), principal component analysis (PCA) • statistics: summary statistics • evaluation: binary classification • optimization: gradient descent, L-BFGS
  11. Spark MLlib 1.1 (Aug/Sep 2014) improvements • standardized interfaces: dataset,

    algorithm, params, and model • multi-model training • multiclass support for classification tree • Java/Python APIs for decision trees • SVD via Lanczos • standardized text format for training data new • statistics: stratified sampling, linear/ rank correlation, hypothesis testing • non-negative matrix factorization (NMF) • preprocessing: tf-idf • evaluation: multiclass metrics • online model updates with streaming • and your contribution!
  12. 4. Fault-tolerance & Scalability Cluster size: in production clusters with

    1000+ nodes Data size: processes data many times size of memory
 - petabyte sort on lots of machines
 - or handle tons of data on a few machines (sorting 5TB compressed/node; file system breaking before Spark did)
  13. Spark is the most active open-source Big Data project by

    - active contributors (300+) - commits - line of code changes