Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Apache Spark

Avatar for Amir Sedighi Amir Sedighi
November 20, 2014

An Introduction to Apache Spark

"Apache Spark is a cluster computing engine. It abstracts away the underlying distributed storage and cluster management aspects, making it possible to plug in a lot of specialized storage and cluster management tools. Spark support HDFS, Cassandra, local storage, S3, even tradtional database for the storage layer. Spark can work with cluster management tools like YARN, Mesos. It also has its own standalone mode for cluster management purpose." - https://rahulkavale.github.io

Avatar for Amir Sedighi

Amir Sedighi

November 20, 2014
Tweet

More Decks by Amir Sedighi

Other Decks in Programming

Transcript

  1. 2 What is Spark? • Spark is a fast and

    general engine for large scale data processing.
  2. 3 History • Developed in 2009 at UC Berkeley AMPLab.

    • Open sourced in 2010. • Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations.
  3. 4 Spark is insanely Fast • Using ram, Spark runs

    programs up to 100x faster than hadoop MapReduce. • Using disk, Spark runs MapReduce 10x faster than hadoop.
  4. 5 Ease of Use • Spark offers over 80 high-level

    operators that make it easy to build parallel apps. • Scala and Python shells to use it interactively.
  5. 7 Apache Spark Core • Spark Core is the general

    engine for the Spark platform. – In-memory computing capabilities deliver speed – General execution model supports wide delivery of use cases – Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)
  6. 10 Spark Streaming • Makes it easy to build scalable

    fault-tolerant streaming applications. • Recovers both lost work and operation state, without any extra code on your part.
  7. 14 MLLib • MLLib is Spark's scaleable machine learning engine.

    • MLLib works on any hadoop datasource such as HDFS, HBase and local files.
  8. 15 MLLib • Algorithms: – linear SVM and logistic regression

    – classification and regression tree – k-means clustering – recommendation via alternating least squares – singular value decomposition – linear regression with L1- and L2-regularization – multinomial naive Bayes – basic statistics – feature transformations
  9. 16 GraphX • GraphX is Spark's API for graphs and

    graph- parallel computation. • Works with both graphs and collections.
  10. 18 GraphX • Algorithms – PageRank – Connected components –

    Label propagation – SVD++ – Strongly connected components – Triangle count
  11. 19 Spark Runs Everywhere • Spark runs on Hadoop, Mesos,

    standalone, or in the cloud. • Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.
  12. 20 Spark Programming Model • At a high level, every

    Spark application consists of a driver program that runs the user’s main function.
  13. 21 Spark Programming Model • The main abstraction Spark provides

    is a resilient distributed dataset (RDD). – is a collection of elements partitioned across the nodes. – RDD can be accessed and operated in parallel. – RDDs automatically recover from node failures.
  14. 22 Spark Programming Model • RDDs Operations – Transformations: Create

    a new dataset from an existing one. • Example: map() – Actions: Return a value to the driver program after running a computation on the dataset. • Example: reduce()
  15. 24 Spark Programming Model • Another abstraction is Shared Variables

    – Broadcast Variables, which can be used to cache a value in memory on all nodes. – Accumulator
  16. 28 Resources • http://spark.apache.org • Intro to Apache Spark by

    Paco Nathan • http://www.slideshare.net/manishgforce/lighteni ng-fast-big-data-analytics-using-apache-spark • ZYMR