An Introduction to Apache Spark

1 An Introduction to Apache Spark By Amir Sedighi @amirsedighi
http://hexican.com

2 What is Spark? • Spark is a fast and
general engine for large scale data processing.

3 History • Developed in 2009 at UC Berkeley AMPLab.
• Open sourced in 2010. • Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations.

4 Spark is insanely Fast • Using ram, Spark runs
programs up to 100x faster than hadoop MapReduce. • Using disk, Spark runs MapReduce 10x faster than hadoop.

5 Ease of Use • Spark offers over 80 high-level
operators that make it easy to build parallel apps. • Scala and Python shells to use it interactively.

6 Components

7 Apache Spark Core • Spark Core is the general
engine for the Spark platform. – In-memory computing capabilities deliver speed – General execution model supports wide delivery of use cases – Ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)

8 Spark SQL

9 Spark SQL

10 Spark Streaming • Makes it easy to build scalable
fault-tolerant streaming applications. • Recovers both lost work and operation state, without any extra code on your part.

11 Spark Streaming

12 Spark Streaming

13 Spark Streaming

14 MLLib • MLLib is Spark's scaleable machine learning engine.
• MLLib works on any hadoop datasource such as HDFS, HBase and local files.

15 MLLib • Algorithms: – linear SVM and logistic regression
– classification and regression tree – k-means clustering – recommendation via alternating least squares – singular value decomposition – linear regression with L1- and L2-regularization – multinomial naive Bayes – basic statistics – feature transformations

16 GraphX • GraphX is Spark's API for graphs and
graph- parallel computation. • Works with both graphs and collections.

17 GraphX • Comparable performance to the fastest specialized graph
processing systems

18 GraphX • Algorithms – PageRank – Connected components –
Label propagation – SVD++ – Strongly connected components – Triangle count

19 Spark Runs Everywhere • Spark runs on Hadoop, Mesos,
standalone, or in the cloud. • Spark accesses diverse data sources including HDFS, Cassandra, HBase, S3.

20 Spark Programming Model • At a high level, every
Spark application consists of a driver program that runs the user’s main function.

21 Spark Programming Model • The main abstraction Spark provides
is a resilient distributed dataset (RDD). – is a collection of elements partitioned across the nodes. – RDD can be accessed and operated in parallel. – RDDs automatically recover from node failures.

22 Spark Programming Model • RDDs Operations – Transformations: Create
a new dataset from an existing one. • Example: map() – Actions: Return a value to the driver program after running a computation on the dataset. • Example: reduce()

23 Spark Programming Model

24 Spark Programming Model • Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a value in memory on all nodes. – Accumulator

28 Resources • http://spark.apache.org • Intro to Apache Spark by
Paco Nathan • http://www.slideshare.net/manishgforce/lighteni ng-fast-big-data-analytics-using-apache-spark • ZYMR

An Introduction to Apache Spark

An Introduction to Apache Spark

Amir Sedighi

More Decks by Amir Sedighi

Other Decks in Programming

Featured

Transcript

1 An Introduction to Apache Spark By Amir Sedighi @amirsedighi

2 What is Spark? • Spark is a fast and

3 History • Developed in 2009 at UC Berkeley AMPLab.

4 Spark is insanely Fast • Using ram, Spark runs

5 Ease of Use • Spark offers over 80 high-level

6 Components

7 Apache Spark Core • Spark Core is the general

8 Spark SQL

9 Spark SQL

10 Spark Streaming • Makes it easy to build scalable

11 Spark Streaming

12 Spark Streaming

13 Spark Streaming

14 MLLib • MLLib is Spark's scaleable machine learning engine.

15 MLLib • Algorithms: – linear SVM and logistic regression

16 GraphX • GraphX is Spark's API for graphs and

17 GraphX • Comparable performance to the fastest specialized graph

18 GraphX • Algorithms – PageRank – Connected components –

19 Spark Runs Everywhere • Spark runs on Hadoop, Mesos,

20 Spark Programming Model • At a high level, every

21 Spark Programming Model • The main abstraction Spark provides

22 Spark Programming Model • RDDs Operations – Transformations: Create

23 Spark Programming Model

24 Spark Programming Model • Another abstraction is Shared Variables

25 Spark Programming Model

26 Spark Programming Model

27 Spark Programming Model

28 Resources • http://spark.apache.org • Intro to Apache Spark by