Introduction to Apache Spark

Introduction to Apache Spark Rahul Kavale(@yphalcombinator) Ashutosh Raina(@ashutoshraina)

Some properties of “Big Data” 1. Big data is inherently
immutable, meaning it is not supposed to updated once generated. 2. Mostly the operations are coarse grained when it comes to write 3. Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines, and as we know the distributed nature makes the programming complicated.

Brush up for Hadoop concepts Distributed Storage => HDFS Cluster
Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer

http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

MapReduce pain points 1. Meant for batch jobs, hence is
usually having considerable latency 2. Limits the programming to Map and Reduce phases 3. Non trivial to test 4. A real life solution might result into a complex workflow 5. Not suitable for Iterative processing

Immutability and MapReduce model 1. The MapReduce model lacks to
exploit the immutable nature of the data. 2. The intermediate results are persisted causing lot of IO, causing a serious performance hit.

Few wrappers over MapReduce And lot others...

Wouldn’t it be very nice if we could have 1.
Programmer friendly programming model 2. Low latency 3. Unified ecosystem 4. Fault tolerance and other typical distributed system properties 5. Easily testable code 6. Of course open source :)

Let me introduce you…. Lightning-fast cluster computing

What is Apache Spark 1. Cluster computing Engine 2. Abstracts
the storage and cluster management aspects from computations 3. Aims to unify otherwise spread out interfaces to data 4. provides interfaces in Scala, Python, Java

What is Apache Spark https://spark.apache.org/images/spark-stack.png

Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html

Why should you care about Apache Spark 1. Abstracts underlying
storage, cluster management, you can plugin it as per your need 2. Easy programming model 3. Of course, very high performant as compared to traditional MapReduce and its cousins

4. Recently set a new Petabyte sort record 5. offers
in memory caching of data, resulting further more performance boost 6. Applications like graph processing(via GraphX), Streaming(Spark Streaming), Machine learning(MLib), SQL(Spark SQL) are very easy and highly interoperable 7. Data exploration via Spark-Shell

Programming model for Apache Spark

Word Count example val file = spark.textFile("input path") val counts
= file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("destination path")

Comparing example with MapReduce

Spark Shell Demo 1. SparkContext 2. RDD 3. RDD operations

RDD 1. RDD stands for Resilient Distributed Dataset. 2. It
forms the basic abstraction on which Spark programming model works.

1. Can be thought of as Distributed collections. The programming
interface almost makes distributed nature of underlying data transparent. 2. It can be created via, a. parallelizing a collection, b. transforming an existing RDD by applying a transformation function, c. reading from a persistent data store like HDFS.

RDD is immutable This is a very important point, because
even HDFS is write once, read many times/append only store, making it immutable but the MapReduce model makes it impossible to exploit this fact for improving performance.

RDD is lazily evaluated RDD has two type of operations
on it a. Transformations just create a DAG of transformations to be applied on the RDD and not really evaluating anything b. Actions Actually evaluate the DAG of tranformations giving us back the result

RDD operations Transformations map(f : T ⇒ U) : RDD[T]
⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]

Actions count() : RDD[T] ⇒ Long collect() : RDD[T] ⇒
Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Job Execution

Fault tolerance via lineage MappedRDD FilteredRDD FlatMappedRDD MappedRDD HadoopRDD

Testing

Why is Spark more performant than MapReduce

1. No disk IO between phases since phases themselves are
pipelined 2. No network IO involved unless a shuffle is required Reduced IO

No Mandatory Shuffle 1. Programs not bounded by map and
reduce phases 2. No mandatory Shuffle and sort required

In memory caching of data 1. Optional In memory caching
2. DAG engine can apply certain optimizations since when an action is called, it knows what all transformations as to be applied

Components of Spark Spark SQL Streaming MLib GraphX SparkR

SparkR Demo

Questions?

Thank You! Happy Coding!

References 1. https://spark.apache.org/docs/latest/ 2. https://www.cs.berkeley. edu/~matei/papers/2012/nsdi_spark.pdf 3. http://hadoop.apache.org/

Introduction to Apache Spark

Introduction to Apache Spark

More Decks by Rahul Kavale

Other Decks in Programming

Featured

Transcript