Slide 1

Slide 1 text

Introduction to Apache Spark Rahul Kavale(@yphalcombinator) Ashutosh Raina(@ashutoshraina)

Slide 2

Slide 2 text

Some properties of “Big Data” 1. Big data is inherently immutable, meaning it is not supposed to updated once generated. 2. Mostly the operations are coarse grained when it comes to write 3. Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines, and as we know the distributed nature makes the programming complicated.

Slide 3

Slide 3 text

Brush up for Hadoop concepts Distributed Storage => HDFS Cluster Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer

Slide 4

Slide 4 text

http://hadoop.apache.org/docs/r1.2.1/images/hdfsarchitecture.gif

Slide 5

Slide 5 text

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop

Slide 6

Slide 6 text

MapReduce pain points 1. Meant for batch jobs, hence is usually having considerable latency 2. Limits the programming to Map and Reduce phases 3. Non trivial to test 4. A real life solution might result into a complex workflow 5. Not suitable for Iterative processing

Slide 7

Slide 7 text

Immutability and MapReduce model 1. The MapReduce model lacks to exploit the immutable nature of the data. 2. The intermediate results are persisted causing lot of IO, causing a serious performance hit.

Slide 8

Slide 8 text

Few wrappers over MapReduce And lot others...

Slide 9

Slide 9 text

Wouldn’t it be very nice if we could have 1. Programmer friendly programming model 2. Low latency 3. Unified ecosystem 4. Fault tolerance and other typical distributed system properties 5. Easily testable code 6. Of course open source :)

Slide 10

Slide 10 text

Let me introduce you…. Lightning-fast cluster computing

Slide 11

Slide 11 text

What is Apache Spark 1. Cluster computing Engine 2. Abstracts the storage and cluster management aspects from computations 3. Aims to unify otherwise spread out interfaces to data 4. provides interfaces in Scala, Python, Java

Slide 12

Slide 12 text

What is Apache Spark https://spark.apache.org/images/spark-stack.png

Slide 13

Slide 13 text

Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html

Slide 14

Slide 14 text

Why should you care about Apache Spark 1. Abstracts underlying storage, cluster management, you can plugin it as per your need 2. Easy programming model 3. Of course, very high performant as compared to traditional MapReduce and its cousins

Slide 15

Slide 15 text

4. Recently set a new Petabyte sort record 5. offers in memory caching of data, resulting further more performance boost 6. Applications like graph processing(via GraphX), Streaming(Spark Streaming), Machine learning(MLib), SQL(Spark SQL) are very easy and highly interoperable 7. Data exploration via Spark-Shell

Slide 16

Slide 16 text

Programming model for Apache Spark

Slide 17

Slide 17 text

Word Count example val file = spark.textFile("input path") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("destination path")

Slide 18

Slide 18 text

Comparing example with MapReduce

Slide 19

Slide 19 text

Spark Shell Demo 1. SparkContext 2. RDD 3. RDD operations

Slide 20

Slide 20 text

RDD 1. RDD stands for Resilient Distributed Dataset. 2. It forms the basic abstraction on which Spark programming model works.

Slide 21

Slide 21 text

1. Can be thought of as Distributed collections. The programming interface almost makes distributed nature of underlying data transparent. 2. It can be created via, a. parallelizing a collection, b. transforming an existing RDD by applying a transformation function, c. reading from a persistent data store like HDFS.

Slide 22

Slide 22 text

RDD is immutable This is a very important point, because even HDFS is write once, read many times/append only store, making it immutable but the MapReduce model makes it impossible to exploit this fact for improving performance.

Slide 23

Slide 23 text

RDD is lazily evaluated RDD has two type of operations on it a. Transformations just create a DAG of transformations to be applied on the RDD and not really evaluating anything b. Actions Actually evaluate the DAG of tranformations giving us back the result

Slide 24

Slide 24 text

RDD operations Transformations map(f : T ⇒ U) : RDD[T] ⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]

Slide 25

Slide 25 text

Actions count() : RDD[T] ⇒ Long collect() : RDD[T] ⇒ Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Slide 26

Slide 26 text

Job Execution

Slide 27

Slide 27 text

Fault tolerance via lineage MappedRDD FilteredRDD FlatMappedRDD MappedRDD HadoopRDD

Slide 28

Slide 28 text

Testing

Slide 29

Slide 29 text

Why is Spark more performant than MapReduce

Slide 30

Slide 30 text

1. No disk IO between phases since phases themselves are pipelined 2. No network IO involved unless a shuffle is required Reduced IO

Slide 31

Slide 31 text

No Mandatory Shuffle 1. Programs not bounded by map and reduce phases 2. No mandatory Shuffle and sort required

Slide 32

Slide 32 text

In memory caching of data 1. Optional In memory caching 2. DAG engine can apply certain optimizations since when an action is called, it knows what all transformations as to be applied

Slide 33

Slide 33 text

Components of Spark Spark SQL Streaming MLib GraphX SparkR

Slide 34

Slide 34 text

SparkR Demo

Slide 35

Slide 35 text

Questions?

Slide 36

Slide 36 text

Thank You! Happy Coding!

Slide 37

Slide 37 text

References 1. https://spark.apache.org/docs/latest/ 2. https://www.cs.berkeley. edu/~matei/papers/2012/nsdi_spark.pdf 3. http://hadoop.apache.org/