MapReduce to Spark - Why and How

Slide 1

Slide 1 text

MapReduce to Spark - Why and How Rahul Kavale(@yphalcombinator) 1

Slide 2

Slide 2 text

Map Reduce - what it means? 1. The Map and Reduce collection operations 2. Programming Model in Hadoop 3. MapReduce implementation in Hadoop 2

Slide 3

Slide 3 text

Map and Reduce traditional collection operations 1. Ruby (1..5).map{|x| x*2 } #=> [2, 4, 6, 8, 10] (1..5).reduce(:+) #=> 15 2.Python map(lambda x: x**2, range(5)) reduce(lambda x, y: x + y, range(5)) 3.Scala (1 to 5).map(number => number * 2) (1 to 5).reduce((a, b) => a + b) 4.Clojure (map inc [1 2 3 4 5]) (reduce + [1 2 3 4 5]) 4.Java List tempList = new ArrayList(); for(T elem : elements){ tempList.add(elem * 2) } int result = 0; for(T elem : elements){ result += elem } 3

Slide 4

Slide 4 text

• Collection is in memory • What if collection does not fit in memory? • We divide and conquer! 4

Slide 5

Slide 5 text

Challenges for distributing the work • break a large problem into smaller tasks • distribute each sub task to a different machine • feed workers the data locally • sharing partial results among workers • Hardware faults? 5

Slide 6

Slide 6 text

Hadoop Map Reduce Implementation • Abstracts from lot of low level concerns • Handles the Distributed computing aspects • Handles faults in hardware with Replication • Data is stored with distributed storage • sharing the results between tasks is done by the framework • Splitting the problem into smaller ones 6

Slide 7

Slide 7 text

Map Reduce programming model • Two phases Map Reduce • Implicit Shuffle and sort phase Responsible for sharing results across sub tasks • MapReduce requires imposing the key and value structure on arbitrary datasets 7

Slide 8

Slide 8 text

Map Phase - Mapper map: (k1, v1) → [(k2, v2)] https://developer.yahoo.com/hadoop/tutorial/module4.html#basics 8

Slide 9

Slide 9 text

9 public static class MyMapper extends Mapper{ public void map(Object key, Text value, Context context) throws IOException, InterruptedException { } }

Slide 10

Slide 10 text

Reducer(Reduce phase) reduce: (k2, [v2]) → [(k3, v3)] https://developer.yahoo.com/hadoop/tutorial/module4.html#basics 10

Slide 11

Slide 11 text

public static class MyReducer extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { } } 11

Slide 12

Slide 12 text

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop 12

Slide 13

Slide 13 text

MapReduce pain points • only Map and Reduce phases • results into complex workflow • Non trivial to test • considerable latency • Not suitable for Iterative processing 13

Slide 14

Slide 14 text

Few wrappers over MapReduce And lot others... 14 Apache Crunch

Slide 15

Slide 15 text

What these wrappers don’t offer performance 15

Slide 16

Slide 16 text

Wouldn’t it be very nice if we could have • Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :) 16

Slide 17

Slide 17 text

Let me introduce you…. Lightning-fast cluster computing 17

Slide 18

Slide 18 text

What is Apache Spark • Cluster computing Engine • Abstracts the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R* 18

Slide 19

Slide 19 text

Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html 19

Slide 20

Slide 20 text

Spark performance metrics • Petabyte sort record https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html 20

Slide 21

Slide 21 text

Programming model for Apache Spark 21

Slide 22

Slide 22 text

Word Count example val file = spark.textFile("input path") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path") 22

Slide 23

Slide 23 text

Comparing example with MapReduce 23

Slide 24

Slide 24 text

RDD • RDD stands for Resilient Distributed Dataset. • Basic abstraction for Spark • Evaluated Lazily 24

Slide 25

Slide 25 text

• Equivalent of Distributed collections. • The interface makes distributed nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS. 25

Slide 26

Slide 26 text

RDD has two type of operations • Transformations Create a DAG of transformations to be applied on the RDD Does not evaluating anything eg. map, flatMap, filter etc • Actions Evaluate the DAG of transformations eg. count, saveAsTextFile 26

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Why is Spark more performant than MapReduce 28

Slide 29

Slide 29 text

Reduced IO • No disk IO between phases since phases themselves are pipelined • No network IO involved unless a shuffle is required 29

Slide 30

Slide 30 text

No Mandatory Shuffle • Programs not bounded by map and reduce phases • No mandatory Shuffle and sort required 30

Slide 31

Slide 31 text

In memory caching of data • Optional In memory caching • DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied 31