MapReduce to Spark - Why and How

MapReduce to Spark - Why and How Rahul Kavale(@yphalcombinator) 1

Map Reduce - what it means? 1. The Map and
Reduce collection operations 2. Programming Model in Hadoop 3. MapReduce implementation in Hadoop 2

Map and Reduce traditional collection operations 1. Ruby (1..5).map{|x| x*2
} #=> [2, 4, 6, 8, 10] (1..5).reduce(:+) #=> 15 2.Python map(lambda x: x**2, range(5)) reduce(lambda x, y: x + y, range(5)) 3.Scala (1 to 5).map(number => number * 2) (1 to 5).reduce((a, b) => a + b) 4.Clojure (map inc [1 2 3 4 5]) (reduce + [1 2 3 4 5]) 4.Java List<T> tempList = new ArrayList<T>(); for(T elem : elements){ tempList.add(elem * 2) } int result = 0; for(T elem : elements){ result += elem } 3

• Collection is in memory • What if collection does
not fit in memory? • We divide and conquer! 4

Challenges for distributing the work • break a large problem
into smaller tasks • distribute each sub task to a different machine • feed workers the data locally • sharing partial results among workers • Hardware faults? 5

Hadoop Map Reduce Implementation • Abstracts from lot of low
level concerns • Handles the Distributed computing aspects • Handles faults in hardware with Replication • Data is stored with distributed storage • sharing the results between tasks is done by the framework • Splitting the problem into smaller ones 6

Map Reduce programming model • Two phases Map Reduce •
Implicit Shuffle and sort phase Responsible for sharing results across sub tasks • MapReduce requires imposing the key and value structure on arbitrary datasets 7

Map Phase - Mapper map: (k1, v1) → [(k2, v2)]
https://developer.yahoo.com/hadoop/tutorial/module4.html#basics 8

9 public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { } }

Reducer(Reduce phase) reduce: (k2, [v2]) → [(k3, v3)] https://developer.yahoo.com/hadoop/tutorial/module4.html#basics 10

public static class MyReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text
key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } } 11

http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop 12

MapReduce pain points • only Map and Reduce phases •
results into complex workflow • Non trivial to test • considerable latency • Not suitable for Iterative processing 13

Few wrappers over MapReduce And lot others... 14 Apache Crunch

What these wrappers don’t offer performance 15

Wouldn’t it be very nice if we could have •
Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :) 16

Let me introduce you…. Lightning-fast cluster computing 17

What is Apache Spark • Cluster computing Engine • Abstracts
the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R* 18

Where does it fit in existing Bigdata ecosystem http://www.kdnuggets.com/2014/06/yarn-all-rage-hadoop-summit.html 19

Spark performance metrics • Petabyte sort record https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html 20

Programming model for Apache Spark 21

Word Count example val file = spark.textFile("input path") val counts
= file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path") 22

Comparing example with MapReduce 23

RDD • RDD stands for Resilient Distributed Dataset. • Basic
abstraction for Spark • Evaluated Lazily 24

• Equivalent of Distributed collections. • The interface makes distributed
nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS. 25

RDD has two type of operations • Transformations Create a
DAG of transformations to be applied on the RDD Does not evaluating anything eg. map, flatMap, filter etc • Actions Evaluate the DAG of transformations eg. count, saveAsTextFile 26

Word Count example val file = spark.textFile("input path") val counts
= file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path") 27

Why is Spark more performant than MapReduce 28

Reduced IO • No disk IO between phases since phases
themselves are pipelined • No network IO involved unless a shuffle is required 29

No Mandatory Shuffle • Programs not bounded by map and
reduce phases • No mandatory Shuffle and sort required 30

In memory caching of data • Optional In memory caching
• DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied 31

Questions? 32

Thank You! 33

MapReduce to Spark - Why and How

MapReduce to Spark - Why and How

Rahul Kavale

More Decks by Rahul Kavale

Other Decks in Technology

Featured

Transcript