Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce to Spark - Why and How

MapReduce to Spark - Why and How

In this talk, I have tried to compare MapReduce with Spark with a brief introduction to what is MapReduce leading to programming to distributed collections with Spark

Rahul Kavale

April 02, 2015
Tweet

More Decks by Rahul Kavale

Other Decks in Technology

Transcript

  1. Map Reduce - what it means? 1. The Map and

    Reduce collection operations 2. Programming Model in Hadoop 3. MapReduce implementation in Hadoop 2
  2. Map and Reduce traditional collection operations 1. Ruby (1..5).map{|x| x*2

    } #=> [2, 4, 6, 8, 10] (1..5).reduce(:+) #=> 15 2.Python map(lambda x: x**2, range(5)) reduce(lambda x, y: x + y, range(5)) 3.Scala (1 to 5).map(number => number * 2) (1 to 5).reduce((a, b) => a + b) 4.Clojure (map inc [1 2 3 4 5]) (reduce + [1 2 3 4 5]) 4.Java List<T> tempList = new ArrayList<T>(); for(T elem : elements){ tempList.add(elem * 2) } int result = 0; for(T elem : elements){ result += elem } 3
  3. • Collection is in memory • What if collection does

    not fit in memory? • We divide and conquer! 4
  4. Challenges for distributing the work • break a large problem

    into smaller tasks • distribute each sub task to a different machine • feed workers the data locally • sharing partial results among workers • Hardware faults? 5
  5. Hadoop Map Reduce Implementation • Abstracts from lot of low

    level concerns • Handles the Distributed computing aspects • Handles faults in hardware with Replication • Data is stored with distributed storage • sharing the results between tasks is done by the framework • Splitting the problem into smaller ones 6
  6. Map Reduce programming model • Two phases Map Reduce •

    Implicit Shuffle and sort phase Responsible for sharing results across sub tasks • MapReduce requires imposing the key and value structure on arbitrary datasets 7
  7. Map Phase - Mapper map: (k1, v1) → [(k2, v2)]

    https://developer.yahoo.com/hadoop/tutorial/module4.html#basics 8
  8. 9 public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>{

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException { } }
  9. public static class MyReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text

    key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { } } 11
  10. MapReduce pain points • only Map and Reduce phases •

    results into complex workflow • Non trivial to test • considerable latency • Not suitable for Iterative processing 13
  11. Wouldn’t it be very nice if we could have •

    Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :) 16
  12. What is Apache Spark • Cluster computing Engine • Abstracts

    the storage and cluster management • Unified interfaces to data • API in Scala, Python, Java, R* 18
  13. Word Count example val file = spark.textFile("input path") val counts

    = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path") 22
  14. RDD • RDD stands for Resilient Distributed Dataset. • Basic

    abstraction for Spark • Evaluated Lazily 24
  15. • Equivalent of Distributed collections. • The interface makes distributed

    nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS. 25
  16. RDD has two type of operations • Transformations Create a

    DAG of transformations to be applied on the RDD Does not evaluating anything eg. map, flatMap, filter etc • Actions Evaluate the DAG of transformations eg. count, saveAsTextFile 26
  17. Word Count example val file = spark.textFile("input path") val counts

    = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) counts.saveAsTextFile("destination path") 27
  18. Reduced IO • No disk IO between phases since phases

    themselves are pipelined • No network IO involved unless a shuffle is required 29
  19. No Mandatory Shuffle • Programs not bounded by map and

    reduce phases • No mandatory Shuffle and sort required 30
  20. In memory caching of data • Optional In memory caching

    • DAG engine can apply certain optimisations since when an action is called, it knows what all transformations as to be applied 31