In this talk, I have tried to compare MapReduce with Spark with a brief introduction to what is MapReduce leading to programming to distributed collections with Spark
into smaller tasks • distribute each sub task to a different machine • feed workers the data locally • sharing partial results among workers • Hardware faults? 5
level concerns • Handles the Distributed computing aspects • Handles faults in hardware with Replication • Data is stored with distributed storage • sharing the results between tasks is done by the framework • Splitting the problem into smaller ones 6
Implicit Shuffle and sort phase Responsible for sharing results across sub tasks • MapReduce requires imposing the key and value structure on arbitrary datasets 7
Low latency • Programmer friendly programming model • Unified ecosystem • Fault tolerance and other typical distributed system properties • Easily testable code • Of course open source :) 16
nature of underlying data transparent. • RDD is immutable • Can be created via, • parallelising a collection, • transforming an existing RDD by applying a transformation function, • reading from a persistent data store like HDFS. 25
DAG of transformations to be applied on the RDD Does not evaluating anything eg. map, flatMap, filter etc • Actions Evaluate the DAG of transformations eg. count, saveAsTextFile 26