immutable, meaning it is not supposed to updated once generated. 2. Mostly the operations are coarse grained when it comes to write 3. Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines, and as we know the distributed nature makes the programming complicated.
usually having considerable latency 2. Limits the programming to Map and Reduce phases 3. Non trivial to test 4. A real life solution might result into a complex workflow 5. Not suitable for Iterative processing
Programmer friendly programming model 2. Low latency 3. Unified ecosystem 4. Fault tolerance and other typical distributed system properties 5. Easily testable code 6. Of course open source :)
the storage and cluster management aspects from computations 3. Aims to unify otherwise spread out interfaces to data 4. provides interfaces in Scala, Python, Java
storage, cluster management, you can plugin it as per your need 2. Easy programming model 3. Of course, very high performant as compared to traditional MapReduce and its cousins
in memory caching of data, resulting further more performance boost 6. Applications like graph processing(via GraphX), Streaming(Spark Streaming), Machine learning(MLib), SQL(Spark SQL) are very easy and highly interoperable 7. Data exploration via Spark-Shell
interface almost makes distributed nature of underlying data transparent. 2. It can be created via, a. parallelizing a collection, b. transforming an existing RDD by applying a transformation function, c. reading from a persistent data store like HDFS.
even HDFS is write once, read many times/append only store, making it immutable but the MapReduce model makes it impossible to exploit this fact for improving performance.
on it a. Transformations just create a DAG of transformations to be applied on the RDD and not really evaluating anything b. Actions Actually evaluate the DAG of tranformations giving us back the result