Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Spark

Rahul Kavale
December 06, 2014

Introduction to Apache Spark

Talk I gave at Big Data meetup about Introduction to Apache Spark, comparing it with Map Reduce model and much more.

Rahul Kavale

December 06, 2014
Tweet

More Decks by Rahul Kavale

Other Decks in Programming

Transcript

  1. Some properties of “Big Data” 1. Big data is inherently

    immutable, meaning it is not supposed to updated once generated. 2. Mostly the operations are coarse grained when it comes to write 3. Commodity hardware makes more sense for storage/computation of such enormous data,hence the data is distributed across cluster of many such machines, and as we know the distributed nature makes the programming complicated.
  2. Brush up for Hadoop concepts Distributed Storage => HDFS Cluster

    Manager => YARN Fault tolerance => achieved via replication Job scheduling => Scheduler in YARN Mapper Reducer
  3. MapReduce pain points 1. Meant for batch jobs, hence is

    usually having considerable latency 2. Limits the programming to Map and Reduce phases 3. Non trivial to test 4. A real life solution might result into a complex workflow 5. Not suitable for Iterative processing
  4. Immutability and MapReduce model 1. The MapReduce model lacks to

    exploit the immutable nature of the data. 2. The intermediate results are persisted causing lot of IO, causing a serious performance hit.
  5. Wouldn’t it be very nice if we could have 1.

    Programmer friendly programming model 2. Low latency 3. Unified ecosystem 4. Fault tolerance and other typical distributed system properties 5. Easily testable code 6. Of course open source :)
  6. What is Apache Spark 1. Cluster computing Engine 2. Abstracts

    the storage and cluster management aspects from computations 3. Aims to unify otherwise spread out interfaces to data 4. provides interfaces in Scala, Python, Java
  7. Why should you care about Apache Spark 1. Abstracts underlying

    storage, cluster management, you can plugin it as per your need 2. Easy programming model 3. Of course, very high performant as compared to traditional MapReduce and its cousins
  8. 4. Recently set a new Petabyte sort record 5. offers

    in memory caching of data, resulting further more performance boost 6. Applications like graph processing(via GraphX), Streaming(Spark Streaming), Machine learning(MLib), SQL(Spark SQL) are very easy and highly interoperable 7. Data exploration via Spark-Shell
  9. Word Count example val file = spark.textFile("input path") val counts

    = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("destination path")
  10. RDD 1. RDD stands for Resilient Distributed Dataset. 2. It

    forms the basic abstraction on which Spark programming model works.
  11. 1. Can be thought of as Distributed collections. The programming

    interface almost makes distributed nature of underlying data transparent. 2. It can be created via, a. parallelizing a collection, b. transforming an existing RDD by applying a transformation function, c. reading from a persistent data store like HDFS.
  12. RDD is immutable This is a very important point, because

    even HDFS is write once, read many times/append only store, making it immutable but the MapReduce model makes it impossible to exploit this fact for improving performance.
  13. RDD is lazily evaluated RDD has two type of operations

    on it a. Transformations just create a DAG of transformations to be applied on the RDD and not really evaluating anything b. Actions Actually evaluate the DAG of tranformations giving us back the result
  14. RDD operations Transformations map(f : T ⇒ U) : RDD[T]

    ⇒ RDD[U] filter(f : T ⇒ Bool) : RDD[T] ⇒ RDD[T] flatMap(f : T ⇒ Seq[U]) : RDD[T] ⇒ RDD[U] sample(fraction : Float) : RDD[T] ⇒ RDD[T] (Deterministic sampling) union() : (RDD[T],RDD[T]) ⇒ RDD[T] join() : (RDD[(K, V)],RDD[(K, W)]) ⇒ RDD[(K, (V, W))] groupByKey() : RDD[(K, V)] ⇒ RDD[(K, Seq[V])] reduceByKey(f : (V,V) ⇒ V) : RDD[(K, V)] ⇒ RDD[(K, V)] partitionBy(p : Partitioner[K]) : RDD[(K, V)] ⇒ RDD[(K, V)]
  15. Actions count() : RDD[T] ⇒ Long collect() : RDD[T] ⇒

    Seq[T] reduce(f : (T,T) ⇒ T) : RDD[T] ⇒ T lookup(k : K) : RDD[(K, V)] ⇒ Seq[V] (On hash/range partitioned RDDs) save(path : String) : Outputs RDD to a storage system, e.g., HDFS
  16. 1. No disk IO between phases since phases themselves are

    pipelined 2. No network IO involved unless a shuffle is required Reduced IO
  17. No Mandatory Shuffle 1. Programs not bounded by map and

    reduce phases 2. No mandatory Shuffle and sort required
  18. In memory caching of data 1. Optional In memory caching

    2. DAG engine can apply certain optimizations since when an action is called, it knows what all transformations as to be applied