Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Spark

Introduction to Apache Spark

Short introduction to data processing using Apache Spark, with the ubiquitous word count example. Most of the presentation was live coding, which is unfortunately not part of the slides.

Erik Rozendaal

May 08, 2014
Tweet

Other Decks in Technology

Transcript

  1. SPARK • Apache project originally from UC Berkeley • A

    better tool to work with Big Data (but also useful for not- so-big data) • Interactive Data Analysis • Iterative Algorithms, Machine Learning, Graph • “Real-Time” Stream Processing Monday, 12 May 14
  2. OVERVIEW • Integrates with Hadoop HDFS and YARN • But

    also runs in local and standalone mode • Scala, Java, Python APIs Monday, 12 May 14
  3. RDD[A] • Resilient, Distributed Dataset • A bit like a

    list, just bigger • RDDs consists of multiple partitions • Partitions are distributed across cluster, cached in memory, spilled to disk, automatically recovered on failure, etc. • Partitions must fit in RAM (but just N-at-a-time) Monday, 12 May 14
  4. RDD API • API intentionally very similar to standard Scala

    collections (Array[A], List[A], etc) • Map, flatMap, filter, groupBy, count, foreach, union, zip, ... • Additional operations like (outer) join, co-group, top-N, reduceByKey, pipe, ... • Transformations are lazily evaluated when required by actions Monday, 12 May 14
  5. SPARK WORD COUNT object WordCount { def main(args: Array[String]) {

    val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line split "\\W+") .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } Monday, 12 May 14
  6. CACHING • Whenever the data of an RDD is needed

    the RDD is computed • RDDs that are expensive to re-compute can be cached, persisted, and/or replicated • rdd.persist(MEMORY_AND_DISK_SER_2) • cache is an alias for persist(MEMORY_ONLY) Monday, 12 May 14
  7. SPARK WORD COUNT object WordCount { def main(args: Array[String]) {

    val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line.split("\\W+")) .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word: line.split("\\W+")) { if (!word.isEmpty()) { context.write(new Text(word), new IntWritable(1)); } } } } public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Monday, 12 May 14
  8. VS HADOOP MAP/REDUCE • Like Hadoop Map/Reduce it is build

    on top of Hadoop HDFS • Spark code base is much smaller (and written in Scala) • Many active contributors • Much easier to program • Hadoop Map/Reduce should be considered legacy • ... this is both good and bad! Monday, 12 May 14
  9. PITFALLS • Distributed systems are (inherently) complex • Spark has

    an easy-to-use, semantically clean API • Operationally the complexity is still there • Requires tuning and hand-holding! • Spark relies on reflection based serialization (Java / Kryo) Monday, 12 May 14