Slide 1

Slide 1 text

INTRODUCTION TO APACHE SPARK Erik Rozendaal Monday, 12 May 14

Slide 2

Slide 2 text

SPARK • Apache project originally from UC Berkeley • A better tool to work with Big Data (but also useful for not- so-big data) • Interactive Data Analysis • Iterative Algorithms, Machine Learning, Graph • “Real-Time” Stream Processing Monday, 12 May 14

Slide 3

Slide 3 text

OVERVIEW • Integrates with Hadoop HDFS and YARN • But also runs in local and standalone mode • Scala, Java, Python APIs Monday, 12 May 14

Slide 4

Slide 4 text

RDD[A] • Resilient, Distributed Dataset • A bit like a list, just bigger • RDDs consists of multiple partitions • Partitions are distributed across cluster, cached in memory, spilled to disk, automatically recovered on failure, etc. • Partitions must fit in RAM (but just N-at-a-time) Monday, 12 May 14

Slide 5

Slide 5 text

RDD API • API intentionally very similar to standard Scala collections (Array[A], List[A], etc) • Map, flatMap, filter, groupBy, count, foreach, union, zip, ... • Additional operations like (outer) join, co-group, top-N, reduceByKey, pipe, ... • Transformations are lazily evaluated when required by actions Monday, 12 May 14

Slide 6

Slide 6 text

DEMO Monday, 12 May 14

Slide 7

Slide 7 text

SPARK WORD COUNT object WordCount { def main(args: Array[String]) { val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line split "\\W+") .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } Monday, 12 May 14

Slide 8

Slide 8 text

CACHING • Whenever the data of an RDD is needed the RDD is computed • RDDs that are expensive to re-compute can be cached, persisted, and/or replicated • rdd.persist(MEMORY_AND_DISK_SER_2) • cache is an alias for persist(MEMORY_ONLY) Monday, 12 May 14

Slide 9

Slide 9 text

DEMO Monday, 12 May 14

Slide 10

Slide 10 text

VS HADOOP MAP/REDUCE No contest Monday, 12 May 14

Slide 11

Slide 11 text

SPARK WORD COUNT object WordCount { def main(args: Array[String]) { val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line.split("\\W+")) .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } public class Map extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word: line.split("\\W+")) { if (!word.isEmpty()) { context.write(new Text(word), new IntWritable(1)); } } } } public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Monday, 12 May 14

Slide 12

Slide 12 text

VS HADOOP MAP/REDUCE • Like Hadoop Map/Reduce it is build on top of Hadoop HDFS • Spark code base is much smaller (and written in Scala) • Many active contributors • Much easier to program • Hadoop Map/Reduce should be considered legacy • ... this is both good and bad! Monday, 12 May 14

Slide 13

Slide 13 text

DEMO Monday, 12 May 14

Slide 14

Slide 14 text

PITFALLS • Distributed systems are (inherently) complex • Spark has an easy-to-use, semantically clean API • Operationally the complexity is still there • Requires tuning and hand-holding! • Spark relies on reflection based serialization (Java / Kryo) Monday, 12 May 14

Slide 15

Slide 15 text

Monday, 12 May 14

Slide 16

Slide 16 text

THANK YOU! Monday, 12 May 14