Introduction to Apache Spark

Slide 1

Slide 1 text

INTRODUCTION TO APACHE SPARK Erik Rozendaal Monday, 12 May 14

Slide 2

Slide 2 text

SPARK • Apache project originally from UC Berkeley • A better tool to work with Big Data (but also useful for not- so-big data) • Interactive Data Analysis • Iterative Algorithms, Machine Learning, Graph • “Real-Time” Stream Processing Monday, 12 May 14

Slide 3

Slide 3 text

OVERVIEW • Integrates with Hadoop HDFS and YARN • But also runs in local and standalone mode • Scala, Java, Python APIs Monday, 12 May 14

Slide 4

Slide 4 text

RDD[A] • Resilient, Distributed Dataset • A bit like a list, just bigger • RDDs consists of multiple partitions • Partitions are distributed across cluster, cached in memory, spilled to disk, automatically recovered on failure, etc. • Partitions must ﬁt in RAM (but just N-at-a-time) Monday, 12 May 14

Slide 5

Slide 5 text

RDD API • API intentionally very similar to standard Scala collections (Array[A], List[A], etc) • Map, ﬂatMap, ﬁlter, groupBy, count, foreach, union, zip, ... • Additional operations like (outer) join, co-group, top-N, reduceByKey, pipe, ... • Transformations are lazily evaluated when required by actions Monday, 12 May 14

Slide 6

Slide 6 text

DEMO Monday, 12 May 14

Slide 7

Slide 7 text

Slide 8

Slide 8 text

CACHING • Whenever the data of an RDD is needed the RDD is computed • RDDs that are expensive to re-compute can be cached, persisted, and/or replicated • rdd.persist(MEMORY_AND_DISK_SER_2) • cache is an alias for persist(MEMORY_ONLY) Monday, 12 May 14

Slide 9

Slide 9 text

DEMO Monday, 12 May 14

Slide 10

Slide 10 text

VS HADOOP MAP/REDUCE No contest Monday, 12 May 14

Slide 11

Slide 11 text

SPARK WORD COUNT object WordCount { def main(args: Array[String]) { val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line.split("\\W+")) .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } public class Map extends Mapper { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word: line.split("\\W+")) { if (!word.isEmpty()) { context.write(new Text(word), new IntWritable(1)); } } } } public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Monday, 12 May 14

Slide 12

Slide 12 text

VS HADOOP MAP/REDUCE • Like Hadoop Map/Reduce it is build on top of Hadoop HDFS • Spark code base is much smaller (and written in Scala) • Many active contributors • Much easier to program • Hadoop Map/Reduce should be considered legacy • ... this is both good and bad! Monday, 12 May 14

Slide 13

Slide 13 text

DEMO Monday, 12 May 14

Slide 14

Slide 14 text

PITFALLS • Distributed systems are (inherently) complex • Spark has an easy-to-use, semantically clean API • Operationally the complexity is still there • Requires tuning and hand-holding! • Spark relies on reﬂection based serialization (Java / Kryo) Monday, 12 May 14

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text