Introduction to Apache Spark

INTRODUCTION TO APACHE SPARK Erik Rozendaal Monday, 12 May 14

SPARK • Apache project originally from UC Berkeley • A
better tool to work with Big Data (but also useful for not- so-big data) • Interactive Data Analysis • Iterative Algorithms, Machine Learning, Graph • “Real-Time” Stream Processing Monday, 12 May 14

OVERVIEW • Integrates with Hadoop HDFS and YARN • But
also runs in local and standalone mode • Scala, Java, Python APIs Monday, 12 May 14

RDD[A] • Resilient, Distributed Dataset • A bit like a
list, just bigger • RDDs consists of multiple partitions • Partitions are distributed across cluster, cached in memory, spilled to disk, automatically recovered on failure, etc. • Partitions must ﬁt in RAM (but just N-at-a-time) Monday, 12 May 14

RDD API • API intentionally very similar to standard Scala
collections (Array[A], List[A], etc) • Map, ﬂatMap, ﬁlter, groupBy, count, foreach, union, zip, ... • Additional operations like (outer) join, co-group, top-N, reduceByKey, pipe, ... • Transformations are lazily evaluated when required by actions Monday, 12 May 14

DEMO Monday, 12 May 14

SPARK WORD COUNT object WordCount { def main(args: Array[String]) {
val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line split "\\W+") .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } Monday, 12 May 14

CACHING • Whenever the data of an RDD is needed
the RDD is computed • RDDs that are expensive to re-compute can be cached, persisted, and/or replicated • rdd.persist(MEMORY_AND_DISK_SER_2) • cache is an alias for persist(MEMORY_ONLY) Monday, 12 May 14

VS HADOOP MAP/REDUCE No contest Monday, 12 May 14

SPARK WORD COUNT object WordCount { def main(args: Array[String]) {
val spark = new SparkContext(args(0), "wordcount") val lines = spark.textFile(args(1)) val words = lines .flatMap(line => line.split("\\W+")) .filter(word => word.nonEmpty) val counts = words .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(args(2)) } } public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); for (String word: line.split("\\W+")) { if (!word.isEmpty()) { context.write(new Text(word), new IntWritable(1)); } } } } public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Monday, 12 May 14

VS HADOOP MAP/REDUCE • Like Hadoop Map/Reduce it is build
on top of Hadoop HDFS • Spark code base is much smaller (and written in Scala) • Many active contributors • Much easier to program • Hadoop Map/Reduce should be considered legacy • ... this is both good and bad! Monday, 12 May 14

PITFALLS • Distributed systems are (inherently) complex • Spark has
an easy-to-use, semantically clean API • Operationally the complexity is still there • Requires tuning and hand-holding! • Spark relies on reﬂection based serialization (Java / Kryo) Monday, 12 May 14

Monday, 12 May 14

THANK YOU! Monday, 12 May 14

Introduction to Apache Spark

Introduction to Apache Spark

Erik Rozendaal

Other Decks in Technology

Featured

Transcript