Apache Spark—the light at the end of the tunnel?

Apache Spark—the light at the end of the tunnel? Michael
Hausenblas, Chief Data Engineer EMEA, MapR Technologies  6th Data Science Day, Berlin, 2014-05-07

Apache Spark • Originally developed in 2009 in UC Berkeley’s
AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014 https://spark.apache.org/

The Spark Community

Most Active Open Source Project in Big Data Giraph! Storm!
Tez! 0! 20! 40! 60! 80! 100! 120! 140! Project contributors in past year

Easy & Fast Big Data Easy to Develop • Rich
APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage

Easy & Fast Big Data Easy to Develop • Rich
APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage 2-5 × less code Up to 10 × faster on disk,  and 100 × in memory https://amplab.cs.berkeley.edu/benchmark/

Easy: Get Started Immediately • Multi-language support • Interactive Shell
Python lines = sc.textFile(...)  lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...)  lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...);  lines.filter(new Function<String, Boolean>() {  Boolean call(String s) {  return s.contains(“error”);  }  }).count();

Easy: Clean API Resilient Distributed Datasets • Collections of objects
spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations  (e.g. map, filter, groupBy) • Actions  (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets

Easy: Expressive API map ! ! ! ! ! reduce

Easy: Expressive API map filter groupBy sort union join leftOuterJoin
rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Easy: Example – Word Count public static class WordCountMapClass extends
MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark

Easy: Example – Word Count val spark = new SparkContext(master,
appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Easy: Works Well With Hadoop Data Compatibility • Access your
existing Hadoop data (HDFS, HBase, S3, etc.) • Use the same data formats • Adheres to data locality for efﬁcient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side-by-side

Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() ! w = numpy.random.rand(D)
! for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))   * p.y * p.x)  .reduce(lambda x, y: x + y) w -= gradient ! print “Final w: %s” % w

Fast: Logistic Regression Performance Running Time (s) 0 1000 2000
3000 4000 Number of Iterations 1 5 10 20 30 Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s

Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions
read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map

Easy: Fault Recovery RDDs track lineage information that can be
used to efﬁciently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter  (func = startsWith(…)) map  (func = split(...))

Data platform Execution environment Spark core engine Spark ecosystem 4
3 2 1 The Spark Stack from 100,000 ft

Easy: Uniﬁed Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib
(Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)

Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark
(General execution engine) GraphX (Graph computation)

Hive Compatibility • Interfaces to access data and code in
the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the   Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Parquet Support Native support for reading data stored in Parquet:
• Columnar storage avoids reading unneeded data. • Currently only supports ﬂat structures (nested data on short-term roadmap). • RDDs can be written to parquet ﬁles, preserving the schema. http://parquet.io/

Mixing SQL and Machine Learning val trainingDataTable = sql(""" SELECT
e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib  val trainingData = trainingDataTable.map { row =>  val features = Array[Double](row(1), row(2), row(3))  LabeledPoint(row(0), features)  } val model = new LogisticRegressionWithSGD().run(trainingData)

Relationship to Borrows • Hive data loading code / in-
memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer/ query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark

Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General
execution engine) GraphX (Graph computation)

Spark Streaming Run a streaming computation as a series of
very small, deterministic batch jobs Spark Spark Streaming batches of X seconds live data stream processed results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics

DStream of data Window-based Transformations val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval

MLlib – Machine Learning library Logis]c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna]ng*Least*Squares* KZMeans,*SVD*
SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classiﬁca.on:" Regression:" Collabora.ve"Filtering:" Clustering"/"Explora.on:" Op.miza.on"Primi.ves:" Interopera.lity:"

Enabling users to easily and efficiently express the entire graph
analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Uniﬁed Approach

Use Cases

Interactive Exploratory Analytics • Leverage Spark’s in-memory caching and efﬁcient
execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers

Machine Learning • Improve performance of iterative algorithms by caching
frequently accessed datasets • Develop programs that are easy to reason using a fully-capable functional programming style • Reﬁne algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib

Power Real-time Dashboards • Use Spark Streaming to perform low-latency
window-based aggregations • Combine ofﬂine models with streaming data for online clustering and classiﬁcation within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data

Faster ETL • Leverage Spark’s optimized scheduling for more efﬁcient
I/O on large datasets, and in-memory processing for aggregations, shufﬂes, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark

http://spark-stack.org

Spark and the Lambda Architecture

Fault tolerance hardware software developer

Human fault tolerance

NEW DATA   STREAM QUERY BATCH VIEWS √ View 1
View 2 View N REAL-TIME VIEWS BATCH LAYER SERVING LAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWS BATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS REAL-TIME INCREMENT View 1 View 2 View N Lambda Architecture

Spark hackathon?

Q & A @mhausenblas maprtech [email protected] MapR maprtech mapr-technologies

Apache Spark—the light at the end of the tunnel?

Apache Spark—the light at the end of the tunnel?

More Decks by Michael Hausenblas

Other Decks in Technology

Featured

Transcript