The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013

Slide 1

Slide 1 text

Matei Zaharia and Reynold Xin University of California, Berkeley www.spark-‐project.org The Ecosystem Fast and Expressive Big Data Analytics in Scala UC BERKELEY

Slide 2

Slide 2 text

What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves eﬃciency through: » In-‐memory computing primitives » General computation graphs Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell Up to 100× faster (2-‐10× on disk) Often 5× less code

Slide 3

Slide 3 text

Project History Spark started in 2009, open sourced 2010 In use at Intel, Yahoo!, Adobe, Quantiﬁnd, Conviva, Ooyala, Bizo and others 17 companies now contributing code

Slide 4

Slide 4 text

A Growing Stack Part of the Berkeley Data Analytics Stack (BDAS) project to build an open source next-‐gen analytics system Spark Shark SQL Spark Streaming real-‐time GraphX graph MLbase machine learning …

Slide 5

Slide 5 text

This Talk Spark introduction & use cases GraphX: graph computation Shark: SQL over Spark See tomorrow for a talk on Streaming!

Slide 6

Slide 6 text

Why a New Programming Model? MapReduce greatly simpliﬁed big data analysis But as soon as it got popular, users wanted more: » More complex, multi-‐pass analytics (e.g. ML, graph) » More interactive ad-‐hoc queries » More real-‐time stream processing All 3 need faster data sharing across parallel jobs

Slide 7

Slide 7 text

Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO

Slide 8

Slide 8 text

iter. 1 iter. 2 . . . Input Data Sharing in Spark Distributed memory Input query 1 query 2 query 3 . . . one-‐time processing 10-‐100× faster than network and disk

Slide 9

Slide 9 text

Spark Programming Model Key idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can be cached in memory across cluster » Manipulated through parallel operators » Automatically recomputed on failure Programming interface » Functional APIs in Scala, Java, Python » Interactive use from Scala shell

Slide 10

Slide 10 text

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data) Result: scaled to 1 TB data in 5-‐7 sec (vs 170 sec for on-‐disk data)

Slide 11

Slide 11 text

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)

Slide 12

Slide 12 text

Example: Logistic Regression Goal: ﬁnd best line separating two sets of points + – + + + + + + + + – – – – – – – – + target – random initial line

Slide 13

Slide 13 text

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Slide 14

Slide 14 text

Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running Time (s) Number of Iterations Hadoop Spark 110 s / iteration ﬁrst iteration 80 s further iterations 1 s

Slide 15

Slide 15 text

Demo

Slide 16

Slide 16 text

Supported Operators map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

Slide 17

Slide 17 text

Other Engine Features General operator graphs (e.g. map-‐reduce-‐reduce) Hash-‐based reduces (faster than Hadoop’s sort) Controlled data partitioning to lower communication 171 72 23 0 50 100 150 200 Iteration time (s) PageRank Performance Hadoop Basic Spark Spark + Controlled Partitioning

Slide 18

Slide 18 text

700+ meetup members 30+ external contributors 17 companies contributing Spark Community

Slide 19

Slide 19 text

This Talk Spark introduction & use cases GraphX: graph computation Shark: SQL over Spark

Slide 20

Slide 20 text

Graphs are Essential to Data Mining Identify inﬂuential people and information Find communities Target ads and products Model complex data dependencies

Slide 21

Slide 21 text

Pregel Specialized Graph Systems

Slide 22

Slide 22 text

B C D E F A Specialized Graph Systems 1.  APIs to capture complex dependencies i.e. graph parallelism vs data parallelism 2.  Exploit graph structure to reduce communication and computation

Slide 23

Slide 23 text

How is GraphX diﬀerent from ____? Answer: Simplicity

Slide 24

Slide 24 text

Simplicity Integration with Spark: no disparate system » ETL (Extract, transform, load) » Consumption of graph output » Fault-‐tolerance » Use the Scala REPL for interactive graph mining Programmability: leveraging Scala/Spark API » Implemented GraphLab / Pregel APIs in 20 loc » PageRank in 5 loc

Slide 25

Slide 25 text

Resilient Distributed Graphs An extension of Spark RDDs » Immutable, partitioned set of vertices and edges » Constructed using RDD[Edge] and RDD[Vertex] Additional set of primitives (3 functions) for graph computations » Able to express most graph algorithms (PageRank, Shortest Path, Connected Components, ALS, …) » Implemented GraphLab / Pregel in 20 lines of code

Slide 26

Slide 26 text

vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’)) g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

Slide 27

Slide 27 text

ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum)

Slide 28

Slide 28 text

Resilient Distributed Graph Pregel API PageRank GraphLab API Shortest Path Connected Components ALS GraphX

Slide 29

Slide 29 text

Early Performance Beneﬁts from Spark’s: » In-‐memory caching » Hash-‐based operators » Controlled data partitioning 1340 165 0 200 400 600 800 1000 1200 1400 1600 Hadoop GraphX PageRank, 16 nodes Alpha coming in June / July!

Slide 30

Slide 30 text

This Talk Spark introduction & use cases GraphX: graph computation Shark: SQL over Spark

Slide 31

Slide 31 text

What is Shark? Columnar SQL analytics engine for Spark » Support both SQL and complex analytics » Up to 100X faster than Apache Hive Compatible with Apache Hive » HiveQL, UDF/UDAF, SerDes, Scripts » Runs on existing Hive warehouses In use at Yahoo! for fast in-‐memory OLAP

Slide 32

Slide 32 text

Performance 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime0(seconds) Shark Shark0(disk) Hive 1.1 0.8 0.7 1.0 1.7 TB Warehouse Data on 100 EC2 nodes

Slide 33

Slide 33 text

Spark Integration Uniﬁed system for SQL, graph processing, machine learning All share the same set of workers and caches def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())

Slide 34

Slide 34 text

Teaser: Spark Streaming sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) Come see our talk tomorrow at 2:30!

Slide 35

Slide 35 text

Getting Started Visit www.spark-‐project.org for » Video tutorials » Online exercises (EC2) » Docs and API guides Easy to run in local mode, standalone clusters, Apache Mesos, YARN or EC2 Training camp at Berkeley in August

Slide 36

Slide 36 text

Conclusion Big data analytics is evolving to include: » More complex analytics (e.g. machine learning) » More interactive ad-‐hoc queries » More real-‐time stream processing Spark is a fast, uniﬁed platform for these apps Look for our training camp at Berkeley this August! spark-‐project.org

Slide 37

Slide 37 text

Backup Slides

Slide 38

Slide 38 text

Behavior with Not Enough RAM 68.8 58.1 40.7 29.7 11.5 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Iteration time (s) % of working set in memory

Slide 39

Slide 39 text

Fault Tolerance file.map(rec => (rec.type, 1)) .reduceByKey(_ + _) .filter((type, count) => count > 10) ﬁlter reduce map Input ﬁle RDDs track lineage information to rebuild on failure

Slide 40

Slide 40 text

ﬁlter reduce map Input ﬁle Fault Tolerance file.map(rec => (rec.type, 1)) .reduceByKey(_ + _) .filter((type, count) => count > 10) RDDs track lineage information to rebuild on failure