GraphX at Spark User Meetup

Slide 1

Slide 1 text

GraphX: Unifying Data- Parallel and Graph-Parallel Analytics Reynold Xin, Joseph Gonzalez AMPLab, UC Berkeley Spark User Meetup July 2, 2013

Slide 2

Slide 2 text

Graphs are Essential to Data Mining and Machine Learning Identify inﬂuential people and information Find communities Understand people’s shared interests Model complex data dependencies

Slide 3

Slide 3 text

Pregel Specialized Graph Systems

Slide 4

Slide 4 text

B C D E F A Specialized Graph Systems 1.  APIs to capture complex graph dependencies

Slide 5

Slide 5 text

Specialized Graph Systems 1.5 423 0 100 200 300 400 500 GraphLab Hadoop Runtime (in minutes, counting 34.8 billion triangles)

Slide 6

Slide 6 text

Limitations of Specialized Systems No support for ETL & Post Processing Not interactive Requires separate runtime to maintain both in engineering and operations Data-parallel systems excel at these!

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Vertices

Slide 9

Slide 9 text

Edges Edges

Slide 10

Slide 10 text

username password location rxin ****** Berkeley jegonzal iJustRaised$6.7m Pittsburgh franklin ****** Shanghai istoica ****** Piedmont user1 user2 relationship rxin jegonzal friend franklin rxin advisor istoica franklin coworker

Slide 11

Slide 11 text

How can we combine the advances in data- parallel analytics and graph-parallel analytics?

Slide 12

Slide 12 text

How can we combine the advances in data- parallel analytics and graph-parallel analytics?

Slide 13

Slide 13 text

Why the Uniﬁcation? No disparate systems to maintain / learn Leverage data-parallel engines for task scheduling, distribution and dispatch Reuse existing engines for ETL and consumption of graph computation output

Slide 14

Slide 14 text

Enable Joining Tables and Graphs Simplify ETL and graph structured analytics User Data Product Ratings Friend

Slide 15

Slide 15 text

The Research Challenge Expressive graph computation primitives implementable on existing data-parallel engines » Relational databases (with UDFs and UDAFs) » MapReduce-like frameworks, e.g. Hadoop, Spark Leveraging advanced properties and engine extensions to make these primitives fast » An optimizer for choosing execution strategies » Controlled data partitioning » New index-based access methods and operators

Slide 16

Slide 16 text

Graph Primitives A graph consists » A table of vertices (vid Int, attr1 Int, attr2 …) » A table of edges (srcId Int, dstId Int, attr1 Int, attr2 String …) Relational operators: » E.g. ﬁlter vertices, edges, joining vertices with tables Graph computation primitives: »  aggregateNeighbors(mapUdf, reduceUdaf)

Slide 17

Slide 17 text

aggregateNeighbors B C D E F A map reduce

Slide 18

Slide 18 text

aggregateNeighbors B C D E F A map(F) map(D) map(C) map(B) map(E)

Slide 19

Slide 19 text

aggregateNeighbors B C D E F A map(F) map(D) map(C) map(B) map(E)

Slide 20

Slide 20 text

aggregateNeighbors B C D E F A map(F) map(D) map(C) map(B) map(E) reduce

Slide 21

Slide 21 text

Example: Vertex Degree B C D E F A map: 1 reduce: sum

Slide 22

Slide 22 text

Example: Vertex Degree B C D E F A 1 1 1 1 1

Slide 23

Slide 23 text

Example: Vertex Degree B C D E F A sum: 5 A: 5 B: 1 C: 1 D: 2 E: 3 F: 2

Slide 24

Slide 24 text

Logical Query Plan aggregateNeighbors(mapUdf, reduceUdaf) »  CREATE VIEW aggResults AS SELECT reduceUdaf(*) FROM (SELECT mapUdf(v.attr1, v.attr2, …, e.attr1, …) FROM vertices v RIGHT OUTER JOIN edges e on v.id=e.srcId) GROUP BY e.dstId

Slide 25

Slide 25 text

GraphX operators Relational operators on edges and vertices: » Filter, join, projection… Graph computation operator: » aggregateNeighbors

Slide 26

Slide 26 text

We can express both Pregel and GraphLab using aggregateNeighbors in 20 lines of code!

Slide 27

Slide 27 text

Performance Optimizations Replicate & co-partition vertices with edges » GraphLab (PowerGraph) style vertex-cut partitioning » Minimize communication by avoiding edge data movement in JOINs In-memory hash index for vertices for fast joins Optimizer for choosing execution strategies » E.g. if mapUdf does not need edge data, we can rewrite the query to delay the join

Slide 28

Slide 28 text

Current Implementation Pregel (20) PageRank (5) GraphX Spark (relational operators) Connected Comp. (10) Shortest Path (10) ALS (40) GraphLab (20)

Slide 29

Slide 29 text

Demo

Slide 30

Slide 30 text

vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’)) g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum) 30

Slide 31

Slide 31 text

ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum) 31

Slide 32

Slide 32 text

Early Performance 22 165 1340 0 200 400 600 800 1000 1200 1400 1600 GraphLab GraphX Hadoop Runtime (in seconds, PageRank for 10 iterations)

Slide 33

Slide 33 text

GraphX 1.  Graph-parallel primitives implementable in data-parallel (or relational) engines. 2.  Currently slower than GraphLab, but » No need for specialized systems » Easier ETL, and easier consumption of output » Interactive graph data mining 3.  Future work will bring performance closer to specialized engines.

Slide 34

Slide 34 text

Berkeley Data Analytics Stack Spark Shark BlinkDB SQL HDFS / Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase