GraphX at Spark User Meetup

GraphX: Unifying Data- Parallel and Graph-Parallel Analytics Reynold Xin, Joseph
Gonzalez AMPLab, UC Berkeley Spark User Meetup July 2, 2013

Graphs are Essential to Data Mining and Machine Learning Identify
inﬂuential people and information Find communities Understand people’s shared interests Model complex data dependencies

Pregel Specialized Graph Systems

B C D E F A Specialized Graph Systems 1. 
APIs to capture complex graph dependencies

Specialized Graph Systems 1.5 423 0 100 200 300 400
500 GraphLab Hadoop Runtime (in minutes, counting 34.8 billion triangles)

Limitations of Specialized Systems No support for ETL & Post
Processing Not interactive Requires separate runtime to maintain both in engineering and operations Data-parallel systems excel at these!

Vertices

Edges Edges

username password location rxin ****** Berkeley jegonzal iJustRaised$6.7m Pittsburgh franklin
****** Shanghai istoica ****** Piedmont user1 user2 relationship rxin jegonzal friend franklin rxin advisor istoica franklin coworker

How can we combine the advances in data- parallel analytics
and graph-parallel analytics?

Why the Uniﬁcation? No disparate systems to maintain / learn
Leverage data-parallel engines for task scheduling, distribution and dispatch Reuse existing engines for ETL and consumption of graph computation output

Enable Joining Tables and Graphs Simplify ETL and graph structured
analytics User Data Product Ratings Friend

The Research Challenge Expressive graph computation primitives implementable on existing
data-parallel engines » Relational databases (with UDFs and UDAFs) » MapReduce-like frameworks, e.g. Hadoop, Spark Leveraging advanced properties and engine extensions to make these primitives fast » An optimizer for choosing execution strategies » Controlled data partitioning » New index-based access methods and operators

Graph Primitives A graph consists » A table of vertices (vid
Int, attr1 Int, attr2 …) » A table of edges (srcId Int, dstId Int, attr1 Int, attr2 String …) Relational operators: » E.g. ﬁlter vertices, edges, joining vertices with tables Graph computation primitives: »  aggregateNeighbors(mapUdf, reduceUdaf)

aggregateNeighbors B C D E F A map reduce

aggregateNeighbors B C D E F A map(F) map(D) map(C)
map(B) map(E)

aggregateNeighbors B C D E F A map(F) map(D) map(C)
map(B) map(E) reduce

Example: Vertex Degree B C D E F A map:
1 reduce: sum

Example: Vertex Degree B C D E F A 1
1 1 1 1

Example: Vertex Degree B C D E F A sum:
5 A: 5 B: 1 C: 1 D: 2 E: 3 F: 2

Logical Query Plan aggregateNeighbors(mapUdf, reduceUdaf) »  CREATE VIEW aggResults AS
SELECT reduceUdaf(*) FROM (SELECT mapUdf(v.attr1, v.attr2, …, e.attr1, …) FROM vertices v RIGHT OUTER JOIN edges e on v.id=e.srcId) GROUP BY e.dstId

GraphX operators Relational operators on edges and vertices: » Filter, join,
projection… Graph computation operator: » aggregateNeighbors

We can express both Pregel and GraphLab using aggregateNeighbors in
20 lines of code!

Performance Optimizations Replicate & co-partition vertices with edges » GraphLab (PowerGraph)
style vertex-cut partitioning » Minimize communication by avoiding edge data movement in JOINs In-memory hash index for vertices for fast joins Optimizer for choosing execution strategies » E.g. if mapUdf does not need edge data, we can rewrite the query to delay the join

Current Implementation Pregel (20) PageRank (5) GraphX Spark (relational operators)
Connected Comp. (10) Shortest Path (10) ALS (40) GraphLab (20)

vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’))
g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum) 30

ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum) 31

Early Performance 22 165 1340 0 200 400 600 800
1000 1200 1400 1600 GraphLab GraphX Hadoop Runtime (in seconds, PageRank for 10 iterations)

GraphX 1.  Graph-parallel primitives implementable in data-parallel (or relational) engines.
2.  Currently slower than GraphLab, but » No need for specialized systems » Easier ETL, and easier consumption of output » Interactive graph data mining 3.  Future work will bring performance closer to specialized engines.

Berkeley Data Analytics Stack Spark Shark BlinkDB SQL HDFS /
Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase

Backup slides

Vertex Cut Partitioning B C D E F A

Vertex Cut Partitioning B C D E F A Partition
1 Partition 2 Partition 3

GraphX at Spark User Meetup

GraphX at Spark User Meetup

More Decks by Reynold Xin

Other Decks in Technology

Featured

Transcript