GraphX: Graph Parallelism Made Simple @ AMPLab Retreat

GraphX: Graph-Parallellism Made Simple Reynold Xin, Joseph Gonzalez Michael Franklin,
Ion Stoica AMPLab Retreat May 20, 2013

GraphX: Graph-Parallellism Made Simple

Graphs are Essential to Data Mining and Machine Learning Identify
inﬂuential people and information Find communities Target ads and products Model complex data dependencies

Pregel Specialized Graph Systems

B C D E F A Specialized Graph Systems 1. 
APIs to capture complex graph dependencies

Specialized Graph Systems 1.5 423 0 100 200 300 400
500 GraphLab Hadoop Runtime (in minutes, counting 34.8 billion triangles)

How can data-parallel engines support graph computations efﬁciently?

How can data-parallel engines support graph computations efﬁciently? Spark Shark
SQL HDFS / Hadoop Storage Mesos Resource Manager Spark Streaming GraphX MLBase

How can data-parallel engines support graph computations efﬁciently? 1.  The
right interface for expressing graph computations 2.  Efﬁcient implementation of that interface

Remainder of the Talk 1.  Resilient Distributed Graphs (RDGs) 2. 
Phase One Implementation 3.  Phase Two Implementation (Future Work)

Resilient Distributed Graphs An extension of Spark RDDs » Immutable, partitioned
set of vertices and edges » Can be constructed using RDD[Edge] and RDD[Vertex] Tight integration with Spark » Use data-parallel engine (Spark) to do efﬁcient ETL » Consume results of graph computations in Spark

Resilient Distributed Graphs def vertices: RDD[Vertex] def edges: RDD[Edge] def
edgesWithVertices: RDD[EdgeWithVertices] def mapVertices(mapFunc): Graph def mapEdges(mapFunc): Graph def filterVertices(predicate): Graph def filterEdges(predicate): Graph

Resilient Distributed Graphs Two graph computation primitives: 1.  aggregateNeighbors: RDD[(VID,
Value)] 2.  updateVertices: Graph

aggregateNeighbors B C D E F A map reduce

aggregateNeighbors B C D E F A map(F) map(D) map(C)
map(B) map(E)

aggregateNeighbors B C D E F A map(F) map(D) map(C)
map(B) map(E) reduce

Example: Vertex Degree B C D E F A map:
1 reduce: sum

Example: Vertex Degree B C D E F A 1
1 1 1 1

Example: Vertex Degree B C D E F A sum:
5 A: 5 B: 1 C: 1 D: 2 E: 3 F: 2

updateVertices Taking a set of update “messages”, and apply them
to the vertices using a user-speciﬁed function.

Example: updateVertices B C D E F A A: 5
B: 1 C: 1 D: 2 E: 3 F: 2

Example: updateVertices B C D E F A A: 5
B: 1 C: 1 D: 2 E: 3 F: 2 5 1 1 2 3 2

Resilient Distributed Graphs RDD-like primitives: » map, reduce, ﬁlter… Graph computation
primitives: » aggregateNeighbors: RDD[(VID, Value)] » updateVertices: Graph Surprising expressive and general: » Implemented GraphLab and Pregel API in 20 lines of code

GraphX Phase 1 Implemented RDG abstraction on Spark » Using existing
Spark operators (map, reduce, ﬁlter, join) » Minimal network communication (eq. GraphLab) Higher level APIs: » Pregel (20 lines of code) » GraphLab/PowerGraph (25) Algorithms: » PageRank (5), Connected components (10), Shortest path (10), Alternating Least Squares (40)

PageRank Performance 22 165 1340 0 200 400 600 800
1000 1200 1400 1600 GraphLab GraphX Hadoop Runtime (in seconds, PageRank for 10 iterations)

Phase 2: Improved Perf Introduce 2 new primitives in Spark
» “Mutable” RDD, i.e. update-in-place for iterative computations » Pre-built hash-indexes for record lookups

GraphX 1.  Graph-parallel primitives implementable in data-parallel engines (Spark). 2. 
Currently slower than GraphLab, but » No need for specialized systems » Easier ETL, and easier consumption of output » Interactive graph data mining 3.  Phase 2 (with small additions to Spark), will bring performance closer to specialized engines.

Resilient Distributed Graphs 1.  Graph-parallel primitives implementable in data-parallel engines
(Spark) MPP databases. 2.  Phase 1 » map, reduce, ﬁlter, join = select, group by, where, join » Implementable in MPP databases using only UDFs! 3.  Phase 2 » Materialized views » New hash-indexes and access methods for optimized update-in-place

Resilient Distributed Graphs 1.  Expressive graph-parallel primitives » Pregel, GraphLab (20
lines of code) » PageRank (5 lines of code) 2.  Existing data-parallel engines and MPP databases can support these primitives without any modiﬁcations 3.  Can signiﬁcantly improve performance using a few new operators and access methods (come to the next retreat!)

GraphX: Graph Parallelism Made Simple @ AMPLab ...

GraphX: Graph Parallelism Made Simple @ AMPLab Retreat

Reynold Xin

More Decks by Reynold Xin

Featured

Transcript

GraphX: Graph-Parallellism Made Simple Reynold Xin, Joseph Gonzalez Michael Franklin,

GraphX: Graph-Parallellism Made Simple

Graphs are Essential to Data Mining and Machine Learning Identify

Pregel Specialized Graph Systems

B C D E F A Specialized Graph Systems 1.

Specialized Graph Systems 1.5 423 0 100 200 300 400

How can data-parallel engines support graph computations efﬁciently?

How can data-parallel engines support graph computations efﬁciently? Spark Shark

How can data-parallel engines support graph computations efﬁciently? 1.  The

Remainder of the Talk 1.  Resilient Distributed Graphs (RDGs) 2.

Resilient Distributed Graphs An extension of Spark RDDs » Immutable, partitioned

Resilient Distributed Graphs def vertices: RDD[Vertex] def edges: RDD[Edge] def

Resilient Distributed Graphs Two graph computation primitives: 1.  aggregateNeighbors: RDD[(VID,

aggregateNeighbors B C D E F A map reduce

aggregateNeighbors B C D E F A map(F) map(D) map(C)

aggregateNeighbors B C D E F A map(F) map(D) map(C)

aggregateNeighbors B C D E F A map(F) map(D) map(C)

Example: Vertex Degree B C D E F A map:

Example: Vertex Degree B C D E F A 1

Example: Vertex Degree B C D E F A sum:

updateVertices Taking a set of update “messages”, and apply them

Example: updateVertices B C D E F A A: 5

Example: updateVertices B C D E F A A: 5

Example: updateVertices B C D E F A A: 5

Resilient Distributed Graphs RDD-like primitives: » map, reduce, ﬁlter… Graph computation

GraphX Phase 1 Implemented RDG abstraction on Spark » Using existing

PageRank Performance 22 165 1340 0 200 400 600 800

Phase 2: Improved Perf Introduce 2 new primitives in Spark

GraphX 1.  Graph-parallel primitives implementable in data-parallel engines (Spark). 2.

Resilient Distributed Graphs 1.  Graph-parallel primitives implementable in data-parallel engines

Resilient Distributed Graphs 1.  Expressive graph-parallel primitives » Pregel, GraphLab (20