Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GraphX at Spark User Meetup

GraphX at Spark User Meetup

Reynold Xin

July 02, 2013
Tweet

More Decks by Reynold Xin

Other Decks in Technology

Transcript

  1. GraphX: Unifying Data- Parallel and Graph-Parallel Analytics Reynold Xin, Joseph

    Gonzalez AMPLab, UC Berkeley Spark User Meetup July 2, 2013
  2. Graphs are Essential to Data Mining and Machine Learning Identify

    influential people and information Find communities Understand people’s shared interests Model complex data dependencies
  3. B C D E F A Specialized Graph Systems 1. 

    APIs to capture complex graph dependencies
  4. Specialized Graph Systems 1.5 423 0 100 200 300 400

    500 GraphLab Hadoop Runtime (in minutes, counting 34.8 billion triangles)
  5. Limitations of Specialized Systems No support for ETL & Post

    Processing Not interactive Requires separate runtime to maintain both in engineering and operations Data-parallel systems excel at these!
  6. username password location rxin ****** Berkeley jegonzal iJustRaised$6.7m Pittsburgh franklin

    ****** Shanghai istoica ****** Piedmont user1 user2 relationship rxin jegonzal friend franklin rxin advisor istoica franklin coworker
  7. Why the Unification? No disparate systems to maintain / learn

    Leverage data-parallel engines for task scheduling, distribution and dispatch Reuse existing engines for ETL and consumption of graph computation output
  8. Enable Joining Tables and Graphs Simplify ETL and graph structured

    analytics User Data Product Ratings Friend
  9. The Research Challenge Expressive graph computation primitives implementable on existing

    data-parallel engines » Relational databases (with UDFs and UDAFs) » MapReduce-like frameworks, e.g. Hadoop, Spark Leveraging advanced properties and engine extensions to make these primitives fast » An optimizer for choosing execution strategies » Controlled data partitioning » New index-based access methods and operators
  10. Graph Primitives A graph consists » A table of vertices (vid

    Int, attr1 Int, attr2 …) » A table of edges (srcId Int, dstId Int, attr1 Int, attr2 String …) Relational operators: » E.g. filter vertices, edges, joining vertices with tables Graph computation primitives: »  aggregateNeighbors(mapUdf, reduceUdaf)
  11. Example: Vertex Degree B C D E F A sum:

    5 A: 5 B: 1 C: 1 D: 2 E: 3 F: 2
  12. Logical Query Plan aggregateNeighbors(mapUdf, reduceUdaf) »  CREATE VIEW aggResults AS

    SELECT reduceUdaf(*) FROM (SELECT mapUdf(v.attr1, v.attr2, …, e.attr1, …) FROM vertices v RIGHT OUTER JOIN edges e on v.id=e.srcId) GROUP BY e.dstId
  13. GraphX operators Relational operators on edges and vertices: » Filter, join,

    projection… Graph computation operator: » aggregateNeighbors
  14. Performance Optimizations Replicate & co-partition vertices with edges » GraphLab (PowerGraph)

    style vertex-cut partitioning » Minimize communication by avoiding edge data movement in JOINs In-memory hash index for vertices for fast joins Optimizer for choosing execution strategies » E.g. if mapUdf does not need edge data, we can rewrite the query to delay the join
  15. Current Implementation Pregel (20) PageRank (5) GraphX Spark (relational operators)

    Connected Comp. (10) Shortest Path (10) ALS (40) GraphLab (20)
  16. vertices = spark.textFile("hdfs://path/pages.csv") edges = spark.textFile("hdfs://path/to/links.csv”) .map(line => new Edge(line.split(‘\t’))

    g = new Graph(vertices, edges).cache println(g.vertices.count) println(g.edges.count) g1 = g.filterVertices(_.split('\t')(2) == "Berkeley") ranks = Analytics.pageRank(g1, numIter = 10) println(ranks.vertices.sum) 30  
  17. Early Performance 22 165 1340 0 200 400 600 800

    1000 1200 1400 1600 GraphLab GraphX Hadoop Runtime (in seconds, PageRank for 10 iterations)
  18. GraphX 1.  Graph-parallel primitives implementable in data-parallel (or relational) engines.

    2.  Currently slower than GraphLab, but » No need for specialized systems » Easier ETL, and easier consumption of output » Interactive graph data mining 3.  Future work will bring performance closer to specialized engines.
  19. Berkeley Data Analytics Stack Spark Shark BlinkDB SQL HDFS /

    Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase