Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graphs are everywhere!

Graphs are everywhere!

Distributed graph computing with Spark GraphX

Andrea Iacono

November 28, 2015
Tweet

More Decks by Andrea Iacono

Other Decks in Programming

Transcript

  1. MILAN 20/21.11.2015 - Andrea Iacono Agenda: • Graph definitions and

    usages • GraphX introduction • Pregel • Code example The main focus will be the programming model The code is available at: https://github.com/andreaiacono/TalkGraphX
  2. MILAN 20/21.11.2015 - Andrea Iacono A graph is a set

    of vertices and edges that connect them: Graphs are used for modeling very different domains. Edge Vertex
  3. MILAN 20/21.11.2015 - Andrea Iacono What's wrong with MapReduce? Every

    run of MapReduce reads from disk (e.g. HDFS) the initial data, computes the results and then stores them on disk; since most algorithms on graphs are iterative, this means that for every iteration the whole data must be read and written from/to disk. It's better to use a distributed dataflow framework
  4. MILAN 20/21.11.2015 - Andrea Iacono GraphX is a graph processing

    system built on top of Apache Spark “Graph processing systems represent graph structured data as a property graph, which associates user-defined properties with each vertex and edge.” “The Spark storage abstraction called Resilient Distributed Datasets (RDDs) enables applications to keep data in memory, which is essential for iterative graph algorithms.” “RDDs permit user-defined data partitioning, and the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement. This is essential for encoding partitioned graphs.” Excerpt from GraphX: Graph Processing in a Distributed Dataflow Framework https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf
  5. MILAN 20/21.11.2015 - Andrea Iacono Graph Databases • Storage •

    Query Language • Transactions • Examples: • Neo4j • OrientDB • Titan • APIs for traversing and processing • Better performance (in-memory data) • Examples: • GraphX • Giraphe • GraphLab Graph Processing Systems
  6. MILAN 20/21.11.2015 - Andrea Iacono Pregel is a computational model

    designed by Google (https://kowshik.github.io/JPregel/pregel_paper.pdf) It consists of a sequence of supersteps until termination. In each superstep, every vertex can: • modify its state or the one of any of its neighbours • receive the messages sent to it during the previous superstep • send messages to its neighbours (that will be received in next superstep) • vote to halt When a node votes to halt, it goes to inactive state; if in a later superstep it receives a message, the framework will awake it changing its state to active. When all the nodes have voted to halt, the computation stops; otherwise it can be set a maximum number of iteration. Edges don't have any computation. When writing algorithms, you have to think as a vertex.
  7. MILAN 20/21.11.2015 - Andrea Iacono GraphX implementation of Pregel GraphX

    uses three functions for implementing Pregel: • vprog: the vertex program computed for each vertex that receives the incoming message and computes a new vertex value • sendMsg: the function used for sending messages to other vertices • mergeMsg: a function that takes two incoming messages and merges them into a single message Unlike Google's Pregel, GraphX implementation of Pregel: • leave the message construction out of the vertex-program, so to have a more efficient distributed execution • permits access to both vertices attributes of an edge while building the messages • contraints sending messages to graph structure (only to neighbours)
  8. MILAN 20/21.11.2015 - Andrea Iacono GraphX is well suited for

    algorithms that: • respect the neighborhood structure GraphX is NOT well suited for algorithms that: • need iteration among distant vertices • change the structure of the graph When to use GraphX
  9. MILAN 20/21.11.2015 - Andrea Iacono Algorithms out of the box:

    (as of Spark v1.5.1) - Connected Components - Label Propagation - PageRank - SVD++ - Shortest Paths - Strongly Connected Components - Triangle Count
  10. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 0

    Unvisited nodes: • Baltimore • Detroit • Chicago • NewYork • Philadelphia
  11. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 1

    Unvisited nodes: • Baltimore • Detroit • Chicago • NewYork • Philadelphia
  12. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 2

    Unvisited nodes: • Baltimore • Detroit • Chicago • NewYork • Philadelphia
  13. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 3

    Unvisited nodes: • Baltimore • Detroit • Chicago • NewYork • Philadelphia
  14. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 4

    Unvisited nodes: • Detroit • Chicago • NewYork • Philadelphia
  15. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 5

    Unvisited nodes: • Detroit • Chicago • NewYork • Philadelphia
  16. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 6

    Unvisited nodes: • Detroit • Chicago • NewYork • Philadelphia
  17. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 7

    Unvisited nodes: • Chicago • NewYork • Philadelphia
  18. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 8

    Unvisited nodes: • Chicago • NewYork • Philadelphia
  19. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 9

    Unvisited nodes: • Chicago • NewYork • Philadelphia
  20. MILAN 20/21.11.2015 - Andrea Iacono Dijkstra's Algorithm – Step 10

    Unvisited nodes: • Chicago • Philadelphia
  21. MILAN 20/21.11.2015 - Andrea Iacono Edges: 1 2 27 1

    3 91 2 3 35 2 5 67 3 4 48 3 5 14 5 4 29 5 6 15 Shortest path data sample Vertices: 1 Washington 2 Baltimore 3 Detroit 4 Chicago 5 NewYork 6 Philadelphia
  22. MILAN 20/21.11.2015 - Andrea Iacono val sourceCityId: VertexId = 1L

    val initialGraph: Graph[VertexAttribute, Double] = graph.mapVertices( (vertexId, cityName) => if (vertexId == sourceCityId) { VertexAttribute( cityName, 0.0, List[City](new City(cityName, sourceCityId)) ) } else { VertexAttribute( cityName, Double.PositiveInfinity, List[City]()) } ) Shortest path code sample (1/2)
  23. MILAN 20/21.11.2015 - Andrea Iacono initialGraph.pregel( initialMsg = VertexAttribute("", Double.PositiveInfinity,

    List[City]()), maxIterations = Int.MaxValue, activeDirection = EdgeDirection.Out // the direction of edges on which to run `sendMsg` ) ( // vprog (returns the new vertex attribute for this vertex) (vertexId, currentVertexAttr, newVertexAttr) => if (currentVertexAttr.distance <= newVertexAttr.distance) currentVertexAttr else newVertexAttr, // sendMsg (sends a new VertexAttribute to the neighbours) edgeTriplet => if (edgeTriplet.srcAttr.distance < (edgeTriplet.dstAttr.distance - edgeTriplet.attr)) { Iterator(( edgeTriplet.dstId, new VertexAttribute( edgeTriplet.dstAttr.cityName, edgeTriplet.srcAttr.distance + edgeTriplet.attr, edgeTriplet.srcAttr.path :+ new City(edgeTriplet.dstAttr.cityName, edgeTriplet.dstId ) ))) else Iterator.empty }, // mergeMsg (collapses all the incoming messages for a vertex – two at a time - into one) (attribute1, attribute2) => if (attribute1.distance < attribute2.distance) attribute1 else attribute2 ) Shortest path code sample (2/2)
  24. MILAN 20/21.11.2015 - Andrea Iacono Leave your feedback on Joind.in!

    https://m.joind.in/event/codemotion-milan-2015