Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Processing using Apache Flink

Graph Processing using Apache Flink

Meetup-Vortrag über die Verarbeitung von Graphenstrukturen mit Apache Flink.

Kristian Kottke

November 06, 2017
Tweet

More Decks by Kristian Kottke

Other Decks in Programming

Transcript

  1. ©iteratec Whoami Kristian Kottke › Senior Software Engineer -> iteratec

    Interests › Software Architecture › Big Data Technologies [email protected] github.com/kkottke xing.to/kkottke 02.11.2017 | Graph Processing using Apache Flink 2
  2. ©iteratec Overview › Use Case › Apache Flink › Gelly

    › Demo › Alternative Processing Models › Wrap Up 02.11.2017 | Graph Processing using Apache Flink 3
  3. ©iteratec Overview 02.11.2017 | Graph Processing using Apache Flink 7

    Data Ingestion Graph Analytics Visualization/Persistence
  4. ©iteratec 8 8 Graph Analytics Problem Definition › Determine Relations

    › Assign unique Identifier (RingId) Challenges › High, implicit Connectivity › Transitive Relations › Relations based on different Properties › -> Iterations 02.11.2017 | Graph Processing using Apache Flink 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  5. ©iteratec Apache Flink › Platform for distributed stream and batch

    processing › Data Distribution › Communication › Fault Tolerance › Memory Management › Parallelizing and Optimization › Native Iteration Support › Fork of a research project › 03/2014: Apache Incubator › 12/2014: Apache Top-Level 02.11.2017 | Graph Processing using Apache Flink 14
  6. ©iteratec Apache Flink 02.11.2017 | Graph Processing using Apache Flink

    15 Files, HDFS, S3, JDBC, Kafka, ... Local Cluster Cloud DataStream API DataSet API FlinkML Gelly Table & SQL CEP Table & SQL Storage Deployment Runtime API Libraries following: https://ci.apache.org/projects/flink/flink-docs-release-1.2/
  7. ©iteratec Apache Flink Structure 02.11.2017 | Graph Processing using Apache

    Flink 16 Execution Environment Load/Create Data Transformations Store Data Trigger Execution ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet text = env.readCsvFile(“inputFile.csv“); DataSet<Tuple2<String, Integer> counts = text.flatMap(new LineSplitter()) .groupBy(0).sum(1); counts.writeAsCsv(“outputFile.csv“); env.execute();
  8. ©iteratec Apache Flink Execution 02.11.2017 | Graph Processing using Apache

    Flink 17 TaskManager TaskManager JobManager TaskSlot TaskSlot TaskSlot TaskSlot
  9. ©iteratec 18 18 Pseudo Algorithm › Determine Relations › Assign

    unique Identifier (RingId) Algorithm › Create Graph › Generate potential RingId › Send RingId to all Neighbors › Receive smaller RingId › Update current RingId › Send new RingId to all Neighbors 02.11.2017 | Graph Processing using Apache Flink 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  10. ©iteratec Apache Flink Delta Iterations 02.11.2017 | Graph Processing using

    Apache Flink 19 Join Group Join Filter Solution Set Step Function Update Delta Iteration Result Working Set (Edges) Solution Set (Vertices)
  11. ©iteratec 20 20 Gelly › High-Level Graph API › Graph

    Representation › Graph Transformations › Graph Mutations › Neighborhood Methods (Access/Processing) › Iterative Graph Processing › Vertex-Centric › Scatter-Gather › Gather-Sum-Apply 02.11.2017 | Graph Processing using Apache Flink
  12. ©iteratec Gelly Vertex-Centric Iterations 02.11.2017 | Graph Processing using Apache

    Flink 21 1 2 3 4 1 2 3 4 2 4 3 inbox outbox 3 1 4 1 4 Superstep 1 2 4 4 4 4 1 2 3 1 3 2 4 3 inbox outbox 2 4 1 4 3 1 4 1 1 4 Superstep 2 barrier barrier following: https://ci.apache.org/projects/flink/flink-docs-release-1.2/ ......
  13. ©iteratec Gelly Vertex-Centric Iterations: Code 02.11.2017 | Graph Processing using

    Apache Flink 22 // create graph Graph graph = Graph.fromTupleDataSet(vertices, edges, env); // run iterations Graph resultGraph = graph.runVertexCentricIteration( new ComputeFunction(), new MessageCombiner(), 100);
  14. ©iteratec Gelly Vertex-Centric Iterations: Code public void compute(Vertex vertex, MessageIterator

    messages) { String ringId = vertex.getValue(); if (getSuperstepNumber() == 1) { sendMessageToAllNeighbors(ringId); } else { for (String message : messages) { if (message.compareTo(ringId) < 0) { ringId = message; } } if (!vertex.getValue().equals(ringId)) { setNewVertexValue(ringId); sendMessageToAllNeighbors(ringId); } } } 02.11.2017 | Graph Processing using Apache Flink 23
  15. ©iteratec Gelly Scatter-Gather Iterations 02.11.2017 | Graph Processing using Apache

    Flink 26 1 2 3 4 1 2 3 4 2 4 3 outbox 3 1 4 1 4 2 4 4 4 barrier 4 1 2 3 1 3 inbox 2 1 3 1 3 2 4 ...... Scatter Gather Superstep
  16. ©iteratec Gelly Scatter-Gather Iterations: Code // scatter function public void

    sendMessages(Vertex vertex) { sendMessageToAllNeighbors(vertex.getValue()); } 02.11.2017 | Graph Processing using Apache Flink 27
  17. ©iteratec Gelly Scatter-Gather Iterations: Code 02.11.2017 | Graph Processing using

    Apache Flink 28 // gather function public void updateVertex(Vertex vertex, MessageIterator messages) { String ringId = vertex.getValue(); for (String message : messages) { if (message.compareTo(ringId) < 0) { ringId = message; } } if (!vertex.getValue().equals(ringId)) { setNewVertexValue(ringId); } }
  18. ©iteratec Gelly Gather-Sum-Apply Iterations 02.11.2017 | Graph Processing using Apache

    Flink 29 1 2 3 4 1 2 3 4 2 4 3 outbox 3 1 4 1 4 2 4 4 4 barrier 1 2 3 3 inbox 1 1 ...... Gather Sum Superstep 4 1 1 3 inbox 2 1 3 1 3 2 4 3 1 1 1 Apply inbox
  19. ©iteratec Gelly Gather-Sum-Apply Iterations: Code 02.11.2017 | Graph Processing using

    Apache Flink 30 // gather function public String gather(Neighbor neighbor) { return neighbor.getNeighborValue(); }
  20. ©iteratec Gelly Gather-Sum-Apply Iterations: Code 02.11.2017 | Graph Processing using

    Apache Flink 31 // sum function public String sum(String rindId1, String ringId2) { return ringId1.compareTo(ringId2) < 0 ? ringId1 : ringId2; }
  21. ©iteratec Gelly Gather-Sum-Apply Iterations: Code 02.11.2017 | Graph Processing using

    Apache Flink 32 // apply function public void apply(String newRingId, String currentRingId) { if (newRingId.compareTo(currentRingId) < 0) { setResult(newRingId); } }
  22. ©iteratec Gelly Iteration Abstractions Vertex-Centric › Most generic model (computation

    & messaging) Scatter-Gather & Gather-Sum-Apply › Separated iteration phases › Maintainability › Performance › No concurrent access to inbox and outbox › Scatter: parallelized over vertices › Gather: parallelized over edges 02.11.2017 | Graph Processing using Apache Flink 33
  23. ©iteratec Appendix Iteration Abstractions Iteration Model Update Function Update Logic

    Communication Scope Communication Logic Vertex-Centric arbitrary arbitrary any vertex arbitrary Scatter-Gather arbitrary based on received messages any vertex based on vertex state Gather-Sum-Apply associative & commutative based on reduced message neighborhood based on vertex state 02.11.2017 | Graph Processing using Apache Flink 34 following: https://ci.apache.org/projects/flink/flink-docs-release-1.2/
  24. ©iteratec 35 35 Gelly Library Methods › Community Detection ›

    Label Propagation › PageRank › Single Source Shortest Path › Triangle Listing ... › Connected Components 02.11.2017 | Graph Processing using Apache Flink // create graph Graph graph = Graph.fromTupleDataSet( vertices, edges, env); // run iterations DataSet<Vertex> result = graph.run( new GSAConnectedComponents(100));
  25. ©iteratec Wrap Up › Data (often) have high and complex

    connectivity › Relationships are (at least as) important than the single records › Graph Processing › Intuitiveness › Speed › Scalability › Processing vs. Storing › Alternatives › Giraph › Spark GraphX 02.11.2017 | Graph Processing using Apache Flink 41