Slide 1

Slide 1 text

PROCESSING LARGE-SCALE GRAPHS WITH GOOGLE(TM) PREGEL MICHAEL HACKSTEIN FRONT END AND GRAPH SPECIALIST ARANGODB

Slide 2

Slide 2 text

Processing large-scale graphs with GoogleTMPregel Michael Hackstein @mchacki November 17th www.arangodb.com

Slide 3

Slide 3 text

Michael Hackstein ArangoDB Core Team Web Frontend Graph visualisation Graph features Host of cologne.js Master’s Degree (spec. Databases and Information Systems) 1

Slide 4

Slide 4 text

Graph Algorithms Pattern matching Search through the entire graph Identify similar components ⇒ Touch all vertices and their neighbourhoods 2

Slide 5

Slide 5 text

Graph Algorithms Pattern matching Search through the entire graph Identify similar components ⇒ Touch all vertices and their neighbourhoods Traversals Define a specific start point Iteratively explore the graph ⇒ History of steps is known 2

Slide 6

Slide 6 text

Graph Algorithms Pattern matching Search through the entire graph Identify similar components ⇒ Touch all vertices and their neighbourhoods Traversals Define a specific start point Iteratively explore the graph ⇒ History of steps is known Global measurements Compute one value for the graph, based on all it’s vertices or edges Compute one value for each vertex or edge ⇒ Often require a global view on the graph 2

Slide 7

Slide 7 text

Pregel A framework to query distributed, directed graphs. Known as “Map-Reduce” for graphs Uses same phases Has several iterations Aims at: Operate all servers at full capacity Reduce network traffic Good at calculations touching all vertices Bad at calculations touching a very small number of vertices 3

Slide 8

Slide 8 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 2 3 3 4 4 5 5 6 6 7 7 4

Slide 9

Slide 9 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 2 3 3 4 4 5 5 6 6 7 7 2 3 4 4 5 6 7 4

Slide 10

Slide 10 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 2 3 3 4 4 5 5 6 6 7 7 2 3 4 4 5 6 7 4

Slide 11

Slide 11 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 2 3 3 4 4 5 5 6 5 7 6 1 2 2 3 5 5 6 4

Slide 12

Slide 12 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 2 3 3 4 4 5 5 6 5 7 6 1 2 2 3 5 5 6 4

Slide 13

Slide 13 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 1 3 2 4 2 5 5 6 5 7 5 1 1 2 2 4

Slide 14

Slide 14 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 1 3 2 4 2 5 5 6 5 7 5 1 1 2 2 4

Slide 15

Slide 15 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 1 3 1 4 1 5 5 6 5 7 5 1 1 4

Slide 16

Slide 16 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 1 3 1 4 1 5 5 6 5 7 5 1 1 4

Slide 17

Slide 17 text

Example – Connected Components active inactive 3 forward message 2 backward message 1 1 2 1 3 1 4 1 5 5 6 5 7 5 4

Slide 18

Slide 18 text

Pregel – Sequence 5

Slide 19

Slide 19 text

Pregel – Sequence 5

Slide 20

Slide 20 text

Pregel – Sequence 5

Slide 21

Slide 21 text

Pregel – Sequence 5

Slide 22

Slide 22 text

Pregel – Sequence 5

Slide 23

Slide 23 text

Worker ˆ = Map “Map” a user-defined algorithm over all vertices Output: set of messages to other vertices Available parameters: The current vertex and his outbound edges All incoming messages Global values Allow modifications on the vertex: Attach a result to this vertex and his outgoing edges Delete the vertex and his outgoing edges Deactivate the vertex 6

Slide 24

Slide 24 text

Combine ˆ = Reduce “Reduce” all generated messages Output: An aggregated message for each vertex. Executed on sender as well as receiver. Available parameters: One new message for a vertex The stored aggregate for this vertex Typical combiners are SUM, MIN or MAX Reduces network traffic 7

Slide 25

Slide 25 text

Activity ˆ = Termination Execute several rounds of Map/Reduce Count active vertices and messages Start next round if one of the following is true: At least one vertex is active At least one message is sent Terminate if neither a vertex is active nor messages were sent Store all non-deleted vertices and edges as resulting graph 8

Slide 26

Slide 26 text

Pregel at ArangoDB Started as a side project in free hack time Experimental on operational database Implemented as an alternative to traversals Make use of the flexibility of JavaScript: No strict type system No pre-compilation, on-the-fly queries Native JSON documents Really fast development 9

Slide 27

Slide 27 text

Pagerank for Giraph 10 1 public class SimplePageRankComputation extends BasicComputation < LongWritable , DoubleWritable , FloatWritable , DoubleWritable > { 2 public static final int MAX_SUPERSTEPS = 30; 3 4 @Override 5 public void compute(Vertex vertex , Iterable messages) throws IOException { 6 if ( getSuperstep () >= 1) { 7 double sum = 0; 8 for ( DoubleWritable message : messages) { 9 sum += message.get(); 10 } 11 DoubleWritable vertexValue = new DoubleWritable ((0.15f / getTotalNumVertices ()) + 0.85f * sum); 12 vertex.setValue(vertexValue); 13 } 14 if ( getSuperstep () < MAX_SUPERSTEPS ) { 15 long edges = vertex.getNumEdges (); 16 sendMessageToAllEdges (vertex , new DoubleWritable (vertex. getValue ().get() / edges)); 17 } else { 18 vertex.voteToHalt (); 19 } 20 } 21 22 public static class SimplePageRankWorkerContext extends WorkerContext { 23 @Override 24 public void preApplication () throws InstantiationException , IllegalAccessException { } 25 @Override 26 public void postApplication () { } 27 @Override 28 public void preSuperstep () { } 29 @Override 30 public void postSuperstep () { } 31 } 32 33 public static class SimplePageRankMasterCompute extends DefaultMasterCompute { 34 @Override 35 public void initialize () throws InstantiationException , IllegalAccessException { 36 } 37 } 38 public static class SimplePageRankVertexReader extends GeneratedVertexReader { 39 @Override 40 public boolean nextVertex () { 41 return totalRecords > recordsRead; 42 } 44 @Override 45 public Vertex getCurrentVertex () throws IOException { 46 Vertex vertex = getConf ().createVertex (); 47 LongWritable vertexId = new LongWritable( 48 (inputSplit. getSplitIndex () * totalRecords) + recordsRead); 49 DoubleWritable vertexValue = new DoubleWritable (vertexId. get() * 10d); 50 long targetVertexId = (vertexId.get() + 1) % (inputSplit. getNumSplits () * totalRecords); 51 float edgeValue = vertexId.get() * 100f; 52 List > edges = Lists. newLinkedList (); 53 edges.add(EdgeFactory.create(new LongWritable( targetVertexId ), new FloatWritable(edgeValue))); 54 vertex.initialize(vertexId , vertexValue , edges); 55 ++ recordsRead; 56 return vertex; 57 } 58 } 59 60 public static class SimplePageRankVertexInputFormat extends GeneratedVertexInputFormat { 61 @Override 62 public VertexReader createVertexReader (InputSplit split , TaskAttemptContext context) 63 throws IOException { 64 return new SimplePageRankVertexReader (); 65 } 66 } 67 68 public static class SimplePageRankVertexOutputFormat extends TextVertexOutputFormat { 69 @Override 70 public TextVertexWriter createVertexWriter ( TaskAttemptContext context) throws IOException , InterruptedException { 71 return new SimplePageRankVertexWriter (); 72 } 73 74 public class SimplePageRankVertexWriter extends TextVertexWriter { 75 @Override 76 public void writeVertex( Vertex vertex) throws IOException , InterruptedException { 77 getRecordWriter ().write( new Text(vertex.getId (). toString ()), new Text(vertex.getValue ().toString ())) ; 78 } 79 } 80 } 81 }

Slide 28

Slide 28 text

Pagerank for TinkerPop3 11 1 public class PageRankVertexProgram implements VertexProgram < Double > { 2 private MessageType.Local messageType = MessageType.Local.of (() -> GraphTraversal .of().outE ()); 3 public static final String PAGE_RANK = Graph.Key.hide("gremlin .pageRank"); 4 public static final String EDGE_COUNT = Graph.Key.hide(" gremlin.edgeCount"); 5 private static final String VERTEX_COUNT = "gremlin. pageRankVertexProgram .vertexCount"; 6 private static final String ALPHA = "gremlin. pageRankVertexProgram .alpha"; 7 private static final String TOTAL_ITERATIONS = "gremlin. pageRankVertexProgram . totalIterations "; 8 private static final String INCIDENT_TRAVERSAL = "gremlin. pageRankVertexProgram . incidentTraversal "; 9 private double vertexCountAsDouble = 1; 10 private double alpha = 0.85d; 11 private int totalIterations = 30; 12 private static final Set COMPUTE_KEYS = new HashSet <>( Arrays.asList(PAGE_RANK , EDGE_COUNT)); 13 14 private PageRankVertexProgram () {} 15 16 @Override 17 public void loadState(final Configuration configuration) { 18 this. vertexCountAsDouble = configuration .getDouble( VERTEX_COUNT , 1.0d); 19 this.alpha = configuration .getDouble(ALPHA , 0.85d); 20 this. totalIterations = configuration .getInt( TOTAL_ITERATIONS , 30); 21 try { 22 if ( configuration .containsKey( INCIDENT_TRAVERSAL )) { 23 final SSupplier traversalSupplier = VertexProgramHelper .deserialize(configuration , INCIDENT_TRAVERSAL ); 24 VertexProgramHelper . verifyReversibility ( traversalSupplier .get()); 25 this.messageType = MessageType.Local.of(( SSupplier) traversalSupplier ); 26 } 27 } catch (final Exception e) { 28 throw new IllegalStateException (e.getMessage (), e); 29 } 30 } 32 @Override 33 public void storeState(final Configuration configuration) { 34 configuration .setProperty(GraphComputer.VERTEX_PROGRAM , PageRankVertexProgram .class.getName ()); 35 configuration .setProperty(VERTEX_COUNT , this. vertexCountAsDouble ); 36 configuration .setProperty(ALPHA , this.alpha); 37 configuration .setProperty(TOTAL_ITERATIONS , this. totalIterations ); 38 try { 39 VertexProgramHelper .serialize(this.messageType. getIncidentTraversal (), configuration , INCIDENT_TRAVERSAL ); 40 } catch (final Exception e) { 41 throw new IllegalStateException (e.getMessage (), e); 42 } 43 } 44 45 @Override 46 public Set getElementComputeKeys () { 47 return COMPUTE_KEYS; 48 } 49 50 @Override 51 public void setup(final Memory memory) { 52 53 } 54 55 @Override 56 public void execute(final Vertex vertex , Messenger messenger , final Memory memory) { 57 if (memory. isInitialIteration ()) { 58 double initialPageRank = 1.0d / this. vertexCountAsDouble ; 59 double edgeCount = Double.valueOf (( Long) this. messageType.edges(vertex).count ().next ()); 60 vertex. singleProperty(PAGE_RANK , initialPageRank ); 61 vertex. singleProperty(EDGE_COUNT , edgeCount); 62 messenger.sendMessage(this.messageType , initialPageRank / edgeCount); 63 } else { 64 double newPageRank = StreamFactory.stream(messenger. receiveMessages (this.messageType)).reduce (0.0d, (a, b) -> a + b); 65 newPageRank = (this.alpha * newPageRank) + ((1.0d - this .alpha) / this. vertexCountAsDouble ); 66 vertex. singleProperty(PAGE_RANK , newPageRank); 67 messenger.sendMessage(this.messageType , newPageRank / vertex.property(EDGE_COUNT).orElse (0.0d)); 68 } 69 } 70 71 @Override 72 public boolean terminate(final Memory memory) { 73 return memory.getIteration () >= this. totalIterations ; 74 } 75 }

Slide 29

Slide 29 text

Pagerank for ArangoDB 1 var pageRank = function (vertex , message , global) { 2 var total , rank , edgeCount , send , edge , alpha , sum; 3 total = global.vertexCount; 4 edgeCount = vertex._outEdges.length; 5 alpha = global.alpha; 6 sum = 0; 7 if (global.step > 0) { 8 while (message.hasNext ()) { 9 sum += message.next ().data; 10 } 11 rank = alpha * sum + (1- alpha) / total; 12 } else { 13 rank = 1 / total; 14 } 15 vertex._setResult(rank); 16 if (global.step < global.MAX_STEPS) { 17 send = rank / edgeCount; 18 while (vertex._outEdges.hasNext ()) { 19 edge = vertex._outEdges.next (); 20 message.sendTo(edge._getTarget (), send); 21 } 22 } else { 23 vertex._deactivate (); 24 } 25 }; 26 27 var combiner = function (message , oldMessage) { 28 return message + oldMessage; 29 }; 30 31 var Runner = require ("org/arangodb/pregelRunner ").Runner; 32 var runner = new Runner (); 33 runner.setWorker(pageRank); 34 runner.setCombiner(combiner); 35 runner.start (" myGraph "); 12

Slide 30

Slide 30 text

Thank you Further Questions? Follow me on twitter/github: @mchacki Write me a mail: [email protected] Follow @arangodb on Twitter Join our google group: https://groups.google.com/forum/#!forum/arangodb Visit our blog https://www.arangodb.com/blog Slides available at https://www.slideshare.net/arangodb 13

Slide 31

Slide 31 text

17TH ~ 18th NOV 2014 MADRID (SPAIN)