Slide 1

Slide 1 text

Large scale graph processing with apache giraph André Kelpe @fs111 http://kel.pe

Slide 2

Slide 2 text

graphs 101

Slide 3

Slide 3 text

vertices and edges

Slide 4

Slide 4 text

v2 v5 v4 v7 v3 v8 v6 v1 v9 v8 v10 simple graph

Slide 5

Slide 5 text

graphs are everywhere road network, the www, social graphs etc.

Slide 6

Slide 6 text

graphs can be huge

Slide 7

Slide 7 text

google knows!

Slide 8

Slide 8 text

Pregel

Slide 9

Slide 9 text

Pregel by google Describes graph processing approach based on BSP (Bulk Synchronous Parallel)

Slide 10

Slide 10 text

pro-tip: search for „pregel_paper.pdf“ on github ;-)

Slide 11

Slide 11 text

Properties of Pregel batch-oriented, scalable, fault tolerant processing of graphs

Slide 12

Slide 12 text

It is not a graph database It is a processing framework

Slide 13

Slide 13 text

BSP vertex centric processing in so called supersteps

Slide 14

Slide 14 text

BSP vertices send messages to each other

Slide 15

Slide 15 text

BSP synchronization points between supersteps

Slide 16

Slide 16 text

execution of superstep S Each vertex processes messages generated in S-1 and send messages to be processed in S+1 and determines to halt.

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

apache giraph

Slide 19

Slide 19 text

giraph Loose implementation of Pregel ideas on top of Hadoop M/R coming from yahoo

Slide 20

Slide 20 text

apache giraph http://incubator.apache.org/giraph/

Slide 21

Slide 21 text

giraph avoid overhead of classic M/R process but reuse existing infrastructure

Slide 22

Slide 22 text

giraph simple map jobs in master worker setup. coordination via zookeeper. messaging via own RPC protocol. in memory processing. custom input and output formats.

Slide 23

Slide 23 text

current status version 0.1 released compatible with a multitude of hadoop versions (we use CDH3 at work) still lots of things to do, join the fun!

Slide 24

Slide 24 text

the APIs the APIs

Slide 25

Slide 25 text

Vertex-API /** *@param vertex id * @param vertex data * @param edge data * @param message data */ class BasicVertex void compute(Iterator msgIterator); void sendMsg(I id, M msg); void voteToHalt();

Slide 26

Slide 26 text

Shortest path example https://cwiki.apache.org/confl uence/display/GIRAPH/Shorte st+Paths+Example

Slide 27

Slide 27 text

v2 v5 v4 v7 v3 v8 v6 v1 v9 v8 v10 simple graph

Slide 28

Slide 28 text

private boolean isSource() { return (getVertexId().get() == getContext().getConfiguration().getLong(SOURCE_ID, SOURCE_ID_DEFAULT)); } @Override public void compute(Iterator msgIterator) { if (getSuperstep() == 0) { setVertexValue(new DoubleWritable(Double.MAX_VALUE)); } double minDist = isSource() ? 0d : Double.MAX_VALUE; while (msgIterator.hasNext()) { minDist = Math.min(minDist, msgIterator.next().get()); } if (minDist < getVertexValue().get()) { setVertexValue(new DoubleWritable(minDist)); for (Edge edge : getOutEdgeMap().values()) { sendMsg(edge.getDestVertexId(), new DoubleWritable(minDist + edge.getEdgeValue().get())); } } voteToHalt(); }

Slide 29

Slide 29 text

GiraphJob job = new GiraphJob(getConf(), getClass().getName()); job.setVertexClass(SimpleShortestPathVertex.class); job.setVertexInputFormatClass(SimpleShortestPathsVertexInputFormat.class); job.setVertexOutputFormatClass( SimpleShortestPathsVertexOutputFormat.class); FileInputFormat.addInputPath(job, new Path(„/foo/bar/baz“)); FileOutputFormat.setOutputPath(job, new Path(„/foo/bar/quux“)); job.getConfiguration().setLong(SimpleShortestPathsVertex.SOURCE_ID, Long.parseLong(argArray[2])); job.setWorkerConfiguration(minWorkers, maxWorkers), 100.0f); GiraphJob

Slide 30

Slide 30 text

see also http://incubator.apache.org/giraph/ https://cwiki.apache.org/confluence/displ ay/GIRAPH/Shortest+Paths+Example http://googleresearch.blogspot.com/2009/ 06/large-scale-graph-computing-at- google.html

Slide 31

Slide 31 text

Thanks! Questions?