Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large scale graph processing with apache giraph

Large scale graph processing with apache giraph

André Kelpe

May 23, 2012
Tweet

More Decks by André Kelpe

Other Decks in Programming

Transcript

  1. Large scale graph
    processing with
    apache giraph
    André Kelpe
    @fs111
    http://kel.pe

    View Slide

  2. graphs 101

    View Slide

  3. vertices and edges

    View Slide

  4. v2 v5
    v4
    v7
    v3
    v8
    v6
    v1
    v9
    v8
    v10
    simple graph

    View Slide

  5. graphs are everywhere
    road network, the www, social
    graphs etc.

    View Slide

  6. graphs can be huge

    View Slide

  7. google knows!

    View Slide

  8. Pregel

    View Slide

  9. Pregel by google
    Describes graph processing
    approach based on BSP
    (Bulk Synchronous Parallel)

    View Slide

  10. pro-tip: search for
    „pregel_paper.pdf“
    on github ;-)

    View Slide

  11. Properties of Pregel
    batch-oriented, scalable,
    fault tolerant processing of
    graphs

    View Slide

  12. It is not a graph database
    It is a processing framework

    View Slide

  13. BSP
    vertex centric processing
    in so called supersteps

    View Slide

  14. BSP
    vertices send messages to
    each other

    View Slide

  15. BSP
    synchronization points
    between supersteps

    View Slide

  16. execution of superstep S
    Each vertex processes messages
    generated in S-1 and send
    messages to be processed in S+1
    and determines to halt.

    View Slide

  17. View Slide

  18. apache
    giraph

    View Slide

  19. giraph
    Loose implementation of
    Pregel ideas on top of
    Hadoop M/R coming from
    yahoo

    View Slide

  20. apache giraph
    http://incubator.apache.org/giraph/

    View Slide

  21. giraph
    avoid overhead of classic M/R
    process but reuse existing
    infrastructure

    View Slide

  22. giraph
    simple map jobs in master worker setup.
    coordination via zookeeper.
    messaging via own RPC protocol.
    in memory processing.
    custom input and output formats.

    View Slide

  23. current status
    version 0.1 released
    compatible with a multitude of hadoop
    versions (we use CDH3 at work)
    still lots of things to do, join the fun!

    View Slide

  24. the APIs
    the APIs

    View Slide

  25. Vertex-API
    /**
    *@param vertex id
    * @param vertex data
    * @param edge data
    * @param message data
    */
    class BasicVertexV extends Writable,
    E extends Writable,
    M extends Writable>
    void compute(Iterator msgIterator);
    void sendMsg(I id, M msg);
    void voteToHalt();

    View Slide

  26. Shortest path example
    https://cwiki.apache.org/confl
    uence/display/GIRAPH/Shorte
    st+Paths+Example

    View Slide

  27. v2 v5
    v4
    v7
    v3
    v8
    v6
    v1
    v9
    v8
    v10
    simple graph

    View Slide

  28. private boolean isSource() {
    return (getVertexId().get() ==
    getContext().getConfiguration().getLong(SOURCE_ID,
    SOURCE_ID_DEFAULT));
    }
    @Override
    public void compute(Iterator msgIterator) {
    if (getSuperstep() == 0) {
    setVertexValue(new DoubleWritable(Double.MAX_VALUE));
    }
    double minDist = isSource() ? 0d : Double.MAX_VALUE;
    while (msgIterator.hasNext()) {
    minDist = Math.min(minDist, msgIterator.next().get());
    }
    if (minDist < getVertexValue().get()) {
    setVertexValue(new DoubleWritable(minDist));
    for (Edge edge :
    getOutEdgeMap().values()) {
    sendMsg(edge.getDestVertexId(),
    new DoubleWritable(minDist +
    edge.getEdgeValue().get()));
    }
    }
    voteToHalt();
    }

    View Slide

  29. GiraphJob job = new GiraphJob(getConf(), getClass().getName());
    job.setVertexClass(SimpleShortestPathVertex.class);
    job.setVertexInputFormatClass(SimpleShortestPathsVertexInputFormat.class);
    job.setVertexOutputFormatClass(
    SimpleShortestPathsVertexOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(„/foo/bar/baz“));
    FileOutputFormat.setOutputPath(job, new Path(„/foo/bar/quux“));
    job.getConfiguration().setLong(SimpleShortestPathsVertex.SOURCE_ID,
    Long.parseLong(argArray[2]));
    job.setWorkerConfiguration(minWorkers, maxWorkers), 100.0f);
    GiraphJob

    View Slide

  30. see also
    http://incubator.apache.org/giraph/
    https://cwiki.apache.org/confluence/displ
    ay/GIRAPH/Shortest+Paths+Example
    http://googleresearch.blogspot.com/2009/
    06/large-scale-graph-computing-at-
    google.html

    View Slide

  31. Thanks!
    Questions?

    View Slide