Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming Graph Dynamics at Scale

Taming Graph Dynamics at Scale

talk by Felix Cuadrado, Queen Mary University at @ds_ldn Data Science London meetup

Data Science London

February 04, 2015
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Taming Graph Dynamics at Scale Felix Cuadrado (@felixcuadrado) Queen Mary

    University of London Joint work with Luis Vaquero, Dyonisios Logothetis, Claudio Martella Data Science London Meetup 10th December 2014
  2. Pregel’s node-centric processing model BSP SYNC BARRIER Compute, Send Messages

    Receive Messages BSP SYNC BARRIER GraphX [JPDC 94] [SIGMOD 10] [VLDB 12] [OSDI 14] 4
  3. Example: PageRank in Apache Giraph public void compute(Vertex <LongWritable, DoubleWritable,

    FloatWritable> vertex, Iterable messages) throws IOException { if (getSuperstep() >=1 ) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } DoubleWritable vertexValue = new DoubleWritable((0.15f / getTotalNumVertices()) + 0.85f * sum); vertex.setValue(vertexValue); } if (getSuperstep() < MAX_SUPERSTEPS) { long edges = vertex.getNumEdges(); sendMessageToAllEdges(vertex, new DoubleWritable(vertex.getValue().get() / edges)); } else { vertex.voteToHalt(); } } 5
  4. • Slowly growing • Long-life structures • Friendship graphs •

    The Web graph • Road Network Real-world graphs are dynamic • Rapidly evolving • Streams of events/messages • Calls, interactions, mobile proximity • Relationships decay/become stale over time 6
  5. Dynamic graph processing model BSP SYNC BARRIER Update Graph BSP

    SYNC BARRIER Compute, Send Messages Receive Messages 8
  6. Partitioning dynamic graphs Hash Partitioning Deterministic Greedy Stream Partitioning [KDD12]

    0 5 10 15 20 25 30 time elapsed (days) 0.4 0.5 0.6 0.7 0.8 ratio of cuts Interaction graph from 1 month of mobile calls data (CDR) Inactive links expire in 1 week partition quality: ratio of edges cut between partitions 10
  7. Adapting mobile call graphs adaptive hash wk1 wk2 wk3 wk4

    0 2 4 6 8 10 throughput (queries per hour) Maximum clique performance week Hash partitioning Adaptive partitioning 0 5 10 15 20 25 30 time elapsed (days) 0.4 0.5 0.6 0.7 0.8 ratio of cuts Quality of partitioning Dataset: 1 month of mobile calls 21 million unique nodes 7% Addition 4% Deletion each week Sliding window of one week Algorithm: Maximum clique computation 13
  8. Adapting real-time social graphs 0 10 20 30 40 50

    Average Tweets per sec 0 1 2 3 4 5 Superstep time (s) 0 2 4 6 8 10 12 14 16 18 20 22 24 Time (h) Tweets per second Hash superstep time Adaptive superstep time Dataset: one week of tweets published from London in 2012 Algorithm: TunkRank (User influence metric) 14
  9. • Initial partitioning is not that important with dynamic graphs

    • Adaptive partitioning / repartitioning might be needed • >50% performance improvement on dynamic graphs • Partitioning overhead should be considered • Smarter partition strategies might not be practical • Migrations /repartitions might not be worth it • BSP aids system optimisations • Message aggregation (migration & computation) Lessons learnt 15