Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Fast and Scalable Graph Pattern Mining

Towards Fast and Scalable Graph Pattern Mining

Anand Iyer

July 09, 2018
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. Towards Fast and Scalable Graph Pattern Mining Anand Iyer ⋆,

    Zaoxing Liu ⬩, Xin Jin⬩, Shivaram Venkataraman▴, Vladimir Braverman⬩, Ion Stoica ⋆ ⋆ UC Berkeley ⬩ Johns Hopkins University ▴ Microsoft Research / University of Wisconsin HotCloud, July 09, 2018
  2. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  3. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  4. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  5. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Challenging to mine patterns in large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  6. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Arabesque (SOSP’15)
  7. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Can graph pattern mining be made both fast and scalable? Arabesque (SOSP’15)
  8. Leverage approximation for graph pattern mining Many mining tasks ask

    for the number of occurrences and do not need exact answers
  9. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3
  10. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting
  11. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  12. Answer: 10 General approach: Apply algorithm on subset(s) (sample) of

    the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  13. Answer: 10 General approach: Apply algorithm on subset(s) (sample) of

    the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2 Applying exact algorithm on sampled graph(s) not the right approach for pattern mining
  14. Approximation by Sampling Patterns 0 1 4 2 3 graph

    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  15. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  16. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  17. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  18. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  19. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  20. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  21. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  22. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  23. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  24. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  25. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  26. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  27. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  28. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 neighborhood sampling graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  29. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  30. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  31. Potential Benefits § 16 node Apache Spark cluster § Two

    graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5%
  32. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  33. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  34. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  35. Challenges Building a General Purpose Approximate Graph Mining System General

    Patterns Distributed Settings Error Estimation Handling Updates
  36. Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle

    counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0
  37. Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle

    counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Can we restrict the implementation using a simple API ? How can we analyze programs written using the API?
  38. Challenge #2: Distributed Setting graph subgraph 0 partial count c

    0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  39. Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0

    partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  40. Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0

    partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  41. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  42. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  43. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine
  44. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine How do we compute f(w) for any pattern? How does f(w) affect error?
  45. Challenge #3: Building Error-Latency Profile Problem: Given a time /

    error bound, how many estimators should we use? Need to build two profiles: • Time vs #estimators • Error vs #estimators Naïve approach: • Exhaustively run every possible point (infeasible)
  46. Building Estimators vs Time Profile 1 2 3 0.5M 1M

    1.5M 2M Runtime (min) No. of Estimators Twitter Graph Time complexity linear in number of estimators
  47. Building Estimators vs Error Profile 0 5 10 15 20

    25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators
  48. Building Estimators vs Error Profile 0 5 10 15 20

    25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators Leverage techniques like experiment design/Bayesian optimization? How do we avoid the need to know the ground truth?
  49. Challenge #4: Updates Problem: Graphs and queries can be updated/refined

    Several systems challenges: • Incremental pattern mining • Can the error-latency profiles be updated? • Caching • Re-use results • Pre-computation
  50. Conclusion § Approximation is a promising solution for pattern mining

    § Significant benefits, and can handle much larger graphs… § … but cannot output all instances of the pattern § Several challenges in realizing it § How to extend the technique to general patterns? § How to do approximate pattern mining in a distributed setting? § How do we estimate the error? § How do we handle updates? http://www.cs.berkeley.edu/~api [email protected]