Towards Fast and Scalable Graph Pattern Mining

Towards Fast and Scalable Graph Pattern Mining

0ff46442256bf55681d64027c68beea7?s=128

Anand Iyer

July 09, 2018
Tweet

Transcript

  1. Towards Fast and Scalable Graph Pattern Mining Anand Iyer ⋆,

    Zaoxing Liu ⬩, Xin Jin⬩, Shivaram Venkataraman▴, Vladimir Braverman⬩, Ion Stoica ⋆ ⋆ UC Berkeley ⬩ Johns Hopkins University ▴ Microsoft Research / University of Wisconsin HotCloud, July 09, 2018
  2. Graphs popular in big data analytics

  3. Graphs popular in big data analytics

  4. Graphs popular in big data analytics

  5. None
  6. Graph Analytics

  7. Graph Analytics Processing Algorithms

  8. Graph Analytics Processing Algorithms PageRank Connected Components

  9. Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components

  10. Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components Motifs

    Cliques
  11. Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components Motifs

    Cliques
  12. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  13. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  14. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  15. Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components

    Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Challenging to mine patterns in large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph
  16. Graph Analytics: Processing vs Mining Log scale # Edges Computation

    Time
  17. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    Log scale # Edges Computation Time
  18. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    Log scale # Edges Computation Time
  19. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Arabesque (SOSP’15)
  20. Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank

    ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Can graph pattern mining be made both fast and scalable? Arabesque (SOSP’15)
  21. Many mining tasks ask for the number of occurrences and

    do not need exact answers
  22. Leverage approximation for graph pattern mining Many mining tasks ask

    for the number of occurrences and do not need exact answers
  23. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics
  24. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 graph
  25. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3
  26. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting
  27. General approach: Apply algorithm on subset(s) (sample) of the input

    data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  28. Answer: 10 General approach: Apply algorithm on subset(s) (sample) of

    the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  29. Answer: 10 General approach: Apply algorithm on subset(s) (sample) of

    the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2 Applying exact algorithm on sampled graph(s) not the right approach for pattern mining
  30. Approximation by Sampling Patterns 0 1 4 2 3 graph

    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  31. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  32. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  33. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  34. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  35. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  36. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  37. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  38. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  39. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  40. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  41. E0 Approximation by Sampling Patterns 0 1 4 2 3

    graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  42. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  43. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  44. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 neighborhood sampling graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  45. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  46. ! = 1 10 ∗ 1 4 E0 Approximation by

    Sampling Patterns 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013
  47. Potential Benefits § 16 node Apache Spark cluster § Two

    graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5%
  48. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  49. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  50. 3-Motif System Graph |V| |E| Time Ours (5%) 16 x

    8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s
  51. Challenges Building a General Purpose Approximate Graph Mining System

  52. Challenges Building a General Purpose Approximate Graph Mining System General

    Patterns
  53. Challenges Building a General Purpose Approximate Graph Mining System General

    Patterns Distributed Settings
  54. Challenges Building a General Purpose Approximate Graph Mining System General

    Patterns Distributed Settings Error Estimation
  55. Challenges Building a General Purpose Approximate Graph Mining System General

    Patterns Distributed Settings Error Estimation Handling Updates
  56. Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle

    counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0
  57. Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle

    counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Can we restrict the implementation using a simple API ? How can we analyze programs written using the API?
  58. Challenge #2: Distributed Setting Problem: Neighborhood sampling is for a

    single machine
  59. Challenge #2: Distributed Setting graph Problem: Neighborhood sampling is for

    a single machine
  60. Challenge #2: Distributed Setting graph subgraph 0 partial count c

    0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  61. Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0

    partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  62. Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0

    partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  63. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  64. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine
  65. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine
  66. Challenge #2: Distributed Setting graph ! "#$ %&' (" map:

    w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine How do we compute f(w) for any pattern? How does f(w) affect error?
  67. Challenge #3: Building Error-Latency Profile Problem: Given a time /

    error bound, how many estimators should we use? Need to build two profiles: • Time vs #estimators • Error vs #estimators Naïve approach: • Exhaustively run every possible point (infeasible)
  68. Building Estimators vs Time Profile 1 2 3 0.5M 1M

    1.5M 2M Runtime (min) No. of Estimators Twitter Graph Time complexity linear in number of estimators
  69. Building Estimators vs Error Profile 0 5 10 15 20

    25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators
  70. Building Estimators vs Error Profile 0 5 10 15 20

    25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators Leverage techniques like experiment design/Bayesian optimization? How do we avoid the need to know the ground truth?
  71. Challenge #4: Updates Problem: Graphs and queries can be updated/refined

    Several systems challenges: • Incremental pattern mining • Can the error-latency profiles be updated? • Caching • Re-use results • Pre-computation
  72. Conclusion § Approximation is a promising solution for pattern mining

    § Significant benefits, and can handle much larger graphs… § … but cannot output all instances of the pattern § Several challenges in realizing it § How to extend the technique to general patterns? § How to do approximate pattern mining in a distributed setting? § How do we estimate the error? § How do we handle updates? http://www.cs.berkeley.edu/~api api@cs.berkeley.edu