$30 off During Our Annual Pro Sale. View Details »

Towards Fast and Scalable Graph Pattern Mining

Towards Fast and Scalable Graph Pattern Mining

Anand Iyer

July 09, 2018
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. Towards Fast and Scalable
    Graph Pattern Mining
    Anand Iyer ⋆, Zaoxing Liu ⬩, Xin Jin⬩,
    Shivaram Venkataraman▴, Vladimir Braverman⬩, Ion Stoica ⋆
    ⋆ UC Berkeley ⬩ Johns Hopkins University ▴ Microsoft Research / University of Wisconsin
    HotCloud, July 09, 2018

    View Slide

  2. Graphs popular in big data analytics

    View Slide

  3. Graphs popular in big data analytics

    View Slide

  4. Graphs popular in big data analytics

    View Slide

  5. View Slide

  6. Graph Analytics

    View Slide

  7. Graph Analytics
    Processing Algorithms

    View Slide

  8. Graph Analytics
    Processing Algorithms
    PageRank
    Connected Components

    View Slide

  9. Graph Analytics
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components

    View Slide

  10. Graph Analytics
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques

    View Slide

  11. Graph Analytics
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques

    View Slide

  12. Graph Analytics: State-of-the-Art
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques
    Computes properties of the
    underlying graph
    Discovers structural patterns in the
    underlying graph

    View Slide

  13. Graph Analytics: State-of-the-Art
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques
    § Easy to implement
    § Massively parallelizable
    § Can handle large graphs
    Computes properties of the
    underlying graph
    Discovers structural patterns in the
    underlying graph

    View Slide

  14. Graph Analytics: State-of-the-Art
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques
    § Easy to implement
    § Massively parallelizable
    § Can handle large graphs
    § Efficient custom algorithms
    § Exponential intermediate data
    § Limited to small graphs
    Computes properties of the
    underlying graph
    Discovers structural patterns in the
    underlying graph

    View Slide

  15. Graph Analytics: State-of-the-Art
    Processing Algorithms Mining Algorithms
    PageRank
    Connected Components
    Motifs
    Cliques
    § Easy to implement
    § Massively parallelizable
    § Can handle large graphs
    § Efficient custom algorithms
    § Exponential intermediate data
    § Limited to small graphs
    Challenging to mine patterns in large graphs
    Computes properties of the
    underlying graph
    Discovers structural patterns in the
    underlying graph

    View Slide

  16. Graph Analytics: Processing vs Mining
    Log scale
    # Edges
    Computation Time

    View Slide

  17. Graph Analytics: Processing vs Mining
    1 trillion
    140 s
    PageRank
    Log scale
    # Edges
    Computation Time

    View Slide

  18. Graph Analytics: Processing vs Mining
    1 trillion
    140 s
    PageRank
    Log scale
    # Edges
    Computation Time

    View Slide

  19. Graph Analytics: Processing vs Mining
    1 trillion
    140 s
    PageRank
    ~1 billion
    11 hours
    Motifs with size = 3
    Log scale
    # Edges
    Computation Time
    Arabesque (SOSP’15)

    View Slide

  20. Graph Analytics: Processing vs Mining
    1 trillion
    140 s
    PageRank
    ~1 billion
    11 hours
    Motifs with size = 3
    Log scale
    # Edges
    Computation Time
    Can graph pattern mining be made both
    fast and scalable?
    Arabesque (SOSP’15)

    View Slide

  21. Many mining tasks ask for the number of
    occurrences and do not need exact answers

    View Slide

  22. Leverage approximation for graph pattern
    mining
    Many mining tasks ask for the number of
    occurrences and do not need exact answers

    View Slide

  23. General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics

    View Slide

  24. General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    graph

    View Slide

  25. General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    0
    1 4
    2 3

    View Slide

  26. General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting

    View Slide

  27. General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting
    result
    $ % 2 = 2

    View Slide

  28. Answer: 10
    General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting
    result
    $ % 2 = 2

    View Slide

  29. Answer: 10
    General approach: Apply algorithm on
    subset(s) (sample) of the input data
    Approximate Analytics
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting
    result
    $ % 2 = 2
    Applying exact algorithm on sampled graph(s)
    not the right approach for pattern mining

    View Slide

  30. Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  31. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  32. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  33. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  34. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  35. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  36. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  37. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  38. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  39. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  40. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  41. E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  42. ! =
    1
    10

    1
    4
    E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  43. ! =
    1
    10

    1
    4
    E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    '(
    = 40
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  44. ! =
    1
    10

    1
    4
    E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    neighborhood
    sampling
    graph
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    '(
    = 40
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  45. ! =
    1
    10

    1
    4
    E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    neighborhood
    sampling
    graph
    E1
    E2
    E3
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    '(
    = 40
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  46. ! =
    1
    10

    1
    4
    E0
    Approximation by Sampling Patterns
    0
    1 4
    2 3
    estimator
    (r=4)
    neighborhood
    sampling
    graph
    E1
    E2
    E3
    edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4)
    '(
    = 40
    result
    1
    )
    *
    +,(
    -./
    '+
    = 10
    '/
    = 0
    '0
    = 0
    '1
    = 0
    Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

    View Slide

  47. Potential Benefits
    § 16 node Apache Spark cluster
    § Two graphs: Live Journal (68.9B), Twitter (1.47B)
    § Count 3-Motifs (2 patterns: triangle, 3-chain)
    § Set error to 5%

    View Slide

  48. 3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m
    Arabesque 20x32 Instagram 180M 0.9B 10h45m
    Potential Benefits
    § 16 node Apache Spark cluster
    § Two graphs: Live Journal (68.9B), Twitter (1.47B)
    § Count 3-Motifs (2 patterns: triangle, 3-chain)
    § Set error to 5%
    3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s
    Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

    View Slide

  49. 3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m
    Arabesque 20x32 Instagram 180M 0.9B 10h45m
    Potential Benefits
    § 16 node Apache Spark cluster
    § Two graphs: Live Journal (68.9B), Twitter (1.47B)
    § Count 3-Motifs (2 patterns: triangle, 3-chain)
    § Set error to 5%
    3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s
    Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

    View Slide

  50. 3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m
    Arabesque 20x32 Instagram 180M 0.9B 10h45m
    Potential Benefits
    § 16 node Apache Spark cluster
    § Two graphs: Live Journal (68.9B), Twitter (1.47B)
    § Count 3-Motifs (2 patterns: triangle, 3-chain)
    § Set error to 5%
    3-Motif System Graph |V| |E| Time
    Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s
    Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

    View Slide

  51. Challenges
    Building a General
    Purpose Approximate
    Graph Mining System

    View Slide

  52. Challenges
    Building a General
    Purpose Approximate
    Graph Mining System
    General Patterns

    View Slide

  53. Challenges
    Building a General
    Purpose Approximate
    Graph Mining System
    General Patterns
    Distributed Settings

    View Slide

  54. Challenges
    Building a General
    Purpose Approximate
    Graph Mining System
    General Patterns
    Distributed Settings
    Error Estimation

    View Slide

  55. Challenges
    Building a General
    Purpose Approximate
    Graph Mining System
    General Patterns
    Distributed Settings
    Error Estimation
    Handling Updates

    View Slide

  56. Challenge #1: General Patterns
    Problem: Neighborhood sampling is for triangle counting
    Break down neighborhood sampling into two phases:
    • Sampling phase
    • Closing phase
    ! =
    1
    10

    1
    4
    E0
    0
    1 4
    2 3
    estimator
    (r=4)
    neighborhood
    sampling
    graph
    E1
    E2
    E3
    '(
    = 40
    result
    1
    )
    *
    +,(
    -./
    '+
    = 10
    '/
    = 0
    '0
    = 0
    '1
    = 0

    View Slide

  57. Challenge #1: General Patterns
    Problem: Neighborhood sampling is for triangle counting
    Break down neighborhood sampling into two phases:
    • Sampling phase
    • Closing phase
    ! =
    1
    10

    1
    4
    E0
    0
    1 4
    2 3
    estimator
    (r=4)
    neighborhood
    sampling
    graph
    E1
    E2
    E3
    '(
    = 40
    result
    1
    )
    *
    +,(
    -./
    '+
    = 10
    '/
    = 0
    '0
    = 0
    '1
    = 0
    Can we restrict the implementation using a simple API ?
    How can we analyze programs written using the API?

    View Slide

  58. Challenge #2: Distributed Setting
    Problem: Neighborhood sampling is for a single machine

    View Slide

  59. Challenge #2: Distributed Setting
    graph
    Problem: Neighborhood sampling is for a single machine

    View Slide

  60. Challenge #2: Distributed Setting
    graph
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  61. Challenge #2: Distributed Setting
    graph
    map: w(=3) workers
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  62. Challenge #2: Distributed Setting
    graph
    map: w(=3) workers
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  63. Challenge #2: Distributed Setting
    graph !
    "#$
    %&'
    ("
    map: w(=3) workers
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  64. Challenge #2: Distributed Setting
    graph !
    "#$
    %&'
    ("
    map: w(=3) workers reduce
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  65. Challenge #2: Distributed Setting
    graph !
    "#$
    %&'
    ("
    map: w(=3) workers reduce
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    )(+)
    Problem: Neighborhood sampling is for a single machine

    View Slide

  66. Challenge #2: Distributed Setting
    graph !
    "#$
    %&'
    ("
    map: w(=3) workers reduce
    subgraph
    0
    partial count c
    0
    (using r estimators)
    subgraph
    1
    partial count c
    1
    (using r estimators)
    subgraph
    2
    partial count c
    2
    (using r estimators)
    )(+)
    Problem: Neighborhood sampling is for a single machine
    How do we compute f(w) for any pattern?
    How does f(w) affect error?

    View Slide

  67. Challenge #3: Building Error-Latency Profile
    Problem: Given a time / error bound, how many
    estimators should we use?
    Need to build two profiles:
    • Time vs #estimators
    • Error vs #estimators
    Naïve approach:
    • Exhaustively run every possible point (infeasible)

    View Slide

  68. Building Estimators vs Time Profile
    1
    2
    3
    0.5M 1M 1.5M 2M
    Runtime (min)
    No. of Estimators
    Twitter Graph
    Time complexity linear in number of estimators

    View Slide

  69. Building Estimators vs Error Profile
    0
    5
    10
    15
    20
    25
    30
    35
    40
    50k 1m 1.5m 2.1m
    Error Rate (%)
    No. of Estimators
    Twitter Graph
    Error complexity non-linear in number of estimators

    View Slide

  70. Building Estimators vs Error Profile
    0
    5
    10
    15
    20
    25
    30
    35
    40
    50k 1m 1.5m 2.1m
    Error Rate (%)
    No. of Estimators
    Twitter Graph
    Error complexity non-linear in number of estimators
    Leverage techniques like experiment design/Bayesian optimization?
    How do we avoid the need to know the ground truth?

    View Slide

  71. Challenge #4: Updates
    Problem: Graphs and queries can be updated/refined
    Several systems challenges:
    • Incremental pattern mining
    • Can the error-latency profiles be updated?
    • Caching
    • Re-use results
    • Pre-computation

    View Slide

  72. Conclusion
    § Approximation is a promising solution for pattern mining
    § Significant benefits, and can handle much larger graphs…
    § … but cannot output all instances of the pattern
    § Several challenges in realizing it
    § How to extend the technique to general patterns?
    § How to do approximate pattern mining in a distributed setting?
    § How do we estimate the error?
    § How do we handle updates?
    http://www.cs.berkeley.edu/~api
    [email protected]

    View Slide