$30 off During Our Annual Pro Sale. View Details »

Bridging the GAP: Towards Approximate Graph Analytics

Bridging the GAP: Towards Approximate Graph Analytics

Anand Iyer

June 10, 2018
Tweet

More Decks by Anand Iyer

Other Decks in Research

Transcript

  1. Bridging the GAP: Towards
    Approximate Graph Analytics
    Anand Iyer ⋆, Aurojit Panda ▪, Shivaram Venkataraman⬩,
    Mosharaf Chowdhury▴, Aditya Akella⬩, Scott Shenker ⋆, Ion Stoica ⋆
    ⋆ UC Berkeley ▪ NYU ⬩ University of Wisconsin ▴ University of Michigan
    June 10, 2018

    View Slide

  2. Graphs popular in big data analytics

    View Slide

  3. Graphs popular in big data analytics

    View Slide

  4. Graphs popular in big data analytics

    View Slide

  5. Graphs popular in big data analytics

    View Slide

  6. View Slide

  7. Can benefit from timely analysis

    View Slide

  8. Can benefit from timely analysis
    Often do not require exact answers

    View Slide

  9. Cellular Network Analytics

    View Slide

  10. Financial Network Analytics
    Image courtesy: Neo4J

    View Slide

  11. Graph Analytics

    View Slide

  12. Graph Analytics
    Takes several minutes to produce exact answers

    View Slide

  13. Can we leverage approximation for
    graph analytics?

    View Slide

  14. Apply query on samples of the input data
    Approximate Analytics

    View Slide

  15. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10

    View Slide

  16. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?

    View Slide

  17. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    0.2325

    View Slide

  18. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    ID City Buff Ratio Sampling Rate
    2 NYC 0.13 1/4
    6 Berkeley 0.25 1/4
    8 NYC 0.19 1/4
    Uniform
    Sample

    View Slide

  19. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    ID City Buff Ratio Sampling Rate
    2 NYC 0.13 1/4
    6 Berkeley 0.25 1/4
    8 NYC 0.19 1/4
    Uniform
    Sample
    0.19
    0.2325

    View Slide

  20. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    ID City Buff Ratio Sampling Rate
    2 NYC 0.13 1/4
    6 Berkeley 0.25 1/4
    8 NYC 0.19 1/4
    Uniform
    Sample
    0.19
    0.2325
    +/- 0.05

    View Slide

  21. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    ID City Buff Ratio Sampling Rate
    2 NYC 0.13 1/2
    3 Berkeley 0.25 1/2
    5 NYC 0.11 1/2
    6 Berkeley 0.09 1/2
    8 NYC 0.15 1/2
    12 Berkeley 0.10 1/2
    Uniform
    Sample

    View Slide

  22. Apply query on samples of the input data
    Approximate Analytics
    ID City Buff Ratio
    1 NYC 0.78
    2 NYC 0.13
    3 Berkeley 0.25
    4 NYC 0.19
    5 NYC 0.11
    6 Berkeley 0.09
    7 NYC 0.18
    8 NYC 0.15
    9 Berkeley 0.13
    10 Berkeley 0.49
    11 NYC 0.19
    12 Berkeley 0.10
    What is the average buffering
    ratio in the table?
    ID City Buff Ratio Sampling Rate
    2 NYC 0.13 1/2
    3 Berkeley 0.25 1/2
    5 NYC 0.11 1/2
    6 Berkeley 0.09 1/2
    8 NYC 0.15 1/2
    12 Berkeley 0.10 1/2
    Uniform
    Sample
    0.19 +/- 0.05
    0.2325
    0.22 +/- 0.02

    View Slide

  23. Can we use the same idea on graphs?
    Approximate Analytics on Graphs

    View Slide

  24. Can we use the same idea on graphs?
    Approximate Analytics on Graphs
    0
    1 4
    2 3
    graph

    View Slide

  25. Can we use the same idea on graphs?
    Approximate Analytics on Graphs
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    0
    1 4
    2 3

    View Slide

  26. Can we use the same idea on graphs?
    Approximate Analytics on Graphs
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting

    View Slide

  27. Can we use the same idea on graphs?
    Approximate Analytics on Graphs
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting
    result
    $ % 2 = 2

    View Slide

  28. Answer: 10
    Can we use the same idea on graphs?
    Approximate Analytics on Graphs
    0
    1 4
    2 3
    edge sampling
    (p=0.5)
    graph
    e = 1
    0
    1 4
    2 3
    triangle
    counting
    result
    $ % 2 = 2

    View Slide

  29. Challenge: Non-linear relation between sample
    size and runtime / error
    Approximate Analytics on Graphs

    View Slide

  30. Challenge: Non-linear relation between sample
    size and runtime / error
    Approximate Analytics on Graphs
    ���
    ���
    ���
    ���
    ���
    ���
    ���
    ���

    �� �� �� �� �� �� �� �� ��
    �������
    ����� ������� ���

    View Slide

  31. Challenge: Non-linear relation between sample
    size and runtime / error
    Approximate Analytics on Graphs
    How to sample graphs?
    What is the right sample size?
    How to compute the error for a given (iterative) graph query?

    View Slide

  32. Our Proposal: GAP
    Run A within T sec Result, Error
    Graph Algorithms
    Sparsification Selector
    Models
    Sparsifier

    View Slide

  33. Our Proposal: GAP
    Run A within T sec Result, Error
    Graph Algorithms
    Sparsification Selector
    Models
    Sparsifier

    View Slide

  34. Our Proposal: GAP
    Run A within T sec Result, Error
    Graph Algorithms
    Sparsification Selector
    Models
    Sparsifier

    View Slide

  35. Our Proposal: GAP
    Run A within T sec Result, Error
    Graph Algorithms
    Sparsification Selector
    Models
    Sparsifier

    View Slide

  36. * Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of Graphs”
    Sampling for Graph Approximation
    § Sparsification extensively studied in graph theory
    § Idea: approximate the graph using a sparse, much smaller graph
    § Many computationally intensive
    § Not amenable to distributed implementation
    § Build on Spielman & Teng’s work*
    § Keep edges with probability
    cial properties of the input graph.
    While several proposals on the type of sparsier exists, many o
    m are either computationally intensive, or are not amenable to
    stributed implementation (which is the focus of our work)3. A
    nitial solution, we developed a simple sparsier adapted from
    work of Spielman and Teng [31] that is based on vertex degree
    sparsier uses the following probability to decide to keep an
    e between vertex a and b:
    dAV G ⇥ s
    min(d
    o
    a,d
    i
    b
    )
    (1
    o

    View Slide

  37. * Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of Graphs”
    Sampling for Graph Approximation
    § Sparsification extensively studied in graph theory
    § Idea: approximate the graph using a sparse, much smaller graph
    § Many computationally intensive
    § Not amenable to distributed implementation
    § Build on Spielman & Teng’s work*
    § Keep edges with probability
    cial properties of the input graph.
    While several proposals on the type of sparsier exists, many o
    m are either computationally intensive, or are not amenable to
    stributed implementation (which is the focus of our work)3. A
    nitial solution, we developed a simple sparsier adapted from
    work of Spielman and Teng [31] that is based on vertex degree
    sparsier uses the following probability to decide to keep an
    e between vertex a and b:
    dAV G ⇥ s
    min(d
    o
    a,d
    i
    b
    )
    (1
    o
    Exploring other sparsification techniques

    View Slide

  38. Estimating the Error / Latency
    What is the error / speedup due to sparsification?
    Many approaches in approximate processing literature:
    • Exhaustively run every possible point
    • Theoretical closed-bound solutions
    • Experiment design / Bayesian techniques

    View Slide

  39. Estimating the Error / Latency
    What is the error / speedup due to sparsification?
    Many approaches in approximate processing literature:
    • Exhaustively run every possible point
    • Theoretical closed-bound solutions
    • Experiment design / Bayesian techniques
    None applicable for graph approximation

    View Slide

  40. Building a model for s
    Use machine learning to build a model.

    View Slide

  41. Building a model for s
    Model
    Builder
    Use machine learning to build a model.

    View Slide

  42. Building a model for s
    Model
    Builder
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  43. Building a model for s
    Model
    Builder
    Input
    features
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  44. Building a model for s
    Model
    Builder
    Input
    features
    Model
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  45. Building a model for s
    Model
    Builder
    Input
    features
    Model
    Use machine learning to build a model.
    Learn the relation between s and error / latency
    “The most important determinant of graph workload
    characteristics is typically the input graph and surprisingly not
    the implementation or even the graph kernel.” Beamer et. al.
    Indistributed graph processing, communication (shuffles)
    dominate execution time.

    View Slide

  46. Building a model for s
    Model
    Builder
    Input
    features
    Model
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  47. Building a model for s
    Model
    Builder
    Learn H: (s, a, g) => e/l
    Input
    features
    Model
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  48. Building a model for s
    Model
    Builder
    Learn H: (s, a, g) => e/l
    Input
    features
    Model
    Random Forests
    Use machine learning to build a model.
    Learn the relation between s and error / latency

    View Slide

  49. Building a model for s
    Model Builder

    View Slide

  50. Building a model for s
    Model Builder
    Input Graph

    View Slide

  51. Building a model for s
    Model Builder Models
    Input Graph

    View Slide

  52. Building a model for s
    Model Builder Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  53. Building a model for s
    Model Builder
    0
    1 4
    2 3
    Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  54. Building a model for s
    Model Builder
    Model Mapper
    0
    1 4
    2 3
    Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  55. Building a model for s
    Model Builder
    Model Mapper
    0
    1 4
    2 3
    Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  56. Building a model for s
    Model Builder
    Model Mapper
    0
    1 4
    2 3
    Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  57. Building a model for s
    Model Builder
    Model Mapper
    0
    1 4
    2 3
    Model
    Models
    Input Graph
    Benchmark
    Graphs/Queries
    (e.g., Graph500)

    View Slide

  58. Preliminary Feasibility Evaluation
    § Implemented sparsifier on Apache Spark
    § Not limited to it
    § Evaluated on a few real-world graphs
    § Largest: UK 3.73B edges
    § Goal: check if our assumptions hold

    View Slide

  59. Preliminary Feasibility Evaluation
    0
    0.5
    1
    1.5
    2
    2.5
    0.9
    0.85
    0.8
    0.75
    0.7
    0.65
    0.6
    0.55
    0.5
    0.45
    0.4
    0.35
    0.3
    0.25
    0.2
    0.15
    0.1
    Speedup
    Sparsifcation Parameter
    AstroPh
    Facebook
    0
    0.5
    1
    1.5
    2
    2.5
    0.9
    0.85
    0.8
    0.75
    0.7
    0.65
    0.6
    0.55
    0.5
    0.45
    0.4
    0.35
    0.3
    0.25
    0.2
    0.15
    0.1
    Speedup
    Sparsifcation Parameter
    Epinions
    Wikivote

    View Slide

  60. Preliminary Feasibility Evaluation
    0
    0.5
    1
    1.5
    2
    2.5
    0.9
    0.85
    0.8
    0.75
    0.7
    0.65
    0.6
    0.55
    0.5
    0.45
    0.4
    0.35
    0.3
    0.25
    0.2
    0.15
    0.1
    Speedup
    Sparsifcation Parameter
    AstroPh
    Facebook
    0
    0.5
    1
    1.5
    2
    2.5
    0.9
    0.85
    0.8
    0.75
    0.7
    0.65
    0.6
    0.55
    0.5
    0.45
    0.4
    0.35
    0.3
    0.25
    0.2
    0.15
    0.1
    Speedup
    Sparsifcation Parameter
    Epinions
    Wikivote
    Performance trends similar for graphs that are similar

    View Slide

  61. Preliminary Feasibility Evaluation

    ���
    ���
    ���
    ���
    ���
    ���
    ���
    ���
    ���
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ����
    ���
    ������ ���������� ���
    ������������� ���������
    �������
    ��������
    ��������
    ��������

    View Slide

  62. Preliminary Feasibility Evaluation
    1
    2
    3
    4
    5
    6
    7
    8
    9
    0.9
    0.8
    0.7
    0.6
    0.5
    0.4
    0.3
    0.2
    0.1
    0
    10
    20
    30
    40
    50
    60
    70
    Speedup
    Error (%)
    Sparsifcation Parameter
    Speedup
    Error

    View Slide

  63. Preliminary Feasibility Evaluation
    1
    2
    3
    4
    5
    6
    7
    8
    9
    0.9
    0.8
    0.7
    0.6
    0.5
    0.4
    0.3
    0.2
    0.1
    0
    10
    20
    30
    40
    50
    60
    70
    Speedup
    Error (%)
    Sparsifcation Parameter
    Speedup
    Error
    Bigger benefits achievable in large graphs

    View Slide

  64. Ongoing/Future Work
    § Deep Learning
    § Build better models.
    § Better sparsifiers
    § Can we cherry pick sparsifiers?
    § Programming Language techniques
    § Can we synthesize approximate versions of an exact
    graph-parallel program?

    View Slide

  65. Conclusion
    § Approximate graph analytics challenging
    § Unlike approximate query processing, no direct relation
    between graph size and latency/error.
    § Our proposal GAP:
    § Uses sparsification theory to reduce input to graph algorithms,
    and ML to learn the relation between input latency/error.
    § Initial results are encouraging.
    http://www.cs.berkeley.edu/~api
    [email protected]

    View Slide