Bridging the GAP: Towards Approximate Graph Analytics

Bridging the GAP: Towards Approximate Graph Analytics

0ff46442256bf55681d64027c68beea7?s=128

Anand Iyer

June 10, 2018
Tweet

Transcript

  1. 1.

    Bridging the GAP: Towards Approximate Graph Analytics Anand Iyer ⋆,

    Aurojit Panda ▪, Shivaram Venkataraman⬩, Mosharaf Chowdhury▴, Aditya Akella⬩, Scott Shenker ⋆, Ion Stoica ⋆ ⋆ UC Berkeley ▪ NYU ⬩ University of Wisconsin ▴ University of Michigan June 10, 2018
  2. 6.
  3. 15.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  4. 16.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table?
  5. 17.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? 0.2325
  6. 18.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample
  7. 19.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
  8. 20.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325 +/- 0.05
  9. 21.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample
  10. 22.

    Apply query on samples of the input data Approximate Analytics

    ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample 0.19 +/- 0.05 0.2325 0.22 +/- 0.02
  11. 25.

    Can we use the same idea on graphs? Approximate Analytics

    on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3
  12. 26.

    Can we use the same idea on graphs? Approximate Analytics

    on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting
  13. 27.

    Can we use the same idea on graphs? Approximate Analytics

    on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  14. 28.

    Answer: 10 Can we use the same idea on graphs?

    Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2
  15. 30.

    Challenge: Non-linear relation between sample size and runtime / error

    Approximate Analytics on Graphs ��� ��� ��� ��� ��� ��� ��� ��� � �� �� �� �� �� �� �� �� �� ������� ����� ������� ���
  16. 31.

    Challenge: Non-linear relation between sample size and runtime / error

    Approximate Analytics on Graphs How to sample graphs? What is the right sample size? How to compute the error for a given (iterative) graph query?
  17. 32.

    Our Proposal: GAP Run A within T sec Result, Error

    Graph Algorithms Sparsification Selector Models Sparsifier
  18. 33.

    Our Proposal: GAP Run A within T sec Result, Error

    Graph Algorithms Sparsification Selector Models Sparsifier
  19. 34.

    Our Proposal: GAP Run A within T sec Result, Error

    Graph Algorithms Sparsification Selector Models Sparsifier
  20. 35.

    Our Proposal: GAP Run A within T sec Result, Error

    Graph Algorithms Sparsification Selector Models Sparsifier
  21. 36.

    * Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of

    Graphs” Sampling for Graph Approximation § Sparsification extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsi￿er exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsi￿er adapted from work of Spielman and Teng [31] that is based on vertex degree sparsi￿er uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o
  22. 37.

    * Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of

    Graphs” Sampling for Graph Approximation § Sparsification extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsi￿er exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsi￿er adapted from work of Spielman and Teng [31] that is based on vertex degree sparsi￿er uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o Exploring other sparsification techniques
  23. 38.

    Estimating the Error / Latency What is the error /

    speedup due to sparsification? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques
  24. 39.

    Estimating the Error / Latency What is the error /

    speedup due to sparsification? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques None applicable for graph approximation
  25. 42.

    Building a model for s Model Builder Use machine learning

    to build a model. Learn the relation between s and error / latency
  26. 43.

    Building a model for s Model Builder Input features Use

    machine learning to build a model. Learn the relation between s and error / latency
  27. 44.

    Building a model for s Model Builder Input features Model

    Use machine learning to build a model. Learn the relation between s and error / latency
  28. 45.

    Building a model for s Model Builder Input features Model

    Use machine learning to build a model. Learn the relation between s and error / latency “The most important determinant of graph workload characteristics is typically the input graph and surprisingly not the implementation or even the graph kernel.” Beamer et. al. Indistributed graph processing, communication (shuffles) dominate execution time.
  29. 46.

    Building a model for s Model Builder Input features Model

    Use machine learning to build a model. Learn the relation between s and error / latency
  30. 47.

    Building a model for s Model Builder Learn H: (s,

    a, g) => e/l Input features Model Use machine learning to build a model. Learn the relation between s and error / latency
  31. 48.

    Building a model for s Model Builder Learn H: (s,

    a, g) => e/l Input features Model Random Forests Use machine learning to build a model. Learn the relation between s and error / latency
  32. 52.

    Building a model for s Model Builder Models Input Graph

    Benchmark Graphs/Queries (e.g., Graph500)
  33. 53.

    Building a model for s Model Builder 0 1 4

    2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)
  34. 54.

    Building a model for s Model Builder Model Mapper 0

    1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)
  35. 55.

    Building a model for s Model Builder Model Mapper 0

    1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)
  36. 56.

    Building a model for s Model Builder Model Mapper 0

    1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)
  37. 57.

    Building a model for s Model Builder Model Mapper 0

    1 4 2 3 Model Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)
  38. 58.

    Preliminary Feasibility Evaluation § Implemented sparsifier on Apache Spark §

    Not limited to it § Evaluated on a few real-world graphs § Largest: UK 3.73B edges § Goal: check if our assumptions hold
  39. 59.

    Preliminary Feasibility Evaluation 0 0.5 1 1.5 2 2.5 0.9

    0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter AstroPh Facebook 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter Epinions Wikivote
  40. 60.

    Preliminary Feasibility Evaluation 0 0.5 1 1.5 2 2.5 0.9

    0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter AstroPh Facebook 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter Epinions Wikivote Performance trends similar for graphs that are similar
  41. 61.

    Preliminary Feasibility Evaluation � ��� ��� ��� ��� ��� ���

    ��� ��� ��� ��� ���� ��� ���� ��� ���� ��� ���� ��� ���� ��� ���� ��� ���� ��� ���� ��� ������ ���������� ��� ������������� ��������� ������� �������� �������� ��������
  42. 62.

    Preliminary Feasibility Evaluation 1 2 3 4 5 6 7

    8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error
  43. 63.

    Preliminary Feasibility Evaluation 1 2 3 4 5 6 7

    8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error Bigger benefits achievable in large graphs
  44. 64.

    Ongoing/Future Work § Deep Learning § Build better models. §

    Better sparsifiers § Can we cherry pick sparsifiers? § Programming Language techniques § Can we synthesize approximate versions of an exact graph-parallel program?
  45. 65.

    Conclusion § Approximate graph analytics challenging § Unlike approximate query

    processing, no direct relation between graph size and latency/error. § Our proposal GAP: § Uses sparsification theory to reduce input to graph algorithms, and ML to learn the relation between input latency/error. § Initial results are encouraging. http://www.cs.berkeley.edu/~api api@cs.berkeley.edu