Bridging the GAP: Towards Approximate Graph Analytics

Bridging the GAP: Towards Approximate Graph Analytics Anand Iyer ⋆,
Aurojit Panda ▪, Shivaram Venkataraman⬩, Mosharaf Chowdhury▴, Aditya Akella⬩, Scott Shenker ⋆, Ion Stoica ⋆ ⋆ UC Berkeley ▪ NYU ⬩ University of Wisconsin ▴ University of Michigan June 10, 2018

Graphs popular in big data analytics

Can benefit from timely analysis

Can benefit from timely analysis Often do not require exact
answers

Cellular Network Analytics

Financial Network Analytics Image courtesy: Neo4J

Graph Analytics

Graph Analytics Takes several minutes to produce exact answers

Can we leverage approximation for graph analytics?

Apply query on samples of the input data Approximate Analytics

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table?

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? 0.2325

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325 +/- 0.05

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample

ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample 0.19 +/- 0.05 0.2325 0.22 +/- 0.02

Can we use the same idea on graphs? Approximate Analytics
on Graphs

on Graphs 0 1 4 2 3 graph

on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3

on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting

on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Answer: 10 Can we use the same idea on graphs?
Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Challenge: Non-linear relation between sample size and runtime / error
Approximate Analytics on Graphs

Approximate Analytics on Graphs ��

Approximate Analytics on Graphs How to sample graphs? What is the right sample size? How to compute the error for a given (iterative) graph query?

Our Proposal: GAP Run A within T sec Result, Error
Graph Algorithms Sparsiﬁcation Selector Models Sparsiﬁer

* Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of
Graphs” Sampling for Graph Approximation § Sparsiﬁcation extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsier exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsier adapted from work of Spielman and Teng [31] that is based on vertex degree sparsier uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o

* Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of
Graphs” Sampling for Graph Approximation § Sparsification extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsier exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsier adapted from work of Spielman and Teng [31] that is based on vertex degree sparsier uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o Exploring other sparsification techniques

Estimating the Error / Latency What is the error /
speedup due to sparsification? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques

Estimating the Error / Latency What is the error /
speedup due to sparsiﬁcation? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques None applicable for graph approximation

Building a model for s Use machine learning to build
a model.

Building a model for s Model Builder Use machine learning
to build a model.

Building a model for s Model Builder Use machine learning
to build a model. Learn the relation between s and error / latency

Building a model for s Model Builder Input features Use
machine learning to build a model. Learn the relation between s and error / latency

Building a model for s Model Builder Input features Model
Use machine learning to build a model. Learn the relation between s and error / latency

Use machine learning to build a model. Learn the relation between s and error / latency “The most important determinant of graph workload characteristics is typically the input graph and surprisingly not the implementation or even the graph kernel.” Beamer et. al. Indistributed graph processing, communication (shuffles) dominate execution time.

Use machine learning to build a model. Learn the relation between s and error / latency

Building a model for s Model Builder Learn H: (s,
a, g) => e/l Input features Model Use machine learning to build a model. Learn the relation between s and error / latency

Building a model for s Model Builder Learn H: (s,
a, g) => e/l Input features Model Random Forests Use machine learning to build a model. Learn the relation between s and error / latency

Building a model for s Model Builder

Building a model for s Model Builder Input Graph

Building a model for s Model Builder Models Input Graph

Building a model for s Model Builder Models Input Graph
Benchmark Graphs/Queries (e.g., Graph500)

Building a model for s Model Builder 0 1 4
2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Building a model for s Model Builder Model Mapper 0
1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Building a model for s Model Builder Model Mapper 0
1 4 2 3 Model Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Preliminary Feasibility Evaluation § Implemented sparsifier on Apache Spark §
Not limited to it § Evaluated on a few real-world graphs § Largest: UK 3.73B edges § Goal: check if our assumptions hold

Preliminary Feasibility Evaluation 0 0.5 1 1.5 2 2.5 0.9
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter AstroPh Facebook 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter Epinions Wikivote

Preliminary Feasibility Evaluation 0 0.5 1 1.5 2 2.5 0.9
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter AstroPh Facebook 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter Epinions Wikivote Performance trends similar for graphs that are similar

Preliminary Feasibility Evaluation � ��
��

Preliminary Feasibility Evaluation 1 2 3 4 5 6 7
8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error

Preliminary Feasibility Evaluation 1 2 3 4 5 6 7
8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error Bigger benefits achievable in large graphs

Ongoing/Future Work § Deep Learning § Build better models. §
Better sparsifiers § Can we cherry pick sparsifiers? § Programming Language techniques § Can we synthesize approximate versions of an exact graph-parallel program?

Conclusion § Approximate graph analytics challenging § Unlike approximate query
processing, no direct relation between graph size and latency/error. § Our proposal GAP: § Uses sparsification theory to reduce input to graph algorithms, and ML to learn the relation between input latency/error. § Initial results are encouraging. http://www.cs.berkeley.edu/~api [email protected]

Bridging the GAP: Towards Approximate Graph Ana...

Bridging the GAP: Towards Approximate Graph Analytics

More Decks by Anand Iyer

Other Decks in Research

Featured

Transcript