Bridging the GAP: Towards Approximate Graph Analytics

Slide 1

Slide 1 text

Bridging the GAP: Towards Approximate Graph Analytics Anand Iyer ⋆, Aurojit Panda ▪, Shivaram Venkataraman⬩, Mosharaf Chowdhury▴, Aditya Akella⬩, Scott Shenker ⋆, Ion Stoica ⋆ ⋆ UC Berkeley ▪ NYU ⬩ University of Wisconsin ▴ University of Michigan June 10, 2018

Slide 2

Slide 2 text

Graphs popular in big data analytics

Slide 3

Slide 3 text

Graphs popular in big data analytics

Slide 4

Slide 4 text

Graphs popular in big data analytics

Slide 5

Slide 5 text

Graphs popular in big data analytics

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Can benefit from timely analysis

Slide 8

Slide 8 text

Can benefit from timely analysis Often do not require exact answers

Slide 9

Slide 9 text

Cellular Network Analytics

Slide 10

Slide 10 text

Financial Network Analytics Image courtesy: Neo4J

Slide 11

Slide 11 text

Graph Analytics

Slide 12

Slide 12 text

Graph Analytics Takes several minutes to produce exact answers

Slide 13

Slide 13 text

Can we leverage approximation for graph analytics?

Slide 14

Slide 14 text

Apply query on samples of the input data Approximate Analytics

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Apply query on samples of the input data Approximate Analytics ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Apply query on samples of the input data Approximate Analytics ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample

Slide 22

Slide 22 text

Apply query on samples of the input data Approximate Analytics ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.11 1/2 6 Berkeley 0.09 1/2 8 NYC 0.15 1/2 12 Berkeley 0.10 1/2 Uniform Sample 0.19 +/- 0.05 0.2325 0.22 +/- 0.02

Slide 23

Slide 23 text

Can we use the same idea on graphs? Approximate Analytics on Graphs

Slide 24

Slide 24 text

Can we use the same idea on graphs? Approximate Analytics on Graphs 0 1 4 2 3 graph

Slide 25

Slide 25 text

Can we use the same idea on graphs? Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3

Slide 26

Slide 26 text

Can we use the same idea on graphs? Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting

Slide 27

Slide 27 text

Can we use the same idea on graphs? Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Slide 28

Slide 28 text

Answer: 10 Can we use the same idea on graphs? Approximate Analytics on Graphs 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Slide 29

Slide 29 text

Challenge: Non-linear relation between sample size and runtime / error Approximate Analytics on Graphs

Slide 30

Slide 30 text

Challenge: Non-linear relation between sample size and runtime / error Approximate Analytics on Graphs ��

Slide 31

Slide 31 text

Challenge: Non-linear relation between sample size and runtime / error Approximate Analytics on Graphs How to sample graphs? What is the right sample size? How to compute the error for a given (iterative) graph query?

Slide 32

Slide 32 text

Our Proposal: GAP Run A within T sec Result, Error Graph Algorithms Sparsiﬁcation Selector Models Sparsiﬁer

Slide 33

Slide 33 text

Our Proposal: GAP Run A within T sec Result, Error Graph Algorithms Sparsiﬁcation Selector Models Sparsiﬁer

Slide 34

Slide 34 text

Our Proposal: GAP Run A within T sec Result, Error Graph Algorithms Sparsiﬁcation Selector Models Sparsiﬁer

Slide 35

Slide 35 text

Our Proposal: GAP Run A within T sec Result, Error Graph Algorithms Sparsiﬁcation Selector Models Sparsiﬁer

Slide 36

Slide 36 text

* Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of Graphs” Sampling for Graph Approximation § Sparsiﬁcation extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsier exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsier adapted from work of Spielman and Teng [31] that is based on vertex degree sparsier uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o

Slide 37

Slide 37 text

* Daniel A. Spielman and Shang-Hua Teng. “Spectral Sparsification of Graphs” Sampling for Graph Approximation § Sparsification extensively studied in graph theory § Idea: approximate the graph using a sparse, much smaller graph § Many computationally intensive § Not amenable to distributed implementation § Build on Spielman & Teng’s work* § Keep edges with probability cial properties of the input graph. While several proposals on the type of sparsier exists, many o m are either computationally intensive, or are not amenable to stributed implementation (which is the focus of our work)3. A nitial solution, we developed a simple sparsier adapted from work of Spielman and Teng [31] that is based on vertex degree sparsier uses the following probability to decide to keep an e between vertex a and b: dAV G ⇥ s min(d o a,d i b ) (1 o Exploring other sparsification techniques

Slide 38

Slide 38 text

Estimating the Error / Latency What is the error / speedup due to sparsification? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques

Slide 39

Slide 39 text

Estimating the Error / Latency What is the error / speedup due to sparsiﬁcation? Many approaches in approximate processing literature: • Exhaustively run every possible point • Theoretical closed-bound solutions • Experiment design / Bayesian techniques None applicable for graph approximation

Slide 40

Slide 40 text

Building a model for s Use machine learning to build a model.

Slide 41

Slide 41 text

Building a model for s Model Builder Use machine learning to build a model.

Slide 42

Slide 42 text

Building a model for s Model Builder Use machine learning to build a model. Learn the relation between s and error / latency

Slide 43

Slide 43 text

Building a model for s Model Builder Input features Use machine learning to build a model. Learn the relation between s and error / latency

Slide 44

Slide 44 text

Building a model for s Model Builder Input features Model Use machine learning to build a model. Learn the relation between s and error / latency

Slide 45

Slide 45 text

Building a model for s Model Builder Input features Model Use machine learning to build a model. Learn the relation between s and error / latency “The most important determinant of graph workload characteristics is typically the input graph and surprisingly not the implementation or even the graph kernel.” Beamer et. al. Indistributed graph processing, communication (shuffles) dominate execution time.

Slide 46

Slide 46 text

Building a model for s Model Builder Input features Model Use machine learning to build a model. Learn the relation between s and error / latency

Slide 47

Slide 47 text

Building a model for s Model Builder Learn H: (s, a, g) => e/l Input features Model Use machine learning to build a model. Learn the relation between s and error / latency

Slide 48

Slide 48 text

Building a model for s Model Builder Learn H: (s, a, g) => e/l Input features Model Random Forests Use machine learning to build a model. Learn the relation between s and error / latency

Slide 49

Slide 49 text

Building a model for s Model Builder

Slide 50

Slide 50 text

Building a model for s Model Builder Input Graph

Slide 51

Slide 51 text

Building a model for s Model Builder Models Input Graph

Slide 52

Slide 52 text

Building a model for s Model Builder Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 53

Slide 53 text

Building a model for s Model Builder 0 1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 54

Slide 54 text

Building a model for s Model Builder Model Mapper 0 1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 55

Slide 55 text

Building a model for s Model Builder Model Mapper 0 1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 56

Slide 56 text

Building a model for s Model Builder Model Mapper 0 1 4 2 3 Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 57

Slide 57 text

Building a model for s Model Builder Model Mapper 0 1 4 2 3 Model Models Input Graph Benchmark Graphs/Queries (e.g., Graph500)

Slide 58

Slide 58 text

Preliminary Feasibility Evaluation § Implemented sparsifier on Apache Spark § Not limited to it § Evaluated on a few real-world graphs § Largest: UK 3.73B edges § Goal: check if our assumptions hold

Slide 59

Slide 59 text

Preliminary Feasibility Evaluation 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter AstroPh Facebook 0 0.5 1 1.5 2 2.5 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Speedup Sparsifcation Parameter Epinions Wikivote

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Preliminary Feasibility Evaluation � ��

Slide 62

Slide 62 text

Preliminary Feasibility Evaluation 1 2 3 4 5 6 7 8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error

Slide 63

Slide 63 text

Preliminary Feasibility Evaluation 1 2 3 4 5 6 7 8 9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 Speedup Error (%) Sparsifcation Parameter Speedup Error Bigger benefits achievable in large graphs

Slide 64

Slide 64 text

Ongoing/Future Work § Deep Learning § Build better models. § Better sparsifiers § Can we cherry pick sparsifiers? § Programming Language techniques § Can we synthesize approximate versions of an exact graph-parallel program?

Slide 65

Slide 65 text

Conclusion § Approximate graph analytics challenging § Unlike approximate query processing, no direct relation between graph size and latency/error. § Our proposal GAP: § Uses sparsification theory to reduce input to graph algorithms, and ML to learn the relation between input latency/error. § Initial results are encouraging. http://www.cs.berkeley.edu/~api [email protected]