Slide 1

Slide 1 text

Towards Fast and Scalable Graph Pattern Mining Anand Iyer ⋆, Zaoxing Liu ⬩, Xin Jin⬩, Shivaram Venkataraman▴, Vladimir Braverman⬩, Ion Stoica ⋆ ⋆ UC Berkeley ⬩ Johns Hopkins University ▴ Microsoft Research / University of Wisconsin HotCloud, July 09, 2018

Slide 2

Slide 2 text

Graphs popular in big data analytics

Slide 3

Slide 3 text

Graphs popular in big data analytics

Slide 4

Slide 4 text

Graphs popular in big data analytics

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Graph Analytics

Slide 7

Slide 7 text

Graph Analytics Processing Algorithms

Slide 8

Slide 8 text

Graph Analytics Processing Algorithms PageRank Connected Components

Slide 9

Slide 9 text

Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components

Slide 10

Slide 10 text

Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques

Slide 11

Slide 11 text

Graph Analytics Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques

Slide 12

Slide 12 text

Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques Computes properties of the underlying graph Discovers structural patterns in the underlying graph

Slide 13

Slide 13 text

Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph

Slide 14

Slide 14 text

Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph

Slide 15

Slide 15 text

Graph Analytics: State-of-the-Art Processing Algorithms Mining Algorithms PageRank Connected Components Motifs Cliques § Easy to implement § Massively parallelizable § Can handle large graphs § Efficient custom algorithms § Exponential intermediate data § Limited to small graphs Challenging to mine patterns in large graphs Computes properties of the underlying graph Discovers structural patterns in the underlying graph

Slide 16

Slide 16 text

Graph Analytics: Processing vs Mining Log scale # Edges Computation Time

Slide 17

Slide 17 text

Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank Log scale # Edges Computation Time

Slide 18

Slide 18 text

Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank Log scale # Edges Computation Time

Slide 19

Slide 19 text

Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Arabesque (SOSP’15)

Slide 20

Slide 20 text

Graph Analytics: Processing vs Mining 1 trillion 140 s PageRank ~1 billion 11 hours Motifs with size = 3 Log scale # Edges Computation Time Can graph pattern mining be made both fast and scalable? Arabesque (SOSP’15)

Slide 21

Slide 21 text

Many mining tasks ask for the number of occurrences and do not need exact answers

Slide 22

Slide 22 text

Leverage approximation for graph pattern mining Many mining tasks ask for the number of occurrences and do not need exact answers

Slide 23

Slide 23 text

General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics

Slide 24

Slide 24 text

General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 graph

Slide 25

Slide 25 text

General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph 0 1 4 2 3

Slide 26

Slide 26 text

General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting

Slide 27

Slide 27 text

General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Slide 28

Slide 28 text

Answer: 10 General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2

Slide 29

Slide 29 text

Answer: 10 General approach: Apply algorithm on subset(s) (sample) of the input data Approximate Analytics 0 1 4 2 3 edge sampling (p=0.5) graph e = 1 0 1 4 2 3 triangle counting result $ % 2 = 2 Applying exact algorithm on sampled graph(s) not the right approach for pattern mining

Slide 30

Slide 30 text

Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 31

Slide 31 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 32

Slide 32 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 33

Slide 33 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 34

Slide 34 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 35

Slide 35 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 36

Slide 36 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 37

Slide 37 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 38

Slide 38 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 39

Slide 39 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 40

Slide 40 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 41

Slide 41 text

E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 42

Slide 42 text

! = 1 10 ∗ 1 4 E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 43

Slide 43 text

! = 1 10 ∗ 1 4 E0 Approximation by Sampling Patterns 0 1 4 2 3 graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 44

Slide 44 text

! = 1 10 ∗ 1 4 E0 Approximation by Sampling Patterns 0 1 4 2 3 neighborhood sampling graph edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 45

Slide 45 text

! = 1 10 ∗ 1 4 E0 Approximation by Sampling Patterns 0 1 4 2 3 neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 46

Slide 46 text

! = 1 10 ∗ 1 4 E0 Approximation by Sampling Patterns 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 edge stream: (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Pavan et al. Counting and sampling triangles from a graph stream, VLDB 2013

Slide 47

Slide 47 text

Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5%

Slide 48

Slide 48 text

3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

Slide 49

Slide 49 text

3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

Slide 50

Slide 50 text

3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 Twitter 41.7M 1.47B 4m Arabesque 20x32 Instagram 180M 0.9B 10h45m Potential Benefits § 16 node Apache Spark cluster § Two graphs: Live Journal (68.9B), Twitter (1.47B) § Count 3-Motifs (2 patterns: triangle, 3-chain) § Set error to 5% 3-Motif System Graph |V| |E| Time Ours (5%) 16 x 8 LiveJ 4.8M 68.9B 11.5s Arabesque 16 x 8 LiveJ 41.7M 1.47B 299.2s

Slide 51

Slide 51 text

Challenges Building a General Purpose Approximate Graph Mining System

Slide 52

Slide 52 text

Challenges Building a General Purpose Approximate Graph Mining System General Patterns

Slide 53

Slide 53 text

Challenges Building a General Purpose Approximate Graph Mining System General Patterns Distributed Settings

Slide 54

Slide 54 text

Challenges Building a General Purpose Approximate Graph Mining System General Patterns Distributed Settings Error Estimation

Slide 55

Slide 55 text

Challenges Building a General Purpose Approximate Graph Mining System General Patterns Distributed Settings Error Estimation Handling Updates

Slide 56

Slide 56 text

Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0

Slide 57

Slide 57 text

Challenge #1: General Patterns Problem: Neighborhood sampling is for triangle counting Break down neighborhood sampling into two phases: • Sampling phase • Closing phase ! = 1 10 ∗ 1 4 E0 0 1 4 2 3 estimator (r=4) neighborhood sampling graph E1 E2 E3 '( = 40 result 1 ) * +,( -./ '+ = 10 '/ = 0 '0 = 0 '1 = 0 Can we restrict the implementation using a simple API ? How can we analyze programs written using the API?

Slide 58

Slide 58 text

Challenge #2: Distributed Setting Problem: Neighborhood sampling is for a single machine

Slide 59

Slide 59 text

Challenge #2: Distributed Setting graph Problem: Neighborhood sampling is for a single machine

Slide 60

Slide 60 text

Challenge #2: Distributed Setting graph subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine

Slide 61

Slide 61 text

Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine

Slide 62

Slide 62 text

Challenge #2: Distributed Setting graph map: w(=3) workers subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine

Slide 63

Slide 63 text

Challenge #2: Distributed Setting graph ! "#$ %&' (" map: w(=3) workers subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine

Slide 64

Slide 64 text

Challenge #2: Distributed Setting graph ! "#$ %&' (" map: w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) Problem: Neighborhood sampling is for a single machine

Slide 65

Slide 65 text

Challenge #2: Distributed Setting graph ! "#$ %&' (" map: w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine

Slide 66

Slide 66 text

Challenge #2: Distributed Setting graph ! "#$ %&' (" map: w(=3) workers reduce subgraph 0 partial count c 0 (using r estimators) subgraph 1 partial count c 1 (using r estimators) subgraph 2 partial count c 2 (using r estimators) )(+) Problem: Neighborhood sampling is for a single machine How do we compute f(w) for any pattern? How does f(w) affect error?

Slide 67

Slide 67 text

Challenge #3: Building Error-Latency Profile Problem: Given a time / error bound, how many estimators should we use? Need to build two profiles: • Time vs #estimators • Error vs #estimators Naïve approach: • Exhaustively run every possible point (infeasible)

Slide 68

Slide 68 text

Building Estimators vs Time Profile 1 2 3 0.5M 1M 1.5M 2M Runtime (min) No. of Estimators Twitter Graph Time complexity linear in number of estimators

Slide 69

Slide 69 text

Building Estimators vs Error Profile 0 5 10 15 20 25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators

Slide 70

Slide 70 text

Building Estimators vs Error Profile 0 5 10 15 20 25 30 35 40 50k 1m 1.5m 2.1m Error Rate (%) No. of Estimators Twitter Graph Error complexity non-linear in number of estimators Leverage techniques like experiment design/Bayesian optimization? How do we avoid the need to know the ground truth?

Slide 71

Slide 71 text

Challenge #4: Updates Problem: Graphs and queries can be updated/refined Several systems challenges: • Incremental pattern mining • Can the error-latency profiles be updated? • Caching • Re-use results • Pre-computation

Slide 72

Slide 72 text

Conclusion § Approximation is a promising solution for pattern mining § Significant benefits, and can handle much larger graphs… § … but cannot output all instances of the pattern § Several challenges in realizing it § How to extend the technique to general patterns? § How to do approximate pattern mining in a distributed setting? § How do we estimate the error? § How do we handle updates? http://www.cs.berkeley.edu/~api [email protected]