GraphGen: Extracting and Analyzing Hidden Graphs from Relational Databases

Extracting and Analyzing Hidden Graphs from Relational Databases Department of
Computer Science University of Maryland

• Graph Analytics / Querying • GraphGen • Technical Challenges
• Experimental Results Outline

Graph Analytics / Querying Graph datasets can provide value in
many domains Social Networks Email Networks Protein Interaction Networks Stock Trading Networks Many different types of ways to deal with graph data • Graph Databases (neo4j, orientDB, RDF stores) • Distributed Batch Analytics systems (Giraph, GraphX, GraphLab) • In-Memory systems (Ligra, Green-Marl, X-Stream) • Many research prototypes / custom indexes.

But first...where is your data stored ? • Typically datasets
are in RDBMS or Key-Value stores under some sort of schema. • Graph Analytics needs graph-structured data (i.e., lists of nodes and edges) • We must first extract the graph!

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3
c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_key name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key Which customer bought which product? On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 Edge weights based on geographical distance c_1 John c_2 Jane

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3
c3 order_ke y part_ke y LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_ke y name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 Edge weights based on geographical distance c_1 John c_2 Jane Many other graphs of potential interest: • Suppliers that sell a common item • Employees working under the same manager • Parts that were ordered together • Bipartite graph between Part and Supplier • ...

But first...where is your data stored ? 1. Extracted graphs
often orders-of-magnitude larger than original database ◦ Homogeneous graphs (over the same set of entities) invariably require at least one self-join on a non-key 2. User needs to write custom SQL queries for ETL ◦ Can be unintuitive → time consuming ◦ Repeat every time database is updated 3. Large selectivity estimation errors due to complex joins ◦ May result into bad query plans

1. Extracted graphs often orders-of-magnitude larger than original database ◦
Homogeneous graphs (over the same set of entities) invariably require at least one self-join on a non-key 2. User needs to write custom SQL queries for ETL ◦ Can be unintuitive → time consuming ◦ Repeat every time database is updated 3. Large selectivity estimation errors due to complex joins ◦ May result into bad query plans But first...where is your data stored ?

• Graph Analytics / State of the Art • GraphGen
• Technical Challenges • Experimental Results Outline

RDBMS-based Graph Systems vs GraphGen DECLARATIVE

Graph Analysis with GraphGen GraphGen Java Library Relational Database GraphGen
Front End Giraph / Other Graph Libraries Graph Analysis Program Declarative Graph Definition Query D Nodes(ID, Name) :- Customer(ID, Name). Edges(ID1, ID2) :- Orders(o_key1, ID1), LineItem(o_key1, part_key), Orders(o_key2, ID2), LineItem(o_key2, part_key). Declarative Graph Definition Query

Analysis / Query Results Graph Analysis with GraphGen GraphGen Java
Library Relational Database GraphGen Front End Giraph / Other Graph Libraries Graph Analysis Program Declarative Graph Definition Query D Vertex Centric Direct Graph Access

Graph Analysis with GraphGen GraphGen Java Library Relational Database GraphGen
Front End Giraph / Other Graph Libraries Graph Snippet for Visualization Declarative Graph Definition Query D

GraphGen Visual Front End [VLDB 2015 Demo] User can visually
explore 1-hop neighborhoods View simple statistics about the graph User explores schema and specifies graphs to be extracted

Serialized Graph File Graph Analysis with GraphGen GraphGen Java Library
Relational Database GraphGen Front End Giraph / Other Graph Libraries Declarative Graph Definition Query D

• Extracted graph may be larger than input tables! ◦
Graph may not fit in memory • Expensive and cumbersome to extract Solution: Instead, extract a Condensed Representation which: ◦ At most the size of the input tables -- usually much smaller ◦ Support many different APIs over this representation Key Challenges

Constructing the Condensed Representation “Real” Nodes Nodes(ID, Name):- Author(ID, Name).
Edges(ID1, ID2):- AuthorPub(ID1,PubID), AuthorPub(ID2, PubID). aId name AuthorPub Author aid pid a1 p1 a2 p1 a3 p1 a4 p1 a6 p1 a1 p2 a4 p2 a5 p2 a2 p3 a3 p3 a5 p3 a6 p3 a7 p3 a1 name1 a2 name2 a3 name3 a4 name4 pId title Publication p1 title1 p2 title2 p3 title3 a5 name5 a6 name6 a7 name7 a8 name8 a8 p3 p1 p2 p3 a1 a2 a3 a4 a5 a6 a7 a8 Condensed Graph a1 a2 a3 a4 a5 a6 a7 a8 Expanded Graph a1 a2 a3 a4 a5 a6 a7 a8 Virtual nodes Query to extract co-author graph

Condensed Representation Construction Algorithm 1. Translate the Nodes statements (to
SQL) and execute them. 2. Edges statements are split at each join (acyclic and aggregation-free): 3. For each join between R i , R i+1 retrieve the number of distinct values d for the join condition attribute(s). 4. Every join where the |R i ||R i+1 |/d > 2 (|R i |+|R i+1 |) is marked as large-output 5. Create virtual nodes for every large-output join. Execute rest of joins in-database Edges(ID1,ID2) : - R 1 (ID1,a1),R 2 (a1,a2),..., R n (a n-1 ,ID2)

Constructing the Condensed Representation Orders Lineitem Lineitem Orders Nodes(ID, Name)
:- Customer(ID, Name). Edges(ID1, ID2) :- Orders(o_key1, ID1), LineItem(o_key1, part_key), Orders(o_key2, ID2), LineItem(o_key2, part_key). Orders o1 c1 o2 c2 o3 c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 order_key cust_key Non-high output c1 c2 c3 o1 o2 p1 p2 o1 o2 c1 c2 c3 c1 c2 c3 p1 p2 c1 c2 c3

Duplicate Edges Problem • There are multiple paths between pairs
of nodes ◦ Assuming no aggregates on edges • Most graph algorithms cannot handle this ◦ Some (e.g., connected components) are tolerant • Solution: We provide a suite of pre-processing techniques to deal with duplication. ◦ Override getNeighbors() → Run any algorithm over this p1 p2 a1 a2 a3 a1 a2 a3

1. C-DUP (Condensed, Duplicated) • For every getNeighbors(): ◦ Do
a DFS from the node to find neighbors ◦ Cache the neighbor-list (if memory available) • Most storage-efficient solution • Great for tasks/queries that touch a small portion of the graph p1 p2 a1 a2 a3 a1 a2 a3 a3: {} p1 p2 a1 a2 a3 a1 a2 a3 a3: {a1,a2,a3} cache

2. DEDUP-1 (Condensed, De-Duplicated) • Run a de-duplication algorithm on
the C-DUP representation ◦ Removes duplication → Single path from a node to each neighbor • Most portable solution (easily loaded into any graph system) p1 a1 a2 a3 a1 a2 a3 p2 p1 a1 a2 a3 a1 a2 a3 • Equivalent to result of biclique-compression a4 a4 a4 a4

• DEDUP-1 Problem: Given a condensed graph, remove edges until
there is only 1 path between each pair of neighbors. ◦ Goal: minimize the output graph size • Both problems require finding overlaps between edge lists. Same complexity • GraphGen: ◦ We avoid materializing the full graph Connection to Biclique Compression • Biclique Comp. Problem: Partition edges into the minimum set of non-overlapping bipartite cliques (NP-Complete) [Feder, Motwani ‘94]. Figure from A Scalable Pattern Mining Approach to Web Graph Compression with Communities [WSDM’08]

DEDUP-1: Algorithms Duplicate edges set: C i :{a2,a3} Options: -
Remove a2 from p1 or p2? - Remove a3 from p1 or p2? p1 processed:{p1} processed:{} a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 p1 p2 a1 a2 a3 a4 a1 a2 a3 a4

DEDUP-1: Algorithms 1. Naive 1 (Virtual-Nodes-First): Choose which real node
to remove randomly. 2. Naive 2 (Real-Nodes-First): Same, remove all duplication for a real node u before moving on. 3. Greedy 1 (Virtual-Nodes-First): Heuristic: Compute “global” benefit/cost of disconnecting real node u from virtual node p n 4. Greedy 2 (Real-Nodes-First): Heuristic: Compute benefit based on reduction in edges in the result from adding a virtual node p1 and connection u to it. The “amount” by which we have decreased duplication The number of direct edges to be added

DEDUP2 (22 Edges) • Only works for single-layered symmetric graphs.
◦ Uses undirected edges between virtual nodes ◦ Can lead to substantially more compact deduplicated condensed graphs (10x or higher comp. to DEDUP-1). 3. DEDUP-2 (Condensed, De-Duplicated) V V 1 u 1 u 3 u 2 d f e a c b u 1 u 3 u 2 d f e a c b u 1 u 3 u 2 d f e a c b W 2 W 1 W 3 C-DUP (24 Edges) Please see full paper

4. Avoiding Duplicates with BITMAPs y1 y2 a1 a2 a3
a1 a2 a3 a1 a2 a3 a1 1 1 1 a2 a3 1 0 0 1 0 0 a2 a3 1 1 1 1 a2 a3 • Optimization Problem: • Let O(V n ) the set of real nodes connected to virtual node V n • Given a real node u , and its virtual nodes {V 1 ,V 2 , …, V n }, find the smallest subset of {O(V 1 ), O(V 2 ),..., O(V n )} that covers their union. • Algorithm based on standard greedy set cover y1 a1 a2 a3 a1 a2 a3 y2 a1 a2 a3 a1 1 1 1 a2 a3 1 1 1 1 1 1 a2 a3 0 0 0 0 a2 a3 • Use bitmaps at virtual nodes to avoid duplicate paths • Iterators use this bitmaps to return neighbors without duplication

4. Avoiding Duplicates with BITMAPs • Optimization Problem: • Let
O(V n ) the set of real nodes connected to virtual node V n • Given a real node u , and its virtual nodes {V 1 ,V 2 , …, V n }, find the smallest subset of {O(V 1 ), O(V 2 ),..., O(V n )} that covers their union. • Algorithm based on standard greedy set cover • Use bitmaps at virtual nodes to avoid duplicate paths • Iterators use this bitmaps to return neighbors without duplication x1 x2 y1 y2 a1 a2 a3 a1 a2 a3 x1 x2 a1 1 y1 a1 1 1 y1 a2 y2 1 1 a1 1 1 x1 a2 a3 x2 1 1 1 1 a1 1 a1 a2 a3 1 1 a1 1 1 a2 a3 a2 a3 1 1 1 1 a1 0 a2 a3 x2 0 0

Compactness VMiner: Graph compression using Bi-Cliques Sparse Graphs

Compactness VMiner: Graph compression using Bi-Cliques Dense Graphs

Larger Datasets CDUP BMP-DEDUP FULL GRAPH Syn-1 1.421 GB 2.737
GB >64 GB Syn-2 1.613 GB 2.258 GB 19.798 GB Syn-3 1.276 GB 1.493 GB 1.2 GB Syn-4 9.9 GB 13.042 GB >64 GB TPC-H .023 GB .049 GB 7.398 GB CDUP BMP-DEDUP FULL GRAPH Syn-1 382 s 284 s DNF Syn-2 129 s 111 s 85 s Syn-3 0.01 s 0.02 s 0.01 s Syn-4 1.3 s 0.12 s DNF TPC-H 86 s 8.5 s 16 s Memory Footprint Time to run Breadth First Search Analyzing >64GB graph using 2.7GB of memory

Integration with Apache Giraph • Implemented vertex-centric programs over Giraph
using our condensed representations as input. • Proved to be non-trivial ◦ E.g. PageRank requires the degree of a node to run (not directly available in the condensed representation) • Message aggregation at the virtual nodes. ◦ This leads to significant speedups in many cases. ▪ Due to smaller number of passed messages

• Implemented vertex-centric programs over Giraph using our condensed representations
as input. • Virtual nodes simply forwarding messages, doesn’t always work. ◦ E.g. PageRank requires the degree of a node to run (which isn’t directly available) • We can also do message aggregation at the virtual nodes. ◦ This leads to significant speedups in many cases. ▪ Due to smaller number of passed messages Integration with Apache Giraph Time in seconds , memory in MB

• Need for support of graph analytics over RDBMSs •
We propose a layer that sits over RDBMSs that optimizes the extraction and analysis processes required. • GraphGen provides a declarative DSL and a suite of in-memory condensed representations and APIs over those. Future Work • Many computational challenges that we are just beginning to explore ◦ Edge-properties and aggregates over them ◦ Integration with other high-level graph analytics frameworks • Extending the DSL towards: ◦ Integrating declarative graph analysis with our DSL ◦ Pushing computation to the database ◦ Efficient extraction of collections of graphs at a time Conclusions * GRADES’17 on Friday @ 15:00! Thank you! Questions?

GraphGen: Extracting and Analyzing Hidden Graph...

GraphGen: Extracting and Analyzing Hidden Graphs from Relational Databases

More Decks by Konstantinos Xirogiannopoulos

Other Decks in Research

Featured

Transcript