Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GraphGen: Extracting and Analyzing Hidden Graphs from Relational Databases

GraphGen: Extracting and Analyzing Hidden Graphs from Relational Databases

Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics can provide tremendous value in many application domains. However, graphs are not the primary representation choice for storing most data to- day, and in order to have access to these analyses, users are forced to manually extract data from their data stores, construct the requisite graphs, and then load them into some graph engine in order to execute their graph analysis task. Moreover, in many cases (especially when the graphs are dense), these graphs can be significantly larger than the initial input stored in the database, making it infeasible to construct or analyze such graphs in memory. In this pa- per we address both of these challenges by building a system that enables users to declaratively specify graph extraction tasks over a relational database schema and then execute graph algorithms on the extracted graphs. We propose a declarative domain specific language for this purpose, and pair it up with a novel condensed, in- memory representation that significantly reduces the memory foot- print of these graphs, permitting analysis of larger-than-memory graphs. We present a general algorithm for creating such a condensed representation for a large class of graph extraction queries against arbitrary schemas. We observe that the condensed representation suffers from a duplication issue, that results in inaccuracies for most graph algorithms. We then present a suite of in-memory representations that handle this duplication in different ways and allow trading off the memory required and the computational cost for executing different graph algorithms. We also introduce several novel deduplication algorithms for removing this duplication in the graph, which are of independent interest for graph compression, and provide a comprehensive experimental evaluation over several real-world and synthetic datasets illustrating these trade-offs.

More Decks by Konstantinos Xirogiannopoulos

Other Decks in Research

Transcript

  1. Graph Analytics / Querying Graph datasets can provide value in

    many domains Social Networks Email Networks Protein Interaction Networks Stock Trading Networks Many different types of ways to deal with graph data • Graph Databases (neo4j, orientDB, RDF stores) • Distributed Batch Analytics systems (Giraph, GraphX, GraphLab) • In-Memory systems (Ligra, Green-Marl, X-Stream) • Many research prototypes / custom indexes.
  2. But first...where is your data stored ? • Typically datasets

    are in RDBMS or Key-Value stores under some sort of schema. • Graph Analytics needs graph-structured data (i.e., lists of nodes and edges) • We must first extract the graph!
  3. Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

    c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_key name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key Which customer bought which product? On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 Edge weights based on geographical distance c_1 John c_2 Jane
  4. Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

    c3 order_ke y part_ke y LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_ke y name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 Edge weights based on geographical distance c_1 John c_2 Jane Many other graphs of potential interest: • Suppliers that sell a common item • Employees working under the same manager • Parts that were ordered together • Bipartite graph between Part and Supplier • ...
  5. But first...where is your data stored ? 1. Extracted graphs

    often orders-of-magnitude larger than original database ◦ Homogeneous graphs (over the same set of entities) invariably require at least one self-join on a non-key 2. User needs to write custom SQL queries for ETL ◦ Can be unintuitive → time consuming ◦ Repeat every time database is updated 3. Large selectivity estimation errors due to complex joins ◦ May result into bad query plans
  6. 1. Extracted graphs often orders-of-magnitude larger than original database ◦

    Homogeneous graphs (over the same set of entities) invariably require at least one self-join on a non-key 2. User needs to write custom SQL queries for ETL ◦ Can be unintuitive → time consuming ◦ Repeat every time database is updated 3. Large selectivity estimation errors due to complex joins ◦ May result into bad query plans But first...where is your data stored ?
  7. • Graph Analytics / State of the Art • GraphGen

    • Technical Challenges • Experimental Results Outline
  8. Graph Analysis with GraphGen GraphGen Java Library Relational Database GraphGen

    Front End Giraph / Other Graph Libraries Graph Analysis Program Declarative Graph Definition Query D Nodes(ID, Name) :- Customer(ID, Name). Edges(ID1, ID2) :- Orders(o_key1, ID1), LineItem(o_key1, part_key), Orders(o_key2, ID2), LineItem(o_key2, part_key). Declarative Graph Definition Query
  9. Analysis / Query Results Graph Analysis with GraphGen GraphGen Java

    Library Relational Database GraphGen Front End Giraph / Other Graph Libraries Graph Analysis Program Declarative Graph Definition Query D Vertex Centric Direct Graph Access
  10. Graph Analysis with GraphGen GraphGen Java Library Relational Database GraphGen

    Front End Giraph / Other Graph Libraries Graph Snippet for Visualization Declarative Graph Definition Query D
  11. GraphGen Visual Front End [VLDB 2015 Demo] User can visually

    explore 1-hop neighborhoods View simple statistics about the graph User explores schema and specifies graphs to be extracted
  12. Serialized Graph File Graph Analysis with GraphGen GraphGen Java Library

    Relational Database GraphGen Front End Giraph / Other Graph Libraries Declarative Graph Definition Query D
  13. • Graph Analytics / State of the Art • GraphGen

    • Technical Challenges • Experimental Results Outline
  14. • Extracted graph may be larger than input tables! ◦

    Graph may not fit in memory • Expensive and cumbersome to extract Solution: Instead, extract a Condensed Representation which: ◦ At most the size of the input tables -- usually much smaller ◦ Support many different APIs over this representation Key Challenges
  15. Constructing the Condensed Representation “Real” Nodes Nodes(ID, Name):- Author(ID, Name).

    Edges(ID1, ID2):- AuthorPub(ID1,PubID), AuthorPub(ID2, PubID). aId name AuthorPub Author aid pid a1 p1 a2 p1 a3 p1 a4 p1 a6 p1 a1 p2 a4 p2 a5 p2 a2 p3 a3 p3 a5 p3 a6 p3 a7 p3 a1 name1 a2 name2 a3 name3 a4 name4 pId title Publication p1 title1 p2 title2 p3 title3 a5 name5 a6 name6 a7 name7 a8 name8 a8 p3 p1 p2 p3 a1 a2 a3 a4 a5 a6 a7 a8 Condensed Graph a1 a2 a3 a4 a5 a6 a7 a8 Expanded Graph a1 a2 a3 a4 a5 a6 a7 a8 Virtual nodes Query to extract co-author graph
  16. Condensed Representation Construction Algorithm 1. Translate the Nodes statements (to

    SQL) and execute them. 2. Edges statements are split at each join (acyclic and aggregation-free): 3. For each join between R i , R i+1 retrieve the number of distinct values d for the join condition attribute(s). 4. Every join where the |R i ||R i+1 |/d > 2 (|R i |+|R i+1 |) is marked as large-output 5. Create virtual nodes for every large-output join. Execute rest of joins in-database Edges(ID1,ID2) : - R 1 (ID1,a1),R 2 (a1,a2),..., R n (a n-1 ,ID2)
  17. Constructing the Condensed Representation Orders Lineitem Lineitem Orders Nodes(ID, Name)

    :- Customer(ID, Name). Edges(ID1, ID2) :- Orders(o_key1, ID1), LineItem(o_key1, part_key), Orders(o_key2, ID2), LineItem(o_key2, part_key). Orders o1 c1 o2 c2 o3 c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 order_key cust_key Non-high output c1 c2 c3 o1 o2 p1 p2 o1 o2 c1 c2 c3 c1 c2 c3 p1 p2 c1 c2 c3
  18. Duplicate Edges Problem • There are multiple paths between pairs

    of nodes ◦ Assuming no aggregates on edges • Most graph algorithms cannot handle this ◦ Some (e.g., connected components) are tolerant • Solution: We provide a suite of pre-processing techniques to deal with duplication. ◦ Override getNeighbors() → Run any algorithm over this p1 p2 a1 a2 a3 a1 a2 a3
  19. 1. C-DUP (Condensed, Duplicated) • For every getNeighbors(): ◦ Do

    a DFS from the node to find neighbors ◦ Cache the neighbor-list (if memory available) • Most storage-efficient solution • Great for tasks/queries that touch a small portion of the graph p1 p2 a1 a2 a3 a1 a2 a3 a3: {} p1 p2 a1 a2 a3 a1 a2 a3 a3: {a1,a2,a3} cache
  20. 2. DEDUP-1 (Condensed, De-Duplicated) • Run a de-duplication algorithm on

    the C-DUP representation ◦ Removes duplication → Single path from a node to each neighbor • Most portable solution (easily loaded into any graph system) p1 a1 a2 a3 a1 a2 a3 p2 p1 a1 a2 a3 a1 a2 a3 • Equivalent to result of biclique-compression a4 a4 a4 a4
  21. • DEDUP-1 Problem: Given a condensed graph, remove edges until

    there is only 1 path between each pair of neighbors. ◦ Goal: minimize the output graph size • Both problems require finding overlaps between edge lists. Same complexity • GraphGen: ◦ We avoid materializing the full graph Connection to Biclique Compression • Biclique Comp. Problem: Partition edges into the minimum set of non-overlapping bipartite cliques (NP-Complete) [Feder, Motwani ‘94]. Figure from A Scalable Pattern Mining Approach to Web Graph Compression with Communities [WSDM’08]
  22. DEDUP-1: Algorithms Duplicate edges set: C i :{a2,a3} Options: -

    Remove a2 from p1 or p2? - Remove a3 from p1 or p2? p1 processed:{p1} processed:{} a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 a1 a2 a3 a4 p1 p2 a1 a2 a3 a4 a1 a2 a3 a4
  23. DEDUP-1: Algorithms 1. Naive 1 (Virtual-Nodes-First): Choose which real node

    to remove randomly. 2. Naive 2 (Real-Nodes-First): Same, remove all duplication for a real node u before moving on. 3. Greedy 1 (Virtual-Nodes-First): Heuristic: Compute “global” benefit/cost of disconnecting real node u from virtual node p n 4. Greedy 2 (Real-Nodes-First): Heuristic: Compute benefit based on reduction in edges in the result from adding a virtual node p1 and connection u to it. The “amount” by which we have decreased duplication The number of direct edges to be added
  24. DEDUP2 (22 Edges) • Only works for single-layered symmetric graphs.

    ◦ Uses undirected edges between virtual nodes ◦ Can lead to substantially more compact deduplicated condensed graphs (10x or higher comp. to DEDUP-1). 3. DEDUP-2 (Condensed, De-Duplicated) V V 1 u 1 u 3 u 2 d f e a c b u 1 u 3 u 2 d f e a c b u 1 u 3 u 2 d f e a c b W 2 W 1 W 3 C-DUP (24 Edges) Please see full paper
  25. 4. Avoiding Duplicates with BITMAPs y1 y2 a1 a2 a3

    a1 a2 a3 a1 a2 a3 a1 1 1 1 a2 a3 1 0 0 1 0 0 a2 a3 1 1 1 1 a2 a3 • Optimization Problem: • Let O(V n ) the set of real nodes connected to virtual node V n • Given a real node u , and its virtual nodes {V 1 ,V 2 , …, V n }, find the smallest subset of {O(V 1 ), O(V 2 ),..., O(V n )} that covers their union. • Algorithm based on standard greedy set cover y1 a1 a2 a3 a1 a2 a3 y2 a1 a2 a3 a1 1 1 1 a2 a3 1 1 1 1 1 1 a2 a3 0 0 0 0 a2 a3 • Use bitmaps at virtual nodes to avoid duplicate paths • Iterators use this bitmaps to return neighbors without duplication
  26. 4. Avoiding Duplicates with BITMAPs • Optimization Problem: • Let

    O(V n ) the set of real nodes connected to virtual node V n • Given a real node u , and its virtual nodes {V 1 ,V 2 , …, V n }, find the smallest subset of {O(V 1 ), O(V 2 ),..., O(V n )} that covers their union. • Algorithm based on standard greedy set cover • Use bitmaps at virtual nodes to avoid duplicate paths • Iterators use this bitmaps to return neighbors without duplication x1 x2 y1 y2 a1 a2 a3 a1 a2 a3 x1 x2 a1 1 y1 a1 1 1 y1 a2 y2 1 1 a1 1 1 x1 a2 a3 x2 1 1 1 1 a1 1 a1 a2 a3 1 1 a1 1 1 a2 a3 a2 a3 1 1 1 1 a1 0 a2 a3 x2 0 0
  27. • Graph Analytics / State of the Art • GraphGen

    • Technical Challenges • Experimental Results Outline
  28. Larger Datasets CDUP BMP-DEDUP FULL GRAPH Syn-1 1.421 GB 2.737

    GB >64 GB Syn-2 1.613 GB 2.258 GB 19.798 GB Syn-3 1.276 GB 1.493 GB 1.2 GB Syn-4 9.9 GB 13.042 GB >64 GB TPC-H .023 GB .049 GB 7.398 GB CDUP BMP-DEDUP FULL GRAPH Syn-1 382 s 284 s DNF Syn-2 129 s 111 s 85 s Syn-3 0.01 s 0.02 s 0.01 s Syn-4 1.3 s 0.12 s DNF TPC-H 86 s 8.5 s 16 s Memory Footprint Time to run Breadth First Search Analyzing >64GB graph using 2.7GB of memory
  29. Integration with Apache Giraph • Implemented vertex-centric programs over Giraph

    using our condensed representations as input. • Proved to be non-trivial ◦ E.g. PageRank requires the degree of a node to run (not directly available in the condensed representation) • Message aggregation at the virtual nodes. ◦ This leads to significant speedups in many cases. ▪ Due to smaller number of passed messages
  30. • Implemented vertex-centric programs over Giraph using our condensed representations

    as input. • Virtual nodes simply forwarding messages, doesn’t always work. ◦ E.g. PageRank requires the degree of a node to run (which isn’t directly available) • We can also do message aggregation at the virtual nodes. ◦ This leads to significant speedups in many cases. ▪ Due to smaller number of passed messages Integration with Apache Giraph Time in seconds , memory in MB
  31. • Need for support of graph analytics over RDBMSs •

    We propose a layer that sits over RDBMSs that optimizes the extraction and analysis processes required. • GraphGen provides a declarative DSL and a suite of in-memory condensed representations and APIs over those. Future Work • Many computational challenges that we are just beginning to explore ◦ Edge-properties and aggregates over them ◦ Integration with other high-level graph analytics frameworks • Extending the DSL towards: ◦ Integrating declarative graph analysis with our DSL ◦ Pushing computation to the database ◦ Efficient extraction of collections of graphs at a time Conclusions * GRADES’17 on Friday @ 15:00! Thank you! Questions?