GraphGen: Adaptive Graph Processing using Relational Databases

GraphGen: Adaptive Graph Processing using Relational Databases Department of Computer
Science University of Maryland

Graph Analytics / Querying Graph datasets can provide value in
many domains Social Networks Email Networks Protein Interaction Networks Stock Trading Networks Many different types of ways to manage graph data • Graph Databases (neo4j, orientDB, RDF stores) • Distributed Batch Analytics systems (Giraph, GraphX, GraphLab) • In-Memory systems (Ligra, Green-Marl, X-Stream) • Many research prototypes / custom indexes.

RDBMS-based Graph Systems vs GraphGen DECLARATIVE

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3
c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_key name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key Which customer bought which product? On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 c_1 John c_2 Jane

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3
c3 order_ke y part_ke y LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_ke y name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 c_1 John c_2 Jane Many other graphs of potential interest: • Suppliers that sell a common item • Employees working under the same manager • Parts that were ordered together • Bipartite graph between Part and Supplier • ... Which customer bought which product?

GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL
Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen

• Definition of a GraphView over the database ◦ User
specifies how to construct the Nodes and Edges GraphGenDL - Definition Language CREATE GRAPHVIEW CoAuthors AS Nodes(ID, name) :- Author(ID, name). Edges(ID1, ID2, wt=$COUNT(pub)) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub). • Definition of a collection of graphs (Multi-Graph View) over the database ◦ Can enable many optimizations Edge Property: number of publications CREATE GRAPHVIEW AuthorEgoNetworks(X) WHERE Author(X) AS Nodes(X, name) :- Author(X, name). Nodes(ID, name) :- AuthorPub(X,pub), AuthorPub(ID,pub), Author(ID, name). Edges(ID1, ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub). Extract all ego-graphs

• Specifying Graph Queries over GraphViews • Support for subgraph
pattern matching languages like SPARQL, Cypher, PGQL etc. • Datalog is a natural fit for expressing recursive computation over the Edges VIEW USING GRAPHVIEW CoAuthors Triangle(X, Y, Z) :- Nodes(X, _, “ML” ),Nodes(Y, _, “DB”), Nodes(Z, _, “AL” ),Edges(X, Y),Edges(Y, Z),Edges(X, Z). GraphGenQL - Query Language Find triangles of authors whose areas follow: “ML” -> “DB” -> “AL”

Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen

Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen • Goal: We want to adapt the execution based on the query/analysis. • What are some of the challenges here??

1. Where to execute Queries/ Tasks • Depends on workload,
rate of updates, rate of queries… Dataset In-memory ETL MySQL PosgreSQL Small 0.001 s 2.05 s 0.8 s 0.1 s Large 0.015 s 17.52 s 4.26 0.704 s Triangle Pattern Matching • Key Challenge: Develop accurate cost models, tools, techniques. Decide what to compute where • Other issues: Large-output joins [SIGMOD ‘17], and selectivity estimation errors associated with them.

2. Query Rewriting • Assume the execution is to be
pushed to the database • Many different ways to construct equivalent SQL queries • Auto-generated SQL can be verbose → Challenging to optimize With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query) DISTINCT DISTINCT 1) With vs VIEW 2) Duplicate Elimination (DISTINCT) • The costly duplicate removal might even be unnecessary if the query / analysis doesn’t care about them!

2. Query Rewriting • Assume the execution is to be
pushed to the database • Many different ways to construct equivalent SQL queries • Auto-generated SQL can be verbose → Challenging to optimize With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query) DISTINCT DISTINCT 1) With vs VIEW 2) Duplicate Elimination (DISTINCT) • The costly duplicate removal might even be unnecessary if the query / analysis doesn’t care about them! Time for query to finish in seconds

3. Optimizing Multi-Graph Views Key Challenge: Develop a systematic approach
to optimizing the extraction of and execution against such multi-graph views. E.g. Ego-Graph Analysis • Naive: Generate a separate SQL query for each distinct graph. • Result-Tagging: We can extract all graphs with a single query! Please see full paper • Ego Graph Analysis, Graph snapshot analysis • Ability to refer to each graph independently → significant savings • Opportunity: Overlap computation and storage over collections of graphs

Result-Tagging aid1 a1 a1 a6 a1 aid2 a2 a5 a6
a7 a7 a8 a5 a3 a3 a4 a2 a3 tag a1 a1 a6 a1 a7 a5 a3 a2 Tagged Edges Table e1.aid2 = e2.aid1 aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tags[] [a1] [a1] [a1,a6] [a1] [a6,a7] [a5,a1] [a2,a3,a5] [a1,a2] Tag Aggregation Find the edges 1-hop away for the source (tag) & Union the result with the initial Tagged Edges table aid1 a2 a5 a6 aid2 a3 a3 a7 a7 a8 a3 a4 a3 a4 tag a1 a1 a1 a6 a5 a2 a1 a1 a6 a1 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 a1 a1 a6 a1 a7 a5 a3 a2 Tags show which ego-graphs involve the edge

Take Aways • Need for a unified framework for extraction
and analysis of graphs stored implicitly in a structured data store. • We propose declarative a Datalog-based DSL for specifying: ◦ GraphViews over relational schemas ◦ Declarative Graph queries • Expose a series of APIs for defining complex graph analytics over GraphViews There is a variety of challenges & opportunities here in terms of: • Deciding where to execute graph queries • Handling large-output joins and inaccuracies of query optimizers • Rewriting SQL queries pushed to the database • Optimizing across collections of graphs (Multi-Graph Views) Thank you! Questions?

GraphGen: Adaptive Graph Processing using Relat...

GraphGen: Adaptive Graph Processing using Relational Databases

Konstantinos Xirogiannopoulos

More Decks by Konstantinos Xirogiannopoulos

Other Decks in Research

Featured

Transcript

GraphGen: Adaptive Graph Processing using Relational Databases Department of Computer

Graph Analytics / Querying Graph datasets can provide value in

RDBMS-based Graph Systems vs GraphGen DECLARATIVE

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

• Definition of a GraphView over the database ◦ User

• Specifying Graph Queries over GraphViews • Support for subgraph

GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

1. Where to execute Queries/ Tasks • Depends on workload,

2. Query Rewriting • Assume the execution is to be

2. Query Rewriting • Assume the execution is to be

3. Optimizing Multi-Graph Views Key Challenge: Develop a systematic approach

Result-Tagging aid1 a1 a1 a6 a1 aid2 a2 a5 a6

Take Aways • Need for a unified framework for extraction