Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GraphGen: Adaptive Graph Processing using Relational Databases

GraphGen: Adaptive Graph Processing using Relational Databases

Graph querying and analytics are becoming an increasingly important component of the arsenal of tools for extracting different kinds of insights from data. Despite an immense amount of work on those topics, graphs are largely still handled in an ad hoc manner, in part because most data continues to reside in relational-like data management systems, and because graph analytics/querying typically forms a small portion of the overall analysis pipelines. In this paper we describe an end-to-end graph analysis framework, called GraphGen, that sits atop an RDBMS, and supports graph querying/analytics through: (a) defining graphs as transformations over underlying relational datasets (as Graph-Views) and (b) specifying queries or analytics on those graphs using either a high-level language or Java programs against a simple graph API. Although conceptually simple, GraphGen acts as an abstraction/independence layer that opens up many opportunities for adaptively optimizing graph analysis workflows, since the system can decide where to execute tasks on a per-task basis (in database or outside), how much of the graph to materialize in memory, and what types of in- memory representations to use (especially critical when the graphs are larger than the input datasets, as is often the case). At the same time, by providing the ability to write arbitrary programs against the graphs, GraphGen removes a major expressivity limitation of many existing graph analysis systems, which only support limited programming frameworks. We describe the GraphGen DSL, loosely based on Datalog, that includes both graph specification and in-line analysis capabilities. We then discuss many optimization challenges in building GraphGen, that we are currently working on addressing.

More Decks by Konstantinos Xirogiannopoulos

Other Decks in Research

Transcript

  1. Graph Analytics / Querying Graph datasets can provide value in

    many domains Social Networks Email Networks Protein Interaction Networks Stock Trading Networks Many different types of ways to manage graph data • Graph Databases (neo4j, orientDB, RDF stores) • Distributed Batch Analytics systems (Giraph, GraphX, GraphLab) • In-Memory systems (Ligra, Green-Marl, X-Stream) • Many research prototypes / custom indexes.
  2. Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

    c3 order_key part_key LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_key name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key Which customer bought which product? On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 c_1 John c_2 Jane
  3. Example: TPC-H order_key customer_key Orders o1 c1 o2 c2 o3

    c3 order_ke y part_ke y LineItem o1 p1 o1 p2 o2 p1 o2 p3 o3 p1 o3 p2 o3 p2 c_ke y name Customer c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1 Orders LineItem On order_key On p_key Which customers bought the same item? c1 c4 cust1 cust2 c1 c6 c1 c3 c1 c4 c6 c4 c3 c6 c_1 John c_2 Jane Many other graphs of potential interest: • Suppliers that sell a common item • Employees working under the same manager • Parts that were ordered together • Bipartite graph between Part and Supplier • ... Which customer bought which product?
  4. GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

    Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen
  5. • Definition of a GraphView over the database ◦ User

    specifies how to construct the Nodes and Edges GraphGenDL - Definition Language CREATE GRAPHVIEW CoAuthors AS Nodes(ID, name) :- Author(ID, name). Edges(ID1, ID2, wt=$COUNT(pub)) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub). • Definition of a collection of graphs (Multi-Graph View) over the database ◦ Can enable many optimizations Edge Property: number of publications CREATE GRAPHVIEW AuthorEgoNetworks(X) WHERE Author(X) AS Nodes(X, name) :- Author(X, name). Nodes(ID, name) :- AuthorPub(X,pub), AuthorPub(ID,pub), Author(ID, name). Edges(ID1, ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub). Extract all ego-graphs
  6. • Specifying Graph Queries over GraphViews • Support for subgraph

    pattern matching languages like SPARQL, Cypher, PGQL etc. • Datalog is a natural fit for expressing recursive computation over the Edges VIEW USING GRAPHVIEW CoAuthors Triangle(X, Y, Z) :- Nodes(X, _, “ML” ),Nodes(Y, _, “DB”), Nodes(Z, _, “AL” ),Edges(X, Y),Edges(Y, Z),Edges(X, Z). GraphGenQL - Query Language Find triangles of authors whose areas follow: “ML” -> “DB” -> “AL”
  7. GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

    Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen
  8. GraphGen Backend Relational DBMS Java Program Graph Definition Queries DSL

    Parser + Optimizer In-Memory Engine SQL Queries Graph Analysis Queries Results Direct Graph Access Vertex- Centric Directly over Graph GraphGen • Goal: We want to adapt the execution based on the query/analysis. • What are some of the challenges here??
  9. 1. Where to execute Queries/ Tasks • Depends on workload,

    rate of updates, rate of queries… Dataset In-memory ETL MySQL PosgreSQL Small 0.001 s 2.05 s 0.8 s 0.1 s Large 0.015 s 17.52 s 4.26 0.704 s Triangle Pattern Matching • Key Challenge: Develop accurate cost models, tools, techniques. Decide what to compute where • Other issues: Large-output joins [SIGMOD ‘17], and selectivity estimation errors associated with them.
  10. 2. Query Rewriting • Assume the execution is to be

    pushed to the database • Many different ways to construct equivalent SQL queries • Auto-generated SQL can be verbose → Challenging to optimize With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query) DISTINCT DISTINCT 1) With vs VIEW 2) Duplicate Elimination (DISTINCT) • The costly duplicate removal might even be unnecessary if the query / analysis doesn’t care about them!
  11. 2. Query Rewriting • Assume the execution is to be

    pushed to the database • Many different ways to construct equivalent SQL queries • Auto-generated SQL can be verbose → Challenging to optimize With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query) DISTINCT DISTINCT 1) With vs VIEW 2) Duplicate Elimination (DISTINCT) • The costly duplicate removal might even be unnecessary if the query / analysis doesn’t care about them! Time for query to finish in seconds
  12. 3. Optimizing Multi-Graph Views Key Challenge: Develop a systematic approach

    to optimizing the extraction of and execution against such multi-graph views. E.g. Ego-Graph Analysis • Naive: Generate a separate SQL query for each distinct graph. • Result-Tagging: We can extract all graphs with a single query! Please see full paper • Ego Graph Analysis, Graph snapshot analysis • Ability to refer to each graph independently → significant savings • Opportunity: Overlap computation and storage over collections of graphs
  13. Result-Tagging aid1 a1 a1 a6 a1 aid2 a2 a5 a6

    a7 a7 a8 a5 a3 a3 a4 a2 a3 tag a1 a1 a6 a1 a7 a5 a3 a2 Tagged Edges Table e1.aid2 = e2.aid1 aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tags[] [a1] [a1] [a1,a6] [a1] [a6,a7] [a5,a1] [a2,a3,a5] [a1,a2] Tag Aggregation Find the edges 1-hop away for the source (tag) & Union the result with the initial Tagged Edges table aid1 a2 a5 a6 aid2 a3 a3 a7 a7 a8 a3 a4 a3 a4 tag a1 a1 a1 a6 a5 a2 a1 a1 a6 a1 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 a1 a1 a6 a1 a7 a5 a3 a2 Tags show which ego-graphs involve the edge
  14. Take Aways • Need for a unified framework for extraction

    and analysis of graphs stored implicitly in a structured data store. • We propose declarative a Datalog-based DSL for specifying: ◦ GraphViews over relational schemas ◦ Declarative Graph queries • Expose a series of APIs for defining complex graph analytics over GraphViews There is a variety of challenges & opportunities here in terms of: • Deciding where to execute graph queries • Handling large-output joins and inaccuracies of query optimizers • Rewriting SQL queries pushed to the database • Optimizing across collections of graphs (Multi-Graph Views) Thank you! Questions?