PuppyGraph - IT Press Tour #62 June 2025

Zero ETL Graph Analysis With PuppyGraph

The Pains of Graph DBs High Cost Data Ingestion (difﬁcult
schema update) Scalability Performance

There are users who see value in graph data analysis
but don’t want another data stack. However, this option doesn’t exist in the market today.

Meet PuppyGraph The Only Graph Analytic Engine That Enables Users
To Query One Or More Relational Data Stores As A Uniﬁed Graph Model. No separate graph database required with zero ETL. Scalable w/ PBs of data & billions nodes Deploy to query in 10 mins. Complex queries in seconds.

Supported Data Sources *New data sources can be supported in
2-4 weeks.

Supported Query Languages Client Libraries

Supported Integrations Visualization Data Movement

PuppyGraph Architecture MYSQL/ POSTGRESQL APACHE ICEBERG DELTA LAKE APACHE HUDI
MORE TO COME… MULTI-MODEL QUERY LANGUAGE APACHE GREMLIN CLIENT GREMLIN SERVER LOGICAL PLAN PHYSICAL PLAN Execution Node Execution Node Execution Node Execution Node OPENCYPHER CYPHER SERVER Strategic Open Source Partners Any data source anywhere, cross-cloud and region analytics. PUPPYGRAPH QUERY ENGINE

Before PuppyGraph (traditional graph architecture) With PuppyGraph (architecture with graph
query engine) Graph Query DATA LAKE GRAPH DATABASE NOSQL SOURCE SQL SOURCE SQL SOURCE SQL SOURCE NOSQL SOURCE DATA LAKE E T L E T L E T L E T L E T L E T L E T L Graph Query Complex ETL build + maintenance Bloated architecture + high TCO Data/Query Latency No ETL from Source to Graph Simpliﬁed architecture + lower TCO 10-hop neighbor queries in seconds E T L E T L Connector Connector Connector

Uniﬁed SQL Query Engine Example Architecture Uniﬁed Graph Query Engine
SQL Gremlin Cypher Storage Query Engine Query Language BI &Viz Databricks SQL Single copy of data - query it in both SQL & Graph

Uniﬁed SQL Query Engine Example Architecture E T L Uniﬁed
Graph Query Engine SQL Gremlin Cypher Storage Query Engine Query Language BI &Viz Databricks SQL Single copy of data - query it in both SQL & Graph

PuppyGraph Deployment Data Source B Leader Node Compute Node 1
Compute Node 2 Compute Node N Data Source A Table 1 Table 2 Table X … Table 1 Table 2 Table Y … … Client/BI PuppyGraph Data sources NeoDash

PuppyGraph Query Planning & Execution Leader Node openCypher Server Gremlin
Server Logical Query Plan Compute Node 1 Compute Node 2 Compute Node N Physical Query Plan Query Plan Part Query Optimizer Query Executor Query Plan Part Query Plan Part … Cypher Query Gremlin Query

PuppyGraph Cache Illustration Compute Node 1 Memory Cache Table 1
id=1, attr=fo o id=3, attr=fo o id=2, attr=ba r id=1, attr=fo o Leader Node Query Plan Part Cache Manager Data Scanner Disk Cache id=1, attr=fo o id=3, attr=fo o Compute Node 2 Memory Cache id=2, attr=ba r Query Plan Part Cache Manager Data Scanner Disk Cache id=2, attr=ba r id=4, attr=ba r

PuppyGraph’s Technical Advantages • Zero ETL: PuppyGraph enables you to
query your data as a graph by directly connecting to your data warehouses and lakes. This eliminates the need to build and maintain time-consuming ETL pipelines needed with a traditional graph database setup. Deploy to query in 10 minutes. • Column-based data ﬁle format: While a graph database usually have row-based storage or key-value storage, it is good at adding/updating/removing a vertex/edge rather than running a complex query to do data analysis. PuppyGraph doesn’t have its own storage. Instead, it’s the only graph solution that leverages the column-based storage to speed up complex queries at scale. • Optimized query execution: On top of column-based storage, PuppyGraph offers massively parallel processing and vectorized evaluation technology that makes the computation fast even when lacking efﬁcient indexing and caching. The performance can be even faster by leveraging our internal (in-memory of PuppyGraph compute node) and external (user’s existing storage) indexing and caching technologies. While the relation data warehouse’s query engine is optimized for SQL queries, PuppyGraph is optimized for graph queries. • Distributed design: PuppyGraph’s compute engine is distributed - it means more machines = better performance. The distributed design allows PuppyGraph to handle huge size of data and complex queries like 10-hop neighbor with ease.

Key Use Cases

PuppyGraph Customers & Users

GraphRAG + LLM

Customer GraphRAG Use Case Customer: One of Europe's biggest backers
of business-to-business software companies Choose PuppyGraph over Neo4j because don’t like build data pipelines Use case: social graph for rolodex access. Look for shortest path for any founders/investors in its network Tech Stack: PuppyGraph, BigQuery, LangChain A simple knowledge graph example

The Challenges of Adopting LLMs in Enterprises • OpenAI’s GPT
models are not easy to query your private data: ◦ GPT models are not trained on the private data ◦ Enterprise also don’t want to share any private data with OpenAI • ChatGPT can lead to hallucination when answering data oriented questions: ◦ Provide wrong answers ◦ Give a long block of text but doesn’t really answer the questions

ChatGPT Hallucination Examples with IMDB Data No GraphRAG With PuppyGraph
GraphRAG

ChatGPT Hallucination Examples with IMDB Data Without Graph RAG With
Graph RAG

What is Graph RAG? Graph RAG = RAG x Knowledge
Graph • Graph RAG builds on the concept of RAG by leveraging on knowledge graphs (KGs). • Graph RAG allows integration of the structured data from KGs into the LLM’s processing, providing a more nuanced and informed basis for the model’s responses. A simple knowledge graph example

Build Your Own AI Chatbot With Private Data & LLM
Tech Stack: • LLM Model: OpenAI’s GPT 3.5 • Dataset: IMDB • Data Storage: Apache Iceberg • Knowledge Graph: PuppyGraph

IMDB Dataset Table Schema title_basics type tconst string titleType string
primaryTitle string originalTitle string isAdult boolean startYear Int64 endYear Int64 runtimeMinutes Int64 genres string title_principals type id string tconst string ordering Int64 nconst string category string job string characters string name_basics type nconst string primaryName string birthYear Int64 deathYear Int64 primaryProfession string knownForTitles string

IMDB Dataset Graph Schema The IMDB knowledge graph has two
types of vertices: 1. `person`. This type of vertex can be directors, producers or actors/actresses. It has 3 major attributes: `primaryName`, `birthYear` and `deathYear`. 2. `title`. This type of vertex can be movies, TV episodes or TV movies, etc. It has 3 major attributes: `titleType`, `primaryTitle` and `startYear`. The IMDB knowledge graph has one type of edge: `cast_and_crew`. Note this edge is directed, pointing from `title` vertices to `person` vertices,

Knowledge Graph

The Queries Running On Backend Graph Queries on Knowledge Graph
1. g.V.has(‘personʼ, ‘primaryNameʼ, ‘Tom Hanksʼ).in().has(‘titleTypeʼ, ‘movieʼ).order().by(‘startYearʼ).range(8, 9).values(‘primaryTitleʼ) 2. g.V.has(‘personʼ, ‘primaryNameʼ, ‘Jackie Chanʼ).in().out().groupCount().by(‘primaryNameʼ).order(local).by(values, desc).next()

Fraud Detection

Case Study #1 A multi-year manual ofﬂine system used by
internal & paying customers for fraud-detection Data set is billions of nodes & TBs of metadata, and can’t achieve real-time beyond 3+ hops Pain points: Released a new online, automated system powered by PuppyGraph (in production) Achieved 5-hop paths between A and B in 3 seconds across a few hundred millions of edges POC with PuppyGraph < 1 day and shipped the production in < 6 months After adopting PuppyGraph: It couldn’t support large requests & batch timed out issues Conﬁdential* One of the largest crypto trading platforms in the world

Customers Testimonial Eric Sun Data Platform - Sr. Manager Excerpt
from Data+AI Summit 2024 "PuppyGraph is a very interesting graph query engine. It doesn't require us to load or ETL any data into a specialized or proprietary database storage layer for graphs. We can simply query everything directly on our data lake—whether it’s Delta, Iceberg, or just plain Parquet ﬁles. PuppyGraph can integrate this data into a graph model and another distributed computation engine to render all the results. We use it in conjunction with Unity Catalog to unlock all our transactional and crypto data already on our Delta Lake. PuppyGraph then queries this data directly to perform all sorts of graph-based exploration and aggregation. This capability is so powerful, and our users really enjoy this level of ﬂexibility.”

Case Study #2 Need to build a real-time alerting system
to identify fraudulent account for money transactions & pull relevant account data for human review Explored with Spark-based system but it failed to meet the real-time requirement Pain points: Achieved real time alerting in product by delivering 100-200ms for quick alerting queries and single digit secs for extremely complex queries that pulls review info for human investigation Scaled up to 3-5 TBs of data and plan to increase the data to 100TB by the end of next year by gradually adding more data sources After adopting PuppyGraph: The Spark-based system “hard-coded” a lot of query logic and lack of the flexibility as human reviewers need large varieties of criteria for effective investigation and exploration Confidential* One of the leading financial technology companies

P2P Fraud Detection Graph Analysis Demo • Business use case:
Investigate a real anonymized data sample from a peer-to-peer (P2P) payment platform, identify fraud patterns, resolve high risk fraud communities, and apply recommendation methods. Recreated from Neo4j’s example. • We will identify new fraud risks that went undetected with non-graph methods, increasing the number of flagged users by 87.5%. • Graph Schema: credit cards, devices, and IP addresses. • Each user node has an indicator variable for money transfer fraud (named MoneyTransferFraud) that is 1 for known fraud and 0 otherwise. This indicator is determined by a combination of credit card chargeback events and manual review. • Analysis/queries we’ll run: ◦ Query confirmed financial fraud users ◦ Show a relational pattern if one user transfer money to another user who shares the same credit card ◦ Group accounts with transfer records and shared credit cards using Weakly Connected Components (WCC) algorithm ◦ Find out if there are confirmed fraudulent users within a specific group ◦ Query users within the specific group: if there are confirmed fraudulent users in the group, or other user in the group may be fraudulent users • Tech Stack: PuppyGraph, Apache Iceberg, Docker

P2P Fraud Detection Graph Analysis Demo Tech Stack: PuppyGraph, Apache
Iceberg, Docker Link to the demo video>>

Cyber Security

Cybersecurity Customer Case Study Use case: Create a knowledge graph
to show complete visibility from diverse sources like SIEM, vulnerability scanners, EDR, VPN, IAM, IGA, ITSM, CMDB & cloud sources. Pain points: • Unable to analyze pass 7 days of data with SQL-based solution • Couldn’t support large requests & batch timed out issues • Need to achieve real-time for certain queries while balance the cost of infrastructure Conﬁdential* An ISTARI Collective member and a Cybersecurity leader

Cybersecurity Customer Case Study Results: • Prevalent AI chose PuppyGraph
over Apache Druid because due to scalability & performance & cost advantages • PuppyGraph helped Prevalent AI to increase the amount of handled data volume by 30x • Able to achieve query speed for last 7 day data in <3 seconds and last 30 day data in < 10 seconds Tech Stack: • PuppyGraph + Apache Iceberg Prevalent AI write up on the knowledge graph for crisis readiness>>

Wiz-Like Cloud Security Graph Use Case “Where are all the
Log4J libraries in my environment?” Too difﬁcult to answer w/ traditional security approaches Finding log4j libraries & identifying those exposed or with high permissions across environments is a major challenge ✖ ✖ SQL-based solutions struggle with complex interconnections, requiring inefﬁcient and hard-to-manage queries. ✖ Traditional Ways

Wiz Cloud Security Graph Overview Wiz helps customers uncover the
toxic combination of risk factors that represent critical risks through a graph model. Wiz scans the entire tech stack w/o agents and stores a graph of relevant security metadata. The Wiz Risk Engines traverse the graph & weave together interconnected risk factors in seconds. Wiz’s security graph can make the interconnection visulizable & queryable for users. Through a simple graph query, customers can identify critical risks such as: • Resources open to the AllUser predeﬁned group • Publicly exposed containers with high Kubernetes privileges • Externally exposed and unpatched VM instances with cleartext SSH private keys “The world is a graph, not a table. It’s time our tooling reﬂected this.” Ami Luttwak CTO  Wiz Wiz CTO’s blog on their Cloud Security Graph

Build a Wiz-Like Cloud Security Graph with PuppyGraph Instead of
adopting a specialized graph storage system that require complex ETL process, PuppyGraph is a graph query engine that can sit on top of existing relational data stores - with zero ETL. Deploy PuppyGraph in as little as 10 minutes and create graph schemas across all tables in your one or more of relational storage. Supporting most popular graph query languages, Cypher & Gremlin, and easily query complex relationships such as “resources open to the AllUser predeﬁned group” or “publicly exposed containers with high Kubernetes privileges” with ease. Link to cloud security graph demo recording

Network Topology Demo Business use case: Investigate malfunctioning server nodes
and load balancers • Query and visualize the paths from a speciﬁc server node to all downstream load balancers and server node • Query and visualize the paths from failed server node to all downstream load balancers Tech Stack: PuppyGraph, Apache Iceberg, Docker Link to the network topology demo recording

CI/CD Artifact Dependency Demo Business use case: An Artifact Dependency
is a Dependency of type Artifact. Create a graph for all dependencies for artifact for faster troubleshooting • Query all direct and indirect dependencies of an artifact • Query which artifacts directly or indirectly depend on a certain artifact • Query all failing build records and dependencies related to a certain build Tech Stack: PuppyGraph, Apache Iceberg, Docker Link to the CI/CD demo recording

Workload & Error Rate In A Big Call Graph Demo
Business use case: how to analyze and visualize high CPU utilization across components within a large call graph • Query historical records and related components with a CPU load ratio greater than 0.9 • Query historical records of components with CPU usage exceeding 90%, as well as corresponding component and invocation info • Rank the importance of each component using the PageRank algorithm Tech Stack: PuppyGraph, Apache Iceberg, Docker Link to the workload demo recording

VPC Flow Log Analytics Business use case: AWS VPC Flow
Logs play a crucial role by capturing detailed trafﬁc data, which is pivotal for identifying security threats, optimizing network performance, and ensuring regulatory compliance. While SQL remains a common tool for data analysis, graph-based analytics offer a superior alternative for managing VPC Flow Log data, excelling in rapidly identifying relationships between data points such as IP address connections. Link to the Upsolver tutorial blog

PuppyGraph Benchmarks

Performance Benchmark vs. Neo4j 3-Hop Neighbor Query Benchmark on Twitter
Data Neo4j vs. PuppyGraph Neo4j PuppyGraph Twitter data set: 50 million nodes + 2 billion edges PuppyGraph is 20-70x faster than Neo4j in 3-hop query benchmark when the degree is high. *We can only perform 3-hop due to Neo4j crashed on higher degree queries, like 10-hop neighbor query.

Multi-Hops Benchmarks https://bit.ly/p g-multi-hops- benchmark https://bit.ly/p g-multi-hops- benchmark Dataset #
of Vertices # of Edges Graph500(sf22) 2,396,657 64,155,735 Graph500(sf25) 17,062,472 523,602,831 Twitter 52,579,678 1,963,263,508 Results: PuppyGraph can answer 10-hop neighbor query across half billion edges in 2.26 seconds on a four machine cluster Datasets: Cluster: • Leader node: t3.xlarge • Computer nodes: m6i.4x • Number of compute nodes: 4

Cypher Query Benchmarks vs. Neo4j https://bit.ly/p g-multi-hops- benchmark https://bit.ly/p g-cypher-que
ry-benchmark Query Name Puppy + Iceberg (1GB data) Neo4j (1GB data) Puppy + Iceberg (100GB data) Neo4j (100GB data) return-vertex-of-som e-label 0.228 0.159 0.223 0.320 node-label-count 0.297 0.505 2.365 38.395 relationship-type-co unt 1.146 2.427 6.807 360.522 multi-id-relationship 0.293 0.315 0.249 19.281 Results (in second): Hardware: r6i.4xlarge (AWS, 16 vCPUs, 128GB Memory) https://bit.ly/p g-cypher-que ry-benchmark

Deployment Option of PuppyGraph Docker Download via Docker install. Can
be run in any Linux box, any Cloud. www.puppygraph.com • Can be deployed on-prem or any major clouds: GCP, AWS, Azure • Available on AWS AMI, and GCP Marketplace • Our customers use k8s to deploy a cluster of PuppyGraph and use DataDog to monitoring the status of the cluster and scale in/scale out. • Users can use PuppyGraph’s Java/Go/Python client to set up alerts based on graph outcomes to the ops team

Table 1: Diverse Types of GraphRAG Graph RAG Types Uses
Graph Query Retrieves Graph Content Adaptive Reasoning Example Query-based GraphRAG ✓ – – Cypher or Gremlin queries Content-based GraphRAG ✓ ✓ – Nodes, triplets, paths, subgraphs Agentic GraphRAG ✓ ✓ ✓ PuppyGraphAgent

Table 2: Traditional RAG vs. GraphRAG Features Traditional RAG GraphRAG
Understands Connections 🔗 – Limited ✓ Comprehensive Handles Complex Questions 🧠 – Basic ✓ Advanced Keeps Information Focused 🎯 – Overwhelming ✓ Clear & Focused Works with Your Existing Data Easily ⚙ – Needs ETL ✓ No ETL with PuppyGraph Smart Decision-Making 🤖 – Limited ✓ Yes, with PuppyGraphAgent

All the algorithms can be customized and adding ﬁlters. •
FastRP • GraphSAGE • Node2Vec • HashGNN • DeepWalk • Graph Convolutional Networks (GCNs) • Graph Attention Networks (GATs) • Graph Neural Networks (GNNs) PuppyGraph Supports Comprehensive Machine Learning Algorithms

All the algorithms can be customized and adding filters. PuppyGraph
Supports Large Graph Algorithms Centrality Article Rank Betweenness Centrality CELF Closeness Centrality Degree Centrality Eigenvector Centrality Page Rank Harmonic Centrality HITS Community detection Conductance metric K-Core Decomposition K-1 Coloring K-Means Clustering Label Propagation Leiden Local Clustering Coefficient Louvain Modularity metric Modularity Optimization Strongly Connected Components Triangle Count Weakly Connected Components Approximate Maximum k-cut Speaker-Listener Label Propagation Path finding Delta-Stepping Single-Source Shortest Path Dijkstra Source-Target Shortest Path Dijkstra Single-Source Shortest Path A* Shortest Path Yen’s Shortest Path Breadth First Search Depth First Search Random Walk Bellman-Ford Single-Source Shortest Path Minimum Weight Spanning Tree Minimum Directed Steiner Tree Minimum Weight k-Spanning Tree All Pairs Shortest Path Longest Path for DAG Similarity Node Similarity K-Nearest Neighbors DAG algorithms Topological Sort Longest Path Topological link prediction Adamic Adar Common Neighbors Preferential Attachment Resource Allocation Same Community Total Neighbors Pregel API

Patient Journey Graph Demo • Business use case: This demo
queries the synthetic data mimic the patient journey usually needed by healthcare or insurance organizations. • Analysis/queries we run: ◦ Query the information of the first 10 patients ◦ Query the complete path from a specific patient to their related admissions, diagnoses, and the corresponding ICD diagnosis details ◦ Query all patients diagnosed with "Aphasia" and then to find their other diagnosis records. ◦ Query the top 10 most common diagnoses among patients ◦ Modify data through Trino and read data using PuppyGraph by query and modify a specific patient data • Tech Stack: PuppyGraph, Trino, Apache Iceberg, Docker Link to the demo recording

Clinical Knowledge Graph Demo • Business use case: This demo
queries the public Drug Central data using PuppyGraph. Drug Central provides information on active ingredients, chemical entities, pharmaceutical products, drug mode of action, indications, pharmacologic action. • We downloaded the Drug Central data in one copy of data (in Postgres format), and purposefully stored different tables in two different data sources to show PuppyGraph can query one or more SQL data stores: Apache Iceberg (e.g., bioactivity, target): 24 tables, and PostgreSQL (e.g., drugs, mechanisms of action and human action targets): 41 tables • Analysis/queries we run: ◦ List drugs approved by FDA, limiting the number of returned results. ◦ Get drugs with mechanisms of action for human targets. Returning 50 results. This query actually leverages both (Iceberg and Postgres) data source • Tech Stack: PuppyGraph, Apache Iceberg, PostgreSQL, Docker Link to the voiceover demo recording

PuppyGraph Feature Demo Videos • 100-second PuppyGraph Intro Video •
Create graph schema using PuppyGraph UI • PuppyGraph dashboard demo • Instant schema change feature demo Book a meeting with the PuppyGraph Team: https://bit.ly/Meet-PuppyGraph

PuppyGraph x Databricks Resources • PuppyGraph x Databricks Demo Video
(with voice over) • Blog: Integrating Unity Catalog with PuppyGraph for Real-time Graph Analysis • Blog: GraphRAG with Databricks and PuppyGraph • Blog: Databricks Knowledge Graph: Everything You Need To Know Databricks CTO gave PuppyGraph a shoutout on LinkedIn

Stay Connected! Weimo Liu, CEO & Co-Founder at PuppyGraph [email protected]
Danfeng Xu, CTO & Co-Founder at PuppyGraph [email protected] https://www.linkedin.com/in/weimoliu/ https://www.linkedin.com/in/xudanfeng/ @PuppyGraph www.puppygraph.com @PuppyQuery

PuppyGraph - IT Press Tour #62 June 2025

PuppyGraph - IT Press Tour #62 June 2025

More Decks by The IT Press Tour

Other Decks in Technology

Featured

Transcript