Graph Databases - A gentle introduction

Graph Databases A gentle introduction Dev.Talk August 2015 Marcel Körtgen

Four NOSQL Categories

Key Value Stores • Most Based on Dynamo: Amazon Highly
Available Key-Value Store • Data Model: • Global key-value mapping • Big scalable HashMap • Highly fault tolerant (typically) • Examples: • Redis, Riak, Voldemort

Key Value Stores: Pros and Cons • Pros: • Simple
data model • Scalable • Cons • Create your own “foreign keys” • Poor for complex data

Column Family • Most Based on BigTable: Google’s Distributed Storage
System for Structured Data • Data Model: • A big table, with column families • Map Reduce for querying/processing • Examples: • HBase, HyperTable, Cassandra, (SAP Hana)

Column Family: Pros and Cons • Pros: • Supports Semi-Structured
Data • Naturally Indexed (columns) • Scalable • Cons • Poor for interconnected data

Document Databases • Data Model: • A collection of documents
• A document is a key value collection • Index-centric, lots of map-reduce • Examples: • CouchDB, MongoDB, SOLR, ElasticSearch

Document Databases: Pros and Cons • Pros: • Simple, powerful
data model • Scalable • Cons • Poor for interconnected data • Query model limited to keys and indexes • Map reduce for larger queries

Graph Databases • Data Model: • Nodes and Relationships •
Examples: • Neo4j, OrientDB, InfiniteGraph, AllegroGraph

Graph Databases: Pros and Cons • Pros: • Powerful data
model, as general as RDBMS • Connected data locally indexed • Easy to query • Cons • Sharding ( lots of people working on this) • Scales UP reasonably well • Requires rewiring your brain

Why NOSQL now? Driving trends

Trend 1: Data Size 2007 2008 2009 2010 2011? 0
500 1000 1500 2000 2500 3000

Data is getting bigger: “Every 2 days we create as
much information as we did up to 2003” – Eric Schmidt, Google (2010)

Trend 2: Connectedness Information connectivity Text Documents Hypertext Feeds Blogs
Wikis UGC Tagging Folksonomies RDFa Ontologies GGG

Data is more connected: • Text (content) • HyperText (added
pointers) • RSS (joined those pointers) • Blogs (added pingbacks) • Tagging (grouped related data) • RDF (described connected data) • GGG (content + pointers + relationships + descriptions)

Trend 3: Semi-structured information • Individualisation of content • 1970’s
salary lists, all elements exactly one job • 2000’s salary lists, we need many job columns! • Store more data about each entity • Trend accelerated by the decentralization of content generation • Age of participation (“web 2.0”)

Trend 4: Architecture DB Application 1980’s: Single Application

Trend 4: Architecture DB Application 1990’s: Integration Database Antipattern Application
Application

Trend 4: Architecture 2000’s: SOA DB Application DB Application DB
Application RESTful, hypermedia, composite apps

Trend 4: Architecture 2010’s: Microservices, DevOps, Containerization, 12-factor app, Hexagonal
Architectures

Graph Buzz!

Early Adopters of Graph Tech

What is a Graph?

What is a Graph? • An abstract representation of a
set of objects where some pairs are connected by links. Object (Vertex, Node) Link (Edge, Arc, Relationship)

Different Kinds of Graphs • Undirected Graph • Directed Graph
• Pseudo Graph • Multi Graph • Hyper Graph

More Kinds of Graphs • Weighted Graph • Labeled Graph
• Property Graph

What is a Graph Database? • A database with an
explicit graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of a local step (or hop) remains the same • Plus an Index for lookups

What is a Graph Database “A graph database... is an
online database management system with CRUD methods that expose a graph data model”1 • T wo important properties: • Native graph storage engine: written from the ground up to manage graph data • Native graph processing, including index-free adjacency to facilitate traversals 1] Robinson,Webber , Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265

Graph Databases are Designed to: 1. Store inter-connected data 2.
Make it easy to make sense of that data 3. Enable extreme-performance operations for: • Discovery of connected data patterns • Relatedness queries > depth 1 • Relatedness queries of arbitrary length 4. Make it easy to evolve the database

Relational T ables

Join this way…

• all JOINs are executed every time you query (traverse)
the relationship • executing a JOIN means to search for a key in another table • with Indices executing a JOIN means to lookup a key • B-T ree Index: O(log(n)) • more entries => more lookups => slower JOINs The Problem

Connected Query Performance

Connected Query Performance Query Response Time* = f(graph density, graph
size, query degree) • Graph Density (avg # rel’s / node) • Graph Size (total # nodes in the graph) • Query Degree (# of hops in one’s query) RDBMS: >> exponential slowdown as each factor increases Neo4j: >> Performance remains constant as graph size increases >> Performance slowdown is linear or better as density & degree increase

Connected Query Performance RDBMS vs. Native Graph Database Connectedness of
Data Set Response Time Degree: < 3 Size: Thousands # Hops: < 3 Degree: Thousands+ Size: Billions+ # Hops: T ens to Hundreds Neo4j RDBMS

Social Network “path exists” Performance • Experiment: • ~1k persons
• Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms

• Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms

• Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms Neo4j 1000000 2ms

Cypher // get node 0 start a=(0) return a //
traverse from node 1 start a=(1) match (a)-->(b) return b // return friends of friends start a=(1) match (a)--()--(c) return c Pattern Matching Query Language (like SQL for graphs)

Demo Time • Startup neo4j • Browse to http://localhost:7474

Use cases NLP Big Data Fraud Detection

Take aways • Trends (Data, Architecture) • NoSQL: The big
fours • How to pick • Why Aggregates suck with connected data • Slicing through data limited (needs lots of map&reduce) • Connected Query Performance • IP needs “hopping” through data anyway (k > 3)

Thank You Time for Questions!

Graph Databases - A gentle introduction

Graph Databases - A gentle introduction

More Decks by Awesome Incremented

Other Decks in Technology

Featured

Transcript