Graph Databases - A gentle introduction

Slide 1

Slide 1 text

Graph Databases A gentle introduction Dev.Talk August 2015 Marcel Körtgen

Slide 2

Slide 2 text

Four NOSQL Categories

Slide 3

Slide 3 text

Key Value Stores • Most Based on Dynamo: Amazon Highly Available Key-Value Store • Data Model: • Global key-value mapping • Big scalable HashMap • Highly fault tolerant (typically) • Examples: • Redis, Riak, Voldemort

Slide 4

Slide 4 text

Key Value Stores: Pros and Cons • Pros: • Simple data model • Scalable • Cons • Create your own “foreign keys” • Poor for complex data

Slide 5

Slide 5 text

Column Family • Most Based on BigTable: Google’s Distributed Storage System for Structured Data • Data Model: • A big table, with column families • Map Reduce for querying/processing • Examples: • HBase, HyperTable, Cassandra, (SAP Hana)

Slide 6

Slide 6 text

Column Family: Pros and Cons • Pros: • Supports Semi-Structured Data • Naturally Indexed (columns) • Scalable • Cons • Poor for interconnected data

Slide 7

Slide 7 text

Document Databases • Data Model: • A collection of documents • A document is a key value collection • Index-centric, lots of map-reduce • Examples: • CouchDB, MongoDB, SOLR, ElasticSearch

Slide 8

Slide 8 text

Document Databases: Pros and Cons • Pros: • Simple, powerful data model • Scalable • Cons • Poor for interconnected data • Query model limited to keys and indexes • Map reduce for larger queries

Slide 9

Slide 9 text

Graph Databases • Data Model: • Nodes and Relationships • Examples: • Neo4j, OrientDB, InfiniteGraph, AllegroGraph

Slide 10

Slide 10 text

Graph Databases: Pros and Cons • Pros: • Powerful data model, as general as RDBMS • Connected data locally indexed • Easy to query • Cons • Sharding ( lots of people working on this) • Scales UP reasonably well • Requires rewiring your brain

Slide 11

Slide 11 text

Why NOSQL now? Driving trends

Slide 12

Slide 12 text

Trend 1: Data Size 2007 2008 2009 2010 2011? 0 500 1000 1500 2000 2500 3000

Slide 13

Slide 13 text

Data is getting bigger: “Every 2 days we create as much information as we did up to 2003” – Eric Schmidt, Google (2010)

Slide 14

Slide 14 text

Trend 2: Connectedness Information connectivity Text Documents Hypertext Feeds Blogs Wikis UGC Tagging Folksonomies RDFa Ontologies GGG

Slide 15

Slide 15 text

Data is more connected: • Text (content) • HyperText (added pointers) • RSS (joined those pointers) • Blogs (added pingbacks) • Tagging (grouped related data) • RDF (described connected data) • GGG (content + pointers + relationships + descriptions)

Slide 16

Slide 16 text

Trend 3: Semi-structured information • Individualisation of content • 1970’s salary lists, all elements exactly one job • 2000’s salary lists, we need many job columns! • Store more data about each entity • Trend accelerated by the decentralization of content generation • Age of participation (“web 2.0”)

Slide 17

Slide 17 text

Trend 4: Architecture DB Application 1980’s: Single Application

Slide 18

Slide 18 text

Trend 4: Architecture DB Application 1990’s: Integration Database Antipattern Application Application

Slide 19

Slide 19 text

Trend 4: Architecture 2000’s: SOA DB Application DB Application DB Application RESTful, hypermedia, composite apps

Slide 20

Slide 20 text

Trend 4: Architecture 2010’s: Microservices, DevOps, Containerization, 12-factor app, Hexagonal Architectures

Slide 21

Slide 21 text

Graph Buzz!

Slide 22

Slide 22 text

Early Adopters of Graph Tech

Slide 23

Slide 23 text

What is a Graph?

Slide 24

Slide 24 text

What is a Graph? • An abstract representation of a set of objects where some pairs are connected by links. Object (Vertex, Node) Link (Edge, Arc, Relationship)

Slide 25

Slide 25 text

Different Kinds of Graphs • Undirected Graph • Directed Graph • Pseudo Graph • Multi Graph • Hyper Graph

Slide 26

Slide 26 text

More Kinds of Graphs • Weighted Graph • Labeled Graph • Property Graph

Slide 27

Slide 27 text

What is a Graph Database? • A database with an explicit graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of a local step (or hop) remains the same • Plus an Index for lookups

Slide 28

Slide 28 text

What is a Graph Database “A graph database... is an online database management system with CRUD methods that expose a graph data model”1 • T wo important properties: • Native graph storage engine: written from the ground up to manage graph data • Native graph processing, including index-free adjacency to facilitate traversals 1] Robinson,Webber , Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265

Slide 29

Slide 29 text

Graph Databases are Designed to: 1. Store inter-connected data 2. Make it easy to make sense of that data 3. Enable extreme-performance operations for: • Discovery of connected data patterns • Relatedness queries > depth 1 • Relatedness queries of arbitrary length 4. Make it easy to evolve the database

Slide 30

Slide 30 text

Relational T ables

Slide 31

Slide 31 text

Join this way…

Slide 32

Slide 32 text

• all JOINs are executed every time you query (traverse) the relationship • executing a JOIN means to search for a key in another table • with Indices executing a JOIN means to lookup a key • B-T ree Index: O(log(n)) • more entries => more lookups => slower JOINs The Problem

Slide 33

Slide 33 text

Connected Query Performance

Slide 34

Slide 34 text

Connected Query Performance Query Response Time* = f(graph density, graph size, query degree) • Graph Density (avg # rel’s / node) • Graph Size (total # nodes in the graph) • Query Degree (# of hops in one’s query) RDBMS: >> exponential slowdown as each factor increases Neo4j: >> Performance remains constant as graph size increases >> Performance slowdown is linear or better as density & degree increase

Slide 35

Slide 35 text

Connected Query Performance RDBMS vs. Native Graph Database Connectedness of Data Set Response Time Degree: < 3 Size: Thousands # Hops: < 3 Degree: Thousands+ Size: Billions+ # Hops: T ens to Hundreds Neo4j RDBMS

Slide 36

Slide 36 text

Social Network “path exists” Performance • Experiment: • ~1k persons • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Cypher // get node 0 start a=(0) return a // traverse from node 1 start a=(1) match (a)-->(b) return b // return friends of friends start a=(1) match (a)--()--(c) return c Pattern Matching Query Language (like SQL for graphs)

Slide 40

Slide 40 text

Demo Time • Startup neo4j • Browse to http://localhost:7474

Slide 41

Slide 41 text

Use cases NLP Big Data Fraud Detection

Slide 42

Slide 42 text

Take aways • Trends (Data, Architecture) • NoSQL: The big fours • How to pick • Why Aggregates suck with connected data • Slicing through data limited (needs lots of map&reduce) • Connected Query Performance • IP needs “hopping” through data anyway (k > 3)

Slide 43

Slide 43 text

Thank You Time for Questions!