Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Databases - A gentle introduction

Graph Databases - A gentle introduction

Slides supporting an introductory talk on neo4j & nosql. Most slides duplicated from existing neo4j presentations

Awesome Incremented

August 08, 2015
Tweet

More Decks by Awesome Incremented

Other Decks in Technology

Transcript

  1. Key Value Stores • Most Based on Dynamo: Amazon Highly

    Available Key-Value Store • Data Model: • Global key-value mapping • Big scalable HashMap • Highly fault tolerant (typically) • Examples: • Redis, Riak, Voldemort
  2. Key Value Stores: Pros and Cons • Pros: • Simple

    data model • Scalable • Cons • Create your own “foreign keys” • Poor for complex data
  3. Column Family • Most Based on BigTable: Google’s Distributed Storage

    System for Structured Data • Data Model: • A big table, with column families • Map Reduce for querying/processing • Examples: • HBase, HyperTable, Cassandra, (SAP Hana)
  4. Column Family: Pros and Cons • Pros: • Supports Semi-Structured

    Data • Naturally Indexed (columns) • Scalable • Cons • Poor for interconnected data
  5. Document Databases • Data Model: • A collection of documents

    • A document is a key value collection • Index-centric, lots of map-reduce • Examples: • CouchDB, MongoDB, SOLR, ElasticSearch
  6. Document Databases: Pros and Cons • Pros: • Simple, powerful

    data model • Scalable • Cons • Poor for interconnected data • Query model limited to keys and indexes • Map reduce for larger queries
  7. Graph Databases • Data Model: • Nodes and Relationships •

    Examples: • Neo4j, OrientDB, InfiniteGraph, AllegroGraph
  8. Graph Databases: Pros and Cons • Pros: • Powerful data

    model, as general as RDBMS • Connected data locally indexed • Easy to query • Cons • Sharding ( lots of people working on this) • Scales UP reasonably well • Requires rewiring your brain
  9. Trend 1: Data Size 2007 2008 2009 2010 2011? 0

    500 1000 1500 2000 2500 3000
  10. Data is getting bigger: “Every 2 days we create as

    much information as we did up to 2003” – Eric Schmidt, Google (2010)
  11. Data is more connected: • Text (content) • HyperText (added

    pointers) • RSS (joined those pointers) • Blogs (added pingbacks) • Tagging (grouped related data) • RDF (described connected data) • GGG (content + pointers + relationships + descriptions)
  12. Trend 3: Semi-structured information • Individualisation of content • 1970’s

    salary lists, all elements exactly one job • 2000’s salary lists, we need many job columns! • Store more data about each entity • Trend accelerated by the decentralization of content generation • Age of participation (“web 2.0”)
  13. Trend 4: Architecture 2000’s: SOA DB Application DB Application DB

    Application RESTful, hypermedia, composite apps
  14. What is a Graph? • An abstract representation of a

    set of objects where some pairs are connected by links. Object (Vertex, Node) Link (Edge, Arc, Relationship)
  15. Different Kinds of Graphs • Undirected Graph • Directed Graph

    • Pseudo Graph • Multi Graph • Hyper Graph
  16. What is a Graph Database? • A database with an

    explicit graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of a local step (or hop) remains the same • Plus an Index for lookups
  17. What is a Graph Database “A graph database... is an

    online database management system with CRUD methods that expose a graph data model”1 • T wo important properties: • Native graph storage engine: written from the ground up to manage graph data • Native graph processing, including index-free adjacency to facilitate traversals 1] Robinson,Webber , Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265
  18. Graph Databases are Designed to: 1. Store inter-connected data 2.

    Make it easy to make sense of that data 3. Enable extreme-performance operations for: • Discovery of connected data patterns • Relatedness queries > depth 1 • Relatedness queries of arbitrary length 4. Make it easy to evolve the database
  19. • all JOINs are executed every time you query (traverse)

    the relationship • executing a JOIN means to search for a key in another table • with Indices executing a JOIN means to lookup a key • B-T ree Index: O(log(n)) • more entries => more lookups => slower JOINs The Problem
  20. Connected Query Performance Query Response Time* = f(graph density, graph

    size, query degree) • Graph Density (avg # rel’s / node) • Graph Size (total # nodes in the graph) • Query Degree (# of hops in one’s query) RDBMS: >> exponential slowdown as each factor increases Neo4j: >> Performance remains constant as graph size increases >> Performance slowdown is linear or better as density & degree increase
  21. Connected Query Performance RDBMS vs. Native Graph Database Connectedness of

    Data Set Response Time Degree: < 3 Size: Thousands # Hops: < 3 Degree: Thousands+ Size: Billions+ # Hops: T ens to Hundreds Neo4j RDBMS
  22. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms
  23. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms
  24. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms Neo4j 1000000 2ms
  25. Cypher // get node 0 start a=(0) return a //

    traverse from node 1 start a=(1) match (a)-->(b) return b // return friends of friends start a=(1) match (a)--()--(c) return c Pattern Matching Query Language (like SQL for graphs)
  26. Take aways • Trends (Data, Architecture) • NoSQL: The big

    fours • How to pick • Why Aggregates suck with connected data • Slicing through data limited (needs lots of map&reduce) • Connected Query Performance • IP needs “hopping” through data anyway (k > 3)