Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph Databases - A gentle introduction

Graph Databases - A gentle introduction

Slides supporting an introductory talk on neo4j & nosql. Most slides duplicated from existing neo4j presentations

Avatar for Awesome Incremented

Awesome Incremented

August 08, 2015
Tweet

More Decks by Awesome Incremented

Other Decks in Technology

Transcript

  1. Key Value Stores • Most Based on Dynamo: Amazon Highly

    Available Key-Value Store • Data Model: • Global key-value mapping • Big scalable HashMap • Highly fault tolerant (typically) • Examples: • Redis, Riak, Voldemort
  2. Key Value Stores: Pros and Cons • Pros: • Simple

    data model • Scalable • Cons • Create your own “foreign keys” • Poor for complex data
  3. Column Family • Most Based on BigTable: Google’s Distributed Storage

    System for Structured Data • Data Model: • A big table, with column families • Map Reduce for querying/processing • Examples: • HBase, HyperTable, Cassandra, (SAP Hana)
  4. Column Family: Pros and Cons • Pros: • Supports Semi-Structured

    Data • Naturally Indexed (columns) • Scalable • Cons • Poor for interconnected data
  5. Document Databases • Data Model: • A collection of documents

    • A document is a key value collection • Index-centric, lots of map-reduce • Examples: • CouchDB, MongoDB, SOLR, ElasticSearch
  6. Document Databases: Pros and Cons • Pros: • Simple, powerful

    data model • Scalable • Cons • Poor for interconnected data • Query model limited to keys and indexes • Map reduce for larger queries
  7. Graph Databases • Data Model: • Nodes and Relationships •

    Examples: • Neo4j, OrientDB, InfiniteGraph, AllegroGraph
  8. Graph Databases: Pros and Cons • Pros: • Powerful data

    model, as general as RDBMS • Connected data locally indexed • Easy to query • Cons • Sharding ( lots of people working on this) • Scales UP reasonably well • Requires rewiring your brain
  9. Trend 1: Data Size 2007 2008 2009 2010 2011? 0

    500 1000 1500 2000 2500 3000
  10. Data is getting bigger: “Every 2 days we create as

    much information as we did up to 2003” – Eric Schmidt, Google (2010)
  11. Data is more connected: • Text (content) • HyperText (added

    pointers) • RSS (joined those pointers) • Blogs (added pingbacks) • Tagging (grouped related data) • RDF (described connected data) • GGG (content + pointers + relationships + descriptions)
  12. Trend 3: Semi-structured information • Individualisation of content • 1970’s

    salary lists, all elements exactly one job • 2000’s salary lists, we need many job columns! • Store more data about each entity • Trend accelerated by the decentralization of content generation • Age of participation (“web 2.0”)
  13. Trend 4: Architecture 2000’s: SOA DB Application DB Application DB

    Application RESTful, hypermedia, composite apps
  14. What is a Graph? • An abstract representation of a

    set of objects where some pairs are connected by links. Object (Vertex, Node) Link (Edge, Arc, Relationship)
  15. Different Kinds of Graphs • Undirected Graph • Directed Graph

    • Pseudo Graph • Multi Graph • Hyper Graph
  16. What is a Graph Database? • A database with an

    explicit graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of a local step (or hop) remains the same • Plus an Index for lookups
  17. What is a Graph Database “A graph database... is an

    online database management system with CRUD methods that expose a graph data model”1 • T wo important properties: • Native graph storage engine: written from the ground up to manage graph data • Native graph processing, including index-free adjacency to facilitate traversals 1] Robinson,Webber , Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265
  18. Graph Databases are Designed to: 1. Store inter-connected data 2.

    Make it easy to make sense of that data 3. Enable extreme-performance operations for: • Discovery of connected data patterns • Relatedness queries > depth 1 • Relatedness queries of arbitrary length 4. Make it easy to evolve the database
  19. • all JOINs are executed every time you query (traverse)

    the relationship • executing a JOIN means to search for a key in another table • with Indices executing a JOIN means to lookup a key • B-T ree Index: O(log(n)) • more entries => more lookups => slower JOINs The Problem
  20. Connected Query Performance Query Response Time* = f(graph density, graph

    size, query degree) • Graph Density (avg # rel’s / node) • Graph Size (total # nodes in the graph) • Query Degree (# of hops in one’s query) RDBMS: >> exponential slowdown as each factor increases Neo4j: >> Performance remains constant as graph size increases >> Performance slowdown is linear or better as density & degree increase
  21. Connected Query Performance RDBMS vs. Native Graph Database Connectedness of

    Data Set Response Time Degree: < 3 Size: Thousands # Hops: < 3 Degree: Thousands+ Size: Billions+ # Hops: T ens to Hundreds Neo4j RDBMS
  22. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms
  23. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms
  24. Social Network “path exists” Performance • Experiment: • ~1k persons

    • Average 50 friends per person • pathExists(a,b) limited to depth 4 # persons query time Relational database 1000 2000ms Neo4j 1000 2ms Neo4j 1000000 2ms
  25. Cypher // get node 0 start a=(0) return a //

    traverse from node 1 start a=(1) match (a)-->(b) return b // return friends of friends start a=(1) match (a)--()--(c) return c Pattern Matching Query Language (like SQL for graphs)
  26. Take aways • Trends (Data, Architecture) • NoSQL: The big

    fours • How to pick • Why Aggregates suck with connected data • Slicing through data limited (needs lots of map&reduce) • Connected Query Performance • IP needs “hopping” through data anyway (k > 3)