Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neo4j Magic Adventures

Neo4j Magic Adventures

When the domain of your data is clearly a graph, why shove it into a relational model? Specialized graph databases like Neo4j have demonstrated that it's easier to "think in graphs", while working with your data. But is Neo4j fast enough for use cases where tight performance is needed?

Dmitrijs Vrublevskis

November 10, 2014
Tweet

More Decks by Dmitrijs Vrublevskis

Other Decks in Programming

Transcript

  1. Briefing 1. Evaluate Neo4j capabilities • Import test dataset •

    Implement and run test cases • Measure everything 2. Compare Neo4j with existing solution
  2. • Native graph storage • Property graph database • Schema-less

    • Powerful Query Language • Clustering • Hot backups
  3. Network graph • Looks like a tree • Node has

    unique id • Node has type • Node can contain structures • … or structure sequences • … or structure with sequence of structures
  4. Environment • 8 GB / 4 CPUs • 80 GB

    SSD Disk • CentOS 6.5 x64
  5. Day 1 - 7 • Environment setup… • Dataset import…

    • Documentation read… Done Done Done
  6. Mission impossible • Take existing data model • Take existing

    test case • Make it as fast, as possible
  7. Test case Read whole graph under specified root node with

    one request. Node count in graph - 97254 (structures & sequences not counted)
  8. REST API • Node & Relationship endpoints • Transactional endpoint

    • Traversal endpoint • Batch operations endpoint • Other
  9. Cypher “Cypher is a declarative graph query language that allows

    for expressive and efficient querying and updating of the graph store. “
  10. • ASCII art • Keywords like WHERE and ORDER BY

    are inspired by SQL • Focuses on the clarity of expressing what to retrieve from a graph • Collection semantics have been borrowed from languages such as Haskell and Python
  11. Solution via Cypher query MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId}

    
 OPTIONAL MATCH (child)-[property:PROPERTY]->(child_property) 
 RETURN *
  12. #!/bin/bash
 
 QUERY=bodies/subgraph.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/db/data/transaction/commit
  13. ? Received 1225 mb ? Download Speed 1535 kb/s ?

    Time spent 817 seconds N/A REST/Default
  14. [
 {
 "id": "100",
 "graph": { "nodes": [ {"id": “101"}

    ] }
 },
 {
 "id": "100",
 "graph": { "nodes": [ {"id": “102"} ] }
 }
 ] 100 102 101
  15. Cypher thinks in paths, not graphs! • Unnecessary data duplication

    • Cypher doesn’t know about our data model
  16. Another solution MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId} OPTIONAL MATCH

    (child)-[r2:PROPERTY]->(child_propety) 
 RETURN collect(root) + collect(child) + collect(child_property) as nodes, collect(r2) as relationships
  17. #!/bin/bash
 
 QUERY=bodies/subgraph_optimised.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/db/data/transaction/commit
  18. 1225 mb Received 85.2 MB 1535 kb/s Download Speed 1579

    KB/s 817 seconds Time spent 55 seconds REST/Default REST/Optimized
  19. Conclusion • We need more control on querying & serialisation

    process! • Maybe another serialisation format? • Another querying api?
  20. Day 10 Morning standup - “We need to implement our

    own extension. Can we do it?” - “Yeah, definitely.”
  21. Unmanaged extension • The unmanaged extensions are a way of

    deploying arbitrary JAX-RS code into the Neo4j server.
  22. Plan 1. Take fast serialisation library 2. Take Neo4j Java

    API 3. Implement our own endpoints 4. … 5. Profit!
  23. BSON • Obvious choice • Brought by MongoDB • Fast

    serialisation http://bsonspec.org/ 1. Lightweight 2. Traversable 3. Efficient (as they say)
  24. Jackson • Jackson used by Neo4j internally • It’s cool

    • Jackson has BSON plugin https://github.com/FasterXML/jackson https://github.com/michel-kraemer/bson4jackson
  25. //create mapper
 ObjectMapper mapper = new ObjectMapper( new BsonFactory() );

    ByteArrayOutputStream baos = new ByteArrayOutputStream(); //serialize data mapper.writeValue(baos, pojo);
 
 
 ByteArrayInputStream bais =
 new ByteArrayInputStream(baos.toByteArray()); //deserialize data mapper.readValue(bais, PojoClass.class);

  26. //create bson factory
 BsonFactory factory = new BsonFactory();
 ByteArrayOutputStream baos

    = new ByteArrayOutputStream();
 //serialize data
 JsonGenerator gen = factory.createJsonGenerator(baos);
 gen.writeStartObject();
 gen.writeFieldName("name");
 gen.writeString(bob.getName());
 gen.close(); Streaming!
  27. StreamingOutput stream = new StreamingOutput() {
 @Override
 public void write(OutputStream

    os) {
 
 Writer writer = new BufferedWriter(
 new OutputStreamWriter(os)
 ); 
 writer.write("Hello World!");
 writer.flush();
 }
 }; 
 return Response.ok(stream).build();
  28. Day 11 - 12 • Extension setup… • Cypher endpoint…

    • Documentation read… Done Done Done
  29. private final GraphDatabaseService db; try(Transaction tx = db.beginTx()) {
 ExecutionEngine

    engine = new ExecutionEngine(db);
 ExecutionResult result = engine.execute(query);
 
 Bson.serialize(output, result); }
  30. Iterable<Relationship> relationships = node.getRelationships(Relationships.PROPERTY, Direction.OUTGOING); 
 
 for(Relationship relationship: relationships)

    {
 Node endNode = relationship.getEndNode(); 
 if(endNode.hasLabel(Labels.Structure)) {
 ...
 } else if(endNode.hasLabel(Labels.Sequence)) {
 ... }
 }
  31. #!/bin/bash
 
 QUERY=bodies/bson.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/extension/bson/cypher
  32. 85.2 MB Received 78.3 MB 1579 KB/s Download Speed 12.3

    MB/s 55 seconds Time spent 6 seconds REST/Optimized Bson/Cypher
  33. Traverse API “The Neo4j Traversal API is a callback based,

    lazily executed way of specifying desired movements through a graph in Java.”
  34. 78.3 MB Received 78.3 MB 12.3 MB/s Download Speed 23.7

    MB/s 6 seconds Time spent 3 seconds Bson/Cypher Bson/Traverse
  35. Output output = new Output(outputStream); // Setup Kryo kryo =

    new Kryo();
 kryo.setRegistrationRequired(true);
 kryo.register(HashMap.class);
 kryo.register(String[].class);
 kryo.register(NodeDAO.class);
 kryo.register(RelationshipDAO.class); // Serialize kryo.writeObject(output, object1); kryo.writeObject(output, object2); output.close();
  36. #!/bin/bash
 
 QUERY=bodies/kryo.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/extension/kryo/cypher
  37. 78.3 MB Received 68.5 MB 12.3 MB/s Download Speed 20.5

    MB/s 6 seconds Time spent 3 seconds Bson/Cypher Kryo/Cypher
  38. #!/bin/bash
 
 curl -i -XGET \
 -o output.log \
 -w

    "
 time_connect=%{time_connect}
 time_start_transfer=%{time_starttransfer}
 time_total=%{time_total}
 " \
 http://127.0.0.1:7474/extension/kryo/traverse
  39. 78.3 MB Received 68.5 MB 23.7 MB/s Download Speed 47.2

    MB/s 3 seconds Time spent 1.4 seconds Bson/Traverse Kryo/Traverse
  40. import com.esotericsoftware.kryo.io.Output; // Before Output output = new Output(outputStream); //

    After Output output = new Output( new LZFOutputStream(outputStream) );
  41. #!/bin/bash
 
 curl -i -XGET \
 -o output.log \
 -w

    "
 time_connect=%{time_connect}
 time_start_transfer=%{time_starttransfer}
 time_total=%{time_total}
 " \
 http://127.0.0.1:7474/extension/kryo/traverse
  42. 68.5 MB Received 7.6 MB 47.2 MB/s Download Speed 4299

    KB/s 1.4 seconds Time spent 1.7 seconds Kryo/Traverse Kryo/Traverse Compressed
  43. Important • Neo4j makes heavy use of the java.nio package.

    Native I/O will result in memory being allocated outside the normal Java heap. • Neo4j will require all of the heap memory of the JVM plus the memory to be used for memory mapping to be available as physical memory.
  44. File buffer cache • The file buffer cache is sometimes

    called low level cache or file system cache. • It uses the operating system memory mapping features when possible. • Neo4j uses multiple file buffer caches, one for each different storage file.
  45. Store file Record size Contents neostore.nodestore.db 15 B Nodes neostore.relationshipstore.db

    34 B Relationships neostore.propertystore.db 41 B Properties for nodes and relationships neostore.propertystore.db.strings 128 B Values of string properties neostore.propertystore.db.arrays 128 B Values of array properties String and arrays is stored in one or more 120B chunks, with 8B record overhead.
  46. # Default values for the low-level graph engine
 neostore.nodestore.db.mapped_memory=25M
 neostore.relationshipstore.db.mapped_memory=50M


    neostore.propertystore.db.mapped_memory=90M
 neostore.propertystore.db.strings.mapped_memory=130M
 neostore.propertystore.db.arrays.mapped_memory=130M # Tuned neostore.nodestore.db.mapped_memory=150M
 neostore.relationshipstore.db.mapped_memory=400M
 neostore.propertystore.db.mapped_memory=600M
 neostore.propertystore.db.strings.mapped_memory=1450M
 neostore.propertystore.db.arrays.mapped_memory=400M
  47. 150MB + 400MB + 600MB + 1450MB + 400MB =

    3000MB # Tuned neostore.nodestore.db.mapped_memory=150M
 neostore.relationshipstore.db.mapped_memory=400M
 neostore.propertystore.db.mapped_memory=600M
 neostore.propertystore.db.strings.mapped_memory=1450M
 neostore.propertystore.db.arrays.mapped_memory=400M Available memory 3GB - File Buffers 3GB - Java Heap 2GB - OS
  48. 68.5 MB Received 68.5 MB 47.2 MB/s Download Speed 49.6

    MB/s 1.4 seconds Time spent 1.37 seconds Kryo/Traverse Kryo/Traverse Tuned
  49. Object cache • The object cache is sometimes called high

    level cache. • It caches the Neo4j data in a form optimized for fast traversal. • Two different categories • Reference caches • High-Performance Cache
  50. Object cache • Nodes and relationships are added to the

    object cache as soon as they are accessed (lazily). • Reading from this cache may be 5 to 10 times faster than reading from the file buffer cache.
  51. Object cache type • None • Soft (Community default) •

    Weak • Strong • HPC (Enterprise default)
  52. HPC • Assigned a certain maximum amount of space on

    the JVM heap • Purge objects whenever it grows bigger than that • GC-pauses can be better controlled
  53. 68.5 MB Received 68.5 MB 49.6 MB/s Download Speed 64.7

    MB/s 1.37 seconds Time spent 1.049 seconds Kryo/Traverse Tuned Kryo/Traverse HPC
  54. 68.5 MB Received 68.5 MB 64.7 MB/s Download Speed 68.5

    MB/s 1.049 seconds Time spent 0.991 seconds Kryo/Traverse Virtual Kryo/Traverse Psyhical
  55. Another results • Get 30000 (huge) nodes by field: 2.3

    seconds • Create nodes: >15000 nodes/second • ~2500 concurrent requests on virtual hardware