Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neo4j Magic Adventures

Neo4j Magic Adventures

When the domain of your data is clearly a graph, why shove it into a relational model? Specialized graph databases like Neo4j have demonstrated that it's easier to "think in graphs", while working with your data. But is Neo4j fast enough for use cases where tight performance is needed?

Cf0dfe4a39a8b46ad220d11ff73b9e6c?s=128

Dmitrijs Vrublevskis

November 10, 2014
Tweet

Transcript

  1. Neo4j Magic Adventures Dmitry Vrublevsky @ Neueda dmitry@vrublevsky.me

  2. Day 0

  3. Briefing 1. Evaluate Neo4j capabilities • Import test dataset •

    Implement and run test cases • Measure everything 2. Compare Neo4j with existing solution
  4. Neo4j “Neo4j – the World’s Leading Graph Database”

  5. • Native graph storage • Property graph database • Schema-less

    • Powerful Query Language • Clustering • Hot backups
  6. Network graph • Looks like a tree • Node has

    unique id • Node has type • Node can contain structures • … or structure sequences • … or structure with sequence of structures
  7. None
  8. • Node (Object) • Structure • Sequence

  9. Network graph ~ 8.970.000 nodes ~ 11.000.000 relationships ~ 33.140.000

    properties ~ 5.7 GB
  10. Environment • 8 GB / 4 CPUs • 80 GB

    SSD Disk • CentOS 6.5 x64
  11. Day 1 - 7 • Environment setup… • Dataset import…

    • Documentation read… Done Done Done
  12. Mission impossible • Take existing data model • Take existing

    test case • Make it as fast, as possible
  13. Test case Read whole graph under specified root node with

    one request. Node count in graph - 97254 (structures & sequences not counted)
  14. Neo4j toolbox

  15. REST API • Node & Relationship endpoints • Transactional endpoint

    • Traversal endpoint • Batch operations endpoint • Other
  16. Cypher “Cypher is a declarative graph query language that allows

    for expressive and efficient querying and updating of the graph store. “
  17. • ASCII art • Keywords like WHERE and ORDER BY

    are inspired by SQL • Focuses on the clarity of expressing what to retrieve from a graph • Collection semantics have been borrowed from languages such as Haskell and Python
  18. MATCH (john {name: ‘John’})-[:friend]->(friends) MATCH (friends)-[:friend]->(fof) RETURN john, fof (

    … ) [ … ] ()-[]-() - node - relationship - path
  19. Day 8 Cypher? REST API?

  20. Solution via Cypher query MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId}

    
 OPTIONAL MATCH (child)-[property:PROPERTY]->(child_property) 
 RETURN *
  21. #!/bin/bash
 
 QUERY=bodies/subgraph.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/db/data/transaction/commit
  22. ? Received 1225 mb ? Download Speed 1535 kb/s ?

    Time spent 817 seconds N/A REST/Default
  23. Observations • Response streamed • Large response size • Long

    request time
  24. Day 9 Requirements arrived!

  25. Test case total time 2 seconds Currently we have 817

    seconds
  26. None
  27. Problems?

  28. [
 {
 "id": "100",
 "graph": { "nodes": [ {"id": “101"}

    ] }
 },
 {
 "id": "100",
 "graph": { "nodes": [ {"id": “102"} ] }
 }
 ] 100 102 101
  29. Cypher thinks in paths, not graphs! • Unnecessary data duplication

    • Cypher doesn’t know about our data model
  30. Another solution MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId} OPTIONAL MATCH

    (child)-[r2:PROPERTY]->(child_propety) 
 RETURN collect(root) + collect(child) + collect(child_property) as nodes, collect(r2) as relationships
  31. #!/bin/bash
 
 QUERY=bodies/subgraph_optimised.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Accept: application/json" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/db/data/transaction/commit
  32. 1225 mb Received 85.2 MB 1535 kb/s Download Speed 1579

    KB/s 817 seconds Time spent 55 seconds REST/Default REST/Optimized
  33. None
  34. Conclusion • We need more control on querying & serialisation

    process! • Maybe another serialisation format? • Another querying api?
  35. Day 10 Morning standup - “We need to implement our

    own extension. Can we do it?” - “Yeah, definitely.”
  36. Unmanaged extension • The unmanaged extensions are a way of

    deploying arbitrary JAX-RS code into the Neo4j server.
  37. Plan 1. Take fast serialisation library 2. Take Neo4j Java

    API 3. Implement our own endpoints 4. … 5. Profit!
  38. BSON • Obvious choice • Brought by MongoDB • Fast

    serialisation http://bsonspec.org/ 1. Lightweight 2. Traversable 3. Efficient (as they say)
  39. Jackson • Jackson used by Neo4j internally • It’s cool

    • Jackson has BSON plugin https://github.com/FasterXML/jackson https://github.com/michel-kraemer/bson4jackson
  40. //create mapper
 ObjectMapper mapper = new ObjectMapper( new BsonFactory() );

    ByteArrayOutputStream baos = new ByteArrayOutputStream(); //serialize data mapper.writeValue(baos, pojo);
 
 
 ByteArrayInputStream bais =
 new ByteArrayInputStream(baos.toByteArray()); //deserialize data mapper.readValue(bais, PojoClass.class);

  41. //create bson factory
 BsonFactory factory = new BsonFactory();
 ByteArrayOutputStream baos

    = new ByteArrayOutputStream();
 //serialize data
 JsonGenerator gen = factory.createJsonGenerator(baos);
 gen.writeStartObject();
 gen.writeFieldName("name");
 gen.writeString(bob.getName());
 gen.close(); Streaming!
  42. JAX-RS Streaming • Because why not? http://docs.oracle.com/javaee/6/api/javax/ws/rs/core/StreamingOutput.html

  43. StreamingOutput stream = new StreamingOutput() {
 @Override
 public void write(OutputStream

    os) {
 
 Writer writer = new BufferedWriter(
 new OutputStreamWriter(os)
 ); 
 writer.write("Hello World!");
 writer.flush();
 }
 }; 
 return Response.ok(stream).build();
  44. Day 11 - 12 • Extension setup… • Cypher endpoint…

    • Documentation read… Done Done Done
  45. Dependencies <dependency>
 <groupId>org.neo4j</groupId>
 <artifactId>neo4j</artifactId>
 <version>${neo4j.version}</version>
 <scope>provided</scope>
 </dependency>
 <dependency>
 <groupId>javax.ws.rs</groupId>
 <artifactId>javax.ws.rs-api</artifactId>


    <version>2.0</version>
 <scope>provided</scope>
 </dependency>
  46. private final GraphDatabaseService db; try(Transaction tx = db.beginTx()) {
 ExecutionEngine

    engine = new ExecutionEngine(db);
 ExecutionResult result = engine.execute(query);
 
 Bson.serialize(output, result); }
  47. Day 13 Another day, another experiment

  48. New cypher solution MATCH (root:Object)-[:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId} RETURN

    child (Properties autoloaded during serialisation!)
  49. Iterable<Relationship> relationships = node.getRelationships(Relationships.PROPERTY, Direction.OUTGOING); 
 
 for(Relationship relationship: relationships)

    {
 Node endNode = relationship.getEndNode(); 
 if(endNode.hasLabel(Labels.Structure)) {
 ...
 } else if(endNode.hasLabel(Labels.Sequence)) {
 ... }
 }
  50. #!/bin/bash
 
 QUERY=bodies/bson.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/extension/bson/cypher
  51. 85.2 MB Received 78.3 MB 1579 KB/s Download Speed 12.3

    MB/s 55 seconds Time spent 6 seconds REST/Optimized Bson/Cypher
  52. Traverse API “The Neo4j Traversal API is a callback based,

    lazily executed way of specifying desired movements through a graph in Java.”
  53. ResourceIterable<Node> nodes = db .traversalDescription()
 .breadthFirst()
 .relationships( Relationships.CHILDREN, Direction.OUTGOING )


    .evaluator(Evaluators.all())
 .traverse(rootNode)
 .nodes();
  54. #!/bin/bash
 
 curl -i -XGET \
 -o output.log \
 http://127.0.0.1:7474/extension/bson/traverse

  55. 78.3 MB Received 78.3 MB 12.3 MB/s Download Speed 23.7

    MB/s 6 seconds Time spent 3 seconds Bson/Cypher Bson/Traverse
  56. Day 14 Can we do better?

  57. “Kryo is a fast and efficient object graph serialization framework

    for Java.”
  58. Output output = new Output(outputStream); // Setup Kryo kryo =

    new Kryo();
 kryo.setRegistrationRequired(true);
 kryo.register(HashMap.class);
 kryo.register(String[].class);
 kryo.register(NodeDAO.class);
 kryo.register(RelationshipDAO.class); // Serialize kryo.writeObject(output, object1); kryo.writeObject(output, object2); output.close();
  59. #!/bin/bash
 
 QUERY=bodies/kryo.json
 
 curl -i -XPOST \
 -o output.log

    \
 --data "@$QUERY" \
 -H "Content-Type: application/json" \
 http://127.0.0.1:7474/extension/kryo/cypher
  60. 78.3 MB Received 68.5 MB 12.3 MB/s Download Speed 20.5

    MB/s 6 seconds Time spent 3 seconds Bson/Cypher Kryo/Cypher
  61. #!/bin/bash
 
 curl -i -XGET \
 -o output.log \
 -w

    "
 time_connect=%{time_connect}
 time_start_transfer=%{time_starttransfer}
 time_total=%{time_total}
 " \
 http://127.0.0.1:7474/extension/kryo/traverse
  62. 78.3 MB Received 68.5 MB 23.7 MB/s Download Speed 47.2

    MB/s 3 seconds Time spent 1.4 seconds Bson/Traverse Kryo/Traverse
  63. Compression?

  64. LZF • Optimized for speed • Streaming https://github.com/ning/compress

  65. https://github.com/ning/jvm-compressor-benchmark/wiki

  66. import com.esotericsoftware.kryo.io.Output; // Before Output output = new Output(outputStream); //

    After Output output = new Output( new LZFOutputStream(outputStream) );
  67. #!/bin/bash
 
 curl -i -XGET \
 -o output.log \
 -w

    "
 time_connect=%{time_connect}
 time_start_transfer=%{time_starttransfer}
 time_total=%{time_total}
 " \
 http://127.0.0.1:7474/extension/kryo/traverse
  68. 68.5 MB Received 7.6 MB 47.2 MB/s Download Speed 4299

    KB/s 1.4 seconds Time spent 1.7 seconds Kryo/Traverse Kryo/Traverse Compressed
  69. Short conclusion Compression useful (mostly) for large (huge) responses.

  70. None
  71. Day 15 Configurations

  72. Important • Neo4j makes heavy use of the java.nio package.

    Native I/O will result in memory being allocated outside the normal Java heap. • Neo4j will require all of the heap memory of the JVM plus the memory to be used for memory mapping to be available as physical memory.
  73. File buffer cache • The file buffer cache is sometimes

    called low level cache or file system cache. • It uses the operating system memory mapping features when possible. • Neo4j uses multiple file buffer caches, one for each different storage file.
  74. None
  75. Store file Record size Contents neostore.nodestore.db 15 B Nodes neostore.relationshipstore.db

    34 B Relationships neostore.propertystore.db 41 B Properties for nodes and relationships neostore.propertystore.db.strings 128 B Values of string properties neostore.propertystore.db.arrays 128 B Values of array properties String and arrays is stored in one or more 120B chunks, with 8B record overhead.
  76. # Default values for the low-level graph engine
 neostore.nodestore.db.mapped_memory=25M
 neostore.relationshipstore.db.mapped_memory=50M


    neostore.propertystore.db.mapped_memory=90M
 neostore.propertystore.db.strings.mapped_memory=130M
 neostore.propertystore.db.arrays.mapped_memory=130M # Tuned neostore.nodestore.db.mapped_memory=150M
 neostore.relationshipstore.db.mapped_memory=400M
 neostore.propertystore.db.mapped_memory=600M
 neostore.propertystore.db.strings.mapped_memory=1450M
 neostore.propertystore.db.arrays.mapped_memory=400M
  77. 150MB + 400MB + 600MB + 1450MB + 400MB =

    3000MB # Tuned neostore.nodestore.db.mapped_memory=150M
 neostore.relationshipstore.db.mapped_memory=400M
 neostore.propertystore.db.mapped_memory=600M
 neostore.propertystore.db.strings.mapped_memory=1450M
 neostore.propertystore.db.arrays.mapped_memory=400M Available memory 3GB - File Buffers 3GB - Java Heap 2GB - OS
  78. 68.5 MB Received 68.5 MB 47.2 MB/s Download Speed 49.6

    MB/s 1.4 seconds Time spent 1.37 seconds Kryo/Traverse Kryo/Traverse Tuned
  79. Object cache • The object cache is sometimes called high

    level cache. • It caches the Neo4j data in a form optimized for fast traversal. • Two different categories • Reference caches • High-Performance Cache
  80. Object cache • Nodes and relationships are added to the

    object cache as soon as they are accessed (lazily). • Reading from this cache may be 5 to 10 times faster than reading from the file buffer cache.
  81. Object cache type • None • Soft (Community default) •

    Weak • Strong • HPC (Enterprise default)
  82. Day 16 Upgrade to Enterprise

  83. Neo4j Enterprise • Advanced Monitoring • Backups • Cluster •

    HPC
  84. HPC • Assigned a certain maximum amount of space on

    the JVM heap • Purge objects whenever it grows bigger than that • GC-pauses can be better controlled
  85. 68.5 MB Received 68.5 MB 49.6 MB/s Download Speed 64.7

    MB/s 1.37 seconds Time spent 1.049 seconds Kryo/Traverse Tuned Kryo/Traverse HPC
  86. 1.049 seconds 817 seconds

  87. None
  88. Physical server • 240 GB / 32 CPUs • CentOS

    6.5 x64
  89. 68.5 MB Received 68.5 MB 64.7 MB/s Download Speed 68.5

    MB/s 1.049 seconds Time spent 0.991 seconds Kryo/Traverse Virtual Kryo/Traverse Psyhical
  90. Another results • Get 30000 (huge) nodes by field: 2.3

    seconds • Create nodes: >15000 nodes/second • ~2500 concurrent requests on virtual hardware
  91. None