Neo4j Magic Adventures

Neo4j Magic Adventures Dmitry Vrublevsky @ Neueda [email protected]

Brieﬁng 1. Evaluate Neo4j capabilities • Import test dataset •
Implement and run test cases • Measure everything 2. Compare Neo4j with existing solution

Neo4j “Neo4j – the World’s Leading Graph Database”

• Native graph storage • Property graph database • Schema-less
• Powerful Query Language • Clustering • Hot backups

Network graph • Looks like a tree • Node has
unique id • Node has type • Node can contain structures • … or structure sequences • … or structure with sequence of structures

• Node (Object) • Structure • Sequence

Network graph ~ 8.970.000 nodes ~ 11.000.000 relationships ~ 33.140.000
properties ~ 5.7 GB

Environment • 8 GB / 4 CPUs • 80 GB
SSD Disk • CentOS 6.5 x64

Day 1 - 7 • Environment setup… • Dataset import…
• Documentation read… Done Done Done

Mission impossible • Take existing data model • Take existing
test case • Make it as fast, as possible

Test case Read whole graph under speciﬁed root node with
one request. Node count in graph - 97254 (structures & sequences not counted)

Neo4j toolbox

REST API • Node & Relationship endpoints • Transactional endpoint
• Traversal endpoint • Batch operations endpoint • Other

Cypher “Cypher is a declarative graph query language that allows
for expressive and efﬁcient querying and updating of the graph store. “

• ASCII art • Keywords like WHERE and ORDER BY
are inspired by SQL • Focuses on the clarity of expressing what to retrieve from a graph • Collection semantics have been borrowed from languages such as Haskell and Python

MATCH (john {name: ‘John’})-[:friend]->(friends) MATCH (friends)-[:friend]->(fof) RETURN john, fof (
… ) [ … ] ()-[]-() - node - relationship - path

Day 8 Cypher? REST API?

Solution via Cypher query MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId}
  OPTIONAL MATCH (child)-[property:PROPERTY]->(child_property)   RETURN *

#!/bin/bash    QUERY=bodies/subgraph.json    curl -i -XPOST \  -o output.log
\  --data "@$QUERY" \  -H "Accept: application/json" \  -H "Content-Type: application/json" \  http://127.0.0.1:7474/db/data/transaction/commit

? Received 1225 mb ? Download Speed 1535 kb/s ?
Time spent 817 seconds N/A REST/Default

Observations • Response streamed • Large response size • Long
request time

Day 9 Requirements arrived!

Test case total time 2 seconds Currently we have 817
seconds

Problems?

[  {  "id": "100",  "graph": { "nodes": [ {"id": “101"}
] }  },  {  "id": "100",  "graph": { "nodes": [ {"id": “102"} ] }  }  ] 100 102 101

Cypher thinks in paths, not graphs! • Unnecessary data duplication
• Cypher doesn’t know about our data model

Another solution MATCH (root:Object)-[r1:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId} OPTIONAL MATCH
(child)-[r2:PROPERTY]->(child_propety)   RETURN collect(root) + collect(child) + collect(child_property) as nodes, collect(r2) as relationships

#!/bin/bash    QUERY=bodies/subgraph_optimised.json    curl -i -XPOST \  -o output.log
\  --data "@$QUERY" \  -H "Accept: application/json" \  -H "Content-Type: application/json" \  http://127.0.0.1:7474/db/data/transaction/commit

1225 mb Received 85.2 MB 1535 kb/s Download Speed 1579
KB/s 817 seconds Time spent 55 seconds REST/Default REST/Optimized

Conclusion • We need more control on querying & serialisation
process! • Maybe another serialisation format? • Another querying api?

Day 10 Morning standup - “We need to implement our
own extension. Can we do it?” - “Yeah, deﬁnitely.”

Unmanaged extension • The unmanaged extensions are a way of
deploying arbitrary JAX-RS code into the Neo4j server.

Plan 1. Take fast serialisation library 2. Take Neo4j Java
API 3. Implement our own endpoints 4. … 5. Proﬁt!

BSON • Obvious choice • Brought by MongoDB • Fast
serialisation http://bsonspec.org/ 1. Lightweight 2. Traversable 3. Efﬁcient (as they say)

Jackson • Jackson used by Neo4j internally • It’s cool
• Jackson has BSON plugin https://github.com/FasterXML/jackson https://github.com/michel-kraemer/bson4jackson

//create mapper  ObjectMapper mapper = new ObjectMapper( new BsonFactory() );
ByteArrayOutputStream baos = new ByteArrayOutputStream(); //serialize data mapper.writeValue(baos, pojo);      ByteArrayInputStream bais =  new ByteArrayInputStream(baos.toByteArray()); //deserialize data mapper.readValue(bais, PojoClass.class); 

//create bson factory  BsonFactory factory = new BsonFactory();  ByteArrayOutputStream baos
= new ByteArrayOutputStream();  //serialize data  JsonGenerator gen = factory.createJsonGenerator(baos);  gen.writeStartObject();  gen.writeFieldName("name");  gen.writeString(bob.getName());  gen.close(); Streaming!

JAX-RS Streaming • Because why not? http://docs.oracle.com/javaee/6/api/javax/ws/rs/core/StreamingOutput.html

StreamingOutput stream = new StreamingOutput() {  @Override  public void write(OutputStream
os) {    Writer writer = new BufferedWriter(  new OutputStreamWriter(os)  );   writer.write("Hello World!");  writer.flush();  }  };   return Response.ok(stream).build();

Day 11 - 12 • Extension setup… • Cypher endpoint…
• Documentation read… Done Done Done

Dependencies <dependency>  <groupId>org.neo4j</groupId>  <artifactId>neo4j</artifactId>  <version>${neo4j.version}</version>  <scope>provided</scope>  </dependency>  <dependency>  <groupId>javax.ws.rs</groupId>  <artifactId>javax.ws.rs-api</artifactId> 
<version>2.0</version>  <scope>provided</scope>  </dependency>

private final GraphDatabaseService db; try(Transaction tx = db.beginTx()) {  ExecutionEngine
engine = new ExecutionEngine(db);  ExecutionResult result = engine.execute(query);    Bson.serialize(output, result); }

Day 13 Another day, another experiment

New cypher solution MATCH (root:Object)-[:CHILDREN*]->(child:Object) WHERE root.id = {rootNodeId} RETURN
child (Properties autoloaded during serialisation!)

Iterable<Relationship> relationships = node.getRelationships(Relationships.PROPERTY, Direction.OUTGOING);     for(Relationship relationship: relationships)
{  Node endNode = relationship.getEndNode();   if(endNode.hasLabel(Labels.Structure)) {  ...  } else if(endNode.hasLabel(Labels.Sequence)) {  ... }  }

#!/bin/bash    QUERY=bodies/bson.json    curl -i -XPOST \  -o output.log
\  --data "@$QUERY" \  -H "Content-Type: application/json" \  http://127.0.0.1:7474/extension/bson/cypher

85.2 MB Received 78.3 MB 1579 KB/s Download Speed 12.3
MB/s 55 seconds Time spent 6 seconds REST/Optimized Bson/Cypher

Traverse API “The Neo4j Traversal API is a callback based,
lazily executed way of specifying desired movements through a graph in Java.”

ResourceIterable<Node> nodes = db .traversalDescription()  .breadthFirst()  .relationships( Relationships.CHILDREN, Direction.OUTGOING ) 
.evaluator(Evaluators.all())  .traverse(rootNode)  .nodes();

#!/bin/bash    curl -i -XGET \  -o output.log \  http://127.0.0.1:7474/extension/bson/traverse

78.3 MB Received 78.3 MB 12.3 MB/s Download Speed 23.7
MB/s 6 seconds Time spent 3 seconds Bson/Cypher Bson/Traverse

Day 14 Can we do better?

“Kryo is a fast and efﬁcient object graph serialization framework
for Java.”

Output output = new Output(outputStream); // Setup Kryo kryo =
new Kryo();  kryo.setRegistrationRequired(true);  kryo.register(HashMap.class);  kryo.register(String[].class);  kryo.register(NodeDAO.class);  kryo.register(RelationshipDAO.class); // Serialize kryo.writeObject(output, object1); kryo.writeObject(output, object2); output.close();

#!/bin/bash    QUERY=bodies/kryo.json    curl -i -XPOST \  -o output.log
\  --data "@$QUERY" \  -H "Content-Type: application/json" \  http://127.0.0.1:7474/extension/kryo/cypher

MB/s 6 seconds Time spent 3 seconds Bson/Cypher Kryo/Cypher

#!/bin/bash    curl -i -XGET \  -o output.log \  -w
"  time_connect=%{time_connect}  time_start_transfer=%{time_starttransfer}  time_total=%{time_total}  " \  http://127.0.0.1:7474/extension/kryo/traverse

MB/s 3 seconds Time spent 1.4 seconds Bson/Traverse Kryo/Traverse

Compression?

LZF • Optimized for speed • Streaming https://github.com/ning/compress

https://github.com/ning/jvm-compressor-benchmark/wiki

import com.esotericsoftware.kryo.io.Output; // Before Output output = new Output(outputStream); //
After Output output = new Output( new LZFOutputStream(outputStream) );

#!/bin/bash    curl -i -XGET \  -o output.log \  -w
"  time_connect=%{time_connect}  time_start_transfer=%{time_starttransfer}  time_total=%{time_total}  " \  http://127.0.0.1:7474/extension/kryo/traverse

68.5 MB Received 7.6 MB 47.2 MB/s Download Speed 4299
KB/s 1.4 seconds Time spent 1.7 seconds Kryo/Traverse Kryo/Traverse Compressed

Short conclusion Compression useful (mostly) for large (huge) responses.

Day 15 Conﬁgurations

Important • Neo4j makes heavy use of the java.nio package.
Native I/O will result in memory being allocated outside the normal Java heap. • Neo4j will require all of the heap memory of the JVM plus the memory to be used for memory mapping to be available as physical memory.

File buffer cache • The file buffer cache is sometimes
called low level cache or file system cache. • It uses the operating system memory mapping features when possible. • Neo4j uses multiple file buffer caches, one for each different storage file.

Store ﬁle Record size Contents neostore.nodestore.db 15 B Nodes neostore.relationshipstore.db
34 B Relationships neostore.propertystore.db 41 B Properties for nodes and relationships neostore.propertystore.db.strings 128 B Values of string properties neostore.propertystore.db.arrays 128 B Values of array properties String and arrays is stored in one or more 120B chunks, with 8B record overhead.

# Default values for the low-level graph engine  neostore.nodestore.db.mapped_memory=25M  neostore.relationshipstore.db.mapped_memory=50M 
neostore.propertystore.db.mapped_memory=90M  neostore.propertystore.db.strings.mapped_memory=130M  neostore.propertystore.db.arrays.mapped_memory=130M # Tuned neostore.nodestore.db.mapped_memory=150M  neostore.relationshipstore.db.mapped_memory=400M  neostore.propertystore.db.mapped_memory=600M  neostore.propertystore.db.strings.mapped_memory=1450M  neostore.propertystore.db.arrays.mapped_memory=400M

150MB + 400MB + 600MB + 1450MB + 400MB =
3000MB # Tuned neostore.nodestore.db.mapped_memory=150M  neostore.relationshipstore.db.mapped_memory=400M  neostore.propertystore.db.mapped_memory=600M  neostore.propertystore.db.strings.mapped_memory=1450M  neostore.propertystore.db.arrays.mapped_memory=400M Available memory 3GB - File Buffers 3GB - Java Heap 2GB - OS

MB/s 1.4 seconds Time spent 1.37 seconds Kryo/Traverse Kryo/Traverse Tuned

Object cache • The object cache is sometimes called high
level cache. • It caches the Neo4j data in a form optimized for fast traversal. • Two different categories • Reference caches • High-Performance Cache

Object cache • Nodes and relationships are added to the
object cache as soon as they are accessed (lazily). • Reading from this cache may be 5 to 10 times faster than reading from the ﬁle buffer cache.

Object cache type • None • Soft (Community default) •
Weak • Strong • HPC (Enterprise default)

Day 16 Upgrade to Enterprise

Neo4j Enterprise • Advanced Monitoring • Backups • Cluster •
HPC

HPC • Assigned a certain maximum amount of space on
the JVM heap • Purge objects whenever it grows bigger than that • GC-pauses can be better controlled

MB/s 1.37 seconds Time spent 1.049 seconds Kryo/Traverse Tuned Kryo/Traverse HPC

1.049 seconds 817 seconds

Physical server • 240 GB / 32 CPUs • CentOS
6.5 x64

MB/s 1.049 seconds Time spent 0.991 seconds Kryo/Traverse Virtual Kryo/Traverse Psyhical

Another results • Get 30000 (huge) nodes by ﬁeld: 2.3
seconds • Create nodes: >15000 nodes/second • ~2500 concurrent requests on virtual hardware

Neo4j Magic Adventures

Neo4j Magic Adventures

More Decks by Dmitrijs Vrublevskis

Other Decks in Programming

Featured

Transcript