The Spark of Neo4j

Davide Fantuzzi - Data Engineer The Spark of Neo4j

WELCOME Davide Fantuzzi Data Engineer @ LARUS Business Automation PICTURE
@utnaf /in/davidefantuzzi/

ABOUT LARUS Founded in 2004 HQ: Venice Ofﬁces: Pescara, Rome,
Milan Global services International projects Data Engineer, Data Architect, Data Scientist, Big Data certiﬁed experts team We help companies to become insight-driven organizations Leader in development of data-driven application based on NoSQL & Event Streaming Technologies.

LARUS: OUR SPECIALTIES Big Data Platform Design & Development (Java,
Scala, Python, Javascript) Data Engineering Graph Data Visualization Data Science Strategic Advisoring for Data-Driven Transformation Projects Machine Learning and AI graph based technology

LARUS NEO4J

LARUS: OUR PARTNERS

7 AGENDA 1. Spark & Neo4j 2. Challenges 3. Neo4j
Connector for Apache Spark 4. Demo

8 SPARK & NEO4J

WHAT IS APACHE SPARK? • Analytics engine for large-scale data
processing • Cluster of worker nodes which partition operations and execute in parallel • Supports SQL & Streaming • Oriented around DataFrames which for our purposes are effectively tables • Is Polyglot

WHEN TO USE SPARK? • Very large datasets that have
to be broken into pieces • Complex pipelines with many sources & transformations • Great for iterative algorithms (map, reduce, ﬁlter, sort)

OUR GOAL • Avoid custom "hacky" solutions • Continuous development
• Quick response to issues • Leverage DataSource V2 APIs in order to be fully Spark Compliant ◦ Polyglot ◦ ETL ◦ Graph-Driven Machine Learning

OUR SOLUTION • Deprecation of the old connector • Complete
rewrite using DataSource API V2 • Ofﬁcial (enterprise) Neo4j support through Larus • Dedicated Team • Open-source • Comprehensive Documentation

13 CHALLENGES

DOCUMENTATION AND EXAMPLES • Lack of. • No ofﬁcial documentation
on the DataSource V2 API • Examples on the web were superﬁcial

BREAKING CHANGES Spark 2.3 Spark 2.4 Spark 3 Breaking Changes
Breaking Changes

WHICH VERSION TO START WITH? Spark 2.3 Spark 2.4 Spark
3.0 Spark 2.4

WHICH VERSION TO START WITH?

WHERE ARE WE NOW? • Spark 2.4 is Supported •
Spark 3.0 is pre-released and an ofﬁcial released will be happening in March • Spark 2.3 in on hold

VERSION FRAGMENTATION • Spark 2.4 supports Scala 2.11 and Scala
2.12 • Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11 • (but Spark 3.0 for Scala 2.13 is not released yet) • We need to release a JAR for each combination neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar

VERSION FRAGMENTATION Maven modules to the rescue!

TABLES VS LABELS Person Movie :ACTED_IN We had to ﬁnd
a way to map Graph entities into table columns Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas

TABLES VS LABELS Bi-directional mapping allows to read from and
write to Neo4j Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas READ WRITE

SCHEMA VS. SCHEMALESS • Result ﬂattening name String age Long
location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 “Rome” Result Schema

SCHEMA VS. SCHEMALESS • Result ﬂattening • Property might not
exist in every record name String age Long location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 null Result Schema

SCHEMA VS. SCHEMALESS • Result ﬂattening • Property might not
exist in every record • Property might not have type consistency name String age String* location String name age location “John Doe” “33” “Milan” “Jane Doe” “24” “Rome” Result Schema * no matter the types involved

PARTITIONING • We use SKIP / LIMIT • Be aware
of deadlocks when writing MATCH (p:Person) RETURN p SKIP 0 LIMIT 20 MATCH (p:Person) RETURN p SKIP 20 LIMIT 20 MATCH (p:Person) RETURN p SKIP 40 LIMIT 20 MATCH (p:Person) RETURN p SKIP 60 LIMIT 20 MATCH (p:Person) RETURN p SKIP 80 LIMIT 20 Creating 5 partition from a 100 nodes dataset: https://neo4j.com/developer/spark/quickstart/#_partitioning

28 NEO4J CONNECTOR FOR APACHE SPARK

READ DATA • Labels val df = spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Admin")
.load() df = spark.read.format("org.neo4j.spark.DataSource") \ .option("labels", ":Person:Admin") \ .load() df <- read.df(source="org.neo4j.spark.DataSource", labels=":Person:Admin") Scala Python R

READ DATA • Labels • Relationship val df = spark.read.format("org.neo4j.spark.DataSource")
.option("relationship", "ACTED_IN") .option("relationship.source.labels", "Person") .option("relationship.target.labels", "Movie") .load()

READ DATA • Labels • Relationship • Query val df
= spark.read.format("org.neo4j.spark.DataSource") .option("query","MATCH (n:Person) RETURN n.name, n.age") .load()

WRITE DATA • Labels val bandDf = Seq( (1, "Alex
Lifeson"), (2, "Neil Peart"), (3, "Geddy Lee") ).toDF("id", "name") bandDf.write .format("org.neo4j.spark.DataSource") .option("labels", ":Person:Musician") .save

WRITE DATA • Labels • Relationship val musicDf = Seq(
(12, "John Bonham", "Drums"), (19, "John Mayer", "Guitar"), (32, "John Scofield", "Guitar"), (15, "John Butler", "Guitar") ).toDF("experience", "name", "instrument") musicDf.write .format("org.neo4j.spark.DataSource") .option("relationship", "PLAYS") .option("relationship.save.strategy", "keys") .option("relationship.source.labels", ":Musician") .option("relationship.source.node.keys", "name:name") .option("relationship.target.labels", ":Instrument") .option("relationship.target.node.keys", "instrument:name") .save

WRITE DATA • Labels • Relationship • Query val theTeam
= Seq( ("David", "Allen"), ("Andrea", "Santurbano"), ("Davide", "Fantuzzi") ).toDF("name", "lastname") theTeam.write .format("org.neo4j.spark.DataSource") .option( "query", "CREATE (n:Person)" + "SET fullName = event.name + event.lastname" ) .save() This will generate a query like: UNWIND $events AS event CREATE (n:Person) SET fullName = event.name + event.lastname

FEATURES • Push Down Filters val df = spark.read .format("org.neo4j.spark.DataSource")
.option("labels", "Movie") .load df.where("title LIKE 'Matrix%'").show()

FEATURES • Push Down Filters • Push Down Columns val
df = spark.read .format("org.neo4j.spark.DataSource") .option("labels", "Movie") .load df.select("title").show()

FEATURES • Push Down Filters • Push Down Columns •
Ofﬁcial Neo4j Driver [link]

Ofﬁcial Neo4j Driver [link] • CypherDSL [link]

Ofﬁcial Neo4j Driver [link] • CypherDSL [link] • GraphX / GraphFrames are not used

COMMON USE CASES • Data Source Integration ◦ Connect any
supported ﬁle format or database of Spark ◦ To Neo4j • Extraction, Transformation, and Load (ETL) bi-directionally ◦ Bulk insert for new databases ◦ Ongoing nightly jobs • Graph-driven Machine Learning ◦ Use Spark to leverage Graph Data Science to existing pipelines

41 DEMO

WHAT’S NEXT? • Streaming API • Adding R lang test
• Full support for Spark 3

USEFUL LINKS • GitHub https://github.com/neo4j-contrib/neo4j-spark-connector • Documentation https://neo4j.com/developer/spark/ • Notebook
for playing around with the connector https://github.com/utnaf/neo4j-connector-apache-spark-notebooks • Article on Towards Data Science https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245

THANKS FOR YOUR ATTENTION Davide Fantuzzi Data Engineer @utnaf

The Spark of Neo4j

The Spark of Neo4j

More Decks by Davide Fantuzzi

Other Decks in Programming

Featured

Transcript