Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Spark of Neo4j

The Spark of Neo4j

Apache Spark is a powerful distributed distributed data processing framework whose API allows to create connectors to read to and write from any type of data. In this session we'll see the challenges we faced while developing the official Neo4j Connector for Apache Spark, the enables the user to achieve an easy bidirectional communication between these two tools, and we'll also learn how to leverage the Neo4j's potential with the distributed processing capability of Apache Spark.

Davide Fantuzzi

March 16, 2021
Tweet

More Decks by Davide Fantuzzi

Other Decks in Programming

Transcript

  1. ABOUT LARUS Founded in 2004 HQ: Venice Offices: Pescara, Rome,

    Milan Global services International projects Data Engineer, Data Architect, Data Scientist, Big Data certified experts team We help companies to become insight-driven organizations Leader in development of data-driven application based on NoSQL & Event Streaming Technologies.
  2. LARUS: OUR SPECIALTIES Big Data Platform Design & Development (Java,

    Scala, Python, Javascript) Data Engineering Graph Data Visualization Data Science Strategic Advisoring for Data-Driven Transformation Projects Machine Learning and AI graph based technology
  3. 7 AGENDA 1. Spark & Neo4j 2. Challenges 3. Neo4j

    Connector for Apache Spark 4. Demo
  4. WHAT IS APACHE SPARK? • Analytics engine for large-scale data

    processing • Cluster of worker nodes which partition operations and execute in parallel • Supports SQL & Streaming • Oriented around DataFrames which for our purposes are effectively tables • Is Polyglot
  5. WHEN TO USE SPARK? • Very large datasets that have

    to be broken into pieces • Complex pipelines with many sources & transformations • Great for iterative algorithms (map, reduce, filter, sort)
  6. OUR GOAL • Avoid custom "hacky" solutions • Continuous development

    • Quick response to issues • Leverage DataSource V2 APIs in order to be fully Spark Compliant ◦ Polyglot ◦ ETL ◦ Graph-Driven Machine Learning
  7. OUR SOLUTION • Deprecation of the old connector • Complete

    rewrite using DataSource API V2 • Official (enterprise) Neo4j support through Larus • Dedicated Team • Open-source • Comprehensive Documentation
  8. DOCUMENTATION AND EXAMPLES • Lack of. • No official documentation

    on the DataSource V2 API • Examples on the web were superficial
  9. WHERE ARE WE NOW? • Spark 2.4 is Supported •

    Spark 3.0 is pre-released and an official released will be happening in March • Spark 2.3 in on hold
  10. VERSION FRAGMENTATION • Spark 2.4 supports Scala 2.11 and Scala

    2.12 • Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11 • (but Spark 3.0 for Scala 2.13 is not released yet) • We need to release a JAR for each combination neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar
  11. TABLES VS LABELS Person Movie :ACTED_IN We had to find

    a way to map Graph entities into table columns Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas
  12. TABLES VS LABELS Bi-directional mapping allows to read from and

    write to Neo4j Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas READ WRITE
  13. SCHEMA VS. SCHEMALESS • Result flattening name String age Long

    location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 “Rome” Result Schema
  14. SCHEMA VS. SCHEMALESS • Result flattening • Property might not

    exist in every record name String age Long location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 null Result Schema
  15. SCHEMA VS. SCHEMALESS • Result flattening • Property might not

    exist in every record • Property might not have type consistency name String age String* location String name age location “John Doe” “33” “Milan” “Jane Doe” “24” “Rome” Result Schema * no matter the types involved
  16. PARTITIONING • We use SKIP / LIMIT • Be aware

    of deadlocks when writing MATCH (p:Person) RETURN p SKIP 0 LIMIT 20 MATCH (p:Person) RETURN p SKIP 20 LIMIT 20 MATCH (p:Person) RETURN p SKIP 40 LIMIT 20 MATCH (p:Person) RETURN p SKIP 60 LIMIT 20 MATCH (p:Person) RETURN p SKIP 80 LIMIT 20 Creating 5 partition from a 100 nodes dataset: https://neo4j.com/developer/spark/quickstart/#_partitioning
  17. READ DATA • Labels val df = spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Admin")

    .load() df = spark.read.format("org.neo4j.spark.DataSource") \ .option("labels", ":Person:Admin") \ .load() df <- read.df(source="org.neo4j.spark.DataSource", labels=":Person:Admin") Scala Python R
  18. READ DATA • Labels • Relationship val df = spark.read.format("org.neo4j.spark.DataSource")

    .option("relationship", "ACTED_IN") .option("relationship.source.labels", "Person") .option("relationship.target.labels", "Movie") .load()
  19. READ DATA • Labels • Relationship • Query val df

    = spark.read.format("org.neo4j.spark.DataSource") .option("query","MATCH (n:Person) RETURN n.name, n.age") .load()
  20. WRITE DATA • Labels val bandDf = Seq( (1, "Alex

    Lifeson"), (2, "Neil Peart"), (3, "Geddy Lee") ).toDF("id", "name") bandDf.write .format("org.neo4j.spark.DataSource") .option("labels", ":Person:Musician") .save
  21. WRITE DATA • Labels • Relationship val musicDf = Seq(

    (12, "John Bonham", "Drums"), (19, "John Mayer", "Guitar"), (32, "John Scofield", "Guitar"), (15, "John Butler", "Guitar") ).toDF("experience", "name", "instrument") musicDf.write .format("org.neo4j.spark.DataSource") .option("relationship", "PLAYS") .option("relationship.save.strategy", "keys") .option("relationship.source.labels", ":Musician") .option("relationship.source.node.keys", "name:name") .option("relationship.target.labels", ":Instrument") .option("relationship.target.node.keys", "instrument:name") .save
  22. WRITE DATA • Labels • Relationship • Query val theTeam

    = Seq( ("David", "Allen"), ("Andrea", "Santurbano"), ("Davide", "Fantuzzi") ).toDF("name", "lastname") theTeam.write .format("org.neo4j.spark.DataSource") .option( "query", "CREATE (n:Person)" + "SET fullName = event.name + event.lastname" ) .save() This will generate a query like: UNWIND $events AS event CREATE (n:Person) SET fullName = event.name + event.lastname
  23. FEATURES • Push Down Filters val df = spark.read .format("org.neo4j.spark.DataSource")

    .option("labels", "Movie") .load df.where("title LIKE 'Matrix%'").show()
  24. FEATURES • Push Down Filters • Push Down Columns val

    df = spark.read .format("org.neo4j.spark.DataSource") .option("labels", "Movie") .load df.select("title").show()
  25. FEATURES • Push Down Filters • Push Down Columns •

    Official Neo4j Driver [link] • CypherDSL [link]
  26. FEATURES • Push Down Filters • Push Down Columns •

    Official Neo4j Driver [link] • CypherDSL [link] • GraphX / GraphFrames are not used
  27. COMMON USE CASES • Data Source Integration ◦ Connect any

    supported file format or database of Spark ◦ To Neo4j • Extraction, Transformation, and Load (ETL) bi-directionally ◦ Bulk insert for new databases ◦ Ongoing nightly jobs • Graph-driven Machine Learning ◦ Use Spark to leverage Graph Data Science to existing pipelines
  28. USEFUL LINKS • GitHub https://github.com/neo4j-contrib/neo4j-spark-connector • Documentation https://neo4j.com/developer/spark/ • Notebook

    for playing around with the connector https://github.com/utnaf/neo4j-connector-apache-spark-notebooks • Article on Towards Data Science https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245