Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Spark of Neo4j

The Spark of Neo4j

Apache Spark is a powerful distributed distributed data processing framework whose API allows to create connectors to read to and write from any type of data. In this session we'll see the challenges we faced while developing the official Neo4j Connector for Apache Spark, the enables the user to achieve an easy bidirectional communication between these two tools, and we'll also learn how to leverage the Neo4j's potential with the distributed processing capability of Apache Spark.

76c7872136509d218d15a0abbf304ac8?s=128

Davide Fantuzzi

March 16, 2021
Tweet

Transcript

  1. Davide Fantuzzi - Data Engineer The Spark of Neo4j

  2. WELCOME Davide Fantuzzi Data Engineer @ LARUS Business Automation PICTURE

    @utnaf /in/davidefantuzzi/
  3. ABOUT LARUS Founded in 2004 HQ: Venice Offices: Pescara, Rome,

    Milan Global services International projects Data Engineer, Data Architect, Data Scientist, Big Data certified experts team We help companies to become insight-driven organizations Leader in development of data-driven application based on NoSQL & Event Streaming Technologies.
  4. LARUS: OUR SPECIALTIES Big Data Platform Design & Development (Java,

    Scala, Python, Javascript) Data Engineering Graph Data Visualization Data Science Strategic Advisoring for Data-Driven Transformation Projects Machine Learning and AI graph based technology
  5. LARUS NEO4J

  6. LARUS: OUR PARTNERS

  7. 7 AGENDA 1. Spark & Neo4j 2. Challenges 3. Neo4j

    Connector for Apache Spark 4. Demo
  8. 8 SPARK & NEO4J

  9. WHAT IS APACHE SPARK? • Analytics engine for large-scale data

    processing • Cluster of worker nodes which partition operations and execute in parallel • Supports SQL & Streaming • Oriented around DataFrames which for our purposes are effectively tables • Is Polyglot
  10. WHEN TO USE SPARK? • Very large datasets that have

    to be broken into pieces • Complex pipelines with many sources & transformations • Great for iterative algorithms (map, reduce, filter, sort)
  11. OUR GOAL • Avoid custom "hacky" solutions • Continuous development

    • Quick response to issues • Leverage DataSource V2 APIs in order to be fully Spark Compliant ◦ Polyglot ◦ ETL ◦ Graph-Driven Machine Learning
  12. OUR SOLUTION • Deprecation of the old connector • Complete

    rewrite using DataSource API V2 • Official (enterprise) Neo4j support through Larus • Dedicated Team • Open-source • Comprehensive Documentation
  13. 13 CHALLENGES

  14. DOCUMENTATION AND EXAMPLES • Lack of. • No official documentation

    on the DataSource V2 API • Examples on the web were superficial
  15. BREAKING CHANGES Spark 2.3 Spark 2.4 Spark 3 Breaking Changes

    Breaking Changes
  16. WHICH VERSION TO START WITH? Spark 2.3 Spark 2.4 Spark

    3.0 Spark 2.4
  17. None
  18. WHICH VERSION TO START WITH?

  19. WHERE ARE WE NOW? • Spark 2.4 is Supported •

    Spark 3.0 is pre-released and an official released will be happening in March • Spark 2.3 in on hold
  20. VERSION FRAGMENTATION • Spark 2.4 supports Scala 2.11 and Scala

    2.12 • Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11 • (but Spark 3.0 for Scala 2.13 is not released yet) • We need to release a JAR for each combination neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar
  21. VERSION FRAGMENTATION Maven modules to the rescue!

  22. TABLES VS LABELS Person Movie :ACTED_IN We had to find

    a way to map Graph entities into table columns Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas
  23. TABLES VS LABELS Bi-directional mapping allows to read from and

    write to Neo4j Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas READ WRITE
  24. SCHEMA VS. SCHEMALESS • Result flattening name String age Long

    location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 “Rome” Result Schema
  25. SCHEMA VS. SCHEMALESS • Result flattening • Property might not

    exist in every record name String age Long location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 null Result Schema
  26. SCHEMA VS. SCHEMALESS • Result flattening • Property might not

    exist in every record • Property might not have type consistency name String age String* location String name age location “John Doe” “33” “Milan” “Jane Doe” “24” “Rome” Result Schema * no matter the types involved
  27. PARTITIONING • We use SKIP / LIMIT • Be aware

    of deadlocks when writing MATCH (p:Person) RETURN p SKIP 0 LIMIT 20 MATCH (p:Person) RETURN p SKIP 20 LIMIT 20 MATCH (p:Person) RETURN p SKIP 40 LIMIT 20 MATCH (p:Person) RETURN p SKIP 60 LIMIT 20 MATCH (p:Person) RETURN p SKIP 80 LIMIT 20 Creating 5 partition from a 100 nodes dataset: https://neo4j.com/developer/spark/quickstart/#_partitioning
  28. 28 NEO4J CONNECTOR FOR APACHE SPARK

  29. READ DATA • Labels val df = spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Admin")

    .load() df = spark.read.format("org.neo4j.spark.DataSource") \ .option("labels", ":Person:Admin") \ .load() df <- read.df(source="org.neo4j.spark.DataSource", labels=":Person:Admin") Scala Python R
  30. READ DATA • Labels • Relationship val df = spark.read.format("org.neo4j.spark.DataSource")

    .option("relationship", "ACTED_IN") .option("relationship.source.labels", "Person") .option("relationship.target.labels", "Movie") .load()
  31. READ DATA • Labels • Relationship • Query val df

    = spark.read.format("org.neo4j.spark.DataSource") .option("query","MATCH (n:Person) RETURN n.name, n.age") .load()
  32. WRITE DATA • Labels val bandDf = Seq( (1, "Alex

    Lifeson"), (2, "Neil Peart"), (3, "Geddy Lee") ).toDF("id", "name") bandDf.write .format("org.neo4j.spark.DataSource") .option("labels", ":Person:Musician") .save
  33. WRITE DATA • Labels • Relationship val musicDf = Seq(

    (12, "John Bonham", "Drums"), (19, "John Mayer", "Guitar"), (32, "John Scofield", "Guitar"), (15, "John Butler", "Guitar") ).toDF("experience", "name", "instrument") musicDf.write .format("org.neo4j.spark.DataSource") .option("relationship", "PLAYS") .option("relationship.save.strategy", "keys") .option("relationship.source.labels", ":Musician") .option("relationship.source.node.keys", "name:name") .option("relationship.target.labels", ":Instrument") .option("relationship.target.node.keys", "instrument:name") .save
  34. WRITE DATA • Labels • Relationship • Query val theTeam

    = Seq( ("David", "Allen"), ("Andrea", "Santurbano"), ("Davide", "Fantuzzi") ).toDF("name", "lastname") theTeam.write .format("org.neo4j.spark.DataSource") .option( "query", "CREATE (n:Person)" + "SET fullName = event.name + event.lastname" ) .save() This will generate a query like: UNWIND $events AS event CREATE (n:Person) SET fullName = event.name + event.lastname
  35. FEATURES • Push Down Filters val df = spark.read .format("org.neo4j.spark.DataSource")

    .option("labels", "Movie") .load df.where("title LIKE 'Matrix%'").show()
  36. FEATURES • Push Down Filters • Push Down Columns val

    df = spark.read .format("org.neo4j.spark.DataSource") .option("labels", "Movie") .load df.select("title").show()
  37. FEATURES • Push Down Filters • Push Down Columns •

    Official Neo4j Driver [link]
  38. FEATURES • Push Down Filters • Push Down Columns •

    Official Neo4j Driver [link] • CypherDSL [link]
  39. FEATURES • Push Down Filters • Push Down Columns •

    Official Neo4j Driver [link] • CypherDSL [link] • GraphX / GraphFrames are not used
  40. COMMON USE CASES • Data Source Integration ◦ Connect any

    supported file format or database of Spark ◦ To Neo4j • Extraction, Transformation, and Load (ETL) bi-directionally ◦ Bulk insert for new databases ◦ Ongoing nightly jobs • Graph-driven Machine Learning ◦ Use Spark to leverage Graph Data Science to existing pipelines
  41. 41 DEMO

  42. WHAT’S NEXT? • Streaming API • Adding R lang test

    • Full support for Spark 3
  43. USEFUL LINKS • GitHub https://github.com/neo4j-contrib/neo4j-spark-connector • Documentation https://neo4j.com/developer/spark/ • Notebook

    for playing around with the connector https://github.com/utnaf/neo4j-connector-apache-spark-notebooks • Article on Towards Data Science https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245
  44. THANKS FOR YOUR ATTENTION Davide Fantuzzi Data Engineer @utnaf