Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark_Declarative_Pipelines_Devoxx_2026.pdf

 Spark_Declarative_Pipelines_Devoxx_2026.pdf

Avatar for Frank Munz

Frank Munz

April 30, 2026

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. About me • I am Frank Munz • Principal @

    Databricks: Data and AI • My past: large scale data & compute • Based in 🍻 ⛰ 🥨 󰎲 • Built up AWS Tech Evangelism in Central Europe • SW architect, data scientist, published author etc.
  2. ©2022 Databricks Inc. — All rights reserved Spark SQL and

    Dataframe API # PYTHON: read multiple JSON files with multiple datasets per file cust = spark.read.json('/data/customers/') cust.filter(cust['car_make']== "Audi").show() cust.createOrReplaceTempView("customers") %sql select * from customers where car_make = "Audi" -- PYTHON: -- spark.sql('select * from customers where car_make ="Audi"').show()
  3. ©2022 Databricks Inc. — All rights reserved Reading from Files

    vs Reading from Streams # read file(s) cust = spark.read.format("json").path('/data/customers/') # read from messaging platform sales = spark.readStream.format("kafka|kinesis|socket").load()
  4. ©2022 Databricks Inc. — All rights reserved Data Engineering ->

    Data Pipelines Data Intelligence Platform Orchestration Data Ingestion Data sources Cloud storage Message queues Databases Enterprise applications Serverless Compute | Unified Governance | Reliable Storage Use cases SQL analytics & BI AI/ML apps Real-time apps Data sharing Data warehousing Machine learning Real-time processing Data Transformation Data Intelligence Engine
  5. What if we want to run streams continuously? raw_orders table

    Streaming Query fact_orders table Streaming Query customers table Batch Query Run continuously Run continuously Run every 15 minutes
  6. Problem: first pipeline run • raw_orders streaming query starts •

    fact_orders streaming query starts • fact_orders streaming query tries to read from raw_orders table • raw_orders table doesn’t exist yet -> ⚠error⚠ raw_orders table Streaming Query fact_orders table Streaming Query
  7. ©2022 Databricks Inc. — All rights reserved 30 What if

    we make the whole pipeline declarative? Build plan -> execute it
  8. Streaming Table (ST) • Continuously updated table • Exactly once

    ingestion • Append only • Real-time, incremental data ingestion • Low latency streaming transformations What is it? use it for…
  9. Materialized View (MV) • A precomputed, persistent table based on

    a query • Efficient aggregation on static or periodically updated data. • (possibly) incrementally computed What is it? use it for…
  10. Gold Layer Silver Layer Bronze Layer customers_raw Streaming Table orders_raw

    Streaming Table customer_orders Materialized View Declarative Pipelines Declarative Pipeline with STs and MVs Putting things together customers_clean Streaming Table orders_clean Streaming Table
  11. ⚙ Spark Declarative Pipelines • Transformations Python  or SQL files

    • Initialize via  spark‑pipelines init --name orders‑pl • Execute with  spark‑pipelines [dry-]run  • SDP automatically handles pre-validation, dependencies, retries & parallelization. • Triggered/batch and  continuous/streaming mode
  12. ©2022 Databricks Inc. — All rights reserved 40 Getting Started

    with Spark Declarative Pipelines: OSS + Lakeflow Brand new: OSS and Lakeflow Tutorials (realtime flight data) Create a Databricks forever Free Edition account, and add the avionics SDP demo to your portfolio on LinkedIn
  13. ©2025 Databricks Inc. — All rights reserved Lakeflow Spark Declarative

    Pipelines Efficiently clean, transform and join data Build batch and streaming pipelines with a declarative approach Automated pipeline configuration with reduces maintenance burden Simplified pipeline development Reliable production infrastructure Fully compatible with the open source Spark Declarative Pipelines Built on an open standard
  14. Genie Code • Natural language prompt to create, debug, explain

    and document • Creates, manages and debugs data pipelines • Genie Code Agent skills for Claude Code • Can be extended with Skills.md and MCP servers An autonomous AI agent for Data Science, Engineering and Analytics
  15. Updated architecture, same interface • We extended the microbatch architecture

    by adding concurrent stages, streaming shuffle, and continuous data flow. • Same interface. Only a single line of change!
  16. ©2022 Databricks Inc. — All rights reserved 51 Acknowledgement Michael

    Armbrust & Sandy Ryza for some of the intro slides
  17. 52 Convert Messy Sales Data to AI insights databricks.com/demos Use

    Lakeflow Connect, Jobs and SDP to create a marketing solution with AI for a global food corporation. Video Walkthrough with GitHub Repo + DABs
  18. 53 Get to know Genie Code Genie Code data engineering

    video demo: Use Genie Code to create a complete SDP pipeline from a prompt. With Auto Loader, JSON ingestion, and medallion architecture. Genie Code Step by Step Guide