Spark_Declarative_Pipelines_Devoxx_2026.pdf

Apache Spark Declarative Pipelines (SPD) In Action Avionics Demo Frank
Munz, Databricks Devoxx 2026

Frank Munz, Databricks

About me • I am Frank Munz • Principal @
Databricks: Data and AI • My past: large scale data & compute • Based in 🍻 ⛰ 🥨 󰎲 • Built up AWS Tech Evangelism in Central Europe • SW architect, data scientist, published author etc.

©2022 Databricks Inc. — All rights reserved 4 Apache Spark
™

©2022 Databricks Inc. — All rights reserved Spark SQL and
Dataframe API # PYTHON: read multiple JSON files with multiple datasets per file cust = spark.read.json('/data/customers/') cust.filter(cust['car_make']== "Audi").show() cust.createOrReplaceTempView("customers") %sql select * from customers where car_make = "Audi" -- PYTHON: -- spark.sql('select * from customers where car_make ="Audi"').show()

©2022 Databricks Inc. — All rights reserved Automatic Parallelization and
Scaling

©2022 Databricks Inc. — All rights reserved Reading from Files
vs Reading from Streams # read file(s) cust = spark.read.format("json").path('/data/customers/') # read from messaging platform sales = spark.readStream.format("kafka|kinesis|socket").load()

©2022 Databricks Inc. — All rights reserved Data Engineering ->
Data Pipelines Data Intelligence Platform Orchestration Data Ingestion Data sources Cloud storage Message queues Databases Enterprise applications Serverless Compute | Uniﬁed Governance | Reliable Storage Use cases SQL analytics & BI AI/ML apps Real-time apps Data sharing Data warehousing Machine learning Real-time processing Data Transformation Data Intelligence Engine

©2022 Databricks Inc. — All rights reserved 9 >10y of
Simplifying Apache Spark™

©2022 Databricks Inc. — All rights reserved 10 … why
simplify?

©2022 Databricks Inc. — All rights reserved 11 What is
declarative?

Declarative 12 orders table Query Declare a plan Execute it

©2022 Databricks Inc. — All rights reserved 13 What is
… a data pipeline ?

data pipeline Something that produces and updates a set of
datasets

Dataset Dataset Dataset Dataset Query Query Query Query Dataset Query
Dataset Query

©2022 Databricks Inc. — All rights reserved 16 Why make
pipelines declarative?

Developing and operating data pipelines is hard

raw_orders Query fact_orders Query customers Query Surprisingly hard: multiple queries/tables

Sequential Execution could run at the same time

Multithreading

Things get very complicated quickly table4 Query table5 Query table3
Query table2 Query table1 Query

©2022 Databricks Inc. — All rights reserved 22 What if
there is an error?

Analysis Error: Catchable before execution ⚠

Add code for runtime retries

Manual Retry (command-line args)

Add an Orchestrator / Workﬂow System ? Code DAG object

What if we want to run streams continuously? raw_orders table
Streaming Query fact_orders table Streaming Query customers table Batch Query Run continuously Run continuously Run every 15 minutes

Problem: ﬁrst pipeline run • raw_orders streaming query starts •
fact_orders streaming query starts • fact_orders streaming query tries to read from raw_orders table • raw_orders table doesn’t exist yet -> ⚠error⚠ raw_orders table Streaming Query fact_orders table Streaming Query

©2022 Databricks Inc. — All rights reserved 30 What if
we make the whole pipeline declarative? Build plan -> execute it

©2022 Databricks Inc. — All rights reserved 31 Spark Declarative
Pipelines (SDP)

Spark Declarative Pipelines Core Componentes Datasets: pipeline.yml for catalog, database,
conﬁg, source paths

Streaming Table (ST) • Continuously updated table • Exactly once
ingestion • Append only • Real-time, incremental data ingestion • Low latency streaming transformations What is it? use it for…

Materialized View (MV) • A precomputed, persistent table based on
a query • Efﬁcient aggregation on static or periodically updated data. • (possibly) incrementally computed What is it? use it for…

Gold Layer Silver Layer Bronze Layer customers_raw Streaming Table orders_raw
Streaming Table customer_orders Materialized View Declarative Pipelines Declarative Pipeline with STs and MVs Putting things together customers_clean Streaming Table orders_clean Streaming Table

Python API raw_orders Query convention in Spark 4.1: import …
as dp

raw_orders Query fact_orders Query customers Query No need for DAG
obj

⚙ Spark Declarative Pipelines • Transformations Python or SQL ﬁles
• Initialize via spark‑pipelines init --name orders‑pl • Execute with spark‑pipelines [dry-]run • SDP automatically handles pre-validation, dependencies, retries & parallelization. • Triggered/batch and continuous/streaming mode

©2022 Databricks Inc. — All rights reserved 40 Getting Started
with Spark Declarative Pipelines: OSS + Lakeflow Brand new: OSS and Lakeflow Tutorials (realtime flight data) Create a Databricks forever Free Edition account, and add the avionics SDP demo to your portfolio on LinkedIn

🚀 Python Data Source API spark.readstream.format("opensky") • Use familiar spark.read.format() and write.format()
• OSS • simple pip install • Supports batch + streaming

©2025 Databricks Inc. — All rights reserved Lakeflow Spark Declarative
Pipelines Efficiently clean, transform and join data Build batch and streaming pipelines with a declarative approach Automated pipeline configuration with reduces maintenance burden Simplified pipeline development Reliable production infrastructure Fully compatible with the open source Spark Declarative Pipelines Built on an open standard

Genie Code • Natural language prompt to create, debug, explain
and document • Creates, manages and debugs data pipelines • Genie Code Agent skills for Claude Code • Can be extended with Skills.md and MCP servers An autonomous AI agent for Data Science, Engineering and Analytics

Now you can get O(100ms) in Spark Structured Streaming New:
Real-Time Mode in Apache Spark!

Updated architecture, same interface • We extended the microbatch architecture
by adding concurrent stages, streaming shufﬂe, and continuous data ﬂow. • Same interface. Only a single line of change!

Air Trafﬁc Control System Spark Streaming, Rea-time mode and Databrick
Apps (REC)

52 Convert Messy Sales Data to AI insights databricks.com/demos Use
Lakeﬂow Connect, Jobs and SDP to create a marketing solution with AI for a global food corporation. Video Walkthrough with GitHub Repo + DABs

53 Get to know Genie Code Genie Code data engineering
video demo: Use Genie Code to create a complete SDP pipeline from a prompt. With Auto Loader, JSON ingestion, and medallion architecture. Genie Code Step by Step Guide

Zerobus Ingest Demo Zerobus Ingest demo: ingest streaming IoT events
without Kafka from all over the world

Spark_Declarative_Pipelines_Devoxx_2026.pdf

Spark_Declarative_Pipelines_Devoxx_2026.pdf

More Decks by Frank Munz

Other Decks in Technology

Featured

Transcript