Slide 1

Slide 1 text

©2022 Databricks Inc. — All rights reserved Streaming Data on the Lakehouse - a talk for everyone who ❤ data Devoxx Greece 2023 1 Frank Munz, Ph.D. May2023

Slide 2

Slide 2 text

©2022 Databricks Inc. — All rights reserved About me • Principal @Databricks. Data, Analytics and AI products • All things large scale data & compute • Based in Munich, 🍻 ⛰ 🥨 󰎲 • Twitter: @frankmunz • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.

Slide 3

Slide 3 text

©2022 Databricks Inc. — All rights reserved 3 What do you love most about Apache Spark?

Slide 4

Slide 4 text

©2022 Databricks Inc. — All rights reserved Spark SQL and Dataframe API # PYTHON: read multiple JSON files with multiple datasets per file cust = spark.read.json('/data/customers/') # cust.createOrReplaceTempView("customers") cust.filter(cust['car_make']== "Audi").show() %sql select * from customers where car_make = "Audi" -- PYTHON: -- spark.sql('select * from customers where car_make ="Audi"').show()

Slide 5

Slide 5 text

©2022 Databricks Inc. — All rights reserved Automatic Parallelization and Scale

Slide 6

Slide 6 text

©2022 Databricks Inc. — All rights reserved Reading from Files vs Reading from Streams # read file(s) cust = spark.read.format("json").path('/data/customers/') # read from messaging platform sales = spark.readStream.format("kafka|kinesis|socket").load()

Slide 7

Slide 7 text

©2022 Databricks Inc. — All rights reserved Stream Processing is continuous and unbounded Stream Processing 7 Traditional Processing is one-off and bounded 1 Data Source 2 Processing Data Source Processing

Slide 8

Slide 8 text

©2022 Databricks Inc. — All rights reserved Technical Advantages A more intuitive way of capturing and processing continuous and unbounded data Lower latency for time sensitive applications and use cases Better fault-tolerance through checkpointing Better compute utilization and scalability through continuous and incremental processing 8

Slide 9

Slide 9 text

©2022 Databricks Inc. — All rights reserved Business Benefits 9 BI and SQL Analytics Fresher and faster insights Quicker and better business decisions Data Engineering Sooner availability of cleaned data More business use cases Data Science and ML More frequent model update and inference Better model efficacy Event Driven Application Faster customized response and action Better and differentiated customer experience

Slide 10

Slide 10 text

©2022 Databricks Inc. — All rights reserved 10 Streaming Misconceptions

Slide 11

Slide 11 text

©2022 Databricks Inc. — All rights reserved Misconception #1 Stream processing is only for low latency use cases X spark.readStream .format("delta") .option("maxFilesPerTrigger", "1") .load(inputDir) .writeStream .trigger(Trigger.AvailableNow) .option("checkpointLocation",chkdir) .start() 11 Stream processing can be applied to use cases of any latency “Batch” is a special case of streaming

Slide 12

Slide 12 text

©2022 Databricks Inc. — All rights reserved Misconception #2 Latency Accuracy Cost Choose the right latency, accuracy, and cost tradeoff for each specific use case 12 The lower the latency, the "better" X

Slide 13

Slide 13 text

©2022 Databricks Inc. — All rights reserved 13 Streaming is about the programming paradigm

Slide 14

Slide 14 text

©2022 Databricks Inc. — All rights reserved 14 Spark Structured Streaming

Slide 15

Slide 15 text

©2022 Databricks Inc. — All rights reserved Structured Streaming 15 A scalable and fault-tolerant stream processing engine built on the Spark SQL engine +Project Lightspeed: predictable low latencies

Slide 16

Slide 16 text

©2022 Databricks Inc. — All rights reserved Structured Streaming 16 ● Read from an initial offset position ● Keep tracking offset position as processing makes progress Source ● Apply the same transformations using a standard dataframe Transformation ● Write to a target ● Keep updating checkpoint as processing makes progress Sink Trigger

Slide 17

Slide 17 text

©2022 Databricks Inc. — All rights reserved source transformation sink config spark.readStream.format("kafka|kinesis|socket|...") .option(<>,<>)... .load() .select(cast("string").alias("jsonData")) .select(from_json($"jsonData",jsonSchema).alias("payload")) .writeStream .format("delta") .option("path",...) .trigger("30 seconds") .option("checkpointLocation",...) .start() Streaming ETL 17

Slide 18

Slide 18 text

©2022 Databricks Inc. — All rights reserved Trigger Types 18 ● Default: Process as soon as the previous batch has been processed ● Fixed interval: Process at a user-specified time interval ● One-time: Process all of the available data and then stop

Slide 19

Slide 19 text

©2022 Databricks Inc. — All rights reserved Output Modes 19 ● Append (Default): Only new rows added to the result table since the last trigger will be output to the sink ● Complete: The whole result table will be output to the sink after each trigger ● Update: Only the rows updated in the result table since the last trigger will be output to the sink

Slide 20

Slide 20 text

©2022 Databricks Inc. — All rights reserved Structured Streaming Benefits High Throughput Optimized for high throughput and low cost Rich Connector Ecosystem Streaming connectors ranging from message buses to object storage services Exactly Once Semantics Fault-tolerance and exactly once semantics guarantee correctness Unified Batch and Streaming Unified API makes development and maintenance simple 20

Slide 21

Slide 21 text

©2022 Databricks Inc. — All rights reserved 21 The Data Lakehouse

Slide 22

Slide 22 text

©2022 Databricks Inc. — All rights reserved Data Maturity Curve Data + AI Maturity Competitive Advantage Reports Clean Data Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics Automated Decision Making Data Lake for AI Data Warehouse for BI Data Maturity Curve What will happen? What happened? 22

Slide 23

Slide 23 text

©2022 Databricks Inc. — All rights reserved Business Intelligence SQL Analytics Data Science & ML Data Streaming Two disparate, incompatible data platforms Structured tables Unstructured files: logs, text, images, video, Data Warehouse Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incompatible security and governance models Incomplete support for use cases 23

Slide 24

Slide 24 text

©2022 Databricks Inc. — All rights reserved 24 Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds

Slide 25

Slide 25 text

©2022 Databricks Inc. — All rights reserved 25 Streaming on the Lakehouse Streaming on the Lakehouse Lakehouse Platform All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Streaming Ingesting Streaming ETL Event Processing Event Driven Application ML Inference Live Dashboard Near Real-time Query Alert Fraud Prevention Dynamic UI Dynamic Ads Dynamic Pricing Device Control Game Scene Update Etc. DB Change Data Feeds Clickstreams Machine & Application Logs Mobile & IoT Data Application Events

Slide 26

Slide 26 text

©2022 Databricks Inc. — All rights reserved Lakehouse Differentiations 26 Unified Batch and Streaming No overhead of learning, developing on, or maintaining two sets of APIs and data processing stacks Favorite Tools Provide diverse users with their favorite tools to work with streaming data, enabling the broader organization to take advantage of streaming End-to-End Streaming Has everything you need, no need to stitch together different streaming technology stacks or tune them to work together Optimal Cost Structure Easily configure the right latency-cost tradeoff for each of your streaming workloads

Slide 27

Slide 27 text

©2022 Databricks Inc. — All rights reserved 27 but what is the foundation of the Lakehouse Platform?

Slide 28

Slide 28 text

©2022 Databricks Inc. — All rights reserved One source of truth for all your data Open format Delta Lake is the foundation of the Lakehouse • open source format Delta Lake, based on Parquet • adds quality, reliability, and performance to your existing data lakes • Provides one common data management framework for Batch & Streaming, ETL, Analytics & ML ✓ ACID Transactions ✓ Time Travel ✓ Schema Enforcement ✓ Identity Columns ✓ Advanced Indexing ✓ Caching ✓ Auto-tuning ✓ Python, SQL, R, Scala Support All Open Source

Slide 29

Slide 29 text

©2022 Databricks Inc. — All rights reserved Lakehouse: Delta Lake Publication Link to PDF

Slide 30

Slide 30 text

©2022 Databricks Inc. — All rights reserved Delta’s expanding ecosystem of connectors N ew ! Coming Soon! 10x Monthly download growth in just one year >8 Million monthly downloads

Slide 31

Slide 31 text

©2022 Databricks Inc. — All rights reserved Under the Hood my_table/ ← table name _delta_log/ ← delta log 00000.json ← delta for tx 00001.json … 00010.json 00010.chkpt.parquet ← checkpoint files date=2022-01-01/ ← optional partition File-1.parquet ← data in parquet format Write ahead log with optimistic concurrency provides serializable ACID transactions:

Slide 32

Slide 32 text

©2022 Databricks Inc. — All rights reserved Delta Lake Quickstart pyspark --packages io.delta:delta-core_2.12:1.0.0 \ --conf "spark.sql.extensions= \ io.delta.sql.DeltaSparkSessionExtension" \ --conf "spark.sql.catalog.spark_catalog= \ org.apache.spark.sql.delta.catalog.DeltaCatalog" https://github.com/delta-io/delta

Slide 33

Slide 33 text

©2022 Databricks Inc. — All rights reserved Spark: Convert to Delta 33 #convert to delta format cust = spark.read.json('/data/customers/') cust.write.format("delta").saveAsTable(table_name)

Slide 34

Slide 34 text

©2022 Databricks Inc. — All rights reserved All of Delta Lake 2.0 is open ACID Transactions Scalable Metadata Time Travel Open Source Unified Batch/Streaming Schema Evolution /Enforcement Audit History DML Operations OPTIMIZE Compaction OPTIMIZE ZORDER Change data feed Table Restore S3 Multi-cluster writes MERGE Enhancements Stream Enhancements Simplified Logstore Data Skipping via Column Stats Multi-part checkpoint writes Generated Columns Column Mapping Generated column support w/ partitioning Identity Columns Subqueries in deletes and updates Clones Iceberg to Delta converter Fast metadata only deletes Coming Soon!

Slide 35

Slide 35 text

©2022 Databricks Inc. — All rights reserved Without Delta Lake: No UPDATE

Slide 36

Slide 36 text

©2022 Databricks Inc. — All rights reserved Delta Examples (in SQL for brevity) 36 SELECT * FROM student TIMESTAMP AS OF "2022-05-28" CREATE TABLE test_student SHALLOW CLONE student DESCRIBE HISTORY student UPDATE student SET lastname = "Miller" WHERE id = 2805 OPTIMIZE student ZORDER BY age VACUUM student RETAIN 12 HOURS Update/Upsert Time travel Clones Table History Co-locate Compact Files

Slide 37

Slide 37 text

©2022 Databricks Inc. — All rights reserved 37 … but how does this perform for DWH workloads?

Slide 38

Slide 38 text

©2022 Databricks Inc. — All rights reserved Databricks SQL Photon Serverless Eliminate compute infrastructure management Instant, Elastic Compute Zero Management Lower TCO Vectorized C++ execution engine with Apache Spark API https://dbricks.co/benchmark TPC-DS Benchmark 100 TB

Slide 39

Slide 39 text

©2022 Databricks Inc. — All rights reserved 39 … and streaming?

Slide 40

Slide 40 text

©2022 Databricks Inc. — All rights reserved Streaming: Built into the foundation 40 display(spark.readStream.format("delta").table("heart.bpm"). groupBy("bpm",window("time", "1 minute")) .avg("bpm").orderBy("window",ascending=True)) Delta Tables can be streaming sources and sinks

Slide 41

Slide 41 text

©2022 Databricks Inc. — All rights reserved 41

Slide 42

Slide 42 text

©2022 Databricks Inc. — All rights reserved 42 How can you simplify Streaming Ingestion and Data Pipelines?

Slide 43

Slide 43 text

©2022 Databricks Inc. — All rights reserved Auto Loader • Python & Scala (and SQL in Delta Live Tables!) • Streaming data source with incremental loading • Exactly once ingestion • Scales to large amounts of data • Designed for structured, semi-structured and unstructured data • Schema inference, enforcement with data rescue, and evolution df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/path/to/table") 43

Slide 44

Slide 44 text

©2022 Databricks Inc. — All rights reserved “We ingest more than 5 petabytes per day with Auto Loader” 44

Slide 45

Slide 45 text

©2022 Databricks Inc. — All rights reserved Delta Live Tables 45 CREATE STREAMING LIVE TABLE raw_data AS SELECT * FROM cloud_files ("/raw_data", "json") CREATE LIVE TABLE clean_data AS SELECT … FROM LIVE .raw_data Reliable, declarative, streaming data pipelines in SQL or Python Accelerate ETL development Declare SQL or Python and DLT automatically orchestrates the DAG, handles retries, changing data Automatically manage your infrastructure Automates complex tedious activities like recovery, auto-scaling, and performance optimization Ensure high data quality Deliver reliable data with built-in quality controls, testing, monitoring, and enforcement Unify batch and streaming Get the simplicity of SQL with freshness of streaming with one unified API

Slide 46

Slide 46 text

©2022 Databricks Inc. — All rights reserved COPY INTO • SQL command • Idempotent and incremental • Great when source directory contains ~ thousands of files • Schema automatically inferred COPY INTO my_delta_table FROM 's3://my-bucket/path/to/csv_files' FILEFORMAT = CSV FORMAT_OPTIONS ('header'='true','inferSchema'='true') 46

Slide 47

Slide 47 text

©2022 Databricks Inc. — All rights reserved Build sophisticated workflows inside your Databricks workspace with a few clicks, or connect to your favorite IDE. Simple Workflows 47

Slide 48

Slide 48 text

©2022 Databricks Inc. — All rights reserved 48 10k View

Slide 49

Slide 49 text

©2022 Databricks Inc. — All rights reserved Lakehouse Platform: Streaming Options 49 Messaging Systems (Apache Kafka, Kinesis, …) Files / Object Stores (S3, ADLS, GCS, …) Apache Kafka Connector for SSStreaming Databricks Auto Loader Spark Structured Streaming Delta Live Tables (SQL/Python) Delta Lake Data Consumers Delta Lake Sink Connector

Slide 50

Slide 50 text

©2022 Databricks Inc. — All rights reserved 50 It's demo time!

Slide 51

Slide 51 text

©2022 Databricks Inc. — All rights reserved 51 Demo 1

Slide 52

Slide 52 text

©2022 Databricks Inc. — All rights reserved Demo: Data Donation Project https://corona-datenspende.de/science/en/

Slide 53

Slide 53 text

©2022 Databricks Inc. — All rights reserved System Architecture

Slide 54

Slide 54 text

©2022 Databricks Inc. — All rights reserved DLT: Directly Ingest Streaming Data

Slide 55

Slide 55 text

©2022 Databricks Inc. — All rights reserved 55 Demo 2

Slide 56

Slide 56 text

©2022 Databricks Inc. — All rights reserved Delta Live Tables Twitter Sentiment Analysis

Slide 57

Slide 57 text

©2022 Databricks Inc. — All rights reserved Tweepy API: Streaming Twitter Feed

Slide 58

Slide 58 text

©2022 Databricks Inc. — All rights reserved Auto Loader: Streaming Data Ingestion Ingest Streaming Data in a Delta Live Table pipeline

Slide 59

Slide 59 text

©2022 Databricks Inc. — All rights reserved Declarative, auto scaling Data Pipelines in SQL CTAS Pattern: Create Table As Select …

Slide 60

Slide 60 text

©2022 Databricks Inc. — All rights reserved Declarative, auto scaling Data Pipelines

Slide 61

Slide 61 text

©2022 Databricks Inc. — All rights reserved DWH / SQL Persona

Slide 62

Slide 62 text

©2022 Databricks Inc. — All rights reserved ML/DS: Hugging Face Transformer 62

Slide 63

Slide 63 text

©2022 Databricks Inc. — All rights reserved 63 Resources

Slide 64

Slide 64 text

©2022 Databricks Inc. — All rights reserved Demos on Databricks Github https://github.com/databricks/delta-live-tables-notebooks

Slide 65

Slide 65 text

©2022 Databricks Inc. — All rights reserved Databricks Blog: DLT with Apache Kafka https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html

Slide 66

Slide 66 text

©2022 Databricks Inc. — All rights reserved Resources • Spark on Databricks • Databricks Demo Hub • Databricks Blog • Databricks Community / Forum (join to ask your tech questions!) • Training & Certification

Slide 67

Slide 67 text

©2022 Databricks Inc. — All rights reserved Data + AI Summit: Sign up now

Slide 68

Slide 68 text

©2022 Databricks Inc. — All rights reserved 68 Thank You! https://speakerdeck.com/fmunz @frankmunz Try Databricks free