Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Current_Streaming_Data_Lakehouse_Databricks.pdf

Frank Munz
October 06, 2022

 Current_Streaming_Data_Lakehouse_Databricks.pdf

Streaming Data into Your Lakehouse

The last years have taught us that cheap, virtually unlimited, and highly available cloud object storage doesn't make a solid enterprise data platform. Too many data lakes didn't fulfill their expectations and degenerated into sad data swamps. With the Linux Foundation OSS project Delta Lake (https://github.com/delta-io), you can turn your data lake into the foundation of a data lakehouse that brings back ACID transactions, schema enforcement, upserts, efficient metadata handling, and time travel. In this session, we explore how a data lakehouse works with streaming, using Apache Kafka as an example. This talk is for data architects who are not afraid of some code and for data engineers who love open source and cloud services. Attendees of this talk will learn:

Lakehouse architecture 101, the honest tech bits
The data lakehouse and streaming data: what's there beyond Apache Spark Structured Streaming?
Why the lakehouse and Apache Kafka make a great couple and what concepts you should know to get them hitched with success
Streaming data with declarative data pipelines
In a live demo, I will show data ingestion, cleansing, and transformation based on a simulation of the Data Donation Project (DDP, https://corona-datenspende.de/science/en) built on the lakehouse with Apache Kafka, Apache Spark, and Delta Live Tables (a fully managed service). DDP is a scientific IoT experiment to determine COVID outbreaks in Germany by detecting elevated heart rates correlated to infections. Half a million volunteers have already decided to donate their heart rate data from their fitness trackers.

Dr. Frank Munz works on large-scale data and AI at Databricks. He authored three computer science books, built up technical evangelism for Amazon Web Services in Germany, Austria, and Switzerland, and once upon a time worked as a data scientist with a group that won a Nobel prize. Frank realized his dream to speak at top-notch conferences - such as Devoxx, Kubecon, ODSC, and Java One - on every continent (except Antarctica because it is too cold there). He holds a Ph.D. with summa cum laude in Computer Science from TU Munich. Enjoys skiing in the Alps, tapas in Spain, and exploring secret beaches in SE Asia.

Frank Munz

October 06, 2022
Tweet

More Decks by Frank Munz

Other Decks in Programming

Transcript

  1. ©2022 Databricks Inc. — All rights reserved Streaming Data Into

    the Lakehouse Current.io 2022 1 Frank Munz, Ph.D. Oct 2022
  2. ©2022 Databricks Inc. — All rights reserved About me •

    @Databricks, Data and AI Products, formerly DevRel EMEA • All things large scale data & compute • Based in Munich, 🍻 ⛰ 🥨 󰎲 • Twitter: @frankmunz • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.
  3. ©2022 Databricks Inc. — All rights reserved 3 What do

    you love most about Apache Spark?
  4. ©2022 Databricks Inc. — All rights reserved Spark SQL and

    Dataframe API # PYTHON: read multiple JSON files with multiple datasets per file cust = spark.read.json('/data/customers/') cust.createOrReplaceTempView("customers") cust.filter(cust['car_make']== "Audi").show() %sql select * from customers where car_make = "Audi" -- PYTHON: -- spark.sql('select * from customers where car_make ="Audi"').show()
  5. ©2022 Databricks Inc. — All rights reserved Automatic Parallelization and

    Scale
  6. ©2022 Databricks Inc. — All rights reserved Reading from Files

    vs Reading from Streams # read file(s) cust = spark.read.format("json").path('/data/customers/') # read from messaging platform sales = spark.readStream.format("kafka|kinesis|socket").load()
  7. ©2022 Databricks Inc. — All rights reserved Stream Processing is

    continuous and unbounded Stream Processing 7 Traditional Processing is one-off and bounded 1 Data Source 2 Processing Data Source Processing
  8. ©2022 Databricks Inc. — All rights reserved Technical Advantages A

    more intuitive way of capturing and processing continuous and unbounded data Lower latency for time sensitive applications and use cases Better fault-tolerance through checkpointing Better compute utilization and scalability through continuous and incremental processing 8
  9. ©2022 Databricks Inc. — All rights reserved Business Benefits 9

    BI and SQL Analytics Fresher and faster insights Quicker and better business decisions Data Engineering Sooner availability of cleaned data More business use cases Data Science and ML More frequent model update and inference Better model efficacy Event Driven Application Faster customized response and action Better and differentiated customer experience
  10. ©2022 Databricks Inc. — All rights reserved 10 Streaming Misconceptions

  11. ©2022 Databricks Inc. — All rights reserved Misconception #1 Stream

    processing is only for low latency use cases X spark.readStream .format("delta") .option("maxFilesPerTrigger", "1") .load(inputDir) .writeStream .trigger(Trigger.AvailableNow) .option("checkpointLocation",chkdir) .start() 11 Stream processing can be applied to use cases of any latency “Batch” is a special case of streaming
  12. ©2022 Databricks Inc. — All rights reserved Misconception #2 Latency

    Accuracy Cost Choose the right latency, accuracy, and cost tradeoff for each specific use case 12 The lower the latency, the "better" X
  13. ©2022 Databricks Inc. — All rights reserved 13 Streaming is

    about the programming paradigm
  14. ©2022 Databricks Inc. — All rights reserved 14 Spark Structured

    Streaming
  15. ©2022 Databricks Inc. — All rights reserved Structured Streaming 15

    A scalable and fault-tolerant stream processing engine built on the Spark SQL engine +Project Lightspeed: predictable low latencies
  16. ©2022 Databricks Inc. — All rights reserved Structured Streaming 16

    • Read from an initial offset position • Keep tracking offset position as processing makes progress Source • Apply the same transformations using a standard dataframe Transformation • Write to a target • Keep updating checkpoint as processing makes progress Sink Trigger
  17. ©2022 Databricks Inc. — All rights reserved source transformation sink

    config spark.readStream.format("kafka|kinesis|socket|...") .option(<>,<>)... .load() .select(cast("string").alias("jsonData")) .select(from_json($"jsonData",jsonSchema).alias("payload")) .writeStream .format("delta") .option("path",...) .trigger("30 seconds") .option("checkpointLocation",...) .start() Streaming ETL 17
  18. ©2022 Databricks Inc. — All rights reserved Trigger Types 18

    • Default: Process as soon as the previous batch has been processed • Fixed interval: Process at a user-specified time interval • One-time: Process all of the available data and then stop
  19. ©2022 Databricks Inc. — All rights reserved Output Modes 19

    • Append (Default): Only new rows added to the result table since the last trigger will be output to the sink • Complete: The whole result table will be output to the sink after each trigger • Update: Only the rows updated in the result table since the last trigger will be output to the sink
  20. ©2022 Databricks Inc. — All rights reserved Structured Streaming Benefits

    High Throughput Optimized for high throughput and low cost Rich Connector Ecosystem Streaming connectors ranging from message buses to object storage services Exactly Once Semantics Fault-tolerance and exactly once semantics guarantee correctness Unified Batch and Streaming Unified API makes development and maintenance simple 20
  21. ©2022 Databricks Inc. — All rights reserved 21 The Data

    Lakehouse
  22. ©2022 Databricks Inc. — All rights reserved Data Maturity Curve

    Data + AI Maturity Competitive Advantage Reports Clean Data Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics Automated Decision Making Data Lake for AI Data Warehouse for BI Data Maturity Curve What will happen? What happened? 22
  23. ©2022 Databricks Inc. — All rights reserved Business Intelligence SQL

    Analytics Data Science & ML Data Streaming Two disparate, incompatible data platforms Structured tables Unstructured files: logs, text, images, video, Data Warehouse Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incompatible security and governance models Incomplete support for use cases 23
  24. ©2022 Databricks Inc. — All rights reserved 24 Databricks Lakehouse

    Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds
  25. ©2022 Databricks Inc. — All rights reserved 25 Streaming on

    the Lakehouse Streaming on the Lakehouse Lakehouse Platform All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Streaming Ingesting Streaming ETL Event Processing Event Driven Application ML Inference Live Dashboard Near Real-time Query Alert Fraud Prevention Dynamic UI Dynamic Ads Dynamic Pricing Device Control Game Scene Update Etc. DB Change Data Feeds Clickstreams Machine & Application Logs Mobile & IoT Data Application Events
  26. ©2022 Databricks Inc. — All rights reserved Lakehouse Differentiations 26

    Unified Batch and Streaming No overhead of learning, developing on, or maintaining two sets of APIs and data processing stacks Favorite Tools Provide diverse users with their favorite tools to work with streaming data, enabling the broader organization to take advantage of streaming End-to-End Streaming Has everything you need, no need to stitch together different streaming technology stacks or tune them to work together Optimal Cost Structure Easily configure the right latency-cost tradeoff for each of your streaming workloads
  27. ©2022 Databricks Inc. — All rights reserved 27 but what

    is the foundation of the Lakehouse Platform?
  28. ©2022 Databricks Inc. — All rights reserved One source of

    truth for all your data Open format Delta Lake is the foundation of the Lakehouse • open source format Delta Lake, based on Parquet • adds quality, reliability, and performance to your existing data lakes • Provides one common data management framework for Batch & Streaming, ETL, Analytics & ML ✓ ACID Transactions ✓ Time Travel ✓ Schema Enforcement ✓ Identity Columns ✓ Advanced Indexing ✓ Caching ✓ Auto-tuning ✓ Python, SQL, R, Scala Support All Open Source
  29. ©2022 Databricks Inc. — All rights reserved Lakehouse: Delta Lake

    Publication Link to PDF
  30. ©2022 Databricks Inc. — All rights reserved Delta’s expanding ecosystem

    of connectors N ew ! Coming Soon!
  31. ©2022 Databricks Inc. — All rights reserved Under the Hood

    my_table/ ← table name _delta_log/ ← delta log 00000.json ← delta for tx 00001.json … 00010.json 00010.chkpt.parquet ← checkpoint files date=2022-01-01/ ← optional partition File-1.parquet ← data in parquet format Write ahead log with optimistic concurrency provides serializable ACID transactions:
  32. ©2022 Databricks Inc. — All rights reserved Delta Lake Quickstart

    pyspark --packages io.delta:delta-core_2.12:1.0.0 \ --conf "spark.sql.extensions= \ io.delta.sql.DeltaSparkSessionExtension" \ --conf "spark.sql.catalog.spark_catalog= \ org.apache.spark.sql.delta.catalog.DeltaCatalog" https://github.com/delta-io/delta
  33. ©2022 Databricks Inc. — All rights reserved Spark: Convert to

    Delta 33 #convert to delta format cust = spark.read.json('/data/customers/') cust.write.format("delta").saveAsTable(table_name)
  34. ©2022 Databricks Inc. — All rights reserved All of Delta

    Lake 2.0 is open ACID Transactions Scalable Metadata Time Travel Open Source Unified Batch/Streaming Schema Evolution /Enforcement Audit History DML Operations OPTIMIZE Compaction OPTIMIZE ZORDER Change data feed Table Restore S3 Multi-cluster writes MERGE Enhancements Stream Enhancements Simplified Logstore Data Skipping via Column Stats Multi-part checkpoint writes Generated Columns Column Mapping Generated column support w/ partitioning Identity Columns Subqueries in deletes and updates Clones Iceberg to Delta converter Fast metadata only deletes Coming Soon!
  35. ©2022 Databricks Inc. — All rights reserved Without Delta Lake:

    No UPDATE
  36. ©2022 Databricks Inc. — All rights reserved Delta Examples (in

    SQL for brevity) 36 SELECT * FROM student TIMESTAMP AS OF "2022-05-28" CREATE TABLE test_student SHALLOW CLONE student DESCRIBE HISTORY student UPDATE student SET lastname = "Miller" WHERE id = 2805 OPTIMIZE student ZORDER BY age VACUUM student RETAIN 12 HOURS Update/Upsert Time travel Clones Table History Co-locate Compact Files
  37. ©2022 Databricks Inc. — All rights reserved 37 … but

    how does this perform for DWH workloads?
  38. ©2022 Databricks Inc. — All rights reserved Databricks SQL Photon

    Serverless Eliminate compute infrastructure management Instant, Elastic Compute Zero Management Lower TCO Vectorized C++ execution engine with Apache Spark API https://dbricks.co/benchmark TPC-DS Benchmark 100 TB
  39. ©2022 Databricks Inc. — All rights reserved 39 … and

    streaming?
  40. ©2022 Databricks Inc. — All rights reserved Streaming: Built into

    the foundation 40 display(spark.readStream.format("delta").table("heart.bpm"). groupBy("bpm",window("time", "1 minute")) .avg("bpm").orderBy("window",ascending=True)) Delta Tables can be streaming sources and sinks
  41. ©2022 Databricks Inc. — All rights reserved 41

  42. ©2022 Databricks Inc. — All rights reserved 42 How can

    you simplify Streaming Ingestion and Data Pipelines?
  43. ©2022 Databricks Inc. — All rights reserved Auto Loader •

    Python & Scala (and SQL in Delta Live Tables!) • Streaming data source with incremental loading • Exactly once ingestion • Scales to large amounts of data • Designed for structured, semi-structured and unstructured data • Schema inference, enforcement with data rescue, and evolution df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/path/to/table") 43
  44. ©2022 Databricks Inc. — All rights reserved “We ingest more

    than 5 petabytes per day with Auto Loader” 44
  45. ©2022 Databricks Inc. — All rights reserved Delta Live Tables

    45 CREATE STREAMING LIVE TABLE raw_data AS SELECT * FROM cloud_files ("/raw_data", "json") CREATE LIVE TABLE clean_data AS SELECT … FROM LIVE .raw_data Reliable, declarative, streaming data pipelines in SQL or Python Accelerate ETL development Declare SQL or Python and DLT automatically orchestrates the DAG, handles retries, changing data Automatically manage your infrastructure Automates complex tedious activities like recovery, auto-scaling, and performance optimization Ensure high data quality Deliver reliable data with built-in quality controls, testing, monitoring, and enforcement Unify batch and streaming Get the simplicity of SQL with freshness of streaming with one unified API
  46. ©2022 Databricks Inc. — All rights reserved COPY INTO •

    SQL command • Idempotent and incremental • Great when source directory contains ~ thousands of files • Schema automatically inferred COPY INTO my_delta_table FROM 's3://my-bucket/path/to/csv_files' FILEFORMAT = CSV FORMAT_OPTIONS ('header'='true','inferSchema'='true') 46
  47. ©2022 Databricks Inc. — All rights reserved 47 10k View

  48. ©2022 Databricks Inc. — All rights reserved Lakehouse Platform: Streaming

    Options 48 Messaging Systems (Apache Kafka, Kinesis, …) Files / Object Stores (S3, ADLS, GCS, …) Apache Kafka Connector for SSStreaming Databricks Auto Loader Spark Structured Streaming Delta Live Tables (SQL/Python) Delta Lake Data Consumers Delta Lake Sink Connector
  49. ©2022 Databricks Inc. — All rights reserved 49 It's demo

    time!
  50. ©2022 Databricks Inc. — All rights reserved 50 Demo 1

  51. ©2022 Databricks Inc. — All rights reserved Delta Live Tables

    Twitter Sentiment Analysis
  52. ©2022 Databricks Inc. — All rights reserved Tweepy API: Streaming

    Twitter Feed
  53. ©2022 Databricks Inc. — All rights reserved Auto Loader: Streaming

    Data Ingestion Ingest Streaming Data in a Delta Live Table pipeline
  54. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling

    Data Pipelines in SQL CTAS Pattern: Create Table As Select …
  55. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling

    Data Pipelines
  56. ©2022 Databricks Inc. — All rights reserved DWH / SQL

    Persona
  57. ©2022 Databricks Inc. — All rights reserved ML/DS: Hugging Face

    Transformer 57
  58. ©2022 Databricks Inc. — All rights reserved 58 Demo 2

  59. ©2022 Databricks Inc. — All rights reserved Demo: Data Donation

    Project https://corona-datenspende.de/science/en/
  60. ©2022 Databricks Inc. — All rights reserved System Architecture

  61. ©2022 Databricks Inc. — All rights reserved 61 Resources

  62. ©2022 Databricks Inc. — All rights reserved Demos on Databricks

    Github https://github.com/databricks/delta-live-tables-notebooks
  63. ©2022 Databricks Inc. — All rights reserved Databricks Blog: DLT

    with Apache Kafka https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html
  64. ©2022 Databricks Inc. — All rights reserved Resources • The

    best platform to run your Spark workloads • Databricks Demo Hub • Datbricks GitHub • Databricks Blog • Databricks Community / Forum (join to ask your tech questions!) • Training & Certification
  65. ©2022 Databricks Inc. — All rights reserved 65 Thank You!

    https://speakerdeck.com/fmunz @frankmunz