$30 off During Our Annual Pro Sale. View Details »

Streaming Data on the Lakehouse (Devoxx Conference ATH)

Frank Munz
October 06, 2022

Streaming Data on the Lakehouse (Devoxx Conference ATH)

Streaming Data into Your Lakehouse

The last years have taught us that cheap, virtually unlimited, and highly available cloud object storage doesn't make a solid enterprise data platform. Too many data lakes didn't fulfill their expectations and degenerated into sad data swamps. With the Linux Foundation OSS project Delta Lake (https://github.com/delta-io), you can turn your data lake into the foundation of a data lakehouse that brings back ACID transactions, schema enforcement, upserts, efficient metadata handling, and time travel. In this session, we explore how a data lakehouse works with streaming, using Apache Kafka as an example. This talk is for data architects who are not afraid of some code and for data engineers who love open source and cloud services. Attendees of this talk will learn:

Lakehouse architecture 101, the honest tech bits
The data lakehouse and streaming data: what's there beyond Apache Spark Structured Streaming?
Why the lakehouse and Apache Kafka make a great couple and what concepts you should know to get them hitched with success
Streaming data with declarative data pipelines
In a live demo, I will show data ingestion, cleansing, and transformation based on a simulation of the Data Donation Project (DDP, https://corona-datenspende.de/science/en) built on the lakehouse with Apache Kafka, Apache Spark, and Delta Live Tables (a fully managed service). DDP is a scientific IoT experiment to determine COVID outbreaks in Germany by detecting elevated heart rates correlated to infections. Half a million volunteers have already decided to donate their heart rate data from their fitness trackers.

This presentation was delivered at Current.io 2022 and Devox ATH 2023.

Dr. Frank Munz works on large-scale data and AI at Databricks. He authored three computer science books, built up technical evangelism for Amazon Web Services in Germany, Austria, and Switzerland, and once upon a time worked as a data scientist with a group that won a Nobel prize. Frank realized his dream to speak at top-notch conferences - such as Devoxx, Kubecon, ODSC, and Java One - on every continent (except Antarctica because it is too cold there). He holds a Ph.D. with summa cum laude in Computer Science from TU Munich. Enjoys skiing in the Alps, tapas in Spain, and exploring secret beaches in SE Asia.

Frank Munz

October 06, 2022
Tweet

More Decks by Frank Munz

Other Decks in Programming

Transcript

  1. ©2022 Databricks Inc. — All rights reserved
    Streaming Data
    on the Lakehouse
    - a talk for everyone who ❤ data
    Devoxx Greece 2023
    1
    Frank Munz, Ph.D.
    May2023

    View Slide

  2. ©2022 Databricks Inc. — All rights reserved
    About me
    • Principal @Databricks. Data, Analytics and AI products
    • All things large scale data & compute

    Based in Munich,
    🍻 ⛰ 🥨 󰎲
    • Twitter: @frankmunz
    • Formerly AWS Tech Evangelist, SW architect, data scientist,
    published author etc.

    View Slide

  3. ©2022 Databricks Inc. — All rights reserved 3
    What do you love most about
    Apache Spark?

    View Slide

  4. ©2022 Databricks Inc. — All rights reserved
    Spark SQL and Dataframe API
    # PYTHON: read multiple JSON files with multiple datasets per file
    cust = spark.read.json('/data/customers/')
    # cust.createOrReplaceTempView("customers")
    cust.filter(cust['car_make']== "Audi").show()
    %sql
    select * from customers where car_make = "Audi"
    -- PYTHON:
    -- spark.sql('select * from customers where car_make ="Audi"').show()

    View Slide

  5. ©2022 Databricks Inc. — All rights reserved
    Automatic Parallelization and Scale

    View Slide

  6. ©2022 Databricks Inc. — All rights reserved
    Reading from Files vs Reading from Streams
    # read file(s)
    cust = spark.read.format("json").path('/data/customers/')
    # read from messaging platform
    sales = spark.readStream.format("kafka|kinesis|socket").load()

    View Slide

  7. ©2022 Databricks Inc. — All rights reserved
    Stream Processing is
    continuous and unbounded
    Stream Processing
    7
    Traditional Processing is
    one-off and bounded 1
    Data Source
    2
    Processing
    Data Source Processing

    View Slide

  8. ©2022 Databricks Inc. — All rights reserved
    Technical Advantages
    A more intuitive way of capturing and processing
    continuous and unbounded data
    Lower latency for time sensitive applications and use cases
    Better fault-tolerance through checkpointing
    Better compute utilization and scalability through
    continuous and incremental processing
    8

    View Slide

  9. ©2022 Databricks Inc. — All rights reserved
    Business Benefits
    9
    BI and SQL
    Analytics
    Fresher
    and faster
    insights
    Quicker and
    better business
    decisions
    Data
    Engineering
    Sooner
    availability of
    cleaned data
    More business
    use cases
    Data Science
    and ML
    More frequent
    model update
    and inference
    Better model
    efficacy
    Event Driven
    Application
    Faster customized
    response
    and action
    Better and
    differentiated
    customer
    experience

    View Slide

  10. ©2022 Databricks Inc. — All rights reserved 10
    Streaming
    Misconceptions

    View Slide

  11. ©2022 Databricks Inc. — All rights reserved
    Misconception #1
    Stream processing is only for low latency use cases
    X
    spark.readStream
    .format("delta")
    .option("maxFilesPerTrigger", "1")
    .load(inputDir)
    .writeStream
    .trigger(Trigger.AvailableNow)
    .option("checkpointLocation",chkdir)
    .start()
    11
    Stream processing can
    be applied to use cases
    of any latency
    “Batch” is a special case
    of streaming

    View Slide

  12. ©2022 Databricks Inc. — All rights reserved
    Misconception #2
    Latency
    Accuracy Cost
    Choose the right
    latency, accuracy, and
    cost tradeoff for each
    specific use case
    12
    The lower the latency, the "better"
    X

    View Slide

  13. ©2022 Databricks Inc. — All rights reserved 13
    Streaming is about the
    programming paradigm

    View Slide

  14. ©2022 Databricks Inc. — All rights reserved 14
    Spark Structured
    Streaming

    View Slide

  15. ©2022 Databricks Inc. — All rights reserved
    Structured Streaming
    15
    A scalable and
    fault-tolerant stream
    processing engine built
    on the Spark SQL engine
    +Project Lightspeed: predictable low latencies

    View Slide

  16. ©2022 Databricks Inc. — All rights reserved
    Structured Streaming
    16
    ● Read from an initial
    offset position
    ● Keep tracking offset
    position as
    processing makes
    progress
    Source
    ● Apply the same
    transformations
    using a standard
    dataframe
    Transformation
    ● Write to a target
    ● Keep updating
    checkpoint as
    processing makes
    progress
    Sink
    Trigger

    View Slide

  17. ©2022 Databricks Inc. — All rights reserved
    source
    transformation
    sink
    config
    spark.readStream.format("kafka|kinesis|socket|...")
    .option(<>,<>)...
    .load()
    .select(cast("string").alias("jsonData"))
    .select(from_json($"jsonData",jsonSchema).alias("payload"))
    .writeStream
    .format("delta")
    .option("path",...)
    .trigger("30 seconds")
    .option("checkpointLocation",...)
    .start()
    Streaming ETL
    17

    View Slide

  18. ©2022 Databricks Inc. — All rights reserved
    Trigger Types
    18
    ● Default: Process as soon as the previous batch has been processed
    ● Fixed interval: Process at a user-specified time interval
    ● One-time: Process all of the available data and then stop

    View Slide

  19. ©2022 Databricks Inc. — All rights reserved
    Output Modes
    19
    ● Append (Default): Only new rows added to the result
    table since the last trigger will be output to the sink
    ● Complete: The whole result table will be output to the
    sink after each trigger
    ● Update: Only the rows updated in the result table since
    the last trigger will be output to the sink

    View Slide

  20. ©2022 Databricks Inc. — All rights reserved
    Structured Streaming Benefits
    High
    Throughput
    Optimized for high
    throughput and
    low cost
    Rich Connector
    Ecosystem
    Streaming
    connectors ranging
    from message
    buses to object
    storage services
    Exactly Once
    Semantics
    Fault-tolerance and
    exactly once
    semantics
    guarantee
    correctness
    Unified Batch and
    Streaming
    Unified API makes
    development and
    maintenance
    simple
    20

    View Slide

  21. ©2022 Databricks Inc. — All rights reserved 21
    The Data Lakehouse

    View Slide

  22. ©2022 Databricks Inc. — All rights reserved
    Data Maturity Curve
    Data + AI Maturity
    Competitive Advantage
    Reports
    Clean Data
    Ad Hoc
    Queries
    Data
    Exploration
    Predictive
    Modeling
    Prescriptive
    Analytics
    Automated
    Decision Making
    Data Lake
    for AI
    Data Warehouse
    for BI
    Data Maturity Curve
    What will happen?
    What happened?
    22

    View Slide

  23. ©2022 Databricks Inc. — All rights reserved
    Business
    Intelligence
    SQL
    Analytics
    Data Science
    & ML
    Data
    Streaming
    Two disparate, incompatible data platforms
    Structured tables Unstructured files:
    logs, text, images, video,
    Data Warehouse Data Lake
    Governance and Security
    Table ACLs
    Governance and Security
    Files and Blobs
    Copy subsets of data
    Disjointed
    and duplicative
    data silos
    Incompatible
    security and
    governance models
    Incomplete
    support for
    use cases
    23

    View Slide

  24. ©2022 Databricks Inc. — All rights reserved 24
    Databricks
    Lakehouse Platform
    Lakehouse Platform
    Data
    Warehousing
    Data
    Engineering
    Data Science
    and ML
    Data
    Streaming
    All structured and unstructured data
    Cloud Data Lake
    Unity Catalog
    Fine-grained governance for data and AI
    Delta Lake
    Data reliability and performance
    Simple
    Unify your data warehousing and AI
    use cases on a single platform
    Open
    Built on open source and open standards
    Multicloud
    One consistent data platform across clouds

    View Slide

  25. ©2022 Databricks Inc. — All rights reserved 25
    Streaming on the Lakehouse
    Streaming on the Lakehouse
    Lakehouse Platform
    All structured and unstructured data
    Cloud Data Lake
    Unity Catalog
    Fine-grained governance for data and AI
    Delta Lake
    Data reliability and performance
    Streaming
    Ingesting
    Streaming
    ETL
    Event
    Processing
    Event Driven
    Application
    ML Inference
    Live Dashboard
    Near Real-time
    Query
    Alert
    Fraud Prevention
    Dynamic UI
    Dynamic Ads
    Dynamic Pricing
    Device Control
    Game Scene
    Update
    Etc.
    DB Change
    Data Feeds
    Clickstreams
    Machine &
    Application Logs
    Mobile & IoT Data
    Application Events

    View Slide

  26. ©2022 Databricks Inc. — All rights reserved
    Lakehouse Differentiations
    26
    Unified Batch and
    Streaming
    No overhead of
    learning, developing
    on, or maintaining
    two sets of APIs and
    data processing
    stacks
    Favorite Tools
    Provide diverse users
    with their favorite tools
    to work with streaming
    data, enabling the
    broader organization to
    take advantage of
    streaming
    End-to-End
    Streaming
    Has everything you
    need, no need to
    stitch together
    different streaming
    technology stacks
    or tune them to work
    together
    Optimal Cost
    Structure
    Easily configure the
    right latency-cost
    tradeoff for each of
    your streaming
    workloads

    View Slide

  27. ©2022 Databricks Inc. — All rights reserved 27
    but what is the
    foundation of the
    Lakehouse Platform?

    View Slide

  28. ©2022 Databricks Inc. — All rights reserved
    One source of truth for all your data
    Open format Delta Lake is the foundation of the Lakehouse
    • open source format Delta Lake, based on Parquet
    • adds quality, reliability, and performance
    to your existing data lakes
    • Provides one common data management
    framework for Batch & Streaming, ETL, Analytics & ML
    ✓ ACID Transactions
    ✓ Time Travel
    ✓ Schema Enforcement
    ✓ Identity Columns
    ✓ Advanced Indexing
    ✓ Caching
    ✓ Auto-tuning
    ✓ Python, SQL, R, Scala Support
    All Open Source

    View Slide

  29. ©2022 Databricks Inc. — All rights reserved
    Lakehouse: Delta Lake Publication
    Link to PDF

    View Slide

  30. ©2022 Databricks Inc. — All rights reserved
    Delta’s expanding ecosystem of connectors
    N
    ew
    !
    Coming Soon!
    10x
    Monthly
    download
    growth in
    just one
    year
    >8 Million
    monthly downloads

    View Slide

  31. ©2022 Databricks Inc. — All rights reserved
    Under the Hood
    my_table/ ← table name
    _delta_log/ ← delta log
    00000.json ← delta for tx
    00001.json

    00010.json
    00010.chkpt.parquet ← checkpoint files
    date=2022-01-01/ ← optional partition
    File-1.parquet ← data in parquet format
    Write ahead log with optimistic concurrency provides
    serializable ACID transactions:

    View Slide

  32. ©2022 Databricks Inc. — All rights reserved
    Delta Lake Quickstart
    pyspark --packages io.delta:delta-core_2.12:1.0.0 \
    --conf "spark.sql.extensions= \
    io.delta.sql.DeltaSparkSessionExtension" \
    --conf "spark.sql.catalog.spark_catalog= \
    org.apache.spark.sql.delta.catalog.DeltaCatalog"
    https://github.com/delta-io/delta

    View Slide

  33. ©2022 Databricks Inc. — All rights reserved
    Spark: Convert to Delta
    33
    #convert to delta format
    cust = spark.read.json('/data/customers/')
    cust.write.format("delta").saveAsTable(table_name)

    View Slide

  34. ©2022 Databricks Inc. — All rights reserved
    All of Delta Lake 2.0 is open
    ACID
    Transactions
    Scalable
    Metadata
    Time Travel Open Source
    Unified
    Batch/Streaming
    Schema Evolution
    /Enforcement
    Audit History DML Operations
    OPTIMIZE
    Compaction
    OPTIMIZE
    ZORDER
    Change data
    feed
    Table Restore S3 Multi-cluster
    writes
    MERGE
    Enhancements
    Stream
    Enhancements
    Simplified
    Logstore
    Data Skipping
    via Column
    Stats
    Multi-part checkpoint
    writes
    Generated
    Columns
    Column
    Mapping
    Generated
    column support
    w/ partitioning
    Identity
    Columns
    Subqueries in
    deletes and
    updates
    Clones
    Iceberg to Delta
    converter
    Fast metadata
    only deletes
    Coming Soon!

    View Slide

  35. ©2022 Databricks Inc. — All rights reserved
    Without Delta Lake: No UPDATE

    View Slide

  36. ©2022 Databricks Inc. — All rights reserved
    Delta Examples (in SQL for brevity)
    36
    SELECT * FROM student TIMESTAMP AS OF "2022-05-28"
    CREATE TABLE test_student SHALLOW CLONE student
    DESCRIBE HISTORY student
    UPDATE student SET lastname = "Miller" WHERE id = 2805
    OPTIMIZE student ZORDER BY age
    VACUUM student RETAIN 12 HOURS
    Update/Upsert
    Time travel
    Clones
    Table History
    Co-locate
    Compact Files

    View Slide

  37. ©2022 Databricks Inc. — All rights reserved 37
    … but how does this perform for
    DWH workloads?

    View Slide

  38. ©2022 Databricks Inc. — All rights reserved
    Databricks SQL Photon
    Serverless
    Eliminate compute
    infrastructure management
    Instant, Elastic Compute
    Zero Management
    Lower TCO
    Vectorized C++ execution engine with
    Apache Spark API
    https://dbricks.co/benchmark
    TPC-DS Benchmark 100 TB

    View Slide

  39. ©2022 Databricks Inc. — All rights reserved 39
    … and streaming?

    View Slide

  40. ©2022 Databricks Inc. — All rights reserved
    Streaming: Built into the foundation
    40
    display(spark.readStream.format("delta").table("heart.bpm").
    groupBy("bpm",window("time", "1 minute"))
    .avg("bpm").orderBy("window",ascending=True))
    Delta Tables can be
    streaming
    sources and sinks

    View Slide

  41. ©2022 Databricks Inc. — All rights reserved 41

    View Slide

  42. ©2022 Databricks Inc. — All rights reserved 42
    How can you simplify
    Streaming Ingestion and
    Data Pipelines?

    View Slide

  43. ©2022 Databricks Inc. — All rights reserved
    Auto Loader
    • Python & Scala (and SQL in Delta Live Tables!)
    • Streaming data source with incremental loading
    • Exactly once ingestion
    • Scales to large amounts of data
    • Designed for structured, semi-structured and unstructured data
    • Schema inference, enforcement with data rescue, and evolution
    df = spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load("/path/to/table")
    43

    View Slide

  44. ©2022 Databricks Inc. — All rights reserved
    “We ingest more than
    5 petabytes per day
    with Auto Loader”
    44

    View Slide

  45. ©2022 Databricks Inc. — All rights reserved
    Delta Live Tables
    45
    CREATE STREAMING LIVE TABLE raw_data
    AS SELECT *
    FROM cloud_files ("/raw_data", "json")
    CREATE LIVE TABLE clean_data
    AS SELECT …
    FROM LIVE .raw_data
    Reliable, declarative, streaming data pipelines in SQL or Python
    Accelerate ETL development
    Declare SQL or Python and DLT automatically
    orchestrates the DAG, handles retries, changing data
    Automatically manage your infrastructure
    Automates complex tedious activities like recovery,
    auto-scaling, and performance optimization
    Ensure high data quality
    Deliver reliable data with built-in quality controls,
    testing, monitoring, and enforcement
    Unify batch and streaming
    Get the simplicity of SQL with freshness
    of streaming with one unified API

    View Slide

  46. ©2022 Databricks Inc. — All rights reserved
    COPY INTO
    • SQL command
    • Idempotent and incremental
    • Great when source directory contains ~ thousands of files
    • Schema automatically inferred
    COPY INTO my_delta_table
    FROM 's3://my-bucket/path/to/csv_files'
    FILEFORMAT = CSV
    FORMAT_OPTIONS ('header'='true','inferSchema'='true')
    46

    View Slide

  47. ©2022 Databricks Inc. — All rights reserved
    Build sophisticated workflows inside your Databricks workspace
    with a few clicks, or connect to your favorite IDE.
    Simple Workflows
    47

    View Slide

  48. ©2022 Databricks Inc. — All rights reserved 48
    10k View

    View Slide

  49. ©2022 Databricks Inc. — All rights reserved
    Lakehouse Platform: Streaming Options
    49
    Messaging Systems
    (Apache Kafka, Kinesis, …)
    Files / Object Stores
    (S3, ADLS, GCS, …)
    Apache Kafka
    Connector for
    SSStreaming
    Databricks
    Auto Loader
    Spark Structured Streaming
    Delta Live Tables (SQL/Python)
    Delta Lake
    Data
    Consumers
    Delta Lake
    Sink Connector

    View Slide

  50. ©2022 Databricks Inc. — All rights reserved 50
    It's demo time!

    View Slide

  51. ©2022 Databricks Inc. — All rights reserved 51
    Demo 1

    View Slide

  52. ©2022 Databricks Inc. — All rights reserved
    Demo: Data Donation Project
    https://corona-datenspende.de/science/en/

    View Slide

  53. ©2022 Databricks Inc. — All rights reserved
    System Architecture

    View Slide

  54. ©2022 Databricks Inc. — All rights reserved
    DLT: Directly Ingest Streaming Data

    View Slide

  55. ©2022 Databricks Inc. — All rights reserved 55
    Demo 2

    View Slide

  56. ©2022 Databricks Inc. — All rights reserved
    Delta Live Tables
    Twitter Sentiment Analysis

    View Slide

  57. ©2022 Databricks Inc. — All rights reserved
    Tweepy API: Streaming Twitter Feed

    View Slide

  58. ©2022 Databricks Inc. — All rights reserved
    Auto Loader: Streaming Data Ingestion
    Ingest Streaming Data in a Delta Live Table pipeline

    View Slide

  59. ©2022 Databricks Inc. — All rights reserved
    Declarative, auto scaling Data Pipelines in SQL
    CTAS Pattern: Create Table As Select …

    View Slide

  60. ©2022 Databricks Inc. — All rights reserved
    Declarative, auto scaling Data Pipelines

    View Slide

  61. ©2022 Databricks Inc. — All rights reserved
    DWH / SQL Persona

    View Slide

  62. ©2022 Databricks Inc. — All rights reserved
    ML/DS: Hugging Face Transformer
    62

    View Slide

  63. ©2022 Databricks Inc. — All rights reserved 63
    Resources

    View Slide

  64. ©2022 Databricks Inc. — All rights reserved
    Demos on Databricks Github
    https://github.com/databricks/delta-live-tables-notebooks

    View Slide

  65. ©2022 Databricks Inc. — All rights reserved
    Databricks Blog: DLT with Apache Kafka
    https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html

    View Slide

  66. ©2022 Databricks Inc. — All rights reserved
    Resources
    • Spark on Databricks
    • Databricks Demo Hub
    • Databricks Blog
    • Databricks Community / Forum (join to ask your tech questions!)
    • Training & Certification

    View Slide

  67. ©2022 Databricks Inc. — All rights reserved
    Data + AI Summit: Sign up now

    View Slide

  68. ©2022 Databricks Inc. — All rights reserved 68
    Thank You!
    https://speakerdeck.com/fmunz
    @frankmunz
    Try
    Databricks
    free

    View Slide