Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Serverless, Real-Time Data Lakehouse in Action

Frank Munz
September 28, 2023

The Serverless, Real-Time Data Lakehouse in Action

Unlike just a few years ago, today the lakehouse architecture is an established data platform embraced by all major cloud data companies such as AWS, Azure, Google, Oracle, Microsoft, Snowflake and Databricks.

This session kicks off with a technical, no-nonsense introduction to the lakehouse concept, dives deep into the lakehouse architecture and recaps how a data lakehouse is built from the ground up with streaming as a first-class citizen.

Then we focus on serverless for streaming use cases. Serverless concepts are well-known from developers triggering hundreds of thousands of AWS Lambda functions at a negligible cost. However, the same concept becomes more interesting when looking at data platforms.

We have all heard about the principle "It runs best on Powerpoint", so I decided to skip slides here and bring a serverless demo instead:

A hands-on, fun, and interactive serverless streaming use case example where we ingest live events from hundreds of mobile devices (don't miss out - bring your phone and be part of it!!). Based on this use case I will critically explore how much of a modern lakehouse is serverless and how we implemented that at Databricks (spoiler alert: serverless is everywhere from data pipelines, workflows, optimized Spark APIs, to ML).

TL;DR benefits for the Data Practitioners:
-Recap the OSS foundation of the Lakehouse architecture and understand its appeal
- Understand the benefits of leveraging a lakehouse for streaming and what's there beyond Spark Structured Streaming.
- Meat of the talk: The Serverless Lakehouse. I give you the tech bits beyond the hype. How does a serverless lakehouse differ from other serverless offers?
- Live, hands-on, interactive demo to explore serverless data engineering data end-to-end. For each step we have a critical look and I explain what it means, e.g for you saving costs and removing operational overhead.

Frank Munz

September 28, 2023
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. ©2022 Databricks Inc. — All rights reserved
    Standing on the Shoulders of
    Open-Source Giants
    The Real-Time,
    Serverless
    Lakehouse in Action
    Frank Munz, Principal TMM, Databricks / Current.io 2023
    @frankmunz

    View full-size slide

  2. ©2022 Databricks Inc. — All rights reserved 2
    Databricks
    Lakehouse Platform
    Lakehouse Platform
    Data
    Warehousing
    Data
    Engineering
    Data Science
    and ML
    Data
    Streaming
    All structured and unstructured data
    Cloud Data Lake
    Unity Catalog
    Fine-grained governance for data and AI
    Delta Lake
    Data reliability and performance
    Simple
    Unify your data warehousing and AI
    use cases on a single platform
    Open
    Built on open source and open standards
    Multicloud
    One consistent data platform across clouds

    View full-size slide

  3. Standing on
    the Shoulders of OSS
    What is new?

    View full-size slide

  4. ©2021 Databricks Inc. — All rights reserved
    Apache Spark

    View full-size slide

  5. Annual downloads
    > 1 Billion

    View full-size slide

  6. 3,600 contributors, 40,000 commits
    #1 in dev activity for 10 years

    View full-size slide

  7. Subsecond Latency - Project Lightspeed
    7
    Performance Improvements
    • Micro-Batch Pipelining
    • Offset Management
    • Log Purging
    • Consistent Latency for Stateful Pipelines
    • State Rebalancing
    • Adaptive Query Execution
    Enhanced Functionality
    • Multiple Stateful Operators
    • Arbitrary Stateful Processing in Python
    • Drop Duplicates Within Watermark
    • Native support for Protobuf
    Improved Observability
    • Python Query Listener
    Connectors & Ecosystem
    • Enhanced Fanout (EFO)
    • Trigger.AvailableNow support for Amazon Kinesis
    • Google Pub/Sub Connector
    • Integrations with Unity Catalog

    View full-size slide

  8. Spark Connect GA in Apache Spark 3.4
    Applications
    IDEs / Notebooks
    Programming Languages / SDKs
    Modern data application
    Thin client, with full power of Apache Spark
    Spark’s Monolith Driver
    Application Gateway
    Analyzer
    Optimizer
    Scheduler
    Distributed Execution Engine
    Spark Connect
    Client API

    View full-size slide

  9. Spark Assistant
    Prompt engineering by
    Spark experts
    New LLM-powered
    features

    View full-size slide

  10. ©2021 Databricks Inc. — All rights reserved
    Delta.io
    supports streaming
    from the ground up

    View full-size slide

  11. Introducing
    Delta Kernel
    Implements the
    complete Delta
    Data + Metadata
    specification.
    Unifies connector
    development
    =
    Java Ecosystem
    aws-
    pandas-sdk
    ray
    airbyte
    Python
    Ecosystem
    Power BI
    pandas
    dask
    duck DB
    Rust
    Ecosystem
    Startree
    (pinot)
    beam
    ballista
    kafka
    data fusion
    pulsar flink
    prestodb hive
    trino glue
    athena
    emr dlt (spark-r) azure synapse
    delta-
    spark
    redshift datahub
    C++
    Excel
    Golang
    Java
    Power BI
    R-Stats
    Rust
    Delta Sharing
    Others
    Delta
    Protocol
    Delta Kernels
    polars arrow

    View full-size slide

  12. Metadata
    Delta Lake
    With UniForm
    Metadata
    Data
    Delta
    UniForm
    Unifying the
    lakehouse
    formats
    Parquet

    View full-size slide

  13. ©2021 Databricks Inc. — All rights reserved
    Delta Sharing
    Lightning talk
    tomorrow at 3.30PM
    Meetup Hub

    View full-size slide

  14. active data consumers on
    Delta Sharing
    data shared with Delta Lake
    6,000+
    300+ PB per day
    Delta Lake
    table
    Delta
    Sharing
    protocol
    Any
    compatible
    client
    Data consumer
    Data provider
    An open standard for secure data sharing

    View full-size slide

  15. Delta Sharing Ecosystem
    3rd Party Data Vendors/Clean Room
    Open Source Clients Business Intelligence/Analytics
    Governance SaaS/Multi-Cloud Infrastructure
    Hyperscalers
    Carto
    NEW

    View full-size slide

  16. ©2021 Databricks Inc. — All rights reserved
    MLFlow

    View full-size slide

  17. Model
    Serving
    optimized
    for LLMs
    INTRODUCING
    Model Serving
    and Monitoring
    Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1
    MPT-7B-Instruct

    View full-size slide

  18. Manage, govern,
    evaluate, and switch
    models easily
    MLflow AI
    Gateway
    INTRODUCING
    Multiple Generative AI use cases
    across the organization
    BI Pipelines Apps
    MLflow AI Gateway
    Multiple Generative AI Models
    Credentials Caching Logging Rate limiting
    Model Serving
    and Monitoring
    Users

    View full-size slide

  19. ©2021 Databricks Inc. — All rights reserved
    Demo Audience 1

    View full-size slide

  20. Let's do the math…
    This demo creates a sustained data rate
    43 million
    events / day
    2

    View full-size slide

  21. Data Engineering
    on the Lakehouse

    View full-size slide

  22. ©2022 Databricks Inc. — All rights reserved
    Unity Catalog
    Delta Lake
    BI & Data
    Warehousing
    Data
    Streaming
    Data
    Science & ML
    Data
    Engineering
    Databricks Workflows
    Unified orchestration for data,
    analytics, and AI on the
    Lakehouse Platform
    Lakehouse Platform
    ● Simple authoring
    ● Actionable insights
    ● Proven reliability
    YipitData: Why we migrated from
    Airflow to Workflows
    Workflows
    Sessions
    Clicks
    Join
    Featurize
    Aggregate Analyze
    Train
    Orders
    22

    View full-size slide

  23. ©2022 Databricks Inc. — All rights reserved
    Building Blocks of Databricks Workflows
    23
    A unit of orchestration in Databricks Workflows is called a Job.
    Databricks
    Notebooks
    Python
    Scripts
    Python
    Wheels
    SQL
    Files/Queries
    Delta Live Tables
    Pipeline
    dbt Java
    JAR file
    Spark
    Submit
    Jobs consist of
    one or more Tasks
    Sequential Parallel Conditionals
    (Run If)
    Jobs as a Task
    (Modular)
    Control flows can
    be established
    between Tasks.
    Jobs supports
    different Triggers
    Preview
    DBSQL
    Dashboards
    Manual
    Trigger
    Scheduled
    (Cron)
    API
    Trigger
    File
    Arrival
    Delta Table
    Update
    Continuous
    (Streaming)
    Preview
    Coming
    Soon

    View full-size slide

  24. ©2022 Databricks Inc. — All rights reserved
    Serverless Workflows
    Hands-off, auto-optimizing compute in Databricks’ account
    Benefit from Databricks’ scale of compute and
    engineering expertise through Serverless
    compute in Databricks’ account:
    Problem: Setting up, managing, and optimising clusters is
    cumbersome and requires expert knowledge, wasting
    valuable time and resources.
    ● High efficiency: Don’t pay for idle,
    auto-optimize compute config
    ● Reliability so your critical
    workloads are shielded from cloud
    disruptions
    ● Faster startup: So users don’t
    have to wait and critical data is
    always fresh
    ● Simplicity that enables every user
    to set up serverless
    2
    PREVIEW

    View full-size slide

  25. ©2022 Databricks Inc. — All rights reserved
    What is Delta Live Tables?
    Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to
    building reliable data pipelines. DLT automatically manages your infrastructure at scale so data
    analysts and engineers can spend less time on tooling and focus on getting value from data.
    Accelerate ETL
    Development
    Automatically manage
    your infrastructure
    Have confidence
    in your data
    Simplify batch and
    streaming
    https://databricks.com/product/delta-live-tables
    Modern software engineering for ETL processing

    View full-size slide

  26. ©2022 Databricks Inc. — All rights reserved
    Reference Architecture
    Most use cases will use STs for ingestion and MVs for transformation
    Bronze
    cloud_files
    CREATE STREAMING TABLE
    Use a short retention
    period to avoid
    compliance risks and
    reduce costs
    Avoid complex
    transformations
    that could have
    bugs or drop
    important data
    Retain infinite history
    Easy to perform
    GDPR and other
    compliance tasks
    CREATE MATERIALZIED VIEW
    Materialized views
    automatically handle
    complex joins /
    aggregations, and
    propagate updates and
    deletes.
    Silver/Gold
    Ad-hoc DML
    for GDPR /
    Corrections

    View full-size slide

  27. ©2022 Databricks Inc. — All rights reserved
    Serverless Streaming optimizations
    DLT Serverless also optimizes streaming TCO and latency!
    27
    PREVIEW
    DLT Serverless dynamically
    optimizes compute and scheduling
    • Pipelined execution of multiple
    microbatches
    • Dynamically tuning of batches sizes
    based on the amount of compute
    available

    View full-size slide

  28. ©2021 Databricks Inc. — All rights reserved
    Demo Audience 2

    View full-size slide

  29. ©2022 Databricks Inc. — All rights reserved 29
    Delta Live Tables
    Link to blog

    View full-size slide

  30. ©2022 Databricks Inc. — All rights reserved
    Workflows Or DLT?
    Often Both: Workflows can orchestrate anything, including DLT
    ● At some schedule
    ● After other tasks have
    completed
    ● When a file arrives
    ● When another table is
    updated
    30
    ● Batch and streaming data
    transformations / quality
    ● Easy way to run
    Structured Streaming
    ● Creating/updating delta tables
    Use DLT for managing dataflow
    Use Workflows to run any
    task

    View full-size slide

  31. ©2022 Databricks Inc. — All rights reserved
    The core abstractions of DLT
    You define datasets, and DLT automatically keeps them up to date
    31
    A delta table with stream(s)
    writing to it.
    Used for:
    • Ingestion
    (files, message brokers)
    • Low latency transformations
    • Huge scale
    The result of a query, stored in a
    delta table.
    Used for:
    • Transforming data
    • Building aggregate tables
    • Speeding up BI queries and
    reports
    Streaming Tables Materialized View

    View full-size slide

  32. ©2022 Databricks Inc. — All rights reserved
    Streaming does not always mean expensive
    Costs: lowest
    Latency: highest
    Delta live tables lets you choose how often to update the results.
    Costs: depends on frequency
    Latency: 10 minutes to months
    Costs: highest
    Latency: minutes to seconds
    Triggered: Manually Triggered: On a schedule
    using Databricks Jobs
    Continually
    32
    (for some workloads)

    View full-size slide

  33. ©2022 Databricks Inc. — All rights reserved
    Challenge
    Heavy burden on Data
    Engineers to create workflows
    for analysts due to the high
    complexity of creating custom
    workflows with Airflow.
    Solution
    Migrated from Airflow to
    Databricks Workflows for a
    unified platform providing
    analysts a simple way to own
    and manage their own
    workflows from data ingestion
    to downstream analytics.
    60%
    Lower database costs
    90%
    Reduction in
    processing time
    Impact
    33
    “If we went back to 2018 and Databricks Workflows was available, we would never
    have considered building out a custom Airflow setup. We would just use
    Workflows.”
    —Hillevi Crognale, Engineering Manager, YipitData
    Migrating from Apache Airflow
    to Databricks Workflows

    View full-size slide

  34. ©2022 Databricks Inc. — All rights reserved 34
    Delta Live Tables
    Link to blog

    View full-size slide

  35. ©2022 Databricks Inc. — All rights reserved 35
    New Databricks Demo Center
    databricks.com/demos
    Notebooks for this demo
    on GitHub
    This demo on Demo
    Center

    View full-size slide

  36. ©2022 Databricks Inc. — All rights reserved
    Additional Resources
    Demos
    ● Ingest from Kinesis: Earthquake detection with mobile phone sensors
    ● Ingest from Kafka: Corona Detection with IoT Fitness Trackers
    ● Ingest with Auto Loader: Sentiment Analysis for Twitter streams
    Blogs: Why YipitData migrated from Airflow
    DLT: ETL with 1 billion rows for under 1 $
    Product Tours without account: Delta Live Tables & Workflows
    36

    View full-size slide

  37. ©2022 Databricks Inc. — All rights reserved
    Technical Questions?
    Sign-up for the Databricks Community!
    Ask your technical questions here: https://community.databricks.com/
    37

    View full-size slide

  38. ©2022 Databricks Inc. — All rights reserved 38
    Thank You!
    @frankmunz
    Try
    Databricks
    free

    View full-size slide