Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Serverless, Real-Time Data Lakehouse in Action

Frank Munz
September 28, 2023

The Serverless, Real-Time Data Lakehouse in Action

Unlike just a few years ago, today the lakehouse architecture is an established data platform embraced by all major cloud data companies such as AWS, Azure, Google, Oracle, Microsoft, Snowflake and Databricks.

This session kicks off with a technical, no-nonsense introduction to the lakehouse concept, dives deep into the lakehouse architecture and recaps how a data lakehouse is built from the ground up with streaming as a first-class citizen.

Then we focus on serverless for streaming use cases. Serverless concepts are well-known from developers triggering hundreds of thousands of AWS Lambda functions at a negligible cost. However, the same concept becomes more interesting when looking at data platforms.

We have all heard about the principle "It runs best on Powerpoint", so I decided to skip slides here and bring a serverless demo instead:

A hands-on, fun, and interactive serverless streaming use case example where we ingest live events from hundreds of mobile devices (don't miss out - bring your phone and be part of it!!). Based on this use case I will critically explore how much of a modern lakehouse is serverless and how we implemented that at Databricks (spoiler alert: serverless is everywhere from data pipelines, workflows, optimized Spark APIs, to ML).

TL;DR benefits for the Data Practitioners:
-Recap the OSS foundation of the Lakehouse architecture and understand its appeal
- Understand the benefits of leveraging a lakehouse for streaming and what's there beyond Spark Structured Streaming.
- Meat of the talk: The Serverless Lakehouse. I give you the tech bits beyond the hype. How does a serverless lakehouse differ from other serverless offers?
- Live, hands-on, interactive demo to explore serverless data engineering data end-to-end. For each step we have a critical look and I explain what it means, e.g for you saving costs and removing operational overhead.

Frank Munz

September 28, 2023

More Decks by Frank Munz

Other Decks in Technology


  1. ©2022 Databricks Inc. — All rights reserved Standing on the

    Shoulders of Open-Source Giants The Real-Time, Serverless Lakehouse in Action Frank Munz, Principal TMM, Databricks / Current.io 2023 @frankmunz
  2. ©2022 Databricks Inc. — All rights reserved 2 Databricks Lakehouse

    Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds
  3. Subsecond Latency - Project Lightspeed 7 Performance Improvements • Micro-Batch

    Pipelining • Offset Management • Log Purging • Consistent Latency for Stateful Pipelines • State Rebalancing • Adaptive Query Execution Enhanced Functionality • Multiple Stateful Operators • Arbitrary Stateful Processing in Python • Drop Duplicates Within Watermark • Native support for Protobuf Improved Observability • Python Query Listener Connectors & Ecosystem • Enhanced Fanout (EFO) • Trigger.AvailableNow support for Amazon Kinesis • Google Pub/Sub Connector • Integrations with Unity Catalog
  4. Spark Connect GA in Apache Spark 3.4 Applications IDEs /

    Notebooks Programming Languages / SDKs Modern data application Thin client, with full power of Apache Spark Spark’s Monolith Driver Application Gateway Analyzer Optimizer Scheduler Distributed Execution Engine Spark Connect Client API
  5. Introducing Delta Kernel Implements the complete Delta Data + Metadata

    specification. Unifies connector development = Java Ecosystem aws- pandas-sdk ray airbyte Python Ecosystem Power BI pandas dask duck DB Rust Ecosystem Startree (pinot) beam ballista kafka data fusion pulsar flink prestodb hive trino glue athena emr dlt (spark-r) azure synapse delta- spark redshift datahub C++ Excel Golang Java Power BI R-Stats Rust Delta Sharing Others Delta Protocol Delta Kernels polars arrow
  6. active data consumers on Delta Sharing data shared with Delta

    Lake 6,000+ 300+ PB per day Delta Lake table Delta Sharing protocol Any compatible client Data consumer Data provider An open standard for secure data sharing
  7. Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source

    Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW
  8. Model Serving optimized for LLMs INTRODUCING Model Serving and Monitoring

    Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1 MPT-7B-Instruct
  9. Manage, govern, evaluate, and switch models easily MLflow AI Gateway

    INTRODUCING Multiple Generative AI use cases across the organization BI Pipelines Apps MLflow AI Gateway Multiple Generative AI Models Credentials Caching Logging Rate limiting Model Serving and Monitoring Users
  10. ©2022 Databricks Inc. — All rights reserved Unity Catalog Delta

    Lake BI & Data Warehousing Data Streaming Data Science & ML Data Engineering Databricks Workflows Unified orchestration for data, analytics, and AI on the Lakehouse Platform Lakehouse Platform • Simple authoring • Actionable insights • Proven reliability YipitData: Why we migrated from Airflow to Workflows Workflows Sessions Clicks Join Featurize Aggregate Analyze Train Orders 22
  11. ©2022 Databricks Inc. — All rights reserved Building Blocks of

    Databricks Workflows 23 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs as a Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers Preview DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Delta Table Update Continuous (Streaming) Preview Coming Soon
  12. ©2022 Databricks Inc. — All rights reserved Serverless Workflows Hands-off,

    auto-optimizing compute in Databricks’ account Benefit from Databricks’ scale of compute and engineering expertise through Serverless compute in Databricks’ account: Problem: Setting up, managing, and optimising clusters is cumbersome and requires expert knowledge, wasting valuable time and resources. • High efficiency: Don’t pay for idle, auto-optimize compute config • Reliability so your critical workloads are shielded from cloud disruptions • Faster startup: So users don’t have to wait and critical data is always fresh • Simplicity that enables every user to set up serverless 2 PREVIEW
  13. ©2022 Databricks Inc. — All rights reserved What is Delta

    Live Tables? Delta Live Tables (DLT) is the first ETL framework that uses a simple declarative approach to building reliable data pipelines. DLT automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. Accelerate ETL Development Automatically manage your infrastructure Have confidence in your data Simplify batch and streaming https://databricks.com/product/delta-live-tables Modern software engineering for ETL processing
  14. ©2022 Databricks Inc. — All rights reserved Reference Architecture Most

    use cases will use STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain infinite history Easy to perform GDPR and other compliance tasks CREATE MATERIALZIED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections
  15. ©2022 Databricks Inc. — All rights reserved Serverless Streaming optimizations

    DLT Serverless also optimizes streaming TCO and latency! 27 PREVIEW DLT Serverless dynamically optimizes compute and scheduling • Pipelined execution of multiple microbatches • Dynamically tuning of batches sizes based on the amount of compute available
  16. ©2022 Databricks Inc. — All rights reserved Workflows Or DLT?

    Often Both: Workflows can orchestrate anything, including DLT • At some schedule • After other tasks have completed • When a file arrives • When another table is updated 30 • Batch and streaming data transformations / quality • Easy way to run Structured Streaming • Creating/updating delta tables Use DLT for managing dataflow Use Workflows to run any task
  17. ©2022 Databricks Inc. — All rights reserved The core abstractions

    of DLT You define datasets, and DLT automatically keeps them up to date 31 A delta table with stream(s) writing to it. Used for: • Ingestion (files, message brokers) • Low latency transformations • Huge scale The result of a query, stored in a delta table. Used for: • Transforming data • Building aggregate tables • Speeding up BI queries and reports Streaming Tables Materialized View
  18. ©2022 Databricks Inc. — All rights reserved Streaming does not

    always mean expensive Costs: lowest Latency: highest Delta live tables lets you choose how often to update the results. Costs: depends on frequency Latency: 10 minutes to months Costs: highest Latency: minutes to seconds Triggered: Manually Triggered: On a schedule using Databricks Jobs Continually 32 (for some workloads)
  19. ©2022 Databricks Inc. — All rights reserved Challenge Heavy burden

    on Data Engineers to create workflows for analysts due to the high complexity of creating custom workflows with Airflow. Solution Migrated from Airflow to Databricks Workflows for a unified platform providing analysts a simple way to own and manage their own workflows from data ingestion to downstream analytics. 60% Lower database costs 90% Reduction in processing time Impact 33 “If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows.” —Hillevi Crognale, Engineering Manager, YipitData Migrating from Apache Airflow to Databricks Workflows
  20. ©2022 Databricks Inc. — All rights reserved 35 New Databricks

    Demo Center databricks.com/demos Notebooks for this demo on GitHub This demo on Demo Center
  21. ©2022 Databricks Inc. — All rights reserved Additional Resources Demos

    • Ingest from Kinesis: Earthquake detection with mobile phone sensors • Ingest from Kafka: Corona Detection with IoT Fitness Trackers • Ingest with Auto Loader: Sentiment Analysis for Twitter streams Blogs: Why YipitData migrated from Airflow DLT: ETL with 1 billion rows for under 1 $ Product Tours without account: Delta Live Tables & Workflows 36
  22. ©2022 Databricks Inc. — All rights reserved Technical Questions? Sign-up

    for the Databricks Community! Ask your technical questions here: https://community.databricks.com/ 37