The Serverless, Real-Time Data Lakehouse in Action

©2022 Databricks Inc. — All rights reserved Standing on the
Shoulders of Open-Source Giants The Real-Time, Serverless Lakehouse in Action Frank Munz, Principal TMM, Databricks / Current.io 2023 @frankmunz

©2022 Databricks Inc. — All rights reserved 2 Databricks Lakehouse
Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds

Standing on the Shoulders of OSS What is new?

Annual downloads > 1 Billion

3,600 contributors, 40,000 commits #1 in dev activity for 10
years

Subsecond Latency - Project Lightspeed 7 Performance Improvements • Micro-Batch
Pipelining • Offset Management • Log Purging • Consistent Latency for Stateful Pipelines • State Rebalancing • Adaptive Query Execution Enhanced Functionality • Multiple Stateful Operators • Arbitrary Stateful Processing in Python • Drop Duplicates Within Watermark • Native support for Protobuf Improved Observability • Python Query Listener Connectors & Ecosystem • Enhanced Fanout (EFO) • Trigger.AvailableNow support for Amazon Kinesis • Google Pub/Sub Connector • Integrations with Unity Catalog

Spark Connect GA in Apache Spark 3.4 Applications IDEs /
Notebooks Programming Languages / SDKs Modern data application Thin client, with full power of Apache Spark Spark’s Monolith Driver Application Gateway Analyzer Optimizer Scheduler Distributed Execution Engine Spark Connect Client API

Spark Assistant Prompt engineering by Spark experts New LLM-powered features

©2021 Databricks Inc. — All rights reserved Delta.io supports streaming
from the ground up

Introducing Delta Kernel Implements the complete Delta Data + Metadata
specification. Unifies connector development = Java Ecosystem aws- pandas-sdk ray airbyte Python Ecosystem Power BI pandas dask duck DB Rust Ecosystem Startree (pinot) beam ballista kafka data fusion pulsar flink prestodb hive trino glue athena emr dlt (spark-r) azure synapse delta- spark redshift datahub C++ Excel Golang Java Power BI R-Stats Rust Delta Sharing Others Delta Protocol Delta Kernels polars arrow

Metadata Delta Lake With UniForm Metadata Data Delta UniForm Unifying
the lakehouse formats Parquet

©2021 Databricks Inc. — All rights reserved Delta Sharing Lightning
talk tomorrow at 3.30PM Meetup Hub

active data consumers on Delta Sharing data shared with Delta
Lake 6,000+ 300+ PB per day Delta Lake table Delta Sharing protocol Any compatible client Data consumer Data provider An open standard for secure data sharing

Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source
Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW

Model Serving optimized for LLMs INTRODUCING Model Serving and Monitoring
Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1 MPT-7B-Instruct

Manage, govern, evaluate, and switch models easily MLﬂow AI Gateway
INTRODUCING Multiple Generative AI use cases across the organization BI Pipelines Apps MLﬂow AI Gateway Multiple Generative AI Models Credentials Caching Logging Rate limiting Model Serving and Monitoring Users

Let's do the math… This demo creates a sustained data
rate 43 million events / day 2

Data Engineering on the Lakehouse

©2022 Databricks Inc. — All rights reserved Unity Catalog Delta
Lake BI & Data Warehousing Data Streaming Data Science & ML Data Engineering Databricks Workflows Unified orchestration for data, analytics, and AI on the Lakehouse Platform Lakehouse Platform • Simple authoring • Actionable insights • Proven reliability YipitData: Why we migrated from Airflow to Workflows Workflows Sessions Clicks Join Featurize Aggregate Analyze Train Orders 22

©2022 Databricks Inc. — All rights reserved Building Blocks of
Databricks Workflows 23 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs as a Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers Preview DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Delta Table Update Continuous (Streaming) Preview Coming Soon

©2022 Databricks Inc. — All rights reserved Serverless Workflows Hands-off,
auto-optimizing compute in Databricks’ account Benefit from Databricks’ scale of compute and engineering expertise through Serverless compute in Databricks’ account: Problem: Setting up, managing, and optimising clusters is cumbersome and requires expert knowledge, wasting valuable time and resources. • High efficiency: Don’t pay for idle, auto-optimize compute config • Reliability so your critical workloads are shielded from cloud disruptions • Faster startup: So users don’t have to wait and critical data is always fresh • Simplicity that enables every user to set up serverless 2 PREVIEW

©2022 Databricks Inc. — All rights reserved What is Delta
Live Tables? Delta Live Tables (DLT) is the ﬁrst ETL framework that uses a simple declarative approach to building reliable data pipelines. DLT automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. Accelerate ETL Development Automatically manage your infrastructure Have conﬁdence in your data Simplify batch and streaming https://databricks.com/product/delta-live-tables Modern software engineering for ETL processing

©2022 Databricks Inc. — All rights reserved Reference Architecture Most
use cases will use STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain inﬁnite history Easy to perform GDPR and other compliance tasks CREATE MATERIALZIED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections

©2022 Databricks Inc. — All rights reserved Serverless Streaming optimizations
DLT Serverless also optimizes streaming TCO and latency! 27 PREVIEW DLT Serverless dynamically optimizes compute and scheduling • Pipelined execution of multiple microbatches • Dynamically tuning of batches sizes based on the amount of compute available

©2022 Databricks Inc. — All rights reserved Workflows Or DLT?
Often Both: Workflows can orchestrate anything, including DLT • At some schedule • After other tasks have completed • When a file arrives • When another table is updated 30 • Batch and streaming data transformations / quality • Easy way to run Structured Streaming • Creating/updating delta tables Use DLT for managing dataflow Use Workflows to run any task

©2022 Databricks Inc. — All rights reserved The core abstractions
of DLT You deﬁne datasets, and DLT automatically keeps them up to date 31 A delta table with stream(s) writing to it. Used for: • Ingestion (ﬁles, message brokers) • Low latency transformations • Huge scale The result of a query, stored in a delta table. Used for: • Transforming data • Building aggregate tables • Speeding up BI queries and reports Streaming Tables Materialized View

©2022 Databricks Inc. — All rights reserved Streaming does not
always mean expensive Costs: lowest Latency: highest Delta live tables lets you choose how often to update the results. Costs: depends on frequency Latency: 10 minutes to months Costs: highest Latency: minutes to seconds Triggered: Manually Triggered: On a schedule using Databricks Jobs Continually 32 (for some workloads)

©2022 Databricks Inc. — All rights reserved Challenge Heavy burden
on Data Engineers to create workflows for analysts due to the high complexity of creating custom workflows with Airflow. Solution Migrated from Airflow to Databricks Workflows for a unified platform providing analysts a simple way to own and manage their own workflows from data ingestion to downstream analytics. 60% Lower database costs 90% Reduction in processing time Impact 33 “If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows.” —Hillevi Crognale, Engineering Manager, YipitData Migrating from Apache Airflow to Databricks Workflows

©2022 Databricks Inc. — All rights reserved Additional Resources Demos
• Ingest from Kinesis: Earthquake detection with mobile phone sensors • Ingest from Kafka: Corona Detection with IoT Fitness Trackers • Ingest with Auto Loader: Sentiment Analysis for Twitter streams Blogs: Why YipitData migrated from Airﬂow DLT: ETL with 1 billion rows for under 1 $ Product Tours without account: Delta Live Tables & Workﬂows 36

The Serverless, Real-Time Data Lakehouse in Action

The Serverless, Real-Time Data Lakehouse in Action

More Decks by Frank Munz

Other Decks in Technology

Featured

Transcript