The Serverless, Real-Time Data Lakehouse in Action

Slide 1

Slide 1 text

©2022 Databricks Inc. — All rights reserved Standing on the Shoulders of Open-Source Giants The Real-Time, Serverless Lakehouse in Action Frank Munz, Principal TMM, Databricks / Current.io 2023 @frankmunz

Slide 2

Slide 2 text

©2022 Databricks Inc. — All rights reserved 2 Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Simple Unify your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds

Slide 3

Slide 3 text

Standing on the Shoulders of OSS What is new?

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Annual downloads > 1 Billion

Slide 6

Slide 6 text

3,600 contributors, 40,000 commits #1 in dev activity for 10 years

Slide 7

Slide 7 text

Subsecond Latency - Project Lightspeed 7 Performance Improvements • Micro-Batch Pipelining • Offset Management • Log Purging • Consistent Latency for Stateful Pipelines • State Rebalancing • Adaptive Query Execution Enhanced Functionality • Multiple Stateful Operators • Arbitrary Stateful Processing in Python • Drop Duplicates Within Watermark • Native support for Protobuf Improved Observability • Python Query Listener Connectors & Ecosystem • Enhanced Fanout (EFO) • Trigger.AvailableNow support for Amazon Kinesis • Google Pub/Sub Connector • Integrations with Unity Catalog

Slide 8

Slide 8 text

Spark Connect GA in Apache Spark 3.4 Applications IDEs / Notebooks Programming Languages / SDKs Modern data application Thin client, with full power of Apache Spark Spark’s Monolith Driver Application Gateway Analyzer Optimizer Scheduler Distributed Execution Engine Spark Connect Client API

Slide 9

Slide 9 text

Spark Assistant Prompt engineering by Spark experts New LLM-powered features

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Introducing Delta Kernel Implements the complete Delta Data + Metadata specification. Unifies connector development = Java Ecosystem aws- pandas-sdk ray airbyte Python Ecosystem Power BI pandas dask duck DB Rust Ecosystem Startree (pinot) beam ballista kafka data fusion pulsar flink prestodb hive trino glue athena emr dlt (spark-r) azure synapse delta- spark redshift datahub C++ Excel Golang Java Power BI R-Stats Rust Delta Sharing Others Delta Protocol Delta Kernels polars arrow

Slide 12

Slide 12 text

Metadata Delta Lake With UniForm Metadata Data Delta UniForm Unifying the lakehouse formats Parquet

Slide 13

Slide 13 text

Slide 14

Slide 14 text

active data consumers on Delta Sharing data shared with Delta Lake 6,000+ 300+ PB per day Delta Lake table Delta Sharing protocol Any compatible client Data consumer Data provider An open standard for secure data sharing

Slide 15

Slide 15 text

Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Model Serving optimized for LLMs INTRODUCING Model Serving and Monitoring Falcon-7B-Instruct whisper-large-v2 stable-diffusion-2-1 MPT-7B-Instruct

Slide 18

Slide 18 text

Manage, govern, evaluate, and switch models easily MLﬂow AI Gateway INTRODUCING Multiple Generative AI use cases across the organization BI Pipelines Apps MLﬂow AI Gateway Multiple Generative AI Models Credentials Caching Logging Rate limiting Model Serving and Monitoring Users

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Let's do the math… This demo creates a sustained data rate 43 million events / day 2

Slide 21

Slide 21 text

Data Engineering on the Lakehouse

Slide 22

Slide 22 text

©2022 Databricks Inc. — All rights reserved Unity Catalog Delta Lake BI & Data Warehousing Data Streaming Data Science & ML Data Engineering Databricks Workflows Unified orchestration for data, analytics, and AI on the Lakehouse Platform Lakehouse Platform ● Simple authoring ● Actionable insights ● Proven reliability YipitData: Why we migrated from Airflow to Workflows Workflows Sessions Clicks Join Featurize Aggregate Analyze Train Orders 22

Slide 23

Slide 23 text

©2022 Databricks Inc. — All rights reserved Building Blocks of Databricks Workflows 23 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs as a Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers Preview DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Delta Table Update Continuous (Streaming) Preview Coming Soon

Slide 24

Slide 24 text

©2022 Databricks Inc. — All rights reserved Serverless Workflows Hands-off, auto-optimizing compute in Databricks’ account Benefit from Databricks’ scale of compute and engineering expertise through Serverless compute in Databricks’ account: Problem: Setting up, managing, and optimising clusters is cumbersome and requires expert knowledge, wasting valuable time and resources. ● High efficiency: Don’t pay for idle, auto-optimize compute config ● Reliability so your critical workloads are shielded from cloud disruptions ● Faster startup: So users don’t have to wait and critical data is always fresh ● Simplicity that enables every user to set up serverless 2 PREVIEW

Slide 25

Slide 25 text

©2022 Databricks Inc. — All rights reserved What is Delta Live Tables? Delta Live Tables (DLT) is the ﬁrst ETL framework that uses a simple declarative approach to building reliable data pipelines. DLT automatically manages your infrastructure at scale so data analysts and engineers can spend less time on tooling and focus on getting value from data. Accelerate ETL Development Automatically manage your infrastructure Have conﬁdence in your data Simplify batch and streaming https://databricks.com/product/delta-live-tables Modern software engineering for ETL processing

Slide 26

Slide 26 text

©2022 Databricks Inc. — All rights reserved Reference Architecture Most use cases will use STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain inﬁnite history Easy to perform GDPR and other compliance tasks CREATE MATERIALZIED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections

Slide 27

Slide 27 text

©2022 Databricks Inc. — All rights reserved Serverless Streaming optimizations DLT Serverless also optimizes streaming TCO and latency! 27 PREVIEW DLT Serverless dynamically optimizes compute and scheduling • Pipelined execution of multiple microbatches • Dynamically tuning of batches sizes based on the amount of compute available

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

©2022 Databricks Inc. — All rights reserved Workflows Or DLT? Often Both: Workflows can orchestrate anything, including DLT ● At some schedule ● After other tasks have completed ● When a file arrives ● When another table is updated 30 ● Batch and streaming data transformations / quality ● Easy way to run Structured Streaming ● Creating/updating delta tables Use DLT for managing dataflow Use Workflows to run any task

Slide 31

Slide 31 text

©2022 Databricks Inc. — All rights reserved The core abstractions of DLT You deﬁne datasets, and DLT automatically keeps them up to date 31 A delta table with stream(s) writing to it. Used for: • Ingestion (ﬁles, message brokers) • Low latency transformations • Huge scale The result of a query, stored in a delta table. Used for: • Transforming data • Building aggregate tables • Speeding up BI queries and reports Streaming Tables Materialized View

Slide 32

Slide 32 text

©2022 Databricks Inc. — All rights reserved Streaming does not always mean expensive Costs: lowest Latency: highest Delta live tables lets you choose how often to update the results. Costs: depends on frequency Latency: 10 minutes to months Costs: highest Latency: minutes to seconds Triggered: Manually Triggered: On a schedule using Databricks Jobs Continually 32 (for some workloads)

Slide 33

Slide 33 text

©2022 Databricks Inc. — All rights reserved Challenge Heavy burden on Data Engineers to create workflows for analysts due to the high complexity of creating custom workflows with Airflow. Solution Migrated from Airflow to Databricks Workflows for a unified platform providing analysts a simple way to own and manage their own workflows from data ingestion to downstream analytics. 60% Lower database costs 90% Reduction in processing time Impact 33 “If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows.” —Hillevi Crognale, Engineering Manager, YipitData Migrating from Apache Airflow to Databricks Workflows

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

©2022 Databricks Inc. — All rights reserved Additional Resources Demos ● Ingest from Kinesis: Earthquake detection with mobile phone sensors ● Ingest from Kafka: Corona Detection with IoT Fitness Trackers ● Ingest with Auto Loader: Sentiment Analysis for Twitter streams Blogs: Why YipitData migrated from Airﬂow DLT: ETL with 1 billion rows for under 1 $ Product Tours without account: Delta Live Tables & Workﬂows 36