Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Serverless Kafka and Spark in a Multi-Cloud Data Lakehouse Architecture
Kai Waehner Field CTO [email protected] linkedin.com/in/kaiwaehner @KaiWaehner confluent.io kai-waehner.de

Agenda • Data Analytics at Rest • Data Streaming in
Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Storage at Rest USER JAY SUE FRED CREDIT_SCORE 695 430
710 V1 V3 V2

Analytics at Rest SELECT * FROM DB_TABLE Active Query: Passive
Data: DB Table

Use Cases for Data at Rest • Reporting • Business
Intelligence • Data Engineering • Big Data Analytics • Machine Learning kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Apache Spark – The De Facto Standard for Big Data
at Rest kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Big Data In Big Data Out Big Data Storage and Processing From Historical Data to Insights

Delta Lake Open-source storage framework and open format for data
analytics kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Real-time Data beats Slow Data. kai-waehner.de | @KaiWaehner | Serverless
Apache Kafka and Spark across the Globe

Real-time Data beats Slow Data. Transportation Real-time sensor diagnostics Driver-rider
match ETA updates Insurance Claim processing Fraud detection Omnichannel quote processing Retail Real-time inventory Real-time POS reporting Personalization Entertainment Real-time recommendations Personalized news feed In-app purchases kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Data at Rest Data in Motion SELECT * FROM DB_TABLE
CREATE TABLE T AS SELECT * FROM EVENT_STREAM Active Query: Passive Data: DB Table Active Data: Passive Query: Event Stream

Tables at Rest Streams in Motion USER JAY SUE FRED
CREDIT_SCORE 695 430 710 V1 V3 V2 PAYMENTS 42 18 65 ... USER JAY SUE FRED ...

Data Streaming = Data at Rest + Data in Motion
Payments Stream Credit Score Stream CREATE TABLE credit_scores AS SELECT user, updateScore(p.amount)…

Apache Kafka – The De Facto Standard for Data in
Motion Database CRM Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Producers Consumers Streams of real time events Stream processing apps Connectors Connectors Stream processing apps Incident Alert Forecast Pricing Customer Order kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Data Lakehouse kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and
Spark across the Globe Building the Data Lakehouse Author: Bill Inmon Lakehouse is a logical view, not physical!

Lambda Architecture Option 1: Unified serving layer Data Source Real-Time
Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Serving Layer Real-Time App (Data Processing in Motion) Batch App (Data Processing at Rest) ms min/hr kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Data Source Real-Time Layer (Data Processing in Motion) Batch Layer
(Data Processing at Rest) Real-time Query Mixed Query ms min/hr Speed View Batch View Batch Query Lambda Architecture Option 2: Separate serving layers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Data Source Real-Time Layer (Data Processing in Motion) Real-Time App
(Data Processing in Motion) Storage Batch App (Data Processing at Rest) Storage ms min/hr Storage Kappa Architecture One pipeline for real-time and batch consumers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Kappa @ Uber 24 kai-waehner.de | @KaiWaehner | Kappa vs.
Lambda Architecture

Confluent + Databricks Reference Architecture Kafka Connect On Premises or
any cloud Kafka Streams & ksqlDB - real-time stream processing and transformations Databricks Data Science Workspace Databricks Delta Lake Sink Connector for Confluent Cloud (AWS) Legacy Data Stores: Netezza, Teradata Oracle, Mainframes Databases IoT Data Streaming Analytics Sources Data Streaming Platform built on Kafka On Premises or any cloud Databricks BI Workspace kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Connected Car Infrastructure at Audi 27 • Real Time Data
Analysis • Swarm Intelligence • Collaboration with Partners • Predictive AI • … https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Connected Car Infrastructure at Audi 28 https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner
| Serverless Apache Kafka and Spark across the Globe

Kappa Architecture for a Lakehouse with Kafka and Spark MQTT
Proxy Spark Core Storage Spark SQL Reporting Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem Spark Ecosystem Other Components Kafka Streams All Data Critical Data Ingest Data Potential Detect Spark MLlib Model Training ksqlDB Model Deployment Preprocess Data Consume Data Deploy Analytic Model Mobile App BI Tool kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Machine Learning Model Training with Spark MLlib kai-waehner.de | @KaiWaehner
| Serverless Apache Kafka and Spark across the Globe https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j

“CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User
Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe MLlib

Stream Processing with Kafka or Spark? kai-waehner.de | @KaiWaehner |
Serverless Apache Kafka and Spark across the Globe Kafka Streams / ksqlDB Spark Streaming Component of the data streaming infrastructure Low latency Focus on 24/7 operations Lightweight, decoupled microservices Component of the data analytics infrastructure Strong integration with the rest of the Spark ecossytem Stream and batch Machine Learning “embedded”

Cloud-Native Deployment à Elastic Infrastructure and Faster Time-to-Market kai-waehner.de |
@KaiWaehner | Serverless Apache Kafka and Spark across the Globe

You Manage Provider Managed Self-managed IaaS Hosted Cloud Service Fully
Managed SaaS Scaling Scaling Scaling Load balancing Load balancing Load balancing Partition placement Partition placement Partition placement Logical Storage Logical Storage Logical Storage Broker settings Broker settings Broker settings Zookeeper Zookeeper Zookeeper Kafka patching Kafka patching Kafka patching JVM JVM JVM O/S O/S O/S VMs VMs VMs Servers Servers Servers Provider managed features Product ease of use Fully Managed Partially Managed Self-Managed What is a (truly) fully-managed SaaS? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

AWS Cloud Outage hit Disney World Visitors… https://www.cnet.com/tech/services-and-software/disney-parks-were-already-facing-heat-from-fans-then-an-aws-outage-came-along/ kai-waehner.de |
@KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Disaster Recovery – RPO and RTO RPO = Recovery Point
Objective RTO = Recovery Time Objective kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Use Cases for Hybrid and Multi-Cloud Data Lakehouses • Disaster
Recovery and High Availability: Create a disaster recovery cluster, and fail over to it during an outage. • Global and Multi-Cloud Replication: Move and aggregate data across regions and clouds. • Data Sharing: Share data with other teams, lines-of-business, or organizations. • Data Migration: Migrate data and workloads from one cluster to another (like from legacy on-premise data warehouse to cloud-native data lakehouse). kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Data Replication at Rest or in Motion?

Copyright 2021, Confluent, Inc. All rights reserved. This document may
not be reproduced in any manner without the express written permission of Confluent, Inc. Global Data Lakehouse across Edge and Hybrid Cloud Streaming Replication between Kafka Clusters Bridge to Databases, Data Lakes, Apps, APIs, SaaS Aggregation of Edge Deployments with Replication (Aggregation) Disaster Recovery Operations with Multi-Region Clusters for RPO=0 and RTO~0 Global Data Streaming with Replication and Cluster Linking kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

A data mesh for decentralized data products Data Product Independent
Data Products for Reporting, Analytics, Data Streaming kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe For instance: A KSQL microservice

Kai Waehner Field CTO [email protected] @KaiWaehner confluent.io kai-waehner.de linkedin.com/in/kaiwaehner Questions?
Feedback? Let’s connect!

Serverless Kafka and Spark in a Multi-Cloud Lak...

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

More Decks by Kai Waehner

Other Decks in Programming

Featured

Transcript