Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Slide 1

Slide 1 text

Serverless Kafka and Spark in a Multi-Cloud Data Lakehouse Architecture Kai Waehner Field CTO [email protected] linkedin.com/in/kaiwaehner @KaiWaehner confluent.io kai-waehner.de

Slide 2

Slide 2 text

Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Storage at Rest USER JAY SUE FRED CREDIT_SCORE 695 430 710 V1 V3 V2

Slide 5

Slide 5 text

Analytics at Rest SELECT * FROM DB_TABLE Active Query: Passive Data: DB Table

Slide 6

Slide 6 text

Use Cases for Data at Rest • Reporting • Business Intelligence • Data Engineering • Big Data Analytics • Machine Learning kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 7

Slide 7 text

Apache Spark – The De Facto Standard for Big Data at Rest kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Big Data In Big Data Out Big Data Storage and Processing From Historical Data to Insights

Slide 8

Slide 8 text

Delta Lake Open-source storage framework and open format for data analytics kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Real-time Data beats Slow Data. kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 11

Slide 11 text

Real-time Data beats Slow Data. Transportation Real-time sensor diagnostics Driver-rider match ETA updates Insurance Claim processing Fraud detection Omnichannel quote processing Retail Real-time inventory Real-time POS reporting Personalization Entertainment Real-time recommendations Personalized news feed In-app purchases kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 12

Slide 12 text

Data at Rest Data in Motion SELECT * FROM DB_TABLE CREATE TABLE T AS SELECT * FROM EVENT_STREAM Active Query: Passive Data: DB Table Active Data: Passive Query: Event Stream

Slide 13

Slide 13 text

Tables at Rest Streams in Motion USER JAY SUE FRED CREDIT_SCORE 695 430 710 V1 V3 V2 PAYMENTS 42 18 65 ... USER JAY SUE FRED ...

Slide 14

Slide 14 text

Data Streaming = Data at Rest + Data in Motion Payments Stream Credit Score Stream CREATE TABLE credit_scores AS SELECT user, updateScore(p.amount)…

Slide 15

Slide 15 text

Apache Kafka – The De Facto Standard for Data in Motion Database CRM Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Producers Consumers Streams of real time events Stream processing apps Connectors Connectors Stream processing apps Incident Alert Forecast Pricing Customer Order kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Data Lakehouse kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Building the Data Lakehouse Author: Bill Inmon Lakehouse is a logical view, not physical!

Slide 18

Slide 18 text

Lambda Architecture Option 1: Unified serving layer Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Serving Layer Real-Time App (Data Processing in Motion) Batch App (Data Processing at Rest) ms min/hr kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 19

Slide 19 text

Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Real-time Query Mixed Query ms min/hr Speed View Batch View Batch Query Lambda Architecture Option 2: Separate serving layers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 20

Slide 20 text

Data Source Real-Time Layer (Data Processing in Motion) Real-Time App (Data Processing in Motion) Storage Batch App (Data Processing at Rest) Storage ms min/hr Storage Kappa Architecture One pipeline for real-time and batch consumers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 21

Slide 21 text

Kappa @ Uber 24 kai-waehner.de | @KaiWaehner | Kappa vs. Lambda Architecture

Slide 22

Slide 22 text

Confluent + Databricks Reference Architecture Kafka Connect On Premises or any cloud Kafka Streams & ksqlDB - real-time stream processing and transformations Databricks Data Science Workspace Databricks Delta Lake Sink Connector for Confluent Cloud (AWS) Legacy Data Stores: Netezza, Teradata Oracle, Mainframes Databases IoT Data Streaming Analytics Sources Data Streaming Platform built on Kafka On Premises or any cloud Databricks BI Workspace kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Connected Car Infrastructure at Audi 27 • Real Time Data Analysis • Swarm Intelligence • Collaboration with Partners • Predictive AI • … https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 25

Slide 25 text

Connected Car Infrastructure at Audi 28 https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 26

Slide 26 text

Kappa Architecture for a Lakehouse with Kafka and Spark MQTT Proxy Spark Core Storage Spark SQL Reporting Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem Spark Ecosystem Other Components Kafka Streams All Data Critical Data Ingest Data Potential Detect Spark MLlib Model Training ksqlDB Model Deployment Preprocess Data Consume Data Deploy Analytic Model Mobile App BI Tool kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 27

Slide 27 text

Machine Learning Model Training with Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j

Slide 28

Slide 28 text

“CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe MLlib

Slide 29

Slide 29 text

Stream Processing with Kafka or Spark? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Kafka Streams / ksqlDB Spark Streaming Component of the data streaming infrastructure Low latency Focus on 24/7 operations Lightweight, decoupled microservices Component of the data analytics infrastructure Strong integration with the rest of the Spark ecossytem Stream and batch Machine Learning “embedded”

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Cloud-Native Deployment à Elastic Infrastructure and Faster Time-to-Market kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 32

Slide 32 text

You Manage Provider Managed Self-managed IaaS Hosted Cloud Service Fully Managed SaaS Scaling Scaling Scaling Load balancing Load balancing Load balancing Partition placement Partition placement Partition placement Logical Storage Logical Storage Logical Storage Broker settings Broker settings Broker settings Zookeeper Zookeeper Zookeeper Kafka patching Kafka patching Kafka patching JVM JVM JVM O/S O/S O/S VMs VMs VMs Servers Servers Servers Provider managed features Product ease of use Fully Managed Partially Managed Self-Managed What is a (truly) fully-managed SaaS? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 33

Slide 33 text

Slide 34

Slide 34 text

AWS Cloud Outage hit Disney World Visitors… https://www.cnet.com/tech/services-and-software/disney-parks-were-already-facing-heat-from-fans-then-an-aws-outage-came-along/ kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 35

Slide 35 text

Disaster Recovery – RPO and RTO RPO = Recovery Point Objective RTO = Recovery Time Objective kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 36

Slide 36 text

Use Cases for Hybrid and Multi-Cloud Data Lakehouses • Disaster Recovery and High Availability: Create a disaster recovery cluster, and fail over to it during an outage. • Global and Multi-Cloud Replication: Move and aggregate data across regions and clouds. • Data Sharing: Share data with other teams, lines-of-business, or organizations. • Data Migration: Migrate data and workloads from one cluster to another (like from legacy on-premise data warehouse to cloud-native data lakehouse). kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Data Replication at Rest or in Motion?

Slide 37

Slide 37 text

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Global Data Lakehouse across Edge and Hybrid Cloud Streaming Replication between Kafka Clusters Bridge to Databases, Data Lakes, Apps, APIs, SaaS Aggregation of Edge Deployments with Replication (Aggregation) Disaster Recovery Operations with Multi-Region Clusters for RPO=0 and RTO~0 Global Data Streaming with Replication and Cluster Linking kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe

Slide 38

Slide 38 text

A data mesh for decentralized data products Data Product Independent Data Products for Reporting, Analytics, Data Streaming kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe For instance: A KSQL microservice

Slide 39

Slide 39 text

Kai Waehner Field CTO [email protected] @KaiWaehner confluent.io kai-waehner.de linkedin.com/in/kaiwaehner Questions? Feedback? Let’s connect!