Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.

Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This session explores different architectures to build serverless Kafka and Spark multi-cloud architectures across regions and continents.

We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration

Kai Waehner

June 29, 2022
Tweet

More Decks by Kai Waehner

Other Decks in Programming

Transcript

  1. Serverless Kafka and Spark in a Multi-Cloud Data Lakehouse Architecture

    Kai Waehner Field CTO [email protected] linkedin.com/in/kaiwaehner @KaiWaehner confluent.io kai-waehner.de
  2. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  3. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  4. Use Cases for Data at Rest • Reporting • Business

    Intelligence • Data Engineering • Big Data Analytics • Machine Learning kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  5. Apache Spark – The De Facto Standard for Big Data

    at Rest kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Big Data In Big Data Out Big Data Storage and Processing From Historical Data to Insights
  6. Delta Lake Open-source storage framework and open format for data

    analytics kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  7. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  8. Real-time Data beats Slow Data. Transportation Real-time sensor diagnostics Driver-rider

    match ETA updates Insurance Claim processing Fraud detection Omnichannel quote processing Retail Real-time inventory Real-time POS reporting Personalization Entertainment Real-time recommendations Personalized news feed In-app purchases kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  9. Data at Rest Data in Motion SELECT * FROM DB_TABLE

    CREATE TABLE T AS SELECT * FROM EVENT_STREAM Active Query: Passive Data: DB Table Active Data: Passive Query: Event Stream
  10. Tables at Rest Streams in Motion USER JAY SUE FRED

    CREDIT_SCORE 695 430 710 V1 V3 V2 PAYMENTS 42 18 65 ... USER JAY SUE FRED ...
  11. Data Streaming = Data at Rest + Data in Motion

    Payments Stream Credit Score Stream CREATE TABLE credit_scores AS SELECT user, updateScore(p.amount)…
  12. Apache Kafka – The De Facto Standard for Data in

    Motion Database CRM Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Producers Consumers Streams of real time events Stream processing apps Connectors Connectors Stream processing apps Incident Alert Forecast Pricing Customer Order kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  13. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  14. Data Lakehouse kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and

    Spark across the Globe Building the Data Lakehouse Author: Bill Inmon Lakehouse is a logical view, not physical!
  15. Lambda Architecture Option 1: Unified serving layer Data Source Real-Time

    Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Serving Layer Real-Time App (Data Processing in Motion) Batch App (Data Processing at Rest) ms min/hr kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  16. Data Source Real-Time Layer (Data Processing in Motion) Batch Layer

    (Data Processing at Rest) Real-time Query Mixed Query ms min/hr Speed View Batch View Batch Query Lambda Architecture Option 2: Separate serving layers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  17. Data Source Real-Time Layer (Data Processing in Motion) Real-Time App

    (Data Processing in Motion) Storage Batch App (Data Processing at Rest) Storage ms min/hr Storage Kappa Architecture One pipeline for real-time and batch consumers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  18. Confluent + Databricks Reference Architecture Kafka Connect On Premises or

    any cloud Kafka Streams & ksqlDB - real-time stream processing and transformations Databricks Data Science Workspace Databricks Delta Lake Sink Connector for Confluent Cloud (AWS) Legacy Data Stores: Netezza, Teradata Oracle, Mainframes Databases IoT Data Streaming Analytics Sources Data Streaming Platform built on Kafka On Premises or any cloud Databricks BI Workspace kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  19. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  20. Connected Car Infrastructure at Audi 27 • Real Time Data

    Analysis • Swarm Intelligence • Collaboration with Partners • Predictive AI • … https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  21. Kappa Architecture for a Lakehouse with Kafka and Spark MQTT

    Proxy Spark Core Storage Spark SQL Reporting Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem Spark Ecosystem Other Components Kafka Streams All Data Critical Data Ingest Data Potential Detect Spark MLlib Model Training ksqlDB Model Deployment Preprocess Data Consume Data Deploy Analytic Model Mobile App BI Tool kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  22. Machine Learning Model Training with Spark MLlib kai-waehner.de | @KaiWaehner

    | Serverless Apache Kafka and Spark across the Globe https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j
  23. “CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User

    Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe MLlib
  24. Stream Processing with Kafka or Spark? kai-waehner.de | @KaiWaehner |

    Serverless Apache Kafka and Spark across the Globe Kafka Streams / ksqlDB Spark Streaming Component of the data streaming infrastructure Low latency Focus on 24/7 operations Lightweight, decoupled microservices Component of the data analytics infrastructure Strong integration with the rest of the Spark ecossytem Stream and batch Machine Learning “embedded”
  25. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  26. Cloud-Native Deployment à Elastic Infrastructure and Faster Time-to-Market kai-waehner.de |

    @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  27. You Manage Provider Managed Self-managed IaaS Hosted Cloud Service Fully

    Managed SaaS Scaling Scaling Scaling Load balancing Load balancing Load balancing Partition placement Partition placement Partition placement Logical Storage Logical Storage Logical Storage Broker settings Broker settings Broker settings Zookeeper Zookeeper Zookeeper Kafka patching Kafka patching Kafka patching JVM JVM JVM O/S O/S O/S VMs VMs VMs Servers Servers Servers Provider managed features Product ease of use Fully Managed Partially Managed Self-Managed What is a (truly) fully-managed SaaS? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  28. Agenda • Data Analytics at Rest • Data Streaming in

    Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  29. Disaster Recovery – RPO and RTO RPO = Recovery Point

    Objective RTO = Recovery Time Objective kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  30. Use Cases for Hybrid and Multi-Cloud Data Lakehouses • Disaster

    Recovery and High Availability: Create a disaster recovery cluster, and fail over to it during an outage. • Global and Multi-Cloud Replication: Move and aggregate data across regions and clouds. • Data Sharing: Share data with other teams, lines-of-business, or organizations. • Data Migration: Migrate data and workloads from one cluster to another (like from legacy on-premise data warehouse to cloud-native data lakehouse). kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Data Replication at Rest or in Motion?
  31. Copyright 2021, Confluent, Inc. All rights reserved. This document may

    not be reproduced in any manner without the express written permission of Confluent, Inc. Global Data Lakehouse across Edge and Hybrid Cloud Streaming Replication between Kafka Clusters Bridge to Databases, Data Lakes, Apps, APIs, SaaS Aggregation of Edge Deployments with Replication (Aggregation) Disaster Recovery Operations with Multi-Region Clusters for RPO=0 and RTO~0 Global Data Streaming with Replication and Cluster Linking kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  32. A data mesh for decentralized data products Data Product Independent

    Data Products for Reporting, Analytics, Data Streaming kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe For instance: A KSQL microservice