Streaming Data into Your Lakehouse
The last years have taught us that cheap, virtually unlimited, and highly available cloud object storage doesn't make a solid enterprise data platform. Too many data lakes didn't fulfill their expectations and degenerated into sad data swamps. With the Linux Foundation OSS project Delta Lake (https://github.com/delta-io), you can turn your data lake into the foundation of a data lakehouse that brings back ACID transactions, schema enforcement, upserts, efficient metadata handling, and time travel. In this session, we explore how a data lakehouse works with streaming, using Apache Kafka as an example. This talk is for data architects who are not afraid of some code and for data engineers who love open source and cloud services. Attendees of this talk will learn:
Lakehouse architecture 101, the honest tech bits
The data lakehouse and streaming data: what's there beyond Apache Spark Structured Streaming?
Why the lakehouse and Apache Kafka make a great couple and what concepts you should know to get them hitched with success
Streaming data with declarative data pipelines
In a live demo, I will show data ingestion, cleansing, and transformation based on a simulation of the Data Donation Project (DDP, https://corona-datenspende.de/science/en) built on the lakehouse with Apache Kafka, Apache Spark, and Delta Live Tables (a fully managed service). DDP is a scientific IoT experiment to determine COVID outbreaks in Germany by detecting elevated heart rates correlated to infections. Half a million volunteers have already decided to donate their heart rate data from their fitness trackers.
This presentation was delivered at Current.io 2022 and Devox ATH 2023.
Dr. Frank Munz works on large-scale data and AI at Databricks. He authored three computer science books, built up technical evangelism for Amazon Web Services in Germany, Austria, and Switzerland, and once upon a time worked as a data scientist with a group that won a Nobel prize. Frank realized his dream to speak at top-notch conferences - such as Devoxx, Kubecon, ODSC, and Java One - on every continent (except Antarctica because it is too cold there). He holds a Ph.D. with summa cum laude in Computer Science from TU Munich. Enjoys skiing in the Alps, tapas in Spain, and exploring secret beaches in SE Asia.