Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

This presentation critically reviews the requirements and architectural decisions to build an enterprise-grade, open source, JVM-based, multi-cloud successor for most data lakes and data warehouses.

The new architecture is built on Apache Parquet, seamlessly integrated with Apache Spark, uses the popular Apache Jupyter notebooks with the JVM-based Scala language, and provides reliable semantics with ACID transactions, schema evolution, SQL support, versioning, time-travel (no joke!), deep or shallow copy of data sets - all this based on open source delta.io.

We will start from scratch with an empty directory and implement all the requirements above! Get ready for lots of code, few slides.

Frank Munz

June 09, 2021

More Decks by Frank Munz

Other Decks in Technology


  1. Big Data Processing • Supercomputers / COTS cluster HPC with

    PVM/MPI • Hadoop cluster map/reduce • Apache Spark Python / Scala or SQL Streaming & Batch on-premises / multi-cloud
  2. great for BI SQL only, expensive, poor for ML (future)

    cheap, open formats poor for BI (past), data swamps? Data Warehouse Data Lake
  3. Delta Lake solves the challenges with data lakes RELIABILITY &

    QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Fine-grained access control
  4. Delta Sharing Delta Lake Table Delta Sharing Server Delta Sharing

    Protocol Data Provider Data Recipient … Any Sharing Client Access permissions Delta.io/sharing VIdeo
  5. Hasta la vista, data lake? Delta lake fixes the problems

    with a data lake and provides reliability, performance and governance. The Lakehouse is a unified platform for Data, Analytics & ML Check it out: delta.io or databricks.com/try-databricks
  6. How to engage? delta.io delta-users Slack delta-users Google Group Delta

    Lake YouTube channel Delta Lake GitHub Issues Delta Lake RS Bi-weekly meetings