This presentation critically reviews the requirements and architectural decisions to build an enterprise-grade, open source, JVM-based, multi-cloud successor for most data lakes and data warehouses.
The new architecture is built on Apache Parquet, seamlessly integrated with Apache Spark, uses the popular Apache Jupyter notebooks with the JVM-based Scala language, and provides reliable semantics with ACID transactions, schema evolution, SQL support, versioning, time-travel (no joke!), deep or shallow copy of data sets - all this based on open source delta.io.
We will start from scratch with an empty directory and implement all the requirements above! Get ready for lots of code, few slides.