Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

This presentation critically reviews the requirements and architectural decisions to build an enterprise-grade, open source, JVM-based, multi-cloud successor for most data lakes and data warehouses.

The new architecture is built on Apache Parquet, seamlessly integrated with Apache Spark, uses the popular Apache Jupyter notebooks with the JVM-based Scala language, and provides reliable semantics with ACID transactions, schema evolution, SQL support, versioning, time-travel (no joke!), deep or shallow copy of data sets - all this based on open source delta.io.

We will start from scratch with an empty directory and implement all the requirements above! Get ready for lots of code, few slides.

643cd45dcfa73b072018046e39ed36d1?s=128

Frank Munz

June 09, 2021
Tweet

Transcript

  1. Architect your Data for the 21st Century Hasta la vista,

    data lake? Frank Munz @frankmunz
  2. None
  3. Big Data Processing & Tools

  4. Big Data Processing • Supercomputers / COTS cluster HPC with

    PVM/MPI • Hadoop cluster map/reduce • Apache Spark Python / Scala or SQL Streaming & Batch on-premises / multi-cloud
  5. Apache Spark: Dataframes and SQL

  6. Apache Spark: Batch and Streaming

  7. Scala DataFrame Python DataFrame pandas APIs Different APIs, Powered by

    the Same Engine SQL Language
  8. Data in AI / Machine Learning

  9. Machine Learning: Classification twitter: teenybiscuit

  10. Machine Learning: Forecasting

  11. BI / Data Analytics

  12. Analytics

  13. great for BI SQL only, expensive, poor for ML (future)

    cheap, open formats poor for BI (past), data swamps? Data Warehouse Data Lake
  14. None
  15. Data lakes become data swamps RELIABILITY & QUALITY PERFORMANCE &

    LATENCY GOVERNANCE
  16. Delta Lake solves the challenges with data lakes RELIABILITY &

    QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Fine-grained access control
  17. Lakehouse: Delta Lake Publication https://databricks.com/wp-content/ uploads/2020/12/cidr_lakehouse.pdf

  18. Lakehouse adoption across industries

  19. Demo 1 Getting Started with Delta.io and CLI

  20. Demo 2 Jupyter Notebook: Delta Tables, UPDATEs, time travel etc.

  21. Delta Sharing Delta Lake Table Delta Sharing Server Delta Sharing

    Protocol Data Provider Data Recipient … Any Sharing Client Access permissions Delta.io/sharing VIdeo
  22. Delta Sharing from Pandas / Jupyter Notebook

  23. Demo 3: Delta Sharing

  24. Multicloud Databricks Workspace • Delta Live Tables • Unity Catalog

    • MLFlow • AutoML • Feature Store
  25. Hasta la vista, data lake? Delta lake fixes the problems

    with a data lake and provides reliability, performance and governance. The Lakehouse is a unified platform for Data, Analytics & ML Check it out: delta.io or databricks.com/try-databricks
  26. How to engage? delta.io delta-users Slack delta-users Google Group Delta

    Lake YouTube channel Delta Lake GitHub Issues Delta Lake RS Bi-weekly meetings
  27. https://hackernoon.com/top-7-announcements-from- data-and-ai-summit-202-hd2735c0 Summary Posting Slice&DAIS 2021 EMEA Event https://www.meetup.com/Spark-Munich/events/278396185/

  28. @frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz