$30 off During Our Annual Pro Sale. View Details »

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

Architect your Data for the 21st Century: Hasta la vista, Data Lake?

This presentation critically reviews the requirements and architectural decisions to build an enterprise-grade, open source, JVM-based, multi-cloud successor for most data lakes and data warehouses.

The new architecture is built on Apache Parquet, seamlessly integrated with Apache Spark, uses the popular Apache Jupyter notebooks with the JVM-based Scala language, and provides reliable semantics with ACID transactions, schema evolution, SQL support, versioning, time-travel (no joke!), deep or shallow copy of data sets - all this based on open source delta.io.

We will start from scratch with an empty directory and implement all the requirements above! Get ready for lots of code, few slides.

Frank Munz

June 09, 2021
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. Architect your Data
    for the 21st Century
    Hasta la vista, data lake?
    Frank Munz
    @frankmunz

    View Slide

  2. View Slide

  3. Big Data Processing & Tools

    View Slide

  4. Big Data Processing

    Supercomputers / COTS cluster
    HPC with PVM/MPI

    Hadoop cluster
    map/reduce

    Apache Spark
    Python / Scala or SQL
    Streaming & Batch
    on-premises / multi-cloud

    View Slide

  5. Apache Spark: Dataframes and SQL

    View Slide

  6. Apache Spark: Batch and Streaming

    View Slide

  7. Scala DataFrame
    Python DataFrame
    pandas APIs
    Different APIs, Powered by the Same Engine
    SQL Language

    View Slide

  8. Data in AI / Machine Learning

    View Slide

  9. Machine Learning: Classification
    twitter: teenybiscuit

    View Slide

  10. Machine Learning: Forecasting

    View Slide

  11. BI / Data Analytics

    View Slide

  12. Analytics

    View Slide

  13. great for BI
    SQL only, expensive,
    poor for ML (future)
    cheap, open formats
    poor for BI (past),
    data swamps?
    Data
    Warehouse
    Data
    Lake

    View Slide

  14. View Slide

  15. Data lakes become data swamps
    RELIABILITY & QUALITY
    PERFORMANCE & LATENCY
    GOVERNANCE

    View Slide

  16. Delta Lake solves the challenges with data lakes
    RELIABILITY & QUALITY
    PERFORMANCE & LATENCY
    GOVERNANCE
    ACID transactions
    Advanced indexing & caching
    Fine-grained access control

    View Slide

  17. Lakehouse: Delta Lake Publication
    https://databricks.com/wp-content/
    uploads/2020/12/cidr_lakehouse.pdf

    View Slide

  18. Lakehouse adoption across industries

    View Slide

  19. Demo 1
    Getting Started with Delta.io and CLI

    View Slide

  20. Demo 2
    Jupyter Notebook:
    Delta Tables, UPDATEs, time travel etc.

    View Slide

  21. Delta Sharing
    Delta Lake Table Delta Sharing
    Server
    Delta Sharing Protocol
    Data Provider Data Recipient

    Any Sharing
    Client
    Access
    permissions
    Delta.io/sharing
    VIdeo

    View Slide

  22. Delta Sharing from Pandas / Jupyter Notebook

    View Slide

  23. Demo 3:
    Delta Sharing

    View Slide

  24. Multicloud Databricks Workspace
    ● Delta Live
    Tables
    ● Unity
    Catalog
    ● MLFlow
    ● AutoML
    ● Feature
    Store

    View Slide

  25. Hasta la vista, data lake?
    Delta lake fixes the problems with a data lake and
    provides reliability, performance and governance.
    The Lakehouse is a unified platform for
    Data, Analytics & ML
    Check it out: delta.io or databricks.com/try-databricks

    View Slide

  26. How to engage?
    delta.io
    delta-users
    Slack
    delta-users
    Google Group
    Delta Lake
    YouTube channel
    Delta Lake
    GitHub Issues
    Delta Lake RS
    Bi-weekly meetings

    View Slide

  27. https://hackernoon.com/top-7-announcements-from-
    data-and-ai-summit-202-hd2735c0
    Summary Posting Slice&DAIS 2021 EMEA Event
    https://www.meetup.com/Spark-Munich/events/278396185/

    View Slide

  28. @frankmunz
    https://fmunz.medium.com
    https://github.com/fmunz
    https://www.linkedin.com/in/frankmunz
    https://speakerdeck.com/fmunz

    View Slide