$30 off During Our Annual Pro Sale. View Details »

The Data Lakehouse

Frank Munz
August 30, 2022

The Data Lakehouse

Data Natives 2022 Berlin, presentation slides Frank Munz/Databricks:

Simple. Open. Multicloud.
The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes.

This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. It’s built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster.

Frank Munz

August 30, 2022
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. ©2022 Databricks Inc. — All rights reserved
    The
    Data Lakehouse
    Data Natives Conference 2022
    Dr Frank Munz
    @frankmunz

    View Slide

  2. ©2022 Databricks Inc. — All rights reserved
    Databricks
    The Data + AI Company
    Global adoption
    Over 7000 customers, from F500 to unicorns
    Inventor and pioneer of the
    data lakehouse
    Gartner recognized leader in both
    ● Database Management Systems
    ● Data Science and Machine Learning
    Platforms
    Creator of highly successful OSS data
    projects: Delta Lake, Apache Spark,
    Delta Sharing, and MLflow
    Raised over $3B in investment
    4000+ employees across the globe

    View Slide

  3. ©2022 Databricks Inc. — All rights reserved
    Data, analytics, and AI enabled
    tech’s leaders to disrupt
    industries
    3

    View Slide

  4. ©2022 Databricks Inc. — All rights reserved
    Most enterprises still
    struggle with data, analytics,
    and AI

    View Slide

  5. ©2022 Databricks Inc. — All rights reserved
    Realizing this requires two disparate, incompatible data platforms
    Data + AI Maturity
    Competitive Advantage
    Reports
    Clean Data
    Ad Hoc
    Queries
    Data
    Exploration
    Predictive
    Modeling
    Prescriptive
    Analytics
    Automated
    Decision Making
    Data Lake
    for AI
    Data Warehouse
    for BI
    Data Maturity Curve
    What will
    happen?
    What
    happened?
    5
    What will happen?
    How should we respond?
    Automatically make the
    best decision

    View Slide

  6. ©2022 Databricks Inc. — All rights reserved
    Business
    Intelligence
    SQL
    Analytics
    Data Science
    & ML
    Data
    Streaming
    Structured and unstructured files
    Data Lake
    Governance and Security
    Table ACLs
    Governance and Security
    Files and Blobs
    Copy subsets of data
    Disjointed
    and duplicative
    data silos
    Incompatible
    security and
    governance models
    Structured tables
    Data Warehouse
    Highly reliable and efficient All of the data and very adaptable
    Data Science
    & ML
    Data
    Streaming
    Incomplete
    support for
    use cases
    Business
    Intelligence
    SQL
    Analytics
    Governance and Security
    Files and Blobs and Table ACLs
    Structured tables and unstructured files
    There is no need to have two disparate platforms

    View Slide

  7. ©2022 Databricks Inc. — All rights reserved 7
    Simple
    Unify your data warehousing and AI
    use cases on a single platform
    Multicloud
    One consistent data platform across clouds
    Open
    Built on open source and open standards
    Databricks
    Lakehouse Platform
    Lakehouse Platform
    Data
    Warehousing
    Data
    Engineering
    Data Science
    and ML
    Data
    Streaming
    All structured and unstructured data
    Cloud Data Lake
    Unity Catalog
    Fine-grained governance for data and AI
    Delta Lake
    Data reliability and performance

    View Slide

  8. ©2022 Databricks Inc. — All rights reserved
    Data Governance Data
    Warehousing
    Data
    Engineering
    Data Science
    and ML
    Data
    Streaming
    BI and Dashboards Machine Learning Data Science
    Consulting &
    SI Partners
    Databricks thrives within your modern data stack
    Data Pipelines
    Unity Catalog
    Delta Lake
    Cloud Data Lake
    Data Ingestion

    View Slide

  9. ©2021 Databricks Inc. — All rights reserved
    Supporting enterprises in every industry
    Healthcare & Life Sciences Media & Entertainment Financial Services
    Public Sector Energy & Utilities Digital Native
    Manufacturing & Logistics
    Retail & CPG

    View Slide

  10. ©2021 Databricks Inc. — All rights reserved
    An open approach to bringing
    data management and governance
    to data lakes
    Better reliability with transactions
    48x faster data processing with indexing
    Data governance at scale with
    fine-grained access control lists
    Data
    Warehouse
    Data
    Lake

    View Slide

  11. ©2022 Databricks Inc. — All rights reserved
    All of Delta Lake 2.0 is open
    ACID
    Transactions
    Scalable
    Metadata
    Time Travel Open Source
    Unified
    Batch/Streaming
    Schema Evolution
    /Enforcement
    Audit History DML Operations
    OPTIMIZE
    Compaction
    OPTIMIZE
    ZORDER
    Change data
    feed
    Table Restore S3 Multi-cluster
    writes
    MERGE
    Enhancements
    Stream
    Enhancements
    Simplified
    Logstore
    Data Skipping
    via Column
    Stats
    Multi-part checkpoint
    writes
    Generated
    Columns
    Column
    Mapping
    Generated
    column support
    w/ partitioning
    Identity
    Columns
    Subqueries in
    deletes and
    updates
    Clones
    Iceberg to Delta
    converter
    Fast metadata
    only deletes
    Coming Soon!

    View Slide

  12. ©2022 Databricks Inc. — All rights reserved
    Databricks SQL Photon
    Serverless
    Eliminate compute
    infrastructure management
    Instant, Elastic Compute
    Zero Management
    Lower TCO
    Vectorized C++ exec engine
    Apache Spark API

    View Slide

  13. ©2022 Databricks Inc. — All rights reserved
    $100M
    saved in clinical trial costs
    11%
    uplift in sales success with
    physicians
    Challenge
    Amgen is relentlessly
    focused on invention and
    optimization, but disjointed
    data platforms prevented
    their departments from
    collaborating to uncover
    new avenues of revenue
    growth with machine
    learning
    Solution
    With an open Databricks
    lakehouse, Amgen
    delivered almost 300
    cross-functional analytics
    and machine learning
    projects using a wide
    variety of tools in the first
    year to improve drug
    delivery and patient
    outcomes
    $6.4M
    saved in infrastructure costs
    Impact
    Amgen
    13
    ©2022 Databricks Inc. — All rights reserved

    View Slide

  14. ©2022 Databricks Inc. — All rights reserved
    ©2022 Databricks Inc. — All rights reserved
    $50M
    in revenue from
    improved credit risk
    approval models
    $53M
    in revenue from better
    cross-selling promotions
    Challenge
    Goldman Sachs wanted
    the Apple Card to reach as
    many customers as
    possible without
    significantly increasing risk,
    but their data architecture
    could not easily support
    the real-time machine
    learning required to make
    it happen
    Solution
    Using Databricks, Goldman
    Sachs deployed a
    lakehouse that processes
    30TB a day across a large
    portfolio of data providers
    to accurately predict
    constantly evolving lender
    risk profiles
    Impact

    View Slide

  15. ©2022 Databricks Inc. — All rights reserved
    Demo Time!

    View Slide

  16. ©2022 Databricks Inc. — All rights reserved

    View Slide

  17. ©2022 Databricks Inc. — All rights reserved
    Delta Live Tables
    Cleanse and Transform Tweets

    View Slide

  18. ©2022 Databricks Inc. — All rights reserved
    Tweepy API: Streaming Twitter Feed

    View Slide

  19. ©2022 Databricks Inc. — All rights reserved
    Auto Loader: Streaming Data Ingestion
    Ingest Streaming Data with Automatic Schema Detection

    View Slide

  20. ©2022 Databricks Inc. — All rights reserved
    Declarative, auto scaling Data Pipelines in SQL
    CTAS Pattern: Create Table As Select …

    View Slide

  21. ©2022 Databricks Inc. — All rights reserved
    Declarative, auto scaling Data Pipelines

    View Slide

  22. ©2022 Databricks Inc. — All rights reserved
    DWH / SQL Persona

    View Slide

  23. ©2022 Databricks Inc. — All rights reserved
    Hugging Face -> Sentiment Analysis
    (POS, NEG, NEU) + probability

    View Slide

  24. ©2022 Databricks Inc. — All rights reserved 24

    View Slide

  25. ©2022 Databricks Inc. — All rights reserved 25

    View Slide

  26. ©2022 Databricks Inc. — All rights reserved
    Built-in Orchestration for all Tasks

    View Slide

  27. ©2022 Databricks Inc. — All rights reserved
    Watch the live demo from Data AI Summit
    Databricks.com / Watch Demos
    27
    Demo recording
    Notebooks on GitHub
    Hot off the press:
    Kafka+DLT BLOG

    View Slide

  28. @frankmunz
    https://fmunz.medium.com
    https://www.linkedin.com/in/frankmunz
    https://speakerdeck.com/fmunz
    www.databricks.com/
    try-databricks

    View Slide