Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Data Lakehouse

The Data Lakehouse

Data Natives 2022 Berlin, presentation slides Frank Munz/Databricks:

Simple. Open. Multicloud.
The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes.

This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. It’s built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster.

Frank Munz

August 30, 2022
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. ©2022 Databricks Inc. — All rights reserved The Data Lakehouse

    Data Natives Conference 2022 Dr Frank Munz @frankmunz
  2. ©2022 Databricks Inc. — All rights reserved Databricks The Data

    + AI Company Global adoption Over 7000 customers, from F500 to unicorns Inventor and pioneer of the data lakehouse Gartner recognized leader in both • Database Management Systems • Data Science and Machine Learning Platforms Creator of highly successful OSS data projects: Delta Lake, Apache Spark, Delta Sharing, and MLflow Raised over $3B in investment 4000+ employees across the globe
  3. ©2022 Databricks Inc. — All rights reserved Data, analytics, and

    AI enabled tech’s leaders to disrupt industries 3
  4. ©2022 Databricks Inc. — All rights reserved Most enterprises still

    struggle with data, analytics, and AI
  5. ©2022 Databricks Inc. — All rights reserved Realizing this requires

    two disparate, incompatible data platforms Data + AI Maturity Competitive Advantage Reports Clean Data Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics Automated Decision Making Data Lake for AI Data Warehouse for BI Data Maturity Curve What will happen? What happened? 5 What will happen? How should we respond? Automatically make the best decision
  6. ©2022 Databricks Inc. — All rights reserved Business Intelligence SQL

    Analytics Data Science & ML Data Streaming Structured and unstructured files Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incompatible security and governance models Structured tables Data Warehouse Highly reliable and efficient All of the data and very adaptable Data Science & ML Data Streaming Incomplete support for use cases Business Intelligence SQL Analytics Governance and Security Files and Blobs and Table ACLs Structured tables and unstructured files There is no need to have two disparate platforms
  7. ©2022 Databricks Inc. — All rights reserved 7 Simple Unify

    your data warehousing and AI use cases on a single platform Multicloud One consistent data platform across clouds Open Built on open source and open standards Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance
  8. ©2022 Databricks Inc. — All rights reserved Data Governance Data

    Warehousing Data Engineering Data Science and ML Data Streaming BI and Dashboards Machine Learning Data Science Consulting & SI Partners Databricks thrives within your modern data stack Data Pipelines Unity Catalog Delta Lake Cloud Data Lake Data Ingestion
  9. ©2021 Databricks Inc. — All rights reserved Supporting enterprises in

    every industry Healthcare & Life Sciences Media & Entertainment Financial Services Public Sector Energy & Utilities Digital Native Manufacturing & Logistics Retail & CPG
  10. ©2021 Databricks Inc. — All rights reserved An open approach

    to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine-grained access control lists Data Warehouse Data Lake
  11. ©2022 Databricks Inc. — All rights reserved All of Delta

    Lake 2.0 is open ACID Transactions Scalable Metadata Time Travel Open Source Unified Batch/Streaming Schema Evolution /Enforcement Audit History DML Operations OPTIMIZE Compaction OPTIMIZE ZORDER Change data feed Table Restore S3 Multi-cluster writes MERGE Enhancements Stream Enhancements Simplified Logstore Data Skipping via Column Stats Multi-part checkpoint writes Generated Columns Column Mapping Generated column support w/ partitioning Identity Columns Subqueries in deletes and updates Clones Iceberg to Delta converter Fast metadata only deletes Coming Soon!
  12. ©2022 Databricks Inc. — All rights reserved Databricks SQL Photon

    Serverless Eliminate compute infrastructure management Instant, Elastic Compute Zero Management Lower TCO Vectorized C++ exec engine Apache Spark API
  13. ©2022 Databricks Inc. — All rights reserved $100M saved in

    clinical trial costs 11% uplift in sales success with physicians Challenge Amgen is relentlessly focused on invention and optimization, but disjointed data platforms prevented their departments from collaborating to uncover new avenues of revenue growth with machine learning Solution With an open Databricks lakehouse, Amgen delivered almost 300 cross-functional analytics and machine learning projects using a wide variety of tools in the first year to improve drug delivery and patient outcomes $6.4M saved in infrastructure costs Impact Amgen 13 ©2022 Databricks Inc. — All rights reserved
  14. ©2022 Databricks Inc. — All rights reserved ©2022 Databricks Inc.

    — All rights reserved $50M in revenue from improved credit risk approval models $53M in revenue from better cross-selling promotions Challenge Goldman Sachs wanted the Apple Card to reach as many customers as possible without significantly increasing risk, but their data architecture could not easily support the real-time machine learning required to make it happen Solution Using Databricks, Goldman Sachs deployed a lakehouse that processes 30TB a day across a large portfolio of data providers to accurately predict constantly evolving lender risk profiles Impact
  15. ©2022 Databricks Inc. — All rights reserved Demo Time!

  16. ©2022 Databricks Inc. — All rights reserved

  17. ©2022 Databricks Inc. — All rights reserved Delta Live Tables

    Cleanse and Transform Tweets
  18. ©2022 Databricks Inc. — All rights reserved Tweepy API: Streaming

    Twitter Feed
  19. ©2022 Databricks Inc. — All rights reserved Auto Loader: Streaming

    Data Ingestion Ingest Streaming Data with Automatic Schema Detection
  20. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling

    Data Pipelines in SQL CTAS Pattern: Create Table As Select …
  21. ©2022 Databricks Inc. — All rights reserved Declarative, auto scaling

    Data Pipelines
  22. ©2022 Databricks Inc. — All rights reserved DWH / SQL

    Persona
  23. ©2022 Databricks Inc. — All rights reserved Hugging Face ->

    Sentiment Analysis (POS, NEG, NEU) + probability
  24. ©2022 Databricks Inc. — All rights reserved 24

  25. ©2022 Databricks Inc. — All rights reserved 25

  26. ©2022 Databricks Inc. — All rights reserved Built-in Orchestration for

    all Tasks
  27. ©2022 Databricks Inc. — All rights reserved Watch the live

    demo from Data AI Summit Databricks.com / Watch Demos 27 Demo recording Notebooks on GitHub Hot off the press: Kafka+DLT BLOG
  28. @frankmunz https://fmunz.medium.com https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz www.databricks.com/ try-databricks