Architectures And how to build a Lakehouse 2001: ML Research 2014: DS Consultancy 2020: Field Engineering I. Simplifying Data+AI with Databricks Lakehouse II. Customer stories I’ll make it quick and punchy for a crisp conference end! About me Agenda 2
+ AI Company Global adoption Over 7000 customers, from F500 to unicorns Inventor and pioneer of the data lakehouse Gartner recognized leader in both • Database Management Systems • Data Science and Machine Learning Platforms Creator of highly successful OSS data projects: Delta Lake, Apache Spark, and MLflow Raised over $3B in investment 3000+ employees across the globe
to the right of the Data Maturity Curve Data + AI Maturity Competitive Advantage Clean Data Reports Ad Hoc Queries Data Exploration Predictive Modeling Prescriptive Analytics Automated Decision Making What will happen? How should we respond? What happened? Automatically make the best decision From hindsight to foresight 5
two disparate, incompatible data platforms Data + AI Maturity Competitive Advantage Reports Clean Data Ad Hoc Queries Data Exploratio n Predictive Modeling Prescriptive Analytics Automated Decision Making Data Lake for AI Data Warehouse for BI Data Maturity Curve What will happen? What happened? 7
two disparate, incompatible data platforms Unstructured files: logs, text, images, video, Data Lake Governance and Security Table ACLs Data Science & ML Governance and Security Files and Blobs Data Streaming Business Intelligence SQL Analytics Copy subsets of data Structured tables Data Warehouse
Analytics Data Science & ML Data Streaming Realizing this requires two disparate, incompatible data platforms Unstructured files: logs, text, images, video, Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incompatible security and governance models Incomplete support for use cases Structured tables Data Warehouse
Warehouse Business Intelligence SQL Analytics Data Science & ML Data Streaming Unstructured files: logs, text, images, video, Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjointed and duplicative data silos Incomplete support for use cases Incompatible security and governance models This is too complex and expensive making it hard to achieve the full potential of Data, Analytics, and AI Realizing this requires two disparate, incompatible data platforms 10
Warehouse Business Intelligence SQL Analytics Data Science & ML Data Streaming Realizing this requires two disparate, incompatible data platforms Unstructured files: logs, text, images, video, Data Lake Governance and Security Table ACLs Governance and Security Files and Blobs Copy subsets of data Disjoint and duplicative data silos Incomplete support for use cases Incompatible security and governance models Disjointed and duplicative data silos Incomplete support for use cases Incompatible security and governance models Disjointed and duplicative data silos Lakehouse Platform Incomplete support for use cases All machine learning, SQL, BI, and streaming use cases An open and reliable data platform to efficiently handle all data types Incompatible security and governance models One security and governance approach for all data assets on all clouds 11
your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance
Warehousing Data Engineering Data Science and ML Data Streaming BI and Dashboards Machine Learning Data Science Consulting & SI Partners Databricks thrives within your modern data stack Data Pipelines Unity Catalog Delta Lake Cloud Data Lake Data Ingestion
on Databricks • Data orchestration through Databricks Workflows • Delta Live Tables manage your full data pipelines • Simplifies data engineering with a curated data lake approach through Delta Lake
science workloads on Databricks Machine Learning • Model registry, reproducibility, productionization • Leverages Delta Lake for reproducibility • AutoML for citizen data scientists Data Science • Collaborative notebooks and dashboards for interactive analysis • Native support for Python, Java, R, Scala • Delta Lake data natively supported
Databricks • Great performance and concurrency for BI and SQL workloads on Delta Lake • Native SQL interface for analysts • Support for BI tools to directly query your most recent data in Delta Lake
Lakehouse Governance Govern and manage all data assets • Warehouse, Tables, Columns • Data Lake, Files • Machine Learning Models • Dashboards and Notebooks Capabilities • Data lineage • Attribute-based access control • Security policies • Table or column level tags • Auditing • Data sharing
central to global Multi-cloud Enterprise Data Platform ➔ Center of Excellence central to way of working ➔ Instrumental to the Energy Transition Energy Transition Campus Amsterdam
in the Sky DATA PREPARATION 70 billion rows of sensor data ingested, enriched and prepared for data science (5 years from Europe’s largest refinery). SQL Real-time insights, monitoring and alerting with ad-hoc and scheduled SQL queries MODEL MANAGEMENT Every model is tracked in MLflow, providing a record of how each was trained, and a model registry for easy selection and deployment. MODEL TRAINING 160,000 models trained using Databricks optimised Spark runtime (one for each sensor).
of Databricks over the years has broadened significantly. We started out using Databricks as a big data and AI platform but the scope has broadened. We have an entirely different class of citizen engineers and data scientists who are using it as a modern business intelligence tool to make smarter business decisions.” Dan Jeavons - VP Computational Science & Digital Innovation at Shell
Lakehouse for Data Mesh and Detecting Financial Crime “Databricks has provided one platform for our data and analytics teams to access and share data across ABN AMRO, delivering ML-based solutions that drive automation and insight throughout the company.” Stefan Groot - Engineering Manager | AI | ML | BI at ABN AMRO 22
your data warehousing and AI use cases on a single platform Open Built on open source and open standards Multicloud One consistent data platform across clouds Databricks Lakehouse Platform Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance
Comprehensive investment into your success 27 Supported by 24/7/365 global, production operations at scale Your success Solution Accelerators In-person and Virtual Training Co-located Professional Services