Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generative AI for Data Platforms - Databricks Data Intelligence Platform

Generative AI for Data Platforms - Databricks Data Intelligence Platform

A classic data lakehouse is built on open-source table formats such as Delta.io, Iceberg, or Hudi and seamlessly integrates with big data platforms like Apache Spark and event buses like Apache Kafka or Amazon Kinesis. The popularity of the data lakehouse stems from its ability to combine the quality, speed, and simple SQL access of data warehouses with the cost-effectiveness, scalability, and support for unstructured data of data lakes.

With the advent of generative AI models and the potential of using techniques such as Retrieval-augmented generation (RAG) in combination with fine-tuning or pre-training custom LLMs, a new paradigm has emerged in 2023: AI-infused lakehouses. These platforms use generative AI for code generation, natural language queries, and semantic search, LLM callouts from SQL, enhancing governance and automating documentation.

How do lakehouses adapt to the integration of new AI capabilities?

This talk is for data architects who are not afraid of some code, for data engineers who love open source and cloud services, and for practitioners who enjoy a fun end-to-end demo. The Databricks Lakehouse is used for the demos.

Frank Munz

April 22, 2024
Tweet

More Decks by Frank Munz

Other Decks in Programming

Transcript

  1. ©2023 Databricks Inc. — All rights reserved 1 Frank Munz,

    Principal TM Engineer, Databricks / April 2024 Generative AI for Data Platforms Cutting to the Chase
  2. ©2022 Databricks Inc. — All rights reserved Hi, I am

    Frank! • Principal @Databricks. TMM for Data, Analytics and AI products • Large scale data & compute • Based in 🍻 ⛰ 🥨 󰎲 Munich • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc. • @frankmunz / LindedIn
  3. ©2023 Databricks Inc. — All rights reserved 10,000+ global customers

    $1.5B+ in revenue $4B in investment Inventor of the lakehouse & Pioneer of generative AI Gartner-recognized Leader Database Management Systems + Data Science and Machine Learning Platforms The data and AI company Creator of
  4. ©2023 Databricks Inc. — All rights reserved Streaming Data •

    Small sized data • Continuously produced • Expectation -> processed in time • Programming paradigm ◦ Right-time vs real-time 5
  5. ©2023 Databricks Inc. — All rights reserved Streaming Data Think

    “right-time” instead of “real-time” 6 Manually Continually Scheduled Latency Cost
  6. ©2023 Databricks Inc. — All rights reserved TPC-DS Benchark from

    Barcelona HPC Center 2.2x faster with Photon than previous record for DWH 9
  7. ©2023 Databricks Inc. — All rights reserved 11 Project Lightspeed

    https://www.databricks.com/blog/project-lightspeed-update-advancing-apache-spark-structured-streaming
  8. ©2023 Databricks Inc. — All rights reserved SIMPLE and FAST

    EFFICIENT RELIABLE Serverless Compute for Data Platforms Serverless Compute Hands-off auto-optimized compute No knobs Fast startup For any practitioner Fully managed and versionless Paying only what you use Strong cost governance Secure by default Stable with smart fail-overs Storage Notebooks with Spark Pipelines AI Model hosting SQL DWH "Put your vendor T-shirts down" 14 multi-cloud
  9. ©2023 Databricks Inc. — All rights reserved Confidential and Proprietary

    Walk trough 17 • Single Page App (S3) • Kinesis Stream ◦ JSON Structure ◦ Kinesis Ingest with EFO • Delta Live Tables (ETL) • Spark Streaming Data Analytics ◦ Histogram streaming data ◦ Window-based aggregation • Databricks Workflows • Databricks SQL
  10. ©2024 Databricks Inc. — All rights reserved Databricks Data Intelligence

    Platform Use generative AI to understand the semantics of your data Data Intelligence Engine Open Data Lake (lake first approach: S3, ADLS, GCS) Databricks SQL Text-to-SQL Workflows optimized based on past runs Delta Live Tables Automated data qualility Mosaic AI Create, tune, and serve custom LLMs Unity Catalog Securely get insights in natural language Delta Lake with Delta UniForm Data layout is automatically optimized based on usage patterns
  11. ©2023 Databricks Inc. — All rights reserved Streaming ETL with

    Delta Live Tables Pipelines Python or SQL. STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain infinite history Easy to perform GDPR and other compliance tasks CREATE MATERIALIZED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections
  12. ©2023 Databricks Inc. — All rights reserved 20 Delta Live

    Tables • Serverless Compute (zero compute settings) • Streaming Ingest from Message Buses with SQL read_kafka(), read_kinesis(), … • Incrementally computed Materialized Views Link to blog
  13. ©2023 Databricks Inc. — All rights reserved Building Blocks of

    Databricks Workflows 21 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs-as-a-Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Triggers Table Triggers Continuous (Streaming) Preview Coming Soon
  14. ©2024 Databricks Inc. — All rights reserved Databricks Data Intelligence

    Platform Use generative AI to understand the semantics of your data Data Intelligence Engine Open Data Lake (lake first approach: S3, ADLS, GCS) Databricks SQL Text-to-SQL Workflows optimized based on past runs Delta Live Tables Automated data qualility Mosaic AI Create, tune, and serve custom LLMs Unity Catalog Securely get insights in natural language Delta Lake with Delta UniForm Data layout is automatically optimized based on usage patterns
  15. ©2023 Databricks Inc. — All rights reserved Confidential and Proprietary

    We’re infusing AI in our experiences AI-generated docs + semantic search in Catalog Explorer Databricks Assistant SQL to Dashboard Data Rooms (Project Genie)
  16. ©2023 Databricks Inc. — All rights reserved MosaicML Model Serving

    MosaicML Model Serving Vector Search MLflow AI Gateway Model Serving MLflow AI Gateway MLflow Evaluation MLflow Prompt Engg Generative AI Solutions Enable every architectural pattern Prompt Engineering and Chains Retrieval Augmented Generation (RAG) Fine-tuning Pre-training Unity Catalog | Lakehouse Monitoring Crafting specialized prompts to guide LLM behavior Combining an LLM with enterprise data Adapting a pre-trained LLM to specific data sets or domains Training an LLM from scratch Complexity / Compute-intensiveness
  17. Model Serving Custom Models Foundation Models APIs External Models Deploy

    any model as a REST API with Serverless compute, managed via MLflow. CPU and GPU. Integration with Feature Store and Vector Search. Govern external models and APIs. This provides the governance of MLflow Deployments for LLMs, plus the monitoring and payload logging of traditional Databricks Model Serving. Databricks curates top Foundation Models and provides them behind simple APIs. You can start experimentation immediately, without setting up serving yourself. Databricks Model Serving Unified UI, API & SDK for managing all types of AI Models
  18. ©2024 Databricks Inc. — All rights reserved Built-in governance with

    permissions and lineage Automatically synchronizes streaming source data with vector db. No separate data pipelines Vector DB Serverless vector database for RAG
  19. ©2024 Databricks Inc. — All rights reserved Finetuning Finetune your

    LLM on your data Serverless: no need to reserve or pick GPUs Pick the data from Unity Catalog or from Huggingface Maintain control and ownership of the model. It is your Intellectual Property.
  20. ©2024 Databricks Inc. — All rights reserved Mosaic AI Training

    Up to 7X faster and cheaper training of large AI Models Simplified, scalable, and cost-effective training of large AI models. Train or fine-tune your own generative AI model with your data in your secure environment. Full control of your model and privacy of your data. Your data, your model, built in your secure environment.
  21. ©2024 Databricks Inc. — All rights reserved Databricks Marketplace Share

    data sets with notebooks, and OSS / proprietary AI models Based on OSS Delta Sharing One click Instant Access 32
  22. ©2024 Databricks Inc. — All rights reserved Model architecture •

    Sparse Mixture-of-Experts (MoE) • 4 of 16 experts for a given input Model training • Pre-trained on 3072 NVIDIA H100s in 3 months. • on Databricks Data Intelligence Platform, Notebooks, Jobs, etc. The models • DBRX Base for fine-tuning • DBRX Instruct for RAG chains • 132B parameters • 32k token context length License and data • Open-source for commercial use • Pretrained on publicly available 12T tokens • Designed for enterprises Introducing DBRX’s details
  23. ©2024 Databricks Inc. — All rights reserved DBRX outperforms established

    open source models on language understanding (MMLU), Programming (HumanEval), and Math (GSM8K).
  24. ©2024 Databricks Inc. — All rights reserved DBRX outperforms GPT

    3.5 on language understanding (MMLU), Programming (HumanEval), and Math (GSM8K).
  25. ©2024 Databricks Inc. — All rights reserved How can I

    try DBRX? Hugging Face Spaces Databricks FM API AI Playground labs.perplexity you.com, poe.com
  26. ©2023 Databricks Inc. — All rights reserved Confidential and Proprietary

    39 Gen AI meets Data Platforms Data Intelligence Engine + Unified Governance -> Assistant, Intelligent Search, automated documentation, natural language queries and better scheduling, automated data quality
  27. ©2023 Databricks Inc. — All rights reserved Confidential and Proprietary

    40 Data Platform meets gen AI "There is no good model with bad data"
  28. ©2023 Databricks Inc. — All rights reserved 41 New Databricks

    Demo Center databricks.com/demos Todays demo
  29. ©2023 Databricks Inc. — All rights reserved 42 Thank You!

    @frankmunz Please rate this presentation!