Analytics Orchestration Data Warehousing Data Science & AI Mosaic AI Delta Live Tables Workflows Databricks SQL Unified security, governance, and cataloging Unity Catalog Databricksデータインテリジェンスプラットフォーム Unified data storage for reliability and sharing Delta Lake データのセマンティクスを理解するために生成AIを活用 データインテリジェンスエンジン オープンデータレイク すべての生データ (ログ、テキスト、音声、動画、画像 ) Unity Catalog 自然言語でセキュアに洞察を獲得 Delta Lake 利用パターンに基づきデータレイアウトを自動で最適化 Databricks SQL Text-to-SQL Workflows 過去の実行に基づく ジョブコストの最適化 Delta Live Tables 自動化されたデータ品質 Mosaic AI カスタムLLMの作成、チュー ン、サービング
チューン 分類 回帰 特徴量 時系列予測 数値 カテゴリ変数 テキスト タイムスタンプ ARIMA UIでの特徴量選択 設定可能なnull値の 補完 設定可能な モデル選択 new new new new 分散トレーニングのサ ポート UIでの特徴量選択 new new new new new roadmap AutoMLのリリースとロードマップ 問題、モデル、特徴量タイプ、カスタマイズのサポートの拡張
カスタムモデル 基盤モデル 外部モデル サーバレスコンピュートによるREST API として任意のモデルをデプロイ、 MLflowによる管理。 CPUとGPUをサポート。Feature Store やVector Searchと連携。 外部モデルとAPIを管理。 MLflow AI Gatewayと従来の Databricksモデルサービングの監視と ペイロード記録のガバナンスを 提供。 Databricksがトップの基盤モデルを 選定し、シンプルなAPIとして提供。 自分でサービングを設定することなしに 即座に実験をスタート。 Available now Available now Available now
Single governance solution for data and AI assets on the Lakehouse: ◦ Centralized access control ◦ Auditing ◦ Lineage ◦ Discovery Unified governance for data and AI MLOps - What’s new?
With Feature Engineering in Unity Catalog: • Any Delta table in Unity Catalog that has been assigned a primary key (and additionally timestamp key) can be used as a source of features to train and serve models • Feature tables can be easily shared across different workspaces, and lineage recorded between other assets in the lakehouse Feature Engineering in Unity Catalog MLOps - What’s new?
With Models in Unity Catalog: • The full model lifecycle can be managed in Unity Catalog • Models can be shared across Databricks workspaces • Lineage can be traced across both data and models Models in Unity Catalog MLOps - What’s new?
Real-time ML model deployment • Model Serving provides a production-ready, serverless solution to simplify real-time ML model deployment. • Deploy models as an API to integrate model predictions with applications or websites. • Model Serving: ◦ Reduces operational costs ◦ Streamlines the ML lifecycle ◦ Enables Data Science teams to focus on the core task of integrating production-grade real-time ML into their solutions. MLOps - What’s new?
MLOps benefits MLOps - What’s new? Automatic feature/vector lookups, monitoring and unified governance that automates deployment and reduce errors Lakehouse native Deploy any model type on CPU or GPU. Automated container build and infrastructure management reduce maintenance costs and speed up deployment. Simplified Deployment Highly available and scalable serving with very low latency (p50 overhead latency <10ms) and high query volumes (QPS >25k) Serverless
centralized discovery of assets. Learn how your teammates trained models and what data they trained with Use lineage for audits or reproducibility Discover data and AI assets to use
with lineage and quality. Perform impact analysis, quality tracking, reproducibility, and root cause analysis with UC Root Causes Analysis w/ Lakehouse Monitoring & Lineage Impact Analysis with Popularity
Online evaluation MLOps - What’s new? Supports online evaluation strategies such as A/B testing or canary deployments through the ability to serve multiple models to a serving endpoint
for discovering and accessing AI models such as MPT, Llama and Mistral models. • Easily access and govern AI models combining Databricks Marketplace with Unity Catalog. Models on Databricks Marketplace Open marketplace for discovering and sharing AI assets Search models on Marketplace
Unified management of all models you need to serve Model Serving Custom Models Foundation Models External Models Deploy any model as a REST API with Serverless compute, managed via MLflow. CPU and GPU. Integration with Feature Store and Vector Search. Govern external models and APIs. This provides the governance of MLflow AI Gateway, plus the monitoring and payload logging of traditional Databricks Model Serving. Databricks curates top Foundation Models and provides them behind simple APIs. You can start experimentation immediately, without setting up serving yourself. Available now Available now Available now
deployment of ML projects on Databricks with CI/CD • Automate the creation of infrastructure for an ML project • Includes: • ML pipelines for model training, deployment, and inference deployed using Databricks Asset Bundles • Feature tables • CI/CD (GitHub and Azure Devops supported) • Uses software development best practices, and is flexible to customization • Roadmap: • Monitoring (Q1FY25) • Model Serving (Q1FY25) Documentation: AWS, Azure
Tackle complex language tasks with native SQL functions • Common use cases include summarization, topic identification, entity extra content creation. • ai_query available now, with more functions coming in Q1 • Supports LLMs in Foundation Model APIs, External Models, Custom Models • Also works with non-LLMs, e.g. classification/regression Access and serve LLMs directly from Databricks SQL Generally Available September 2024
users to interact with data with LLM-powered Q&A Ask questions in natural language and receive answers in text and visualizations Curate dataset-specific experiences with custom instructions Powered by Databricks SQL & DatabricksIQ Gated Public Preview in Q1 Project Genie
out-of-the box metrics on data and ML pipelines • Fully managed so no time wasted managing infrastructure, calculating metrics, or building dashboards from scratch • Frictionless with easy setup and out-of-the-box metrics and generated dashboards • Unified solution for data and models for holistic understanding Databricks Lakehouse Monitoring Generally Available July 2024
execute but data quality degraded Data engineers rely on feedback from data analysts and data scientists to identify deteriorating data quality in pipelines. Reactive Issue Detection Using different tools for data and model monitoring can fragment workflows and hinder teamwork among data teams. Fragmented Tooling Lacking a central monitoring service obscures the data teams' full pipeline view, making it tough to pinpoint issues and assign responsibility. Difficult Diagnoses Challenges Managing Data
data platform with proactive issue management Share quality updates organization-wide with auto-generated dashboards, and use ready-made metrics and analytics tools for easy issue exploration in your data products. Auto-Generated Reports Monitor all data products' quality with a single tool, no matter the framework or platform used to build them. Merge quality and business metrics effortlessly in your lakehouse to gauge your data products' impact. Unified Monitoring Catch data product issues before they reach consumers with cost-effective "insurance." Boost efficiency with smart automation in your data and AI pipelines, avoiding unnecessary retraining. Automated Root Cause Analysis Databricks Lakehouse Monitoring
tables in your lakehouse Bronze/Silver/Gold monitor Time Series Table monitor Inference Table monitor •Databricks batch scoring pipeline •Databricks Model Serving Endpoint •ETL to ingest from external serving (request logs) or batch pipelines • Columns • TimeStamp • Columns/Features Different out-of-the-box analysis metrics based on table type(s) Snapshot Table • TimeStamp • Features • Prediction column • Label column • Model ID Feature table
Table Dashboard Monitoring a table in the Lakehouse Table 🔎monitor Alerts Webhooks DBSQL How does it work? Distributional statistics for inputs, outputs Minimum, maximum, standard deviation, quantiles, top occurring value, … Model quality metrics (if labels are provided) Classification: Accuracy, F1, precision, recall Regression: MSE, RMSE, MAE, R2, … Anomaly detection and drift for training-vs-scoring and scoring-vs-scoring Delta/changes in nulls and counts, PSI, KS divergence, Mean shift, Total Variation distance, L-inf distance, χ2 test, Wasserstein distance, … Custom metrics Expressed as SQL expressions
Catalog • Calculates profile metrics stored in UC table • Calculates drift metrics stored in UC table • Supports custom metrics as SQL expressions • Auto-generates DBSQL dashboard to visualize metrics over time Background service that incrementally processes data in Unity Catalog tables
monitoring for tables and models 117 Profiling Tables Table (data) Table (data) Table (feature) Table (data) Model Table (inference ) Lakehouse Monitoring (with AI support) Dashboards Data Drift Tables monitor import databricks.data_monitoring as dm dm.create_or_update_monitor( table_name=... , analysis_type=dm.analysis.InferenceLog(...), output_schema_name=... ... ) dm.refresh_metrics(...) monitor monitor monitor monitor monitor Configure in Monitoring UI or via Python API: Mosaic AI or BI Tools DB SQL Alerts Representing the model for monitoring Databricks Serverless Scheduled Pipeline Users / Admins Monitoring definitions: Model pipeline:
works Serve AI Unity Catalog + Delta Lake Monitor Data & AI Packaging Packaging Features Indexes AI Assets AI Assets Logs Metrics Log s Features Indexes Data Storage Models Chains Agents Features Indexes 118 APIs BI / SQL ETL / streaming pipelines Prepare Data Batch, streaming, real time Governance & Lineage Features Features Indexes Serve Data Develop & Evaluate AI 🤗 pipelines 🦜🔗 chains + prompt + credential function(...)