Slice & DAIS - Summit Highlights from Data and AI Summit 2021 (former Apache Spark Summit)

Slide 1

Slide 1 text

Slice & DAIS 2021 Community Hightlights Data and AI Summit 2021 Adi Polak Matt Thomson Frank Munz

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Agenda for Today ▪ Beginner friendly DAIS highlights ▪ Agenda ▪ This intro :-) ▪ ML updates, Matt Thomson, Databricks ▪ Lakehouse updates, Frank Munz, Databricks ▪ You Might be Suffering From the Small Files Syndrome, Adi Polak, Microsoft

Slide 4

Slide 4 text

Databricks ML May 2021 announcements

Slide 5

Slide 5 text

Key ML Announcements 1. Databricks Machine Learning 2. Feature Store 3. AutoML 4. MLﬂow developments

Slide 6

Slide 6 text

Lakehouse One platform to unify all your data, analytics, and AI workloads BI & SQL Open Data Lake Data Management & Governance Real-time Data Applications Data Science & ML

Slide 7

Slide 7 text

Open Data Lakehouse Foundation with Announcing: Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace

Slide 8

Slide 8 text

Persona-based Navigation Purpose-built surfaces for data teams

Slide 9

Slide 9 text

ML Dashboard All ML related assets and resources in one place

Slide 10

Slide 10 text

Feature Store

Slide 11

Slide 11 text

Open Data Lakehouse Foundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Feature Store The ﬁrst Feature Store codesigned with a Data and MLOps Platform

Slide 12

Slide 12 text

A day (or 6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving Client No reuse of Features Online / Offline Skew

Slide 13

Slide 13 text

Solving the Feature Store Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

AutoML

Slide 18

Slide 18 text

Open Data Lakehouse Foundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Databricks AutoML A glassbox approach to AutoML that empowers data teams without taking away control

Slide 19

Slide 19 text

What is AutoML? Automated machine learning (AutoML) is a fully-automated model development solution seeking to “democratize” machine learning. While the scope of the automation varies, AutoML technologies usually automate the ML process from data to model selection. select a dataset automated data prep automated feature engineering and selection automated training and model selection automated hyperparameter tuning

Slide 20

Slide 20 text

AutoML solves two key pain points for data scientists Quickly Verify the Predictive Power of a Dataset “Can this dataset be used to predict customer churn?” Marketing Team Data Science Team Dataset Data Science Team Baseline Model Dataset “What direction should I go in for this ML project and what benchmark should I aim to beat?” Get a Baseline Model to Guide Project Direction

Slide 21

Slide 21 text

Problems with Existing AutoML Solutions Opaque-Box and Production Cliff Problems in AutoML Problem Result / Pain Points 1. A “production cliff” exists where data scientists need to modify the returned “best” model using their domain expertise before deployment 2. Data scientists need to be able to explain how they trained a model for regulatory purposes (e.g., FDA, GDPR, etc.) and most AutoML solutions have “opaque box” models ● The “best” model returned is often not good enough to deploy ● Data scientists must spend time and energy reverse engineering these “opaque-box” returned models so that they can modify them and/or explain them AutoML Conﬁguration Returned Best Model AutoML Training “Opaque Box” ? Deployed Model Production Cliff ?

Slide 22

Slide 22 text

Databricks AutoML A glass-box solution that empowers data teams without taking away control UI and API to start AutoML training Data exploration notebook Generated notebook with feature summary statistics and distributions Reproducible trial notebooks Generated notebooks with source code for every model MLﬂow experiment Auto-created MLﬂow Experiment to track models and metrics Easily deploy to Model Registry Understand and debug data quality and preprocessing Iterate further on models from AutoML, adding your expertise

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Notebook source databricks.automl.classify(df, target_col='label', timeout_minutes=60) “Glass-Box” AutoML with an API

Slide 26

Slide 26 text

MLﬂow Developments

Slide 27

Slide 27 text

: An Open Source ML Platform Monitoring Model management Reproducible runs Packaging & serving TRACKING PROJECTS MODEL REGISTRY MODELS Training Deployment Raw Data Data Prep ML Engineer Application Developer Data Engineer Any Language Any ML Library

Slide 28

Slide 28 text

What’s New in MLﬂow AUTO LOGGING TRACKING mlflow.spark.autolog() mlflow.pyspark.ml.autolog() mlflow.sklearn.autolog() mlflow.tensorflow.autolog() mlflow.pytorch.autolog() mlflow.gluon.autolog() mlflow.keras.autolog() mlflow.lightgbm.autolog() mlflow.xgboost.autolog() mlflow.fastai.autolog() mlflow.autolog() mlflow.shap.log_explainer() mlflow.shap.log_explanation() mlflow.log_figure() mlflow.log_image() mlflow.log_dict() mlflow.log_text() mlflow.catboost.autolog() mlflow-thin-client

Slide 29

Slide 29 text

What’s New in MLﬂow Google Cloud AI Platform DEPLOYMENT BACKENDS

Slide 30

Slide 30 text

PyCaret + MLﬂow

Slide 31

Slide 31 text

Lakehouse Intro

Slide 32

Slide 32 text

Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer Lakehouse RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing, caching, compaction Fine-grained access control

Slide 33

Slide 33 text

Lakehouse adoption across industries

Slide 34

Slide 34 text

Single node Data Science meets Big Data

Slide 35

Slide 35 text

What is Koalas? Implementation of pandas APIs over Spark Easily port existing data science code, making it execute at scale import databricks.koalas as ks df = ks.read_csv(file) df[‘x’] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df[‘x’] = df.y * df.z df.describe() df.plot.line(...) Now ~ 3 million PyPI downloads per month.

Slide 36

Slide 36 text

Single Node Performance Comparison - 31 GB Lower is better

Slide 37

Slide 37 text

Performance Comparison - 95 GB pandas Lower is better pyspark.pandas

Slide 38

Slide 38 text

Scala DataFrame Python DataFrame pandas APIs Different APIs, Powered by the Same Engine SQL Language

Slide 39

Slide 39 text

ANSI SQL Compliance Python Performance More Streaming Decorrelation Framework Timestamp w/o Time Zone Adaptive Optimization Scala 2.13 Beta Error Code Implicit Type Cast Interval Type Complex Type Support in Lateral Join Compile Latency Reduction JAVA 17 Push-based Shuffle Session Window Visualization and Plotting RocksDB State Store Queryable State Store Pythonic Error Handling Richer Input/Output Koalas (pandas APIs) Parquet 1.12 (Column Index) State Store APIs DML Metrics ANSI Mode GA Apache Spark Development (Link 3.1.1 RelNotes) Low Latency Scheduler

Slide 40

Slide 40 text

OSS Delta Lake 1.0

Slide 41

Slide 41 text

• Project Zen: more Pythonic, better usability • Faster performance including predicate pushdown and pruning • ANSI SQL compliance for DDL/DML commands including INSERT, MERGE, and EXPLAIN • Spark 3.1 comes with Databricks Runtime 8.0 https://spark.apache.org/releases/spark-release-3-1-1.html https://databricks.com/blog/2021/03/02/introducing-apache-spark-3-1.html delta.io blog: Delta Lake 1.0.0 Released

Slide 42

Slide 42 text

Generated Columns Problem: Partition by date Better solution: generated columns CREATE TABLE events( id bigint, eventTime timestamp, eventDate GENERATED ALWAYS AS ( CAST(eventTime AS DATE) ) ) USING delta PARTITIONED BY (eventDate) id eventTime eventDate 1 2021-05-24 09:00:00.000 2021-05-24 .. ... ...

Slide 43

Slide 43 text

Delta Everywhere Standalon e JVM

Slide 44

Slide 44 text

pip install delta-spark Python APIs for using Delta Lake with Apache Spark, e.g. for unit testing pip install deltalake # delta lake without spark from deltalake import DeltaTable dt = DeltaTable("reviews") dt.version() 3 dt.files() ['part-00000-...-ff32ddab96d2-c000.snappy.parquet', 'part-00000-...-d46c948aa415-c000.snappy.parquet', 'part-00001-...-7eb62007a15c-c000.snappy.parquet'] PyPI Install Delta Rust client

Slide 45

Slide 45 text

Delta Live Tables

Slide 46

Slide 46 text

Filtered, Cleaned, Augmented Business-level Aggregates Raw Ingestion and History Building the foundation of a Lakehouse with ETL Data Lake CSV, JSON, TXT… Kinesis BI & Reporting Streaming Analytics Data Science & ML BRONZE SILVER GOLD QUALITY

Slide 47

Slide 47 text

Delta Live Tables: Easily build data pipelines Declaratively build data pipelines with business logic and chain table dependencies Run in batch or streaming with structured or unstructured data Reuse ETL pipelines across environments https://docs.databricks.com/data-engineering/delta-live-tables/index.html

Slide 48

Slide 48 text

Treat your data as code A single source of truth for more than just transformation logic. CREATE LIVE TABLE clean_data( CONSTRAINT valid_timestamp EXPECT (timestamp > "…") ) COMMENT "Customer data with timestamps cleaned up" TBLPROPERTIES ( "has_pii" = "true", ) as SELECT to_timestamp(ts) as ts, * FROM LIVE.raw_data Declarative Quality Expectations Just say what makes bad data bad and what to do with it. Documentation with Transformation Helps ensure discovery information is recent. Governance Built-In All information about processing is captured into a table for analysis / auditing.

Slide 49

Slide 49 text

Databricks Unity Catalog Simpliﬁed governance for data and AI

Slide 50

Slide 50 text

Data Lake Governance Today is Complex Data (ﬁles on S3/ADLS/GCS) /dataset/pages/part-001 /dataset/pages/part-002 /dataset/users/uk/part-001 /dataset/users/uk/part-002 /dataset/users/us/part-001 Users File-based permissions: • user1 can read /pages/ • user2 can read /users/ • user3 can read /users/us/ What if we only want users to see some columns/rows within a table? What if we want to change data layout? What if governance rules change? Metadata (e.g. Hive Metastore) Tables & views ML Models SQL Databases Could be out of sync with the data! Different governance model Different governance model

Slide 51

Slide 51 text

Databricks Unity Catalog Data (ﬁles on S3/ADLS/GCS) /dataset/pages/part-001 /dataset/pages/part-002 /dataset/users/uk/part-001 /dataset/users/uk/part-002 /dataset/users/us/part-001 Users table 1 ML Models SQL Databases Delta Shares Unity Catalog table 2 view1 view2 models view3 ● Fine-grained permissions on tables, ﬁelds, views ● ANSI SQL grants ● Uniform permission model for all data assets ● Across workspaces ● ODBC/JDBC / Delta Sharing Audit Log

Slide 52

Slide 52 text

Using the Unity Catalog CREATE TABLE iot_events GRANT SELECT ON iot_events TO engineers GRANT SELECT(date, country) ON iot_events TO marketing

Slide 53

Slide 53 text

Attribute-Based Access Control (ABAC) CREATE ATTRIBUTE pii ALTER TABLE iot_events ADD ATTRIBUTE pii ON email ALTER TABLE users ADD ATTRIBUTE pii ON phone ... GRANT SELECT ON DATABASE iot_data HAVING ATTRIBUTE NOT IN (pii) TO product_managers Set permission on all columns tagged pii together

Slide 54

Slide 54 text

Delta Sharing An Open Protocol for Secure Data Sharing

Slide 55

Slide 55 text

Delta Sharing: delta.io/sharing Delta Lake Table Delta Sharing Server Delta Sharing Protocol (REST) Data Provider Data Recipient Commercial or open source Python connector for pandas or Apache Spark Access permissions pip install delta-sharing

Slide 56

Slide 56 text

Delta Sharing Recipient: Pandas + Jupyter Notebook

Slide 57

Slide 57 text

Delta Sharing on Databricks Secure Delta Sharing server integrated in our service easily manages shares with CREATE SHARE commands in SQL or REST APIs. Delta Sharing Protocol Data Recipients Data Provider sales CREATE SHARE retail ALTER SHARE retail ADD TABLE sales GRANT SELECT ON SHARE retail TO supplier1 Audit log Unity Catalog

Slide 58

Slide 58 text

How to engage? delta.io delta-users Slack delta-users Google Group Delta Lake YouTube channel Delta Lake GitHub Issues Delta Lake RS Bi-weekly meetings

Slide 59

Slide 59 text

Databricks SQL

Slide 60

Slide 60 text

First-Class SQL Development Experience Develop Query tabs Drafts & “pick up where you left off” Command history Contextual auto-complete Troubleshoot Query progress Error highlighting Execution time breakdown Collaborate Scheduled email delivery Edit permissions Enabling simple, quick ad-hoc exploratory analysis on the Lake with SQL

Slide 61

Slide 61 text

Large Query Performance Price / Performance Benchmark with Barcelona Supercomputing Center (Nov 2020) 30TB TPC-DS Price/Performance Lower is better

Slide 62

Slide 62 text

Beyond large query performance Many small ﬁles Small queries BI Results Retrieval Providing fast and predictable performance for all workloads Mixed small/large

Slide 63

Slide 63 text

What about many concurrent users on small data? 10 GB TPC-DS @ 32 Concurrent Streams (Queries/Hr) Higher is better

Slide 64

Slide 64 text

What about many concurrent users on small data? 10 GB TPC-DS @ 32 Concurrent Streams (Queries/Hr) Higher is better

Slide 65

Slide 65 text

What about badly laid out tables? Small ﬁles ~12x rows scanned within the same duration Async & parallel IO: Cold S3/ADLS remote reads fully saturate S3/ADLS/GCS bandwidth with increased parallelism for better cold reads “Too Many Small Files” Scenario Benchmark (# rows scanned/sec) Higher is better

Slide 66

Slide 66 text

Summary: Advancing the Lakehouse Reliable ETL made easy with Delta Lake The first open protocol for data sharing The first multi-cloud data catalog for the lakehouse The first high performance query engine for the lakehouse Delta Sharing Delta Live Tables Unity Catalog Photon Available Today Coming Soon Coming Soon Public Preview

Slide 67

Slide 67 text

Data and AI Summit: https://dataaisummit.com/ Databricks Youtube channel: https://www.youtube.com/ Databricks Databricks blog: https://databricks.com/blog @frankmunz