$30 off During Our Annual Pro Sale. View Details »

Slice & DAIS - Summit Highlights from Data and AI Summit 2021 (former Apache Spark Summit)

Slice & DAIS - Summit Highlights from Data and AI Summit 2021 (former Apache Spark Summit)

Slides for our beginner-friendly, level 250 community session with Data and AI Summit highlights and announcements.

ML updates by Matt Thomson, Databricks
Lakehouse updates by Frank Munz, Databricks
You Might be Suffering From the
Small Files Syndrome, by Adi Polak, Microsoft

Frank Munz

June 18, 2021
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. Slice & DAIS 2021
    Community Hightlights
    Data and AI Summit 2021
    Adi Polak
    Matt Thomson
    Frank Munz

    View Slide

  2. View Slide

  3. Agenda for Today
    ▪ Beginner friendly DAIS highlights
    ▪ Agenda
    ▪ This intro :-)
    ▪ ML updates,
    Matt Thomson, Databricks
    ▪ Lakehouse updates,
    Frank Munz, Databricks
    ▪ You Might be Suffering From the
    Small Files Syndrome,
    Adi Polak, Microsoft

    View Slide

  4. Databricks ML
    May 2021 announcements

    View Slide

  5. Key ML Announcements
    1. Databricks Machine Learning
    2. Feature Store
    3. AutoML
    4. MLflow developments

    View Slide

  6. Lakehouse
    One platform to unify all
    your data, analytics, and AI workloads
    BI & SQL
    Open Data Lake
    Data Management & Governance
    Real-time Data
    Applications
    Data Science
    & ML

    View Slide

  7. Open Data Lakehouse Foundation with
    Announcing: Databricks Machine Learning
    A data-native and collaborative solution for the full ML lifecycle
    MLOps / Governance
    Data
    Prep
    Data
    Versioning
    Model
    Training
    Model
    Tuning
    Runtime and
    Environments
    Monitoring
    Batch
    Scoring
    Online Serving
    Data Science Workspace

    View Slide

  8. Persona-based Navigation
    Purpose-built surfaces for data teams

    View Slide

  9. ML Dashboard
    All ML related assets and resources in one place

    View Slide

  10. Feature Store

    View Slide

  11. Open Data Lakehouse Foundation with
    MLOps / Governance
    Data
    Prep
    Data
    Versioning Monitoring
    Batch
    Scoring
    Online Serving
    AutoML
    Data Science Workspace
    Model
    Training
    Model
    Tuning
    Runtime and
    Environments
    Feature Store
    Batch (high throughput)
    Real time (low latency)
    Feature Store
    The first Feature Store codesigned with a Data and MLOps Platform

    View Slide

  12. A day (or 6 months) in the life of an ML model
    Raw Data
    Featurization
    Training
    Joins, Aggregates, Transforms, etc.
    csv
    csv
    Serving Client
    No reuse of Features
    Online / Offline Skew

    View Slide

  13. Solving the Feature Store Problem
    Raw Data
    Featurization
    Training
    Joins, Aggregates, Transforms, etc.
    Serving Client
    Feature Store
    Feature Registry
    Feature
    Provider
    Batch (high throughput)
    Online (low latency)

    View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. AutoML

    View Slide

  18. Open Data Lakehouse Foundation with
    MLOps / Governance
    Data
    Prep
    Data
    Versioning Monitoring
    Batch
    Scoring
    Online Serving
    AutoML
    Data Science Workspace
    Model
    Training
    Model
    Tuning
    Runtime and
    Environments
    Feature Store
    Batch (high throughput)
    Real time (low latency)
    Databricks AutoML
    A glassbox approach to AutoML that empowers data teams without taking away control

    View Slide

  19. What is AutoML?
    Automated machine learning (AutoML) is a fully-automated model development solution seeking to
    “democratize” machine learning. While the scope of the automation varies, AutoML technologies usually
    automate the ML process from data to model selection.
    select a dataset automated data
    prep
    automated feature
    engineering and
    selection
    automated training
    and model selection
    automated
    hyperparameter
    tuning

    View Slide

  20. AutoML solves two key pain points for data
    scientists
    Quickly Verify the Predictive Power of a
    Dataset
    “Can this dataset be used to predict
    customer churn?”
    Marketing
    Team
    Data
    Science
    Team
    Dataset
    Data
    Science
    Team
    Baseline Model
    Dataset
    “What direction should I go in for this ML
    project and what benchmark should
    I aim to beat?”
    Get a Baseline Model to Guide Project
    Direction

    View Slide

  21. Problems with Existing AutoML Solutions
    Opaque-Box and Production Cliff Problems in AutoML
    Problem Result / Pain Points
    1. A “production cliff” exists where data scientists need to modify
    the returned “best” model using their domain expertise before
    deployment
    2. Data scientists need to be able to explain how they trained a
    model for regulatory purposes (e.g., FDA, GDPR, etc.) and most
    AutoML solutions have “opaque box” models
    ● The “best” model returned is often not good enough to
    deploy
    ● Data scientists must spend time and energy reverse
    engineering these “opaque-box” returned models so that
    they can modify them and/or explain them
    AutoML
    Configuration
    Returned
    Best Model
    AutoML
    Training
    “Opaque Box”
    ?
    Deployed
    Model
    Production
    Cliff
    ?

    View Slide

  22. Databricks AutoML
    A glass-box solution that empowers data teams without taking away control
    UI and API to start
    AutoML training
    Data exploration notebook
    Generated notebook with feature
    summary statistics and distributions
    Reproducible trial notebooks
    Generated notebooks with source code
    for every model
    MLflow experiment
    Auto-created MLflow Experiment to
    track models and metrics
    Easily deploy to
    Model Registry
    Understand and
    debug data quality
    and preprocessing
    Iterate further on
    models from AutoML,
    adding your expertise

    View Slide

  23. View Slide

  24. View Slide

  25. Notebook source
    databricks.automl.classify(df, target_col='label', timeout_minutes=60)
    “Glass-Box” AutoML with an API

    View Slide

  26. MLflow Developments

    View Slide

  27. : An Open Source ML Platform
    Monitoring Model management
    Reproducible runs Packaging & serving
    TRACKING PROJECTS MODEL REGISTRY
    MODELS
    Training
    Deployment
    Raw Data
    Data Prep
    ML Engineer
    Application
    Developer
    Data Engineer
    Any Language
    Any ML Library

    View Slide

  28. What’s New in MLflow
    AUTO LOGGING TRACKING
    mlflow.spark.autolog()
    mlflow.pyspark.ml.autolog()
    mlflow.sklearn.autolog()
    mlflow.tensorflow.autolog()
    mlflow.pytorch.autolog()
    mlflow.gluon.autolog()
    mlflow.keras.autolog()
    mlflow.lightgbm.autolog()
    mlflow.xgboost.autolog()
    mlflow.fastai.autolog()
    mlflow.autolog()
    mlflow.shap.log_explainer()
    mlflow.shap.log_explanation()
    mlflow.log_figure()
    mlflow.log_image()
    mlflow.log_dict()
    mlflow.log_text()
    mlflow.catboost.autolog()
    mlflow-thin-client

    View Slide

  29. What’s New in MLflow
    Google Cloud AI
    Platform
    DEPLOYMENT BACKENDS

    View Slide

  30. PyCaret + MLflow

    View Slide

  31. Lakehouse Intro

    View Slide

  32. Streaming
    Analytics
    BI Data
    Science
    Machine
    Learning
    Structured, Semi-Structured and Unstructured
    Data
    Data Lake for all your data
    One platform for every use case
    Structured transactional layer
    Lakehouse
    RELIABILITY & QUALITY
    PERFORMANCE & LATENCY
    GOVERNANCE
    ACID transactions
    Advanced indexing,
    caching, compaction
    Fine-grained access
    control

    View Slide

  33. Lakehouse adoption across industries

    View Slide

  34. Single node Data Science
    meets Big Data

    View Slide

  35. What is Koalas?
    Implementation of pandas APIs over Spark
    Easily port existing data science code, making it execute at scale
    import databricks.koalas as ks
    df = ks.read_csv(file)
    df[‘x’] = df.y * df.z
    df.describe()
    df.plot.line(...)
    import pandas as pd
    df = pd.read_csv(file)
    df[‘x’] = df.y * df.z
    df.describe()
    df.plot.line(...)
    Now ~ 3 million PyPI downloads per month.

    View Slide

  36. Single Node Performance Comparison - 31 GB
    Lower is better

    View Slide

  37. Performance Comparison - 95 GB
    pandas
    Lower is better
    pyspark.pandas

    View Slide

  38. Scala DataFrame
    Python DataFrame
    pandas APIs
    Different APIs, Powered by the Same Engine
    SQL Language

    View Slide

  39. ANSI SQL Compliance
    Python
    Performance More
    Streaming
    Decorrelation
    Framework
    Timestamp
    w/o Time
    Zone
    Adaptive
    Optimization
    Scala 2.13
    Beta
    Error Code Implicit Type
    Cast
    Interval Type
    Complex Type
    Support in
    Lateral Join
    Compile Latency
    Reduction JAVA 17
    Push-based
    Shuffle
    Session
    Window
    Visualization
    and Plotting
    RocksDB
    State Store
    Queryable
    State Store
    Pythonic Error
    Handling
    Richer
    Input/Output
    Koalas
    (pandas APIs)
    Parquet 1.12
    (Column Index)
    State Store
    APIs
    DML Metrics
    ANSI Mode
    GA
    Apache Spark Development (Link 3.1.1 RelNotes)
    Low Latency
    Scheduler

    View Slide

  40. OSS Delta Lake 1.0

    View Slide

  41. • Project Zen: more Pythonic, better usability
    • Faster performance including predicate
    pushdown and pruning
    • ANSI SQL compliance for DDL/DML
    commands including INSERT, MERGE, and
    EXPLAIN
    • Spark 3.1 comes with Databricks Runtime 8.0
    https://spark.apache.org/releases/spark-release-3-1-1.html
    https://databricks.com/blog/2021/03/02/introducing-apache-spark-3-1.html
    delta.io blog: Delta Lake 1.0.0 Released

    View Slide

  42. Generated
    Columns
    Problem: Partition by date
    Better solution: generated columns
    CREATE TABLE events(
    id bigint,
    eventTime timestamp,
    eventDate GENERATED ALWAYS AS (
    CAST(eventTime AS DATE)
    )
    )
    USING delta
    PARTITIONED BY (eventDate)
    id eventTime eventDate
    1 2021-05-24
    09:00:00.000
    2021-05-24
    .. ... ...

    View Slide

  43. Delta
    Everywhere
    Standalon
    e JVM

    View Slide

  44. pip install delta-spark
    Python APIs for using Delta Lake with Apache Spark,
    e.g. for unit testing
    pip install deltalake
    # delta lake without spark
    from deltalake import DeltaTable
    dt = DeltaTable("reviews")
    dt.version()
    3
    dt.files()
    ['part-00000-...-ff32ddab96d2-c000.snappy.parquet',
    'part-00000-...-d46c948aa415-c000.snappy.parquet',
    'part-00001-...-7eb62007a15c-c000.snappy.parquet']
    PyPI Install
    Delta Rust client

    View Slide

  45. Delta Live Tables

    View Slide

  46. Filtered, Cleaned,
    Augmented
    Business-level
    Aggregates
    Raw Ingestion
    and History
    Building the foundation of a Lakehouse with ETL
    Data Lake
    CSV,
    JSON, TXT…
    Kinesis
    BI &
    Reporting
    Streaming
    Analytics
    Data Science
    & ML
    BRONZE SILVER GOLD
    QUALITY

    View Slide

  47. Delta Live Tables: Easily build data pipelines
    Declaratively build data pipelines
    with business logic and chain
    table dependencies
    Run in batch or streaming with
    structured or unstructured data
    Reuse ETL pipelines across
    environments
    https://docs.databricks.com/data-engineering/delta-live-tables/index.html

    View Slide

  48. Treat your data as code
    A single source of truth for more than just transformation logic.
    CREATE LIVE TABLE clean_data(
    CONSTRAINT valid_timestamp EXPECT (timestamp > "…")
    )
    COMMENT "Customer data with timestamps cleaned up"
    TBLPROPERTIES (
    "has_pii" = "true",
    )
    as SELECT to_timestamp(ts) as ts, *
    FROM LIVE.raw_data
    Declarative Quality Expectations
    Just say what makes bad data bad and what to
    do with it.
    Documentation with Transformation
    Helps ensure discovery information is recent.
    Governance Built-In
    All information about processing is captured
    into a table for analysis / auditing.

    View Slide

  49. Databricks
    Unity Catalog
    Simplified governance for data and AI

    View Slide

  50. Data Lake Governance Today is Complex
    Data (files on
    S3/ADLS/GCS)
    /dataset/pages/part-001
    /dataset/pages/part-002
    /dataset/users/uk/part-001
    /dataset/users/uk/part-002
    /dataset/users/us/part-001
    Users
    File-based permissions:
    • user1 can read /pages/
    • user2 can read /users/
    • user3 can read /users/us/ What if we only want users to see some
    columns/rows within a table?
    What if we want to change data layout?
    What if governance rules change?
    Metadata (e.g. Hive
    Metastore)
    Tables &
    views
    ML
    Models
    SQL
    Databases
    Could be out of sync with the data!
    Different governance model
    Different governance model

    View Slide

  51. Databricks Unity Catalog
    Data (files on S3/ADLS/GCS)
    /dataset/pages/part-001
    /dataset/pages/part-002
    /dataset/users/uk/part-001
    /dataset/users/uk/part-002
    /dataset/users/us/part-001
    Users
    table
    1
    ML Models
    SQL Databases
    Delta Shares
    Unity
    Catalog
    table
    2
    view1
    view2
    models
    view3
    ● Fine-grained permissions on
    tables, fields, views
    ● ANSI SQL grants
    ● Uniform permission model for all data assets
    ● Across workspaces
    ● ODBC/JDBC / Delta Sharing
    Audit
    Log

    View Slide

  52. Using the Unity Catalog
    CREATE TABLE iot_events
    GRANT SELECT ON iot_events TO engineers
    GRANT SELECT(date, country) ON iot_events TO marketing

    View Slide

  53. Attribute-Based Access Control (ABAC)
    CREATE ATTRIBUTE pii
    ALTER TABLE iot_events ADD ATTRIBUTE pii ON email
    ALTER TABLE users ADD ATTRIBUTE pii ON phone
    ...
    GRANT SELECT ON DATABASE iot_data
    HAVING ATTRIBUTE NOT IN (pii)
    TO product_managers
    Set permission on all
    columns tagged pii together

    View Slide

  54. Delta Sharing
    An Open Protocol for Secure Data Sharing

    View Slide

  55. Delta Sharing: delta.io/sharing
    Delta Lake
    Table
    Delta Sharing
    Server
    Delta Sharing Protocol
    (REST)
    Data Provider Data Recipient
    Commercial or
    open source
    Python connector
    for pandas or
    Apache Spark
    Access
    permissions
    pip install delta-sharing

    View Slide

  56. Delta Sharing Recipient: Pandas + Jupyter
    Notebook

    View Slide

  57. Delta Sharing on Databricks
    Secure Delta Sharing server integrated in our service easily manages shares with
    CREATE SHARE commands in SQL or REST APIs.
    Delta Sharing Protocol
    Data Recipients
    Data Provider
    sales
    CREATE SHARE retail
    ALTER SHARE retail
    ADD TABLE sales
    GRANT SELECT ON SHARE
    retail TO supplier1
    Audit
    log
    Unity Catalog

    View Slide

  58. How to engage?
    delta.io
    delta-users
    Slack
    delta-users
    Google Group
    Delta Lake
    YouTube channel
    Delta Lake
    GitHub Issues
    Delta Lake RS
    Bi-weekly meetings

    View Slide

  59. Databricks SQL

    View Slide

  60. First-Class SQL Development Experience
    Develop
    Query tabs
    Drafts & “pick up where you left off”
    Command history
    Contextual auto-complete
    Troubleshoot
    Query progress
    Error highlighting
    Execution time breakdown
    Collaborate
    Scheduled email delivery
    Edit permissions
    Enabling simple, quick ad-hoc exploratory analysis on the Lake with SQL

    View Slide

  61. Large Query Performance
    Price / Performance Benchmark with Barcelona Supercomputing Center (Nov 2020)
    30TB TPC-DS Price/Performance
    Lower is better

    View Slide

  62. Beyond large query performance
    Many small files
    Small queries BI Results Retrieval
    Providing fast and predictable performance for all workloads
    Mixed small/large

    View Slide

  63. What about many concurrent users on small data?
    10 GB TPC-DS @ 32 Concurrent Streams (Queries/Hr)
    Higher is better

    View Slide

  64. What about many concurrent users on small data?
    10 GB TPC-DS @ 32 Concurrent Streams (Queries/Hr)
    Higher is better

    View Slide

  65. What about badly laid out tables?
    Small files
    ~12x rows scanned within
    the same duration
    Async & parallel IO:
    Cold S3/ADLS remote
    reads fully saturate
    S3/ADLS/GCS bandwidth
    with increased parallelism
    for better cold reads
    “Too Many Small Files” Scenario Benchmark (# rows scanned/sec)
    Higher is better

    View Slide

  66. Summary: Advancing the Lakehouse
    Reliable ETL made easy
    with Delta Lake
    The first open protocol
    for data sharing
    The first multi-cloud
    data catalog for the
    lakehouse
    The first high performance
    query engine for the
    lakehouse
    Delta
    Sharing
    Delta
    Live Tables
    Unity
    Catalog Photon
    Available Today Coming Soon Coming Soon Public Preview

    View Slide

  67. Data and AI Summit:
    https://dataaisummit.com/
    Databricks Youtube channel:
    https://www.youtube.com/
    Databricks
    Databricks blog:
    https://databricks.com/blog
    @frankmunz

    View Slide