$30 off During Our Annual Pro Sale. View Details »

Slice_DAIS_Español_ONLINE.pdf

 Slice_DAIS_Español_ONLINE.pdf

Due to the huge success, we will repeat the Slice & DAIS event in Spanish. Once more we will cover all the Delta Lake and AI news! This is a cross promotion from the Barcelona Spark Meetup Group:
https://www.meetup.com/Spark-Barcelona/events/279277516/

¡Lo repetimos!

Nuestro evento "Slice and DAIS", en el que se explicaron todos los aspectos destacados del Data and AI Summit (antiguo Apache Spark Summit) de una forma sencilla para los Data Scientists y Data Engineers, fue un gran éxito con más de 6000 espectadores. Así que hemos decidido hacerlo de nuevo, ¡pero esta vez en español!

Junto con el grupo de Spark meetup de Barcelona, presentaremos las novedades sobre Delta Lake y Machine Learning. También tendremos una gran contribución de la comunidad hispanohablante.

Cuándo:: June 13th, 2021
Time: 17h CEST time
Dónde: El Zoom de Databricks
Nivel: L200/300 (on a scale from L100 "product flyer" to L400 "live coding")

El orden del día es el siguiente:

Bienvenida y moderación, Paola Pardo Spark Meetup Barcelona
Alejandro Rabadán, Databricks, novedades sobre Lakehouse, 30 mins
Carlos del Cacho, Databricks, novedades sobre Machine Learning, 20-30 mins

Frank Munz

July 14, 2021
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. View Slide

  2. Bienvenidos!
    ● Primera edición de Slice & DAIS en español!
    ● Destacados Data & AI Summit
    ● Databricks + Spark Barcelona
    ● Expandimos la comunidad!

    View Slide

  3. ● Estamos de vuelta!!
    ● 7 años
    ● +2900 miembros
    ● Un espacio para compartir conocimiento y conectar
    personas :)

    View Slide

  4. Menú del día
    Bienvenida e introducción
    Novedades en Lakehouse
    Alejandro Rabadán, Databricks
    Novedades en Machine Learning
    Carlos del Cacho, Databricks
    Menú especial
    Lakehouse
    Machine
    Learning

    View Slide

  5. Compartir es vivir :)
    @databricks
    @sparkbarcelona
    #SliceAndDAIS
    slack-sparkbcn.herokuapp.com/
    Únete y consulta las
    novedades!

    View Slide

  6. Lakehouse Intro

    View Slide

  7. Streaming
    Analytics
    BI Data
    Science
    Machine
    Learning
    Structured, Semi-Structured and Unstructured
    Data
    Data Lake for all your data
    One platform for every use case
    Structured transactional layer
    Lakehouse
    RELIABILITY & QUALITY
    PERFORMANCE & LATENCY
    GOVERNANCE
    ACID transactions
    Advanced indexing,
    caching, compaction
    Fine-grained access
    control

    View Slide

  8. Lakehouse adoption across industries

    View Slide

  9. Single node Data Science
    meets Big Data

    View Slide

  10. What is Koalas?
    Implementation of pandas APIs over Spark
    Easily port existing data science code, making it execute at scale
    import databricks.koalas as ks
    df = ks.read_csv(file)
    df[‘x’] = df.y * df.z
    df.describe()
    df.plot.line(...)
    import pandas as pd
    df = pd.read_csv(file)
    df[‘x’] = df.y * df.z
    df.describe()
    df.plot.line(...)
    Now ~ 3 million PyPI downloads per month.

    View Slide

  11. Single Node Performance Comparison - 31 GB
    Lower is better

    View Slide

  12. Performance Comparison - 95 GB
    pandas
    Lower is better
    pyspark.pandas

    View Slide

  13. Scala DataFrame
    Python DataFrame
    pandas APIs
    Different APIs, Powered by the Same Engine
    SQL Language

    View Slide

  14. ANSI SQL
    Compliance
    Pytho
    n
    Performanc
    e
    Mor
    e
    Streamin
    g
    Decorrelation
    Framework
    Timestamp
    w/o Time Zone
    Adaptive
    Optimization
    Scala 2.13
    Beta
    Error Code Implicit Type
    Cast
    Interval Type
    Complex Type
    Support in ORC
    Lateral Join
    Compile Latency
    Reduction JAVA 17
    Push-based
    Shuffle
    Session
    Window
    Visualization
    and Plotting
    RocksDB
    State Store
    Queryable
    State Store
    Pythonic Error
    Handling
    Richer
    Input/Output
    Koalas
    (pandas APIs)
    Parquet 1.12
    (Column Index)
    State Store
    APIs
    DML Metrics
    ANSI Mode
    GA
    Apache Spark Development (Link 3.1.1 RelNotes)
    Low Latency
    Scheduler

    View Slide

  15. OSS Delta Lake 1.0

    View Slide

  16. • Project Zen: more Pythonic, better usability
    • Faster performance including predicate
    pushdown and pruning
    • ANSI SQL compliance for DDL/DML
    commands including INSERT, MERGE, and
    EXPLAIN
    • Spark 3.1 comes with Databricks Runtime 8.0
    https://spark.apache.org/releases/spark-release-3-1-1.html
    https://databricks.com/blog/2021/03/02/introducing-apache-spark-3-1.html
    delta.io blog: Delta Lake 1.0.0 Released

    View Slide

  17. Generated
    Columns
    Problem: Partition by date
    Better solution: generated columns
    CREATE TABLE events(
    id bigint,
    eventTime timestamp,
    eventDate GENERATED ALWAYS AS (
    CAST(eventTime AS DATE)
    )
    )
    USING delta
    PARTITIONED BY (eventDate)
    id eventTime eventDate
    1 2021-05-24
    09:00:00.000
    2021-05-24
    .. ... ...

    View Slide

  18. Delta
    Everywhere
    Standalon
    e JVM

    View Slide

  19. pip install delta-spark
    Python APIs for using Delta Lake with Apache Spark,
    e.g. for unit testing
    pip install deltalake
    # delta lake without spark
    from deltalake import DeltaTable
    dt = DeltaTable("reviews")
    dt.version()
    3
    dt.files()
    ['part-00000-...-ff32ddab96d2-c000.snappy.parquet',
    'part-00000-...-d46c948aa415-c000.snappy.parquet',
    'part-00001-...-7eb62007a15c-c000.snappy.parquet']
    PyPI Install
    Delta Rust client

    View Slide

  20. Delta Live Tables

    View Slide

  21. Filtered, Cleaned,
    Augmented
    Business-level
    Aggregates
    Raw Ingestion
    and History
    Building the foundation of a Lakehouse with ETL
    Data Lake
    CSV,
    JSON, TXT…
    Kinesis
    BI &
    Reporting
    Streaming
    Analytics
    Data Science
    & ML
    BRONZE SILVER GOLD
    QUALITY

    View Slide

  22. Delta Live Tables: Easily build data pipelines
    Declaratively build data pipelines
    with business logic and chain
    table dependencies
    Run in batch or streaming with
    structured or unstructured data
    Reuse ETL pipelines across
    environments
    https://docs.databricks.com/data-engineering/delta-live-tables/index.html

    View Slide

  23. Treat your data as code
    A single source of truth for more than just transformation logic.
    CREATE LIVE TABLE clean_data(
    CONSTRAINT valid_timestamp EXPECT (timestamp > "…")
    )
    COMMENT "Customer data with timestamps cleaned up"
    TBLPROPERTIES (
    "has_pii" = "true",
    )
    as SELECT to_timestamp(ts) as ts, *
    FROM LIVE.raw_data
    Declarative Quality Expectations
    Just say what makes bad data bad and what
    to do with it.
    Documentation with
    Transformation
    Helps ensure discovery information is recent.
    Governance Built-In
    All information about processing is captured
    into a table for analysis / auditing.

    View Slide

  24. Databricks
    Unity Catalog
    Simplified governance for data and AI

    View Slide

  25. Data Lake Governance Today is Complex
    Data (files on
    S3/ADLS/GCS)
    /dataset/pages/part-001
    /dataset/pages/part-002
    /dataset/users/uk/part-001
    /dataset/users/uk/part-002
    /dataset/users/us/part-001
    Users
    File-based permissions:
    • user1 can read /pages/
    • user2 can read /users/
    • user3 can read
    /users/us/ What if we only want users to see some
    columns/rows within a table?
    What if we want to change data layout?
    What if governance rules change?
    Metadata (e.g. Hive Metastore)
    Tables &
    views
    ML Models
    SQL Databases
    Could be out of sync with the data!
    Different governance model
    Different governance model

    View Slide

  26. Databricks Unity Catalog
    Data (files on S3/ADLS/GCS)
    /dataset/pages/part-001
    /dataset/pages/part-002
    /dataset/users/uk/part-001
    /dataset/users/uk/part-002
    /dataset/users/us/part-001
    Users
    table
    1
    ML Models
    SQL Databases
    Delta Shares
    Unity Catalog
    table2
    view1
    view2
    models
    view3
    ● Fine-grained permissions on
    tables, fields, views
    ● ANSI SQL grants
    ● Uniform permission model for all data assets
    ● Across workspaces
    ● ODBC/JDBC / Delta Sharing
    Audit
    Log

    View Slide

  27. Using the Unity Catalog
    CREATE TABLE iot_events
    GRANT SELECT ON iot_events TO engineers
    GRANT SELECT(date, country) ON iot_events TO marketing

    View Slide

  28. Attribute-Based Access Control (ABAC)
    CREATE ATTRIBUTE pii
    ALTER TABLE iot_events ADD ATTRIBUTE pii ON email
    ALTER TABLE users ADD ATTRIBUTE pii ON phone
    ...
    GRANT SELECT ON DATABASE iot_data
    HAVING ATTRIBUTE NOT IN (pii)
    TO product_managers
    Set permission on all
    columns tagged pii together

    View Slide

  29. Delta Sharing
    An Open Protocol for Secure Data Sharing

    View Slide

  30. Delta Sharing: delta.io/sharing
    Delta Lake
    Table
    Delta
    Sharing
    Server
    Delta Sharing Protocol
    (REST)
    Data Provider Data Recipient
    Commercial or
    open source
    Python connector
    for pandas or
    Apache Spark
    Access
    permissions
    pip install delta-sharing

    View Slide

  31. Delta Sharing Recipient: Pandas + Jupyter
    Notebook

    View Slide

  32. Delta Sharing on Databricks
    Secure Delta Sharing server integrated in our service easily manages shares with
    CREATE SHARE commands in SQL or REST APIs.
    Delta Sharing Protocol
    Data Recipients
    Data Provider
    sales
    CREATE SHARE retail
    ALTER SHARE retail
    ADD TABLE sales
    GRANT SELECT ON SHARE
    retail TO supplier1
    Audi
    t log
    Unity Catalog

    View Slide

  33. How to engage?
    delta.io
    delta-users
    Slack
    delta-users
    Google Group
    Delta Lake
    YouTube channel
    Delta Lake
    GitHub Issues
    Delta Lake RS
    Bi-weekly meetings

    View Slide

  34. Databricks SQL

    View Slide

  35. First-Class SQL Development Experience
    Develop
    Query tabs
    Drafts & “pick up where you left off”
    Command history
    Contextual auto-complete
    Troubleshoot
    Query progress
    Error highlighting
    Execution time breakdown
    Collaborate
    Scheduled email delivery
    Edit permissions
    Enabling simple, quick ad-hoc exploratory analysis on the Lake with SQL

    View Slide

  36. Large Query Performance
    Price / Performance Benchmark with Barcelona Supercomputing Center (Nov 2020)
    30TB TPC-DS Price/Performance
    Lower is better

    View Slide

  37. Beyond large query performance
    Many small files
    Small queries BI Results Retrieval
    Providing fast and predictable performance for all workloads
    Mixed small/large

    View Slide

  38. What about many concurrent users on small data?
    10 GB TPC-DS @ 32 Concurrent Streams (Queries/Hr)
    Higher is better

    View Slide

  39. What about badly laid out tables?
    Small files
    ~12x rows scanned within
    the same duration
    Async & parallel IO:
    Cold S3/ADLS remote
    reads fully saturate
    S3/ADLS/GCS bandwidth
    with increased parallelism
    for better cold reads
    “Too Many Small Files” Scenario Benchmark (# rows scanned/sec)
    Higher is better

    View Slide

  40. Summary: Advancing the Lakehouse
    Reliable ETL made easy
    with Delta Lake
    The first open protocol
    for data sharing
    The first multi-cloud
    data catalog for the
    lakehouse
    The first high performance
    query engine for the
    lakehouse
    Delta
    Sharing
    Delta
    Live Tables
    Unity
    Catalog Photon
    Available Today Coming Soon Coming Soon Public Preview

    View Slide

  41. Key ML Announcements
    1. Databricks Machine Learning
    2. Feature Store
    3. AutoML
    4. MLflow developments

    View Slide

  42. Persona-based Navigation
    Purpose-built surfaces for data teams

    View Slide

  43. ML Dashboard
    All ML related assets and resources in one place

    View Slide

  44. Open Data Lakehouse Foundation with
    MLOps / Governance
    Data
    Prep
    Data
    Versioning Monitoring
    Batch Scoring
    Online Serving
    AutoML
    Data Science Workspace
    Model
    Training
    Model
    Tuning
    Runtime and
    Environments
    Feature Store
    Batch (high throughput)
    Real time (low latency)
    Announcing: Databricks Machine Learning
    A data-native and collaborative solution for the full ML lifecycle

    View Slide

  45. Feature Store

    View Slide

  46. Customers are loyal, until they aren’t
    D2C Primary Focus
    for Media
    Nearly 75% of US
    homes subscribe to
    Hulu, Netflix or Amazon
    10M Subscribers
    in 1 Day
    890% Growth in
    Subscription boxes
    Disney’s streaming
    service garnered 10M
    subscribers in 1 day
    Food, Beauty and
    Apparel growth over 3
    year period.

    View Slide

  47. There might be synergies between use cases
    D2C Primary Focus
    for Media
    10M Subscribers
    in 1 Day
    Customer Churn
    Data Science Team #1
    Customer Lifetime Value
    Data Science Team #2
    D2C Primary Focus
    for Media
    Next Best Action
    Data Science Team #3

    View Slide

  48. Raw Data
    Featurization
    Training
    Joins, Aggregates, Transforms, etc.
    csv
    csv
    Serving Client
    No reuse of Features
    Online / Offline Skew
    Teams working in silos reinvent the wheel

    View Slide

  49. Solving the Feature Store Problem
    Raw Data
    Featurization
    Training
    Joins, Aggregates, Transforms, etc.
    Serving Client
    Feature Store
    Feature Registry
    Feature
    Provider
    Batch (high throughput)
    Online (low latency)
    - single source of truth for features
    - promotes feature discoverability

    View Slide

  50. AutoML

    View Slide

  51. What is AutoML?
    Automated machine learning (AutoML) is a fully-automated model development solution seeking to
    “democratize” machine learning. While the scope of the automation varies, AutoML technologies
    usually automate the ML process from data to model selection.
    select a dataset automated data
    prep
    automated feature
    engineering and
    selection
    automated training
    and model selection
    automated
    hyperparameter
    tuning

    View Slide

  52. AutoML solves two key pain points for data
    scientists
    Quickly Verify the Predictive Power of a
    Dataset
    “Can this dataset be used to predict
    customer churn?”
    Marketing
    Team
    Data
    Science
    Team
    Dataset
    Data
    Science
    Team
    Baseline
    Model
    Dataset
    “What direction should I go in for this ML
    project and what benchmark should
    I aim to beat?”
    Get a Baseline Model to Guide Project
    Direction

    View Slide

  53. Problems with Existing AutoML Solutions
    AutoML
    Configuration
    Returned
    Best Model
    AutoML
    Training
    “Opaque Box”
    ?
    Deployed
    Model
    Production
    Cliff
    ?
    Citizen
    Data
    Scientist
    Engineer
    ML Expert /
    Researcher
    / No-Code
    / Full
    Automation
    Persona Goal Driving Analogy
    / Low-Code
    / Augmentation
    / Code
    / Flexibility and
    Performance

    View Slide

  54. Databricks AutoML
    A glass-box solution that empowers data teams without taking away control
    UI and API to
    start AutoML
    training
    Data exploration notebook
    Generated notebook with feature
    summary statistics and
    distributions
    Reproducible trial
    notebooks
    Generated notebooks with source
    code for every model
    MLflow experiment
    Auto-created MLflow Experiment
    to track models and metrics
    Easily deploy
    to Model
    Registry
    Understand and
    debug data quality
    and
    preprocessing
    Iterate further on
    models from
    AutoML, adding your
    expertise

    View Slide

  55. View Slide

  56. Configure
    Augment
    Train and Evaluate
    Databricks AutoML
    Deploy

    View Slide

  57. MLflow Developments

    View Slide

  58. : An Open Source ML Platform
    Monitoring Model management
    Reproducible runs Packaging & serving
    TRACKING PROJECTS MODEL REGISTRY
    MODELS
    Training
    Deploymen
    t
    Raw Data
    Data Prep
    ML Engineer
    Application
    Developer
    Data Engineer
    Any Language
    Any ML Library

    View Slide

  59. What’s New in MLflow
    AUTO LOGGING TRACKING
    mlflow.spark.autolog()
    mlflow.pyspark.ml.autolog()
    mlflow.sklearn.autolog()
    mlflow.tensorflow.autolog()
    mlflow.pytorch.autolog()
    mlflow.gluon.autolog()
    mlflow.keras.autolog()
    mlflow.lightgbm.autolog()
    mlflow.xgboost.autolog()
    mlflow.fastai.autolog()
    mlflow.autolog()
    mlflow.shap.log_explainer()
    mlflow.shap.log_explanation
    ()
    mlflow.log_figure()
    mlflow.log_image()
    mlflow.log_dict()
    mlflow.log_text()
    mlflow.catboost.autolog()
    mlflow-thin-client

    View Slide

  60. What’s New in MLflow
    Google Cloud AI
    Platform
    DEPLOYMENT BACKENDS

    View Slide

  61. PyCaret + MLflow

    View Slide

  62. Data and AI Summit:
    https://dataaisummit.com/
    Databricks Youtube channel:
    https://www.youtube.com/
    Databricks
    Databricks blog:
    https://databricks.com/blog
    @frankmunz

    View Slide