Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Slice_DAIS_Español_ONLINE.pdf

 Slice_DAIS_Español_ONLINE.pdf

Due to the huge success, we will repeat the Slice & DAIS event in Spanish. Once more we will cover all the Delta Lake and AI news! This is a cross promotion from the Barcelona Spark Meetup Group:
https://www.meetup.com/Spark-Barcelona/events/279277516/

¡Lo repetimos!

Nuestro evento "Slice and DAIS", en el que se explicaron todos los aspectos destacados del Data and AI Summit (antiguo Apache Spark Summit) de una forma sencilla para los Data Scientists y Data Engineers, fue un gran éxito con más de 6000 espectadores. Así que hemos decidido hacerlo de nuevo, ¡pero esta vez en español!

Junto con el grupo de Spark meetup de Barcelona, presentaremos las novedades sobre Delta Lake y Machine Learning. También tendremos una gran contribución de la comunidad hispanohablante.

Cuándo:: June 13th, 2021
Time: 17h CEST time
Dónde: El Zoom de Databricks
Nivel: L200/300 (on a scale from L100 "product flyer" to L400 "live coding")

El orden del día es el siguiente:

Bienvenida y moderación, Paola Pardo Spark Meetup Barcelona
Alejandro Rabadán, Databricks, novedades sobre Lakehouse, 30 mins
Carlos del Cacho, Databricks, novedades sobre Machine Learning, 20-30 mins

643cd45dcfa73b072018046e39ed36d1?s=128

Frank Munz

July 14, 2021
Tweet

Transcript

  1. None
  2. Bienvenidos! • Primera edición de Slice & DAIS en español!

    • Destacados Data & AI Summit • Databricks + Spark Barcelona • Expandimos la comunidad!
  3. • Estamos de vuelta!! • 7 años • +2900 miembros

    • Un espacio para compartir conocimiento y conectar personas :)
  4. Menú del día Bienvenida e introducción Novedades en Lakehouse Alejandro

    Rabadán, Databricks Novedades en Machine Learning Carlos del Cacho, Databricks Menú especial Lakehouse Machine Learning
  5. Compartir es vivir :) @databricks @sparkbarcelona #SliceAndDAIS slack-sparkbcn.herokuapp.com/ Únete y

    consulta las novedades!
  6. Lakehouse Intro

  7. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and

    Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer Lakehouse RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing, caching, compaction Fine-grained access control
  8. Lakehouse adoption across industries

  9. Single node Data Science meets Big Data

  10. What is Koalas? Implementation of pandas APIs over Spark Easily

    port existing data science code, making it execute at scale import databricks.koalas as ks df = ks.read_csv(file) df[‘x’] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df[‘x’] = df.y * df.z df.describe() df.plot.line(...) Now ~ 3 million PyPI downloads per month.
  11. Single Node Performance Comparison - 31 GB Lower is better

  12. Performance Comparison - 95 GB pandas Lower is better pyspark.pandas

  13. Scala DataFrame Python DataFrame pandas APIs Different APIs, Powered by

    the Same Engine SQL Language
  14. ANSI SQL Compliance Pytho n Performanc e Mor e Streamin

    g Decorrelation Framework Timestamp w/o Time Zone Adaptive Optimization Scala 2.13 Beta Error Code Implicit Type Cast Interval Type Complex Type Support in ORC Lateral Join Compile Latency Reduction JAVA 17 Push-based Shuffle Session Window Visualization and Plotting RocksDB State Store Queryable State Store Pythonic Error Handling Richer Input/Output Koalas (pandas APIs) Parquet 1.12 (Column Index) State Store APIs DML Metrics ANSI Mode GA Apache Spark Development (Link 3.1.1 RelNotes) Low Latency Scheduler
  15. OSS Delta Lake 1.0

  16. • Project Zen: more Pythonic, better usability • Faster performance

    including predicate pushdown and pruning • ANSI SQL compliance for DDL/DML commands including INSERT, MERGE, and EXPLAIN • Spark 3.1 comes with Databricks Runtime 8.0 https://spark.apache.org/releases/spark-release-3-1-1.html https://databricks.com/blog/2021/03/02/introducing-apache-spark-3-1.html delta.io blog: Delta Lake 1.0.0 Released
  17. Generated Columns Problem: Partition by date Better solution: generated columns

    CREATE TABLE events( id bigint, eventTime timestamp, eventDate GENERATED ALWAYS AS ( CAST(eventTime AS DATE) ) ) USING delta PARTITIONED BY (eventDate) id eventTime eventDate 1 2021-05-24 09:00:00.000 2021-05-24 .. ... ...
  18. Delta Everywhere Standalon e JVM

  19. pip install delta-spark Python APIs for using Delta Lake with

    Apache Spark, e.g. for unit testing pip install deltalake # delta lake without spark from deltalake import DeltaTable dt = DeltaTable("reviews") dt.version() 3 dt.files() ['part-00000-...-ff32ddab96d2-c000.snappy.parquet', 'part-00000-...-d46c948aa415-c000.snappy.parquet', 'part-00001-...-7eb62007a15c-c000.snappy.parquet'] PyPI Install Delta Rust client
  20. Delta Live Tables

  21. Filtered, Cleaned, Augmented Business-level Aggregates Raw Ingestion and History Building

    the foundation of a Lakehouse with ETL Data Lake CSV, JSON, TXT… Kinesis BI & Reporting Streaming Analytics Data Science & ML BRONZE SILVER GOLD QUALITY
  22. Delta Live Tables: Easily build data pipelines Declaratively build data

    pipelines with business logic and chain table dependencies Run in batch or streaming with structured or unstructured data Reuse ETL pipelines across environments https://docs.databricks.com/data-engineering/delta-live-tables/index.html
  23. Treat your data as code A single source of truth

    for more than just transformation logic. CREATE LIVE TABLE clean_data( CONSTRAINT valid_timestamp EXPECT (timestamp > "…") ) COMMENT "Customer data with timestamps cleaned up" TBLPROPERTIES ( "has_pii" = "true", ) as SELECT to_timestamp(ts) as ts, * FROM LIVE.raw_data Declarative Quality Expectations Just say what makes bad data bad and what to do with it. Documentation with Transformation Helps ensure discovery information is recent. Governance Built-In All information about processing is captured into a table for analysis / auditing.
  24. Databricks Unity Catalog Simplified governance for data and AI

  25. Data Lake Governance Today is Complex Data (files on S3/ADLS/GCS)

    /dataset/pages/part-001 /dataset/pages/part-002 /dataset/users/uk/part-001 /dataset/users/uk/part-002 /dataset/users/us/part-001 Users File-based permissions: • user1 can read /pages/ • user2 can read /users/ • user3 can read /users/us/ What if we only want users to see some columns/rows within a table? What if we want to change data layout? What if governance rules change? Metadata (e.g. Hive Metastore) Tables & views ML Models SQL Databases Could be out of sync with the data! Different governance model Different governance model
  26. Databricks Unity Catalog Data (files on S3/ADLS/GCS) /dataset/pages/part-001 /dataset/pages/part-002 /dataset/users/uk/part-001

    /dataset/users/uk/part-002 /dataset/users/us/part-001 Users table 1 ML Models SQL Databases Delta Shares Unity Catalog table2 view1 view2 models view3 • Fine-grained permissions on tables, fields, views • ANSI SQL grants • Uniform permission model for all data assets • Across workspaces • ODBC/JDBC / Delta Sharing Audit Log
  27. Using the Unity Catalog CREATE TABLE iot_events GRANT SELECT ON

    iot_events TO engineers GRANT SELECT(date, country) ON iot_events TO marketing
  28. Attribute-Based Access Control (ABAC) CREATE ATTRIBUTE pii ALTER TABLE iot_events

    ADD ATTRIBUTE pii ON email ALTER TABLE users ADD ATTRIBUTE pii ON phone ... GRANT SELECT ON DATABASE iot_data HAVING ATTRIBUTE NOT IN (pii) TO product_managers Set permission on all columns tagged pii together
  29. Delta Sharing An Open Protocol for Secure Data Sharing

  30. Delta Sharing: delta.io/sharing Delta Lake Table Delta Sharing Server Delta

    Sharing Protocol (REST) Data Provider Data Recipient Commercial or open source Python connector for pandas or Apache Spark Access permissions pip install delta-sharing
  31. Delta Sharing Recipient: Pandas + Jupyter Notebook

  32. Delta Sharing on Databricks Secure Delta Sharing server integrated in

    our service easily manages shares with CREATE SHARE commands in SQL or REST APIs. Delta Sharing Protocol Data Recipients Data Provider sales CREATE SHARE retail ALTER SHARE retail ADD TABLE sales GRANT SELECT ON SHARE retail TO supplier1 Audi t log Unity Catalog
  33. How to engage? delta.io delta-users Slack delta-users Google Group Delta

    Lake YouTube channel Delta Lake GitHub Issues Delta Lake RS Bi-weekly meetings
  34. Databricks SQL

  35. First-Class SQL Development Experience Develop Query tabs Drafts & “pick

    up where you left off” Command history Contextual auto-complete Troubleshoot Query progress Error highlighting Execution time breakdown Collaborate Scheduled email delivery Edit permissions Enabling simple, quick ad-hoc exploratory analysis on the Lake with SQL
  36. Large Query Performance Price / Performance Benchmark with Barcelona Supercomputing

    Center (Nov 2020) 30TB TPC-DS Price/Performance Lower is better
  37. Beyond large query performance Many small files Small queries BI

    Results Retrieval Providing fast and predictable performance for all workloads Mixed small/large
  38. What about many concurrent users on small data? 10 GB

    TPC-DS @ 32 Concurrent Streams (Queries/Hr) Higher is better
  39. What about badly laid out tables? Small files ~12x rows

    scanned within the same duration Async & parallel IO: Cold S3/ADLS remote reads fully saturate S3/ADLS/GCS bandwidth with increased parallelism for better cold reads “Too Many Small Files” Scenario Benchmark (# rows scanned/sec) Higher is better
  40. Summary: Advancing the Lakehouse Reliable ETL made easy with Delta

    Lake The first open protocol for data sharing The first multi-cloud data catalog for the lakehouse The first high performance query engine for the lakehouse Delta Sharing Delta Live Tables Unity Catalog Photon Available Today Coming Soon Coming Soon Public Preview
  41. Key ML Announcements 1. Databricks Machine Learning 2. Feature Store

    3. AutoML 4. MLflow developments
  42. Persona-based Navigation Purpose-built surfaces for data teams

  43. ML Dashboard All ML related assets and resources in one

    place
  44. Open Data Lakehouse Foundation with MLOps / Governance Data Prep

    Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Announcing: Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle
  45. Feature Store

  46. Customers are loyal, until they aren’t D2C Primary Focus for

    Media Nearly 75% of US homes subscribe to Hulu, Netflix or Amazon 10M Subscribers in 1 Day 890% Growth in Subscription boxes Disney’s streaming service garnered 10M subscribers in 1 day Food, Beauty and Apparel growth over 3 year period.
  47. There might be synergies between use cases D2C Primary Focus

    for Media 10M Subscribers in 1 Day Customer Churn Data Science Team #1 Customer Lifetime Value Data Science Team #2 D2C Primary Focus for Media Next Best Action Data Science Team #3
  48. Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv

    Serving Client No reuse of Features Online / Offline Skew Teams working in silos reinvent the wheel
  49. Solving the Feature Store Problem Raw Data Featurization Training Joins,

    Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency) - single source of truth for features - promotes feature discoverability
  50. AutoML

  51. What is AutoML? Automated machine learning (AutoML) is a fully-automated

    model development solution seeking to “democratize” machine learning. While the scope of the automation varies, AutoML technologies usually automate the ML process from data to model selection. select a dataset automated data prep automated feature engineering and selection automated training and model selection automated hyperparameter tuning
  52. AutoML solves two key pain points for data scientists Quickly

    Verify the Predictive Power of a Dataset “Can this dataset be used to predict customer churn?” Marketing Team Data Science Team Dataset Data Science Team Baseline Model Dataset “What direction should I go in for this ML project and what benchmark should I aim to beat?” Get a Baseline Model to Guide Project Direction
  53. Problems with Existing AutoML Solutions AutoML Configuration Returned Best Model

    AutoML Training “Opaque Box” ? Deployed Model Production Cliff ? Citizen Data Scientist Engineer ML Expert / Researcher / No-Code / Full Automation Persona Goal Driving Analogy / Low-Code / Augmentation / Code / Flexibility and Performance
  54. Databricks AutoML A glass-box solution that empowers data teams without

    taking away control UI and API to start AutoML training Data exploration notebook Generated notebook with feature summary statistics and distributions Reproducible trial notebooks Generated notebooks with source code for every model MLflow experiment Auto-created MLflow Experiment to track models and metrics Easily deploy to Model Registry Understand and debug data quality and preprocessing Iterate further on models from AutoML, adding your expertise
  55. None
  56. Configure Augment Train and Evaluate Databricks AutoML Deploy

  57. MLflow Developments

  58. : An Open Source ML Platform Monitoring Model management Reproducible

    runs Packaging & serving TRACKING PROJECTS MODEL REGISTRY MODELS Training Deploymen t Raw Data Data Prep ML Engineer Application Developer Data Engineer Any Language Any ML Library
  59. What’s New in MLflow AUTO LOGGING TRACKING mlflow.spark.autolog() mlflow.pyspark.ml.autolog() mlflow.sklearn.autolog()

    mlflow.tensorflow.autolog() mlflow.pytorch.autolog() mlflow.gluon.autolog() mlflow.keras.autolog() mlflow.lightgbm.autolog() mlflow.xgboost.autolog() mlflow.fastai.autolog() mlflow.autolog() mlflow.shap.log_explainer() mlflow.shap.log_explanation () mlflow.log_figure() mlflow.log_image() mlflow.log_dict() mlflow.log_text() mlflow.catboost.autolog() mlflow-thin-client
  60. What’s New in MLflow Google Cloud AI Platform DEPLOYMENT BACKENDS

  61. PyCaret + MLflow

  62. Data and AI Summit: https://dataaisummit.com/ Databricks Youtube channel: https://www.youtube.com/ Databricks

    Databricks blog: https://databricks.com/blog @frankmunz