$30 off During Our Annual Pro Sale. View Details »

Sharing Huge Amounts of Live Data with Delta Sharing

Frank Munz
September 21, 2021

Sharing Huge Amounts of Live Data with Delta Sharing

ODSC and Big Things Spain Conferences 2021, Dr. Frank Munz:

Abstract: "Data comes at us fast" is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern object stores.

Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured.

In this session, dive deep into an open-source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open-source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time.

It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables.

Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data.

Frank Munz

September 21, 2021
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. ODSC & Big Things Conference 2021
    Sharing Huge Amounts
    of Live Data with Delta Sharing
    Dr. Frank Munz
    @frankmunz

    View Slide

  2. About me

    Staff Developer Advocate @Databricks

    Based in Munich, 🍻 ⛰ 🥨 󰎲

    All things large scale data & compute

    Twitter: @frankmunz

    Formerly AWS Tech Evangelist, SW architect, data
    scientist, published author etc.

    View Slide

  3. Example 1: Publicly Shared Data

    View Slide

  4. https://www.cidrap.umn.edu/news-perspective/2020
    /01/china-releases-genetic-data-new-coronavirus-
    now-deadly

    View Slide

  5. View Slide

  6. Example 2: Huge Amounts of Scientific Data

    View Slide

  7. Example 3: Live Data with Transactional Updates

    View Slide

  8. Delta Sharing

    Open Source: https://github.com/delta-io/delta-sharing

    Open format, vendor independent

    Multi-cloud

    Cloud object store bandwidth

    pandas, Apache Spark or commercial / BI clients

    DIY hosting, or cloud service

    View Slide

  9. Head to Head Comparison
    Vendor2Vendor (s)ftp S3 URLs OSS Delta Sharing
    Secure ✅ ✅ ✅ ✅
    Cheap ✅ ✅ ✅
    Vendor agnostic ✅ ✅
    Multi-cloud ✅ ✅
    Open Source ✅ ✅
    Table / Data Frame abstr. ✅ ✅
    Live data ✅ ✅
    Predicate Pushdown ✅ ✅
    Object Store Bandwidth ✅ ✅
    Zero compute cost ✅ ✅
    Scalability ✅ ✅

    View Slide

  10. Introducing Lakehouse
    Data Warehouse Data Lake
    Streaming
    Analytics
    BI Data
    Science
    Machine
    Learning
    Structured, Semi-Structured and Unstructured
    Data

    View Slide

  11. Lakehouse adoption across industries

    View Slide

  12. https://delta.io

    View Slide

  13. Delta Sharing

    View Slide

  14. Demo 1
    Reading Shared Data
    with Google Colab Client

    View Slide

  15. https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

    View Slide

  16. Lessons Learned

    Start with Delta Sharing server hosted at Databricks

    It's cross vendor and multi-cloud

    Clients are easy to build

    Abstraction level is data frames / tables, not files!

    View Slide

  17. Demo 2
    Hacking Frank's DNA - the OSS way!
    Delta Sharing server with Jupyter Notebook

    View Slide

  18. Delta Sharing

    View Slide

  19. Genotyping Frank as a TSV file
    Franks eye color:
    rsXXXXXX AG
    Eye color: rs12913832 GG <-(Frank)

    View Slide

  20. config.yaml

    View Slide

  21. Receiver Notebook

    View Slide

  22. Reality isn't that easy...
    By National Institutes of Health - Public Domain
    http://commonfund.nih.gov/epigenomics/figure.aspx

    View Slide

  23. Don't do this at home!

    View Slide

  24. Lessons Learned

    Delta Sharing environment OSS only:
    pyspark, Jupyter, Delta Sharing server/image

    Install it on your laptop or EC2, …

    Lake first approach: data stays on S3, ADLS2, ...

    View Slide

  25. Demo 3
    Open Source -> Production
    Delta Sharing from Databricks Notebook (SQL)
    + live updates

    View Slide

  26. Databricks Notebook: Create Share

    View Slide

  27. Databricks Notebook: Create Recipient

    View Slide

  28. Lessons Learned

    Operating servers -> abstracted as SQL

    Works with live data, transactionally safe

    Delta Sharing is builtin into Databricks compute plane

    Part of Unity Catalog

    View Slide

  29. Conclusion & Links

    View Slide

  30. Conclusion Delta Sharing
    ● Platform-independent, OSS way of sharing massive amounts of data.
    ● Works with live data
    ● Clients can be built quickly:
    ○ pandas , Apache Spark, or Tableau and Power BI.
    ○ notebooks: Databricks, Amazon EMR and Sagemaker, Google Colab...
    ● Pre-built reference implementation / Docker container.
    Databricks simplifies Data and AI
    -> built-in Delta Sharing server, no ops overhead, plain SQL

    View Slide

  31. Databricks EMEA Community

    Brand new Databricks Forum

    Databricks Beacons

    >10 new EMEA Data & AI meetups
    + global online Meetups
    We want you hear from you!
    new Data and AI meetups: Cape Town /
    Johannesburg / Dubai / Milano / Berlin /
    Moscow / St. Petersburg / Madrid /
    Amsterdam

    View Slide

  32. @frankmunz
    https://fmunz.medium.com
    https://github.com/fmunz
    https://www.linkedin.com/in/frankmunz
    https://speakerdeck.com/fmunz

    View Slide