Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing Huge Amounts of Live Data with Delta Sharing

643cd45dcfa73b072018046e39ed36d1?s=47 Frank Munz
September 21, 2021

Sharing Huge Amounts of Live Data with Delta Sharing

ODSC and Big Things Spain Conferences 2021, Dr. Frank Munz:

Abstract: "Data comes at us fast" is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern object stores.

Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured.

In this session, dive deep into an open-source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open-source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time.

It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables.

Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data.

643cd45dcfa73b072018046e39ed36d1?s=128

Frank Munz

September 21, 2021
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. ODSC & Big Things Conference 2021 Sharing Huge Amounts of

    Live Data with Delta Sharing Dr. Frank Munz @frankmunz
  2. About me • Staff Developer Advocate @Databricks • Based in

    Munich, 🍻 ⛰ 🥨 󰎲 • All things large scale data & compute • Twitter: @frankmunz • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.
  3. Example 1: Publicly Shared Data

  4. https://www.cidrap.umn.edu/news-perspective/2020 /01/china-releases-genetic-data-new-coronavirus- now-deadly

  5. None
  6. Example 2: Huge Amounts of Scientific Data

  7. Example 3: Live Data with Transactional Updates

  8. Delta Sharing • Open Source: https://github.com/delta-io/delta-sharing • Open format, vendor

    independent • Multi-cloud • Cloud object store bandwidth • pandas, Apache Spark or commercial / BI clients • DIY hosting, or cloud service
  9. Head to Head Comparison Vendor2Vendor (s)ftp S3 URLs OSS Delta

    Sharing Secure ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅
  10. Introducing Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data

    Science Machine Learning Structured, Semi-Structured and Unstructured Data
  11. Lakehouse adoption across industries

  12. https://delta.io

  13. Delta Sharing

  14. Demo 1 Reading Shared Data with Google Colab Client

  15. https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

  16. Lessons Learned • Start with Delta Sharing server hosted at

    Databricks • It's cross vendor and multi-cloud • Clients are easy to build • Abstraction level is data frames / tables, not files!
  17. Demo 2 Hacking Frank's DNA - the OSS way! Delta

    Sharing server with Jupyter Notebook
  18. Delta Sharing

  19. Genotyping Frank as a TSV file Franks eye color: rsXXXXXX

    AG Eye color: rs12913832 GG <-(Frank)
  20. config.yaml

  21. Receiver Notebook

  22. Reality isn't that easy... By National Institutes of Health -

    Public Domain http://commonfund.nih.gov/epigenomics/figure.aspx
  23. Don't do this at home!

  24. Lessons Learned • Delta Sharing environment OSS only: pyspark, Jupyter,

    Delta Sharing server/image • Install it on your laptop or EC2, … • Lake first approach: data stays on S3, ADLS2, ...
  25. Demo 3 Open Source -> Production Delta Sharing from Databricks

    Notebook (SQL) + live updates
  26. Databricks Notebook: Create Share

  27. Databricks Notebook: Create Recipient

  28. Lessons Learned • Operating servers -> abstracted as SQL •

    Works with live data, transactionally safe • Delta Sharing is builtin into Databricks compute plane • Part of Unity Catalog
  29. Conclusion & Links

  30. Conclusion Delta Sharing • Platform-independent, OSS way of sharing massive

    amounts of data. • Works with live data • Clients can be built quickly: ◦ pandas , Apache Spark, or Tableau and Power BI. ◦ notebooks: Databricks, Amazon EMR and Sagemaker, Google Colab... • Pre-built reference implementation / Docker container. Databricks simplifies Data and AI -> built-in Delta Sharing server, no ops overhead, plain SQL
  31. Databricks EMEA Community • Brand new Databricks Forum • Databricks

    Beacons • >10 new EMEA Data & AI meetups + global online Meetups We want you hear from you! new Data and AI meetups: Cape Town / Johannesburg / Dubai / Milano / Berlin / Moscow / St. Petersburg / Madrid / Amsterdam
  32. @frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz