Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing Huge Amounts of Live Data with Delta Sharing

643cd45dcfa73b072018046e39ed36d1?s=47 Frank Munz
September 21, 2021

Sharing Huge Amounts of Live Data with Delta Sharing

​Abstract: "Data comes at us fast" is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern object stores.

Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured.

In this session, dive deep into an open-source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open-source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time.

It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables.

Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data.

643cd45dcfa73b072018046e39ed36d1?s=128

Frank Munz

September 21, 2021
Tweet

Transcript

  1. Sharing Huge Amounts of Live Data with Delta Sharing Frank

    Munz @frankmunz
  2. About Frank • Staff Developer Advocate @Databricks • Based in

    Munich, 🍻 ⛰ 🥨 󰎲 • All things large scale data & compute • Twitter: @frankmunz • Formerly AWS Tech Evangelist
  3. Databricks EMEA Community • Brand new Databricks Forum • Databricks

    Beacons • >10 new EMEA Data & AI meetups + global online Meetups We want you hear from you! new Data and AI meetups: Cape Town / Johannesburg / Dubai / Milano / Berlin / Moscow / St. Petersburg / Madrid / Amsterdam
  4. Introducing Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data

    Science Machine Learning Structured, Semi-Structured and Unstructured Data
  5. Delta.io & Delta Sharing

  6. Lakehouse adoption across industries

  7. Delta Sharing • Open Source: https://github.com/delta-io/delta-sharing • Open format, vendor

    independent • Multi-cloud • Cloud object store bandwidth • pandas, Apache Spark or commercial / BI clients • DIY hosting, or use Databricks cloud service
  8. Delta Sharing

  9. Demo 1 Reading Shared Data with Google Colab Client

  10. Head to Head Comparison Vendor2Vendor (s)ftp S3 URLs Delta Sharing

    Secure ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅
  11. Demo 2 Hacking Frank's DNA - the OSS way! Delta

    Sharing server with Jupyter Notebook
  12. Delta Sharing

  13. config.yaml

  14. Genotyping Frank as a TSV file Franks eye color: rsXXXXXX

    AG Eye color: rs12913832 GG <-(Frank)
  15. Demo 3 Live updates! Delta Sharing from Databricks Notebook (SQL)

    with Google Colab (Google) receiver
  16. Databricks Notebook

  17. Lessons Learned • Operating servers -> SQL abstraction • Live

    Data, transactionally safe • DS is builtin into Databricks compute plane • Part of Unity Catalog
  18. Conclusion & Links

  19. Conclusion • Platform-independent, OSS way of sharing massive amounts of

    live data. • Works with live data • Clients can be built quickly: ◦ pandas , Apache Spark, or Tableau and Power BI. ◦ notebooks: Databricks, Amazon EMR and Sagemaker. • Pre-built reference implementation / Docker container. Databricks simplifies Data and AI: built-in Delta Sharing server, no ops overhead, plain SQL
  20. How to engage? delta.io delta-users Slack delta-users Google Group Delta

    Lake YouTube channel Delta Lake GitHub Issues Delta Lake RS Bi-weekly meetings
  21. Technical Questions? Databricks Community https://community.databricks.com

  22. @frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz