ODSC and Big Things Spain Conferences 2021, Dr. Frank Munz:
Abstract: "Data comes at us fast" is what they say. In fact, the last couple of years taught us how to successfully cleanse, store, retrieve, process, and visualize large amounts of data in a batch or streaming way. Despite these advances, data sharing has been severely limited because sharing solutions were tied to a single vendor, did not work for live data, came with severe security issues, and did not scale to the bandwidth of modern object stores.
Conferences have been filled for many years with sessions about how to architect applications and master the APIs of your services, but recent events have shown a huge business demand for sharing massive amounts of live data in the most direct scalable way possible. One example is open data sets of genomic data shared publicly for the development of vaccines. Still, many commercial use cases share news, financial or geological data to a restricted audience where the data has to be secured.
In this session, dive deep into an open-source solution for sharing massive amounts of live data in a cheap, secure, and scalable way. Delta sharing is an open-source project donated to the Linux Foundation. It uses an open REST protocol to secure the real-time exchange of large data sets, enabling secure data sharing across products for the first time.
It leverages modern cloud object stores, such as S3, ADLS, or GCS, to reliably transfer large data sets. There are two parties involved: Data Providers and Recipients. The data provider decides what data to share and runs a sharing server. An open-sourced reference sharing service is available to get started for sharing Apache Parque or Delta.io tables.
Any client supporting pandas, Apache Spark™, Rust, or Python, can connect to the sharing server. Clients always read the latest version of the data, and they can provide filters on the data (e.g., “country=ES”) to read a subset of the data.
ODSC & Big Things Conference 2021
Sharing Huge Amounts
of Live Data with Delta Sharing
Dr. Frank Munz
Staff Developer Advocate @Databricks
Based in Munich, 🍻 ⛰ 🥨
All things large scale data & compute
Formerly AWS Tech Evangelist, SW architect, data
scientist, published author etc.
Example 1: Publicly Shared Data
Example 2: Huge Amounts of Scientiﬁc Data
Example 3: Live Data with Transactional Updates
Open Source: https://github.com/delta-io/delta-sharing
Open format, vendor independent
Cloud object store bandwidth
pandas, Apache Spark or commercial / BI clients
DIY hosting, or cloud service
Head to Head Comparison
Vendor2Vendor (s)ftp S3 URLs OSS Delta Sharing
Secure ✅ ✅ ✅ ✅
Cheap ✅ ✅ ✅
Vendor agnostic ✅ ✅
Multi-cloud ✅ ✅
Open Source ✅ ✅
Table / Data Frame abstr. ✅ ✅
Live data ✅ ✅
Predicate Pushdown ✅ ✅
Object Store Bandwidth ✅ ✅
Zero compute cost ✅ ✅
Scalability ✅ ✅
Data Warehouse Data Lake
Structured, Semi-Structured and Unstructured
Lakehouse adoption across industries
Reading Shared Data
with Google Colab Client
Start with Delta Sharing server hosted at Databricks
It's cross vendor and multi-cloud
Clients are easy to build
Abstraction level is data frames / tables, not ﬁles!
Hacking Frank's DNA - the OSS way!
Delta Sharing server with Jupyter Notebook
Genotyping Frank as a TSV ﬁle
Franks eye color:
Eye color: rs12913832 GG <-(Frank)
Reality isn't that easy...
By National Institutes of Health - Public Domain
Don't do this at home!
Delta Sharing environment OSS only:
pyspark, Jupyter, Delta Sharing server/image
Install it on your laptop or EC2, …
Lake ﬁrst approach: data stays on S3, ADLS2, ...
Open Source -> Production
Delta Sharing from Databricks Notebook (SQL)
+ live updates
Databricks Notebook: Create Share
Databricks Notebook: Create Recipient
Operating servers -> abstracted as SQL
Works with live data, transactionally safe
Delta Sharing is builtin into Databricks compute plane
Part of Unity Catalog
Conclusion & Links
Conclusion Delta Sharing
● Platform-independent, OSS way of sharing massive amounts of data.
● Works with live data
● Clients can be built quickly:
○ pandas , Apache Spark, or Tableau and Power BI.
○ notebooks: Databricks, Amazon EMR and Sagemaker, Google Colab...
● Pre-built reference implementation / Docker container.
Databricks simpliﬁes Data and AI
-> built-in Delta Sharing server, no ops overhead, plain SQL
Databricks EMEA Community
Brand new Databricks Forum
>10 new EMEA Data & AI meetups
+ global online Meetups
We want you hear from you!
new Data and AI meetups: Cape Town /
Johannesburg / Dubai / Milano / Berlin /
Moscow / St. Petersburg / Madrid /