Sharing Huge Amounts of Live Data with Delta Sharing

ODSC & Big Things Conference 2021 Sharing Huge Amounts of
Live Data with Delta Sharing Dr. Frank Munz @frankmunz

About me • Staff Developer Advocate @Databricks • Based in
Munich, 🍻 ⛰ 🥨 󰎲 • All things large scale data & compute • Twitter: @frankmunz • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.

Example 1: Publicly Shared Data

https://www.cidrap.umn.edu/news-perspective/2020 /01/china-releases-genetic-data-new-coronavirus- now-deadly

Example 2: Huge Amounts of Scientiﬁc Data

Example 3: Live Data with Transactional Updates

Delta Sharing • Open Source: https://github.com/delta-io/delta-sharing • Open format, vendor
independent • Multi-cloud • Cloud object store bandwidth • pandas, Apache Spark or commercial / BI clients • DIY hosting, or cloud service

Head to Head Comparison Vendor2Vendor (s)ftp S3 URLs OSS Delta
Sharing Secure ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅

Introducing Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data
Science Machine Learning Structured, Semi-Structured and Unstructured Data

Lakehouse adoption across industries

https://delta.io

Delta Sharing

Demo 1 Reading Shared Data with Google Colab Client

https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

Lessons Learned • Start with Delta Sharing server hosted at
Databricks • It's cross vendor and multi-cloud • Clients are easy to build • Abstraction level is data frames / tables, not ﬁles!

Demo 2 Hacking Frank's DNA - the OSS way! Delta
Sharing server with Jupyter Notebook

Delta Sharing

Genotyping Frank as a TSV ﬁle Franks eye color: rsXXXXXX
AG Eye color: rs12913832 GG <-(Frank)

conﬁg.yaml

Receiver Notebook

Reality isn't that easy... By National Institutes of Health -
Public Domain http://commonfund.nih.gov/epigenomics/figure.aspx

Don't do this at home!

Lessons Learned • Delta Sharing environment OSS only: pyspark, Jupyter,
Delta Sharing server/image • Install it on your laptop or EC2, … • Lake ﬁrst approach: data stays on S3, ADLS2, ...

Demo 3 Open Source -> Production Delta Sharing from Databricks
Notebook (SQL) + live updates

Databricks Notebook: Create Share

Databricks Notebook: Create Recipient

Lessons Learned • Operating servers -> abstracted as SQL •
Works with live data, transactionally safe • Delta Sharing is builtin into Databricks compute plane • Part of Unity Catalog

Conclusion & Links

Conclusion Delta Sharing • Platform-independent, OSS way of sharing massive
amounts of data. • Works with live data • Clients can be built quickly: ◦ pandas , Apache Spark, or Tableau and Power BI. ◦ notebooks: Databricks, Amazon EMR and Sagemaker, Google Colab... • Pre-built reference implementation / Docker container. Databricks simpliﬁes Data and AI -> built-in Delta Sharing server, no ops overhead, plain SQL

Databricks EMEA Community • Brand new Databricks Forum • Databricks
Beacons • >10 new EMEA Data & AI meetups + global online Meetups We want you hear from you! new Data and AI meetups: Cape Town / Johannesburg / Dubai / Milano / Berlin / Moscow / St. Petersburg / Madrid / Amsterdam

@frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz

Sharing Huge Amounts of Live Data with Delta Sh...

Sharing Huge Amounts of Live Data with Delta Sharing

Frank Munz

More Decks by Frank Munz

Other Decks in Technology

Featured

Transcript

ODSC & Big Things Conference 2021 Sharing Huge Amounts of

About me • Staff Developer Advocate @Databricks • Based in

Example 1: Publicly Shared Data

https://www.cidrap.umn.edu/news-perspective/2020 /01/china-releases-genetic-data-new-coronavirus- now-deadly

Example 2: Huge Amounts of Scientiﬁc Data

Example 3: Live Data with Transactional Updates

Delta Sharing • Open Source: https://github.com/delta-io/delta-sharing • Open format, vendor

Head to Head Comparison Vendor2Vendor (s)ftp S3 URLs OSS Delta

Introducing Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data

Lakehouse adoption across industries

https://delta.io

Delta Sharing

Demo 1 Reading Shared Data with Google Colab Client

https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

Lessons Learned • Start with Delta Sharing server hosted at

Demo 2 Hacking Frank's DNA - the OSS way! Delta

Delta Sharing

Genotyping Frank as a TSV ﬁle Franks eye color: rsXXXXXX

conﬁg.yaml

Receiver Notebook

Reality isn't that easy... By National Institutes of Health -

Don't do this at home!

Lessons Learned • Delta Sharing environment OSS only: pyspark, Jupyter,

Demo 3 Open Source -> Production Delta Sharing from Databricks

Databricks Notebook: Create Share

Databricks Notebook: Create Recipient

Lessons Learned • Operating servers -> abstracted as SQL •

Conclusion & Links

Conclusion Delta Sharing • Platform-independent, OSS way of sharing massive

Databricks EMEA Community • Brand new Databricks Forum • Databricks

@frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz