Slide 1

Slide 1 text

ODSC & Big Things Conference 2021 Sharing Huge Amounts of Live Data with Delta Sharing Dr. Frank Munz @frankmunz

Slide 2

Slide 2 text

About me ● Staff Developer Advocate @Databricks ● Based in Munich, 🍻 β›° πŸ₯¨ 󰎲 ● All things large scale data & compute ● Twitter: @frankmunz ● Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc.

Slide 3

Slide 3 text

Example 1: Publicly Shared Data

Slide 4

Slide 4 text

https://www.cidrap.umn.edu/news-perspective/2020 /01/china-releases-genetic-data-new-coronavirus- now-deadly

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Example 2: Huge Amounts of Scientific Data

Slide 7

Slide 7 text

Example 3: Live Data with Transactional Updates

Slide 8

Slide 8 text

Delta Sharing ● Open Source: https://github.com/delta-io/delta-sharing ● Open format, vendor independent ● Multi-cloud ● Cloud object store bandwidth ● pandas, Apache Spark or commercial / BI clients ● DIY hosting, or cloud service

Slide 9

Slide 9 text

Head to Head Comparison Vendor2Vendor (s)ftp S3 URLs OSS Delta Sharing Secure βœ… βœ… βœ… βœ… Cheap βœ… βœ… βœ… Vendor agnostic βœ… βœ… Multi-cloud βœ… βœ… Open Source βœ… βœ… Table / Data Frame abstr. βœ… βœ… Live data βœ… βœ… Predicate Pushdown βœ… βœ… Object Store Bandwidth βœ… βœ… Zero compute cost βœ… βœ… Scalability βœ… βœ…

Slide 10

Slide 10 text

Introducing Lakehouse Data Warehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data

Slide 11

Slide 11 text

Lakehouse adoption across industries

Slide 12

Slide 12 text

https://delta.io

Slide 13

Slide 13 text

Delta Sharing

Slide 14

Slide 14 text

Demo 1 Reading Shared Data with Google Colab Client

Slide 15

Slide 15 text

https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

Slide 16

Slide 16 text

Lessons Learned ● Start with Delta Sharing server hosted at Databricks ● It's cross vendor and multi-cloud ● Clients are easy to build ● Abstraction level is data frames / tables, not files!

Slide 17

Slide 17 text

Demo 2 Hacking Frank's DNA - the OSS way! Delta Sharing server with Jupyter Notebook

Slide 18

Slide 18 text

Delta Sharing

Slide 19

Slide 19 text

Genotyping Frank as a TSV file Franks eye color: rsXXXXXX AG Eye color: rs12913832 GG <-(Frank)

Slide 20

Slide 20 text

config.yaml

Slide 21

Slide 21 text

Receiver Notebook

Slide 22

Slide 22 text

Reality isn't that easy... By National Institutes of Health - Public Domain http://commonfund.nih.gov/epigenomics/figure.aspx

Slide 23

Slide 23 text

Don't do this at home!

Slide 24

Slide 24 text

Lessons Learned ● Delta Sharing environment OSS only: pyspark, Jupyter, Delta Sharing server/image ● Install it on your laptop or EC2, … ● Lake first approach: data stays on S3, ADLS2, ...

Slide 25

Slide 25 text

Demo 3 Open Source -> Production Delta Sharing from Databricks Notebook (SQL) + live updates

Slide 26

Slide 26 text

Databricks Notebook: Create Share

Slide 27

Slide 27 text

Databricks Notebook: Create Recipient

Slide 28

Slide 28 text

Lessons Learned ● Operating servers -> abstracted as SQL ● Works with live data, transactionally safe ● Delta Sharing is builtin into Databricks compute plane ● Part of Unity Catalog

Slide 29

Slide 29 text

Conclusion & Links

Slide 30

Slide 30 text

Conclusion Delta Sharing ● Platform-independent, OSS way of sharing massive amounts of data. ● Works with live data ● Clients can be built quickly: β—‹ pandas , Apache Spark, or Tableau and Power BI. β—‹ notebooks: Databricks, Amazon EMR and Sagemaker, Google Colab... ● Pre-built reference implementation / Docker container. Databricks simplifies Data and AI -> built-in Delta Sharing server, no ops overhead, plain SQL

Slide 31

Slide 31 text

Databricks EMEA Community ● Brand new Databricks Forum ● Databricks Beacons ● >10 new EMEA Data & AI meetups + global online Meetups We want you hear from you! new Data and AI meetups: Cape Town / Johannesburg / Dubai / Milano / Berlin / Moscow / St. Petersburg / Madrid / Amsterdam

Slide 32

Slide 32 text

@frankmunz https://fmunz.medium.com https://github.com/fmunz https://www.linkedin.com/in/frankmunz https://speakerdeck.com/fmunz