Slide 1

Slide 1 text

From Zero to Hero Sharing Streaming Data with Open Source Delta Sharing Frank Munz, Principal TMM, Databricks @frankmunz

Slide 2

Slide 2 text

About me ▪ Principal TMM @ Databricks ▪ Based in Munich, 🍻 ⛰ 🥨 󰎲 ▪ ❤ all things large scale data & AI

Slide 3

Slide 3 text

©2021 Databricks Inc. — All rights reserved What’s the problem with Data Sharing?

Slide 4

Slide 4 text

Proprietary Vendor Solutions SFTP Cloud Object Store Delta Sharing Secure ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅

Slide 5

Slide 5 text

How does Delta Sharing Help?

Slide 6

Slide 6 text

The Open Approach To Sharing Fully open, without proprietary lock-in using any computing platforms Simple to share live data with other organizations Easily managed privacy, security, and compliance Additional flexibility and interoperability Less data movement and complexity Ability unlock data with strong governance

Slide 7

Slide 7 text

Delta Lake Delta Sharing Server Parquet files in cloud storage Request table Pre-signed short-lived URLs Temporary direct access to files (parquet format) in the object store - AWS S3, GCP, ADLS … DATA PROVIDER DATA CONSUMER Delta Sharing Client Under the hood Activation link

Slide 8

Slide 8 text

OSS: Run a Sharing Server https://github.com/delta-io/delta-sharing bin/delta-sharing-server -- --config server-config.yaml OR docker run -p : \ … deltaio/delta-sharing-server:0.6.4 -- --config\ /config/server-config.yaml

Slide 9

Slide 9 text

Databricks: Sharing Data from SQL CREATE SHARE loan ; ALTER SHARE loan ADD TABLE demo.lending.txs; CREATE RECIPIENT l_recipient GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;

Slide 10

Slide 10 text

Databricks UI: Create share (1) create share (2) add table

Slide 11

Slide 11 text

Pandas Client !pip install delta-sharing client = delta_sharing.SharingClient(profile_f) table = profile_f+"#share.schema.table" data = delta_sharing.load_as_pandas(table)

Slide 12

Slide 12 text

Streaming Support: Spark Structured Streaming # client code df = (spark.readStream .format("deltasharing") .option("readChangeFeed", "true") .option("startingTimestamp", "2021-04-21 05:45:46") .load("#..") )

Slide 13

Slide 13 text

Demo Delta Sharing

Slide 14

Slide 14 text

https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

Slide 15

Slide 15 text

Why Delta Sharing rocks

Slide 16

Slide 16 text

Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW

Slide 17

Slide 17 text

Adoption of Delta Sharing protocol takes aim at Snowflake Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing and sort out issues such as security, version control and access management of data sets. “With this open approach, customers can now securely share data with anyone using any application or service that supports the protocol,” the company said in a statement. Oracle’s decision to adopt the protocol could be primarily due to its popularity and to counter Snowflake’s product offerings, analysts said.

Slide 18

Slide 18 text

Open for Databricks & non-Databricks users Data sets, Notebooks, ML models and applications from top data & solution providers Public marketplace, private exchanges Databricks Marketplace provides an open marketplace for data, analytics, and AI 1 8 Dashboards ML Models Data Files Data Tables Solution Accelerators Databricks Marketplace Notebooks

Slide 19

Slide 19 text

Databricks Clean Rooms Secure environments to run computations on joint data Collaborator 1 Mutually approved jobs on Databricks trusted compute Existing tables Scalable Scale to multiple collaborators and any data size Interoperable Any data source with no replication Flexible Your language and workload of choice Collaborator N Existing tables Delta Sharing Delta Sharing

Slide 20

Slide 20 text

Conclusion

Slide 21

Slide 21 text

Conclusion Delta Sharing ● Platform-independent, multi-cloud, OSS for sharing massive amounts live and streaming of data. ● built into Databricks Accounts, Marketplace, Clean Rooms ● Clients can be: ○ OSS pandas , Apache Spark ○ Enterprise BI Tableau, Power BI. ● Server ○ Pre-built reference implementation ○ OSS binary ○ OSS Docker container

Slide 22

Slide 22 text

©2022 Databricks Inc. — All rights reserved Technical Questions? Sign-up for the Databricks Community! Ask your technical questions here: https://community.databricks.com/ 22

Slide 23

Slide 23 text

©2022 Databricks Inc. — All rights reserved 23 New Databricks Demo Center databricks.com/demos Delta Sharing demo on Databricks Demo Center