Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing Streaming Data with Delta Sharing - Kafka Conference

Frank Munz
September 28, 2023

Sharing Streaming Data with Delta Sharing - Kafka Conference

This lightning talk is an introduction to Delta Sharing; A Linux Foundation open source solution for sharing massive amounts of data in a cheap, secure, scalable and *streaming* way.

Homegrown data-sharing solutions based on SFTP or APIs aren’t scalable and saddle you with operational overhead. Off-the-shelf data-sharing solutions only work on specific sharing networks, promoting vendor lock-in and can be costly. Others don't support streaming data.

Delta Sharing reliably accesses data at the bandwidth of modern cloud object stores, such as S3, ADLS, or GCS.

Any client supporting pandas, Apache Spark™, or Python, as well as commercial clients such as Power BI can connect to the sharing server. Clients always read the latest version of the data which can also be partitioned to limit the amount of data transferred. Databricks Marketplace and Databricks Clean Room use Delta Sharing, also Oracle, Dell, Cloudflare and twilio and many others adopted the technology.

Learn what you need to know about data sharing in 2023 in this lightning talk.

Frank Munz

September 28, 2023
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. From Zero to Hero Sharing Streaming Data with Open Source

    Delta Sharing Frank Munz, Principal TMM, Databricks @frankmunz
  2. About me ▪ Principal TMM @ Databricks ▪ Based in

    Munich, 🍻 ⛰ 🥨 󰎲 ▪ ❤ all things large scale data & AI
  3. Proprietary Vendor Solutions SFTP Cloud Object Store Delta Sharing Secure

    ✅ ✅ ✅ ✅ Cheap ✅ ✅ ✅ Vendor agnostic ✅ ✅ Multi-cloud ✅ ✅ Open Source ✅ ✅ Table / Data Frame abstr. ✅ ✅ Live data ✅ ✅ Predicate Pushdown ✅ ✅ Object Store Bandwidth ✅ ✅ Zero compute cost ✅ ✅ Scalability ✅ ✅
  4. The Open Approach To Sharing Fully open, without proprietary lock-in

    using any computing platforms Simple to share live data with other organizations Easily managed privacy, security, and compliance Additional flexibility and interoperability Less data movement and complexity Ability unlock data with strong governance
  5. Delta Lake Delta Sharing Server Parquet files in cloud storage

    Request table Pre-signed short-lived URLs Temporary direct access to files (parquet format) in the object store - AWS S3, GCP, ADLS … DATA PROVIDER DATA CONSUMER Delta Sharing Client Under the hood Activation link
  6. OSS: Run a Sharing Server https://github.com/delta-io/delta-sharing bin/delta-sharing-server -- --config server-config.yaml

    OR docker run -p <host-port>:<container-port> \ … deltaio/delta-sharing-server:0.6.4 -- --config\ /config/server-config.yaml
  7. Databricks: Sharing Data from SQL CREATE SHARE loan ; ALTER

    SHARE loan ADD TABLE demo.lending.txs; CREATE RECIPIENT l_recipient GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;
  8. Pandas Client !pip install delta-sharing client = delta_sharing.SharingClient(profile_f) table =

    profile_f+"#share.schema.table" data = delta_sharing.load_as_pandas(table)
  9. Streaming Support: Spark Structured Streaming # client code df =

    (spark.readStream .format("deltasharing") .option("readChangeFeed", "true") .option("startingTimestamp", "2021-04-21 05:45:46") .load("<profile>#<share>.<schema>.<table>") )
  10. Delta Sharing Ecosystem 3rd Party Data Vendors/Clean Room Open Source

    Clients Business Intelligence/Analytics Governance SaaS/Multi-Cloud Infrastructure Hyperscalers Carto NEW
  11. Adoption of Delta Sharing protocol takes aim at Snowflake Oracle's

    adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing and sort out issues such as security, version control and access management of data sets. “With this open approach, customers can now securely share data with anyone using any application or service that supports the protocol,” the company said in a statement. Oracle’s decision to adopt the protocol could be primarily due to its popularity and to counter Snowflake’s product offerings, analysts said.
  12. Open for Databricks & non-Databricks users Data sets, Notebooks, ML

    models and applications from top data & solution providers Public marketplace, private exchanges Databricks Marketplace provides an open marketplace for data, analytics, and AI 1 8 Dashboards ML Models Data Files Data Tables Solution Accelerators Databricks Marketplace Notebooks
  13. Databricks Clean Rooms Secure environments to run computations on joint

    data Collaborator 1 Mutually approved jobs on Databricks trusted compute Existing tables Scalable Scale to multiple collaborators and any data size Interoperable Any data source with no replication Flexible Your language and workload of choice Collaborator N Existing tables Delta Sharing Delta Sharing
  14. Conclusion Delta Sharing • Platform-independent, multi-cloud, OSS for sharing massive

    amounts live and streaming of data. • built into Databricks Accounts, Marketplace, Clean Rooms • Clients can be: ◦ OSS pandas , Apache Spark ◦ Enterprise BI Tableau, Power BI. • Server ◦ Pre-built reference implementation ◦ OSS binary ◦ OSS Docker container
  15. ©2022 Databricks Inc. — All rights reserved Technical Questions? Sign-up

    for the Databricks Community! Ask your technical questions here: https://community.databricks.com/ 22
  16. ©2022 Databricks Inc. — All rights reserved 23 New Databricks

    Demo Center databricks.com/demos Delta Sharing demo on Databricks Demo Center