Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sharing Streaming Data with Delta Sharing - Kafka Conference

Frank Munz
September 28, 2023

Sharing Streaming Data with Delta Sharing - Kafka Conference

This lightning talk is an introduction to Delta Sharing; A Linux Foundation open source solution for sharing massive amounts of data in a cheap, secure, scalable and *streaming* way.

Homegrown data-sharing solutions based on SFTP or APIs aren’t scalable and saddle you with operational overhead. Off-the-shelf data-sharing solutions only work on specific sharing networks, promoting vendor lock-in and can be costly. Others don't support streaming data.

Delta Sharing reliably accesses data at the bandwidth of modern cloud object stores, such as S3, ADLS, or GCS.

Any client supporting pandas, Apache Spark™, or Python, as well as commercial clients such as Power BI can connect to the sharing server. Clients always read the latest version of the data which can also be partitioned to limit the amount of data transferred. Databricks Marketplace and Databricks Clean Room use Delta Sharing, also Oracle, Dell, Cloudflare and twilio and many others adopted the technology.

Learn what you need to know about data sharing in 2023 in this lightning talk.

Frank Munz

September 28, 2023
Tweet

More Decks by Frank Munz

Other Decks in Technology

Transcript

  1. From Zero to Hero
    Sharing Streaming
    Data with Open
    Source Delta Sharing
    Frank Munz, Principal TMM, Databricks
    @frankmunz

    View full-size slide

  2. About me
    ▪ Principal TMM @ Databricks
    ▪ Based in Munich, 🍻 ⛰ 🥨 󰎲
    ▪ ❤ all things large scale data & AI

    View full-size slide

  3. ©2021 Databricks Inc. — All rights reserved
    What’s the problem with
    Data Sharing?

    View full-size slide

  4. Proprietary
    Vendor Solutions
    SFTP Cloud Object Store Delta Sharing
    Secure ✅ ✅ ✅ ✅
    Cheap ✅ ✅ ✅
    Vendor agnostic ✅ ✅
    Multi-cloud ✅ ✅
    Open Source ✅ ✅
    Table / Data Frame abstr. ✅ ✅
    Live data ✅ ✅
    Predicate Pushdown ✅ ✅
    Object Store Bandwidth ✅ ✅
    Zero compute cost ✅ ✅
    Scalability ✅ ✅

    View full-size slide

  5. How does Delta Sharing Help?

    View full-size slide

  6. The Open Approach To Sharing
    Fully open, without
    proprietary lock-in using
    any computing platforms
    Simple to share live
    data with other
    organizations
    Easily managed
    privacy, security, and
    compliance
    Additional
    flexibility and
    interoperability
    Less data
    movement and
    complexity
    Ability unlock
    data with strong
    governance

    View full-size slide

  7. Delta
    Lake
    Delta Sharing
    Server
    Parquet files
    in cloud
    storage
    Request table
    Pre-signed
    short-lived URLs
    Temporary direct access to files
    (parquet format) in the object
    store - AWS S3, GCP, ADLS

    DATA PROVIDER DATA CONSUMER
    Delta Sharing
    Client
    Under the hood
    Activation link

    View full-size slide

  8. OSS: Run a Sharing Server
    https://github.com/delta-io/delta-sharing
    bin/delta-sharing-server -- --config server-config.yaml
    OR
    docker run -p : \

    deltaio/delta-sharing-server:0.6.4 -- --config\
    /config/server-config.yaml

    View full-size slide

  9. Databricks: Sharing Data from SQL
    CREATE SHARE loan ;
    ALTER SHARE loan ADD TABLE demo.lending.txs;
    CREATE RECIPIENT l_recipient
    GRANT SELECT ON SHARE loan TO RECIPIENT l_recipient;

    View full-size slide

  10. Databricks UI: Create share
    (1) create share
    (2) add table

    View full-size slide

  11. Pandas Client
    !pip install delta-sharing
    client = delta_sharing.SharingClient(profile_f)
    table = profile_f+"#share.schema.table"
    data = delta_sharing.load_as_pandas(table)

    View full-size slide

  12. Streaming Support: Spark Structured Streaming
    # client code
    df = (spark.readStream
    .format("deltasharing")
    .option("readChangeFeed", "true")
    .option("startingTimestamp", "2021-04-21 05:45:46")
    .load("#..")
    )

    View full-size slide

  13. Demo
    Delta Sharing

    View full-size slide

  14. https://github.com/fmunz/bigdata-intro/blob/main/DeltaSharing_DatabricksReference.ipynb

    View full-size slide

  15. Why Delta Sharing rocks

    View full-size slide

  16. Delta Sharing Ecosystem
    3rd Party Data Vendors/Clean Room
    Open Source Clients Business Intelligence/Analytics
    Governance SaaS/Multi-Cloud Infrastructure
    Hyperscalers
    Carto
    NEW

    View full-size slide

  17. Adoption of Delta Sharing protocol takes aim at Snowflake
    Oracle's adoption of Databricks’ Delta Sharing protocol is a major part of the updates to its Autonomous Data
    Warehouse. The protocol was adopted, according to Oracle's Wheeler, to avoid vendor lock-ins for data sharing
    and sort out issues such as security, version control and access management of data sets.
    “With this open approach, customers can now securely share data with anyone using any application or service
    that supports the protocol,” the company said in a statement.
    Oracle’s decision to adopt the protocol could be primarily due to its popularity and to
    counter Snowflake’s product offerings, analysts said.

    View full-size slide

  18. Open for Databricks &
    non-Databricks users
    Data sets, Notebooks,
    ML models and
    applications from top
    data & solution providers
    Public marketplace,
    private exchanges
    Databricks Marketplace provides an open
    marketplace for data, analytics, and AI
    1
    8
    Dashboards
    ML
    Models
    Data
    Files
    Data
    Tables
    Solution
    Accelerators
    Databricks
    Marketplace
    Notebooks

    View full-size slide

  19. Databricks Clean Rooms
    Secure environments to run computations on joint data
    Collaborator 1
    Mutually approved
    jobs on Databricks
    trusted compute
    Existing tables
    Scalable
    Scale to multiple
    collaborators and any data
    size
    Interoperable
    Any data source with no
    replication
    Flexible
    Your language and workload
    of choice
    Collaborator N
    Existing tables
    Delta
    Sharing
    Delta
    Sharing

    View full-size slide

  20. Conclusion Delta Sharing
    ● Platform-independent, multi-cloud, OSS for
    sharing massive amounts live and streaming of data.
    ● built into Databricks Accounts, Marketplace, Clean Rooms
    ● Clients can be:
    ○ OSS pandas , Apache Spark
    ○ Enterprise BI Tableau, Power BI.
    ● Server
    ○ Pre-built reference implementation
    ○ OSS binary
    ○ OSS Docker container

    View full-size slide

  21. ©2022 Databricks Inc. — All rights reserved
    Technical Questions?
    Sign-up for the Databricks Community!
    Ask your technical questions here: https://community.databricks.com/
    22

    View full-size slide

  22. ©2022 Databricks Inc. — All rights reserved 23
    New Databricks Demo Center
    databricks.com/demos
    Delta Sharing demo on
    Databricks
    Demo Center

    View full-size slide