Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FROST: Federated Registry of Scientific Things ...

Tom Nicholas
February 12, 2025

FROST: Federated Registry of Scientific Things (@ Pangeo Showcase)

Talk slides from Pangeo Showcase February 12th 2025
https://discourse.pangeo.io/t/pangeo-showcase-frost-federated-registry-of-scientific-things-feb-12-2025/4861

Blog post mentioned: https://hackmd.io/@TomNicholas/H1KzoYrPJe

Abstract:

Context:
The easiest way to store and provide access to big scientific datasets is via ARCO data in S3-compatible cloud object storage. We now have scalable cloud-optimised formats that are version-controlled at rest in object storage (particularly Icechunk for arrays and Iceberg for tables). This is huge, as even dynamically-updated datasets can now be distributed via raw S3, with no other server needed. All the data providers who are paying attention are about to put their data in these formats, but then they will try to advertise the S3 URLs to the world via ad-hoc data catalogs.

Problem: Everyone’s catalogs are disconnected from everyone else’s.

This means:
- No cross-org discoverability (e.g. NASA catalog users won’t see NOAA datasets or vice versa).
- No cross-org tracking of updates (e.g. NOAA datasets derived directly from NASA datasets won’t automatically know if the NASA datasets have been updated upstream).
- Risk of “catalog wars” where platform services compete to make more and more comprehensive “meta-catalogs” which merely track (outdated) links to other orgs’ data.
- Risk that if one platform does win everyone might feel locked in to it via the social network effect.

Solution: Federated catalog protocol with cross-org publish-subcribe model.
- Cross-org discoverability enabled via displaying the contents of the dataset entries being broadcast,
- Cross-org tracking of updates to datasets enabled the same way,
- No need to compete to make a better catalog, as anyone can easily consume and display the entire global catalog, including updates,
- Federated trust model allows proliferation of high-quality centralized services, whilst also guarding against platform lock-in.

How do we build it?:
Not sure exactly, but the problem is analogous to creating Federated alternatives to centralized social media (i.e. Bluesky/Mastodon vs Twitter). Perhaps we can piggyback off of Bluesky’s ATproto or Mastodon’s ActivityPub?

Tom Nicholas

February 12, 2025
Tweet

More Decks by Tom Nicholas

Other Decks in Science

Transcript

  1. What I will not talk about: - My blog post

    (https://hackmd.io/@TomNicholas/H1KzoYrPJe) - Instead I’m going to work in the opposite direction - From the stack today to the dream
  2. What I will talk about: - Version-controlled cloud-native data today

    - Connecting the catalogs - Use cases - Missing link: FROST - Design ideas
  3. Problem: Disconnected catalogs - No cross-org discoverability - No cross-org

    update tracking - Risk of “catalog wars” - Risk of network lock-in Solution: Connected catalogs - Cross-org discoverability - Cross-org update tracking - Search entire registry - Free to move while bringing dependencies
  4. Federated registry (FROST) Platform / search service Platform / search

    service Platform / search service Private data Sensitive Data Public data Authentication portal Authentication portal Search Pub-sub catalog Access Storage User communities Common platform architecture
  5. Use cases: - Public data catalog (e.g. NASA) Federated registry

    (FROST) NASA-specific search service NASA Public Data
  6. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) Federated registry (FROST) Global search service Public Data
  7. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) - Data marketplace (e.g. Source Cooperative) Federated registry (FROST) Marketplace platform For-sale data Pay-for-access layer
  8. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) - Data marketplace (e.g. Source Cooperative) - Federated data analysis (e.g. lifebit) Federated registry (FROST) Bioscience search service Sensitive Data HIPAA-compliant authentication
  9. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) - Data marketplace (e.g. Source Cooperative) - Federated data analysis (e.g. lifebit) - Data lake (e.g. Earthmover’s Arraylake) Federated registry (FROST) Platform service Commercial Data Lake Customer-specific authentication
  10. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) - Data marketplace (e.g. Source Cooperative) - Federated data analysis (e.g. lifebit) - Data lake (e.g. Earthmover’s Arraylake) - Backup trawler (e.g. Wayback Machine) Federated registry (FROST) Archiving service Archived public data
  11. Use cases: - Public catalog (e.g. NASA) - Aggregated search

    engine (e.g. Google dataset search) - Data marketplace (e.g. Source Cooperative) - Federated data analysis (e.g. lifebit) - Data lake (e.g. Earthmover’s Arraylake) - Backup trawler (e.g. Wayback Machine) - Real-time operations (e.g. WatchDuty) Federated registry (FROST) Alert service Dynamically-updated public data
  12. Missing link: Decentralized pub-sub catalog network Requirements: - Defines data

    catalog entries, updates, and dependencies - Handle any data at S3-like URL that’s version-controlled - Fast, reliable update notifications - Federated trust model - Global search
  13. How could we build it? Design inspirations: - Pub-sub protocols

    (e.g. RSS) - Federated social networks (e.g. Bluesky, Mastodon) - Database replication (e.g. Raft algorithm) - Software package registries (e.g. npm)
  14. Design idea 2: Custom AT protocol lexicon - Use Bluesky’s

    AT protocol - Define a custom lexicon - com.frost.dataset.entry - com.frost.dataset.update - com.frost.dataset.dependency - Publish from your PDS - Consume via firehose - Display via feedgen - BUT consuming whole firehose forever is basically sync’ing a DB…
  15. Action: FROST working group?? - Do we need a standard

    / protocol? - Does this already exist? - Would have to be a multi-organisation effort Please indicate interest in participation / following updates here: https://tinyurl.com/r76v43u6
  16. FAQ: Which storage formats? - Icechunk / Iceberg have nice

    properties: - version-controlled at rest, with a uniquely identifiable address/hash for each commit (i.e. icechunk, git itself) - some idea of a diff, potentially one that's small enough to be sent over the network (e.g. git diff, icechunk's ChangeSet) - bytes can be pulled out via http range requests to a storage URL
  17. FAQ: But we surely should verify / police entries? -

    No - github doesn’t, why should this?