Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PFHUB REIMPLEMENTATION FOR FAIR DATA COLLECTION

PFHUB REIMPLEMENTATION FOR FAIR DATA COLLECTION

97d945680ed363e4cce48666d41c586e?s=128

Daniel Wheeler

July 21, 2022
Tweet

More Decks by Daniel Wheeler

Other Decks in Science

Transcript

  1. PFHUB REIMPLEMENTATION FOR FAIR DATA COLLECTION DANIEL WHEELER 2022-05-11

  2. OVERVIEW Using GitHub workflow (extreme FAIRness) Examples from other communities

    Nixpkgs Conda-Forge NIST Code Portal Schema Tools The CodeMeta Project ASDF - Advanced Scientific Data Format Boutiques Python-PFHub (FAIRer than JS) PFHub example submission using issue templates
  3. NIXPKGS Nixpkgs is a large community 100k packages 5k issues

    3.1k PRs 4k contributers Completely GitHub based Single repository 100s of CI workflows Takes submissions from 100s of users everyday that require human interaction
  4. NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated

    assignment
  5. NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated

    assignment Human interaction
  6. NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated

    assignment Human interaction Human CIs Automated CIs
  7. ASIDE: USING NIX FOR DATA WORKFLOW Nix used to orchestrate

    data workflows Not just build workflows Functional storage Completely reproducible
  8. USING NIX IN PFHUB PFHub uses Nix for all builds

    Python-pfhub will also have Pip and Conda builds Uses Cachix to cache builds All CI builds with Nix
  9. CONDA-FORGE Different model to Nixpkgs 16k repositories!!! Users merge (not

    admins) but all GitHub based
  10. NIST CODE PORTAL Akin to PFHub Jekyll frontend CMS-free builds

    NIST's code.json daily All GitHub Actions
  11. SCHEMA: CODEMETA PROJECT Simple standard for code metadata (from science

    community) Includes: 6 basic categories of data (software, discoverability, development, run-time, versions, other) Plan to use this with PFHub Metadata builder tools include web, cli, python https://codemeta.github.io/codemeta-generator/
  12. SCHEMA: ASDF Settled on YAML in 2015 for astronomical data

    (many other choices) ASCII and binary data in same file include simple editable data files supports compressed Numpy arrays Number of readers available No standard data model See "ASDF: A new data format for astronomy", Greenfield et al.
  13. SCHEMA: BOUTIQUES Not a workflow language Formal command line description

    Specify inputs and outputs Boutiques output for "echo" command
  14. PYTHON-PFHUB Python package that deals with all data transformations and

    aggregations All data exported as Pandas dataframes Easy for others to augment, develop, change Everything in Python Plotly has improved Python support Everything working outside of website setting
  15. PYTHON-PFHUB

  16. PYTHON-PFHUB

  17. PFHUB SUBMISSIONS 1. Started with simple YAML file Fill out

    YAML file by hand Submit pull-request CIs + human checks in pull-request 2. Next iteration included an upload form Fill out sophisticated form Submit and Staticman app submits pull-request 3. Currently working on using GitHub issue template Issue template is a simple form (not sophisticated) On submission launches GitHub Action (parses form and submits pull-request) 4. CLI tool?
  18. DEMO https://github.com/usnistgov/pfhub/issues/new/choose

  19. DISCUSSION Upload mechanism (CLI?, GitHub issue templates?) Schema File type

    (ASDF?)
  20. None