Slide 1

Slide 1 text

PFHUB REIMPLEMENTATION FOR FAIR DATA COLLECTION DANIEL WHEELER 2022-05-11

Slide 2

Slide 2 text

OVERVIEW Using GitHub workflow (extreme FAIRness) Examples from other communities Nixpkgs Conda-Forge NIST Code Portal Schema Tools The CodeMeta Project ASDF - Advanced Scientific Data Format Boutiques Python-PFHub (FAIRer than JS) PFHub example submission using issue templates

Slide 3

Slide 3 text

NIXPKGS Nixpkgs is a large community 100k packages 5k issues 3.1k PRs 4k contributers Completely GitHub based Single repository 100s of CI workflows Takes submissions from 100s of users everyday that require human interaction

Slide 4

Slide 4 text

NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated assignment

Slide 5

Slide 5 text

NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated assignment Human interaction

Slide 6

Slide 6 text

NIXPKGS Example submission on Nixpkgs Checks to ensure compliance Automated assignment Human interaction Human CIs Automated CIs

Slide 7

Slide 7 text

ASIDE: USING NIX FOR DATA WORKFLOW Nix used to orchestrate data workflows Not just build workflows Functional storage Completely reproducible

Slide 8

Slide 8 text

USING NIX IN PFHUB PFHub uses Nix for all builds Python-pfhub will also have Pip and Conda builds Uses Cachix to cache builds All CI builds with Nix

Slide 9

Slide 9 text

CONDA-FORGE Different model to Nixpkgs 16k repositories!!! Users merge (not admins) but all GitHub based

Slide 10

Slide 10 text

NIST CODE PORTAL Akin to PFHub Jekyll frontend CMS-free builds NIST's code.json daily All GitHub Actions

Slide 11

Slide 11 text

SCHEMA: CODEMETA PROJECT Simple standard for code metadata (from science community) Includes: 6 basic categories of data (software, discoverability, development, run-time, versions, other) Plan to use this with PFHub Metadata builder tools include web, cli, python https://codemeta.github.io/codemeta-generator/

Slide 12

Slide 12 text

SCHEMA: ASDF Settled on YAML in 2015 for astronomical data (many other choices) ASCII and binary data in same file include simple editable data files supports compressed Numpy arrays Number of readers available No standard data model See "ASDF: A new data format for astronomy", Greenfield et al.

Slide 13

Slide 13 text

SCHEMA: BOUTIQUES Not a workflow language Formal command line description Specify inputs and outputs Boutiques output for "echo" command

Slide 14

Slide 14 text

PYTHON-PFHUB Python package that deals with all data transformations and aggregations All data exported as Pandas dataframes Easy for others to augment, develop, change Everything in Python Plotly has improved Python support Everything working outside of website setting

Slide 15

Slide 15 text

PYTHON-PFHUB

Slide 16

Slide 16 text

PYTHON-PFHUB

Slide 17

Slide 17 text

PFHUB SUBMISSIONS 1. Started with simple YAML file Fill out YAML file by hand Submit pull-request CIs + human checks in pull-request 2. Next iteration included an upload form Fill out sophisticated form Submit and Staticman app submits pull-request 3. Currently working on using GitHub issue template Issue template is a simple form (not sophisticated) On submission launches GitHub Action (parses form and submits pull-request) 4. CLI tool?

Slide 18

Slide 18 text

DEMO https://github.com/usnistgov/pfhub/issues/new/choose

Slide 19

Slide 19 text

DISCUSSION Upload mechanism (CLI?, GitHub issue templates?) Schema File type (ASDF?)

Slide 20

Slide 20 text

No content