Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Frictionless Data 101: How to document and vali...

Frictionless Data 101: How to document and validate datasets or create community standards with the Data Package standard

Talk at Living Data 2025 in Bogota, Colombia - October 24, 2025.

Abstract: https://livingdata2025.com/program.html?abstract=7007354

Avatar for Peter Desmet

Peter Desmet

October 22, 2025
Tweet

More Decks by Peter Desmet

Other Decks in Science

Transcript

  1. Peter Desmet INBO, Belgium How to document and validate datasets

    or create community standards with the Data Package standard Frictionless Data 101 Photo by Peter Desmet
  2. - How do I describe - Column definitions? - Data

    types? - Constraints? - Where do I describe - In the CSV file? - In a README.txt? - Not ideal - My approach ≠ yours - Hard to find or (machine-)read - Naming things is hard How to describe data? tag-id,animal-id,deploy-on-date,animal-mass,a nimal-sex 586,H173481,2013-05-16 19:55:00.000,785.0,f 610,L143451,2013-06-24 10:00:00.000,512.0,m 623,L143457,2013-07-22 10:30:00.000,482.0,m 630,L143467,2015-05-26 09:28:00.000,472.0,m 6058,L143472,2016-05-02 12:58:00.000,571.0,m 6059,L143473,2016-06-01 13:36:00.000,485.0,m 630,H185298,2016-06-03 11:11:00.000,656.0,f Data from doi.org/10.5281/zenodo.10053583 A CSV file with bird+tag deployments
  3. Use “Data Package” - Standard to describe datasets, data files

    and tabular data - Generic - Simple - Extensible - Open - Hosted by the Open Knowledge Foundation (OKFN) - Maintained by volunteers - Part of Frictionless Data project datapackage.org Image by OKFN
  4. datapackage.json reference-data.csv gps-2013.csv.gz gps-2014.csv.gz gps-2015.csv.gz gps-2016.csv.gz gps-2017.csv.gz gps-2018.csv.gz Create a

    Data Package - Add a datapackage.json file - Serves as an access point - Contains descriptions of - Dataset (Data Package) - Files (Data Resource) - CSV properties (Table Dialect) - Tabular data (Table Schema) A dataset that is a Data Package
  5. - resources: data files - id: dataset identifier - version:

    dataset version - license: dataset licence(s) - etc. (including your own properties) Code by Peter Desmet { "resources": [ { "name": "reference-data", "path": "reference-data.csv" }, { "name": "gps", "path": [ "gps-2018.csv.gz", "gps-2019.csv.gz" ] } ], "id": "https://doi.org/10.5281/zenodo.6567022", "version": "v5", "licenses": [ { "name": "CC0-1.0" } ], } A Data Package datapackage.org/standard/data-package/ Describe a dataset
  6. - name: resource name - path: path/URL to file(s) -

    profile: type of data - format: standard extension - mediatype: type of file - encoding: file encoding - dialect: CSV dialect - schema: see next slide - etc. Code by Peter Desmet { "name": "reference-data", "path": "reference-data.csv", "profile": "tabular-data-resource", "format": "csv", "mediatype": "text/csv", "encoding": "UTF-8", "dialect": { "header": true, "delimiter": "," }, "schema": { ... } } Describe a file datapackage.org/standard/data-resource/ datapackage.org/standard/table-dialect/ A Data Resource with a Table Dialect
  7. - fields: all columns in a CSV file - primaryKey:

    field(s) to be considered as row identifier - missingValues: what values should be considered NULL - etc. Code by Peter Desmet { "fields": [ { "name": "tag-id" }, { "name": "animal-id" }, { "name": "deploy-on-date" }, { "name": "animal-mass" }, { "name": "animal-sex" } ], "primaryKey": ["animal-id", "tag-id"], "missingValues": ["", "NA"] } Describe tabular data datapackage.org/standard/table-schema/ A Table Schema
  8. - name: column header in CSV - title: human-readable label

    - description: free text - example: single example - type: data type (controlled value) - format: how to parse data - constraints: validation requirements - etc. { "name": "deploy-on-date", "title": "deploy on timestamp", "description": "The timestamp when the tag deployment started. Data records recorded before this day and time are not associated with the animal related to the deployment. Values are typically defined by the data owner, and in some cases are created automatically during data import.", "example": "2008-08-30 18:00:00.000", "type": "datetime", "format": "%Y-%m-%d %H:%M:%S.%f", "constraints": { "required": false } } Describe a field datapackage.org/standard/table-schema/ A field in a Table Schema
  9. - Adding a datapackage.json improves - Findability: rich metadata -

    Accessibility: one access point - Interoperability: standard properties - Reusability: community standards - My approach = yours 💚 FAIR data Image by Zenodo
  10. - Open Data Editor (desktop) - frictionless-py (Python) - frictionless-r

    (R) - dpkit (TypeScript) - etc. Software to create, write and read Image by OKFN datapackage.org/overview/software/
  11. - Frictionless-py offers extensive validation - Metadata errors - Reference

    errors - Type errors - Constraint errors - etc. Software to validate github.com/frictionlessdata/frictionless-py frictionless validate datapackage.json ─── Dataset ─── reference-data: INVALID gps: VALID ─── Tables ─── reference-data Type error in the cell "2015-05-26T09:28Z" in row "5" and field "deploy-on-date" at position "3": type is "datetime/default" The cell "x" in row at position "6" and field "animal-sex" at position "5" does not conform to a constraint: constraint "enum" is "['f', 'm', 'u']" Command line validation with frictionless-py
  12. - Standardizing even further - Shared dataset requirements (profile) -

    Shared tabular data requirements (Table Schemas) - Build upon Data Package - Solid foundation - Free validation - Free software - Focus on what’s important Community standards Image by GitHub, TDWG, Camtrap DP dev team camtrap-dp.tdwg.org/
  13. - Biotracks (cell migration) - Camtrap DP (camera-trapping) - Darwin

    Core Data Package - European Seabirds at Sea - Fluctuations of Glacier database - GeoLocator DP (movement ecol.) - VPTS CSV (radar aeroecology) Community standards aloftdata.eu/vpts-csv/ Image by Aloft
  14. - Participate in the Data Package Working Group - Help

    with open source software - Use Data Package for your datasets github.com/frictionlessdata/datapackage Contribute Image by GitHub, OKFN
  15. Thank you! [email protected] bit.ly/frictionless-101 This work was funded by Research

    Foundation - Flanders (LifeWatch) & EU (B3). Photo by Peter Desmet