Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defining dataset specifications to communicate data quality characteristics

Peter Desmet
December 06, 2016

Defining dataset specifications to communicate data quality characteristics

Talk at the TDWG 2016 annual conference in Santa Clara de San Carlos, Costa Rica - December 6, 2016.

Recording: https://vimeo.com/showcase/4308386/video/196431347

The Darwin Core standard provides a list of community-ratified terms for sharing biodiversity information. Although some terms have strict definitions, most allow users a certain level of freedom in how to interpret these. This degree of freedom has enabled a wide range of biodiversity data to be mapped to Darwin Core, but it complicates automated data aggregation and processing. One way to resolve this are community specific guidelines describing how data should be mapped, but few have been created or adopted. Moreover, these are intended for humans only.

Inspired by existing data validation specifications in other fields, we propose the usage of a specification file, describing the constraints to which the data should comply. Its syntax is human- and machine-readable, so it can be used to communicate expected data quality/conformity and to validate data automatically. The scope of the set of rules can be specific to a dataset, publisher or community, which allows bottom-up and top-down adoption.

In this talk, we will present a prototype format for these specifications, where the rules are defined on the level of individual terms and expressed as a YAML file. We also present prototype software to validate data with these specifications. We hope it will trigger a discussion on how to express data specifications and mapping guidelines.

Peter Desmet

December 06, 2016
Tweet

More Decks by Peter Desmet

Other Decks in Science

Transcript

  1. Thanks! @peterdesmet @s-jnvanhoey @dimibro bit.ly/2h0cDLU Desmet P, Van Hoey S

    & Brosens D (2016) Defining dataset specifica-ons to communicate data quality. hbp://bit.ly/2h0cDLU