Whip: Communicate and test what to expect from data

Whip: Communicate and test what to expect from data

Talk at the TDWG 2018 annual conference in Dunedin, New Zealand - August 28, 2018.

The ability to communicate and assess the quality and fitness for use of data is crucial to ensure maximum utility and re-use. Data consumers have certain requirements for the data they seek and need to be able to check if a data set conforms with these requirements. Data publishers aim to provide data with the highest possible quality and need to be able to identify potential errors that can be addressed with the available information at hand. The development and adoption of data publication guidelines is one approach to define and meet those requirements. However, the use of a guideline, the mapping decisions, and the requirements a dataset is expected to meet, are generally not communicated with the provided data. Moreover, these guidelines are typically intended for humans only.

In this talk, we will present 'whip': a proposed syntax for data specifications. With whip, one can define column-based constraints for tabular (tidy) data using a number of rules, e.g. how data is structured following Darwin Core, how a term uses controlled vocabulary values, or what the expected minimum and maximum values are. These rules are human- and machine-readable, which communicates the specifications, and allows to automatically validate those in pipelines for data publication and quality assessment, such as Kurator. Whip can be formatted as a (yaml) text file that can be provided with the published data, communicating the specifications a dataset is expected to meet. The scope of these specifications can be specific to a dataset, but can also be used to express expected data quality and fitness for use of a publisher, consumer or community, allowing bottom-up and top-down adoption. As such, these specifications are complementary to the core set of data quality tests as currently under development by the TDWG Biodiversity Data Quality Task 2 Group 2. Whip rules are currently generic, but more specific ones can be defined to address requirements for biodiversity information.

https://doi.org/10.3897/biss.2.25317

6f6914b1cdb438695ec1aaabba7463bb?s=128

Peter Desmet

August 28, 2018
Tweet

Transcript

  1. Whip Communicate and test what to expect from data Stijn

    Van Hoey & Peter Desmet
  2. Expectations Data Users

  3. Expectations Data Users Fit for my research? Fit for specific

    user community?
  4. We are a data publisher

  5. We care

  6. What to expect Data Publisher

  7. What to expect Data Publisher Data quality Standardization Community recommendations

    Dataset characteristics
  8. Expectations / What to expect Data Publisher Users Expectations What

    to expect
  9. How to communicate expectations? Data Publisher Users Expectations What to

    expect
  10. How to test expectations? Data Publisher Users Expectations What to

    expect
  11. Whip

  12. Whip syntax

  13. Whip syntax

  14. Whip syntax Field

  15. Whip syntax Field Specification

  16. Whip syntax Comment Field Specification

  17. Whip syntax Comment Field Specification

  18. Whip specifications allowed minlength / maxlength stringformat regex min /

    max numberformat mindate / maxdate dateformat
  19. Whip scope specifications empty delimitedvalues if

  20. Using whip to document

  21. Pywhip: a whip implementation

  22. import whip_csv from pywhip # load specifications with open("my_specifications.yml") as

    spec_file: specifications = yaml.load(spec_file) # test specifications test = whip_csv("my_data.csv", specifications) # get report test.get_report("html") Pywhip Or “json”
  23. Pywhip

  24. Pywhip

  25. Pywhip

  26. Conclusion Human and machine-readable syntax to express specifications for data

    Not specific to Darwin Core (but we plan to use it for that) Can be adopted by users (expectations) and publishers (what to expect) Can be included with dataset as testable metadata Pywhip: first implementation for testing whip specifications
  27. github.com/inbo/whip github.com/inbo/pywhip bit.ly/pywhip_binder Thank you! Data Specifications