Good enough practices for Research Data Management

Good enough practices for research data management Tanja Milotić @oscibio
Empowering biodiversity research II September 27th, 2022 Brussels

Would my colleague be able to take over my project
if I suddenly disappeared?

Data are hard to ﬁnd & navigate

Digital storage media are fragile

Research data are undervalued & neglected

Temporary involvement of staff & students

Why should you even bother? Help your future self •
higher data quality, less mistakes • increased research efficiency • minimized risk of data loss, less frustrations • saved time & money • prevent to collect duplicate data

Why should you even bother? Motivation for the near future
• required by funders & publishers • increase your visibility (citations!) • easy data sharing • new collaboration opportunities “Research data management is part of good research practice”

How to start with research data management? Research data management
(RDM) concerns the organization of data, from its entry to the research cycle to the dissemination and archiving of valuable results Data management plan (DMP)

Research life cycle

Data life cycle

Research data? any information collected/created for the purpose of analysis
to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...

data organization

Folder structure • Use a well-defined folder structure |- README.md
<- The top-level README describing the general layout of the project |- data <- research data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)

File names • Unique file names • Comprehensible • No
special characters • Use _ instead of spaces • YYYY-MM-DD dates • Include initials, project, location, variable, content 2022-09-27_EBR_RDM_practices.pdf

Keep track of changes In general • Backup changes ASAP
• Keep changes small • Share changes frequently Manually track changes • Add a CHANGELOG.txt • Copy entire project after large changes Version control system • Git • GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)

Data file formats • Non proprietary (open source) formats •
Easily reusable • Commonly used Some preferred file formats Tabular data .csv (comma separated values), HDF5, netcdf, rdf Text .txt, html, xml, odt, rtf Still images .tif, jpeg2000, png, pdf, gif, bmp, svg “Love your data, and help other love it, too” (Goodman et al, 2014)

data quality

Data quality • Standardize data collection • Check data entry
• Edit, clean, verify and validate raw data • Peer review • Documentation • Scripting “Data should be structured for analysis” (Hart et al, 2016)

Tidy data • 80% of data analysis is spent on
data cleaning and preparing • Tidy data: structuring datasets to facilitate analysis • Tidy data from the start of the project Wickham, 2014

From messy to tidy Make it a rectangle • Only
rows and columns, no additional structure • One column for each type of information • One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy:

From messy to tidy Make it a rectangle • Only
rows and columns, no additional structure • One column for each type of information • One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy: Plot Species Abundance 1 A 3 1 B 1 2 A 2 2 B 4 Tidy:

From messy to tidy One cell, one value • Every
cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy:

From messy to tidy One cell, one value • Every
cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy: Mass Unit 26 g 0.2 kg Tidy:

From messy to tidy Don’t mess with the computer •
Don’t use visual markings (colors, italics, fonts,...) • Avoid spaces in names, use ‘_’ or CamelCase for multiple words • Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy:

From messy to tidy Don’t mess with the computer •
Don’t use visual markings (colors, italics, fonts,...) • Avoid spaces in names, use ‘_’ or CamelCase for multiple words • Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy: min_temp calibration_error 5 0 4.5 0 3.1 1 Tidy:

From messy to tidy Be clear and consistent • Use
short meaningful names. • Use consistent names, abbreviations, and capitalizations • Use good null values (blanks, NA,... Do not use numbers (0, -999)) • Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy:

From messy to tidy Be clear and consistent • Use
short meaningful names. • Use consistent names, abbreviations, and capitalizations • Use good null values (blanks, NA,... Do not use numbers (0, -999)) • Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy: Date Species Abundance 2022-02-26 dior 9 2022-02-26 disp 1 2022-05-24 dior NA 2022-05-24 disp NA Tidy:

documentation

Documentation & metadata Why? • Long term usability • Avoid
misinterpretation • Collaboration & staff changes • Save your memory for other stuff… @TDXDigLibrary

Project documentation Add a README.txt to your project folder •
Context • People • Sponsor(s) • Data collection methods • File organization • Known problems, limitations, gaps • Licenses • How to cite

Data documentation README.txt for each dataset • Variable names, labels,
data type, description • Explain codes & abbreviations • Code & reason for missing values • Code used for derived data • File format • Software • Data standards

data preservation

Backup guidelines • 3 - 2 - 1 rule: 3
copies - 2 different types of media - 1 offsite • Apply a backup schedule • Test file restores • Do not use CDs or DVDs • Use reliable backup media

Storage media

data publication

open data funders demands publishers’ rules innovation & valorization higher
citation rates more visibility for your work collaboration application of your findings reduced costs public access to your findings

How not to “publish” your data Personal website University website

FAIR data Make your data • Findable • Accessible •
Interoperable • Reusable Findable • Persistent identifiers (DOI) • Metadata • Naming conventions • Keywords • Versioning Accessible • Choice of datasets • Data repository • Software, documentation • Access status • Retrievable data • Metadata access Interoperable • Standards • Vocabulary • Methodology • References Reusable • Licensing • Provenance • Community standards

Where to deposit data? • Depends on data type •
Domain specific repositories ◦ GBIF for species occurrences ◦ Movebank for movement data ◦ Genbank for genomic data ◦ ... • General repositories (Zenodo, dataDryad,...) • ORCID integration • DOI for citing datasets

Thank you!

Good enough practices for Research Data Management

Good enough practices for Research Data Management

More Decks by Tanja Milotic

Other Decks in Science

Featured

Transcript