Slide 1

Slide 1 text

Good enough practices for research data management Tanja Milotić @oscibio Empowering biodiversity research II September 27th, 2022 Brussels

Slide 2

Slide 2 text

Would my colleague be able to take over my project if I suddenly disappeared?

Slide 3

Slide 3 text

Data are hard to find & navigate

Slide 4

Slide 4 text

Digital storage media are fragile

Slide 5

Slide 5 text

Research data are undervalued & neglected

Slide 6

Slide 6 text

Temporary involvement of staff & students

Slide 7

Slide 7 text

Why should you even bother? Help your future self ● higher data quality, less mistakes ● increased research efficiency ● minimized risk of data loss, less frustrations ● saved time & money ● prevent to collect duplicate data

Slide 8

Slide 8 text

Why should you even bother? Motivation for the near future ● required by funders & publishers ● increase your visibility (citations!) ● easy data sharing ● new collaboration opportunities “Research data management is part of good research practice”

Slide 9

Slide 9 text

How to start with research data management? Research data management (RDM) concerns the organization of data, from its entry to the research cycle to the dissemination and archiving of valuable results Data management plan (DMP)

Slide 10

Slide 10 text

Research life cycle

Slide 11

Slide 11 text

Data life cycle

Slide 12

Slide 12 text

Data life cycle

Slide 13

Slide 13 text

Research data? any information collected/created for the purpose of analysis to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...

Slide 14

Slide 14 text

data organization

Slide 15

Slide 15 text

Folder structure ● Use a well-defined folder structure |- README.md <- The top-level README describing the general layout of the project |- data <- research data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)

Slide 16

Slide 16 text

File names ● Unique file names ● Comprehensible ● No special characters ● Use _ instead of spaces ● YYYY-MM-DD dates ● Include initials, project, location, variable, content 2022-09-27_EBR_RDM_practices.pdf

Slide 17

Slide 17 text

Keep track of changes In general ● Backup changes ASAP ● Keep changes small ● Share changes frequently Manually track changes ● Add a CHANGELOG.txt ● Copy entire project after large changes Version control system ● Git ● GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)

Slide 18

Slide 18 text

Data file formats ● Non proprietary (open source) formats ● Easily reusable ● Commonly used Some preferred file formats Tabular data .csv (comma separated values), HDF5, netcdf, rdf Text .txt, html, xml, odt, rtf Still images .tif, jpeg2000, png, pdf, gif, bmp, svg “Love your data, and help other love it, too” (Goodman et al, 2014)

Slide 19

Slide 19 text

data quality

Slide 20

Slide 20 text

Data quality ● Standardize data collection ● Check data entry ● Edit, clean, verify and validate raw data ● Peer review ● Documentation ● Scripting “Data should be structured for analysis” (Hart et al, 2016)

Slide 21

Slide 21 text

Tidy data ● 80% of data analysis is spent on data cleaning and preparing ● Tidy data: structuring datasets to facilitate analysis ● Tidy data from the start of the project Wickham, 2014

Slide 22

Slide 22 text

From messy to tidy Make it a rectangle ● Only rows and columns, no additional structure ● One column for each type of information ● One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy:

Slide 23

Slide 23 text

From messy to tidy Make it a rectangle ● Only rows and columns, no additional structure ● One column for each type of information ● One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy: Plot Species Abundance 1 A 3 1 B 1 2 A 2 2 B 4 Tidy:

Slide 24

Slide 24 text

From messy to tidy One cell, one value ● Every cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy:

Slide 25

Slide 25 text

From messy to tidy One cell, one value ● Every cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy: Mass Unit 26 g 0.2 kg Tidy:

Slide 26

Slide 26 text

From messy to tidy Don’t mess with the computer ● Don’t use visual markings (colors, italics, fonts,...) ● Avoid spaces in names, use ‘_’ or CamelCase for multiple words ● Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy:

Slide 27

Slide 27 text

From messy to tidy Don’t mess with the computer ● Don’t use visual markings (colors, italics, fonts,...) ● Avoid spaces in names, use ‘_’ or CamelCase for multiple words ● Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy: min_temp calibration_error 5 0 4.5 0 3.1 1 Tidy:

Slide 28

Slide 28 text

From messy to tidy Be clear and consistent ● Use short meaningful names. ● Use consistent names, abbreviations, and capitalizations ● Use good null values (blanks, NA,... Do not use numbers (0, -999)) ● Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy:

Slide 29

Slide 29 text

From messy to tidy Be clear and consistent ● Use short meaningful names. ● Use consistent names, abbreviations, and capitalizations ● Use good null values (blanks, NA,... Do not use numbers (0, -999)) ● Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy: Date Species Abundance 2022-02-26 dior 9 2022-02-26 disp 1 2022-05-24 dior NA 2022-05-24 disp NA Tidy:

Slide 30

Slide 30 text

documentation

Slide 31

Slide 31 text

Documentation & metadata Why? ● Long term usability ● Avoid misinterpretation ● Collaboration & staff changes ● Save your memory for other stuff… @TDXDigLibrary

Slide 32

Slide 32 text

Project documentation Add a README.txt to your project folder ● Context ● People ● Sponsor(s) ● Data collection methods ● File organization ● Known problems, limitations, gaps ● Licenses ● How to cite

Slide 33

Slide 33 text

Data documentation README.txt for each dataset ● Variable names, labels, data type, description ● Explain codes & abbreviations ● Code & reason for missing values ● Code used for derived data ● File format ● Software ● Data standards

Slide 34

Slide 34 text

data preservation

Slide 35

Slide 35 text

Backup guidelines ● 3 - 2 - 1 rule: 3 copies - 2 different types of media - 1 offsite ● Apply a backup schedule ● Test file restores ● Do not use CDs or DVDs ● Use reliable backup media

Slide 36

Slide 36 text

Storage media

Slide 37

Slide 37 text

data publication

Slide 38

Slide 38 text

open data funders demands publishers’ rules innovation & valorization higher citation rates more visibility for your work collaboration application of your findings reduced costs public access to your findings

Slide 39

Slide 39 text

How not to “publish” your data Personal website University website

Slide 40

Slide 40 text

FAIR data Make your data ● Findable ● Accessible ● Interoperable ● Reusable Findable ● Persistent identifiers (DOI) ● Metadata ● Naming conventions ● Keywords ● Versioning Accessible ● Choice of datasets ● Data repository ● Software, documentation ● Access status ● Retrievable data ● Metadata access Interoperable ● Standards ● Vocabulary ● Methodology ● References Reusable ● Licensing ● Provenance ● Community standards

Slide 41

Slide 41 text

Where to deposit data? ● Depends on data type ● Domain specific repositories ○ GBIF for species occurrences ○ Movebank for movement data ○ Genbank for genomic data ○ ... ● General repositories (Zenodo, dataDryad,...) ● ORCID integration ● DOI for citing datasets

Slide 42

Slide 42 text

Thank you!