Good enough practices in data management

Good enough practices in data management Tanja Milotić @milotict

Would my colleague be able to take over my project
if I suddenly disappeared?

Data are hard to ﬁnd & navigate

Digital storage media are fragile

Research data are undervalued & neglected

Temporary involvement of staff & students

Why should you even bother? Help your future self •
higher data quality, less mistakes • increased research efficiency • minimized risk of data loss, less frustrations • saved time & money • prevent to collect duplicate data RDM concerns the organization of data, from its entry to the research cycle to the dissemination and archiving of valuable results

Why should you even bother? Motivation for the near future
• required by funders & publishers • increase your visibility (citations!) • easy data sharing “RDM is part of good research practice”

Survey results • Awareness of institutional research data policy •
Personal data management efforts • Data documentation https://hackmd.io/s/ryiIwtYIG

Research life cycle

Data life cycle

Data management plan

Research data? any information collected/created for the purpose of analysis
to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...

data organization

Data file formats • Non proprietary (open source) formats •
Easily reusable • Commonly used • Structured folder system • Naming conventions Some preferred file formats Tabular data .csv (comma separated values), HDF5, netcdf, rdf Text .txt, html, xml, odt, rtf Still images .tif, jpeg2000, png, pdf, gif, bmp, svg “Love your data, and help other love it, too” (Goodman et al, 2014)

Folder structure • Use a well-defined folder structure |- README.md
<- The top-level README describing the general layout of the project |- data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)

File names • Unique file names • Comprehensible • No
special characters • Use _ instead of spaces • YYYY-MM-DD dates • Include initials, project, location, variable, content 2018-02-28_TM_SAFRED_RDM.pdf

Keep track of changes In general • Backup changes ASAP
• Keep changes small • Share changes frequently Manually track changes • Add a CHANGELOG.txt • Copy entire project after large changes Version control system • Git • GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)

data quality

Data quality • Standardize data collection • Check data entry
• Edit, clean, verify and validate raw data • Peer review • Documentation • Scripting “Data should be structured for analysis” (Hart et al., 2016)

documentation

Documentation Why? • Long term usability • Avoid misinterpretation •
Collaboration & staff changes • Save your memory for other stuff… @TDXDigLibrary

Project documentation Add a README.txt to your project folder •
Context • People • Sponsor(s) • Data collection methods • File organization • Known problems, limitations, gaps • Licenses • How to cite

Data documentation README.txt for each dataset • Variable names, labels,
data type, description • Explain codes & abbreviations • Code & reason for missing values • Code used for derived data • File format • Software • Data standards

data preservation

Data preservation We all know someone (who knows someone) who
has lost data… • Hardware theft or misplacement • Hardware failure • Hardware damage • Software faults • Power failure • Viruses or hacking • Human errors • Backups

Still...

Time for a backup strategy! • Decide which files •
Size? • Responsible • Frequency of full backups • Frequency of partial backups • Organization of backup files “Have a systematic backup scheme” (Hart et al, 2016)

Backup guidelines • 3 - 2 - 1 rule: 3
copies - 2 different types of media - 1 offsite • Apply a backup schedule • Test file restores • Do not use CDs or DVDs • Use reliable backup media

Storing or archiving? Storing & backing up • Short-term projects
• While still active • >1 location: hard drives, servers, laptops,... • Easy to modify files, but also to delete or lose them Archiving & publishing • Long-term monitoring projects • After finishing short-term projects • Deposited in a digital repository • Safeguarded & preserved both!

Storage media

data publication

open data funders demands publishers’ rules innovation & valorization higher
citation rates more visibility for your work collaboration application of your findings reduced costs public access to your findings

How not to “publish” your data Personal website University website

FAIR data Make your data • Findable • Accessible •
Interoperable • Reusable Findable • Persistent identifiers (DOI) • Metadata • Naming conventions • Keywords • Versioning Accessible • Choice of datasets • Data repository • Software, documentation • Access status • Retrievable data • Metadata access Interoperable • Standards • Vocabulary • Methodology • References Reusable • Licensing • Provenance • Community standards

Where to deposit data? • Depends on data type •
Domain specific repositories ◦ GBIF for species occurrences ◦ Genbank for genomic data ◦ ... • General repositories (Zenodo, dataDryad,...) • ORCID integration • DOI for citing datasets

Thank you for listening! @LifeWatchINBO

Good enough practices in data management

Good enough practices in data management

More Decks by Tanja Milotic

Other Decks in Science

Featured

Transcript