Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Good enough practices in data management

Tanja Milotic
February 28, 2018

Good enough practices in data management

Research data management in biodiversity research. Data organization, data quality, documentation, preservation and publication using the FAIR principles.

Tanja Milotic

February 28, 2018
Tweet

More Decks by Tanja Milotic

Other Decks in Science

Transcript

  1. Why should you even bother? Help your future self •

    higher data quality, less mistakes • increased research efficiency • minimized risk of data loss, less frustrations • saved time & money • prevent to collect duplicate data RDM concerns the organization of data, from its entry to the research cycle to the dissemination and archiving of valuable results
  2. Why should you even bother? Motivation for the near future

    • required by funders & publishers • increase your visibility (citations!) • easy data sharing “RDM is part of good research practice”
  3. Survey results • Awareness of institutional research data policy •

    Personal data management efforts • Data documentation https://hackmd.io/s/ryiIwtYIG
  4. Research data? any information collected/created for the purpose of analysis

    to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...
  5. Data file formats • Non proprietary (open source) formats •

    Easily reusable • Commonly used • Structured folder system • Naming conventions Some preferred file formats Tabular data .csv (comma separated values), HDF5, netcdf, rdf Text .txt, html, xml, odt, rtf Still images .tif, jpeg2000, png, pdf, gif, bmp, svg “Love your data, and help other love it, too” (Goodman et al, 2014)
  6. Folder structure • Use a well-defined folder structure |- README.md

    <- The top-level README describing the general layout of the project |- data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)
  7. File names • Unique file names • Comprehensible • No

    special characters • Use _ instead of spaces • YYYY-MM-DD dates • Include initials, project, location, variable, content 2018-02-28_TM_SAFRED_RDM.pdf
  8. Keep track of changes In general • Backup changes ASAP

    • Keep changes small • Share changes frequently Manually track changes • Add a CHANGELOG.txt • Copy entire project after large changes Version control system • Git • GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)
  9. Data quality • Standardize data collection • Check data entry

    • Edit, clean, verify and validate raw data • Peer review • Documentation • Scripting “Data should be structured for analysis” (Hart et al., 2016)
  10. Documentation Why? • Long term usability • Avoid misinterpretation •

    Collaboration & staff changes • Save your memory for other stuff… @TDXDigLibrary
  11. Project documentation Add a README.txt to your project folder •

    Context • People • Sponsor(s) • Data collection methods • File organization • Known problems, limitations, gaps • Licenses • How to cite
  12. Data documentation README.txt for each dataset • Variable names, labels,

    data type, description • Explain codes & abbreviations • Code & reason for missing values • Code used for derived data • File format • Software • Data standards
  13. Data preservation We all know someone (who knows someone) who

    has lost data… • Hardware theft or misplacement • Hardware failure • Hardware damage • Software faults • Power failure • Viruses or hacking • Human errors • Backups
  14. Time for a backup strategy! • Decide which files •

    Size? • Responsible • Frequency of full backups • Frequency of partial backups • Organization of backup files “Have a systematic backup scheme” (Hart et al, 2016)
  15. Backup guidelines • 3 - 2 - 1 rule: 3

    copies - 2 different types of media - 1 offsite • Apply a backup schedule • Test file restores • Do not use CDs or DVDs • Use reliable backup media
  16. Storing or archiving? Storing & backing up • Short-term projects

    • While still active • >1 location: hard drives, servers, laptops,... • Easy to modify files, but also to delete or lose them Archiving & publishing • Long-term monitoring projects • After finishing short-term projects • Deposited in a digital repository • Safeguarded & preserved both!
  17. open data funders demands publishers’ rules innovation & valorization higher

    citation rates more visibility for your work collaboration application of your findings reduced costs public access to your findings
  18. FAIR data Make your data • Findable • Accessible •

    Interoperable • Reusable Findable • Persistent identifiers (DOI) • Metadata • Naming conventions • Keywords • Versioning Accessible • Choice of datasets • Data repository • Software, documentation • Access status • Retrievable data • Metadata access Interoperable • Standards • Vocabulary • Methodology • References Reusable • Licensing • Provenance • Community standards
  19. Where to deposit data? • Depends on data type •

    Domain specific repositories ◦ GBIF for species occurrences ◦ Genbank for genomic data ◦ ... • General repositories (Zenodo, dataDryad,...) • ORCID integration • DOI for citing datasets