Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Good enough practices for Research Data Management

Tanja Milotic
September 26, 2022

Good enough practices for Research Data Management

Tanja Milotic

September 26, 2022
Tweet

More Decks by Tanja Milotic

Other Decks in Science

Transcript

  1. Good enough practices for
    research data management
    Tanja Milotić
    @oscibio
    Empowering biodiversity research II
    September 27th, 2022
    Brussels

    View Slide

  2. Would my colleague
    be able to take over my
    project if I suddenly
    disappeared?

    View Slide

  3. Data are hard
    to find & navigate

    View Slide

  4. Digital storage
    media are fragile

    View Slide

  5. Research data
    are undervalued &
    neglected

    View Slide

  6. Temporary
    involvement of
    staff & students

    View Slide

  7. Why should you even bother?
    Help your future self
    ● higher data quality, less mistakes
    ● increased research efficiency
    ● minimized risk of data loss, less frustrations
    ● saved time & money
    ● prevent to collect duplicate data

    View Slide

  8. Why should you even bother?
    Motivation for the near future
    ● required by funders & publishers
    ● increase your visibility (citations!)
    ● easy data sharing
    ● new collaboration opportunities
    “Research data
    management is part of
    good research practice”

    View Slide

  9. How to start with research data management?
    Research data management
    (RDM) concerns the organization
    of data, from its entry to the
    research cycle to the
    dissemination and archiving of
    valuable results
    Data management plan
    (DMP)

    View Slide

  10. Research life cycle

    View Slide

  11. Data life cycle

    View Slide

  12. Data life cycle

    View Slide

  13. Research data? any information
    collected/created
    for the purpose of
    analysis to verify
    scientific claims
    Digital(ized) data:
    observations,
    measurements, pictures,
    audio, models,...
    Physical data:
    samples (soil, water,
    tissue,...), collections,...

    View Slide

  14. data organization

    View Slide

  15. Folder structure
    ● Use a well-defined folder structure
    |- README.md <- The top-level README describing the general layout of the project
    |- data <- research data
    | |- raw <- The original, read-only acquired raw data
    | |- interim <- Intermediate data that has been transformed.
    | |- processed <- final data products, used in the report/paper/graphs
    | |_ external <- used additional third party data resources (e.g. vector maps)
    |- reports <- reported outcome of the analysis as LaTeX, word, markdown,...
    | |_ figures <- Generated graphics and figures to be used in reporting
    |- src <- set of analysis scripts used in the analysis
    “Keep raw data raw!”
    (Hart et al, 2016)

    View Slide

  16. File names
    ● Unique file names
    ● Comprehensible
    ● No special characters
    ● Use _ instead of spaces
    ● YYYY-MM-DD dates
    ● Include initials, project,
    location, variable, content
    2022-09-27_EBR_RDM_practices.pdf

    View Slide

  17. Keep track of changes
    In general
    ● Backup changes ASAP
    ● Keep changes small
    ● Share changes frequently
    Manually track changes
    ● Add a CHANGELOG.txt
    ● Copy entire project after large changes
    Version control system
    ● Git
    ● GitHub, Bitbucket,
    Gitlab,... File naming conventions: file_v1, file_v2
    Use a version control system (recommended)

    View Slide

  18. Data file formats
    ● Non proprietary (open source) formats
    ● Easily reusable
    ● Commonly used
    Some preferred file formats
    Tabular data .csv (comma separated values), HDF5, netcdf, rdf
    Text .txt, html, xml, odt, rtf
    Still images .tif, jpeg2000, png, pdf, gif, bmp, svg
    “Love your data, and
    help other love it, too”
    (Goodman et al, 2014)

    View Slide

  19. data quality

    View Slide

  20. Data quality
    ● Standardize data collection
    ● Check data entry
    ● Edit, clean, verify and validate raw data
    ● Peer review
    ● Documentation
    ● Scripting
    “Data should be
    structured for analysis”
    (Hart et al, 2016)

    View Slide

  21. Tidy data
    ● 80% of data analysis is spent on data cleaning and preparing
    ● Tidy data: structuring datasets to facilitate analysis
    ● Tidy data from the start of the project
    Wickham, 2014

    View Slide

  22. From messy to tidy
    Make it a rectangle
    ● Only rows and columns, no additional structure
    ● One column for each type of information
    ● One row for each observation (data point)
    Data carpentry for biologists
    Plot SpeciesA SpeciesB
    1 3 1
    2 2 4
    Messy:

    View Slide

  23. From messy to tidy
    Make it a rectangle
    ● Only rows and columns, no additional structure
    ● One column for each type of information
    ● One row for each observation (data point)
    Data carpentry for biologists
    Plot SpeciesA SpeciesB
    1 3 1
    2 2 4
    Messy:
    Plot Species Abundance
    1 A 3
    1 B 1
    2 A 2
    2 B 4
    Tidy:

    View Slide

  24. From messy to tidy
    One cell, one value
    ● Every cell contains 1 piece of information
    Data carpentry for biologists
    Mass
    26g
    0.2kg
    Messy:

    View Slide

  25. From messy to tidy
    One cell, one value
    ● Every cell contains 1 piece of information
    Data carpentry for biologists
    Mass
    26g
    0.2kg
    Messy:
    Mass Unit
    26 g
    0.2 kg
    Tidy:

    View Slide

  26. From messy to tidy
    Don’t mess with the computer
    ● Don’t use visual markings (colors, italics, fonts,...)
    ● Avoid spaces in names, use ‘_’ or CamelCase for multiple words
    ● Avoid special characters (*, @, ^,...)
    Data carpentry for biologists
    Min temp
    5
    4.5
    3.1*
    Messy:

    View Slide

  27. From messy to tidy
    Don’t mess with the computer
    ● Don’t use visual markings (colors, italics, fonts,...)
    ● Avoid spaces in names, use ‘_’ or CamelCase for multiple words
    ● Avoid special characters (*, @, ^,...)
    Data carpentry for biologists
    Min temp
    5
    4.5
    3.1*
    Messy:
    min_temp calibration_error
    5 0
    4.5 0
    3.1 1
    Tidy:

    View Slide

  28. From messy to tidy
    Be clear and consistent
    ● Use short meaningful names.
    ● Use consistent names, abbreviations, and capitalizations
    ● Use good null values (blanks, NA,... Do not use numbers (0, -999))
    ● Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns
    d s a
    26/02/2022 dior 9
    26/02/2022 disp 1
    May 24, 2022 DIor -999
    May 24, 2022 DISP Missing
    Messy:

    View Slide

  29. From messy to tidy
    Be clear and consistent
    ● Use short meaningful names.
    ● Use consistent names, abbreviations, and capitalizations
    ● Use good null values (blanks, NA,... Do not use numbers (0, -999))
    ● Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns
    d s a
    26/02/2022 dior 9
    26/02/2022 disp 1
    May 24, 2022 DIor -999
    May 24, 2022 DISP Missing
    Messy:
    Date Species Abundance
    2022-02-26 dior 9
    2022-02-26 disp 1
    2022-05-24 dior NA
    2022-05-24 disp NA
    Tidy:

    View Slide

  30. documentation

    View Slide

  31. Documentation & metadata
    Why?
    ● Long term usability
    ● Avoid misinterpretation
    ● Collaboration & staff changes
    ● Save your memory for other stuff…
    @TDXDigLibrary

    View Slide

  32. Project documentation
    Add a README.txt to your project folder
    ● Context
    ● People
    ● Sponsor(s)
    ● Data collection methods
    ● File organization
    ● Known problems, limitations, gaps
    ● Licenses
    ● How to cite

    View Slide

  33. Data documentation
    README.txt for each dataset
    ● Variable names, labels, data type, description
    ● Explain codes & abbreviations
    ● Code & reason for missing values
    ● Code used for derived data
    ● File format
    ● Software
    ● Data standards

    View Slide

  34. data preservation

    View Slide

  35. Backup guidelines
    ● 3 - 2 - 1 rule: 3 copies - 2
    different types of media - 1 offsite
    ● Apply a backup schedule
    ● Test file restores
    ● Do not use CDs or DVDs
    ● Use reliable backup media

    View Slide

  36. Storage media

    View Slide

  37. data publication

    View Slide

  38. open data
    funders demands
    publishers’
    rules
    innovation &
    valorization
    higher citation rates
    more visibility for your work
    collaboration
    application of your findings
    reduced costs
    public access to
    your findings

    View Slide

  39. How not to “publish” your data
    Personal website
    University website

    View Slide

  40. FAIR data
    Make your data
    ● Findable
    ● Accessible
    ● Interoperable
    ● Reusable
    Findable
    ● Persistent identifiers (DOI)
    ● Metadata
    ● Naming conventions
    ● Keywords
    ● Versioning
    Accessible
    ● Choice of datasets
    ● Data repository
    ● Software, documentation
    ● Access status
    ● Retrievable data
    ● Metadata access
    Interoperable
    ● Standards
    ● Vocabulary
    ● Methodology
    ● References
    Reusable
    ● Licensing
    ● Provenance
    ● Community
    standards

    View Slide

  41. Where to deposit data?
    ● Depends on data type
    ● Domain specific repositories
    ○ GBIF for species occurrences
    ○ Movebank for movement data
    ○ Genbank for genomic data
    ○ ...
    ● General repositories
    (Zenodo, dataDryad,...)
    ● ORCID integration
    ● DOI for citing datasets

    View Slide

  42. Thank you!

    View Slide