Organization of research data. How to bring order in the chaos?

Organisatie van onderzoeksdata Hoe orde scheppen in de chaos? Tanja
Milotić @milotict January 15th, 2019 VVBAD, Brussel @LifeWatchINBO

Would my colleague be able to take over my project
if I suddenly disappeared?

Data are hard to find & navigate

Digital storage media are fragile

Research data are undervalued & neglected

Temporary involvement of staff & students

Why should you even bother? Help your future self •
higher data quality, less mistakes • increased research efficiency • minimized risk of data loss, less frustrations • saved time & money • prevent to collect duplicate data

Why should you even bother? Motivation for the near future
• bad data management leads to bad science! • required by funders & publishers • increase your visibility (citations!) • easy data sharing • increased reproducibility “Research data management is part of good research practice”

It pays off on the long run... Ocean health index
Lowndes et al. (2017)

research culture?

Personal data management efforts

Time needed to retrieve your own datasets

Collaboration with peers

Positive evolution! Increased understanding of usefulness RDM Application of RDM
techniques Needed for cooperation − Little knowledge − Little institutional support − Lack of time

Research life cycle

Data life cycle

collecting research data

Existing datasets • Start research project: search for existing datasets
• Check conditions for re-use ◦ Costs? ◦ Restrictions for re-use? (e.g., no commercial applications, no manipulations,...) ◦ Sharing results and newly compiled or processed datasets ◦ Citation of data owner? ◦ Permission for re-use? • Check quality and usability of data ◦ Source? ◦ Data collection methods? ◦ Clear metadata? ◦ Data compatibility in project?

Citation of existing research data • Always cite existing data
sources • According to scientific conventions of your field of research • Citation suggestion often in metadata • Reproducibility of results ◦ Subsets of data ◦ Version of data source ◦ Download date and location ◦ How data were acquired (download from website, personal communication with data owner) ◦ Data publisher, data collector ◦ DOI (digital object identifier)

Collection of new data • Observational data ◦ Unique (location,
time,...) ◦ Species occurrences, behaviour, climate, archeological excavations,... • Experimental data ◦ Repeatable ◦ Manipulation of environmental variables, psychological tests,... • Simulations ◦ Reproducible ◦ Models (populations, climate, economy,...) • Data processing ◦ Combination and manipulation of datasets ◦ Big data • Literature search

Research data? any information collected/created for the purpose of analysis
to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...

organising (digital) data

Organising data and files at different levels • Data file
format • Organising data in files • Naming data files • Versions of data files • Organising (data) files in (sub)folders

Data file formats • Non proprietary (open source) formats •
Commonly used (generic data types) • Adopted by the research community • Archive data in open formats • Specific (proprietary) software often allows export in open formats • Use format compatible with computer processing • Proprietary file formats: README.txt file with software version, company “Love your data, and help other love it, too” (Goodman et al, 2014)

Some preferred file formats Tabular data .csv Hierarchically structured data
.HDF5, .netcdf, .rdf Text .txt, .html, .xml, .odt, .rtf Still images .tif, .jpeg2000, .png, .pdf, .gif, .bmp, .svg Moving images .MOV, .MPEG, .AVI, .MXF Sounds .FLAC, .WAVE, .AIFF, .MP3, .MXF Containers .TAR, .GZIP, .ZIP Databases .XML, .CSV Geospatial .kml, .geojson, .geoTIFF, .netCDF Statistics .ASCII, .CSV Web archive .WARC

Focus on re-use • Standard data formats • Readable by
programming languages (R, python,...) • Reduced data loss • Less mistakes during conversion • Machine readable -> avoid ◦ Data incorporated in text-file (pdf,..) ◦ Scanned images ◦ Tables from paper source

Organising data in files • Data should be structured for
analysis • Save time later on! • Write code for humans, write data for computers ◦ Easy import ◦ Easy manipulation ◦ Higher reproducibility and re-use • Tidy data concept for tabular data ◦ Rows = observations ◦ Columns = variables ◦ Table = observational unit

Untidy data Individual Treatment A Treatment B Individual 1 -
15 Individual 2 2 233 Individual 3 5.5 10

15 Individual 2 2 233 Individual 3 5.5 10 Individual 1 Individual 2 Individual 3 Treatment A - 2 5.5 Treatment B 15 233 10

15 Individual 2 2 233 Individual 3 5.5 10 Individual 1 Individual 2 Individual 3 Treatment A - 2 5.5 Treatment B 15 233 10 individuals

15 Individual 2 2 233 Individual 3 5.5 10 Individual 1 Individual 2 Individual 3 Treatment A - 2 5.5 Treatment B 15 233 10 individuals treatments

15 Individual 2 2 233 Individual 3 5.5 10 Individual 1 Individual 2 Individual 3 Treatment A - 2 5.5 Treatment B 15 233 10 individuals treatments results

Tidy data Individual Treatment Result Individual 1 A Individual 2
A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10

A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Each variable forms a column and contains a value

A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Individual Treatment Result Individual 1 A Individual 2 A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Each variable forms a column and contains a value Each observation forms a row

A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Individual Treatment Result Individual 1 A Individual 2 A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Individual Treatment Result Individual 1 A Individual 2 A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Each variable forms a column and contains a value Each observation forms a row Each cell contains a value and each column contains values of the same data type

A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10 Each observational unit forms a table Individual Country decimalLatitude decimalLongitude Individual 1 BE 51.154621 3.454508 Individual 2 NL 51.404667 4.383826 Individual 3 BE 51.154621 3.454508

Tidy data • Fixed variables (Individual, treatment) ◦ Experimental design
◦ Known in advance ◦ Put on left side of table • Measured variables (result) ◦ Measured during the study ◦ Put on right side of table Individual Treatment Result Individual 1 A Individual 2 A 2 Individual 3 A 5.5 Individual 1 B 15 Individual 2 B 233 Individual 3 B 10

Tidying messy datasets Column headers are values, not variable names
Hart et al. (2016)

Tidying messy datasets Multiple variables are stored in one column
Hart et al. (2016)

Tidying messy datasets Variables are stored in both rows and
columns Hart et al. (2016)

Formatting errors Use of bad data formats www.datacarpentry.org

Formatting errors Using multiple tables www.datacarpentry.org

Formatting errors Not filling in zeros • Difference blank value
(empty cell) and 0 value? • Computer: ◦ Blank value = unmeasured variable ◦ 0 = measured value of 0 • Importance of 0 values in analysis

Formatting errors Using problematic null values White et al. (2013)

Formatting errors Using formatting to convey information www.datacarpentry.org

Formatting errors Using problematic field names www.datacarpentry.org

Software tools • Use code to perform repetitive data cleaning
tasks (e.g., R, python,...) • Exploratory (R) • Package tidyverse in R • Never manipulate the original file! • Tidy data are easier ◦ To manipulate (filter, reorder, transform, aggregate, sort,...) ◦ To visualize (graphs, charts,...) ◦ To use in (statistical) models

File names • Unique file names • Comprehensible • No
special characters • Use _ instead of spaces • YYYY-MM-DD dates • Include initials, project, location, variable, content 2018-02-28_TM_SAFRED_RDM.pdf

Keeping track of changes In general • Backup changes ASAP
• Keep changes small • Share changes frequently Manually track changes • Add a CHANGELOG.txt • Copy entire project after large changes Version control system • Git • GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)

Folder structure • Use a well-defined folder structure |- README.md
<- The top-level README describing the general layout of the project |- data <- research data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)

documentation

Documentation & metadata Why? • Long term usability • Avoid
misinterpretation • Collaboration & staff changes • Save your memory for other stuff… @TDXDigLibrary

Concept of metadata is not new... Sidereus nuncius (Galilei, 1610)
• Data: position of Jupiter + moons (drawings) • Metadata: observation timing, weather, telescoop properties • Text: methodology, analyses, conclusions Goodman et al. (2014)

Project documentation Add a README.txt to your project folder •
Context • People • Sponsor(s) • Data collection methods • File organization • Known problems, limitations, gaps • Licenses • How to cite

Data documentation README.txt for each dataset • Descriptive metadata ◦
Author, title, abstract, date ◦ Context (location, methods,...) • Structural metadata ◦ Variable names, labels, data type, description ◦ Explain codes & abbreviations ◦ Code & reason for missing values ◦ Code used for derived data

Data documentation README.txt for each dataset • Technical metadata ◦
File format ◦ Used software and hardware ◦ Encryption ◦ Version ◦ Metadata standards • Administrative metadata ◦ Licences ◦ Citation

References Goodman, Pepe, Blocker, Borgman, Cranmer, Crosas, Di Stefano, Gil,
Groth, Hedstrom, Hogg, Kashyap, Mahabal, Siemiginowska, Slavkovic (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 Hart, Barmby, LeBauer, Michonneau, Mount, Mulrooney, Poisot, Woo, Zimmerman, Hollister (2016) Ten Simple Rules for Digital Data Storage. PLoS Comput Biol 12(10): e1005097. doi:10.1371/journal.pcbi.1005097 Lowndes, Best, Scarborough, Afflerbach, Frazier, O’Hara, Jiang, Halpern (2017) Our path to better science in less time using open data science tools. Nature Ecology & Evolution, 1: 0160 doi:10.1038/s41559-017-0160 White, Baldridge, Brym, Locey, McGlinn, Supp (2013) Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution 6(2): 1–10. doi:10.4033/iee.2013.6b.6.f

Organization of research data. How to bring ord...

Organization of research data. How to bring order in the chaos?

More Decks by Tanja Milotic

Other Decks in Science

Featured

Transcript