structure • Internal, restricted use • Not so well documented sometime hermetic • All data fit in • Local IDs are sufficient DarwinCore • Generic/universal terms • Public access • Well documented understandable • Data must follow standards • Need for Global IDs Advocate Open Data Transform/clean Data Publish Data
Darwin Core) • No Unique IDs • Missing integrity constraints • Lots of typos (eg scientific names) • Heterogeneous formats (eg dates, coordinates...) • Special Coordinates System • Redundancy and conflicting information • Same field used for multiple purposes • Variety of null values (‘’, ‘NA’, ‘null’...) • Sensitive and/or untrusted data
collaborative effort of scientists and spiders amateurs. It gathers information on collection specimens and observations of spiders in Belgium from 1879 to present time.
Core) No Unique IDs Missing integrity constraints Lots of typos (eg scientific names) Heterogeneous formats (eg dates, coordinates...) Special Coordinates System Redundancy and conflicting information Same field used for multiple purposes Variety of null values (‘’, ‘NA’, ‘null’...) Sensitive and/or untrusted data
Unique IDs…………………………….. Missing integrity constraints……………... Lots of typos (eg scientific names)........... Heterogeneous formats………………….. Special Coordinates System…………….. Redundancy and conflicting information.. Same field used for multiple purposes…. Variety of null values (‘’, ‘NA’, ‘null’).......... Sensitive and/or untrusted data………….. Rename, restructure, use views Add UUIDs Force integrity constraints Detect typos and correct them Be creative Convert to WGS84 Ask data owners Split with regular expressions Unify to same null values Blur sensitive data, do not publish untrusted data
This process is time consuming • Involve Data owners as early as possible • Always keep the original/verbatim fields • Document what you did • Automate the process, whenever possible • Tools are faster and more reliable than humans • Use your own palette of tools (DB, GIS, languages...)