DUW_01-Intro.pdf

Data Use Workshop 20-21 September 2018, Brussels Introduction By André
Heughebaert

Summary • Introduction • Open Data Principles • Biodiversity Standards
• Data Quality Principles • Discover Workshop Tools

Introduction

Before we start... Who? Why? How? What?

Objectives This workshop will increase your skills in open biodiversity
data and is a good opportunity to discover tools for data management and visualisation. You will learn how to combine your research data with data made freely available by thousands of museums/institutions/scientists worldwide. At the end of the workshop, participants will be familiar with the GBIF.org portal, open data principles, and some exciting data visualisation tools.

Practicalities Wifi network=Belspo-Guest, password=belspo1050 Repository https://drive.google.com/open?id=1OhAJiK6kmxOcssgcIKsAUBLdile558vf (see your email) Timing
09:00-17:00 Coffee breaks 11:00 & 15:30 (flexible) Lunch break 12:30-13:30 Principles Interact, share your experience, practice and... enjoy

Agenda overview Day 1 - Thursday Day 2 - Friday
Introduction Data Handling with OpenRefine, R & QGIS Download data from GBIF.org Species Distribution Modeling Wrap-up & Conclusions

4 workgroups of 1 trainer + 3-4 participants André Dimi
Nicolas Max

Preparatory Survey - Biodiversity data • Occurrence and abundance data
of bees and other pollinators, and of flowering plants • Mainly data for biogeography, impact assessment, management plans and data analysis (e. g. for vulgarisation) • Lepidoptera of Belgium and Lepidoptera of Africa • Numbers of migrating amphibians as f(meteorological conditions, distance to overwintering location) • Abundance data primarily and occupancy-data • Data about belgian species repartition

Preparatory Survey - your difficulties/challenges • To access, map and
model the distribution of bees and plants with GBIF records (e.g. in Brussels) • Learning new programming languages, finding reliable dataset resources • Assembling the information from a multitude of mainly literature sources. • Finding the meta data for the years 2011 - 2018; and the days January to April of each year, for location as close as possible to 50.8155 N 4.4404 W • Not familiar with the techniques of data use • My school cursus did not familiarize me with these tools, except QGIS.

Tools for this Data Use Workshop:

Open Data Principles

Open Data Principles • How open are your data? •
Open data in a nutshell • Global Biodiversity Information Facility • FAIR Principles • Creative Common Licenses

How open are your data? Circulated Home-made Not very well
organized Not documented (Unassessed quality) Emailed to contact(s) Shared (Not fully documented) (Unassessed quality) With your colleagues Open (Unassessed quality) Open Data Private Home-made,often cryptic Not very well organized Not documented (Unassessed quality) My treasure! See “Open/Shared/Closed: the world of data” from Open Data Institute

Open Data in a nutshell Capture Re-use sum Document &
Clean Discover Publish Data Life Cycle

Global Biodiversity Information Facility • 58 country participants, 37 organisations
• 1.264 publishing institutions • 40.000 datasets • 1 billion occurrences records • 160.000 user sessions/month • 130 billion records downloaded/month • 2 peer reviewed articles/day of data re-use

Discover and download data through GBIF.org data portal.

FAIR Principles “Good data management is not a goal in
itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.” Mark D. Wilkinson et al.# In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data.

Creative Common Licenses Tell researchers what they can do with
your data

Biodiversity Standards

Biodiversity Standards • Everyday Standards • Darwin Core terms •
Darwin Core star schema • Type of data supported • Other Biodiversity Standards

“Standardisation does not mean that we all wear the same
color and weave of cloth, eat standard sandwiches, or live in standard rooms with standard furnishing. Homes of infinite variety of design are built with a few types of bricks, and with lumber of standard sizes, and with water and heating pipes and fitting of standard dimensions.” W. Edwards Deming “Let’s Agree to Disagree.” Standards

Everyday Standards Some examples of standards that you use often:
• Units of Measurement (Metric, Imperial) • Numeral Systems (Hindu-Arabic; Roman Numerals) • Alphabets • Languages • Emojis • Postal Addressing • Morse Code “The main purpose for standards is to create a framework to ease sharing. They should provide clarity and help communication.”

Everyday Standards eg Lat. & Long. • measurement - geographic
coordinates • format - degrees, minutes, seconds • numeric system - sexagesimal • numbers - Indo-Arabic • language - English • alphabet - Latin • symbols - typography • font - Arial 13° 51' 3” S 171° 45' 5” W

Darwin Core Standard “List of fields and their definitions, as
they relate to biodiversity data.” Discover all Darwin Core terms on TDWG website. institutionID collectionID datasetID institutionCode collectionCode datasetName ownerInstitutionCode basisOfRecord informationWithheld dataGeneralizations dynamicProperties occurrenceID catalogNumber recordNumber recordedBy individualCount organismQuantity organismQuantityType sex lifeStage reproductiveCondition behavior establishmentMeans occurrenceStatus preparations disposition associatedMedia associatedReferences associatedSequences associatedTaxa otherCatalogNumbers occurrenceRemarks organismID organismName organismScope associatedOccurrences associatedOrganisms previousIdentifications organismRemarks materialSampleID eventID parentEventID fieldNumber eventDate eventTime startDayOfYear endDayOfYear year month day verbatimEventDate habitat samplingProtocol sampleSizeValue sampleSizeUnit samplingEffort fieldNotes eventRemarks locationID higherGeographyID higherGeography continent waterBody islandGroup island country countryCode stateProvince county municipality locality verbatimLocality minimumElevationInMeters maximumElevationInMeters verbatimElevation minimumDepthInMeters maximumDepthInMeters verbatimDepth minimumDistanceAboveSurfaceInMeters maximumDistanceAboveSurfaceInMeters locationAccordingTo locationRemarks decimalLatitude decimalLongitude geodeticDatum coordinateUncertaintyInMeters coordinatePrecision pointRadiusSpatialFit verbatimCoordinates verbatimLatitude verbatimLongitude verbatimCoordinateSystem verbatimSRS footprintWKT footprintSRS footprintSpatialFit georeferencedBy georeferencedDate georeferenceProtocol georeferenceSources georeferenceVerificationStatus georeferenceRemarks geologicalContextID earliestEonOrLowestEonothem latestEonOrHighestEonothem earliestEraOrLowestErathem latestEraOrHighestErathem earliestPeriodOrLowestSystem latestPeriodOrHighestSystem earliestEpochOrLowestSeries latestEpochOrHighestSeries earliestAgeOrLowestStage latestAgeOrHighestStage lowestBiostratigraphicZone highestBiostratigraphicZone lithostratigraphicTerms group formation member bed identificationID identificationQualifier typeStatus identifiedBy dateIdentified identificationReferences identificationVerificationStatus identificationRemarks taxonID scientificNameID acceptedNameUsageID parentNameUsageID originalNameUsageID nameAccordingToID namePublishedInID taxonConceptID scientificName acceptedNameUsage parentNameUsage originalNameUsage nameAccordingTo namePublishedIn namePublishedInYear higherClassification kingdom phylum class order family genus subgenus specificEpithet infraspecificEpithet taxonRank verbatimTaxonRank scientificNameAuthorship vernacularName nomenclaturalCode taxonomicStatus nomenclaturalStatus taxonRemarks

Darwin Core term : organismID Identifier: http://rs.tdwg.org/dwc/terms/organismID Class: http://rs.tdwg.org/dwc/terms/Organism Definition:
An identifier for the Organism instance (as opposed to a particular digital record of the Organism). May be a globally unique identifier or an identifier specific to the data set. Comment: For discussion see http://terms.tdwg.org/wiki/dwc:organismID Details: organismID

Darwin Core term : locality Identifier: http://rs.tdwg.org/dwc/terms/locality Class: http://purl.org/dc/terms/Location Definition:
The specific description of the place. Less specific geographic information can be provided in other geographic terms (higherGeography, continent, country, stateProvince, county, municipality, waterBody, island, islandGroup). This term may contain information modified from the original to correct perceived errors or standardize the description. Comment: Example: "Bariloche, 25 km NNE via Ruta Nacional 40 (=Ruta 237)". For discussion see http://terms.tdwg.org/wiki/dwc:locality Details: locality

Darwin Core star schema • Central ‘Core’ Entity • +
0 or more extensions (that always relate to the Core entity) • + metadata (EML) DwCArchive (zip) Enough to describe some relations But not a fully relational model

3 types of core entities supported

Occurrence Core 0 or more extensions: • Geographical • Media
• Measurements & Facts • etc

Taxon Core 0 or more extensions: • Description • Vernacular
• Occurrences • Literature • etc

Event Core 0 or more extensions: • Relevé • Measurements
& Facts • Occurrences • etc

Other Biodiversity Standards of concern ABCD A standard equivalent to
DarwinCore. Access to Biological Collections Data task group. 2007. Access to Biological Collection Data (ABCD), Version 2.06. Biodiversity Information Standards (TDWG) http://www.tdwg.org/standards/115 EML A metadata standard developed for the earth, environmental and ecological sciences. Ecological Metadata Language (EML) is a metadata specification particularly developed for the ecology discipline. It is based on prior work done by the Ecological Society of America and associated efforts (Michener et al., 1997, Ecological Applications).

Data Quality Principles

Data Quality Principles • What is quality? • Fitness for
use • Correctness • Consistency • Data Cleaning

Quality is relative to the usage Theatrum orbis Terrarum by
Ortelius Abraham, 1527-1598 Image from the collections of the State Library of New South Wales.

Fitness for use “...data quality is related to use and
cannot be assessed independently of the user. In a database, the data have no actual quality or value (Dalcin 2004); they only have potential value that is realized when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers’ needs (English 1999).” Arthur Chapman

Fitness for use in the real world How well does
a thing do what it’s supposed to and what is that anyway? A shoemaker creates clogs for the purpose of covering a person’s feet.

Did he know that these folks would dance in them?
>Probably yes

Or that this gardener would use them as plant pots?
>Maybe not

Data quality is a relative concept that depends on the
use of these data. "The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data." Chrisman, 1991 The genus level will be sufficient to run predictive models of ecological niches, whereas someone studying a particular taxon will need really detailed occurrences with subspecies information. Fitness for use in Biodiversity data

Fitness for use in Biodiversity data Do you understand your
data and can you explain its purpose to someone else? 1. accessibility 2. accuracy 3. timeliness 4. completeness / comprehensiveness 5. consistency 6. relevancy 7. well documented [outside of your head] 8. easy to read and easy to interpret

Measures of Quality "All data include error – there is
no escaping it! It is knowing what the error is that is important, and knowing if the error is within acceptable limits for the purpose to which the data are to be put.” A. Chapman 2005 • Correctness (Accuracy) How close is the recorded value to the actual value? • Consistency (Precision) How often do you get it right?

Correctness example 1 A dataset contains fossil specimens from the
Triassic period. The recorded taxa for a specimen Is Thismia. Is Thismia a fossil bird?

Correctness example 1 A dataset contains fossil specimens from the
Triassic period. The recorded taxa for a specimen Is Thismia. Is Thismia a fossil bird? > No! It’s a very rare plant from Illinois (US)

Correctness example 2 A botanical dataset contains specimens from Kalamazoo..
The zip code is 49007 and the collector is Richard Spruce. 1. Is 49007 the right zip code for Kalamazoo? 2. Did Richard Spruce Collect in Michigan?

Correctness example 2 1. Is 49007 the right zip code
for Kalamazoo? > Yes!

Correctness example 2 2. Did Richard Spruce Collect in Michigan?
> Maybe

Consistency example A botanical dataset has specimens collected by: Full
Name = Joseph Dalton Hooker Full Name = Hooker, J. Full Name = W. J. Hooker Full Name = Hook.f. Full Name = Hook. How many unique collectors are there?

Consistency example A botanical dataset has specimens collected by: Full
Name = Joseph Dalton Hooker Full Name = Hooker, J. Full Name = W. J. Hooker Full Name = Hook.f. Full Name = Hook. How many unique collectors are there? > 3 different collectors for 5 different names

Data cleaning "All data include error – there is no
escaping it! It is knowing what the error is that is important, and knowing if the error is within acceptable limits for the purpose to which the data are to be put.” A. Chapman 2005 Data cleaning is the process of correcting (or removing) dirty data caused by contradictions, disparities, keying mistakes, missing bits, etc. It also includes validation of the changes made, and may require normalization.

Cleaning maximizes fitness for use Private Home-made,often cryptic Not very
well organized Not documented (Unassessed quality) My treasure! Circulated Home-made Not very well organized Not documented (Unassessed quality) Emailed to contact(s) Shared (Not fully documented) (Unassessed quality) With your colleagues Open (Unassessed quality) Open Data

References • Chapman, AD 2005. Principles of Data Quality. Global
Biodiversity Information Facility. https://doi.org/10.15468/doc.jrgg-a190 • Chapman, AD 2005. Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. Available online at http://www.gbif.org/document/80528

Discover Workshop Tools

Workshop Tools • Short presentation of the Workshop tools •
Demos of the tools • Other Tools and References • Exercise with prepared data

R/RStudio The R language is widely used among statisticians and
data miners for developing statistical software and data analysis. RStudio makes R easier to use. It includes a code editor, debugging & visualization tools. RMarkdown provides an authoring framework for data science

OpenRefine Open source software Web-based spreadsheet Easy faceting (group by)
and filtering Intuitive bulk edition Easy programming Undo/redo features

SQL Structured Query Language Most common way to deal with
structured data (entities + relations) Support unique and null values Support primary and foreign keys Enforce integrity constraints Heavily used since decades We suggest DB Browser for SQLite

Quantum GIS Open Source Software Deal with spatial information Visualise
points, polygons,... on maps Support many type of layers Specialized GIS functions Good connection with databases and web maps (WMS services)

More tools... • RegExp (eg from your text editor) •
Taxon names search/match/validate • LibreOffice (Open Source, CSV friendly spreadsheet) • Gazetteers (locations to coordinates) • Exploratory(statistics) • Python (or other programming languages) • GBIF Data Validator: a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. • ... Make your own toolbox!

Exercices (20’ each) on Formidabel dataset

Exercice 1.A. RStudio 1. Open Formidabel.rmd (a RMarkdown file that
contains R Scripts) 2. Open Formidabel.csv (occurrences Data) 3. Display first record 4. Calculate mean of longitude, latitude 5. Plot occurrence points on Belgium map

Exercice 1.B. OpenRefine 1. Create new project 2. Import 2018Formidabel.csv
3. Add Text Facet on species 4. How many ‘Formica picea Nylander, 1846’? 5. Display ‘Lasius plathytorax Seifert, 1991’ 6. Edit as ‘Lasius platythorax Seifert, 1991’ 7. Rename year column as oYear 8. Add new column year based on oYear values (for editing) 9. Add Text Facet on year 10. Edit ‘2021’ values as ‘2012’ 11. Export to 20180827Formidabel.csv

Exercice 1.C. SQL Browser 1. Create new database 2. Import
2018Formidabel.csv Tab delimited, UTF-8 3. Count() records per basisOfRecord 4. Count() records per species 5. Count() records per year 6. Calculate min(), max() of decimalLatitude/decimalLongitude 7. Create a speciesList view with family, genus, species, scientificName, count(*) 8. Export speciesList as CSV file

Exercice 1.D. Quantum GIS 1. Create new project 2. Add
layer BEL3.shp 3. Alter properties to change color 4. Alter properties to display names 5. Add layer 2018Formidabel.csv 6. Alter properties to change symbol 7. Alter properties to display catalogNumber 8. Try to detect records not in Belgium 9. Add new calculated Layer with records not in Belgium

Any questions? [email protected] Icons by vectorpocket / Freepik

DUW_01-Intro.pdf

DUW_01-Intro.pdf

More Decks by André Heughebaert

Featured

Transcript