Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DUW_01-Intro.pdf

André Heughebaert
September 20, 2018
37

 DUW_01-Intro.pdf

André Heughebaert

September 20, 2018
Tweet

Transcript

  1. Summary • Introduction • Open Data Principles • Biodiversity Standards

    • Data Quality Principles • Discover Workshop Tools
  2. Objectives This workshop will increase your skills in open biodiversity

    data and is a good opportunity to discover tools for data management and visualisation. You will learn how to combine your research data with data made freely available by thousands of museums/institutions/scientists worldwide. At the end of the workshop, participants will be familiar with the GBIF.org portal, open data principles, and some exciting data visualisation tools.
  3. Practicalities Wifi network=Belspo-Guest, password=belspo1050 Repository https://drive.google.com/open?id=1OhAJiK6kmxOcssgcIKsAUBLdile558vf (see your email) Timing

    09:00-17:00 Coffee breaks 11:00 & 15:30 (flexible) Lunch break 12:30-13:30 Principles Interact, share your experience, practice and... enjoy
  4. Agenda overview Day 1 - Thursday Day 2 - Friday

    Introduction Data Handling with OpenRefine, R & QGIS Download data from GBIF.org Species Distribution Modeling Wrap-up & Conclusions
  5. Preparatory Survey - Biodiversity data • Occurrence and abundance data

    of bees and other pollinators, and of flowering plants • Mainly data for biogeography, impact assessment, management plans and data analysis (e. g. for vulgarisation) • Lepidoptera of Belgium and Lepidoptera of Africa • Numbers of migrating amphibians as f(meteorological conditions, distance to overwintering location) • Abundance data primarily and occupancy-data • Data about belgian species repartition
  6. Preparatory Survey - your difficulties/challenges • To access, map and

    model the distribution of bees and plants with GBIF records (e.g. in Brussels) • Learning new programming languages, finding reliable dataset resources • Assembling the information from a multitude of mainly literature sources. • Finding the meta data for the years 2011 - 2018; and the days January to April of each year, for location as close as possible to 50.8155 N 4.4404 W • Not familiar with the techniques of data use • My school cursus did not familiarize me with these tools, except QGIS.
  7. Open Data Principles • How open are your data? •

    Open data in a nutshell • Global Biodiversity Information Facility • FAIR Principles • Creative Common Licenses
  8. How open are your data? Circulated Home-made Not very well

    organized Not documented (Unassessed quality) Emailed to contact(s) Shared (Not fully documented) (Unassessed quality) With your colleagues Open (Unassessed quality) Open Data Private Home-made,often cryptic Not very well organized Not documented (Unassessed quality) My treasure! See “Open/Shared/Closed: the world of data” from Open Data Institute
  9. Open Data in a nutshell Capture Re-use sum Document &

    Clean Discover Publish Data Life Cycle
  10. Global Biodiversity Information Facility • 58 country participants, 37 organisations

    • 1.264 publishing institutions • 40.000 datasets • 1 billion occurrences records • 160.000 user sessions/month • 130 billion records downloaded/month • 2 peer reviewed articles/day of data re-use
  11. FAIR Principles “Good data management is not a goal in

    itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.” Mark D. Wilkinson et al.# In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data.
  12. Biodiversity Standards • Everyday Standards • Darwin Core terms •

    Darwin Core star schema • Type of data supported • Other Biodiversity Standards
  13. “Standardisation does not mean that we all wear the same

    color and weave of cloth, eat standard sandwiches, or live in standard rooms with standard furnishing. Homes of infinite variety of design are built with a few types of bricks, and with lumber of standard sizes, and with water and heating pipes and fitting of standard dimensions.” W. Edwards Deming “Let’s Agree to Disagree.” Standards
  14. Everyday Standards Some examples of standards that you use often:

    • Units of Measurement (Metric, Imperial) • Numeral Systems (Hindu-Arabic; Roman Numerals) • Alphabets • Languages • Emojis • Postal Addressing • Morse Code “The main purpose for standards is to create a framework to ease sharing. They should provide clarity and help communication.”
  15. Everyday Standards eg Lat. & Long. • measurement - geographic

    coordinates • format - degrees, minutes, seconds • numeric system - sexagesimal • numbers - Indo-Arabic • language - English • alphabet - Latin • symbols - typography • font - Arial 13° 51' 3” S 171° 45' 5” W
  16. Darwin Core Standard “List of fields and their definitions, as

    they relate to biodiversity data.” Discover all Darwin Core terms on TDWG website. institutionID collectionID datasetID institutionCode collectionCode datasetName ownerInstitutionCode basisOfRecord informationWithheld dataGeneralizations dynamicProperties occurrenceID catalogNumber recordNumber recordedBy individualCount organismQuantity organismQuantityType sex lifeStage reproductiveCondition behavior establishmentMeans occurrenceStatus preparations disposition associatedMedia associatedReferences associatedSequences associatedTaxa otherCatalogNumbers occurrenceRemarks organismID organismName organismScope associatedOccurrences associatedOrganisms previousIdentifications organismRemarks materialSampleID eventID parentEventID fieldNumber eventDate eventTime startDayOfYear endDayOfYear year month day verbatimEventDate habitat samplingProtocol sampleSizeValue sampleSizeUnit samplingEffort fieldNotes eventRemarks locationID higherGeographyID higherGeography continent waterBody islandGroup island country countryCode stateProvince county municipality locality verbatimLocality minimumElevationInMeters maximumElevationInMeters verbatimElevation minimumDepthInMeters maximumDepthInMeters verbatimDepth minimumDistanceAboveSurfaceInMeters maximumDistanceAboveSurfaceInMeters locationAccordingTo locationRemarks decimalLatitude decimalLongitude geodeticDatum coordinateUncertaintyInMeters coordinatePrecision pointRadiusSpatialFit verbatimCoordinates verbatimLatitude verbatimLongitude verbatimCoordinateSystem verbatimSRS footprintWKT footprintSRS footprintSpatialFit georeferencedBy georeferencedDate georeferenceProtocol georeferenceSources georeferenceVerificationStatus georeferenceRemarks geologicalContextID earliestEonOrLowestEonothem latestEonOrHighestEonothem earliestEraOrLowestErathem latestEraOrHighestErathem earliestPeriodOrLowestSystem latestPeriodOrHighestSystem earliestEpochOrLowestSeries latestEpochOrHighestSeries earliestAgeOrLowestStage latestAgeOrHighestStage lowestBiostratigraphicZone highestBiostratigraphicZone lithostratigraphicTerms group formation member bed identificationID identificationQualifier typeStatus identifiedBy dateIdentified identificationReferences identificationVerificationStatus identificationRemarks taxonID scientificNameID acceptedNameUsageID parentNameUsageID originalNameUsageID nameAccordingToID namePublishedInID taxonConceptID scientificName acceptedNameUsage parentNameUsage originalNameUsage nameAccordingTo namePublishedIn namePublishedInYear higherClassification kingdom phylum class order family genus subgenus specificEpithet infraspecificEpithet taxonRank verbatimTaxonRank scientificNameAuthorship vernacularName nomenclaturalCode taxonomicStatus nomenclaturalStatus taxonRemarks
  17. Darwin Core term : organismID Identifier: http://rs.tdwg.org/dwc/terms/organismID Class: http://rs.tdwg.org/dwc/terms/Organism Definition:

    An identifier for the Organism instance (as opposed to a particular digital record of the Organism). May be a globally unique identifier or an identifier specific to the data set. Comment: For discussion see http://terms.tdwg.org/wiki/dwc:organismID Details: organismID
  18. Darwin Core term : locality Identifier: http://rs.tdwg.org/dwc/terms/locality Class: http://purl.org/dc/terms/Location Definition:

    The specific description of the place. Less specific geographic information can be provided in other geographic terms (higherGeography, continent, country, stateProvince, county, municipality, waterBody, island, islandGroup). This term may contain information modified from the original to correct perceived errors or standardize the description. Comment: Example: "Bariloche, 25 km NNE via Ruta Nacional 40 (=Ruta 237)". For discussion see http://terms.tdwg.org/wiki/dwc:locality Details: locality
  19. Darwin Core star schema • Central ‘Core’ Entity • +

    0 or more extensions (that always relate to the Core entity) • + metadata (EML) DwCArchive (zip) Enough to describe some relations But not a fully relational model
  20. Other Biodiversity Standards of concern ABCD A standard equivalent to

    DarwinCore. Access to Biological Collections Data task group. 2007. Access to Biological Collection Data (ABCD), Version 2.06. Biodiversity Information Standards (TDWG) http://www.tdwg.org/standards/115 EML A metadata standard developed for the earth, environmental and ecological sciences. Ecological Metadata Language (EML) is a metadata specification particularly developed for the ecology discipline. It is based on prior work done by the Ecological Society of America and associated efforts (Michener et al., 1997, Ecological Applications).
  21. Data Quality Principles • What is quality? • Fitness for

    use • Correctness • Consistency • Data Cleaning
  22. Quality is relative to the usage Theatrum orbis Terrarum by

    Ortelius Abraham, 1527-1598 Image from the collections of the State Library of New South Wales.
  23. Fitness for use “...data quality is related to use and

    cannot be assessed independently of the user. In a database, the data have no actual quality or value (Dalcin 2004); they only have potential value that is realized when someone uses the data to do something useful. Information quality relates to its ability to satisfy its customers and to meet customers’ needs (English 1999).” Arthur Chapman
  24. Fitness for use in the real world How well does

    a thing do what it’s supposed to and what is that anyway? A shoemaker creates clogs for the purpose of covering a person’s feet.
  25. Data quality is a relative concept that depends on the

    use of these data. "The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data." Chrisman, 1991 The genus level will be sufficient to run predictive models of ecological niches, whereas someone studying a particular taxon will need really detailed occurrences with subspecies information. Fitness for use in Biodiversity data
  26. Fitness for use in Biodiversity data Do you understand your

    data and can you explain its purpose to someone else? 1. accessibility 2. accuracy 3. timeliness 4. completeness / comprehensiveness 5. consistency 6. relevancy 7. well documented [outside of your head] 8. easy to read and easy to interpret
  27. Measures of Quality "All data include error – there is

    no escaping it! It is knowing what the error is that is important, and knowing if the error is within acceptable limits for the purpose to which the data are to be put.” A. Chapman 2005 • Correctness (Accuracy) How close is the recorded value to the actual value? • Consistency (Precision) How often do you get it right?
  28. Correctness example 1 A dataset contains fossil specimens from the

    Triassic period. The recorded taxa for a specimen Is Thismia. Is Thismia a fossil bird?
  29. Correctness example 1 A dataset contains fossil specimens from the

    Triassic period. The recorded taxa for a specimen Is Thismia. Is Thismia a fossil bird? > No! It’s a very rare plant from Illinois (US)
  30. Correctness example 2 A botanical dataset contains specimens from Kalamazoo..

    The zip code is 49007 and the collector is Richard Spruce. 1. Is 49007 the right zip code for Kalamazoo? 2. Did Richard Spruce Collect in Michigan?
  31. Consistency example A botanical dataset has specimens collected by: Full

    Name = Joseph Dalton Hooker Full Name = Hooker, J. Full Name = W. J. Hooker Full Name = Hook.f. Full Name = Hook. How many unique collectors are there?
  32. Consistency example A botanical dataset has specimens collected by: Full

    Name = Joseph Dalton Hooker Full Name = Hooker, J. Full Name = W. J. Hooker Full Name = Hook.f. Full Name = Hook. How many unique collectors are there? > 3 different collectors for 5 different names
  33. Data cleaning "All data include error – there is no

    escaping it! It is knowing what the error is that is important, and knowing if the error is within acceptable limits for the purpose to which the data are to be put.” A. Chapman 2005 Data cleaning is the process of correcting (or removing) dirty data caused by contradictions, disparities, keying mistakes, missing bits, etc. It also includes validation of the changes made, and may require normalization.
  34. Cleaning maximizes fitness for use Private Home-made,often cryptic Not very

    well organized Not documented (Unassessed quality) My treasure! Circulated Home-made Not very well organized Not documented (Unassessed quality) Emailed to contact(s) Shared (Not fully documented) (Unassessed quality) With your colleagues Open (Unassessed quality) Open Data
  35. References • Chapman, AD 2005. Principles of Data Quality. Global

    Biodiversity Information Facility. https://doi.org/10.15468/doc.jrgg-a190 • Chapman, AD 2005. Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. Available online at http://www.gbif.org/document/80528
  36. Workshop Tools • Short presentation of the Workshop tools •

    Demos of the tools • Other Tools and References • Exercise with prepared data
  37. R/RStudio The R language is widely used among statisticians and

    data miners for developing statistical software and data analysis. RStudio makes R easier to use. It includes a code editor, debugging & visualization tools. RMarkdown provides an authoring framework for data science
  38. OpenRefine Open source software Web-based spreadsheet Easy faceting (group by)

    and filtering Intuitive bulk edition Easy programming Undo/redo features
  39. SQL Structured Query Language Most common way to deal with

    structured data (entities + relations) Support unique and null values Support primary and foreign keys Enforce integrity constraints Heavily used since decades We suggest DB Browser for SQLite
  40. Quantum GIS Open Source Software Deal with spatial information Visualise

    points, polygons,... on maps Support many type of layers Specialized GIS functions Good connection with databases and web maps (WMS services)
  41. More tools... • RegExp (eg from your text editor) •

    Taxon names search/match/validate • LibreOffice (Open Source, CSV friendly spreadsheet) • Gazetteers (locations to coordinates) • Exploratory(statistics) • Python (or other programming languages) • GBIF Data Validator: a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. • ... Make your own toolbox!
  42. Exercice 1.A. RStudio 1. Open Formidabel.rmd (a RMarkdown file that

    contains R Scripts) 2. Open Formidabel.csv (occurrences Data) 3. Display first record 4. Calculate mean of longitude, latitude 5. Plot occurrence points on Belgium map
  43. Exercice 1.B. OpenRefine 1. Create new project 2. Import 2018Formidabel.csv

    3. Add Text Facet on species 4. How many ‘Formica picea Nylander, 1846’? 5. Display ‘Lasius plathytorax Seifert, 1991’ 6. Edit as ‘Lasius platythorax Seifert, 1991’ 7. Rename year column as oYear 8. Add new column year based on oYear values (for editing) 9. Add Text Facet on year 10. Edit ‘2021’ values as ‘2012’ 11. Export to 20180827Formidabel.csv
  44. Exercice 1.C. SQL Browser 1. Create new database 2. Import

    2018Formidabel.csv Tab delimited, UTF-8 3. Count() records per basisOfRecord 4. Count() records per species 5. Count() records per year 6. Calculate min(), max() of decimalLatitude/decimalLongitude 7. Create a speciesList view with family, genus, species, scientificName, count(*) 8. Export speciesList as CSV file
  45. Exercice 1.D. Quantum GIS 1. Create new project 2. Add

    layer BEL3.shp 3. Alter properties to change color 4. Alter properties to display names 5. Add layer 2018Formidabel.csv 6. Alter properties to change symbol 7. Alter properties to display catalogNumber 8. Try to detect records not in Belgium 9. Add new calculated Layer with records not in Belgium