Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Transformation

Data Transformation

Introduction to Data Transformation from project format to DarwinCore.
The Arabel use case.
Presented during BIFA ToT workshop, Tagatay, Philippines.

André Heughebaert

June 22, 2014
Tweet

More Decks by André Heughebaert

Other Decks in Research

Transcript

  1. BIFA ToT workshop – 20-24 June 2016 SESSION 02BIS: DATA

    TRANSFORMATION DATA TRANSFORMATION André Heughebaert
  2. FROM PROJECT DATA TO DARWIN CORE Project Database • Home-made

    structure • Internal, restricted use • Not so well documented sometime hermetic • All data fit in • Local IDs are sufficient DarwinCore • Generic/universal terms • Public access • Well documented understandable • Data must follow standards • Need for Global IDs Advocate Open Data Transform/clean Data Publish Data
  3. FREQUENT DATA ISSUES • Specific, project oriented data structure (not

    Darwin Core) • No Unique IDs • Missing integrity constraints • Lots of typos (eg scientific names) • Heterogeneous formats (eg dates, coordinates...) • Special Coordinates System • Redundancy and conflicting information • Same field used for multiple purposes • Variety of null values (‘’, ‘NA’, ‘null’...) • Sensitive and/or untrusted data
  4. OUR DEMO USE CASE AraBel, the Arachnologia Belgica is a

    collaborative effort of scientists and spiders amateurs. It gathers information on collection specimens and observations of spiders in Belgium from 1879 to present time.
  5. ARABEL IN A NUTSHELL • Started in 1976 • 3

    generations of experts involved • Science (not technology) driven
  6. ARABEL DATA ISSUES Specific, project oriented data structure (not Darwin

    Core) No Unique IDs Missing integrity constraints Lots of typos (eg scientific names) Heterogeneous formats (eg dates, coordinates...) Special Coordinates System Redundancy and conflicting information Same field used for multiple purposes Variety of null values (‘’, ‘NA’, ‘null’...) Sensitive and/or untrusted data
  7. DEMONSTRATION STEPS 1. Access – legacy database 2. LibreOffice -

    Spreadsheet functions 3. OpenRefine - Cleanup 4. QGIS - Geographical visualisation 5. Ruby - Invoking webservices 6. SQLite – Preparing DarwinCore views 7. IPT - Auto-mapping
  8. ARABEL ISSUES AND SOLUTIONS Specific, project oriented data structure... No

    Unique IDs…………………………….. Missing integrity constraints……………... Lots of typos (eg scientific names)........... Heterogeneous formats………………….. Special Coordinates System…………….. Redundancy and conflicting information.. Same field used for multiple purposes…. Variety of null values (‘’, ‘NA’, ‘null’).......... Sensitive and/or untrusted data………….. Rename, restructure, use views Add UUIDs Force integrity constraints Detect typos and correct them Be creative Convert to WGS84 Ask data owners Split with regular expressions Unify to same null values Blur sensitive data, do not publish untrusted data
  9. OPEN SOURCES TOOLS • LibreOffice - Spreadsheet • OpenRefine -

    Working with messy data • QGIS - Geographic Information System • Ruby - Programming Language • SQLite - SQL database engine
  10. TAKE HOME MESSAGES • Data transformation/cleaning is a necessity •

    This process is time consuming • Involve Data owners as early as possible • Always keep the original/verbatim fields • Document what you did • Automate the process, whenever possible • Tools are faster and more reliable than humans • Use your own palette of tools (DB, GIS, languages...)
  11. BIFA ToT workshop – 20-24 June 2016 SESSION 02BIS: DATA

    TRANSFORMATION DATA TRANSFORMATION André Heughebaert