Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Turning three overlapping thesauri into a Global Agricultural Concept Scheme

De87c771ade1b486944471caee227739?s=47 SWIB14
December 03, 2014

Turning three overlapping thesauri into a Global Agricultural Concept Scheme

Authors: Thomas Baker / Osma Suominen (Sungkyunkwan University, Korea / National Library of Finland, Finland)

Abstract:
AGROVOC Concept Scheme, CAB Thesaurus (CABT), and NAL Thesaurus (NALT) largely overlap in scope (agriculture and agricultural research). This duplication is both inefficient for their maintainers and constitutes a barrier to searching across databases indexed with their terms. Common representation in SKOS makes mapping easier, but it would in principle be more efficient to merge the thesauri into one shared concept scheme to be jointly maintained by the three organizations. A feasibility study has defined a semi-automatic method for mapping among the three thesauri. Confirmed mappings will be used to coin new concepts, with new URIs, for a shared Global Agricultural Concept Scheme (GACS). One key challenge will be to balance the inclusion of diverse concept hierarchies from the source thesauri against a desire to converge on common semantics through editorial intervention. Partners who currently use their thesauri to automatically generate derivative products will need to balance the efficiencies of sharing a concept scheme with the control required for local production processes. GACS will be natively represented as SKOS XL, edited using VocBench software and published using the Skosmos platform (both open source software) under a Creative Commons license. The GACS project aspires to constitute a consortium open to other thesaurus maintainers. The first version of GACS will be available online in time for a presentation of lessons learned at SWIB 2014.

De87c771ade1b486944471caee227739?s=128

SWIB14

December 03, 2014
Tweet

Transcript

  1. Turning three overlapping thesauri into a Global Agricultural Concept Scheme

    SWIB14, Bonn, 3 December 2014 Osma Suominen and Thomas Baker
  2. Outline 1. Background 2. Starting point: three thesauri 3. Creating

    GACS 4. Challenges 5. Next steps and future of GACS
  3. Background • Food and Agriculture Organization of the UN •

    CABI (UK) • National Agricultural Library (US) Each organization maintains a thesaurus of terms and concepts related to agriculture -- concepts like rice, ricefield aquaculture, and plant pests.
  4. Global Agricultural Concept Scheme (GACS) 1. To improve the semantic

    interoperability of thesauri maintained by FAO, CABI, and NAL. 2. To provide core concepts broadly supported across the three thesauri. 3. To achieve efficiencies of scale by maintaining the core concepts in cooperation.
  5. Three Thesauri

  6. Separate thesauri, separate databases Create GACS as a glue linking

    them together
  7. AGROVOC CAB Thesaurus NAL Thesaurus 140,000 concepts, >1.4M terms 32,000

    concepts, >1.2M terms 53,000 concepts, >200k terms English, Spanish, Portuguese, German, Czech, Persian, Polish, Hindi, French, Italian, Russian, Japanese, Hungarian, Chinese, Slovak, Thai, Lao, Turkish, Korean, Arabic, Telugu ... English, Spanish, Portuguese, Dutch + many languages with lower coverage English, Spanish All thesauri represented using SKOS
  8. Overlap estimate Obtained via automatic mappings created using AgreementMakerLight

  9. Long tail distribution (in AGRIS) 10,000 concepts cover nearly 99%

    of occurrences in metadata
  10. Creating GACS

  11. Requirements and Wishes 1. An integrated view and bridge of

    existing thesauri 2. Reuses thesaurus development work, incl. translations 3. Compatible with existing databases 4. Based on RDF technologies: URIs, SKOS etc. 5. Available as Linked Open Data Currently building GACS Beta, a proof-of-concept implementation attempting to fulfill most requirements
  12. Selection of top 10,000 concepts Each partner organization provided the

    10,000 concepts most frequently used in their respective databases. These lists of concepts were modified as follows: • added all countries (from AGROVOC) • added organisms hierarchy all the way to the top
  13. Automated mappings Created using AgreementMakerLight software between the full thesauri,

    for completeness AgreementMakerLight was top performer at OAEI 2014 ontology mapping competition!
  14. Human evaluation of mappings Created Google Docs spreadsheets using the

    lists of selected concepts and the auto-generated mappings. Three sheets with circa 10,700 rows each. Mappings manually evaluated by staff of partner organizations. Evaluated 60 to 150 rows/hour, total evaluation time over 300 hours so far. Currently projected to take 500-600 hours for GACS Beta.
  15. Forming GACS concepts by merging the source concepts and aggregating

    their information rice UF paddy UF paddy rice cereals UF feed cereals UF small grain cereals (grain) Oryza sativa UF Oryza glutinosa UF Oryza indica UF Oryza japonica UF Oryza sativa … (subsp, var etc.) Oryza UF Padia UF rice (plant) agrovoc:c_5435 cabt:82917 nalt:56271 exactMatch agrovoc:c_5438 cabt:82935 nalt:56277 exactMatch agrovoc:c_1474 cabt:26247 exactMatch agrovoc:c_6599 cabt:101613 nalt:56293 exactMatch (actually we use SKOS, not traditional thesaurus tags)
  16. Size of GACS GACS GACS Beta will have around 14,000

    of the most used concepts
  17. Quality evaluation Using the qSKOS and Skosify tools that can

    find and correct problems in SKOS vocabularies [1], we can detect • missing, invalid or overlapping concept labels • anomalies in concept hierarchy, e.g. cycles • ...and many other kinds of problems. Many problems are expected due to merging of concepts within GACS, but most should be automatically corrected. [1] Osma Suominen and Christian Mader: Assessing and Improving the Quality of SKOS Vocabularies. JoDS, 3(1) 2014.
  18. Demo of GACS Alpha in Skosmos

  19. Lessons already learned • It is hard to sustain focus

    on mapping beyond circa five hours per day. • Mapping reveals issues with both the source and target thesauri -- areas for improvement, or errors, fixable in collaboration. • Starting with the 10,000 most-used concepts shines a light on parts of thesauri that may long have lacked attention. • Starting small, with a core, avoids the potential stress of over-committing resources. • Mapping provides an incentive to adopt open-data technologies that can have prove beneficial in other areas.
  20. Challenges

  21. Differences in modeling Q: Are taxonomic organism names (e.g. ‘Bos

    taurus’) different concepts than the common names (‘cattle’)? • sometimes there is no 1:1 match and/or context of use is different • the source thesauri all have different policies No final answer yet...
  22. Lumps clusters of concepts mapped one-to-several, several-to-one, or in spirals

  23. Next steps and future of GACS

  24. Additional mapping rounds Need to perform 2-3 more smaller mapping

    rounds in order to ensure that all necessary concepts have been fully mapped between all source thesauri
  25. GACS system infrastructure

  26. VocBench for editing

  27. Beyond GACS Beta? Q: Can GACS replace existing agricultural thesauri?

    • definitely not with GACS Beta due to smaller scope/size • a future GACS may be an alternative for some scenarios, but not all uses of existing thesauri because ◦ they cover areas beyond agriculture ◦ existing systems and processes (publication, automatic indexing…) depend on current thesauri In future, more partners are expected and the scope of GACS can be adjusted.
  28. Thank you Reports available on the FAO AIMS site: http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports

    These slides: http://tinyurl.com/swib14-gacs osma.suominen@helsinki.fi tom@tombaker.org