Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Turning three overlapping thesauri into a Global Agricultural Concept Scheme

SWIB14
December 03, 2014

Turning three overlapping thesauri into a Global Agricultural Concept Scheme

Authors: Thomas Baker / Osma Suominen (Sungkyunkwan University, Korea / National Library of Finland, Finland)

Abstract:
AGROVOC Concept Scheme, CAB Thesaurus (CABT), and NAL Thesaurus (NALT) largely overlap in scope (agriculture and agricultural research). This duplication is both inefficient for their maintainers and constitutes a barrier to searching across databases indexed with their terms. Common representation in SKOS makes mapping easier, but it would in principle be more efficient to merge the thesauri into one shared concept scheme to be jointly maintained by the three organizations. A feasibility study has defined a semi-automatic method for mapping among the three thesauri. Confirmed mappings will be used to coin new concepts, with new URIs, for a shared Global Agricultural Concept Scheme (GACS). One key challenge will be to balance the inclusion of diverse concept hierarchies from the source thesauri against a desire to converge on common semantics through editorial intervention. Partners who currently use their thesauri to automatically generate derivative products will need to balance the efficiencies of sharing a concept scheme with the control required for local production processes. GACS will be natively represented as SKOS XL, edited using VocBench software and published using the Skosmos platform (both open source software) under a Creative Commons license. The GACS project aspires to constitute a consortium open to other thesaurus maintainers. The first version of GACS will be available online in time for a presentation of lessons learned at SWIB 2014.

SWIB14

December 03, 2014
Tweet

More Decks by SWIB14

Other Decks in Technology

Transcript

  1. Turning three overlapping thesauri
    into a Global Agricultural Concept Scheme
    SWIB14, Bonn, 3 December 2014
    Osma Suominen and Thomas Baker

    View full-size slide

  2. Outline
    1. Background
    2. Starting point: three thesauri
    3. Creating GACS
    4. Challenges
    5. Next steps and future of GACS

    View full-size slide

  3. Background
    ● Food and Agriculture Organization of the UN
    ● CABI (UK)
    ● National Agricultural Library (US)
    Each organization maintains a thesaurus of terms and concepts related to
    agriculture -- concepts like rice, ricefield aquaculture, and plant pests.

    View full-size slide

  4. Global Agricultural Concept Scheme (GACS)
    1. To improve the semantic interoperability of thesauri
    maintained by FAO, CABI, and NAL.
    2. To provide core concepts broadly supported across the
    three thesauri.
    3. To achieve efficiencies of scale by maintaining the core
    concepts in cooperation.

    View full-size slide

  5. Three Thesauri

    View full-size slide

  6. Separate thesauri, separate databases
    Create GACS as a glue linking them together

    View full-size slide

  7. AGROVOC CAB Thesaurus NAL Thesaurus
    140,000
    concepts,
    >1.4M terms
    32,000
    concepts,
    >1.2M terms
    53,000
    concepts,
    >200k terms
    English, Spanish,
    Portuguese, German,
    Czech, Persian, Polish,
    Hindi, French, Italian,
    Russian, Japanese,
    Hungarian, Chinese,
    Slovak, Thai, Lao, Turkish,
    Korean, Arabic, Telugu ...
    English, Spanish,
    Portuguese, Dutch
    + many languages with
    lower coverage
    English, Spanish
    All thesauri represented using SKOS

    View full-size slide

  8. Overlap estimate
    Obtained via automatic
    mappings created using
    AgreementMakerLight

    View full-size slide

  9. Long tail distribution (in AGRIS)
    10,000 concepts cover nearly 99% of occurrences in metadata

    View full-size slide

  10. Creating GACS

    View full-size slide

  11. Requirements and Wishes
    1. An integrated view and bridge of existing thesauri
    2. Reuses thesaurus development work, incl. translations
    3. Compatible with existing databases
    4. Based on RDF technologies: URIs, SKOS etc.
    5. Available as Linked Open Data
    Currently building GACS Beta, a proof-of-concept
    implementation attempting to fulfill most requirements

    View full-size slide

  12. Selection of top 10,000 concepts
    Each partner organization provided
    the 10,000 concepts most frequently
    used in their respective databases.
    These lists of concepts were
    modified as follows:
    ● added all countries (from
    AGROVOC)
    ● added organisms hierarchy all
    the way to the top

    View full-size slide

  13. Automated mappings
    Created using AgreementMakerLight software
    between the full thesauri, for completeness
    AgreementMakerLight was top performer at
    OAEI 2014 ontology mapping competition!

    View full-size slide

  14. Human evaluation of mappings
    Created Google Docs spreadsheets using the lists of selected concepts and the
    auto-generated mappings. Three sheets with circa 10,700 rows each.
    Mappings manually evaluated by
    staff of partner organizations.
    Evaluated 60 to 150 rows/hour,
    total evaluation time over 300
    hours so far.
    Currently projected to take
    500-600 hours for GACS Beta.

    View full-size slide

  15. Forming GACS concepts
    by merging the source concepts and aggregating their information
    rice
    UF paddy
    UF paddy rice
    cereals
    UF feed cereals
    UF small grain cereals (grain)
    Oryza sativa
    UF Oryza glutinosa
    UF Oryza indica
    UF Oryza japonica
    UF Oryza sativa … (subsp, var etc.)
    Oryza
    UF Padia
    UF rice (plant)
    agrovoc:c_5435
    cabt:82917
    nalt:56271
    exactMatch
    agrovoc:c_5438
    cabt:82935
    nalt:56277
    exactMatch
    agrovoc:c_1474
    cabt:26247
    exactMatch
    agrovoc:c_6599
    cabt:101613
    nalt:56293
    exactMatch
    (actually we use SKOS, not traditional thesaurus tags)

    View full-size slide

  16. Size of GACS
    GACS
    GACS Beta
    will have around
    14,000 of the
    most used
    concepts

    View full-size slide

  17. Quality evaluation
    Using the qSKOS and Skosify tools that can find and correct problems in SKOS
    vocabularies [1], we can detect
    ● missing, invalid or overlapping concept labels
    ● anomalies in concept hierarchy, e.g. cycles
    ● ...and many other kinds of problems.
    Many problems are expected due to merging of concepts within GACS, but
    most should be automatically corrected.
    [1] Osma Suominen and Christian Mader: Assessing and Improving the
    Quality of SKOS Vocabularies. JoDS, 3(1) 2014.

    View full-size slide

  18. Demo of GACS Alpha in Skosmos

    View full-size slide

  19. Lessons already learned
    ● It is hard to sustain focus on mapping beyond circa five hours per day.
    ● Mapping reveals issues with both the source and target thesauri -- areas
    for improvement, or errors, fixable in collaboration.
    ● Starting with the 10,000 most-used concepts shines a light on parts of
    thesauri that may long have lacked attention.
    ● Starting small, with a core, avoids the potential stress of over-committing
    resources.
    ● Mapping provides an incentive to adopt open-data technologies that can
    have prove beneficial in other areas.

    View full-size slide

  20. Differences in modeling
    Q: Are taxonomic organism names (e.g. ‘Bos taurus’)
    different concepts than the common names (‘cattle’)?
    ● sometimes there is no 1:1 match
    and/or context of use is different
    ● the source thesauri all have different policies
    No final answer yet...

    View full-size slide

  21. Lumps
    clusters of concepts mapped one-to-several, several-to-one, or in spirals

    View full-size slide

  22. Next steps
    and future of GACS

    View full-size slide

  23. Additional mapping rounds
    Need to perform 2-3 more
    smaller mapping rounds
    in order to ensure that
    all necessary concepts
    have been fully mapped
    between all source thesauri

    View full-size slide

  24. GACS system infrastructure

    View full-size slide

  25. VocBench for editing

    View full-size slide

  26. Beyond GACS Beta?
    Q: Can GACS replace existing agricultural thesauri?
    ● definitely not with GACS Beta due to smaller scope/size
    ● a future GACS may be an alternative for some
    scenarios, but not all uses of existing thesauri because
    ○ they cover areas beyond agriculture
    ○ existing systems and processes (publication,
    automatic indexing…) depend on current thesauri
    In future, more partners are expected and the scope of GACS can be adjusted.

    View full-size slide

  27. Thank you
    Reports available on the FAO AIMS site:
    http://aims.fao.org/community/agrovoc/blogs/phase-one-gacs-approved-read-reports
    These slides: http://tinyurl.com/swib14-gacs
    [email protected]
    [email protected]

    View full-size slide