Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information schemas & datasets discovery: crossroads for cross-domain applications

Information schemas & datasets discovery: crossroads for cross-domain applications

Over the recent years, the research world has faced significant changes in the way of conducting standard work flows. These changes reflect on a growth of dataset provision through different platforms and systems. As research becomes more interdisciplinary, the challenges of providing datasets and other research objects in a homogeneous way are arduous. Ontologies and other information schemas have been used to bring solution to this homogeneity issue, most known as information integration, and the Libraries, Archives and Museums fields, as well as their digital equivalents, have exhibited noteworthy activity in this research area, from both theoretical and practical sides. This talk will highlight the challenging areas in information integration and will provide an overview of some available solutions for cross-domain applications.

Presentation in the Simulation and Data Science Colloquium, Tuesday 26 June 2018 at The Cyprus Institute – Novel Technologies Building, 1st Floor Events Room, Athalassa Campus.


Giannis Tsakonas

June 26, 2018


  1. Information schemas & datasets discovery crossroads for cross-domain applications Giannis

    Tsakonas Library & Information Center, University of Patras, Greece SIMDAS Colloquium June 26, 2018, Nicosia, Cyprus
  2. Context • Three, superficially, distinct domains. • Life Sciences •

    Climate Change • Cultural Heritage "Yet that big-and-slow data set offers clues to understanding oceanic “dead zones” and factors contributing to climate change. The distribution of microorganisms’ shells can signal changes in ocean currents and species migration, and the presence of particular oxygen isotopes can reveal the rate at which carbon is reaching the ocean floor or how much water is locked up in land-based ice sheets at a given time." [Mattern, 2017]
  3. Why access datasets? • Access to datasets is required to

    transit to a state of Open Science. • We assume that datasets will be reused by researchers, in or out of service environments for many purposes. • We hope that datasets will be used by citizens and will educate them in scientific and reasonable thought. • We aspire that datasets will stir innovation and entrepreneurship.
  4. The problem(s) • All three areas collect, organize and provide

    datasets • We could define datasets as sets of recorded information (observations, experiments, measurements, markings, annotations) • These data have been used in certain stages of scientific processes. • The scientific process is not just data (theories, hypothesis, findings, interpretations, etc). • The first problem is the concreteness of information we are managing. • The second problem is the heterogeneity of data itself (type, versions, sizes, etc.) • The third problem is the organization of information.
  5. Challenges • Complex integration challenges; within and beyond a domain.

    • The challenge is to have information schemas that can balance between effectiveness of documentation and efficiency of ‘work’. • Lighter profiles of schemas exist to address the issue of complexity, see “CDWA Lite records are intended for contribution to union catalogues and other repositories using the Open Archives Initiative (OAI) harvesting protocol” • Clear containers for the physical and the digital asset.
  6. Integration through interoperability • Semantic interoperability: agreement of terms, e.g.

    the common understanding of the meanings of the concepts used. • Syntactic interoperability: agreement of structures, e.g. the understanding of the way a record is build.
  7. Schemas of information • Vocabularies • Integration by terminological equivalence,

    between labels, between meanings, between languages, etc. • Metadata schemas • Integration by mapping metadata schemas, e.g. crosswalks between fields • Ontologies • Integration by ontology mediation
  8. Definition of metadata • Simple: Information about resources in a

    structured and organized format • Object centered information schema • In the semantic web world, metadata classes are linked by their properties.
  9. What metadata do? • Describe: descriptive information • Document: provenance,

    preservation, status information • Link: structural information
  10. Definition of ontologies Simple: provide the representational machinery with which

    to instantiate domain models in knowledge bases, make queries to knowledge-based services, and represent the results of calling such services [Gruber 2007]
  11. What ontologies do? • Ontologies are formal models that help

    us: • Understand a domain of knowledge 
 (what, where, when, how…) • Structure a knowledge base to collate different instances 
 (records of actors, events, places, topics…) • Infer a logical development 
 (what has happened, what comes next…)
  12. Categories of ontologies • Domain Ontologies: represent knowledge of a

    domain or a discourse. • Metadata Ontologies: represent the semantics of vocabularies for the description of domain information. • Generic/Common Sense Ontologies: represent information based on common sense concepts, such as time, space, events, etc. • Representational Ontologies: represent concepts of high abstraction • Task Ontologies: represent processes and methods
  13. Ontologies as Knowledge Structures • As conceptual constructs, define the

    semantics of information in a coherent way and facilitate its processing • An approach that: • Reflects the structure of a domain. • Highlights relationships. • Supports information conversion
  14. Ontologies as Information Access Tools • Define the semantics of

    the real world and facilitate its connection to machine accessible content, based on a commonly agreed terminology. • An approach that: • Links the conceptual to the physical world. • Emphasizes on the terminological view, e.g. vocabularies and their agreement. • Supports information discovery
  15. Crossroads for cross-domain problems • Cross-Domain infrastructures face the problem

    of harmonization/integration • A decision to be made is whether one should construct a new ontology/schema or use existing ones. • Aim for a balance between representation and efficiency of the new schema. • Answering the why and for whom will substantially define the scope and the choice.
  16. Problems • Conceptual issues between ontologies/schemas • Inexistent concepts •

    Overlapping/Mismatched concepts • Terminological issues • Different terms for the same concept • Different concepts for the same term • Inadequacy of terms • Architecture issues • Event centered approaches • Object centered approaches
  17. Ontology mediation • Alignment: the correspondence between the elements of

    two or more ontologies. • Schema based alignment • Instance based alignment • Legacy ontologies remain intact; correspondences work on a middle layer. • Merging: the unification of two or more ontologies to create a new one. • All elements of legacy ontologies should be represented in the new one.
  18. Metadata crosswalks • Correspondences of metadata fields between two or

    more standards. • The case of CARARE: • POLIS DTD, MIDAS, EDM, LIDO
  19. Datasets discovery - what to model? • Resources • Domain

    agnostic approaches • Serving: Discovery • Data Catalog Vocabulary DCAT • a RDF vocabulary to facilitate interoperability between data catalogs published on the Web. • Processes • Domain-governed approaches • Serving: Documentation • Ontology for Biomedical Investigations • Scientific Observation Model • Based on upper ontologies, such as Basic Formal Ontology and CIDOC CRM.
  20. Issues to consider • Ontologies are intentionally designed structures, i.e.

    biased. One should avoid to impose his intentionality on other structures. New cross-domain models should be neutral. • Skills are required during the construction of a ‘meta-semantic’ modelling. Ontologies are already complex semantic constructs.
  21. Issues to consider • Conceptual exercises • Identification of correspondences

    • Representation of correspondences • Technical compatibility • Common description languages, e.g. OWL • Other agreements, e.g. standardized values for time, space, etc. • Legal compatibility • License agreement
  22. Conclusions • The issue of discovery is different than the

    issue of documentation. • Discovery requires compromise. Information will condense. • Documentation can continue in legacy schemas.
  23. Thank you for your attention.