Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Library Data Management Platform Based on Linked Open Data

SWIB14
December 02, 2014

A Library Data Management Platform Based on Linked Open Data

Presenters: Jens Mittelbach / Robert Glaß (SLUB Dresden, Germany / Avantgarde Labs)

Abstract:
The rise of the concept of resource discovery, the increasing multiplicity of information channels and the exploding complexity of the technological infrastructure have placed organizational and financial challenges on libraries. Library data has become more heterogeneous, its sources have grown manifold. Bibliographic and authority data, licence and business data, usage data from library catalogues and the global science community (bibliometric data) as well as open data from the WWW constitute the graph that describes the resources managed by libraries. Consequently, there is an increasing need to integrate, normalize, and enrich existing library data sets as well as assure data quality for production and presentation purposes. The Saxon State and University Library Dresden has chosen a new approach of data integration for libraries and other cultural heritage institutions. In the EFRE-funded project, a scalable cloud-based data management platform called d:swarm has been implemented. Featuring an easy-to-use web-based modelling GUI, d:swarm allows for the integration and interlinkage of heterogeneous data sources into an integrated and flexible property graph data storage. As a middleware layer, it runs on top of existing library software infrastructures. Thus, existing library workflows depending on a variety of software solutions can remain untouched while data integration can be flexibly tailored to the needs of the individual institutions. Using d:swarm, feeding a library’s discovery front-end with high-quality normalized data or disseminating Linked Open Data is much easier. The project is published under an open source licence.

SWIB14

December 02, 2014
Tweet

More Decks by SWIB14

Other Decks in Technology

Transcript

  1. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    25 November, 2014
    Jens Mittelbach | Robert Glaß
    A Library Data Management Platform
    Based on Linked Open Data

    View Slide

  2. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    D:SWARM
    25 November 2014 | Page 2
    Dr. Jens Mittelbach
    A Library Data Management Platform Based on Linked Open
    Data
     Back in Those Days
     The Age of Discovery
     Library Data Management
     Qualify, Link and Free Your Data: D:SWARM
     Live Demo

    View Slide

  3. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Back in Those Days …
    03.12.14 | Page 3
    Dr. Jens Mittelbach
    Data Heterogeneity
     Multiple individual data silos
    • ILS, document repositories,
    databases, …
     Data saved in heterogeneous formats
    • MAB, MARC21, …
     Each data silo gets processed
    individually
    • Multiple admin interfaces
    • Multiple search interfaces
    • Data unrelated to one another
     Comprehensive view of resources
    almost impossible (for users and
    librarians)

    View Slide

  4. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    The Age of “Discovery”
    03.12.14 | Page 4
    Dr. Jens Mittelbach
    Data Normalization
     More comprehensive view of
    resources for users, but no real
    discovery/exploration
     Data gets normalized into one
    storage but not integrated
     Data available in record-
    oriented structures
    • External data (e.g. GND) has
    to be squeezed in the record
    • Metadata records are
    independent of each other
    • No explicit semantic quality
    of data

    View Slide

  5. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Library Data Management
    03.12.14 | Page 5
    Dr. Jens Mittelbach
    What Libraries Actually Need
     Get rid of data silos
    • Open formats for exchange
     Lossless data integration instead of
    reductive normalization
     Data integration with entity level
    granularity
    • Get rid of pre-compiled data records
     Focus on linking entities/objects:
    • Graph structures creating the
    knowledge graph
     Stick to quality policy of libraries
    • Versioning and provenance of data
    Library Data

    View Slide

  6. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Library Data Management
    03.12.14 | Page 6
    Dr. Jens Mittelbach
    What Should Library Data Actually Look Like?

    View Slide

  7. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Library Data Management
    03.12.14 | Page 7
    Dr. Jens Mittelbach
    Whose Job Is Library Data Integration?
     Data integration should be done by domain experts
    • Librarians, not IT staff (IT always understaffed)
    • Programming skills should not be a requirement
    • Good user experience is a prerequisite for adoption
     Example driven modelling approach
     Value created in the community should be reusable

    View Slide

  8. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Library Data Management
    03.12.14 | Page 8
    Dr. Jens Mittelbach
    What Tools Do We Need?
    Our Approach: An Open Source Data Management Platform

    View Slide

  9. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Library Data Management
    03.12.14 | Page 9
    Dr. Jens Mittelbach
    How Can Data Integration Be Done?

    View Slide

  10. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 10
    Dr. Jens Mittelbach
    Who’s behind this Project?
     Collaborative development team of SLUB Dresden and Avantgarde
    Labs GmbH
     Started work in June 2013
     Funded from the European Regional Development Fund (ERDF)

    View Slide

  11. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 11
    Dr. Jens Mittelbach
    Our Challenge: Existing Data Formats: MAB, MARC
    • „selection of keywords“
    • Relevant MAB fields are 902x, 907x, 912x, 917x,
    922x.
    • These fields have subfields a, b, c, … coded with
    further information (type of keyword, person,
    time, place, concept...)
    • From field 902x to field 922x we have to check
    • If in subfield "a" there is one of these strings
    (800|801|820|830|845|850|860|870|880)?
    • If so, is there one of these strings (c|g|k|p|s|
    t|z) in subfield "b“?
    • If so, the value in subfield "c“ qualifies as a
    keyword
    • Keyword needs to be trimmed (which is the
    easiest part)

    View Slide

  12. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 12
    Dr. Jens Mittelbach
    Our Challenge: Existing Tools: Talend

    View Slide

  13. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 13
    Dr. Jens Mittelbach
    Our Challenge: Existing Tools: Open Refine

    View Slide

  14. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 14
    Dr. Jens Mittelbach
    What Is D:SWARM?
     Graphical web based ETL modelling tool that serves to:
    • import data from heterogeneous sources with different formats
    • map input to output schemata and design transformation workflows
    • load transformed data into property graph database
     With additional functionalities:
    • Exporting of data models as RDF
    • Sharing mappings and transformation workflows

    View Slide

  15. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 15
    Dr. Jens Mittelbach
    How Does D:SWARM Work?
     Modelling GUI and job repository
     Execution environment
    • Operational data from heterogeneous data sources (ILS, OAI-PMH,
    CSV …) get processed according to the transformation logics defined
    in modelling GUI
     Admin centre
    • Scheduling & execution planning
    • Monitoring of system (data ingest, processing, errors)

    View Slide

  16. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 16
    Dr. Jens Mittelbach
    Why a Property Graph?
     Node (S) – Edge (P) – Node (O)
     Extension of RDF data model - each element
    can be endowed with additional information
    (key : value)
    • Version number
    • Provenance information
    • Type information

    View Slide

  17. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 17
    Dr. Jens Mittelbach
    Intermediate Results as of November 2014
     Modelling GUI in 2nd version
    • Available file importer: XML, CSV, MABXML
    • Simple schema editor & graphic schema mapper
    • Transformation workflow designer & filter (Metafacture)
     Execution of mappings and transformations in modelling GUI
     Persistence in graph database (Neo4J)
     Exporter: Turtle, N-Quads, N3, …
     Publication under Open Source licence (Apache 2):
    https://github.com/dswarm

    View Slide

  18. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 18
    Dr. Jens Mittelbach
    Live Demo
    http://demo.dswarm.org

    View Slide

  19. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 19
    Dr. Jens Mittelbach
    Our Next Steps
     Provision of URI templates for resource matching and linking
     Scalable execution engine for production mode
     Extension of transformation function set
     Extension of importers
     Implementation of an administration centre
     Deduplication and FRBRization
     Integration of SLUBsemantics Enrichtment Service
     Implementation of sharing features

    View Slide

  20. SLUB Dresden slub-dresden.de
    CC BY-SA 4.0
    Avantgarde Labs
    Robert Glaß
    Qualify, Link and Free Your Data: D:SWARM
    03.12.14 | Page 20
    Dr. Jens Mittelbach
    Your Next Steps
     Follow us on twitter.com/dswarm or www.dswarm.org or github.com/
    dswarm
     Try it out and get in contact with us
    • http://demo.dswarm.org
    • https://github.com/dswarm/dswarm-documentation/wiki
    [email protected]
     Help us prioritize our backlog
    • https://jira.slub-dresden.de/
     Fork us on github.com/dswarm

    View Slide