Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From MARC silos to Linked Data silos? (SWIB16)

From MARC silos to Linked Data silos? (SWIB16)

A talk about bibliographic data models given at the SWIB16 conference in Bonn, Germany, on November 30, 2016.

Video recording: https://www.youtube.com/watch?v=xp-PRtdF5_U

Abstract:

Many libraries are experimenting with publishing their metadata as Linked Data in order to open up bibliographic silos, usually based on MARC records, and make them more interoperable, accessible and understandable to developers who are not intimately familiar with library data. The libraries who have published Linked Data have all used different data models for structuring their bibliographic data. Some are using a FRBR-based model where Works, Expressions and Manifestations are represented separately. Others have chosen basic Dublin Core, dumbing down their data into a lowest common denominator format. The proliferation of data models limits the reusability of bibliographic data. In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. Data sets can be difficult to combine, for example when one data set is modelled around Works while another mixes Work-level metadata such as author and subject with Manifestation-level metadata such as publisher and physical form. Small modelling differences may be overcome by schema mappings, but it is not clear that interoperability has improved overall. We present a survey of published bibliographic Linked Data, the data models proposed for representing bibliographic data as RDF, and tools used for conversion from MARC. We also present efforts at the National Library of Finland to open up metadata, including the national bibliography Fennica, the national discography Viola and the article database Arto, as Linked Data while trying to learn from the examples of others.

Google Slides: https://tinyurl.com/linked-silos

Avatar for Osma Suominen

Osma Suominen

November 30, 2016
Tweet

More Decks by Osma Suominen

Other Decks in Technology

Transcript

  1. From MARC silos to Linked Data silos? Osma Suominen and

    Nina Hyvönen SWIB16, Bonn November 30, 2016
  2. DNB MARC MODS MODS RDF marcmods2rdf Dublin Core DC-RDF Catmandu

    BIBO FaBiO BIBFRAME 1.0 FRBR FRBRer eFRBRoo ALIADA RDA Vocabulary BNE ontology MARiMbA Flat / Record-based Entity-based marc2bibframe bibfra.me (Zepheira) pybibframe LD4L ontology BIBFRAME 2.0 LD4P ontology LD4L BNE schema.org + bib.extensions World Cat BNB AP BNB DNB AP LibHub BNF AP BNF “Family forest” of bibliographic data models, conversion tools, application profiles and data sets Legend Non-RDF data model RDF data model Conversion tool Application profile Data set Artium Swissbib AP Swiss bib Metafacture don’t have Works have Works DC-NDL AP NDL FRBRoo FRBR Core
  3. Libraryish - used for producing and maintaining (meta)data - lossless

    conversion to/from legacy formats (MARC) - modelling of abstractions (records, authorities) - housekeeping metadata (status, timestamps) - favour self-contained modelling over reuse of other data models Webbish - used for publishing data for others to reuse - interoperability with other (non-library) data models - modelling of Real World Objects (books, people, places, organizations...) - favour simplicity over exhaustive detail Bibliographic data Authority data MODS RDF BIBFRAME RDA Vocabulary LD4L ontology LD4P ontology MADS/RDF Dublin Core RDF schema.org + bib.extensions Wikidata properties SKOS FOAF BIBO FaBiO
  4. Reason 1 Reason 2 Reason 3 Reason 4 Different use

    cases require different kinds of data models. None of the existing models fits them all. But surely, for basic MARC records (e.g. a “regular” national library collection) a single model would be enough?
  5. Reason 1 Reason 2 Reason 3 Reason 4 Converting existing

    data (i.e. MARC) into a modern entity-based model is difficult and prevents adoption of such data models in practice for real data. All FRBR-based models require “FRBRization”, which is difficult to get right. BIBFRAME is somewhat easier because of its more relaxed view about Works.
  6. Reason 1 Reason 2 Reason 3 Reason 4 Libraries want

    to control their data - including data models. Defining your own ontology, or a custom application profile, allows maximum control. Issues like localization and language- or culture-specific requirements (e.g. Japanese dual representation of titles as hiragana and katakana) are not always adequately addressed in the general models.
  7. Reason 1 Reason 2 Reason 3 Reason 4 Once you’ve

    chosen a data model, you’re likely to stick to it.
  8. Choosing an RDF data model for a bibliographic data set

    1. Want to have Works, or just records? 2. Libraryish (maintaining) or Webbish (publishing) use case? For maintaining metadata as RDF, suitable data models (BIBFRAME, RDA Vocabulary etc.) are not yet mature. For publishing, we already have too many data models.
  9. Don’t create another data model, especially if it’s only for

    publishing. Help improve the existing ones! We need more efforts like LD4P that consider the production and maintenance of library data as modern, entity-based RDF instead of records. How could we share and reuse each other’s Works and other entities instead of having to all maintain our own?
  10. Will Google, or some other big player, sort this out

    for us? A big actor offering a compelling use case for publishing bibliographic LOD would make a big difference. • a global bibliographic knowledgebase? • pushing all bibliographic data into Wikidata? • Search Engine Optimization (SEO) using schema.org? This is happening for scientific datasets - Google recently defined a schema for them within schema.org.
  11. 1. Fennica - national bibliography (1M records) Melinda union catalog

    (9M records) 2. Arto - national article database (1.7M records) 3. Viola - national discography (1M records) All are MARC record based Voyager or Aleph systems. The Z39.50/SRU APIs have been opened in September 2016 Our bibliographic databases
  12. Not very Linked to start with • Only some of

    our bibliographic records are in WorldCat ◦ ...and we don’t know their OCLC numbers • Our bibliographic records don’t have explicit (ID) links to authority records ◦ ...but we’re working on it! • Only some of our person and corporate name authority records are in VIAF ◦ ...and we don’t know their VIAF IDs • Our name authorities are not in ISNI either • Our main subject headings (YSA) are linked via YSO to LCSH
  13. Targeting schema.org Schema.org + bibliographic extensions allows surprisingly rich descriptions!

    Modelling of Works is possible, similar to BIBFRAME [1] [1] Godby, Carol Jean, and Denenberg, Ray. 2015. Common Ground: Exploring Compatibilities Between the Linked Data Models of the Library of Congress and OCLC. Dublin, Ohio: Library of Congress and OCLC Research. http://www.oclc.org/content/dam/research/publications/2015/oclcresearch-loc-linked-data-2015.pdf
  14. schema.org forces to think about data from a web user’s

    point of view “We have these 1M bibliographic records”
  15. schema.org forces to think about data from a web user’s

    point of view “We have these 1M bibliographic records” “The National Library maintains this amazing collection of literary works! We have these editions of those works in our collection. They are available free of charge for reading/borrowing from our library building (Unioninkatu 36, 00170 Helsinki, Finland) which is open Mon-Fri 10-17, except Wed 10-20. The electronic versions are available online from these URLs.”
  16. Fennica using schema.org # The original English language work fennica:000215259work9

    a schema:CreativeWork ; schema:about ysa:Y94527, ysa:Y96623, ysa:Y97136, ysa:Y97137, ysa:Y97575, ysa:Y99040, yso:p18360, yso:p19627, yso:p21034, yso:p2872, yso:p4403, yso:p9145 ; schema:author fennica:000215259person10 ; schema:inLanguage "en" ; schema:name "The illustrated A brief history of time" ; schema:workTranslation fennica:000215259 . # The Finnish translation (~expression in FRBR/RDA) fennica:000215259 a schema:CreativeWork ; schema:about ysa:Y94527, ysa:Y96623, ysa:Y97136, ysa:Y97137, ysa:Y97575, ysa:Y99040, yso:p18360, yso:p19627, yso:p21034, yso:p2872, yso:p4403, yso:p9145 ; schema:author fennica:000215259person10 ; schema:contributor fennica:000215259person11 ; schema:inLanguage "fi" ; schema:name "Ajan lyhyt historia" ; schema:translationOfWork fennica:000215259work9 ; schema:workExample fennica:000215259instance26 . # The manifestation (FRBR/RDA) / instance (BIBFRAME) fennica:000215259instance26 a schema:Book, schema:CreativeWork ; schema:author fennica:000215259person10 ; schema:contributor fennica:000215259person11 ; schema:datePublished "2000" ; schema:description "Lisäpainokset: 4. p. 2002. - 5. p. 2005." ; schema:exampleOfWork fennica:000215259 ; schema:isbn "9510248215", "9789510248218" ; schema:name "Ajan lyhyt historia" ; schema:numberOfPages "248, 6 s. :" ; schema:publisher [ schema:name "WSOY" ; a schema:Organization ] . # The original author fennica:000215259person10 a schema:Person ; schema:name "Hawking, Stephen" . # The translator fennica:000215259person11 a schema:Person ; schema:name "Varteva, Risto" . Special thanks to Richard Wallis for help with applying schema.org!
  17. Fennica RDF conversion pipeline (draft) Aleph- bib- dump txt txt

    txt split into 300 batches (max 10k records per batch) 1.5 min mrcx mrcx mrcx Filter, convert to MARCXML using Catmandu 240$l fix 11 min rdf rdf rdf BIBFRAME conversion using marc2bibframe 75 min nt nt nt Schema.org conversion using SPARQL CONSTRUCT 35 min nt Create work keys (SPARQL) 35 min nt Create work mappings 2 min RDF for publishing nt + hdt consolidate & cleanup works using SPARQL 30M triples, ~3 GB 1M records, 2.5 GB 4 GB 9 GB Under construction: https://github.com/NatLibFi/bib-rdf-pipeline Raw merged data nt + hdt merge works using SPARQL • batch process driven by a Makefile, which defines dependencies ◦ incremental updates: only changed batches are reprocessed • parallel execution on multiple CPU cores, single virtual machine • unit tested using Bats
  18. Current challenges 1. problems caused by errors & omissions in

    MARC records 2. extracting works: initial implementation needs fine tuning ◦ the result will not be perfect; establishing a work registry would help 3. dumbing down MARC to match schema.org expectations ◦ e.g. structured page counts: “vii, 89, 31 p.” -- schema.org only defines numeric numberOfPages property 4. linking internally - from strings to things ◦ subjects from YSA and YSO - already working ◦ using person and corporate name authorities 5. linking externally ◦ linking name authorities to VIAF, ISNI, Wikidata... ◦ linking works to WorldCat Works?
  19. Publishing as LOD (draft plan) <500MB HDT Linked Data Fragments

    server LDF API Fuseki with hdt-java SPARQL Elda? Custom app? HTML+RDFa REST API