Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BibBase Triplified: Linked Open Bibliography Data

Reynold Xin
September 06, 2010
54

BibBase Triplified: Linked Open Bibliography Data

Presented at LDTC 2010,

Reynold Xin

September 06, 2010
Tweet

Transcript

  1. BibBase Triplified http://data.bibbase.org/ Presented by: Reynold S. Xin UC Berkeley

    Joint work with: Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao, Renee J. Miller University of Toronto Christian Fritz University of Southern California
  2. Outline ¤  Goals and Status ¤  Duplicate detection ¤  Interlinking

    of data sources ¤  Additional features ¤  Conclusions and future work
  3. Goals http://www.bibbase.org ¤  Makes it easy for scientists to maintain

    publications pages ¤  Scientists maintain a bibtex file; BibBase does the rest ¤  Publishes them in HTML
  4. Goals http://data.bibbase.org ¤  Makes it easy for scientists to maintain

    publications pages ¤  Scientists maintain a bibtex file; BibBase does the rest ¤  Publishes them in HTML ¤  Publishes them in RDF ¤  Links entries to the open linked data cloud ¤  With incentive, scientists are helping us build a bibliographic database (think DBLP but automated) ¤  Invaluable data set for benchmarking duplicate detection and semantic link discovery systems
  5. Some statistics ¤  “Beta” went online in June 2010 ¤ 

    As of yesterday (September 1, 2010) ¤  ~ 100 active users ¤  4520 publications, 4883 authors, 502 journals, 1881 proceedings, 88 keywords ¤  39201 author links, 2768 publication links, 30 keyword links ¤  Note that this is before we do any form of “marketing”
  6. Duplicate Detection ¤  Examples ¤  Authors: “Renee J. Miller” or

    “R. J. Miller” or “RJ Miller” ¤  Publication entries ¤  Journal & conferences: “VLDB” or “Very Large Data Base” ¤  Solutions ¤  Local detection (within a single bibtex file) ¤  Global detection (across multiple files)
  7. Local Detection ¤  A set of predefined rules to identify

    duplicates. ¤  E.g. within a single file, it is highly likely that “Renee J Miller” is the same as “RJ Miller”. ¤  Users can specify a suffix to the name to differentiate them (DBLP approach). ¤  E.g. “Min Wang” vs “Min Wang2”
  8. Global Detection ¤  Duplicate detection, also known as entity resolution,

    record linkage, or reference reconciliation is a well- studied problem and an active research area. [Tutorial- VLDB’05, Tutorial-SIGMOD’06] ¤  We use existing declarative techniques [D.App.σ-SIGMOD’07] to detect duplicates across multiple files. ¤  Display disambiguation page on HTML interface and rdfs:seeAlso attribute on RDF interface. ¤  Also enables user to provide feedback by @string{vldb = Very Large Data Base}
  9. Interlinking of Data Sources ¤  Leverages both offline dictionaries and

    online real-time URL verifications. ¤  Some external data sources ¤  DBLP ¤  DBpedia ¤  RKBExplorer ¤  Semantic Web Dogfood ¤  LOD foaf
  10. Additional Features ¤  Storage and publication of provenance information ¤ 

    Dynamic grouping of entities (by year, keyword, etc) ¤  RSS feed for notification ¤  DBLP scraper to generate bibtex files from DBLP records ¤  Statistics on usage ¤  Enhancement to existing MIT bibtex ontology file
  11. Conclusion and Future Work ¤  BibBase ¤  Light-weight publication of

    bibliographic data ¤  Semantic web technologies as a result of complex triplification performed inside the system ¤  Invaluable data set ¤  Future Work ¤  More comprehensive duplicate detection ¤  Links to more external data sources ¤  Better engineering and service level agreement (99.99%?) ¤  Broader user base