Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reference rot in Concordia University’s Spectrum Research Repository

InfoNexus
February 05, 2016
72

Reference rot in Concordia University’s Spectrum Research Repository

Presented by Kathleen Botter, Systems Librarian at Concordia University Library, at InfoNexus 2016 in Montreal, Canada. Visit http://info-nexus.org/ for more detail.

InfoNexus

February 05, 2016
Tweet

Transcript

  1. Reference rot analysis of PhD theses in Concordia University’s Spectrum

    Research Repository Kathleen Botter, Systems Librarian
  2. What is Reference Rot? Reference Rot = Link Rot +

    Content Drift  Link Rot - “Page Not Found”  Content Drift - change of site/page content over time  Documents can be immune, healthy, or infected
  3. Why do we care? If I have seen further, it

    is by standing on the shoulders of giants. - Isaac Newton  Reference rot represents the disappearing giant  What are we left standing on?
  4. Reference Rot in STM articles Scholarly Context Not Found –

    1/5 Articles suffers from Reference Rot  3.5 million STM articles from 1997 - 2012  arXiv, Elsevier, Pubmed Central  1.8 million articles with open web references - 7/10 articles suffers from reference rot Klein M., Van de Sompel H., Sanderson R., Shankar H., Balakireva L., Zhou K., et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12), e115253. doi:10.1371/journal.pone.0115253
  5. Reference Rot in Law Perma: Scoping and Addressing the Problem

    of Link and Reference Rot in Legal Citations  3 Harvard law and policy related publications ~1996-2012  70% of links suffer reference rot  US Supreme Court opinions from CourtListener  50% of links suffer reference rot Zittrain J., Albert K., & Lessig L. (2014). Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management 14(2), 88-99. doi: 10.1017/S1472669614000255
  6. Research Questions  What % of links in Concordia’s ETDs

    exhibit characteristics of reference rot?  Do some disciplines exhibit more reference rot than others?  Is there a relation between the age of a thesis and reference rot?
  7. Spectrum Research Repository  Concordia University’s institutional repository http://spectrum.concordia.ca 

    All PhD and Masters submitted as ETDs since Spring 2011  PDF/A format  605 (654) PhD theses total (S2011-S2015)  Loss: 30 embargoes/restrictions + 19 pdf conversion problems
  8. Extracting and Testing Links 1. Obtain pdfs 2. Convert pdfs

    to xml or txt 3. Use regular expression on each thesis to find links 4. Manual verification (and fixing) of links 5. Use cURL utility to get http status code for each link  Output = original URL, final/effective URL, status code
  9. Content Drift in Spectrum In progress – sampling of links

    with http status code 200  Manually visit link  Use last accessed statement in reference or date of theses publication  Look for mementos close to this date  Tell by the final/effective url that the link is a custom 404 http://www.gfkrt.com/imperia/md/content/rt- france/cp_gfk_march___de_la_bd_-_39eme___dition_de_la_fibd.pdf http://www.gfk.com/404/ , 200
  10. Challenges  Not all PDFs are created equal  Some

    can’t be converted  Blank spaces, ~, _, new lines in links  Inconsistent linking  http sometimes there sometimes not  Ellipses (…) in long links!  Mistakes not so obvious  Missing .edu, //
  11. Surprises  Regular expression pulled out even more than we

    could have imagined  Sometimes that was just text with a colon  2554 DOIs  ~1700 since 2014
  12. Preliminary Results Theses  24.7% of theses are immune (150)

     25.6% of theses are healthy (155)  49.6% of theses suffer reference rot (300) Links  8046 of 10503 links are healthy  ~23.4% of links are afflicted by link rot
  13. Prelim. Results: Link Distribution 5% 0% 76% 1% 17% 1%

    0% HTTP Code Distribution 0s - Empty response 100s - Informational 200s - Successful 300s - Redirected 400s - Client Error 500s - Server Error 900s - Request denied
  14. Prelim. Results: Link Rot / Year Year % Theses w/

    links # Links # Links status = 200 % ~ Healthy links 2011 75.6% 1813 1248 68.8% 2012 72.5% 2494 1795 72% 2013 69.3% 2461 2017 82% 2014 67.5% 2724 2118 77.7% 2015 60.5% 1011 868 85.9% Total 69.6% 10503 8046 76.6%
  15. Prelim. Results: Discipline Distribution STEM 57% Social Sciences 18% Humanities

    9% Commerce 7% Fine Arts 5% Cross Disciplinary 4% Distribution of Disciplines
  16. Prelim. Results: Link Rot / Discipline Discipline (# theses) %

    Theses w/ links # Links # Links status = 200 % ~ Healthy links STEM (259) 66% 2963 2245 76% Social Sciences (81) 71% 3273 2551 78% Humanities (42) 75% 1837 1366 74% Commerce (33) 78.6% 173 122 70.5% Fine Arts (21) 91% 1570 1234 78.6% Cross disciplinary (19) 76% 687 528 77%
  17. Mitigating Reference Rot  Mementos  Online archives e.g. Internet

    Archive’s Wayback Machine  Perma.cc – law specific  Browser plugins that create mementos automatically  Citation style for online resources  Hiberlink solution for open web resources: <a href=“http://hiberlink.org” data-versionurl=http://archive.today/CT6mt data-versiondate=“2014-08-12”>
  18. Institution / Library’s Role?  Require mementos as part of

    document submission process?  Created, verified by author, librarian, other?  Who owns / guarantees the mementos existence?  Public archive  Institutional archive  Copyright issues