Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent? / tpdl2022

Jiro Kikkawa
September 15, 2022

Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent? / tpdl2022

This is our presentation slide at TPDL2022 (http://tpdl2022.dei.unipd.it/), Session 9: Research and CH Data on 23rd September 2022.
Authors: Jiro Kikkawa, Masao Takaku, and Fuyuki Yoshikane
Paper: https://doi.org/10.1007/978-3-031-16802-4_13
Preprint: https://doi.org/10.48550/arXiv.2207.12018
Abstract: Digital Object Identifiers (DOIs) are regarded as persistent; however, they are sometimes deleted. Deleted DOIs are an important issue not only for persistent access to scholarly content but also for bibliometrics, because they may cause problems in correctly identifying scholarly articles. However, little is known about how much of deleted DOIs and what causes them. We identified deleted DOIs by comparing the datasets of all Crossref DOIs on two different dates, investigated the number of deleted DOIs in the scholarly content along with the corresponding document types, and analyzed the factors that cause deleted DOIs. Using the proposed method, 708,282 deleted DOIs were identified. The majority corresponded to individual scholarly articles such as journal articles, proceedings articles, and book chapters. There were cases of many DOIs assigned to the same content, e.g., retracted journal articles and abstracts of international conferences. We show the publishers and academic societies which are the most common in deleted DOIs. In addition, the top cases of single scholarly content with a large number of deleted DOIs were revealed. The findings of this study are useful for citation analysis and altmetrics, as well as for avoiding deleted DOIs.

Jiro Kikkawa

September 15, 2022
Tweet

More Decks by Jiro Kikkawa

Other Decks in Research

Transcript

  1. Analysis of the deletions of DOIs What factors undermine their

    persistence and to what extent? 1 TPDL2022 - Session 9: Research and CH Data Jiro Kikkawa Masao Takaku Fuyuki Yoshikane { jiro, masao, fuyuki } @slis.tsukuba.ac.jp Slide https://speakerdeck.com/corgies/tpdl2022 Paper https://doi.org/10.1007/978-3-031-16802-4_13 Preprint https://doi.org/10.48550/arXiv.2207.12018
  2. 2 Background • Persistent access to scholarly articles – DOI

    is the best-known PID and an international standard – 130 million Crossref DOIs as of March 2022 • Deleted DOIs – DOIs and their artifacts (e.g., identifiers, metadata, and redirected URIs) are sometimes deleted by content holders. – Deleted DOIs may cause problems to bibliometrics (e.g., citation index and altmetrics) in correctly identifying scholarly articles. – However, little is known about their quantity and causes. Thus, we developed a methodology to identify them and analyzed them.
  3. 1. We provide the overall picture of deleted DOIs by

    clarifying the number of deleted DOIs and the characteristics of their content. We demonstrate the potential impact of deleted DOIs for citation analysis and altmetrics. 2. We provide guidance for avoiding deleted DOIs and making the DOI system more stable. Definition of deleted DOIs We focus on the deletion of identifiers and metadata. We defined DOIs for which identifiers and associated metadata cannot be retrieved as “Deleted DOIs” in this study. 3 Purpose
  4. 4 Related work #1 • Investigation of duplicated Crossref DOIs

    – Tkaczyk (2020) • Incorrect DOIs indexed by scholarly bibliographic databases – Franceschini et al. (2014), Zhu et al. (2018) • Crossref DOI statistics – Hendricks et al. (2020) • Analysis of persistence of Crossref DOIs – Klein and Balakireva (2020) • Analysis of the usage of DOI links in scholarly articles – Van de Sompel et al. (2016)
  5. 5 Related work #2 1. Duplicated Crossref DOIs 2. Crossref

    DOIs causing errors and unable to lead to the contents 3. DOIs in the records of scholarly bibliographic databases contain errors • Previous studies were based on relatively small samples because a methodology to identify duplicated or deleted DOIs at a large scale has not been proposed. • We propose a methodology for identifying deleted DOIs and conducting a large-scale analysis to determine how many deleted DOIs exist and what factors cause them.
  6. • We extracted the DOIs that existed at a point

    in time and did not exist later by comparing the dump files of Crossref DOIs on two different dates. à We used the datasets as of March 2017 and January 2021. • We then treated these DOIs as a candidate set of deleted DOIs. 6 Materials and Methods Identifying deleted DOIs #1 All Crossref DOIs as of March 2017 All Crossref DOIs as of January 2021 Difference set of DOIs included in the dataset as of March 2017 Candidate set of deleted DOIs
  7. 7 Identifying deleted DOIs #2 Difference set of DOIs included

    in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? First, we classified DOIs into the following groups according to the result of “Which RA?” DOI does not exist Crossref RA names other than Crossref ˞Items with a red box correspond to deleted DOIs.
  8. 8 Identifying deleted DOIs #3 Difference set of DOIs included

    in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? Next, we classified the Crossref DOIs into the following groups according to the results of each HTTP header for the DOI links. DOIs with Redirects Defunct DOIs Redirect DOIs without Redirects DOIs that redirected to the specific URI for the deleted content https://www.crossref.org/_deleted-doi/ DOI does not exist Crossref RA names other than Crossref ˞Items with a red box correspond to deleted DOIs.
  9. 9 Identifying deleted DOIs #4 Difference set of DOIs included

    in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? Finally, the DOIs were classified into the following groups according to the results of the Crossref REST API. DOIs with Deleted Description in Metadata Alias DOIs Other DOIs Crossref REST API When multiple Crossref DOIs were assigned to the same content item, one was set as the Primary DOI, and the others were set as the Alias DOIs. The Crossref REST API returned the error “Resource not found” for the Alias DOIs. Result DOI does not exist RA names other than Crossref DOIs with Redirects Defunct DOIs Redirect DOIs without Redirects Crossref ˞Items with a red box correspond to deleted DOIs.
  10. 10 Results and Discussion How many deleted DOIs exist? #

    Group Count % 1 Non-existing DOIs 240 0.03 2 DOIs without Redirects 693 0.10 3 DOIs with Deleted Description on Metadata 1,144 0.16 4 Defunct DOIs 388 0.05 5 Alias DOIs 667,869 94.29 6 Other DOIs 37,948 5.36 Table1. Number of deleted DOIs in each group (n=708,282) • We identified 708,282 deleted DOIs. Most of them were Alias DOIs, accounting for 94.29% of the total. • The majority of the deleted DOIs were caused by the deletion of multiple DOI assignments for the same content.
  11. 11 # Primary DOI Title Count Volume (Issue), Page Container

    title 1 10.1016/s1876-6102(14)00454-8 Volume Removed - Publisher's Disclaimer 1,474 13, pp. 1-10380 Energy Procedia This entire volume was retracted. 2 10.1016/s1876-6102(14)00453-6 Volume Removed - Publisher's Disclaimer 748 11, pp. 1-5156 Energy Procedia This entire volume was retracted. Table 3-1. Primary DOIs with the largest numbers of associated Alias DOIs • These cases are where many DOIs assigned to the same content in the retracted journal articles. • These articles are different scholarly articles. The withdrawn information should be applied to the respective DOIs, rather than applying them as alias DOIs to a single primary DOI. Why many DOIs are assigned to the same content? #1
  12. 12 Table 3-2. Primary DOIs with the largest numbers of

    associated Alias DOIs # Primary DOI Title Count Volume (Issue), Page Container title 3 10.1016/j.fueleneab.2006.10.002 Abstracts 486 47 (6), pp. 384-446 Fuel and Energy Abstracts 4 10.1016/s0735-1097(01)80004-8 Hypertension, vascular disease, and prevention 399 37 (2), pp. A220-A304 Journal of the American College of Cardiology • These cases involved abstracts of an international conference, where DOIs were registered to the abstracts of the presentations, but the publisher appeared to change the policy for registering DOIs to a set of abstracts. Why many DOIs are assigned to the same content? #2
  13. 13 What are the most common changes in the suffixes

    of deleted DOIs? #1 Pattern Count % Only the suffix changed 465,789 69.74 Both the prefix and the suffix changed 156,650 23.46 Only the prefix changed 45,430 6.80 Table 4. DOI name change patterns (n=667,869) • Table 4 presents the DOI name change patterns for each pair of Alias and Primary DOIs. • The most common pattern was “only the suffix changed” (69.74%), indicating that many DOIs registered under the same prefix were deleted when multiple DOIs were registered to the same content by the same registrant.
  14. 14 What are the most common changes in the suffixes

    of deleted DOIs? #2 # Pattern of changes on the Suffix Count % Example 1 Delete a slash once. 49,169 10.56 /s12445-012-0033-7 → s12445-012-0033-7 The correction of “//” to “/” as a separator between the prefix and the suffix. 2 Add a hyphen four times. 21,918 4.71 9781591401087.ch001 → 978-1-59140-108-7.ch001 The correction of the ISBN in the suffix by adding hyphens as separators. 3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99 replace “8” with “9” once, and replace “6” with “3” once. 2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010 The replacement of the ISSN in the suffix with a new one. Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789). Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively.
  15. 15 What are the most common changes in the suffixes

    of deleted DOIs? #3 # Pattern of changes on the Suffix Count % Example 3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99 replace “8” with “9” once, and replace “6” with “3” once. 2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010 The replacement of the ISSN in the suffix with a new one. As a similar case to #3, we observed the replacement of the SICI (Serial Item and Contribution Identifier) in the suffix with a new one. Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789). Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively. (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.3.co;2-2 (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.0.co;2-# →
  16. 16 Conclusion 1. We identified 708,282 deleted DOIs that existed

    in March 2017 and did not exist in January 2021 using the proposed method. 2. The cases where many DOIs were assigned to the same content were retracted papers in a specific volume of a journal or abstracts of international conference proceedings. 3. We revealed the factors that caused a large number of deleted DOIs. – We must be careful not to set double slashes between the prefix and the suffix. When we apply other identifiers such as the ISBN, ISSN, and SICI to suffixes, we may need to format or update them owing to changes in the ISBN, ISSN, or SICI. 4. The findings of this study are useful for both considering the problems caused by deleted DOIs in citation analysis and altmetrics and assigning DOIs in a better way to avoid deleted DOIs.
  17. 17 Future works 1. Expand the scope of the deleted

    DOIs and develop a methodology for identifying them – to identify deleted DOIs that existed before March 2017, and the deleted DOIs after March 2021 – to cover other DOIs whose contents were not reachable 2. Classify all the factors that caused deleted DOIs – We only focused on the most frequent cases in this study. – It is unclear whether the observed factors represented exceptional cases. 3. Conduct a quantitative analysis of the effects of deleted DOIs on citation analysis and altmetrics – Are alias/primary DOIs included in the citation indexes and altmetrics or not? – We are examining the citing/cited references to deleted DOIs by using the OpenCitations’ Index of Crossref open DOI-to-DOI citations dataset.
  18. • Dataset of the deleted DOIs extracted from the difference

    set between Crossref DOIs as of March 2017 and January 2021. Zenodo (2022). https://doi.org/10.5281/zenodo.6841257 18 { "doi": "10.1001/.387", "whichRA": "Crossref", "redirects": [ "https://doi.org/10.1001/archinte.166.4.387", "http://archinte.jamanetwork.com/article.aspx?doi=10.1001/archinte.166.4.387" ], "redirect_to_other_doi": [ "10.1001/archinte.166.4.387" ], "timestamp": "2022-01-30T02:10:45Z", "label": "Alias DOIs" } Dataset is available
  19. Analysis of the deletions of DOIs What factors undermine their

    persistence and to what extent? 19 TPDL2022 - Session 9: Research and CH Data Jiro Kikkawa Masao Takaku Fuyuki Yoshikane { jiro, masao, fuyuki } @slis.tsukuba.ac.jp Slide https://speakerdeck.com/corgies/tpdl2022 Paper https://doi.org/10.1007/978-3-031-16802-4_13 Preprint https://doi.org/10.48550/arXiv.2207.12018