Slide 1

Slide 1 text

Analysis of the deletions of DOIs What factors undermine their persistence and to what extent? 1 TPDL2022 - Session 9: Research and CH Data Jiro Kikkawa Masao Takaku Fuyuki Yoshikane { jiro, masao, fuyuki } @slis.tsukuba.ac.jp Slide https://speakerdeck.com/corgies/tpdl2022 Paper https://doi.org/10.1007/978-3-031-16802-4_13 Preprint https://doi.org/10.48550/arXiv.2207.12018

Slide 2

Slide 2 text

2 Background • Persistent access to scholarly articles – DOI is the best-known PID and an international standard – 130 million Crossref DOIs as of March 2022 • Deleted DOIs – DOIs and their artifacts (e.g., identifiers, metadata, and redirected URIs) are sometimes deleted by content holders. – Deleted DOIs may cause problems to bibliometrics (e.g., citation index and altmetrics) in correctly identifying scholarly articles. – However, little is known about their quantity and causes. Thus, we developed a methodology to identify them and analyzed them.

Slide 3

Slide 3 text

1. We provide the overall picture of deleted DOIs by clarifying the number of deleted DOIs and the characteristics of their content. We demonstrate the potential impact of deleted DOIs for citation analysis and altmetrics. 2. We provide guidance for avoiding deleted DOIs and making the DOI system more stable. Definition of deleted DOIs We focus on the deletion of identifiers and metadata. We defined DOIs for which identifiers and associated metadata cannot be retrieved as “Deleted DOIs” in this study. 3 Purpose

Slide 4

Slide 4 text

4 Related work #1 • Investigation of duplicated Crossref DOIs – Tkaczyk (2020) • Incorrect DOIs indexed by scholarly bibliographic databases – Franceschini et al. (2014), Zhu et al. (2018) • Crossref DOI statistics – Hendricks et al. (2020) • Analysis of persistence of Crossref DOIs – Klein and Balakireva (2020) • Analysis of the usage of DOI links in scholarly articles – Van de Sompel et al. (2016)

Slide 5

Slide 5 text

5 Related work #2 1. Duplicated Crossref DOIs 2. Crossref DOIs causing errors and unable to lead to the contents 3. DOIs in the records of scholarly bibliographic databases contain errors • Previous studies were based on relatively small samples because a methodology to identify duplicated or deleted DOIs at a large scale has not been proposed. • We propose a methodology for identifying deleted DOIs and conducting a large-scale analysis to determine how many deleted DOIs exist and what factors cause them.

Slide 6

Slide 6 text

• We extracted the DOIs that existed at a point in time and did not exist later by comparing the dump files of Crossref DOIs on two different dates. à We used the datasets as of March 2017 and January 2021. • We then treated these DOIs as a candidate set of deleted DOIs. 6 Materials and Methods Identifying deleted DOIs #1 All Crossref DOIs as of March 2017 All Crossref DOIs as of January 2021 Difference set of DOIs included in the dataset as of March 2017 Candidate set of deleted DOIs

Slide 7

Slide 7 text

7 Identifying deleted DOIs #2 Difference set of DOIs included in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? First, we classified DOIs into the following groups according to the result of “Which RA?” DOI does not exist Crossref RA names other than Crossref ˞Items with a red box correspond to deleted DOIs.

Slide 8

Slide 8 text

8 Identifying deleted DOIs #3 Difference set of DOIs included in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? Next, we classified the Crossref DOIs into the following groups according to the results of each HTTP header for the DOI links. DOIs with Redirects Defunct DOIs Redirect DOIs without Redirects DOIs that redirected to the specific URI for the deleted content https://www.crossref.org/_deleted-doi/ DOI does not exist Crossref RA names other than Crossref ˞Items with a red box correspond to deleted DOIs.

Slide 9

Slide 9 text

9 Identifying deleted DOIs #4 Difference set of DOIs included in the dataset as of March 2017 (n=711,198) Crossref DOIs Non-Crossref DOIs Non-existing DOIs Which RA? Finally, the DOIs were classified into the following groups according to the results of the Crossref REST API. DOIs with Deleted Description in Metadata Alias DOIs Other DOIs Crossref REST API When multiple Crossref DOIs were assigned to the same content item, one was set as the Primary DOI, and the others were set as the Alias DOIs. The Crossref REST API returned the error “Resource not found” for the Alias DOIs. Result DOI does not exist RA names other than Crossref DOIs with Redirects Defunct DOIs Redirect DOIs without Redirects Crossref ˞Items with a red box correspond to deleted DOIs.

Slide 10

Slide 10 text

10 Results and Discussion How many deleted DOIs exist? # Group Count % 1 Non-existing DOIs 240 0.03 2 DOIs without Redirects 693 0.10 3 DOIs with Deleted Description on Metadata 1,144 0.16 4 Defunct DOIs 388 0.05 5 Alias DOIs 667,869 94.29 6 Other DOIs 37,948 5.36 Table1. Number of deleted DOIs in each group (n=708,282) • We identified 708,282 deleted DOIs. Most of them were Alias DOIs, accounting for 94.29% of the total. • The majority of the deleted DOIs were caused by the deletion of multiple DOI assignments for the same content.

Slide 11

Slide 11 text

11 # Primary DOI Title Count Volume (Issue), Page Container title 1 10.1016/s1876-6102(14)00454-8 Volume Removed - Publisher's Disclaimer 1,474 13, pp. 1-10380 Energy Procedia This entire volume was retracted. 2 10.1016/s1876-6102(14)00453-6 Volume Removed - Publisher's Disclaimer 748 11, pp. 1-5156 Energy Procedia This entire volume was retracted. Table 3-1. Primary DOIs with the largest numbers of associated Alias DOIs • These cases are where many DOIs assigned to the same content in the retracted journal articles. • These articles are different scholarly articles. The withdrawn information should be applied to the respective DOIs, rather than applying them as alias DOIs to a single primary DOI. Why many DOIs are assigned to the same content? #1

Slide 12

Slide 12 text

12 Table 3-2. Primary DOIs with the largest numbers of associated Alias DOIs # Primary DOI Title Count Volume (Issue), Page Container title 3 10.1016/j.fueleneab.2006.10.002 Abstracts 486 47 (6), pp. 384-446 Fuel and Energy Abstracts 4 10.1016/s0735-1097(01)80004-8 Hypertension, vascular disease, and prevention 399 37 (2), pp. A220-A304 Journal of the American College of Cardiology • These cases involved abstracts of an international conference, where DOIs were registered to the abstracts of the presentations, but the publisher appeared to change the policy for registering DOIs to a set of abstracts. Why many DOIs are assigned to the same content? #2

Slide 13

Slide 13 text

13 What are the most common changes in the suffixes of deleted DOIs? #1 Pattern Count % Only the suffix changed 465,789 69.74 Both the prefix and the suffix changed 156,650 23.46 Only the prefix changed 45,430 6.80 Table 4. DOI name change patterns (n=667,869) • Table 4 presents the DOI name change patterns for each pair of Alias and Primary DOIs. • The most common pattern was “only the suffix changed” (69.74%), indicating that many DOIs registered under the same prefix were deleted when multiple DOIs were registered to the same content by the same registrant.

Slide 14

Slide 14 text

14 What are the most common changes in the suffixes of deleted DOIs? #2 # Pattern of changes on the Suffix Count % Example 1 Delete a slash once. 49,169 10.56 /s12445-012-0033-7 → s12445-012-0033-7 The correction of “//” to “/” as a separator between the prefix and the suffix. 2 Add a hyphen four times. 21,918 4.71 9781591401087.ch001 → 978-1-59140-108-7.ch001 The correction of the ISBN in the suffix by adding hyphens as separators. 3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99 replace “8” with “9” once, and replace “6” with “3” once. 2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010 The replacement of the ISSN in the suffix with a new one. Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789). Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively.

Slide 15

Slide 15 text

15 What are the most common changes in the suffixes of deleted DOIs? #3 # Pattern of changes on the Suffix Count % Example 3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99 replace “8” with “9” once, and replace “6” with “3” once. 2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010 The replacement of the ISSN in the suffix with a new one. As a similar case to #3, we observed the replacement of the SICI (Serial Item and Contribution Identifier) in the suffix with a new one. Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789). Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively. (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.3.co;2-2 (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.0.co;2-# →

Slide 16

Slide 16 text

16 Conclusion 1. We identified 708,282 deleted DOIs that existed in March 2017 and did not exist in January 2021 using the proposed method. 2. The cases where many DOIs were assigned to the same content were retracted papers in a specific volume of a journal or abstracts of international conference proceedings. 3. We revealed the factors that caused a large number of deleted DOIs. – We must be careful not to set double slashes between the prefix and the suffix. When we apply other identifiers such as the ISBN, ISSN, and SICI to suffixes, we may need to format or update them owing to changes in the ISBN, ISSN, or SICI. 4. The findings of this study are useful for both considering the problems caused by deleted DOIs in citation analysis and altmetrics and assigning DOIs in a better way to avoid deleted DOIs.

Slide 17

Slide 17 text

17 Future works 1. Expand the scope of the deleted DOIs and develop a methodology for identifying them – to identify deleted DOIs that existed before March 2017, and the deleted DOIs after March 2021 – to cover other DOIs whose contents were not reachable 2. Classify all the factors that caused deleted DOIs – We only focused on the most frequent cases in this study. – It is unclear whether the observed factors represented exceptional cases. 3. Conduct a quantitative analysis of the effects of deleted DOIs on citation analysis and altmetrics – Are alias/primary DOIs included in the citation indexes and altmetrics or not? – We are examining the citing/cited references to deleted DOIs by using the OpenCitations’ Index of Crossref open DOI-to-DOI citations dataset.

Slide 18

Slide 18 text

• Dataset of the deleted DOIs extracted from the difference set between Crossref DOIs as of March 2017 and January 2021. Zenodo (2022). https://doi.org/10.5281/zenodo.6841257 18 { "doi": "10.1001/.387", "whichRA": "Crossref", "redirects": [ "https://doi.org/10.1001/archinte.166.4.387", "http://archinte.jamanetwork.com/article.aspx?doi=10.1001/archinte.166.4.387" ], "redirect_to_other_doi": [ "10.1001/archinte.166.4.387" ], "timestamp": "2022-01-30T02:10:45Z", "label": "Alias DOIs" } Dataset is available

Slide 19

Slide 19 text

Analysis of the deletions of DOIs What factors undermine their persistence and to what extent? 19 TPDL2022 - Session 9: Research and CH Data Jiro Kikkawa Masao Takaku Fuyuki Yoshikane { jiro, masao, fuyuki } @slis.tsukuba.ac.jp Slide https://speakerdeck.com/corgies/tpdl2022 Paper https://doi.org/10.1007/978-3-031-16802-4_13 Preprint https://doi.org/10.48550/arXiv.2207.12018