Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent? / tpdl2022

Jiro Kikkawa
September 15, 2022

Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent? / tpdl2022

This is our presentation slide at TPDL2022 (http://tpdl2022.dei.unipd.it/), Session 9: Research and CH Data on 23rd September 2022.
Authors: Jiro Kikkawa, Masao Takaku, and Fuyuki Yoshikane
Paper: https://doi.org/10.1007/978-3-031-16802-4_13
Preprint: https://doi.org/10.48550/arXiv.2207.12018
Abstract: Digital Object Identifiers (DOIs) are regarded as persistent; however, they are sometimes deleted. Deleted DOIs are an important issue not only for persistent access to scholarly content but also for bibliometrics, because they may cause problems in correctly identifying scholarly articles. However, little is known about how much of deleted DOIs and what causes them. We identified deleted DOIs by comparing the datasets of all Crossref DOIs on two different dates, investigated the number of deleted DOIs in the scholarly content along with the corresponding document types, and analyzed the factors that cause deleted DOIs. Using the proposed method, 708,282 deleted DOIs were identified. The majority corresponded to individual scholarly articles such as journal articles, proceedings articles, and book chapters. There were cases of many DOIs assigned to the same content, e.g., retracted journal articles and abstracts of international conferences. We show the publishers and academic societies which are the most common in deleted DOIs. In addition, the top cases of single scholarly content with a large number of deleted DOIs were revealed. The findings of this study are useful for citation analysis and altmetrics, as well as for avoiding deleted DOIs.

Jiro Kikkawa

September 15, 2022
Tweet

More Decks by Jiro Kikkawa

Other Decks in Research

Transcript

  1. Analysis of the deletions of DOIs
    What factors undermine their persistence
    and to what extent?
    1
    TPDL2022 - Session 9: Research and CH Data
    Jiro Kikkawa Masao Takaku Fuyuki Yoshikane
    { jiro, masao, fuyuki } @slis.tsukuba.ac.jp
    Slide https://speakerdeck.com/corgies/tpdl2022
    Paper https://doi.org/10.1007/978-3-031-16802-4_13
    Preprint https://doi.org/10.48550/arXiv.2207.12018

    View Slide

  2. 2
    Background
    • Persistent access to scholarly articles
    – DOI is the best-known PID and an international standard
    – 130 million Crossref DOIs as of March 2022
    • Deleted DOIs
    – DOIs and their artifacts (e.g., identifiers, metadata, and
    redirected URIs) are sometimes deleted by content holders.
    – Deleted DOIs may cause problems to bibliometrics (e.g., citation
    index and altmetrics) in correctly identifying scholarly articles.
    – However, little is known about their quantity and causes. Thus,
    we developed a methodology to identify them and analyzed them.

    View Slide

  3. 1. We provide the overall picture of deleted DOIs by clarifying
    the number of deleted DOIs and the characteristics of
    their content. We demonstrate the potential impact of
    deleted DOIs for citation analysis and altmetrics.
    2. We provide guidance for avoiding deleted DOIs and
    making the DOI system more stable.
    Definition of deleted DOIs
    We focus on the deletion of identifiers and metadata.
    We defined DOIs for which identifiers and associated metadata
    cannot be retrieved as “Deleted DOIs” in this study.
    3
    Purpose

    View Slide

  4. 4
    Related work #1
    • Investigation of duplicated Crossref DOIs
    – Tkaczyk (2020)
    • Incorrect DOIs indexed by scholarly bibliographic
    databases
    – Franceschini et al. (2014), Zhu et al. (2018)
    • Crossref DOI statistics
    – Hendricks et al. (2020)
    • Analysis of persistence of Crossref DOIs
    – Klein and Balakireva (2020)
    • Analysis of the usage of DOI links in scholarly
    articles
    – Van de Sompel et al. (2016)

    View Slide

  5. 5
    Related work #2
    1. Duplicated Crossref DOIs
    2. Crossref DOIs causing errors and unable to lead to the
    contents
    3. DOIs in the records of scholarly bibliographic databases
    contain errors
    • Previous studies were based on relatively small samples
    because a methodology to identify duplicated or deleted
    DOIs at a large scale has not been proposed.
    • We propose a methodology for identifying deleted DOIs
    and conducting a large-scale analysis to determine how
    many deleted DOIs exist and what factors cause them.

    View Slide

  6. • We extracted the DOIs that
    existed at a point in time and
    did not exist later by comparing
    the dump files of Crossref DOIs
    on two different dates.
    à We used the datasets as of
    March 2017 and January 2021.
    • We then treated these DOIs as a
    candidate set of deleted DOIs.
    6
    Materials and Methods
    Identifying deleted DOIs #1
    All Crossref DOIs
    as of March 2017
    All Crossref DOIs
    as of January 2021
    Difference set of DOIs included in
    the dataset as of March 2017
    Candidate set of deleted DOIs

    View Slide

  7. 7
    Identifying deleted DOIs #2
    Difference set of DOIs included in the dataset as of March 2017 (n=711,198)
    Crossref DOIs Non-Crossref DOIs
    Non-existing DOIs
    Which RA?
    First, we classified DOIs into the following groups according to
    the result of “Which RA?”
    DOI does not exist Crossref RA names other than Crossref
    ˞Items with a red box correspond to deleted DOIs.

    View Slide

  8. 8
    Identifying deleted DOIs #3
    Difference set of DOIs included in the dataset as of March 2017 (n=711,198)
    Crossref DOIs Non-Crossref DOIs
    Non-existing DOIs
    Which RA?
    Next, we classified the Crossref DOIs into the following groups
    according to the results of each HTTP header for the DOI links.
    DOIs with Redirects
    Defunct DOIs
    Redirect DOIs without Redirects
    DOIs that redirected to the specific
    URI for the deleted content
    https://www.crossref.org/_deleted-doi/
    DOI does not exist Crossref RA names other than Crossref
    ˞Items with a red box correspond to deleted DOIs.

    View Slide

  9. 9
    Identifying deleted DOIs #4
    Difference set of DOIs included in the dataset as of March 2017 (n=711,198)
    Crossref DOIs Non-Crossref DOIs
    Non-existing DOIs
    Which RA?
    Finally, the DOIs were classified into the following groups according
    to the results of the Crossref REST API.
    DOIs with Deleted Description in Metadata
    Alias DOIs Other DOIs
    Crossref
    REST API
    When multiple Crossref DOIs were assigned to the same content item,
    one was set as the Primary DOI, and the others were set as the Alias DOIs.
    The Crossref REST API returned the error “Resource not found” for the Alias DOIs.
    Result DOI does not exist RA names other than Crossref
    DOIs with Redirects
    Defunct DOIs
    Redirect DOIs without Redirects
    Crossref
    ˞Items with a red box correspond to deleted DOIs.

    View Slide

  10. 10
    Results and Discussion
    How many deleted DOIs exist?
    # Group Count %
    1 Non-existing DOIs 240 0.03
    2 DOIs without Redirects 693 0.10
    3 DOIs with Deleted Description on Metadata 1,144 0.16
    4 Defunct DOIs 388 0.05
    5 Alias DOIs 667,869 94.29
    6 Other DOIs 37,948 5.36
    Table1. Number of deleted DOIs in each group (n=708,282)
    • We identified 708,282 deleted DOIs. Most of them were
    Alias DOIs, accounting for 94.29% of the total.
    • The majority of the deleted DOIs were caused by the
    deletion of multiple DOI assignments for the same content.

    View Slide

  11. 11
    # Primary DOI Title
    Count Volume (Issue), Page Container title
    1 10.1016/s1876-6102(14)00454-8 Volume Removed - Publisher's Disclaimer
    1,474 13, pp. 1-10380 Energy Procedia
    This entire volume was retracted.
    2 10.1016/s1876-6102(14)00453-6 Volume Removed - Publisher's Disclaimer
    748 11, pp. 1-5156 Energy Procedia
    This entire volume was retracted.
    Table 3-1. Primary DOIs with the largest numbers of associated Alias DOIs
    • These cases are where many DOIs assigned to the same content
    in the retracted journal articles.
    • These articles are different scholarly articles. The withdrawn information
    should be applied to the respective DOIs, rather than applying them as
    alias DOIs to a single primary DOI.
    Why many DOIs are assigned to the
    same content? #1

    View Slide

  12. 12
    Table 3-2. Primary DOIs with the largest numbers of associated Alias DOIs
    # Primary DOI Title
    Count Volume (Issue), Page Container title
    3 10.1016/j.fueleneab.2006.10.002 Abstracts
    486 47 (6), pp. 384-446 Fuel and Energy Abstracts
    4 10.1016/s0735-1097(01)80004-8 Hypertension, vascular disease, and prevention
    399 37 (2), pp. A220-A304 Journal of the American College of Cardiology
    • These cases involved abstracts of an international conference,
    where DOIs were registered to the abstracts of the presentations,
    but the publisher appeared to change the policy for registering DOIs
    to a set of abstracts.
    Why many DOIs are assigned to the
    same content? #2

    View Slide

  13. 13
    What are the most common changes
    in the suffixes of deleted DOIs? #1
    Pattern Count %
    Only the suffix changed 465,789 69.74
    Both the prefix and the suffix changed 156,650 23.46
    Only the prefix changed 45,430 6.80
    Table 4. DOI name change patterns (n=667,869)
    • Table 4 presents the DOI name change patterns for each pair of
    Alias and Primary DOIs.
    • The most common pattern was “only the suffix changed” (69.74%),
    indicating that many DOIs registered under the same prefix were
    deleted when multiple DOIs were registered to the same content
    by the same registrant.

    View Slide

  14. 14
    What are the most common changes
    in the suffixes of deleted DOIs? #2
    # Pattern of changes on the Suffix Count %
    Example
    1 Delete a slash once. 49,169 10.56
    /s12445-012-0033-7 → s12445-012-0033-7
    The correction of “//” to “/” as a separator between the prefix and the suffix.
    2 Add a hyphen four times. 21,918 4.71
    9781591401087.ch001 → 978-1-59140-108-7.ch001
    The correction of the ISBN in the suffix by adding hyphens as separators.
    3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99
    replace “8” with “9” once, and replace “6” with “3” once.
    2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010
    The replacement of the ISSN in the suffix with a new one.
    Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789).
    Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively.

    View Slide

  15. 15
    What are the most common changes
    in the suffixes of deleted DOIs? #3
    # Pattern of changes on the Suffix Count %
    Example
    3 Delete “2” twice, add “5” once, add “7” once, 18,607 3.99
    replace “8” with “9” once, and replace “6” with “3” once.
    2214-8647_dnp_e1000010 → 1574-9347_dnp_e1000010
    The replacement of the ISSN in the suffix with a new one.
    As a similar case to #3, we observed the replacement of
    the SICI (Serial Item and Contribution Identifier) in the suffix with a new one.
    Table 5. Three most frequent patterns for “only the suffix changed” (n=465,789).
    Red, blue, and purple text corresponds to deletion, addition, and replacement, respectively.
    (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.3.co;2-2
    (sici)1096-8628(19960102)61:1<21::aid-ajmg4>3.0.co;2-#

    View Slide

  16. 16
    Conclusion
    1. We identified 708,282 deleted DOIs that existed in March 2017
    and did not exist in January 2021 using the proposed method.
    2. The cases where many DOIs were assigned to the same content were
    retracted papers in a specific volume of a journal or
    abstracts of international conference proceedings.
    3. We revealed the factors that caused a large number of deleted DOIs.
    – We must be careful not to set double slashes between the prefix and the
    suffix. When we apply other identifiers such as the ISBN, ISSN, and SICI to
    suffixes, we may need to format or update them owing to changes in the
    ISBN, ISSN, or SICI.
    4. The findings of this study are useful for both considering the
    problems caused by deleted DOIs in citation analysis and altmetrics
    and assigning DOIs in a better way to avoid deleted DOIs.

    View Slide

  17. 17
    Future works
    1. Expand the scope of the deleted DOIs and develop a
    methodology for identifying them
    – to identify deleted DOIs that existed before March 2017, and the deleted
    DOIs after March 2021
    – to cover other DOIs whose contents were not reachable
    2. Classify all the factors that caused deleted DOIs
    – We only focused on the most frequent cases in this study.
    – It is unclear whether the observed factors represented exceptional cases.
    3. Conduct a quantitative analysis of the effects of deleted
    DOIs on citation analysis and altmetrics
    – Are alias/primary DOIs included in the citation indexes and altmetrics or not?
    – We are examining the citing/cited references to deleted DOIs by using
    the OpenCitations’ Index of Crossref open DOI-to-DOI citations dataset.

    View Slide

  18. • Dataset of the deleted DOIs extracted from the
    difference set between Crossref DOIs as of
    March 2017 and January 2021. Zenodo (2022).
    https://doi.org/10.5281/zenodo.6841257
    18
    {
    "doi": "10.1001/.387",
    "whichRA": "Crossref",
    "redirects": [
    "https://doi.org/10.1001/archinte.166.4.387",
    "http://archinte.jamanetwork.com/article.aspx?doi=10.1001/archinte.166.4.387"
    ],
    "redirect_to_other_doi": [
    "10.1001/archinte.166.4.387"
    ],
    "timestamp": "2022-01-30T02:10:45Z",
    "label": "Alias DOIs"
    }
    Dataset is available

    View Slide

  19. Analysis of the deletions of DOIs
    What factors undermine their persistence
    and to what extent?
    19
    TPDL2022 - Session 9: Research and CH Data
    Jiro Kikkawa Masao Takaku Fuyuki Yoshikane
    { jiro, masao, fuyuki } @slis.tsukuba.ac.jp
    Slide https://speakerdeck.com/corgies/tpdl2022
    Paper https://doi.org/10.1007/978-3-031-16802-4_13
    Preprint https://doi.org/10.48550/arXiv.2207.12018

    View Slide