Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detection and Analysis of First Appearances of the Scholarly Bibliographic References on Wikipedia Articles / 20230718

Detection and Analysis of First Appearances of the Scholarly Bibliographic References on Wikipedia Articles / 20230718

Presentation slide at 2nd AP-iNext workshop Scholarly Communication & Scholarly Data Mining

Jiro Kikkawa

July 18, 2023
Tweet

More Decks by Jiro Kikkawa

Other Decks in Research

Transcript

  1. 2nd AP-iNext workshop Scholarly Communication & Scholarly Data Mining Jiro

    Kikkawa [email protected] Detection and Analysis of First Appearances of the Scholarly Bibliographic References on Wikipedia Articles University of Tsukuba, Japan 1
  2. 2 About Me: Jiro Kikkawa / ٢઒ ࣍࿠ • Assistant

    Professor at the University of Tsukuba – Institute of Library, Information and Media Science • Ph.D. (Library Information Science) – received from the University of Tsukuba in March 2021 • Research interests – Scholarly communication, Bibliometrics, and Digital library – I have been analyzing scholarly bibliographic references on Wikipedia since I was a graduate student. • For more details, please visit https://researchmap.jp/jir_o?lang=en
  3. 3 Overview • I introduce my research project to identify

    and analyze scholarly bibliographic references on Wikipedia – based on the following two papers www.nature.com/scientificdata Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles Jiro Kikkawa  ✉ , Masao Takaku & Fuyuki Yoshikane Referencing scholarly documents as information sources on Wikipedia is important because it supports or improves the quality of Wikipedia content. Several studies have been conducted regarding scholarly references on Wikipedia; however, little is known of the editors and their edits contributing to add the scholarly references on Wikipedia. In this study, we develop a methodology to detect the oldest scholarly reference added to Wikipedia articles by which a certain paper is uniquely identifiable as the “first appearance of the scholarly reference.” We identified the first appearances of 923,894 scholarly references (611,119 unique DOIs) in 180,795 unique pages on English Wikipedia as of March 1, 2017 and stored them in the dataset. Moreover, we assessed the precision of the dataset, which was highly precise regardless of the research field. Finally, we demonstrate the potential of our dataset. This dataset is unique and attracts those who are interested in how the scholarly references on Wikipedia grew and which editors added them. Background & Summary Along with the digitization of scholarly communication, numerous scholarly documents have been referenced and used on the Web. One of the changes arising from the development and dissemination of scholarly infor- mation infrastructures on the Web is the utilization of scholarly documents by various people and communities, including readers other than traditional ones such as researchers and specialists. As such an example, there are many references and accesses to scholarly documents via Wikipedia. In particular, according to Crossref, which assigns Digital Object Identi ers (DOIs) to scholarly documents massively, Wikipedia is one of the largest refer- rers of Crossref DOIs as of 20151. Wikipedia is a free online encyclopedia that anyone can edit, and it has been one of the most visited websites in the world. However, owing to its collaborative nature, much criticism and discussion have emerged since its start with regard to the accuracy and reliability of its contents. ree core content policies exist in Wikipedia: “veri ability,” “neutral point of view,” and “no original research.” Referencing scholarly documents as informa- tion sources on Wikipedia complements these policies, as these cited sources support or improve the quality of Wikipedia content. Several studies have been conducted regarding scholarly bibliographic references on Wikipedia; however, most of them have focused on the scholarly document itself2–6. e methodologies in previous studies used DaTa DEScRIpTOR OpEN Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Vol. 9, Article No. 85, pp. 1-11, 2022. https://doi.org/10.1038/s41597-022-01190-z Time Lag Analysis of Adding Scholarly References to English Wikipedia How Rapidly Are They Added to and How Fresh Are They? Jiro Kikkawa( B) , Masao Takaku , and Fuyuki Yoshikane University of Tsukuba, Tsukuba, Ibaraki, Japan {jiro,masao,fuyuki}@slis.tsukuba.ac.jp Abstract. Referencing scholarly documents as information sources on Wikipedia is important because they complement and improve the qual- ity of Wikipedia content. However, little is known about them, such as how rapidly they are added and how fresh they are. To answer these ques- tions, we conduct a time-series analysis of adding scholarly references to the English Wikipedia as of October 2021. Consequently, we detect no tendencies in Wikipedia articles created recently to refer to more fresh references because the time lag between publishing the scholarly articles and adding references of the corresponding paper to Wikipedia articles has remained generally constant over the years. In contrast, tendencies to decrease over time in the time lag between creating Wikipedia articles and adding the first scholarly references are observed. The percentage of cases where scholarly references were added simultaneously as Wikipedia articles are created is found to have increased over the years, particu- larly since 2007–2008. This trend can be seen as a response to the policy changes of the Wikipedia community at that time that was adopted by various editors, rather than depending on massive activities by a small number of editors. Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Time Lag Analysis of Adding Scholarly References to English Wikipedia: How Rapidly Are They Added to and How Fresh Are They?", Proceedings of the 18th International Conference, iConference 2023, Lecture Notes in Computer Science (LNCS), Vol. 13972, pp. 425-438, 2023. https://doi.org/10.1007/978-3-031-28032-0_33
  4. 4 Background • Mass digitization of scholarly communication – Various

    kinds of communities and people, including non-traditional readers, such as researchers and specialists can utilize scholarly documents – Wikipedia offers numerous references and access to scholarly documents, and Wikipedia is one of the largest referrers of Crossref DOIs as of 2015 • Wikipedia and Scholarly bibliographic references – Wikipedia is a free online encyclopedia that anyone can edit, and one of the most visited websites in the world – Much criticism and discussion have emerged since its start with regard to the accuracy owing to its collaborative nature – Scholarly bibliographic references on Wikipedia complement and improve the quality of Wikipedia content
  5. 5 Background • Scholarly bibliographic references on Wikipedia complement and

    improve the quality of Wikipedia content Difficulties defining LIS "The question, 'What is library and information science?' does not elicit responses [...] Chua & Yang (2008) [10] studied papers published in Journal of the American Society for Information Science and Technology in the period 1988–1997 and found, among other things: "Top authors have grown in diversity from those being affiliated predominantly with library/information-related departments to include those from information systems management, information technology, business, and the humanities. […] " References 1. Bates, M.J. and Maack, M.N. (eds.). (2010). Encyclopedia of Library and Information Sciences. Vol. 1–7. CRC Press, Boca Raton, USA. Also available as an electronic source. […] 10. Chua, Alton Y.K.; Yang, Christopher C. (November 2008). "The shift towards multi- disciplinarity in information science". Journal of the American Society for Information Science and Technology. 59 (13): 2156– 2170. doi:10.1002/asi.20929. Figure 1. Example of the scholarly reference on English Wikipedia. Library and information science - Wikipedia https://en.wikipedia.org/wiki/Library_and_information_science • 1,474,375 scholarly references on English Wikipedia as of October 2021 • Who added these references to Wikipedia, and when?
  6. • Most previous studies have focused on the scholarly document

    itself, and little is known about the editors and their contributions to adding scholarly references to Wikipedia. 1. whether the scholarly articles published in high-impact factor journals tend to be more referenced on Wikipedia [Nielsen, 2007; Teplitskiy, 2016] 2. whether the scholarly articles published in open access journals tend to be more referenced on Wikipedia [Teplitskiy, 2016; Lin and Fenner, 2014; Pooladian and Borrego, 2017] 3. whether the references on Wikipedia are usable as a data source for research evaluations [Kousha and Thelwall, 2017] 4. investigations regarding the characteristics of Wikipedia articles with scholarly references [Pooladian and Borrego, 2017] 5. investigations regarding the references focused on specific identifiers (e.g., DOI, arXiv, ISSN, and ISBN) [Kikkawa, 2016; Kikkawa, 2020b; Halfaker and Taraborelli, 2019] or research fields [Thelwall, 2016; Pooladian and Borrego, 2017] Previous studies focused on the scholarly document itself Analysis of scholarly references on Wikipedia 6 Reference: Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Time Lag Analysis of Adding Scholarly References to English Wikipedia: How Rapidly Are They Added to and How Fresh Are They?", Proceedings of the 18th International Conference, iConference 2023, Lecture Notes in Computer Science (LNCS), Vol. 13972, pp. 425-438, 2023. https://doi.org/10.1007/978-3-031-28032-0_33
  7. 7 Difficulties to detect the first appearance #1 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Vol. 9, Article No. 85, pp. 1-11, 2022. https://doi.org/10.1038/s41597-022-01190-z • We define the term “first appearance of the scholarly reference” as - the oldest scholarly reference added to Wikipedia articles by which a certain paper is uniquely identifiable • We do not consider the roles of each reference - For instance, references as evidence for a certain part of content of the article, those just mentioning the paper, and those listed in further readings is not distinguished. • If there are multiple references corresponding to the same paper on the same article, the oldest one is treated as the first appearance. • The most challenging part is that the scholarly reference at the time of its first appearance is composed of insufficient or incomplete information, and more detailed information is added in later revisions.
  8. 8 Difficulties to detect the first appearance #2 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Vol. 9, Article No. 85, pp. 1-11, 2022. https://doi.org/10.1038/s41597-022-01190-z • We define the term “first appearance of the scholarly reference” as - the oldest scholarly reference added to Wikipedia articles by which a certain paper is uniquely identifiable Figure 2A. The first appearance of the target papers on the article “Fair trade” on English Wikipedia • First appearance in this case is A1, an editor had added the corresponding scholarly reference including author name, published year, paper title, and journal name to the article • Then, another editor modified its format according to the citation template on A2, and DOI was added on A3 • We need to detect the first appearance by matching paper titles for this case. 19 Article title Fair trade Target paper Reed, D. (2009). What do Corporations have to do with Fair Trade? Positive and Normative Analysis from a Value Chain Perspective. Journal of Business Ethics, 86, 3–26. https://doi.org/10.1007/s10551-008-9757-5 Sample Number - A1 A2 A3 Revision timestamp - 2011-05-05 13:35:01 UTC 2016-06-26 09:48:41 UTC 2016-06-26 09:49:39 UTC Corresponding Scholarly reference on the article (not exist) * Reed, D. (2009). What do Corporations have to do with Fair Trade? Positive and normative analysis from a value chain perspective. Journal of Business Ethics , 86:3-26, , p. 12) <ref>{{cite journal | last1 = Reed | first1 = D | year = 2009 | title = What do Corporations have to do with Fair Trade? Positive and normative analysis from a value chain perspective | url = | journal = Journal of Business Ethics | volume = 86 | issue = | pages = 3–26 [12] }}</ref> <ref> […] {{cite journal | last1 = Reed | first1 = D | year = 2009 | title = What do Corporations have to do with Fair Trade? Positive and normative analysis from a value chain perspective | url = | journal = Journal of Business Ethics | volume = 86 | issue = | pages = 3–26 [21] | doi=10.1007/s10551-008-9757- 5}}</ref> Article title Solomon Islands Target paper Norton, H. L., Friedlaender, J. S., Merriwether, D. A., Koki, G., Mgone, C. S., & Shriver, M. D. (2006). Skin and hair pigmentation variation in Island Melanesia. American Journal of Physical Anthropology, 130 (2), 254–268. https://doi.org/10.1002/ajpa.20343 Sample number - B1 B2 B3 Revision timestamp - 2014-11-19 19:36:09 UTC 2014-11-19 22:23:48 UTC 2015-03-29 08:18:34 UTC Corresponding scholarly reference on the article (not exist) <ref>http://www.ncbi.nlm.nih. gov/pubmed/16374866</ref> <ref>{{cite web | url=http://www.ncbi.nlm.nih.gov/pu bmed/16374866 | title=Skin and hair pigmentation variation in Island Melanesia. | author=Norton HL , et al. | publisher= | accessdate=19 November 2014}}</ref> <ref>{{cite journal | last1=Norton HL | first1=et al | title=Skin and Hair Pigmentation Variation in Island Melanesia. | journal=MedLine | date=June 2006 | volume=130 | issue=2 | page=254 | accessdate=4 December 2014 | doi=10.1002/ajpa.20343}}</ref>
  9. 9 Difficulties to detect the first appearance #3 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Vol. 9, Article No. 85, pp. 1-11, 2022. https://doi.org/10.1038/s41597-022-01190-z Figure 2B. The first appearance of the target papers on the article “Solomon Islands” on English Wikipedia • First appearance in this case is B1, an editor initially added just the URI with PubMed ID (PMID) to this article. • Then, the paper title and author names for the paper were added along with modification of the format according to the citation template on B2 • Additional information including DOI was added on B3 • We need to detect the first appearance by matching PubMed IDs for this case. Corresponding Scholarly reference on the article (not exist) Trade? Positive and normative analysis from a value chain perspective. Journal of Business Ethics , 86:3-26, , p. 12) Trade? Positive and normative analysis from a value chain perspective | url = | journal = Journal of Business Ethics | volume = 86 | issue = | pages = 3–26 [12] }}</ref> Positive and normative analysis from a value chain perspective | url = | journal = Journal of Business Ethics | volume = 86 | issue = | pages = 3–26 [21] | doi=10.1007/s10551-008-9757- 5}}</ref> Article title Solomon Islands Target paper Norton, H. L., Friedlaender, J. S., Merriwether, D. A., Koki, G., Mgone, C. S., & Shriver, M. D. (2006). Skin and hair pigmentation variation in Island Melanesia. American Journal of Physical Anthropology, 130 (2), 254–268. https://doi.org/10.1002/ajpa.20343 Sample number - B1 B2 B3 Revision timestamp - 2014-11-19 19:36:09 UTC 2014-11-19 22:23:48 UTC 2015-03-29 08:18:34 UTC Corresponding scholarly reference on the article (not exist) <ref>http://www.ncbi.nlm.nih. gov/pubmed/16374866</ref> <ref>{{cite web | url=http://www.ncbi.nlm.nih.gov/pu bmed/16374866 | title=Skin and hair pigmentation variation in Island Melanesia. | author=Norton HL , et al. | publisher= | accessdate=19 November 2014}}</ref> <ref>{{cite journal | last1=Norton HL | first1=et al | title=Skin and Hair Pigmentation Variation in Island Melanesia. | journal=MedLine | date=June 2006 | volume=130 | issue=2 | page=254 | accessdate=4 December 2014 | doi=10.1002/ajpa.20343}}</ref>
  10. 10 Proposed method to detect the first appearances #1 Reference:

    Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Vol. 9, Article No. 85, pp. 1-11, 2022. https://doi.org/10.1038/s41597-022-01190-z 1. We extracted DOI links referenced in main namespace articles along with their article IDs and article titles on English Wikipedia by using Wikipedia dump files 2. We obtained Crossref metadata for each DOI via the Crossref REST API 3. We obtained other identifiers such as PubMed (PMID & PMCID) and other identifiers corresponding to each DOI by using Entrez Programming Utilities, etc. 4. We stored article IDs, article titles, DOIs, and other identifiers; Crossref metadata; and research fields for each reference as the basic dataset. Figure 3A. Data creation workflows. (1) Building the basic dataset Wikipedia Dump Data DOI Crossref Metadata Paper title Other Identifiers ISSN Research Field Article ID Basic Dataset Basic Dataset Article ID Wikipedia Dump Data Identifiers Paper title First appearance of the paper on the article Revision 1 Revision 2 Revision n ⁝ Revisions First Appearances Dataset
  11. 11 Proposed method to detect the first appearances #2 Figure

    3B. Data creation workflows. (2) Building the first appearance dataset 1. We extracted all revision histories corresponding to article IDs in the basic dataset, together with article texts by using Wikipedia Dump files. 2. We extracted identifiers and paper titles from the basic dataset, and detected the candidates of the first appearance for each scholarly reference as follows: A) One or more identifiers included in the article text. B) Either the full title of the paper or the first 5 words of the title is included in the article text. C) The similarity score based on the edit distance between the two paper titles from the basic dataset and from the extracted citation on the article is equal to or lower than the given threshold. 3. We selected the oldest revision among the candidates as the first appearance. Paper title ISSN Basic Dataset Article ID Wikipedia Dump Data Identifiers Paper title First appearance of the paper on the article Revision 1 Revision 2 Revision n ⁝ Revisions First Appearances Dataset
  12. 12 Dataset of first appearances on English Wikipedia articles as

    of 1 October 2021 Reference: Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki: “Dataset of first appearances of the scholarly bibliographic references on English Wikipedia articles as of 1 March 2017 and as of 1 October 2021”. Zenodo (2021). https://doi.org/10.5281/zenodo.5595573 • By using the proposed method, we built and published the dataset of first appearances of scholarly bibliographic references on English Wikipedia as of 1 October 2021. We identified the first appearances of 1,474,375 scholarly references (1,010,834 unique DOIs) in 313,240 unique articles • We evaluated the precision for detecting the first appearance, which was 93.3% as a whole and exceeded 90% in 20 out of 22 ESI research fields. • Please play with this dataset :D
  13. 13 2023/07/17 5:14 plot_by_editor_type_yymm_enwiki2021-10-01.html 2001 2002 2003 2004 2005 2006

    2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 0 20,000 40,000 60,000 10,000 30,000 50,000 70,000 5,000 15,000 25,000 35,000 45,000 55,000 65,000 User Bot IP Figure 4. Monthly plot of the time-series transitions for the total number of references added on English Wikipedia articles. Example of analysis of this dataset #1 A B C • The spikes seen at A, B, and C in Figure 4 are caused by activities of a certain editor • A and B: ProteinBoxBot, the bot editor adds scholarly references related to molecular and cellular biology automatically at a large scale. https://en.wikipedia.org/wiki/User:ProteinBoxBot • C: Yeast2Hybrid, a human editor and is PhD, Bioinformatician, France, according to his profile page. https://en.wikipedia.org/wiki/User:Yeast2Hybrid
  14. 14 Example of analysis of this dataset #2-1 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Time Lag Analysis of Adding Scholarly References to English Wikipedia: How Rapidly Are They Added to and How Fresh Are They?", Proceedings of the 18th International Conference, iConference 2023, Lecture Notes in Computer Science (LNCS), Vol. 13972, pp. 425-438, 2023. https://doi.org/10.1007/978-3-031-28032-0_33 Time lag between the creation date of each Wikipedia article and the date of adding the first scholarly reference to the corresponding article The date of a certain Wikipedia article created The date of adding the first scholarly reference to this article Time lag Wikipedia aritcle The date of a certain Wikipedia article created The date of adding the first scholarly reference to this article Time lag Spyware 2001-11-22 16:37:56 UTC 2016-08-06 16:05:57 UTC 5370.98 days (464,052,481 seconds ≒ 14.7 years)
  15. 15 Example of analysis of this dataset #2-2 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Time Lag Analysis of Adding Scholarly References to English Wikipedia: How Rapidly Are They Added to and How Fresh Are They?", Proceedings of the 18th International Conference, iConference 2023, Lecture Notes in Computer Science (LNCS), Vol. 13972, pp. 425-438, 2023. https://doi.org/10.1007/978-3-031-28032-0_33 A. 0 days and at the same time B. 0 days but not at the same time C. less than 1 month D. equal to or more than 1 month but less than 6 months E. equal to or more than 6 months but less than 1 year F. equal to or more than 1 year but less than 3 years G. equal to or more than 3 years but less than 5 years H. equal to or more than 5 years 2022/06/25 18:27 timelag_add_between_page_created_and_first_ref_added.html 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Figure 5. Distribution of the time lag between creating the Wikipedia articles and adding the first scholarly references for every 2 years. • Regarding the group of “0 days and at the same time,” the percentage increased significantly from 2005–2006 to 2007–2008 (from 9.05% to 36.00%). Grouped by the years of Wikipedia articles created Time lag between the creation date of each Wikipedia article and the date of adding the first scholarly reference to the corresponding article
  16. 16 Example of analysis of this dataset #2-3 Reference: Kikkawa,

    Jiro; Takaku, Masao; Yoshikane, Fuyuki: "Time Lag Analysis of Adding Scholarly References to English Wikipedia: How Rapidly Are They Added to and How Fresh Are They?", Proceedings of the 18th International Conference, iConference 2023, Lecture Notes in Computer Science (LNCS), Vol. 13972, pp. 425-438, 2023. https://doi.org/10.1007/978-3-031-28032-0_33 Time lag between the creation date of each Wikipedia article and the date of adding the first scholarly reference to the corresponding article A. 0 days and at the same time B. 0 days but not at the same time C. less than 1 month D. equal to or more than 1 month but less than 6 months E. equal to or more than 6 months but less than 1 year F. equal to or more than 1 year but less than 3 years G. equal to or more than 3 years but less than 5 years H. equal to or more than 5 years 2022/06/25 18:27 timelag_add_between_page_created_and_first_ref_added.html 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A. 0 days and at the same time B. 0 days but not at the same time C. less than 1 month D. equal to or more than 1 month but less than 6 months E. equal to or more than 6 months but less than 1 year F. equal to or more than 1 year but less than 3 years G. equal to or more than 3 years but less than 5 years H. equal to or more than 5 years file:///Users/mona26/Dropbox/working/wikipedia_timelag2022/pageid_and_oldest_ref/highchart/timelag_add_between_page_created_and_first_ref_added.html 1/1 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% • In 2005, a hoax stating that a certain journalist had been a suspect in the assassinations of the president of the USA was added to the Wikipedia article, which became a social problem. • Wikipedia Seigenthaler biography incident - Wikipedia https://en.wikipedia.org/wiki/Wikipedia_Seigenthaler_biography_incident • In 2006, Jimmy Wales declared that the Wikipedia community has traded in quantity for the quality of its contents. • The increase observed here could be seen as a response to this movement.
  17. 17 Future directions of this project • Achievements ü Building

    the methodology to detect first appearances of scholarly bibliographic references on Wikipedia articles with a high precision ü The dataset of English Wikipedia as of 2021 October ü Time lag analysis of adding scholarly references to English Wikipedia • Future works – Classify each reference based on their roles such as evidence for a certain part of content of the article, those just mentioning the paper, and those listed in further readings. – Support adding more scholarly references by building recommendation system that shows related scholarly articles to Wikipedia editors. – Support to detect and update obsolete/problematic references such as references to retracted papers.
  18. 2nd AP-iNext workshop Scholarly Communication & Scholarly Data Mining Jiro

    Kikkawa [email protected] Detection and Analysis of First Appearances of the Scholarly Bibliographic References on Wikipedia Articles University of Tsukuba, Japan 18