Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enhancing Identification of Scholarly Reference...

Jiro Kikkawa
September 19, 2024

Enhancing Identification of Scholarly Reference on YouTube: Method Development and Analysis of External Link Characteristics / TPDL2024

Our presentation slide at TPDL2024 (https://tpdl2024.nuk.si/), Session 5: Scholarly Communication and Systematic Reviews on 27th September 2024.
Authors: Jiro Kikkawa, Masao Takaku, and Fuyuki Yoshikane
Paper: https://doi.org/10.1007/978-3-031-72437-4_19
Conference program: https://tpdl2024.nuk.si/program.html
Abstract: Scholarly communication through YouTube videos has been increasing. Although Altmetric provides the dataset on such references, its coverage is unclear, and it does not contain the original external links in each video. Further investigation is needed to understand the characteristics of scholarly references as external links in YouTube videos. To address this gap, we propose a method to identify scholarly references by searching for domain names and building a dataset by applying this method. Subsequently, we compare this dataset with the Altmetric dataset and analyze the external link characteristics. Using the proposed method and targeting six types of domain names, we identified approximately 480,000 references among 230,000 videos posted on 55,000 channels. Notably, over half of these references were not covered by the Altmetric dataset, resulting in a 150% increase in the number of references when combining the dataset constructed by the proposed method with the Altmetric dataset, compared to the Altmetric dataset alone. Regarding external links, PubMed and DOI links were prominent; however, a substantial number of direct links to publisher platforms were observed. Most channels and videos contained external links to a single platform, scattered across each platform. The method proposed in this study is helpful for identifying and analyzing scholarly references on YouTube. In addition, the findings on external link characteristics raise concerns about the long-term accessibility and fact-checking of information sources for YouTube video content.

Jiro Kikkawa

September 19, 2024
Tweet

More Decks by Jiro Kikkawa

Other Decks in Research

Transcript

  1. Enhancing Identification of Scholarly Reference on YouTube: Method Development and

    Analysis of External Link Characteristics 1 TPDL2024 - Session 5: Scholarly Communication and Systematic Reviews Jiro Kikkawa Masao Takaku Fuyuki Yoshikane Institute of Library, Information and Media Science, University of Tsukuba, Japan
  2. 2 Background and Purpose #1 Scholarly communication through online videos

    has been expanding • Due to these communication, scholarly articles and their knowledge are usable not only by academic researchers and domain experts but also by the general public. • YouTube, the largest online video platform, is used to share various videos including content related to these kinds of scholarly knowledge
  3. 3 Background and Purpose #2 • In bibliometrics, a few

    studies have conducted bibliometric and quantitative analyses of scholarly references on YouTube – The dataset used in these studies is provided by Altmetric.com (Hereinafter, “Altmetric dataset”) – However, the coverage of Altmetric dataset is unclear because it captures scholarly references only from videos uploaded by pre-curated YouTube channels • We propose a method for identifying scholarly references and building a dataset using this method • Furthermore, we compared this dataset with the Altmetric dataset and analyzed the external link characteristics
  4. 4 Approach by Altmetric.com (Altmetric dataset) Target YouTube Channels YouTube

    Channel X Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Description text • https://doi.org/... Description text • https://pubmed.ncbi.nlm.nih.gov/... Description text • DOI • DOI • DOI • DOI • DOI Dataset Video B Video C • In approach to build Altmetrics dataset, the first step is to set a group of YouTube channels as targets. • As of end of 2023, over 40,000 YouTube channels are curated; however, scholarly references on videos uploaded by channels outside of this group are not included in this dataset.
  5. 5 Approach by Altmetric.com (Altmetric dataset) • Next, the URIs

    in the description text of each video are extracted and matched to DOIs and other identifiers, including PubMed (i.e., PMIDs and PMCIDs), based on both the strings of URIs and meta tags on the landing pages Target YouTube Channels YouTube Channel X Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Description text • https://doi.org/... Description text • https://pubmed.ncbi.nlm.nih.gov/... Description text • DOI • DOI • DOI • DOI • DOI Dataset Video B Video C
  6. 6 The proposed method (YA Domain Dataset) YouTube Channel X

    Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • The first step is to set the target domain names that hit any scholarly article platform or URI that includes specific identifiers. We targeted the following six domain names. # Domain name Platform # Domain name Platform 1 ncbi.nlm.nih.gov PubMed 4 onlinelibrary.wiley.com Wiley Online Library 2 doi.org DOI Link 5 link.springer.com SpringerLink 3 sciencedirect.com SienceDirect 6 ieeexplore.ieee.org IEEE Xplore
  7. 7 The proposed method (YA Domain Dataset) YouTube Channel X

    Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • The second step is to execute a search using a string of domain names enclosed in double quotes as queries (e.g., “ncbi.nlm.nih.gov”) # Domain name Platform # Domain name Platform 1 ncbi.nlm.nih.gov PubMed 4 onlinelibrary.wiley.com Wiley Online Library 2 doi.org DOI Link 5 link.springer.com SpringerLink 3 sciencedirect.com SienceDirect 6 ieeexplore.ieee.org IEEE Xplore
  8. 8 The proposed method (YA Domain Dataset) YouTube Channel X

    Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • Third, the metadata for the videos found in the search results are retrieved. Additionally, URIs are extracted from the title and description text of each video • In the fourth step, we associate these URIs containing the domain name with the corresponding DOIs based on the string of URIs. This step is necessary to compare references in the Altmetrics dataset using DOIs because that dataset does not contain raw URLs
  9. 9 Results and Discussion #1 Basic statistics of the datasets

    # Dataset DOI type Unique Channels Unique Videos Unique DOIs Total References A1 YA Domain Crossref DOI 55,712 235,095 261,027 479,111 A2 YA Domain Other DOIs 1,759 7,850 4,502 8,642 B1 Altmetric Crossref DOI 42,154 152,229 256,315 470,731 B2 Altmetric Other DOIs 3,484 10,532 9,526 13,742 Most references in both datasets were scholarly articles assigned Crossref DOIs • YA Domain dataset covers 470,000 references among 230,000 videos posted by 55,000 channels • Altmetric Dataset covers 47,000 references among 150,000 videos posted by 42,000 channels
  10. 10 The Difference set of A1-B1 shows that 64% of

    channels and 72% of videos in the YA Domain dataset are not covered by the Altmetric dataset. A1 B1 A1 - B1 A1 * B1 B1 - A1 Channels 55,712 42,154 35,820 19,892 22,262 100.00% 100.00% 64.29% 35.71% 47.19% 52.81% Videos 235,095 152,229 170,292 64,803 87,426 100.00% 100.00% 72.44% 27.56% 42.57% 57.43% DOIs 261,027 256,315 130,313 130,714 125,601 100.00% 100.00% 49.92% 50.08% 51.00% 49.00% References 479,111 470,731 272,798 206,313 264,418 100.00% 100.00% 56.94% 43.06% 43.83% 56.17% Results and Discussion #2-1 Overlaps between the datasets Note. A1: Crossref DOIs in YA Domain dataset, B1: Crossref DOI in Altmetric dataset
  11. 11 • The Difference set of B1-A1, shows that approximately

    one-half of channels, videos, DOIs, and references covered by the Altmetric dataset are not included in the YA Domain dataset. • The number of references increased by 150% when combining the YA domain dataset with the Altmetric dataset, compared to using the Altmetric dataset alone. A1 B1 A1 - B1 A1 * B1 B1 - A1 Channels 55,712 42,154 35,820 19,892 22,262 100.00% 100.00% 64.29% 35.71% 47.19% 52.81% Videos 235,095 152,229 170,292 64,803 87,426 100.00% 100.00% 72.44% 27.56% 42.57% 57.43% DOIs 261,027 256,315 130,313 130,714 125,601 100.00% 100.00% 49.92% 50.08% 51.00% 49.00% References 479,111 470,731 272,798 206,313 264,418 100.00% 100.00% 56.94% 43.06% 43.83% 56.17% Results and Discussion #2-2 Overlaps between the datasets
  12. 12 Results and Discussion #3 Platform Unique Channels Unique Videos

    Unique DOIs Total References PubMed 18,760 33.67% 83,269 35.42% 111,948 42.89% 232,380 48.50% DOI Link 12,856 23.08% 72,739 30.94% 83,742 32.08% 113,730 23.74% SienceDirect 14,640 26.28% 34,328 14.60% 30,259 11.59% 43,930 9.17% Wiley Online Library 12,840 23.05% 30,252 12.87% 23,598 9.04% 34,750 7.25% SpringerLink 11,023 19.79% 25,312 10.77% 17,403 6.67% 31,014 6.47% IEEE Xplore 4,240 7.61% 9,416 4.01% 7,313 2.80% 24,282 5.07% Overall 55,712 100.00% 235,095 100.00% 261,027 100.00% 479,111 100.00% Number of channels, videos, DOIs, and references by platform • PubMed shows that 18,760 channels, 83,269 videos, 111,948 DOIs, and 232,380 references correspond to Crossref DOIs associated with external links containing the domain name “ncbi.nlm.nih.gov” • PubMed has the highest percentage of channels, videos, DOIs, and references (33.67%, 35.42%, 42.89%, and 48.50%, respectively). • The DOI link is the second largest in the number of total references • Tens of thousands of direct external links exist for each of other platforms
  13. 13 Conclusion 1. We proposed a method for identifying scholarly

    references on YouTube using the domain names of scholarly publishing platforms or related services as a query 2. Targeting six types of domain names, we identified 480,000 references corresponding to 260,000 unique Crossref DOIs among 230,000 videos posted on 55,000 channels 3. We revealed that over half of these references were not covered by the Altmetric dataset. 4. Regarding external links, PubMed and DOI links were prominent; however, a substantial number of direct links to publisher platforms were observed • Dataset in this study is available at https://doi.org/10.5281/zenodo.12801387