Slide 1

Slide 1 text

Enhancing Identification of Scholarly Reference on YouTube: Method Development and Analysis of External Link Characteristics 1 TPDL2024 - Session 5: Scholarly Communication and Systematic Reviews Jiro Kikkawa Masao Takaku Fuyuki Yoshikane Institute of Library, Information and Media Science, University of Tsukuba, Japan

Slide 2

Slide 2 text

2 Background and Purpose #1 Scholarly communication through online videos has been expanding • Due to these communication, scholarly articles and their knowledge are usable not only by academic researchers and domain experts but also by the general public. • YouTube, the largest online video platform, is used to share various videos including content related to these kinds of scholarly knowledge

Slide 3

Slide 3 text

3 Background and Purpose #2 • In bibliometrics, a few studies have conducted bibliometric and quantitative analyses of scholarly references on YouTube – The dataset used in these studies is provided by Altmetric.com (Hereinafter, “Altmetric dataset”) – However, the coverage of Altmetric dataset is unclear because it captures scholarly references only from videos uploaded by pre-curated YouTube channels • We propose a method for identifying scholarly references and building a dataset using this method • Furthermore, we compared this dataset with the Altmetric dataset and analyzed the external link characteristics

Slide 4

Slide 4 text

4 Approach by Altmetric.com (Altmetric dataset) Target YouTube Channels YouTube Channel X Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Description text • https://doi.org/... Description text • https://pubmed.ncbi.nlm.nih.gov/... Description text • DOI • DOI • DOI • DOI • DOI Dataset Video B Video C • In approach to build Altmetrics dataset, the first step is to set a group of YouTube channels as targets. • As of end of 2023, over 40,000 YouTube channels are curated; however, scholarly references on videos uploaded by channels outside of this group are not included in this dataset.

Slide 5

Slide 5 text

5 Approach by Altmetric.com (Altmetric dataset) • Next, the URIs in the description text of each video are extracted and matched to DOIs and other identifiers, including PubMed (i.e., PMIDs and PMCIDs), based on both the strings of URIs and meta tags on the landing pages Target YouTube Channels YouTube Channel X Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Description text • https://doi.org/... Description text • https://pubmed.ncbi.nlm.nih.gov/... Description text • DOI • DOI • DOI • DOI • DOI Dataset Video B Video C

Slide 6

Slide 6 text

6 The proposed method (YA Domain Dataset) YouTube Channel X Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • The first step is to set the target domain names that hit any scholarly article platform or URI that includes specific identifiers. We targeted the following six domain names. # Domain name Platform # Domain name Platform 1 ncbi.nlm.nih.gov PubMed 4 onlinelibrary.wiley.com Wiley Online Library 2 doi.org DOI Link 5 link.springer.com SpringerLink 3 sciencedirect.com SienceDirect 6 ieeexplore.ieee.org IEEE Xplore

Slide 7

Slide 7 text

7 The proposed method (YA Domain Dataset) YouTube Channel X Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • The second step is to execute a search using a string of domain names enclosed in double quotes as queries (e.g., “ncbi.nlm.nih.gov”) # Domain name Platform # Domain name Platform 1 ncbi.nlm.nih.gov PubMed 4 onlinelibrary.wiley.com Wiley Online Library 2 doi.org DOI Link 5 link.springer.com SpringerLink 3 sciencedirect.com SienceDirect 6 ieeexplore.ieee.org IEEE Xplore

Slide 8

Slide 8 text

8 The proposed method (YA Domain Dataset) YouTube Channel X Description text Target Domain names Video A • https://doi.org/... • https://pubmed.ncbi.nlm.nih.gov/... • https://www.sciencedirect.com/science/... Title and Description text Data Video C • https://pubmed.ncbi.nlm.nih.gov/... Title and Description text • DOI Search by the query of each domain name on YouTube Data API “pubmed.ncbi.nlm.nih.gov” • DOI Video C Dataset • Third, the metadata for the videos found in the search results are retrieved. Additionally, URIs are extracted from the title and description text of each video • In the fourth step, we associate these URIs containing the domain name with the corresponding DOIs based on the string of URIs. This step is necessary to compare references in the Altmetrics dataset using DOIs because that dataset does not contain raw URLs

Slide 9

Slide 9 text

9 Results and Discussion #1 Basic statistics of the datasets # Dataset DOI type Unique Channels Unique Videos Unique DOIs Total References A1 YA Domain Crossref DOI 55,712 235,095 261,027 479,111 A2 YA Domain Other DOIs 1,759 7,850 4,502 8,642 B1 Altmetric Crossref DOI 42,154 152,229 256,315 470,731 B2 Altmetric Other DOIs 3,484 10,532 9,526 13,742 Most references in both datasets were scholarly articles assigned Crossref DOIs • YA Domain dataset covers 470,000 references among 230,000 videos posted by 55,000 channels • Altmetric Dataset covers 47,000 references among 150,000 videos posted by 42,000 channels

Slide 10

Slide 10 text

10 The Difference set of A1-B1 shows that 64% of channels and 72% of videos in the YA Domain dataset are not covered by the Altmetric dataset. A1 B1 A1 - B1 A1 * B1 B1 - A1 Channels 55,712 42,154 35,820 19,892 22,262 100.00% 100.00% 64.29% 35.71% 47.19% 52.81% Videos 235,095 152,229 170,292 64,803 87,426 100.00% 100.00% 72.44% 27.56% 42.57% 57.43% DOIs 261,027 256,315 130,313 130,714 125,601 100.00% 100.00% 49.92% 50.08% 51.00% 49.00% References 479,111 470,731 272,798 206,313 264,418 100.00% 100.00% 56.94% 43.06% 43.83% 56.17% Results and Discussion #2-1 Overlaps between the datasets Note. A1: Crossref DOIs in YA Domain dataset, B1: Crossref DOI in Altmetric dataset

Slide 11

Slide 11 text

11 • The Difference set of B1-A1, shows that approximately one-half of channels, videos, DOIs, and references covered by the Altmetric dataset are not included in the YA Domain dataset. • The number of references increased by 150% when combining the YA domain dataset with the Altmetric dataset, compared to using the Altmetric dataset alone. A1 B1 A1 - B1 A1 * B1 B1 - A1 Channels 55,712 42,154 35,820 19,892 22,262 100.00% 100.00% 64.29% 35.71% 47.19% 52.81% Videos 235,095 152,229 170,292 64,803 87,426 100.00% 100.00% 72.44% 27.56% 42.57% 57.43% DOIs 261,027 256,315 130,313 130,714 125,601 100.00% 100.00% 49.92% 50.08% 51.00% 49.00% References 479,111 470,731 272,798 206,313 264,418 100.00% 100.00% 56.94% 43.06% 43.83% 56.17% Results and Discussion #2-2 Overlaps between the datasets

Slide 12

Slide 12 text

12 Results and Discussion #3 Platform Unique Channels Unique Videos Unique DOIs Total References PubMed 18,760 33.67% 83,269 35.42% 111,948 42.89% 232,380 48.50% DOI Link 12,856 23.08% 72,739 30.94% 83,742 32.08% 113,730 23.74% SienceDirect 14,640 26.28% 34,328 14.60% 30,259 11.59% 43,930 9.17% Wiley Online Library 12,840 23.05% 30,252 12.87% 23,598 9.04% 34,750 7.25% SpringerLink 11,023 19.79% 25,312 10.77% 17,403 6.67% 31,014 6.47% IEEE Xplore 4,240 7.61% 9,416 4.01% 7,313 2.80% 24,282 5.07% Overall 55,712 100.00% 235,095 100.00% 261,027 100.00% 479,111 100.00% Number of channels, videos, DOIs, and references by platform • PubMed shows that 18,760 channels, 83,269 videos, 111,948 DOIs, and 232,380 references correspond to Crossref DOIs associated with external links containing the domain name “ncbi.nlm.nih.gov” • PubMed has the highest percentage of channels, videos, DOIs, and references (33.67%, 35.42%, 42.89%, and 48.50%, respectively). • The DOI link is the second largest in the number of total references • Tens of thousands of direct external links exist for each of other platforms

Slide 13

Slide 13 text

13 Conclusion 1. We proposed a method for identifying scholarly references on YouTube using the domain names of scholarly publishing platforms or related services as a query 2. Targeting six types of domain names, we identified 480,000 references corresponding to 260,000 unique Crossref DOIs among 230,000 videos posted on 55,000 channels 3. We revealed that over half of these references were not covered by the Altmetric dataset. 4. Regarding external links, PubMed and DOI links were prominent; however, a substantial number of direct links to publisher platforms were observed • Dataset in this study is available at https://doi.org/10.5281/zenodo.12801387