innovation For example… ⚫plays a crucial role in bridging theory ⚫practice to achieve a more direct societal impact ► Industry gains access to the latest academic knowledge ► academia learns about real-world challenges 2 Academia Industry ◼ Team science is important ⚫Because, in modern world, science is too complex Background
the field of AI is needed ⚫for policy makers ⚫for stakeholders to make their decisions Ex.) How much their country or company should spend on AI research ◼ In our research, a bibliometric analysis in the field of AI With focusing on... ⚫Prominent conference papers ► conferences are considered important in the field of AI ⚫Co-authored papers by researchers from academia and industry Background
specific conferences ⚫These two conferences are the most prominent conference ◼ Identification of IAC papers by affiliation extracted from PDF Why do we need to extract from PDF? ⚫Conference papers have missing information in Scopus or Web of Science 5 Journals Conference paper Background
year by year? 2. Which research institutions are most active in IAC? 3. Whether academia or industry more frequently leads IAC? 4. Are there content differences between IAC papers and other papers? 6 Background
of papers from AAAI and IJCAI by scraping official website 9 Conference Years Total Papers AAAI 2010–2023 12,517 IJCAI 2010–2023 8,032 Total 20,549 Method
tool for extracting information from paper PDFs GROBID XML ◼ Title ◼ Authors ◼ Affiliation ◼ Abstract ◼ etc. Paper Information Extraction Normalization Classification 10 Method
Electrical Engineering and Computer Science Department, University of California, Berkeley ⚫ EECS, University of California, Berkely, California, US ⚫ UC Berkeley University of California, Berkeley 11 ◼ We normalized affiliation strings ⚫ Because of extracted affiliation strings in multiple notations Method
affiliation text Named Entity Recognition Match with ROR Normalized Institution name What is ROR? ⚫ Research institutes database ⚫ Some are not in database... University of California, Berkely. Berkeley, California, US 12 ⚫ University of California, Berkely ⚫ Berkeley ⚫ California ⚫ US Method
found in ROR, which could be normalized by S2AFF ⚫Classified by “Organization type” in ROR 13 Method Organization type There are eight different types: Education, Funder, Archive, Company and so on.
three methods agree, then use this unanimous classification 14 ◼ For institutions that were not included in ROR, we used three auto classification methods ⚫ Classification by labeled list and keyword ⚫ Classification by URLs by domain name ⚫ Classification by LightGBM with word2vec feature Method
Labeled list = University name list + Company name list [Abuwala et al. 23] ◼ All institutions not on the list are academia if they contain the following keywords ⚫ Universi ⚫ Academ ⚫ School ⚫ Polytech ⚫ Department ⚫ Univ. ⚫ Dept. 15 Method To accommodate a variety of languages of European descent
Classification by URLs of the top 3 search results ⚫ “.edu” ⚫ “.ac” ⚫ “.gov” If URL contains one of the following domain names Academia otherwise Industry 16 Method
Dataset created from classification by ROR’s Organization type ⚫ Academic institution strings: 31,785 ⚫ Industrial institution strings: 1,204 ⚫ Train : validation: test = 8:1:1 ◼ Feature of Research Institution String ⚫ word2vec (corpus: English Wikipedia) ◼ Classifier ⚫ LightGBM ◼ Result in ROC-AUC ⚫ score: 0.9494 (suggests this is a strong classification model) 17 Method
annotators ◼ 1,842 institution strings are annotated ◼ Procedure ⚫Visually inspect the affiliation on original PDFs ⚫Use web search engines (like Google or Bing) to identify them 18 Method ◼ Manual classification of affiliation string for which the three classifications do not agree Details
”Carnegie Mellon University” is from the US Institution Papers 1 Zhejiang University 321 2 Tsinghua University 303 3 Peking University 265 4 University of Science and Technology of China 234 5 University of Chinese Academy of Science 207 6 Shanghai Jiao Tong University 179 7 Nanyang Technological University 143 8 Chinese Academy of Science 126 9 Beihang University 124 10 Carnegie Mellon University 119 23 Result
unlike Academia's ranking, Microsoft (US) is in third place Institution Papers 1 Microsoft Research Asia (MSRA) 612 2 Alibaba Group 544 3 Microsoft Research/ Microsoft 407 4 Tencent 203 5 Meta 98 6 Huawei Technologies 57 7 Google 57 8 Baidu 44 9 Jingdong 39 10 Amazon 21 24 Result
⚫ MSRA is consistently ranked #1 or 2 ⚫ Papers: Papers: 15-34 per year ◼ 2019: Emergence of Alibaba 1. Alibaba Group (90) 2. MSRA (80) 3. Peking University (73) 25 ◼ 2021: Peak of Alibaba 1. Alibaba Group (137) 2. MSRA (79) 3. Zhejiang University (72) ◼ 2023: Return of MSRA 1. MSRA (101) 2. Zhejiang University (74) 3. Alibaba Group (61) Consistent presence of MSRA, after 2019 rapid rise of Alibaba Result
papers and other papers? Abstract IAC The Other If SciBERT can classify the papers, there are differences between IAC papers and other papers. 32 Result SciBERT [Beltagy+’19]
many Method Accuracy Precision Recall F1-score With negative sampling (Random) 0.49 0.49 0.49 0.49 With negative sampling (SciBERT) 0.61 0.62 0.61 0.60 Without negative sampling (Majority) - 0.46 0.50 0.48 Without negative sampling (SciBERT) - 0.58 0.51 0.49 Datasets Type Number of Papers IAC 1,919 The other 18,353 Due to the imbalanced dataset, we used negative sampling 33 Result Result
⚫ Overcome limitations of existing databases (Scopus or Web of Science) ◼ Key Discoveries ⚫ Growth phase identified (2017-2020) ⚫ China-led collaboration patterns ► Industry: MSRA & Alibaba Group Academia: Major Chinese universities ⚫ Academia-driven paper authorship ⚫ Minimal content differences between IAC papers and the other 34 Conclusion