Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Investigating Industry–Academia Collaboration i...

kazuhiro yamauchi
December 10, 2024
2

Investigating Industry–Academia Collaboration in Artificial Intelligence: PDF-Based Bibliometric Analysis from Leading Conferences

Presentation slides presented at ICADL 2024.
ICADL2024で発表した時のスライドです.

kazuhiro yamauchi

December 10, 2024
Tweet

Transcript

  1. Kazuhiro Yamauchi Marie Katsurai Doshisha University Investigating Industry—Academia Collaboration in

    Artificial Intelligence: PDF-Based Bibliometric Analysis from Leading Conferences ICADL 2024
  2. Research Background ◼ Industry—academia collaboration (IAC) is a form of

    innovation For example… ⚫plays a crucial role in bridging theory ⚫practice to achieve a more direct societal impact ► Industry gains access to the latest academic knowledge ► academia learns about real-world challenges 2 Academia Industry ◼ Team science is important ⚫Because, in modern world, science is too complex Background
  3. Research Background AI is becoming increasingly important for both universities

    and companies 3 from AI Index report 250K papers 60K patents Background
  4. Our Research Purpose 4 ◼ Bibliometric analysis of IAC in

    the field of AI is needed ⚫for policy makers ⚫for stakeholders to make their decisions Ex.) How much their country or company should spend on AI research ◼ In our research, a bibliometric analysis in the field of AI With focusing on... ⚫Prominent conference papers ► conferences are considered important in the field of AI ⚫Co-authored papers by researchers from academia and industry Background
  5. Our Research Purpose ◼ We chose AAAI and IJCAI as

    specific conferences ⚫These two conferences are the most prominent conference ◼ Identification of IAC papers by affiliation extracted from PDF Why do we need to extract from PDF? ⚫Conference papers have missing information in Scopus or Web of Science 5 Journals Conference paper Background
  6. Research Question in AI field 1. Are IAC papers increasing

    year by year? 2. Which research institutions are most active in IAC? 3. Whether academia or industry more frequently leads IAC? 4. Are there content differences between IAC papers and other papers? 6 Background
  7. Methodology overview 8 Paper • Title • Authors • Affiliations

    • Abstract • etc. Information Extraction Normalization Institution Academia Industry Classification Method
  8. Paper Information Extraction Normalization Classification Data Collection ◼ Collected PDFs

    of papers from AAAI and IJCAI by scraping official website 9 Conference Years Total Papers AAAI 2010–2023 12,517 IJCAI 2010–2023 8,032 Total 20,549 Method
  9. GROBID Implementation ◼ GROBID ⚫Bibliographic Information extraction tool ⚫Best performing

    tool for extracting information from paper PDFs GROBID XML ◼ Title ◼ Authors ◼ Affiliation ◼ Abstract ◼ etc. Paper Information Extraction Normalization Classification 10 Method
  10. Paper Information Extraction Normalization Classification Affiliation Normalization For example: ⚫

    Electrical Engineering and Computer Science Department, University of California, Berkeley ⚫ EECS, University of California, Berkely, California, US ⚫ UC Berkeley University of California, Berkeley 11 ◼ We normalized affiliation strings ⚫ Because of extracted affiliation strings in multiple notations Method
  11. Paper Information Extraction Normalization Classification Affiliation Normalization by S2AFF Raw

    affiliation text Named Entity Recognition Match with ROR Normalized Institution name What is ROR? ⚫ Research institutes database ⚫ Some are not in database... University of California, Berkely. Berkeley, California, US 12 ⚫ University of California, Berkely ⚫ Berkeley ⚫ California ⚫ US Method
  12. Paper Information Extraction Normalization Classification Institution Classification Process ◼ Those

    found in ROR, which could be normalized by S2AFF ⚫Classified by “Organization type” in ROR 13 Method Organization type There are eight different types: Education, Funder, Archive, Company and so on.
  13. Paper Information Extraction Normalization Classification Institution Classification Process If all

    three methods agree, then use this unanimous classification 14 ◼ For institutions that were not included in ROR, we used three auto classification methods ⚫ Classification by labeled list and keyword ⚫ Classification by URLs by domain name ⚫ Classification by LightGBM with word2vec feature Method
  14. Paper Information Extraction Normalization Classification Labeled list + keyword ◼

    Labeled list = University name list + Company name list [Abuwala et al. 23] ◼ All institutions not on the list are academia if they contain the following keywords ⚫ Universi ⚫ Academ ⚫ School ⚫ Polytech ⚫ Department ⚫ Univ. ⚫ Dept. 15 Method To accommodate a variety of languages of European descent
  15. Paper Information Extraction Normalization Classification Domain names in URLs ◼

    Classification by URLs of the top 3 search results ⚫ “.edu” ⚫ “.ac” ⚫ “.gov” If URL contains one of the following domain names Academia otherwise Industry 16 Method
  16. Paper Information Extraction Normalization Classification LightGBM with word2vec feature ◼

    Dataset created from classification by ROR’s Organization type ⚫ Academic institution strings: 31,785 ⚫ Industrial institution strings: 1,204 ⚫ Train : validation: test = 8:1:1 ◼ Feature of Research Institution String ⚫ word2vec (corpus: English Wikipedia) ◼ Classifier ⚫ LightGBM ◼ Result in ROC-AUC ⚫ score: 0.9494 (suggests this is a strong classification model) 17 Method
  17. Paper Information Extraction Normalization Classification Manual Classification Details ◼ Four

    annotators ◼ 1,842 institution strings are annotated ◼ Procedure ⚫Visually inspect the affiliation on original PDFs ⚫Use web search engines (like Google or Bing) to identify them 18 Method ◼ Manual classification of affiliation string for which the three classifications do not agree Details
  18. Results the analysis from the paper shown below Conference Years

    Total Papers AAAI 2010–2023 12,517 IJCAI 2010–2023 8,032 Total 20,549
  19. Overall and IAC Paper Statistics 2010-2020: increasing trend 2021-2023: bit

    decreasing trend, but increased in 2023 AAAI IJCAI 21 Result
  20. Leading Academic Institutions Top 9 institution belongs to China Only

    ”Carnegie Mellon University” is from the US Institution Papers 1 Zhejiang University 321 2 Tsinghua University 303 3 Peking University 265 4 University of Science and Technology of China 234 5 University of Chinese Academy of Science 207 6 Shanghai Jiao Tong University 179 7 Nanyang Technological University 143 8 Chinese Academy of Science 126 9 Beihang University 124 10 Carnegie Mellon University 119 23 Result
  21. Leading Industry Institutions Top 2 institution belongs to China But

    unlike Academia's ranking, Microsoft (US) is in third place Institution Papers 1 Microsoft Research Asia (MSRA) 612 2 Alibaba Group 544 3 Microsoft Research/ Microsoft 407 4 Tencent 203 5 Meta 98 6 Huawei Technologies 57 7 Google 57 8 Baidu 44 9 Jingdong 39 10 Amazon 21 24 Result
  22. Key Institution Transition by Year ◼ Until 2017: MSRA dominance

    ⚫ MSRA is consistently ranked #1 or 2 ⚫ Papers: Papers: 15-34 per year ◼ 2019: Emergence of Alibaba 1. Alibaba Group (90) 2. MSRA (80) 3. Peking University (73) 25 ◼ 2021: Peak of Alibaba 1. Alibaba Group (137) 2. MSRA (79) 3. Zhejiang University (72) ◼ 2023: Return of MSRA 1. MSRA (101) 2. Zhejiang University (74) 3. Alibaba Group (61) Consistent presence of MSRA, after 2019 rapid rise of Alibaba Result
  23. Collaboration Network Overview Institutions are shown if co-authored papers ≥

    6 Edge weight: the number of co-authorships 26 Result
  24. Collaboration Network Details MSRA has strong relationships with Beihang U,

    Tsinghua U, and University of Science and Technology of China 28 Result
  25. Collaboration Network Details Alibaba Group has a strong relationship with

    Zhejiang University (highest publication count among academic institutions) 29 Result
  26. First Author Analysis ◼ Academia leads in the number of

    first authorships ◼ However, first authors from industry increased until 2021 30 Result
  27. Content Analysis Methodology Are there differences in content between IAC

    papers and other papers? Abstract IAC The Other If SciBERT can classify the papers, there are differences between IAC papers and other papers. 32 Result SciBERT [Beltagy+’19]
  28. Content Analysis Results There are some differences, but not that

    many Method Accuracy Precision Recall F1-score With negative sampling (Random) 0.49 0.49 0.49 0.49 With negative sampling (SciBERT) 0.61 0.62 0.61 0.60 Without negative sampling (Majority) - 0.46 0.50 0.48 Without negative sampling (SciBERT) - 0.58 0.51 0.49 Datasets Type Number of Papers IAC 1,919 The other 18,353 Due to the imbalanced dataset, we used negative sampling 33 Result Result
  29. Major Contribution ◼ Novel Methodological Approach ⚫ PDF-based bibliometric analysis

    ⚫ Overcome limitations of existing databases (Scopus or Web of Science) ◼ Key Discoveries ⚫ Growth phase identified (2017-2020) ⚫ China-led collaboration patterns ► Industry: MSRA & Alibaba Group Academia: Major Chinese universities ⚫ Academia-driven paper authorship ⚫ Minimal content differences between IAC papers and the other 34 Conclusion