PubMedBERT: 生物医学NLPのための事前学習

PubMedBERT Domain-Specific Pretraining for Biomedical NLP Naoto Usuyama @naotous Microsoft
Research, Health Futures Project Hanover, Biomedical NLP

Naoto Usuyama @naotous Microsoft Research, Health Futures Past: 東京⼤学情報科学科医科研ヒトゲノム解析センター

お伝えしたいことドメイン特化の事前学習モデル Microsoft x ヘルスケア医師のキャリア先としてのGAFAM

Supervised Learning • 学習データを⽤意するのは⼤変 • タスクごとに必要⼊⼒正解ラベルモデル

5 https://imgur.com/rwUE5 Everyday PubMed: 4000 new papers Expert can curate
4-5 Information Overload

EMR: 60-80% In Unstructured Text 6 Wolters Kluwer: Health Language
Blog Curation: 2-3 hours X 1.8M = 4-5 million expert hours

Supervised Learning 7 Domain Knowledge Algorithm Application Model Unlabeled Data
Labeled Data Public Abundant Bottleneck

Self-Supervised Learning 8 Domain Knowledge 事前学習 Fine-Tuning Unlabeled Data Labeled
Data Public Abundant Bottleneck Application Model

Neural Language Model Pretraining The 2 mutations that were only
found in the neuroblastoma resistance screen (G1123S/D) are located in the glycine-rich loop, which is known to be crucial for ATP and ligand binding and are the first mutations described that induce resistance to TAE684, but not to PF02341066 9 Unlabeled text

Masked Language Modeling (MLM) The 2 mutations that were only
found in the neuroblastoma resistance screen (G1123S/D) are located in the glycine-rich loop, which is known to be crucial for ATP and ligand binding and are the first mutations described that induce resistance to TAE684, but not to PF02341066 10 [MASK] [MASK] [MASK] [MASK] [MASK]

11 Standard PubMedBERT

12 Standard PubMedBERT

Biomed-Specific Pretraining 13 lymphoma ® l, ##ym, ##ph, ##oma acetyltransferase
® ace, ##ty, ##lt, ##ran, ##sf, ##eras, ##e Yu et al. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing”, Special Issue on Computational Methods for Biomedical Natural Language Processing, ACM Transactions on Computing for Health, 2021. (arxiv) Domain-Specific Vocabulary

14 PubMedBERT outperforms prior language models

PubMedBERT on HuggingFace • PubMed BERT (abst) • PubMed BERT
(abst + full-text) • ⽉間25万回以上ダウンロード 15

Fine-Tuning の注意点 • Hyper-parameters • Learning Rate • Layer re-init.
• Layer-wise Learning Rate Decay • Random Seed, GPU が違うだけで結果が⼤きく変わることも • ちゃんと評価する時は平均を取る Tinn et al. “Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing”, in submission, 2021. (arxiv)

Why? • 膨⼤な未整理データから有⽤な情報をタイムリーに • PubMed 論⽂の整理 • 電⼦カルテから Real World
Evidence • ⼈の⼿間を減らす、⽣産性向上 • 事前学習: 膨⼤な未整理データの活⽤、アノテーションコストを減らす • 膨⼤な計算資源 • PubMedBERT の学習 DGX-2 で約⼀週間

http://aka.ms/biomedsearch Wang, Li and Naumann et al. “Domain-Specific Pretraining for
Vertical Search: Case Study on Biomedical Literature”, KDD, 2021 (arxiv)

Assisted Curation 19 Wong et al. “Breaching the curation bottleneck
with human-machine reading symbiosis”, in submission, 2021.

Precision Oncology RWE 20 Preston and Wei et al. “Towards
Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision”, 2022. (arxiv)

Nuance • 病院向け⾳声認識、8割以上の病院で利⽤ • $20B (約2兆2500億円) • 医療従事者の⽣産性向上、バーンアウト軽減 • RWE
データ活⽤

お伝えしたいことドメインにあった事前学習モデルを使う Microsoft のヘルスケアAI分野での取り組み医師のキャリア先としての GAFAM Naoto Usuyama @naotous

Microsoft Research, Health Futures • https://aka.ms/health_futures

Biomedical NLP Group • Project Hanover • Biomedical NLP Group
• BLURB benchmark

PubMedBERT: 生物医学NLPのための事前学習

PubMedBERT: 生物医学NLPのための事前学習

Naoto Usuyama

Other Decks in Technology

Featured

Transcript

PubMedBERT Domain-Specific Pretraining for Biomedical NLP Naoto Usuyama @naotous Microsoft

Naoto Usuyama @naotous Microsoft Research, Health Futures Past: 東京⼤学情報科学科医科研ヒトゲノム解析センター

お伝えしたいことドメイン特化の事前学習モデル Microsoft x ヘルスケア医師のキャリア先としてのGAFAM

Supervised Learning • 学習データを⽤意するのは⼤変 • タスクごとに必要⼊⼒正解ラベルモデル

5 https://imgur.com/rwUE5 Everyday PubMed: 4000 new papers Expert can curate

EMR: 60-80% In Unstructured Text 6 Wolters Kluwer: Health Language

Supervised Learning 7 Domain Knowledge Algorithm Application Model Unlabeled Data

Self-Supervised Learning 8 Domain Knowledge 事前学習 Fine-Tuning Unlabeled Data Labeled

Neural Language Model Pretraining The 2 mutations that were only

Masked Language Modeling (MLM) The 2 mutations that were only

11 Standard PubMedBERT

12 Standard PubMedBERT

Biomed-Specific Pretraining 13 lymphoma ® l, ##ym, ##ph, ##oma acetyltransferase

14 PubMedBERT outperforms prior language models

PubMedBERT on HuggingFace • PubMed BERT (abst) • PubMed BERT

Fine-Tuning の注意点 • Hyper-parameters • Learning Rate • Layer re-init.

Why? • 膨⼤な未整理データから有⽤な情報をタイムリーに • PubMed 論⽂の整理 • 電⼦カルテから Real World

http://aka.ms/biomedsearch Wang, Li and Naumann et al. “Domain-Specific Pretraining for

Assisted Curation 19 Wong et al. “Breaching the curation bottleneck

Precision Oncology RWE 20 Preston and Wei et al. “Towards

Nuance • 病院向け⾳声認識、8割以上の病院で利⽤ • $20B (約2兆2500億円) • 医療従事者の⽣産性向上、バーンアウト軽減 • RWE

お伝えしたいことドメインにあった事前学習モデルを使う Microsoft のヘルスケアAI分野での取り組み医師のキャリア先としての GAFAM Naoto Usuyama @naotous

Microsoft Research, Health Futures • https://aka.ms/health_futures

Biomedical NLP Group • Project Hanover • Biomedical NLP Group