Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PubMedBERT: 生物医学NLPのための事前学習

PubMedBERT: 生物医学NLPのための事前学習

Avatar for Naoto Usuyama

Naoto Usuyama

March 28, 2022
Tweet

Other Decks in Technology

Transcript

  1. EMR: 60-80% In Unstructured Text 6 Wolters Kluwer: Health Language

    Blog Curation: 2-3 hours X 1.8M = 4-5 million expert hours
  2. Neural Language Model Pretraining The 2 mutations that were only

    found in the neuroblastoma resistance screen (G1123S/D) are located in the glycine-rich loop, which is known to be crucial for ATP and ligand binding and are the first mutations described that induce resistance to TAE684, but not to PF02341066 9 Unlabeled text
  3. Masked Language Modeling (MLM) The 2 mutations that were only

    found in the neuroblastoma resistance screen (G1123S/D) are located in the glycine-rich loop, which is known to be crucial for ATP and ligand binding and are the first mutations described that induce resistance to TAE684, but not to PF02341066 10 [MASK] [MASK] [MASK] [MASK] [MASK]
  4. Biomed-Specific Pretraining 13 lymphoma ® l, ##ym, ##ph, ##oma acetyltransferase

    ® ace, ##ty, ##lt, ##ran, ##sf, ##eras, ##e Yu et al. “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing”, Special Issue on Computational Methods for Biomedical Natural Language Processing, ACM Transactions on Computing for Health, 2021. (arxiv) Domain-Specific Vocabulary
  5. PubMedBERT on HuggingFace • PubMed BERT (abst) • PubMed BERT

    (abst + full-text) • ⽉間25万回以上ダウンロード 15
  6. Fine-Tuning の注意点 • Hyper-parameters • Learning Rate • Layer re-init.

    • Layer-wise Learning Rate Decay • Random Seed, GPU が違うだけで結果が⼤きく変わることも • ちゃんと評価する時は平均を取る Tinn et al. “Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing”, in submission, 2021. (arxiv)
  7. Why? • 膨⼤な未整理データから有⽤な情報をタイムリーに • PubMed 論⽂の整理 • 電⼦カルテから Real World

    Evidence • ⼈の⼿間を減らす、⽣産性向上 • 事前学習: 膨⼤な未整理データの活⽤、アノテーションコストを減らす • 膨⼤な計算資源 • PubMedBERT の学習 DGX-2 で約⼀週間
  8. http://aka.ms/biomedsearch Wang, Li and Naumann et al. “Domain-Specific Pretraining for

    Vertical Search: Case Study on Biomedical Literature”, KDD, 2021 (arxiv)
  9. Assisted Curation 19 Wong et al. “Breaching the curation bottleneck

    with human-machine reading symbiosis”, in submission, 2021.
  10. Precision Oncology RWE 20 Preston and Wei et al. “Towards

    Structuring Real-World Data at Scale: Deep Learning for Extracting Key Oncology Information from Clinical Text with Patient-Level Supervision”, 2022. (arxiv)