Upgrade to Pro — share decks privately, control downloads, hide ads and more …

医療言語処理へのBERTの応用 -BioBERT, ClinicalBERT, そして-

医療言語処理へのBERTの応用 -BioBERT, ClinicalBERT, そして-

Yuta Nakamura

May 15, 2020
Tweet

More Decks by Yuta Nakamura

Other Decks in Technology

Transcript

  1. #&35ొ৔·ͰͷܦҢ ུ֓ 2014.9: Attention Mechanism - RNN/LSTM ༻ͷҰػߏ 2017.6: Transformer

    - Attention ͷΈ͔ΒͳΔϞσϧ 2018.10: BERT - લޙ૒ํ޲ͷจ຺ΛߟྀͰ͖ΔΑ͏ʹվྑ → BERTͷొ৔ʹΑΓNLPͷੈք͸Ұม ଏʹ After BERT ͱ͍͏දݱ΋ʁ https://github.com/thunlp/PLMpapers
  2. #&35ͷཱͪҐஔ ɾ਍ྍه࿥ʹର͢Δ Deep Learning ख๏ ɾ2019 1~4 ݄ͷ 48࿦จ ɾRNN/LSTM

    50%, CNN 35% ɾAttention Λ༻͍ͨ΋ͷ 18% Wu S et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc 27(3): 457-470, 2020
  3. υϝΠϯಛԽܕ #&35  #JP#&35 ɾࣄલֶशࡁΈ BERT Λ͞Βʹҩֶ࿦จͰֶशͤͨ͞΋ͷ ɾBERT: Book Corpus

    + English Wikipedia ɾBioBERT: ্ه + PubMed abstract ʶ PMC full text Lee J et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4): 1234- 1240, 2020
  4. υϝΠϯಛԽܕ #&35  $MJOJDBM#&35 ɾBeth Israel Deaconess Medical Center ͷ

    ಗ໊Խ ICU਍ྍه࿥ (MIMIC-III) Ͱࣄલֶशͤͨ͞BERT Huang K et al. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv preprint arXiv:1904.05342, 2019
  5. υϝΠϯಛԽܕ #&35  &IS#&35 ɾBioBERT͔Βελʔτ͠ɼUMass Memorial Medical Centerͷ਍ྍه࿥ 50ສʙ100ສ݅Ͱ͞Βʹֶशͤͨ͞BERT (※ެ։͞Ε͍ͯͳ͍)

    Li F et al. Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study. JMIR Med Inform 7(3): e14830, 2019
  6. #JP#&35 $MJOJDBM#&35ͰԿ͕มΘ͔ͬͨ ɾBioBERT ʷ ҩֶ࿦จ ɾݻ༗දݱநग़ ʴප໊ਖ਼نԽ ɾ࣭໰Ԡ౴ NCBI disease

    corpus (ප໊ͷநग़+ਖ਼نԽ) F1 score 0.572 (MetaMap) → 0.886 (SOTA) → 0.897 PubMedQA (title͕ٙ໰จͷ࿦จͷ abstractͷconclusionΛ, ͦΕҎ֎ͷ෦෼͔ Βਪఆ) Lee J et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4): 1234- 1240, 2020 Jin Q, Dhingra B, Liu Z, Cohen W, Lu X: PubMedQA: A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): 2567-2577, 2019
  7. #JP#&35 $MJOJDBM#&35ͰԿ͕มΘ͔ͬͨ ɾBioBERT ʷ ҩֶ࿦จ ɾཁ໿ʴϚϧνϞʔμϧ FigSum (ਤͷઆ໌ʹͳΔهड़Λຊ จ͔ΒूΊͯཁ໿͢Δ) Saini

    N, Saha S, Bhattacharyya P, Tuteja H: Textual Entailment--Based Figure Summarization for Biomedical Articles. ACM Trans. Multimedia Comput. Commun. Appl. DOI: 10.1145/3357334, April 2020 April, 2020
  8. #JP#&35 $MJOJDBM#&35ͰԿ͕มΘ͔ͬͨ ɾClinicalBERT ʷ ਍ྍه࿥ ɾจॻ෼ྨ ɾݻ༗දݱநग़ ʴؔ܎நग़ Alsentzer E,

    Murphy J, Boag W, Weng WH, Jindi D, Naumann T, McDermott M: Publicly Available Clinical BERT Embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop: 72-78, 2019 MIMIC-III ICU࠶ೖӃ༧ଌ AUC 0.67 i2b2 2010 (঱ঢ়, ݕࠪ, ࣏ྍͷؔ܎) F1 score 0.737 (2010) → 0.878
  9. #JP#&35 $MJOJDBM#&35ͰԿ͕มΘ͔ͬͨ ɾClinicalBERT ʷ ਍ྍه࿥ ɾؚҙؔ܎ೝࣝ Alsentzer E, Murphy J,

    Boag W, Weng WH, Jindi D, Naumann T, McDermott M: Publicly Available Clinical BERT Embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop: 72-78, 2019 MedNLI (2ͭͷҩֶతهड़ͷؚҙؔ܎) Acc 0.735 (baseline) → 0.827 ࣦޠ঱ͱ͍͑Δ͔? ࿩ͤͳ͍ͳΒ͹ True ౶೘පέτΞγυʔγεͷطԟ͕ 6ճ͋Γ, HbA1c 13.3%ͳΒ͹ ݂౶ίϯτϩʔϧྑ޷ͱ͍͑Δ͔? False
  10. #F)3U#&35ͷ ඇݴޠҩྍσʔλ΁ͷԠ༻ ɾ#&35ΛඇݴޠҩྍσʔλʹԠ༻ͨ͠ݚڀ ɾӳࠃͷ਍ྍه࿥σʔλϕʔε $13% ͷ  ສױऀͷσʔλͰֶश Li, Y.,

    Rao, S., Solares, J.R.A. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020). https://doi.org/10.1038/s41598-020-62922-y
  11. #F)3U#&35ͷ ඇݴޠҩྍσʔλ΁ͷԠ༻ ɾBERT: ֤จͷτʔΫϯྻΛ [SEP] Ͱ۠੾ͬͯೖྗ ɾτʔΫϯ͸ޠኮ͔Β͑ΒͿ (ޠኮ਺ ≒ 3.2ສ

    (BERTBASE )) ɾBeHRT: ֤ड਍࣌ͷ਍அͱ೥ྸΛ [SEP] Ͱ۠੾ͬͯೖྗ ɾ਍அ͸ ICD-10ίʔυ ͔Β͑ΒͿ (“ޠኮ਺” 301) Li, Y., Rao, S., Solares, J.R.A. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020). https://doi.org/10.1038/s41598-020-62922-y BERT BeHRt
  12. #F)3U#&35ͷ ඇݴޠҩྍσʔλ΁ͷԠ༻ ɾࣄલֶश: Masked Language Model ɾBERT: τʔΫϯͷҰ෦Λ [MASK] ͯ͠ೖྗ

    → ෮ݩͤ͞Δ ɾBeHRT: ਍அͷҰ෦Λ [MASK] ͯ͠ೖྗ → ෮ݩͤ͞Δ Li, Y., Rao, S., Solares, J.R.A. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020). https://doi.org/10.1038/s41598-020-62922-y
  13. #F)3U#&35ͷ ඇݴޠҩྍσʔλ΁ͷԠ༻ ɾੑೳ ɾױऀ͕͜Ε·Ͱʹطԟྺͷͳ͍࣬ױΛ ৽نൃ঱͢Δ͔Ͳ͏͔Λ༧ଌ → AUC 0.904ʙ0.907Ͱୡ੒ Li, Y.,

    Rao, S., Solares, J.R.A. et al. BEHRT: Transformer for Electronic Health Records. Sci Rep 10, 7155 (2020). https://doi.org/10.1038/s41598-020-62922-y
  14. 5"1&3ݴޠσʔλ ඇݴޠҩྍσʔλͷ౷߹ S. Darabi, M. Kachuee, S. Fazeli and M.

    Sarrafzadeh, "TAPER: Time-Aware Patient EHR Representation," in IEEE Journal of Biomedical and Health Informatics, doi: 10.1109/JBHI.2020.2984931. ɾMIMIC-III ͷ਍ྍه࿥σʔλΛར༻ ᶃݴޠσʔλ: BioBERT + BiGRUͰ਍ྍه࿥Λཁ໿ ᶄඇݴޠσʔλ: ICDίʔυ, ༀࡎίʔυ, ॲஔίʔυΛ word2vecͰϕΫτϧԽ → transformer encoder ᶅױऀͷੑผ, ೥ྸͳͲΛϕΫτϧԽ →ᶃ+ᶄ+ᶅ ʹΑΓ”ೖӃͦͷ΋ͷ”ΛϕΫτϧԽ
  15. 5"1&3ݴޠσʔλ ඇݴޠҩྍσʔλͷ౷߹ S. Darabi, M. Kachuee, S. Fazeli and M.

    Sarrafzadeh, "TAPER: Time-Aware Patient EHR Representation," in IEEE Journal of Biomedical and Health Informatics, doi: 10.1109/JBHI.2020.2984931. ɾMIMIC-III ͷ਍ྍه࿥σʔλΛར༻ ɾICUױऀͷ࠶ೖӃɼࢮ๢ɼ௕ظೖӃΛ retrospective ʹ༧ଌ͢ΔλεΫͰ ClinicalBERT Λ ্ճͬͨ
  16. ೔ຊޠͰͷυϝΠϯಛԽܕ #&35 ɾ౦ژେֶ ҩྍAI։ൃֶߨ࠲ΑΓެ։ ɾ໿1ԯߦͷ೔ຊޠ਍ྍه࿥Ͱࣄલֶशͨ͠BERT ɾ෼͔ͪॻ͖: MeCab + Byte Pair

    Encoding ɾࣙॻ: mecab-ipadic-neologd + ສපࣙॻ ɾޠኮ਺: 25,000 https://ai-health.m.u-tokyo.ac.jp/uth-bert