Upgrade to Pro — share decks privately, control downloads, hide ads and more …

言語処理学会2021

 言語処理学会2021

Shogo Ujiie

March 02, 2021
Tweet

More Decks by Shogo Ujiie

Other Decks in Research

Transcript

  1. 2 ප໊ਖ਼نԽ എܠ Protein 43 proteinopathy is the disease of

    … 病名 概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 ࣙॻ l ප໊Λࣙॻதͷ֓೦ʹඥ෇͚Δॲཧ l ೖྗɿจɼප໊ͷҐஔ l ग़ྗɿප໊ͷ֓೦ l ΤϯςΟςΟϦϯΩϯάͷҰछ l ೔ৗۀ຿Ͱ࢖༻ l ҩༀ඼ͷ෭࡞༻ใࠂ
  2. 3 લॲཧɾޙॲཧͱͯ͠ͷප໊ਖ਼نԽ ප໊ਖ਼نԽ͸ɺҩྍจॻͰͷݴޠॲཧʹ͓͚Δج൫ٕज़ l ޙஈλεΫͷͨΊͷલॲཧ l ؔ܎நग़Ϟσϧͷೖྗ <9V b> l

    ৘ใݕࡧ࣌ͷΠϯσΫγϯά <6KJJF b> l ৘ใநग़ޙͷ৘ใ੔ཧͷͨΊͷޙॲཧ l ՄࢹԽ <"SBNBLJ b> എܠ
  3. 6 ՝୊ɿେن໛Ϧιʔε͕લఏ ઌͷݚڀʹ͸େ͖̎ͭ͘ͷϦιʔε͕ඞཁ l େྔͷʢප໊ɼ֓೦ʣͷϖΞ l ൚༻తͳප໊දݱͷ֫ಘͷͨΊʹඞਢ l େن໛ͳࣙॻʢʹେྔͷྨٛޠʣ l

    ඥ෇͚ઌͱͯ͠ͷࣙॻ l ࣙॻ֦ுʹΑΓਫ਼౓΋޲্ <-FF b> l ࣙॻΛֶशσʔλͱͯ͠༻͍Δྫ <-JV b> طଘख๏ʛ໰୊఺ 多くの⾔語や 医療概念で ⼊⼿困難
  4. 8 ఏҊɿࣗಈߏஙͨ͠ίʔύεΛ༻͍ͨප໊ਖ਼نԽ l ප໊ͷจ຺ԽຒΊࠐΈදݱΛɼࣗಈߏஙͨ͠ίʔύεͰֶश l ίʔύε಺Ͱͷۙ๣୳ࡧʹΑΓਖ਼نԽ ˠ ࣙॻʹප໊͕ͳ͘ͱ΋ɼจ຺Λखֻ͔Γʹ ྨࣅ༻ྫΛࢀর͠ਖ਼نԽ ఏҊख๏ʛ֓ཁ

    TDP-43 proteinopathy with … protein … … torsades de pointes in … heart rhythm… TdP is a specific … cardiac death … 不整脈の ⼀種 ALSなどの 総称 TdPの 正式名称
  5. 10 ප໊ʔ֓೦͕ϦϯΫͨ͠ίʔύεΛߏங ҩྍจॻʹରͯ͠ɺࣙॻϚονʹΑΓප໊Λ֓೦ͱඥ෇͚ ఏҊख๏ʛίʔύε ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------- 病名

    概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 TDP-43 proteinopathy with drugs … TdP ventricular tachycardia are discussed ... --------------------------------------------------- --------------------------------------------------- ࣙॻ ҩྍจॻ
  6. 12 ߏஙͨ͠ίʔύεΛ༻͍ͨରরֶश l ϛχόον಺ͷਖ਼ྫɼෛྫϖΞͰֶश l ϛχόον಺Ͱਖ਼ྫϖΞͷଘࡏΛอূ͢ΔͨΊɼ ҎԼͷΑ͏ʹϛχόονΛαϯϓϦϯά l 𝑚ݸͷ֓೦Λແ࡞ҝʹநग़ l

    ֤֓೦ʹ͍ͭͯɼͦͷ֓೦ΛؚΉจʢਖ਼ྫϖΞʣΛແ࡞ҝʹநग़ ఏҊख๏ʛֶश … TdP ventricular tachycardia are … … torsades de pointes in patients … TDP-43 proteinopathy is associated … ϛχόον D016171 D016171 D057177 ਖ਼ྫϖΞ ෛྫϖΞ
  7. 13 ଛࣦؔ਺ l .VMUJ4JNJMBSJUZMPTT<8BOH b>Λ࢖༻ l ϖΞؒͷ૬ରతͳྨࣅ౓Λߟྀ͠ɼ֤ϖΞͷྨࣅ౓ΛॏΈ෇͚ l ྨࣅ౓ͱͯ͠ɼප໊ͷจ຺ԽຒΊࠐΈදݱؒͷίαΠϯྨࣅ౓Λ࢖༻ ఏҊख๏ʛֶश

    Similarity loss (Wang et al., 2019), which is a metric-learning loss function that considers rela- tive similarities between positive and negative pairs. Let us denote the set of entities in the mini-batch by B and the set of positive and negative samples for the entity x0 i 2 B by Pi and Ni. We define the cosine similarity of two entities x0 i and x0 j as Si,j, resulting in a similarity matrix S 2 R|B| ⇥ |B|. Based on Pi, Ni, and S, the following training objectives are set: LMS = 1 |B| |B| X i=1 ⇢ 1 ↵ log ⇥ 1 + X k2Pi e ↵(Sik ) ⇤ + 1 log ⇥ 1 + X k2Ni e (Sik ) ⇤ , where ↵, are the temperature scales and is the offset applied on S. For pair mining, we follow the original paper (Wang et al., 2019). Test Datasets & Evaluation Metric We evalu- ated BIOCOM on three datasets for the biomedi- cal entity normalization task: NCBI disease cor- pus (NCBID) (Do˘ gan et al., 2014), BioCreative V Chemical Disease Relation (BC5CDR) (Li et al., 2016), and MedMentions (Mohan and Li, 2018). Following previous studies (D’Souza and Ng, 2015; Mondal et al., 2019), we used the accuracy as the evaluation metric. Given that BC5CDR and MedMentions contain mentions whose concepts are not in MEDIC, these were filtered out during the evaluation. We refer to these as “BC5CDR-d” and “MedMentions-d” respectively. Model Details The contextual representation for each entity x was obtained from PubMed- BERT (Gu et al., 2020), which was trained on a large number of PubMed abstracts using
  8. 15 ࢖༻Ϧιʔε l ੜςΩετ l 1VC.FEͷBCTUSBDUશ݅ʢ໿(#ʣ l ࣙॻ l ප໊ࣙॻͰ͋Δ.&%*$<%BWJT

    b>Λ࢖༻ l .&%*$ʹ͸ ֓೦ʹରͯ͠ ͷྨٛޠ͕ऩࡌ l খن໛ͳࣙॻͱ࣮ͯ͠ݧ͢ΔͨΊɺ ֤֓೦Ͱ൒਺ͷྨٛޠΛແ࡞ҝʹαϯϓϦϯάʢจ֓೦ʣ ࣮ݧઃఆ
  9. 17 σʔληοτ ප໊͕Ξϊςʔγϣϯ͞Εͨ̏ͭͷσʔληοτΛ࢖༻ l /$#*EJTFBTFDPSQVT /$#*% <%PHBO > l #JP$SFBUJWF

    7$IFNJDBM%JTFBTF3FMBUJPO #$$%3 <-J b> l .FE.FOUJPOT <.PIBO b>> #$$%3ɼ.FE.FOUJPOTʹ͍ͭͯ͸ɼ ප໊Ҏ֎ͷΤϯςΟςΟΛআ֎ʢ#$$%3Eɼ.FE.FOUJPOTEʣ ࣮ݧઃఆ
  10. 18 ͱͳΔΑ͏ɼ֤֓ ͏ʹແ࡞ҝʹநग़ ճͷ࣮ݧͰ͸ɼ̍ ྨٛޠ͕ऩࡌ͞Ε ֶ࿦จͷݕࡧγε ࿦จͷλΠτϧٴ Β࡞੒͢Δɽͳ͓ɼ ͨ͠ɽ نԽͰҰൠతʹ༻

    NCBI disease corpus ical DiseaseRelation Λ༻͍ͯධՁΛߦ A.1અʹهࡌ͢Δɽ ॻͱͯ͠ɼUMLS શମͰͳ͘ఏҊख๏ͱಉ༷ͷ ࣙॻΛ༻͍ͨɽ 4.3 ݁Ռ ද 1 ֤σʔληοτʹ͓͚Δ Accuracyɽප໊ͷϦϯΫઌ ͕ҟͳΔͨΊɼઌߦݚڀ [5, 6] ͱ͸ҟͳΔɽ ख๏ NCBID BC5CDR MedMentions TF-IDF 0.4533 0.5993 0.5783 PubMedBERT 0.3455 0.4216 0.4522 SapBERT 0.5647 0.6523 0.6511 ఏҊख๏ 0.6031 0.6896 0.7141 ද 1ʹ࣮ݧ݁ՌΛࣔ͢ɽఏҊख๏͸͍ͣΕͷσʔ ληοτʹ͓͍ͯ΋ Accuracy ͕࠷΋ߴ͘ͳ͍ͬͯ ࣮ݧ݁Ռ ̏ͭͷσʔληοτʹ͓͚Δਖ਼ղ཰ͰධՁʢLʣ ݁Ռ Fine-tuning なし 辞書のみで 対照学習 ✔ ఏҊख๏͕શͯͷσʔληοτʹ͓͍ͯߴ͍ਖ਼ղ཰
  11. 20 ࣄྫ෼ੳɿఏҊख๏͕༗ޮͩͬͨྫ ߟ࡯ ✔ ڞ௨ͷจ຺ʢ#"--ʣΛख͕͔Γʹਖ਼نԽ ೖྗɿ QIFOPUZQFХ*GPS#"-- QIFOPUZQFж****7GPS5"---%) UIFQSFTFODFPGUSBOTMPDBUJPOU 

    D014178 ࠷ۙ๣ɿ JOWPMWFEJOHFOFUJDUSBOTMPDBUJPOTDIBSBDUFSJTUJDPGCDFMM BDVUF MZNQIPCMBTUJDMFVLFNJB #DFMM"--  D014178 ⽩⾎病の ⼀種
  12. 22 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶃ ߟ࡯ ❌ ප໊ͷࣗಈೝࣝΛࣦഊ ˠ ਖ਼نԽϛε ೖྗɿ )JHIPDDVSSFODFPGOPODMFBSDFMMSFOBMDFMMDBSDJOPNB ࠷ۙ๣ɿ

    VTFEUPUSFBUSFOBMDFMMDBSDJOPNB OPOTNBMMDFMMMVOH DBODFSBOEDPMPODBODFS D002292 D002289 D002292
  13. 23 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶄ ߟ࡯ ❌ จ຺΋ྨࣅ͢Δප໊ ˠ ਖ਼نԽϛε ೖྗɿ EFUFDUFEJOPOMZPOFQSPCBOEXJUIGBNJMJBMJTPMBUFE IZQFSQBSBUIZSPJEJTN

    ࠷ۙ๣ɿ SBSFDBVTFPGQSJNBSZIZQFSQBSBUIZSPJEJTN IQU  D002292 D002289 形容詞の有無 で概念が異なる