Upgrade to Pro — share decks privately, control downloads, hide ads and more …

言語処理学会2021

 言語処理学会2021

41bf71296a6d916e7b08dd46e07f9266?s=128

Shogo Ujiie

March 02, 2021
Tweet

Transcript

  1. จ຺ԽຒΊࠐΈදݱΛ༻͍ͨ ରরֶशʹΑΔප໊ਖ਼نԽ ˓ࢯՈ ᠳޗɼү ᰜɼߥ຀ ӳ࣏ ಸྑઌ୺Պֶٕज़େֶӃେֶ ⾔語処理学会年次⼤会 D1-4

  2. 2 ප໊ਖ਼نԽ എܠ Protein 43 proteinopathy is the disease of

    … 病名 概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 ࣙॻ l ප໊Λࣙॻதͷ֓೦ʹඥ෇͚Δॲཧ l ೖྗɿจɼප໊ͷҐஔ l ग़ྗɿප໊ͷ֓೦ l ΤϯςΟςΟϦϯΩϯάͷҰछ l ೔ৗۀ຿Ͱ࢖༻ l ҩༀ඼ͷ෭࡞༻ใࠂ
  3. 3 લॲཧɾޙॲཧͱͯ͠ͷප໊ਖ਼نԽ ප໊ਖ਼نԽ͸ɺҩྍจॻͰͷݴޠॲཧʹ͓͚Δج൫ٕज़ l ޙஈλεΫͷͨΊͷલॲཧ l ؔ܎நग़Ϟσϧͷೖྗ <9V b> l

    ৘ใݕࡧ࣌ͷΠϯσΫγϯά <6KJJF b> l ৘ใநग़ޙͷ৘ใ੔ཧͷͨΊͷޙॲཧ l ՄࢹԽ <"SBNBLJ b> എܠ
  4. 4 ਖ਼نԽͷͨΊͷප໊දݱͷֶश ප໊ͷຒΊࠐΈදݱΛڑ཭ֶशʹΑΓֶश͠ ࠷ۙ๣ͷࣙॻΤϯτϦͰਖ਼نԽ <4VOH b> طଘख๏ʛ֓ཁ 病名 概念 TDP-43

    proteinopathy D057177 torsades de pointes D016171 TdP ࣙॻ
  5. 5 ਖ਼نԽͷͨΊͷප໊දݱͷֶश ප໊ͷຒΊࠐΈදݱΛڑ཭ֶशʹΑΓֶश͠ ࠷ۙ๣ͷࣙॻΤϯτϦͰਖ਼نԽ <4VOH b> طଘख๏ʛ֓ཁ 病名 概念 TDP-43

    proteinopathy D057177 torsades de pointes D016171 TdP ࣙॻ
  6. 6 ՝୊ɿେن໛Ϧιʔε͕લఏ ઌͷݚڀʹ͸େ͖̎ͭ͘ͷϦιʔε͕ඞཁ l େྔͷʢප໊ɼ֓೦ʣͷϖΞ l ൚༻తͳප໊දݱͷ֫ಘͷͨΊʹඞਢ l େن໛ͳࣙॻʢʹେྔͷྨٛޠʣ l

    ඥ෇͚ઌͱͯ͠ͷࣙॻ l ࣙॻ֦ுʹΑΓਫ਼౓΋޲্ <-FF b> l ࣙॻΛֶशσʔλͱͯ͠༻͍Δྫ <-JV b> طଘख๏ʛ໰୊఺ 多くの⾔語や 医療概念で ⼊⼿困難
  7. 7 ໨తɿ௿ϦιʔεͰΑ͘ಈ࡞͢Δਖ਼نԽख๏ͷ։ൃ ຊݚڀʹ͓͚Δ࢖༻Ϧιʔε l ʢൺֱతʣখن໛ͳࣙॻ l େྔͷҩֶจॻ ໨తɾϞνϕʔγϣϯ Ϟνϕʔγϣϯ ʢطଘݚڀͷΑ͏ʹප໊ͷΈΛ༻͍ΔͷͰ͸ͳ͘ʣ

    ੜςΩετ͔ΒಘΒΕΔจ຺ͷ৘ใΛΞϊςʔγϣϯͳ͠ʹ࢖͍͍ͨ
  8. 8 ఏҊɿࣗಈߏஙͨ͠ίʔύεΛ༻͍ͨප໊ਖ਼نԽ l ප໊ͷจ຺ԽຒΊࠐΈදݱΛɼࣗಈߏஙͨ͠ίʔύεͰֶश l ίʔύε಺Ͱͷۙ๣୳ࡧʹΑΓਖ਼نԽ ˠ ࣙॻʹප໊͕ͳ͘ͱ΋ɼจ຺Λखֻ͔Γʹ ྨࣅ༻ྫΛࢀর͠ਖ਼نԽ ఏҊख๏ʛ֓ཁ

    TDP-43 proteinopathy with … protein … … torsades de pointes in … heart rhythm… TdP is a specific … cardiac death … 不整脈の ⼀種 ALSなどの 総称 TdPの 正式名称
  9. 9 ఏҊɿࣗಈߏஙͨ͠ίʔύεΛ༻͍ͨප໊ਖ਼نԽ l ප໊ͱ֓೦͕ϦϯΫ͞Εͨίʔύεͷߏங l ରরֶशʹΑΔප໊ͷจ຺ԽຒΊࠐΈදݱͷֶश l Lۙ๣๏ʹΑΔਪ࿦ ఏҊख๏ʛ֓ཁ

  10. 10 ප໊ʔ֓೦͕ϦϯΫͨ͠ίʔύεΛߏங ҩྍจॻʹରͯ͠ɺࣙॻϚονʹΑΓප໊Λ֓೦ͱඥ෇͚ ఏҊख๏ʛίʔύε ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------- 病名

    概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 TDP-43 proteinopathy with drugs … TdP ventricular tachycardia are discussed ... --------------------------------------------------- --------------------------------------------------- ࣙॻ ҩྍจॻ
  11. 11 ରরֶशʹΑΔප໊ͷจ຺ԽຒΊࠐΈදݱͷֶश ରরֶशɿਖ਼ྫϖΞΛۙ͘ɼෛྫϖΞΛԕ͘ຒΊࠐΉڑ཭ֶशख๏ ఏҊख๏ʛֶश

  12. 12 ߏஙͨ͠ίʔύεΛ༻͍ͨରরֶश l ϛχόον಺ͷਖ਼ྫɼෛྫϖΞͰֶश l ϛχόον಺Ͱਖ਼ྫϖΞͷଘࡏΛอূ͢ΔͨΊɼ ҎԼͷΑ͏ʹϛχόονΛαϯϓϦϯά l 𝑚ݸͷ֓೦Λແ࡞ҝʹநग़ l

    ֤֓೦ʹ͍ͭͯɼͦͷ֓೦ΛؚΉจʢਖ਼ྫϖΞʣΛແ࡞ҝʹநग़ ఏҊख๏ʛֶश … TdP ventricular tachycardia are … … torsades de pointes in patients … TDP-43 proteinopathy is associated … ϛχόον D016171 D016171 D057177 ਖ਼ྫϖΞ ෛྫϖΞ
  13. 13 ଛࣦؔ਺ l .VMUJ4JNJMBSJUZMPTT<8BOH b>Λ࢖༻ l ϖΞؒͷ૬ରతͳྨࣅ౓Λߟྀ͠ɼ֤ϖΞͷྨࣅ౓ΛॏΈ෇͚ l ྨࣅ౓ͱͯ͠ɼප໊ͷจ຺ԽຒΊࠐΈදݱؒͷίαΠϯྨࣅ౓Λ࢖༻ ఏҊख๏ʛֶश

    Similarity loss (Wang et al., 2019), which is a metric-learning loss function that considers rela- tive similarities between positive and negative pairs. Let us denote the set of entities in the mini-batch by B and the set of positive and negative samples for the entity x0 i 2 B by Pi and Ni. We define the cosine similarity of two entities x0 i and x0 j as Si,j, resulting in a similarity matrix S 2 R|B| ⇥ |B|. Based on Pi, Ni, and S, the following training objectives are set: LMS = 1 |B| |B| X i=1 ⇢ 1 ↵ log ⇥ 1 + X k2Pi e ↵(Sik ) ⇤ + 1 log ⇥ 1 + X k2Ni e (Sik ) ⇤ , where ↵, are the temperature scales and is the offset applied on S. For pair mining, we follow the original paper (Wang et al., 2019). Test Datasets & Evaluation Metric We evalu- ated BIOCOM on three datasets for the biomedi- cal entity normalization task: NCBI disease cor- pus (NCBID) (Do˘ gan et al., 2014), BioCreative V Chemical Disease Relation (BC5CDR) (Li et al., 2016), and MedMentions (Mohan and Li, 2018). Following previous studies (D’Souza and Ng, 2015; Mondal et al., 2019), we used the accuracy as the evaluation metric. Given that BC5CDR and MedMentions contain mentions whose concepts are not in MEDIC, these were filtered out during the evaluation. We refer to these as “BC5CDR-d” and “MedMentions-d” respectively. Model Details The contextual representation for each entity x was obtained from PubMed- BERT (Gu et al., 2020), which was trained on a large number of PubMed abstracts using
  14. 14 Lۙ๣๏ʹΑΔਪ࿦ l ߏஙͨ͠ίʔύεʹରͯۙ͠๣୳ࡧ l ίαΠϯྨࣅ౓ʹج͍ͮͯۙ๣Λ୳ࡧ l ۙ๣Lจʹ͓͍ͯ࠷΋සग़͢Δ֓೦Ͱਖ਼نԽ ˠ ྨࣅ༻ྫΛࢀর͠ɼจ຺Λߟྀͨ͠Ϛονϯά

    ఏҊख๏ʛਪ࿦ ೖྗ L D016171 D057177
  15. 15 ࢖༻Ϧιʔε l ੜςΩετ l 1VC.FEͷBCTUSBDUશ݅ʢ໿(#ʣ l ࣙॻ l ප໊ࣙॻͰ͋Δ.&%*$<%BWJT

    b>Λ࢖༻ l .&%*$ʹ͸ ֓೦ʹରͯ͠ ͷྨٛޠ͕ऩࡌ l খن໛ͳࣙॻͱ࣮ͯ͠ݧ͢ΔͨΊɺ ֤֓೦Ͱ൒਺ͷྨٛޠΛແ࡞ҝʹαϯϓϦϯάʢจ֓೦ʣ ࣮ݧઃఆ
  16. 16 ප໊ͷจ຺ԽຒΊࠐΈදݱ l 1VC.FE#&35 <(V b>Λ࢖༻ l ප໊Λz<&/5>zɺz<&/5>zͰғΈɺz<&/5>zͷग़ྗΛප໊දݱͱ͠ ͯ࢖༻ ࣮ݧઃఆ

  17. 17 σʔληοτ ප໊͕Ξϊςʔγϣϯ͞Εͨ̏ͭͷσʔληοτΛ࢖༻ l /$#*EJTFBTFDPSQVT /$#*% <%PHBO > l #JP$SFBUJWF

    7$IFNJDBM%JTFBTF3FMBUJPO #$$%3 <-J b> l .FE.FOUJPOT <.PIBO b>> #$$%3ɼ.FE.FOUJPOTʹ͍ͭͯ͸ɼ ප໊Ҏ֎ͷΤϯςΟςΟΛআ֎ʢ#$$%3Eɼ.FE.FOUJPOTEʣ ࣮ݧઃఆ
  18. 18 ͱͳΔΑ͏ɼ֤֓ ͏ʹແ࡞ҝʹநग़ ճͷ࣮ݧͰ͸ɼ̍ ྨٛޠ͕ऩࡌ͞Ε ֶ࿦จͷݕࡧγε ࿦จͷλΠτϧٴ Β࡞੒͢Δɽͳ͓ɼ ͨ͠ɽ نԽͰҰൠతʹ༻

    NCBI disease corpus ical DiseaseRelation Λ༻͍ͯධՁΛߦ A.1અʹهࡌ͢Δɽ ॻͱͯ͠ɼUMLS શମͰͳ͘ఏҊख๏ͱಉ༷ͷ ࣙॻΛ༻͍ͨɽ 4.3 ݁Ռ ද 1 ֤σʔληοτʹ͓͚Δ Accuracyɽප໊ͷϦϯΫઌ ͕ҟͳΔͨΊɼઌߦݚڀ [5, 6] ͱ͸ҟͳΔɽ ख๏ NCBID BC5CDR MedMentions TF-IDF 0.4533 0.5993 0.5783 PubMedBERT 0.3455 0.4216 0.4522 SapBERT 0.5647 0.6523 0.6511 ఏҊख๏ 0.6031 0.6896 0.7141 ද 1ʹ࣮ݧ݁ՌΛࣔ͢ɽఏҊख๏͸͍ͣΕͷσʔ ληοτʹ͓͍ͯ΋ Accuracy ͕࠷΋ߴ͘ͳ͍ͬͯ ࣮ݧ݁Ռ ̏ͭͷσʔληοτʹ͓͚Δਖ਼ղ཰ͰධՁʢLʣ ݁Ռ Fine-tuning なし 辞書のみで 対照学習 ✔ ఏҊख๏͕શͯͷσʔληοτʹ͓͍ͯߴ͍ਖ਼ղ཰
  19. 19 ࣙॻαΠζͷਫ਼౓΁ͷӨڹʢ.FE.FOUJPOTEʣ ߟ࡯ʛࣙॻαΠζ ✔ ࣙॻαΠζ͕খ͍͞΄ͲఏҊख๏͕༗ޮ

  20. 20 ࣄྫ෼ੳɿఏҊख๏͕༗ޮͩͬͨྫ ߟ࡯ ✔ ڞ௨ͷจ຺ʢ#"--ʣΛख͕͔Γʹਖ਼نԽ ೖྗɿ QIFOPUZQFХ*GPS#"-- QIFOPUZQFж****7GPS5"---%) UIFQSFTFODFPGUSBOTMPDBUJPOU 

    D014178 ࠷ۙ๣ɿ JOWPMWFEJOHFOFUJDUSBOTMPDBUJPOTDIBSBDUFSJTUJDPGCDFMM BDVUF MZNQIPCMBTUJDMFVLFNJB #DFMM"--  D014178 ⽩⾎病の ⼀種
  21. 21 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶃ ߟ࡯ ೖྗɿ )JHIPDDVSSFODFPGOPODMFBSDFMMSFOBMDFMMDBSDJOPNB ࠷ۙ๣ɿ VTFEUPUSFBUSFOBMDFMMDBSDJOPNB OPOTNBMMDFMMMVOH DBODFSBOEDPMPODBODFS D002292

    D002289
  22. 22 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶃ ߟ࡯ ❌ ප໊ͷࣗಈೝࣝΛࣦഊ ˠ ਖ਼نԽϛε ೖྗɿ )JHIPDDVSSFODFPGOPODMFBSDFMMSFOBMDFMMDBSDJOPNB ࠷ۙ๣ɿ

    VTFEUPUSFBUSFOBMDFMMDBSDJOPNB OPOTNBMMDFMMMVOH DBODFSBOEDPMPODBODFS D002292 D002289 D002292
  23. 23 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶄ ߟ࡯ ❌ จ຺΋ྨࣅ͢Δප໊ ˠ ਖ਼نԽϛε ೖྗɿ EFUFDUFEJOPOMZPOFQSPCBOEXJUIGBNJMJBMJTPMBUFE IZQFSQBSBUIZSPJEJTN

    ࠷ۙ๣ɿ SBSFDBVTFPGQSJNBSZIZQFSQBSBUIZSPJEJTN IQU  D002292 D002289 形容詞の有無 で概念が異なる
  24. 24 ·ͱΊ l খن໛ࣙॻͱେྔͷੜςΩετΛ༻͍ͨප໊ਖ਼نԽख๏ΛఏҊ l ࣗಈߏஙͨ͠ίʔύεͰରরֶश l ۙ๣୳ࡧʹΑΓɼྨࣅ͢Δ༻ྫΛࢀর͠ਖ਼نԽ l ප໊ਖ਼نԽλεΫʹ͓͍ͯɼߴ͍ਫ਼౓Λୡ੒

    l ࣙॻαΠζ͕খ͍͞΄ͲఏҊख๏͕༗༻ l ఏҊख๏͕ʮจ຺ʯΛ༗ޮʹ׆༻͍ͯ͠Δ͜ͱΛ֬ೝ ͓ΘΓʹ
  25. 25 ۙ๣਺ͷਫ਼౓΁ͷӨڹ "QQFOEJY ✔ ۙ๣਺ʹϩόετʹਪ࿦Մೳ

  26. 26 ϋΠύʔύϥϝʔλ l Lۙ๣๏ͷۙ๣਺ɿ l ࠷దԽख๏"EBN l όον਺ จ l

    𝛼, 𝛽  l 𝜆 "QQFOEJY