Slide 1

Slide 1 text

จ຺ԽຒΊࠐΈදݱΛ༻͍ͨ ରরֶशʹΑΔප໊ਖ਼نԽ ˓ࢯՈ ᠳޗɼү ᰜɼߥ຀ ӳ࣏ ಸྑઌ୺Պֶٕज़େֶӃେֶ ⾔語処理学会年次⼤会 D1-4

Slide 2

Slide 2 text

2 ප໊ਖ਼نԽ എܠ Protein 43 proteinopathy is the disease of … 病名 概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 ࣙॻ l ප໊Λࣙॻதͷ֓೦ʹඥ෇͚Δॲཧ l ೖྗɿจɼප໊ͷҐஔ l ग़ྗɿප໊ͷ֓೦ l ΤϯςΟςΟϦϯΩϯάͷҰछ l ೔ৗۀ຿Ͱ࢖༻ l ҩༀ඼ͷ෭࡞༻ใࠂ

Slide 3

Slide 3 text

3 લॲཧɾޙॲཧͱͯ͠ͷප໊ਖ਼نԽ ප໊ਖ਼نԽ͸ɺҩྍจॻͰͷݴޠॲཧʹ͓͚Δج൫ٕज़ l ޙஈλεΫͷͨΊͷલॲཧ l ؔ܎நग़Ϟσϧͷೖྗ <9Vb> l ৘ใݕࡧ࣌ͷΠϯσΫγϯά <6KJJFb> l ৘ใநग़ޙͷ৘ใ੔ཧͷͨΊͷޙॲཧ l ՄࢹԽ <"SBNBLJb> എܠ

Slide 4

Slide 4 text

4 ਖ਼نԽͷͨΊͷප໊දݱͷֶश ප໊ͷຒΊࠐΈදݱΛڑ཭ֶशʹΑΓֶश͠ ࠷ۙ๣ͷࣙॻΤϯτϦͰਖ਼نԽ <4VOHb> طଘख๏ʛ֓ཁ 病名 概念 TDP-43 proteinopathy D057177 torsades de pointes D016171 TdP ࣙॻ

Slide 5

Slide 5 text

5 ਖ਼نԽͷͨΊͷප໊දݱͷֶश ප໊ͷຒΊࠐΈදݱΛڑ཭ֶशʹΑΓֶश͠ ࠷ۙ๣ͷࣙॻΤϯτϦͰਖ਼نԽ <4VOHb> طଘख๏ʛ֓ཁ 病名 概念 TDP-43 proteinopathy D057177 torsades de pointes D016171 TdP ࣙॻ

Slide 6

Slide 6 text

6 ՝୊ɿେن໛Ϧιʔε͕લఏ ઌͷݚڀʹ͸େ͖̎ͭ͘ͷϦιʔε͕ඞཁ l େྔͷʢප໊ɼ֓೦ʣͷϖΞ l ൚༻తͳප໊දݱͷ֫ಘͷͨΊʹඞਢ l େن໛ͳࣙॻʢʹେྔͷྨٛޠʣ l ඥ෇͚ઌͱͯ͠ͷࣙॻ l ࣙॻ֦ுʹΑΓਫ਼౓΋޲্ <-FF b> l ࣙॻΛֶशσʔλͱͯ͠༻͍Δྫ <-JV b> طଘख๏ʛ໰୊఺ 多くの⾔語や 医療概念で ⼊⼿困難

Slide 7

Slide 7 text

7 ໨తɿ௿ϦιʔεͰΑ͘ಈ࡞͢Δਖ਼نԽख๏ͷ։ൃ ຊݚڀʹ͓͚Δ࢖༻Ϧιʔε l ʢൺֱతʣখن໛ͳࣙॻ l େྔͷҩֶจॻ ໨తɾϞνϕʔγϣϯ Ϟνϕʔγϣϯ ʢطଘݚڀͷΑ͏ʹප໊ͷΈΛ༻͍ΔͷͰ͸ͳ͘ʣ ੜςΩετ͔ΒಘΒΕΔจ຺ͷ৘ใΛΞϊςʔγϣϯͳ͠ʹ࢖͍͍ͨ

Slide 8

Slide 8 text

8 ఏҊɿࣗಈߏஙͨ͠ίʔύεΛ༻͍ͨප໊ਖ਼نԽ l ප໊ͷจ຺ԽຒΊࠐΈදݱΛɼࣗಈߏஙͨ͠ίʔύεͰֶश l ίʔύε಺Ͱͷۙ๣୳ࡧʹΑΓਖ਼نԽ ˠ ࣙॻʹප໊͕ͳ͘ͱ΋ɼจ຺Λखֻ͔Γʹ ྨࣅ༻ྫΛࢀর͠ਖ਼نԽ ఏҊख๏ʛ֓ཁ TDP-43 proteinopathy with … protein … … torsades de pointes in … heart rhythm… TdP is a specific … cardiac death … 不整脈の ⼀種 ALSなどの 総称 TdPの 正式名称

Slide 9

Slide 9 text

9 ఏҊɿࣗಈߏஙͨ͠ίʔύεΛ༻͍ͨප໊ਖ਼نԽ l ප໊ͱ֓೦͕ϦϯΫ͞Εͨίʔύεͷߏங l ରরֶशʹΑΔප໊ͷจ຺ԽຒΊࠐΈදݱͷֶश l Lۙ๣๏ʹΑΔਪ࿦ ఏҊख๏ʛ֓ཁ

Slide 10

Slide 10 text

10 ප໊ʔ֓೦͕ϦϯΫͨ͠ίʔύεΛߏங ҩྍจॻʹରͯ͠ɺࣙॻϚονʹΑΓප໊Λ֓೦ͱඥ෇͚ ఏҊख๏ʛίʔύε ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------------------------------------- ---------------------- 病名 概念 TDP-43 proteinopathy D057177 TdP ventricular tachycardia D016171 TDP-43 proteinopathy with drugs … TdP ventricular tachycardia are discussed ... --------------------------------------------------- --------------------------------------------------- ࣙॻ ҩྍจॻ

Slide 11

Slide 11 text

11 ରরֶशʹΑΔප໊ͷจ຺ԽຒΊࠐΈදݱͷֶश ରরֶशɿਖ਼ྫϖΞΛۙ͘ɼෛྫϖΞΛԕ͘ຒΊࠐΉڑ཭ֶशख๏ ఏҊख๏ʛֶश

Slide 12

Slide 12 text

12 ߏஙͨ͠ίʔύεΛ༻͍ͨରরֶश l ϛχόον಺ͷਖ਼ྫɼෛྫϖΞͰֶश l ϛχόον಺Ͱਖ਼ྫϖΞͷଘࡏΛอূ͢ΔͨΊɼ ҎԼͷΑ͏ʹϛχόονΛαϯϓϦϯά l 𝑚ݸͷ֓೦Λແ࡞ҝʹநग़ l ֤֓೦ʹ͍ͭͯɼͦͷ֓೦ΛؚΉจʢਖ਼ྫϖΞʣΛແ࡞ҝʹநग़ ఏҊख๏ʛֶश … TdP ventricular tachycardia are … … torsades de pointes in patients … TDP-43 proteinopathy is associated … ϛχόον D016171 D016171 D057177 ਖ਼ྫϖΞ ෛྫϖΞ

Slide 13

Slide 13 text

13 ଛࣦؔ਺ l .VMUJ4JNJMBSJUZMPTT<8BOH b>Λ࢖༻ l ϖΞؒͷ૬ରతͳྨࣅ౓Λߟྀ͠ɼ֤ϖΞͷྨࣅ౓ΛॏΈ෇͚ l ྨࣅ౓ͱͯ͠ɼප໊ͷจ຺ԽຒΊࠐΈදݱؒͷίαΠϯྨࣅ౓Λ࢖༻ ఏҊख๏ʛֶश Similarity loss (Wang et al., 2019), which is a metric-learning loss function that considers rela- tive similarities between positive and negative pairs. Let us denote the set of entities in the mini-batch by B and the set of positive and negative samples for the entity x0 i 2 B by Pi and Ni. We define the cosine similarity of two entities x0 i and x0 j as Si,j, resulting in a similarity matrix S 2 R|B| ⇥ |B|. Based on Pi, Ni, and S, the following training objectives are set: LMS = 1 |B| |B| X i=1 ⇢ 1 ↵ log ⇥ 1 + X k2Pi e ↵(Sik ) ⇤ + 1 log ⇥ 1 + X k2Ni e (Sik ) ⇤ , where ↵, are the temperature scales and is the offset applied on S. For pair mining, we follow the original paper (Wang et al., 2019). Test Datasets & Evaluation Metric We evalu- ated BIOCOM on three datasets for the biomedi- cal entity normalization task: NCBI disease cor- pus (NCBID) (Do˘ gan et al., 2014), BioCreative V Chemical Disease Relation (BC5CDR) (Li et al., 2016), and MedMentions (Mohan and Li, 2018). Following previous studies (D’Souza and Ng, 2015; Mondal et al., 2019), we used the accuracy as the evaluation metric. Given that BC5CDR and MedMentions contain mentions whose concepts are not in MEDIC, these were filtered out during the evaluation. We refer to these as “BC5CDR-d” and “MedMentions-d” respectively. Model Details The contextual representation for each entity x was obtained from PubMed- BERT (Gu et al., 2020), which was trained on a large number of PubMed abstracts using

Slide 14

Slide 14 text

14 Lۙ๣๏ʹΑΔਪ࿦ l ߏஙͨ͠ίʔύεʹରͯۙ͠๣୳ࡧ l ίαΠϯྨࣅ౓ʹج͍ͮͯۙ๣Λ୳ࡧ l ۙ๣Lจʹ͓͍ͯ࠷΋සग़͢Δ֓೦Ͱਖ਼نԽ ˠ ྨࣅ༻ྫΛࢀর͠ɼจ຺Λߟྀͨ͠Ϛονϯά ఏҊख๏ʛਪ࿦ ೖྗ L D016171 D057177

Slide 15

Slide 15 text

15 ࢖༻Ϧιʔε l ੜςΩετ l 1VC.FEͷBCTUSBDUશ݅ʢ໿(#ʣ l ࣙॻ l ප໊ࣙॻͰ͋Δ.&%*$<%BWJTb>Λ࢖༻ l .&%*$ʹ͸ ֓೦ʹରͯ͠ ͷྨٛޠ͕ऩࡌ l খن໛ͳࣙॻͱ࣮ͯ͠ݧ͢ΔͨΊɺ ֤֓೦Ͱ൒਺ͷྨٛޠΛແ࡞ҝʹαϯϓϦϯάʢจ֓೦ʣ ࣮ݧઃఆ

Slide 16

Slide 16 text

16 ප໊ͷจ຺ԽຒΊࠐΈදݱ l 1VC.FE#&35 <(V b>Λ࢖༻ l ප໊Λz<&/5>zɺz<&/5>zͰғΈɺz<&/5>zͷग़ྗΛප໊දݱͱ͠ ͯ࢖༻ ࣮ݧઃఆ

Slide 17

Slide 17 text

17 σʔληοτ ප໊͕Ξϊςʔγϣϯ͞Εͨ̏ͭͷσʔληοτΛ࢖༻ l /$#*EJTFBTFDPSQVT /$#*% <%PHBO > l #JP$SFBUJWF 7$IFNJDBM%JTFBTF3FMBUJPO #$$%3 <-J b> l .FE.FOUJPOT <.PIBO b>> #$$%3ɼ.FE.FOUJPOTʹ͍ͭͯ͸ɼ ප໊Ҏ֎ͷΤϯςΟςΟΛআ֎ʢ#$$%3Eɼ.FE.FOUJPOTEʣ ࣮ݧઃఆ

Slide 18

Slide 18 text

18 ͱͳΔΑ͏ɼ֤֓ ͏ʹແ࡞ҝʹநग़ ճͷ࣮ݧͰ͸ɼ̍ ྨٛޠ͕ऩࡌ͞Ε ֶ࿦จͷݕࡧγε ࿦จͷλΠτϧٴ Β࡞੒͢Δɽͳ͓ɼ ͨ͠ɽ نԽͰҰൠతʹ༻ NCBI disease corpus ical DiseaseRelation Λ༻͍ͯධՁΛߦ A.1અʹهࡌ͢Δɽ ॻͱͯ͠ɼUMLS શମͰͳ͘ఏҊख๏ͱಉ༷ͷ ࣙॻΛ༻͍ͨɽ 4.3 ݁Ռ ද 1 ֤σʔληοτʹ͓͚Δ Accuracyɽප໊ͷϦϯΫઌ ͕ҟͳΔͨΊɼઌߦݚڀ [5, 6] ͱ͸ҟͳΔɽ ख๏ NCBID BC5CDR MedMentions TF-IDF 0.4533 0.5993 0.5783 PubMedBERT 0.3455 0.4216 0.4522 SapBERT 0.5647 0.6523 0.6511 ఏҊख๏ 0.6031 0.6896 0.7141 ද 1ʹ࣮ݧ݁ՌΛࣔ͢ɽఏҊख๏͸͍ͣΕͷσʔ ληοτʹ͓͍ͯ΋ Accuracy ͕࠷΋ߴ͘ͳ͍ͬͯ ࣮ݧ݁Ռ ̏ͭͷσʔληοτʹ͓͚Δਖ਼ղ཰ͰධՁʢLʣ ݁Ռ Fine-tuning なし 辞書のみで 対照学習 ✔ ఏҊख๏͕શͯͷσʔληοτʹ͓͍ͯߴ͍ਖ਼ղ཰

Slide 19

Slide 19 text

19 ࣙॻαΠζͷਫ਼౓΁ͷӨڹʢ.FE.FOUJPOTEʣ ߟ࡯ʛࣙॻαΠζ ✔ ࣙॻαΠζ͕খ͍͞΄ͲఏҊख๏͕༗ޮ

Slide 20

Slide 20 text

20 ࣄྫ෼ੳɿఏҊख๏͕༗ޮͩͬͨྫ ߟ࡯ ✔ ڞ௨ͷจ຺ʢ#"--ʣΛख͕͔Γʹਖ਼نԽ ೖྗɿ QIFOPUZQFХ*GPS#"-- QIFOPUZQFж****7GPS5"---%) UIFQSFTFODFPGUSBOTMPDBUJPOU D014178 ࠷ۙ๣ɿ JOWPMWFEJOHFOFUJDUSBOTMPDBUJPOTDIBSBDUFSJTUJDPGCDFMM BDVUF MZNQIPCMBTUJDMFVLFNJB #DFMM"-- D014178 ⽩⾎病の ⼀種

Slide 21

Slide 21 text

21 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶃ ߟ࡯ ೖྗɿ )JHIPDDVSSFODFPGOPODMFBSDFMMSFOBMDFMMDBSDJOPNB ࠷ۙ๣ɿ VTFEUPUSFBUSFOBMDFMMDBSDJOPNB OPOTNBMMDFMMMVOH DBODFSBOEDPMPODBODFS D002292 D002289

Slide 22

Slide 22 text

22 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶃ ߟ࡯ ❌ ප໊ͷࣗಈೝࣝΛࣦഊ ˠ ਖ਼نԽϛε ೖྗɿ )JHIPDDVSSFODFPGOPODMFBSDFMMSFOBMDFMMDBSDJOPNB ࠷ۙ๣ɿ VTFEUPUSFBUSFOBMDFMMDBSDJOPNB OPOTNBMMDFMMMVOH DBODFSBOEDPMPODBODFS D002292 D002289 D002292

Slide 23

Slide 23 text

23 ࣄྫ෼ੳɿఏҊख๏͕ޡͬͨྫᶄ ߟ࡯ ❌ จ຺΋ྨࣅ͢Δප໊ ˠ ਖ਼نԽϛε ೖྗɿ EFUFDUFEJOPOMZPOFQSPCBOEXJUIGBNJMJBMJTPMBUFE IZQFSQBSBUIZSPJEJTN ࠷ۙ๣ɿ SBSFDBVTFPGQSJNBSZIZQFSQBSBUIZSPJEJTN IQU D002292 D002289 形容詞の有無 で概念が異なる

Slide 24

Slide 24 text

24 ·ͱΊ l খن໛ࣙॻͱେྔͷੜςΩετΛ༻͍ͨප໊ਖ਼نԽख๏ΛఏҊ l ࣗಈߏஙͨ͠ίʔύεͰରরֶश l ۙ๣୳ࡧʹΑΓɼྨࣅ͢Δ༻ྫΛࢀর͠ਖ਼نԽ l ප໊ਖ਼نԽλεΫʹ͓͍ͯɼߴ͍ਫ਼౓Λୡ੒ l ࣙॻαΠζ͕খ͍͞΄ͲఏҊख๏͕༗༻ l ఏҊख๏͕ʮจ຺ʯΛ༗ޮʹ׆༻͍ͯ͠Δ͜ͱΛ֬ೝ ͓ΘΓʹ

Slide 25

Slide 25 text

25 ۙ๣਺ͷਫ਼౓΁ͷӨڹ "QQFOEJY ✔ ۙ๣਺ʹϩόετʹਪ࿦Մೳ

Slide 26

Slide 26 text

26 ϋΠύʔύϥϝʔλ l Lۙ๣๏ͷۙ๣਺ɿ l ࠷దԽख๏"EBN l όον਺ จ l 𝛼, 𝛽 l 𝜆 "QQFOEJY