Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ROCLING 2021 - Nested Named Entity Recognition for Chinese Electronic Health Records with QA-based Sequence Labeling

Yu-Lun Chiang
September 28, 2021

ROCLING 2021 - Nested Named Entity Recognition for Chinese Electronic Health Records with QA-based Sequence Labeling

This study presents a novel QA-based sequence labeling (QASL) approach to naturally tackle both flat and nested Named Entity Recognition (NER) tasks on a Chinese Electronic Health Records (CEHRs) dataset. This proposed QASL approach parallelly asks a corresponding natural language question for each specific named entity type. It then identifies those associated NEs of the same specified type with the BIO tagging scheme. The associated nested NEs are then formed by overlapping the results of various types. Compared with those pure sequence-labeling (SL) approaches, since the given question includes significant prior knowledge about the specified entity type and the capability of extracting NEs with different types, the nested NER task is thus improved, obtaining 90.70% of F1-score. Besides, compared to the pure QA-based approach, our proposed approach retains the SL features, which could extract multiple NEs with the same types without knowing the exact number of NEs in the same passage in advance. Eventually, experiments on our CEHR dataset demonstrate that QASL-based models greatly outperform the SL-based models by 6.12% to 7.14% of F1-score.

本篇論文發表於 ROCLING 2021 (https://rocling2021.github.io)
演講於 2021/10/15-10/16 展出。

源碼:https://github.com/allenyummy/EHR_NER

------------------
個人資訊
- Gmail: [email protected]
- Github: allenyummy
- Webpage: https://allenyummy.github.io
- Linkedin: Yu-Lun Chiang (https://www.linkedin.com/in/ylchiang914/)
- Medium: Yu-Lun Chiang (https://allenyummy.medium.com)

Yu-Lun Chiang

September 28, 2021
Tweet

More Decks by Yu-Lun Chiang

Other Decks in Research

Transcript

  1. Nested Named Entity Recognition for Chinese Electronic Health Records with

    QA-based Sequence Labeling Yu-Lun Chiang1, Chih-Hao Lin1, Cheng-Lung Sung1 and Keh-Yih Su2 1Data Intelligence R&D Division, CTBC Bank, Co., Ltd 2Institute of Information Science, Academia Sinica
  2. Name Entity Recognition for Chinese Electronic Health Records 病患於西元2019年10月5日至本院入院急診,於10月7日出院。 入院日期

    (Admission Date) 急診日期 (Emergency Date) 出院日期 (Discharge Date) 門診日期 (Outpatient Date) The patient was admitted to hospital and sent to the emergency on Oct. 5, 2019. Then, he was discharged on Oct. 7. He went to hospital for follow-up treatment on Oct. 16 and Oct. 21. 10月16日、10月21日至本院門診追蹤治療。
  3. • Alex et al., 2007, multi-layer CRFs • Ju et

    al., 2018, stacked flat NER layer • Wang et al., 2020a, pyramid layer 從外至內 (或從內至外) 提取實體, 但會引起錯誤累積、神經層錯亂 堆疊法 Stack-based approaches 圖譜法 Graph-based approaches • Finkel and Manning, 2009, CRF with parse tree • Lu and Roth, 2015, hypergraph • Wang and Lu, 2018, neural segmental hypergraph • Katiyar and Cardie, 2018, LSTM with hypergraph • Luo and Zhao, 2020, bipartite flat graph network 使用圖譜提取實體,但難以優化 區域法 Region-based approaches 先辨別可能實體位置,再賦予實體標籤 • Xu et al., 2017, FOFE & FFNN • Fisher and Vlachos, 2019, merge and label • Xia et al, 2019, detect and classify • Zheng et al., 2019, get boundary and then classify • Wang et al., 2020b, head-tail detector and token tagger 閱讀理解法 Machine Reading Comprehension approaches • Levy et al, 2017, MRC for relation extraction • Li et al., 2019, MRC for relation extraction • McCann et al, 2018, MRC for NLP Decathlon • Yin et al., 2020, MRC for sentiment analysis • Li et al., 2020, MRC for named entity recognition 使用問答框架,重新塑造 NLP 問題 • Segal et al., 2019, multi-span extraction 輔以提取策略 序列標註 Sequence Labeling • Lafferty et al., 2001 • Hammerton, 2003 • Ratinov and Roth, 2009 • Collobert et al., 2011 • Huang et al., 2015 • Ma and Hovy, 2016 • Peters et al., 2018 • Devlin et al., 2019 一般來說,以序列標註 的視角處理 NER 任務, 但過往資料集只含有單 含義實體
  4. Model Structure (QA-based Sequence Labeling, QASL) (2) O O O

    B I I I I I I I O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O (3) O O O O O O O O O O O O O O O O O O O O B I I I O O O O O O O O O O O O O O O O O O O O O (4) O O O O O O O O O O O O O O O O O O O O O O O O O O B I I I O B I I I O O O O O O O O O (2) 急診日期 (Emergency Date) (3) 出院日期 (Discharge Date) (4) 門診日期 (Outpatient Date) (1) 入院日期 (Admission Date) (1) O O O B I I I I I I I O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O [CLS] [SEP] 病患於西元2019年10月5日至本院入院急診,於10月7日出院。10月16日、10月21日至本院門診追蹤治療。 [SEP] BERT Encoder Feed Forward Neural Network Bi-LSTM + CRF Bert-QA + BiLSTM-CRF Bert-QA The patient was admitted to hospital and sent to the emergency on Oct. 5, 2019. Then, he was discharged on Oct. 7. He went to hospital for follow-up treatment on Oct. 16 and Oct. 21.
  5. Query Generation (Ask an appropriate question !) • 問題富含實體標籤的先備知識,使用適當的問題極其重要。 •

    用於建構訓練資料集的標註準則,是建構問題的基石。 • 本篇論文採用實體標籤的中文名稱,直接作為問題。 • 透過標註準則,找尋最佳的問題,須額外付出成本。 • 上述建構出的問題,只適用於特定的資料集,不具備通用性。 (Li et al., 2020)
  6. Dataset and Results 一般實體資料集 Flat NER 一般與多含義實體資料集 Nested NER 醫囑數量

    4,328 7,907 每篇醫囑 平均字數 70.43 76.08 一般實體數量 21,616 43,577 多含義實體數量 0 6,978
  7. Conclusion • 單一的 QASL-based model 足以同時處理單含義實體與多含義實體的資料集。 • 每個問題富含有關於實體標籤的先備知識,引入知識有助於模型學習更多。 • 資料增強,負樣本

    (沒有答案的樣本) 增加,但也有可能導致資料不平衡。 • 不需事前知道需提取的實體個數,即可萃取多個實體。 • https://github.com/allenyummy/EHR_NER • https://speakerdeck.com/allenyummy/rocling-2021-nested-named-entity- recognition-for-chinese-electronic-health-records-with-qa-based-sequence- labeling
  8. Stay Tuned ! [email protected] 臺灣⼤學 ⽣物機電⼯程學系 / 研究所 中央研究院 資訊科學研究所

    中國信託商業銀⾏ 數據暨科技研發處 Yu-Lun Chiang allenyummy https://allenyummy.github.io https://allenyummy.medium.com https://medium.com/allenyummy-note 江侑倫 (Yu-Lun Chiang)