Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Character-to-Word Attention for Word Segmentation

Character-to-Word Attention for Word Segmentation

言語処理学会第27回年次大会 招待論文 発表スライド(2021年3月18日)
Presented in the invited paper session at the 27th Annual Meeting of the Association for Natural Language Processing on March 18, 2021.

shigashiyama

March 30, 2022
Tweet

More Decks by shigashiyama

Other Decks in Research

Transcript

  1. Character-to-Word Attention for Word Segmentation Shohei Higashiyama1,2, Masao Utiyama1, Eiichiro

    Sumita1, Masao Ideuchi1,2, Yoshiaki Oida3, Yohei Sakamoto4, Isaac Okada3, Yuji Matsumoto5 1:NICT, 2:NAIST, 3:Fujitsu, 4:Redgelinez, 5:RIKEN NLP2021, Invited Paper The paper is available at https://www.jstage.jst.go.jp/article/jnlp/27/3/27_499/_article/-char/en
  2. Character-to-Word Attention for Word Segmentation ◆We proposed a neural word

    segmenter that incorporates word information into character-level sequence labeling. • Differing from related work (Wang+ 2017; Yang+ 2019), our model leverages the importance of all candidate words using an attention mechanism. ◆Our main contribution is comprehensive analysis including: • Our method achieved robust performance for out-of-domain data. • Learning appropriate attention weights contributed to accurate segmentation. 2 日本 本 本人 Character xi Candidate words {wj } for xi Char context vector hi Word summary Vector ai 彼 は 日 本 人 Sentence x Word embedding ew j Aggregate Attend He is a Japanese person. SSBES Label seq y Predict 彼|は|日本|人
  3. Word Segmentation (WS) - Fundamental task to segment a character

    sequence into words for unsegmented languages - Segmentation accuracy is important because segmented results will be used for many NLP applications. - There are ambiguity problem and unknown word problem. 3 彼は日本人だ。 彼 | は | 日本 | 人 | だ | 。 SSBESSS Input sentence Output labels B: Beginning of word I: Inside of word E: End of word S: Single character word (He is a Japanese person.) Typical formulation: character-level sequence labeling 日本 (Japan) 本 (book) 本人 (the person) ?
  4. Proposed Model Architecture ◆Char-based model with char-to-word attention to learn

    the importance of candidate words 4 本 日本 本人 S S B E S BiLSTM CRF Char context vector hi Word embedding ew j Word summary vector ai Attend Aggregate Word vocab Char embedding 彼 は 日 本 人 Lookup
  5. Character-to-Word Attention 5 本 日本 本人 は日本 日本人 本人だ …

    Word embedding ew j Word vocab 彼 は 日 本 人 だ 。 Char context vector hi αij ew j exp(hi T Wew j ) ∑k exp(hi T Wew k ) αij = Input sentence Lookup Attention Max word length = 4 WAVG (weighted average) WCON (weighted concat) OR Aggregate … Word summary vector ai
  6. Construction of Word Vocabulary 6 Word vocab BiLSTM-CRF (baseline) Training

    set Unlabeled text Segmented text Train Decode … … … Word embeddings Word2Vec Pre-train Min word freq = 5 - Word vocabulary comprises training words and auto-segmented words obtained by the baseline.
  7. Experimental Datasets ◆Training/Evaluation datasets - Balanced Corpus of Contemporary Written

    Japanese (BCCWJ) - Japanese Dependency Corpus (JDC) - JUMAN Mixed Corpus (JMC; KTC+KWDLC+KNBC) ◆Unlabeled text - BCCWJ non-core data for pre-training word embeddings (5.9M sentences) 7
  8. Experimental Datasets ◆Training/Evaluation datasets - Balanced Corpus of Contemporary Written

    Japanese (BCCWJ) - Japanese Dependency Corpus (JDC) - JUMAN Mixed Corpus (JMC; KTC+KWDLC+KNBC) ◆Unlabeled text - BCCWJ non-core data for pre-training word embeddings (5.9M sentences) 8 All data was used as test sets.
  9. Experiments ◆Settings - Models were trained with SGD and early-stopping

    on dev sets. - Hyperparameters • num_BiLSTM_layers=2, num_BiLSTM_units=600, char/word_emb_dim=300, RNN_dropout_rate=0.4, min_word_freq=5, max_word_length=4, etc. ◆Evaluation 1. Comparison of baseline and proposed model variants 2. Comparison with existing methods on in/cross-domain datasets 3. Effect of semi-supervised learning 4. Effect of word frequency and length 5. Effect of attention for segmentation performance 6. Analysis of segmentation examples 7. Effect of additional word embeddings from target domains 9
  10. Result 1: Comparison of Model Variants ◆Word-level F1 on development

    sets (mean of three runs) - Word-integrated models outperformed BASE by approx. 0.1-0.7. - Attention-enhanced models outperformed non-attention counterparts in 7 of 8 cases. - WCON performed better than other variants even with similar model size. 10 (BiLSTM-CRF) † significant at the 0.01 level over the baseline ‡ significant at the 0.01 level over the variant w/o attention Attention- based
  11. Result 1: Comparison of Model Variants ◆Word-level F1 on development

    sets (mean of three runs) - Word-integrated models outperformed BASE by approx. 0.1-0.7. - Attention-enhanced models outperformed non-attention counterparts in 7 of 8 cases. - WCON performed better than other variants even with similar model size. 11 (BiLSTM-CRF) † significant at the 0.01 level over the baseline ‡ significant at the 0.01 level over the variant w/o attention Similar model size
  12. Result 2: Comparison with Existing Methods ◆Word-level F1 on test

    sets (mean of three runs) - WCON achieved better performance than existing methods especially for domains with higher OOV rate. 12
  13. Result 2: Comparison with Existing Methods ◆Word-level F1 on test

    sets (mean of three runs) - WCON achieved better performance than existing methods especially for domains with higher OOV rate. 13 Avg(target F1 ) – Avg(source F1 )
  14. Result 2: Comparison with Existing Methods ◆Word-level F1 on test

    sets (mean of three runs) - WCON achieved better performance than existing methods especially for domains with higher OOV rate. 14 decreased by 3.5 7.4 5.7 2.6-3.7 improved by 0.5 2.1 1.2 0.8-1.0
  15. Result 3: Effect of semi-supervised learning ◆Word-level F1 on test

    sets (mean of three runs) - BASE with self-training achieved small improvements but underperformed the proposed method. - Proposed model with randomly initialized word embeddings seriously underperformed that with pre-trained embeddings. 15
  16. Result 3: Effect of semi-supervised learning ◆Word-level F1 on test

    sets (mean of three runs) - BASE with self-training achieved small improvements but underperformed the proposed method. - Proposed model with randomly initialized word embeddings seriously underperformed that with pre-trained embeddings. 16
  17. Result 4: Effect of word frequency and length ◆Word-level F1

    on test sets (mean of three runs) - Proposed model with smaller freq and larger length threshold achieved better performance especially on target domains. 17 min_freq Avg. of 4 src doms Avg. of 7 tgt doms 1 98.44 95.69 5 98.44 95.60 10 98.38 95.56 max_len JDC news+ JDC patent JMC Web JMC dining 1 98.12 94.21 97.01 93.33 2 98.32 96.04 97.25 93.76 3 98.39 96.53 97.49 93.98 4 98.49 96.61 97.49 94.34 5 98.51 96.58 97.57 94.48 6 98.48 96.68 97.66 94.46 Word rate for len(w)≧5 0.94 1.72 2.54 1.96 default default Word vocab Max word length Min word freq better better
  18. 18 Result 5: Effect of Attention for Segmentation (ii) Character-level

    segmentation accuracy of trained models for different “correct attention probability” threshold pt 本 本 日本 本人 本 本 日本 本人 0.1 0.1 0.8 0.1 0.1 0.8 if p≧pt if p<pt p~Uniform(0,1) (i) Character-level segmentation/attention accuracy for each case of attention possibility
  19. 19 Result 5: Effect of Attention for Segmentation (ii) Character-level

    segmentation accuracy of trained models for different “correct attention probability” threshold pt 本 本 日本 本人 本 本 日本 本人 0.1 0.1 0.8 0.1 0.1 0.8 if p≧pt if p<pt p~Uniform(0,1) (i) Character-level segmentation/attention accuracy for each case of attention possibility
  20. Result 6: Segmentation Examples 20 (a) Gold: 色 | 収差

    | 係数 WCON: 色 | 収差 | 係数 (c) Gold: ほうれん | 草 WCON: ほう | れん草 WCON+gold attention: ほうれん | 草 : gold word Bold: attended word Attention weights Segmentation result word char char word (b) Gold: 干し | しいたけ WCON: 干し | しいたけ char word (color | aberration | coefficient) (dried | shiitake) (spinach)
  21. Conclusion - We proposed a neural word segmenter with an

    attention mechanism, which incorporates word information into a character-level sequence labeling framework. - Our analysis shows that 1. The proposed method, WCON, achieved better performance than existing methods especially for domains with higher OOV rate. 2. Learning appropriate attention weights contributed to accurate segmentation. - In future work, we will develop more robust methods for various text including user-generated text. 21 The code is available at https://github.com/shigashiyama/seikanlp
  22. References ⁃ Higashiyama, S., Utiyama, M., Sumita, E., Ideuchi, M.,

    Oida, Y., Sakamoto, Y., and Okada, I. (2019). “Incorporating Word Attention into Character-Based Word Segmentation.” NAACL-HLT, pp. 2699–2709. ⁃ Kitagawa, Y. and Komachi, M. (2018). “Long Short-term Memory for Japanese Word Segmentation.” PACLIC, pp. 279–288. ⁃ Neubig, G., Nakata, Y., and Mori, S. (2011). “PointwIse Prediction for Robust, Adaptable Japanese Morphological Analysis.” ACL-HLT, pp. 529–533. ⁃ Wang, C. and Xu, B. (2017). “Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation.” IJCNLP, pp. 163–172. ⁃ Yang, J., Zhang, Y., and Liang, S. (2019). “Subword Encoding in Lattice LSTM for Chinese Word Segmentation.” NAACL-HLT, pp. 2720–2725. 22