Word Segmentation and Lexical Normalization for Unsegmented Languages

Slide 1

Slide 1 text

Word Segmentation and Lexical Normalization for Unsegmented Languages Doctoral Defense December 16, 2021. Shohei Higashiyama NLP Lab, Division of Information Science, NAIST

Slide 2

Slide 2 text

This slide is a slightly modified version of that used for the author’s doctoral defense at NAIST on December 16, 2021. The major contents of this slide were taken from the following papers. • [Study 1] Higashiyama et al., “Incorporating Word Attention into Character-Based Word Segmentation”, NAACL-HLT, 2019 https://www.aclweb.org/anthology/N19-1276 • [Study 1] Higashiyama et al., “Character-to-Word Attention for Word Segmentation”, Journal of Natural Language Processing, 2020 (Paper Award) https://www.jstage.jst.go.jp/article/jnlp/27/3/27_499/_article/-char/en • [Study 2] Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word Segmentation”, Journal of Natural Language Processing, 2020 https://www.jstage.jst.go.jp/article/jnlp/27/3/27_573/_article/-char/en • [Study 3] Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021 https://www.aclweb.org/anthology/2021.naacl-main.438/ • [Study 4] Higashiyama et al., “A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award) https://aclanthology.org/2021.wnut-1.9/ 2

Slide 3

Slide 3 text

Overview ◆Research theme - Word Segmentation (WS) and Lexical Normalization (LN) for Unsegmented Languages ◆Studies in this dissertation [Study 1] Japanese/Chinese WS for general domains [Study 2] Japanese/Chinese WS for specialized domains [Study 3] Construction of Japanese user-generated text (UGT) corpus for WS and LN [Study 4] Japanese WS and LN for UGT domains ◆Structure of this presentation - Background → Detail on each study → Conclusion 3

Slide 4

Slide 4 text

◆Segmentation/Tokenization - The (almost) necessary first step of NLP, which segments a sentence into tokens ◆Word - Human-understandable unit - Processing unit of traditional NLP - Mandatory unit for linguistic analysis (e.g., parsing and PAS) - Useful information as a feature or an intermediate unit of subword for application-oriented tasks (e.g., NER and MT) 4 Char ニ,ュ,ー,ラ,ル,ネ,ッ,ト,ワ,ー,ク,に,よ,る,自,然,言,語,処,理 Subword ニュー,ラル,ネット,ワーク,による,自然,言語,処理 Word ニューラル,ネットワーク,に,よる,自然,言語,処理ニューラルネットワークによる自然言語処理 ‘Natural language processing based on neural networks’ Background (1/4)

Slide 5

Slide 5 text

Background (2/4) ◆Word Segmentation (WS) in unsegmented languages - Task to segment sentences into words using annotated data based on a segmentation standard - Nontrivial task because of the ambiguity problem - Segmentation accuracy degrades in domains w/o sufficient labeled data mainly due to the unknown word problem. Research issue 1 - How to achieve high accuracy in various text domains, including those w/o labeled data 5 彼は日本人だ彼 | は | 日本 | 人 | だ ‘He is a Japanese.’ 日本 ‘Japan’ 本 ‘book’ 本人 ‘the person’ ？

Slide 6

Slide 6 text

Background (3/4) ◆Effective WS approaches for different domain types - [Study 1] General domains: Use of labeled data (and other resources) - [Study 2] Specialized domains: Use of general domain labeled data and target domain resources - [Study 3&4] User-generated text (UGT): Handling nonstandard words → Lexical normalization 6 Domain Type Example Labeled data Unlabeled data Lexicon Other characteristics General dom. News ✓ ✓ ✓ Specialized dom. Scientific documents ✕ ✓ △ UGT dom. Social media ✕ ✓ △ Nonstandard words ✓: available △: sometimes available ×: almost unavailable

Slide 7

Slide 7 text

Background (4/4) ◆The frequent use of nonstandard words in UGT - Examples: オハヨー ohayoo ‘good morning’ (おはよう) すっっげええ suggee ‘awesome’ (すごい) - Achieving accurate WS and downstream processing is difficult. ◆Lexical Normalization (LN) - Task to transform nonstandard words into standard forms - Main problem: the lack of public labeled data for evaluating and training Japanese LN models Research issue 2 - How to train/evaluate WS and LN models for Japanese UGT under the low-resource situation 7 日本語まぢムズカシイ日本語まじ難しい/むずかしい Japanese Majimu Zukashii Japanese Majimuzukashii Online Translators A, B Japanese is really difficult Japanese is difficult

Slide 8

Slide 8 text

Contributions of This Dissertation (1/2) 1. How to achieve accurate WS in various text domains - We proposed effective approach for each of three domain types. ➢Our methods can be effective options to achieve accurate WS and downstream tasks in these domains. 8 General domains Specialized domains UGT domains [Study 1] Neural model combining character and word features [Study 2] Auxiliary prediction task based on unlabeled data and lexicon [Study 4] Joint prediction of WS and LN

Slide 9

Slide 9 text

Contributions of This Dissertation (1/2) 2. How to train/evaluate WS and LN models for Ja UGT - We constructed manually/automatically-annotated corpora. ➢Our evaluation corpus can be a useful benchmark to compare and analyze existing and future systems. ➢Our LN method can be a good baseline to develop more practical Japanese LN methods in future. 9 UGT domains [Study 3] Evaluation corpus annotation [Study 4] Pseudo-training data generation

Slide 10

Slide 10 text

◆I focused on improvements of WS and LN accuracy for each domain type. Corpus annotation for fair evaluation Overview of Studies in This Dissertation 10 Development of More accurate models General domains Specialized domains UGT domains Study 1, 2, 4 Study 3 Development of More fast models Prerequisite Performance

Slide 11

Slide 11 text

Study 1: Word Segmentation for General Domains Higashiyama et al., “Incorporating Word Attention into Character-Based Word Segmentation”, NAACL-HLT, 2019 Higashiyama et al., “Character-to-Word Attention for Word Segmentation”, Journal of Natural Language Processing, 2020 (Paper Award)

Slide 12

Slide 12 text

Study 1: WS for General Domains ◆Goal: Achieve more accurate WS in general domains ◆Background - Limited efforts have been devoted to leverage complementary char and word information for neural WS. ◆Our contributions - Proposed a char-based model incorporating word information - Achieved performance better than or competitive to existing SOTA models on Japanese and Chinese datasets 12 テキストの分割テキスト|の|分割テキストの分割テキストの分割 Char- based Word- based Efficient prediction via first-order sequence labeling Easy use of word-level info

Slide 13

Slide 13 text

[Study 1] Proposed Model Architecture ◆Char-based model with char-to-word attention to learn the importance of candidate words 13 本日本本人ＳＳＢＥＳ BiLSTM Char context vector hi Word embedding ew j Word summary vector ai Attend Aggregate Word vocab Char embedding Lookup 日本 ‘Japan’ 本 ‘book’ 本人 ‘the person’ ？ Input sentence BiLSTM CRF 彼は日本人

Slide 14

Slide 14 text

[Study 1] Character-to-Word Attention 14 本日本本人は日本日本人本人だ … Word embedding ew j Word vocab 彼は日本人だ。 Char context vector hi αij ew j exp(hi T Wew j ) ∑k exp(hi T Wew k ) αij = Input sentence Lookup Attend Max word length = 4 WAVG (weighted average) WCON (weighted concat) OR Aggregate … Word summary vector ai

Slide 15

Slide 15 text

15 Word vocab BiLSTM-CRF (baseline) Training set Unlabeled text Segmented text Train Decode … … … Word embeddings Word2Vec Pre-train Min word freq = 5 - Word vocabulary comprises training words and words automatically segmented by the baseline. [Study 1] Construction of Word Vocabulary

Slide 16

Slide 16 text

[Study 1] Experimental Datasets ◆Training/Test data - Chinese: 2 source domains - Japanese: 4 source domains and 7 target domains ◆Unlabeled text for pre-training word embeddings - Chinese: 48M sentences in Chinese Gigaword 5 - Japanese: 5.9M sentences in BCCWJ non-core data 16

Slide 17

Slide 17 text

[Study 1] Experimental Settings ◆Hyperparameters - num_BiLSTM_layers=2 or 3, num_BiLSTM_units=600, char/word_emb_dim=300, min_word_freq=5, max_word_length=4, etc. ◆Evaluation 1. Comparison of baseline and proposed model variants (and analysis on model size) 2. Comparison with existing methods on in-domain and cross-domain datasets 3. Effect of semi-supervised learning 4. Effect of word frequency and length 5. Effect of attention for segmentation performance 6. Effect of additional word embeddings from target domains 7. Analysis of segmentation examples 17

Slide 18

Slide 18 text

[Study 1] Exp 1. Comparison of Model Variants ◆F1 on development sets (mean of three runs) - Word-integrated models outperformed BASE by up to 1.0 (significant in 20 of 24 cases). - Attention-based models outperformed non-attention counterparts in 10 of 12 cases (significant in 4 cases). - WCON achieved the best performance, which may be because of word length and char position info. 18 † significant at the 0.01 level over the baseline ‡ significant at the 0.01 level over the variant w/o attention BiLSTM-CRF Attention- based

Slide 19

Slide 19 text

[Study 1] Exp 2. Comparison with Existing Methods ◆F1 on test sets (mean of three runs) - WCON achieved better/competitive performance than existing methods. (More recent work achieved further improvements on Chinese datasets.) 19

Slide 20

Slide 20 text

20 [Study 1] Exp 5. Effect of Attention for Segmentation 本本日本本人本本日本本人 0.1 0.1 0.8 0.1 0.1 0.8 if p≧pt if p＜pt p～Uniform(0,1) ◆Character-level accuracy on BCCWJ-dev (Most frequent cases where both correct and incorrect candidate words exist for a character) - Segmentation label accuracy: 99.54% - Attention accuracy for proper words: 93.25% ◆Segmentation accuracy of the trained model increased for larger “correct attention probability” pt

Slide 21

Slide 21 text

[Study 1] Conclusion - We proposed a neural word segmenter with attention, which incorporates word information into a character-level sequence labeling framework. - Our experiments showed that • the proposed method, WCON, achieved performance better than or competitive to existing methods, and • learning appropriate attention weights contributed to accurate segmentation. 21

Slide 22

Slide 22 text

Study 2: Word Segmentation for Specialized Domains Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word Segmentation”, Journal of Natural Language Processing, 2020

Slide 23

Slide 23 text

Study 2: WS for Specialized Domains ◆Goal - Improve WS performance for specialized domains where labeled data is non-available ➢Our focus: how to use linguistic resources in target domain ◆Our contributions - Proposed a WS method to learn signals of word occurrences from unlabeled sentences and a lexicon (in target domain) - Our method improved performance for various Chinese and Japanese target domains. 23 Domain Type Labeled data Unlabeled data Lexicon Specialized domains ✕ ✓ △ ✓: available △: sometimes available ×: almost unavailable

Slide 24

Slide 24 text

[Study 2] Cross-Domain WS with Linguistic Resources ◆Methods for Cross-Domain WS ◆Our Model - To overcome the limitation of lexicon features, we mode lexical information via auxiliary task for neural models. ➢Assumption: 24 (Liu+ 2019), (Gan+ 2019), Ours Neural representation learning Lexicon feature Modeling Lexical Information Modeling Statistical Information Generating pseudo-labeled data (Neubig+ 2011), (Zhang+ 2018) Domain Labeled data Unlabeled data Lexicon Source ✓ ✓ ✓ Target ✕ ✓ ✓

Slide 25

Slide 25 text

[Study 2] Lexicon Feature - Models cannot learn relationship b/w feature values and segmentation labels in a target domain w/o labeled data. 25 週末の外出自粛要請ＢＥＳＢＥＢＥＢＥ Gold label (self-restraint request in the weekend) １００１１１０１０００００００００００１００１１１０１１１１１１１１１１ Sentence Lexicon features Lexicon [B] [I] [E] [S] 長短期記憶ネットワーク (Long short-term memory) {週,末,の,外,出,自,粛,要,請, 週末,外出,出自,自粛,要請, …} {長,短,期,記,憶,長短,短期,記憶, ネット,ワーク,ネットワーク, …} １１０１０１００１００００００００１１１１００１１０１００１００１１１１１１００００００ source sentence target sentence Generate ？ Predict/Learn

Slide 26

Slide 26 text

[Study 2] Our Lexicon Word Prediction - We introduce auxiliary tasks to predict whether each character corresponds to specific positions in lexical words. - The model learns parameters also from target unlabeled sentences. 26 Seg label Sentence Auxiliary labels Lexicon [B] [I] [E] [S] 長短期記憶ネットワーク {長,短,期,記,憶,長短,短期,記憶, ネット,ワーク,ネットワーク, …} １１０１０１００１００００００００１１１１００１１０１００１００１１１１１１００００００ Generate Predict/Learn Predict/ Learn source sentence target sentence Predict/Learn 週末の外出自粛要請ＢＥＳＢＥＢＥＢＥ (self-restraint request in the weekend) {週,末,の,外,出,自,粛,要,請, 週末,外出,出自,自粛,要請, …} (Long short-term memory) １００１１１０１０００００００００００１００１１１０１１１１１１１１１１

Slide 27

Slide 27 text

[Study 2] Methods and Experimental Data ◆Linguistic resources for training - Source domain labeled data - General and domain-specific unlabeled data - Lexicon: UniDic (JA) or Jieba (ZH) and semi-automatically constructed domain-specific lexicons (390K-570K source words & 0-134K target words) ◆Methods - Baselines: BiLSTM (BASE), BASE + self-training (ST), and BASE + lexicon feature (LF) - Proposed: BASE + MLPs for Segmentation and auxiliary LWP tasks 27 JNL: CS Journal; JPT, CPT: Patent; RCP: Recipe; C-ZX, P-ZX, FR, DL: Novel; DM: Medical

Slide 28

Slide 28 text

[Study 2] Experimental Settings ◆Hyperparameter - num_BiLSTM_layers=2, num_BiLSTM_units=600, char_emb_dim=300, num_MLP_units=300, min_word_len=1, max_word_len=4, etc. ◆Evaluation 1. In-domain results 2. Cross-domain results 3. Comparison with SOTA methods 4. Influence of weight for auxiliary loss 5. Results for non-adapted domains 6. Performance of unknown words 28

Slide 29

Slide 29 text

[Study 2] Exp 2. Cross-Domain Results ◆F1 on test sets (mean of three runs) - LWP-S (source) outperformed BASE and ST. - LWP-T (target) significantly outperformed the three baselines. (+3.2 over BASE, +3.0 over ST, +1.2 over LF on average) - Results of LWP-O (oracle) using gold test words indicates more improvements by higher-coverage lexicons. 29 Japanese Chinese ★ significant at the 0.001 level over BASE † significant at the 0.001 level over ST ‡ significant at the 0.001 level over LF

Slide 30

Slide 30 text

[Study 2] Exp 3. Comparison with SOTA Methods ◆F1 on test sets - Our method achieved better or competitive performance on Japanese and Chinese datasets, compared to SOTA methods, including Higashiyama+’19 (our method in the first study). 30 Japanese Chinese

Slide 31

Slide 31 text

[Study 2] Exp 6. Performance for Unknown Words ◆Recall of top 10 frequent OOTV words - For out-of-training-vocabulary (OOTV) words in test sets, our method achieved better recall for words in lexicon, but worse recall for words not in the lexicon (Ls ∪Lt ). JPT (Patent) FR (Novel) 31 JA ZH

Slide 32

Slide 32 text

[Study 2] Conclusion - We proposed a cross-domain WS method to incorporate lexical knowledge via an auxiliary prediction task. - Our method achieved better performance for various target domains than the lexicon feature baseline and existing methods (while preventing performance degradation for source domains). 32

Slide 33

Slide 33 text

Study 3: Construction of a Japanese UGT corpus for WS and LN Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021

Slide 34

Slide 34 text

Study 3: UGT Corpus Construction ◆Background - The lack of public evaluation corpus for Japanese WS and LN ◆Goal - Construct a public evaluation corpus for development and fair comparison of Japanese WS and LN systems ◆Our contributions - Constructed a corpus of blog and Q&A forum text annotated with morphological and normalization information - Conducted a detailed evaluation of UGT-specific problems of existing methods 34 日本語まぢムズカシイ日本語まぢムズカシイまじ難しい/むずかしい ‘Japanese is really difficult.’

Slide 35

Slide 35 text

[Study 3] Corpus Construction Policies 1. Available and restorable - Use blog and Chiebukuro (Yahoo! Answers) sentences in the BCCWJ non-core data and publish annotation information 2. Compatible with existing segmentation standard - Follow the NINJAL’s SUW (short unit word, 短単位) and extend the specification regarding non-standard words 3. Enabling a detailed evaluation on UGT-specific phenomena - Organize linguistic phenomena frequently observed into several categories and annotate every token with a category 35

Slide 36

Slide 36 text

[Study 3] Example Sentence in Our Corpus 36 イイ歌ですねェ Raw sentence イイ歌ですねェ形容詞名詞助動詞助詞良い,よい,いい - - ね Char type - - Sound change variant variant ii uta desu nee ‘It’s a good song, isn’t it?’ Word boundary Standard forms desu (copula) nee (emphasis marker) Part-of-speech Categories ii ‘good’ uta ‘song’

Slide 37

Slide 37 text

[Study 3] Corpus Details ◆Word categories - 11 categories were defined for non-general or nonstandard words that may often cause segmentation errors. ◆Corpus statistics 37 新語/スラング固有名オノマトペ感動詞方言外国語顔文字/AA 異文字種代用表記音変化誤表記 Our most categories overlap with (Kaji+ 2015)’s classification

Slide 38

Slide 38 text

[Study 3] Experiments Using our corpus, we evaluated two existing systems trained only with annotated corpus for WS and POS tagging. • MeCab (Kudo+ 2004) with UniDic v2.3.0 - A popular Japanese morphological analyzer based on CRFs • MeCab+ER (Expansion Rules) - Our MeCab-based implementation of (Sasano+ 2013)’s rule-based lattice expansion method 38 Cited from (Sasano+ 2014) ‘It was delicious.’ Dynamically add nodes by human-crafted rules

Slide 39

Slide 39 text

[Study 3] Experiments ◆Evaluation 1. Overall results 2. Results for each category 3. Analysis of segmentation results 4. Analysis of normalization results 39

Slide 40

Slide 40 text

Study 3. Exp 1. Overall Performance 40 ◆Results - MeCab+ER achieved better performance for Seg and POS by 2.5-2.9 F1 points, but achieved poor Norm recall.

Slide 41

Slide 41 text

◆Results - Both achieved high Seg and POS performance for general and standard words, but lower performance for UGT-characteristic words. - MeCab+ER correctly normalized 30-40% of SCV and AR nonstandard words, but none of those in other two categories. Study 3. Exp 2. Recall for Each Category 41 Norm

Slide 42

Slide 42 text

[Study 3] Conclusion - We constructed a public Japanese UGT corpus annotated with morphological and normalization information. (https://github.com/shigashiyama/jlexnorm) - Experiments on the corpus demonstrated the limited performance of the existing systems for non-general and non-standard words. 42

Slide 43

Slide 43 text

Study 4: WS and LN for Japanese UGT Higashiyama et al., “A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award)

Slide 44

Slide 44 text

Study 4: WS and LN for Japanese UGT ◆Goal - Develop a Japanese WS and LN model with better performance than existing systems, under the condition that normalization labeled data for LN is non-available ◆Our contributions - Proposed generation methods of pseudo-labeled data and a text editing-based method for Japanese WS, POS tagging, and LN - Achieved better normalization performance than an existing method 44

Slide 45

Slide 45 text

[Study 4] Background and Motivation ◆Frameworks for text generation ◆Our approach - Generate pseudo-labeled data for LN using lexical knowledge - Use a text editing-based model to learn efficiently from small amount of (high-quality) training data 45 ⚫ Text editing method for English lexical normalization (Chrupała 2014) ⚫ Encoder-Decoder model for Japanese sentence normalization (Ikeda+ 2016)

Slide 46

Slide 46 text

[Study 4] Task Formulation ◆Formulation as multiple sequence labeling tasks ◆Normalization tags for Japanese char sets - String edit operation (SEdit): {KEEP, DEL, INS_L(c), INS_R(c), REP(c)} (c: hiragana or katakana) - Character type conversion (CConv): {KEEP, HIRA, KATA, KANJI} ◆Kana-kanji conversion 46 日本語まぢムズカシー B E S B E B I I I E Noun Noun Noun Adv Adv Adj Adj Adj Adj Adj KEEP KEEP KEEP KEEP REP(じ) KEEP KEEP KEEP KEEP REP(い) KEEP KEEP KEEP KEEP KEEP HIRA HIRA HIRA HIRA KEEP x = ys = yp = ye = yc = ⇒ まじ ⇒ むずかしい Seg POS Norm Sentence もうあきだ KANJI Kana-kanji converter (n-gram LM) あき秋 ‘autumn’ 空き ‘vacancy’ 飽き ‘bored’ … KANJI ’It’s already autumn.’ KEEP KEEP KEEP CConv tags

Slide 47

Slide 47 text

[Study 4] Variant Pair Acquisition ◆Standard and nonstandard word variant pairs for pseudo-labeled data generation A) Dictionary-based: Extract variant pairs from UniDic with hierarchical lemma definition B) Rule-based: Apply hand-crafted rules to transform standard forms into nonstandard forms 47 … ⇒ 404K pairs ⇒ 47K pairs 6 out of 10 rules are similar to those in (Sasano+ 2013) and (Ikeda+ 2016).

Slide 48

Slide 48 text

[Study 4] Pseudo-labeled Data Generation ◆Input - (Auto-) segmented sentence x and - Pair v of source (nonstandard) and target (standard) word variants 48 x = スゴく|気|に|なる ye = K K K K K K K yc = H H K K K K K “(I’m) very curious.” v = (スゴく, すごく) スゴく気になる K=KEEP,H=HIRA, D=DEL, IR=INS_R x = ほんとう|に|心配 ye = K K D K K K yc = K K K K K K K v = (ほんっと, ほんとう) ほんっとに心配 IR(う) “(I’m) really worried.” ⇒ すごく気になる ⇒ ほんとうに心配 ⚫ Target-side distant supervision (DStgt ) ⚫ Source-side distant supervision (DSsrc ) src tgt src tgt Synthetic target sentence Synthetic source sentence Pro: Actual sentences can be used Pro: Any number of synthetic sentences can be generated

Slide 49

Slide 49 text

[Study 4] Experimental Data ◆Pseudo labeled data for training (and development) - Dict/Rule-derived variant pairs: Vd and Vr - BCCWJ: a mixed domain corpus of news, blog, Q&A forum, etc. ◆Test data: BQNC - Manually-annotated 929 sentences constructed in our third study 49 Du Dt Vd Vd At Ad DStgt Vr Ar 57K sent. 173K syn. sent. 170K syn. sent. 57K sent. Top np =20K freq pairs At most ns =10 sent. were extracted for each pair DSsrc core data Dt with manual Seg&POS tags non-core data Du with auto Seg&POS tags 3.5M sent.

Slide 50

Slide 50 text

[Study 4] Experimental Settings ◆Our model - BiLSTM + task-specific softmax layers - Character embedding, pronunciation embedding, and nonstandard word lexicon binary features - Hyperparameter • num_BiLSTM_layers=2, num_BiLSTM_units=1,000, char_emb_d=200, pron_emb_d=30, etc. ◆Baseline methods - MeCab and MeCab+ER (Sasano+ 2013) ◆Evaluation 1. Main results 2. Effect of dataset size 3. Detailed results of normalization 4. Performance for known and unknown normalization instances 5. Error analysis 50

Slide 51

Slide 51 text

[Study 4] Exp 1. Main Results ◆Results - Our method achieved better Norm performance when trained on more types of pseudo-labeled data - MeCab+ER achieved the best performance on Seg and POS 51 At : DSsrc (Vdic ) Ar : DStgt (Vrule ) Ad : DStgt (Vdic ) (BiLSTM) Postprocessing

Slide 52

Slide 52 text

[Study 4] Exp 5. Error Analysis ◆Detailed normalization performance - Our method outperformed MeCab+ER for all categories. - Major errors ( ) by our model were mis-detection and invalid tag prediction. - Kanji conversion accuracy was 97% (67/70). 52 ほんと (に少人数で) → ほんとう ‘actually’ すげぇ → すごい ‘great’ フツー (の話をして) → 普通 ‘ordinary’ そーゆー → そう|いう ‘such’ な～に (言ってんの) → なに ‘what’ まぁるい → まるい ‘round’ Examples of TPs ガコンッ → ガコン ‘thud’ ゴホゴホ → ごほごホ ‘coughing sound’ はぁぁ → はああ ‘sighing sound’ おお～～ → 王 ‘king’ ケータイ → ケイタイ ‘cell phone’ ダルい → だるい ‘dull’ Examples of FPs × ？

Slide 53

Slide 53 text

[Study 4] Conclusion - We proposed a text editing-based method for Japanese WS, POS tagging, and LN. - We proposed effective generation methods of pseudo-labeled data for Japanese LN. - The proposed method outperformed an existing method on the joint segmentation and normalization task. 53

Slide 54

Slide 54 text

Summary and Future Directions

Slide 55

Slide 55 text

Summary of This Dissertation 1. How to achieve accurate WS in various text domains - We proposed approaches for three domain types, which can be effective options to achieve accurate WS and downstream tasks in these domains. 2. How to train/evaluate WS and LN models for Ja UGT - We constructed a public evaluation corpus, which can be a useful benchmark to compare existing and future systems. - We proposed a joint WS&LN method trained on pseudo- labeled data, which can be a good baseline to develop more practical Japanese LN methods in future. 55

Slide 56

Slide 56 text

Directions for Future Work ◆Model size and inference speed - Knowledge distillation is a prospective approach to train a fast and lightweight student model from an accurate teacher model. ◆Investigation of optimal segmentation unit - Optimal units and effective combination of different units (char/subwrod/word) for downstream tasks have room to be explored. ◆Performance improvement on UGT processing - Incorporating knowledge in large pretrained LMs may be effective. ◆Evaluation on broader UGT domains and phenomena - Constructing evaluation data in various UGT domains is beneficial to evaluate system performance for frequently-occurring phenomena in other UGT domains, such as proper names and neologisms. 56

Slide 57

Slide 57 text

57 Corpus annotation for fair evaluation More accurate models General domains Specialized domains UGT domains Study 1, 2, 4 Study 3 More fast models Broader domain corpus Further improvement Word Segmentation Optimal tokens Tokenization Directions for Future Work (Summary)