Exploring and Adapting Chinese GPT to Pinyin Input Method

Exploring and Adapting Chinese GPT to Pinyin Input Method (Tan
et al., ACL 2022) （図表は断りのない限り上記論⽂からの引⽤）⼊⼒メソッドワークショップ2022 論⽂紹介⼩町守

中国語⼊⼒タスクは拼⾳→漢字変換 • ⼊⼒の Pinyin は分かち書きされている（分かち書きは future work らしい……） 2
Id Context of Characters Input Pinyin Target Pinyin Type s1 ⌘↵h ˆÙ dÜ li bai yi you dian shi <‹ πã Perfect s2 ⌘↵h ˆÙ dÜ l b y y d s <‹ πã Abbreviated s3 .⌘„≥Üæò l b y y d s 8‹Ù^ Abbreviated Table 2: Illustrative examples of the task of pinyin input method with perfect pinyin and abbreviated pinyin. In s3, the input pinyin “l b y y d s” is the abbreviation of “lao ban yong yuan di shen”. The translations of s1 and s3 are “I am free next week except for the next Monday.” and “Boss helps me overcome the obstacle. You are the greatest of all time.”, respectively.

GPT を⽤いたベースライン GPT による中国語⼊⼒ 1. （あれば）⽂脈を⽂字単位でエンコード 2. デコーダが⽂字単位で次の⽂字を予測 • 完全な拼⾳・不完全な拼⾳ともに、マッチする⽂字のみが出⼒候補
ベースラインモデル • GPT (public): ⽂字レベルの Chinese GPT (Du, 2019)、12層で CLUECorpusSmall（14GB）で事前学習 https://github.com/Morizeyao/GPT2-Chinese • GPT (ours): GPT (public) からさらに筆者らがクロールしたコーパス（800GB）で追加学習（32枚の Tesla V100） 3

本研究で分かったこと 1. ⼊⼒が完全な場合、パラメータをフリーズした Chinese-GPT の性能が⼀番⾼い 2. ⼊⼒が不完全な場合、パラメータをフリーズした Chinese- GPT の性能はかなり悪化し、拼⾳を考慮した⼿法を⽤いれば
性能が上がる 3. Chinese-GPT ベースのモデルは⽂脈が⻑ければ⻑いほど性能が上がる 4

提案⼿法1: PinyinGPT-Concat ⽂脈として拼⾳をくっつけて⼊れる • Positional encoding が調整されている 5 我下
周有时间，除了 l b y y d s [SEP] [SEP] 礼拜一有点事 PinyinGPT-Concat 1 2 3 5 6 7 8 4 9 10 12 13 14 15 16 10 11 11 12 13 14 15 16 Context of Chinese characters Abbreviated pinyin Target Chinese characters 礼拜一有点事 [EOS] [CLS] 0 我下周有时间，除了礼拜一有点事 [EOS] Figure 1: An illustration of the training process of Pinyin-Concat (top) and Pinyin-Embed (bottom), respectively. The example is same as the instance of s2 from Table 2. is ts. 60 p- ng un 0k as in ng hi- er el n- rs stance of [w1, . . . , wn, [SEP], pn+1, . . . , pn+k, [SEP], wn+1, . . . , wn+k], the model is trained to minimize the following loss function, where w<n+j stands for the characters before wn+j and p = [pn+1, . . . , pn+k]. Lconcat = k X j=1 log p(wn+j|w<n+j, p) (1) PinyinGPT-Embed The original GPT model in- cludes a word embedding layer and a position embedding layer. In this model, we add a pinyin embedding layer. The basic idea is to provide the model with the pinyin of the character to be gen- erated next. Speciﬁcally, the embedding of each character is the sum of the token embedding of

提案⼿法2: PinyinGPT-Embed 拼⾳に対する埋め込みを⾜す • 拼⾳がない⽂字は [unk] 扱い 6 我下
周有时间，除了 l b y y d s [SEP] [SEP] 礼拜一有点事 PinyinGPT-Concat 1 2 3 5 6 7 8 4 9 10 12 13 14 15 16 10 11 11 12 13 14 15 16 Context of Chinese characters Abbreviated pinyin Target Chinese characters [CLS] 0 Word Embedding Position Embedding Pinyin Embedding 1 2 3 5 6 7 8 4 9 10 0 11 12 13 14 15 l b y y d s x z s j [unk] c y l w PinyinGPT-Embed 我下周有时间，除了礼拜一有点事 [EOS] 我下有时间，周除了 [CLS] 礼拜一有点事 [unk] Figure 1: An illustration of the training process of Pinyin-Concat (top) and Pinyin-Embed (bottom), respectively. The example is same as the instance of s2 from Table 2. the standard GPT, as shown in Figure 1. The los function is given as follows. Lembed = n+k X j=1 log p(wj|w<j, p<j+1) (2 In the inference stage, we transform the input s quence to the same format. 3.3 Pinyin-Constrained Training We describe training details in this subsection. I standard GPT, the loss function is computed ov the whole vocabulary. However, this is suboptim for pinyin input method because the major cha

提案⼿法3: 拼⾳を考慮した学習 • 推論時には拼⾳にマッチしない⽂字は出⼒しないので、全ての語彙を対象に学習すると不整合 →学習時にも同じ拼⾳を持つ⽂字でのみ確率を計算して学習 7 best one from
characters pronounced with the same pinyin (as described in the end of Section 3.1). This leads to inconsistency between training and inference stages. Therefore, in the training stage, the probability of a character is calculated over characters pronounced with the same pinyin, which is formulated as follows. p(wi) = exp (g(wi)) P wj2Vpi exp (g(wj)), (3) where Vpi is the set of Chinese characters whose pinyin is pi and g is the logit before the softmax layer. million segm ters (or Maxi training and test case, the the context i WD Datase date news fr us to study th real words, w We use the W contains 3TB million Web 10http:// for pinyin input method because the major chal- lenge in the inference stage is how to select the best one from characters pronounced with the same pinyin (as described in the end of Section 3.1). This leads to inconsistency between training and inference stages. Therefore, in the training stage, the probability of a character is calculated over characters pronounced with the same pinyin, which is formulated as follows. p(wi) = exp (g(wi)) P wj2Vpi exp (g(wj)), (3) where Vpi is the set of Chinese characters whose pinyin is pi and g is the logit before the softmax layer. 2019). The texts in P ple’s Daily10 from 19 million segments of ters (or Maximum Inp training and 2,000 se test case, the input pin the context is null. WD Dataset Since date news from 20 ye us to study the scenar real words, we constr We use the WuDaoCo contains 3TB Chines million Web pages. C 10http://www.peo

実験設定: コーパス PD Dataset (Yang et al., 2012) • 中国語⼊⼒タスクで⼀般的に使われているデータセット
• 1992から1998までの⼈⺠⽇報（古い） • 完全な⼊⼒のみ。⽂脈は空⽂字列。 • 学習⽤に5.04Mセグメント、テスト⽤に2,000セグメント WD Dataset • WuDaoCorpora (Yuan et al., 2021) の⼀部（200GB）から15カテゴリを抽出 • 完全な⼊⼒と不完全な⼊⼒を⾃動⽣成。⽂脈⻑は⾊々分けた。 • テストのみ。27万セグメント。 8

実験設定: 評価尺度 Precision at top-K (P@K) • 中国語⼊⼒タスクで⼀般的な評価尺度 (Jia and
Zhao, 2014; Zhang et al., 2017, 2019) • キーストロークベースの評価尺度 (Jia and Zhao, 2013; Huang et al., 2015) は複雑なので⽤いない（不完全な⼊⼒を対象にしているのになぜ……） • ⼈⼿評価は時間がかかるので⽤いない（これは仕⽅ない） 9

実験結果: PD データセット＋完全⼊⼒教師あり⼿法より GPT が強かった⽐較⼿法 • Google IME
• On-OMWE (Zhang et al., arXiv 2017): 新しい単語を⾃動的に学習できるモデル • On-P2C (Zhang et al., ACL 2019): ニューラル中国語⼊⼒モデル。オンライン辞書で新しい単語に対応。 10 Model P@1 P@5 P@10 Google IME 70.90 78.30 82.30 On-OMWA 64.40 72.90 77.90 On-P2C 71.30 80.50 81.30 GPT (public) 67.35 79.95 81.60 GPT (ours) 73.15 84.10 85.45 Table 3: Comparison with different methods over PD using perfect pinyin.

実験結果: WD データセット拼⾳を考慮すると不完全⼊⼒で精度向上 11

不完全⼊⼒では拼⾳結合モデルと拼⾳を考慮した損失関数が性能に⼤きく貢献 ablation • GPT (ours) = ベースライン •
Pinyin Context = 学習と推論に PinyinGPT- Concat を⽤いた場合 • PC-Loss = 拼⾳を考慮した損失関数を学習時に⽤いた場合 12

GPT モデルは⽂脈が⻑ければ⻑いほど予測性能が上がる • ⽂脈⻑が⻑い（10+）と拼⾳は P@1 にはあまり効果がない 13

GPT は⽂法的には正しいが意味が変提案⼿法は正解と違っても悪くない予測 14 Id Case Predictions 1 Context: Pinyin:
Abbreviated: Target: Translation: ➣㯚㋉䔊㸐。 qing xiang yu kan hao No 㤄㼓䇻㋕⼤ The Oscar Organizing Committee inclined to prefer GPT (ours): 1. 㤄㼓䇻㋕⼤ inclined to prefer 2. 㤄㼓䇻㋕⼡ inclined to look at PinyinGPT-Concat: 1. 㤄㼓䇻㋕⼤ inclined to prefer 2. 㤄㼓䈌㋕⼤ tendency and optimism 2 Context: Pinyin: Abbreviated: Target: Translation: ➣㯚㋉䔊㸐。 q x y k h Yes 㤄㼓䇻㋕⼤ The Oscar Organizing Committee inclined to prefer GPT (ours): 1. 㡍㻣䄜㌏⼽ one of its very 2. 㡍㻣䄜㌏⼡ one of its luxury PinyinGPT-Concat: 1. 㤄㼓䇻㋕⼤ inclined to prefer 2. 㤄㼓䇻㋚⽃ inclined to fight against 3 Context: Pinyin: Abbreviated: Target: Translation: ⱙ䐱⺛ⰴ䔘㸋⡟⪯ j s d c b g Yes ㈛㧝⭥⧱⟍⺛ And the Chinese team as the host country of this contest GPT (ours): 1. ㉗㧝⭥⧱⟍⺛ the host country of the finals 2. ㉗㧝⭥⧂⢀⹼ at the ringside of the finals PinyinGPT-Concat: 1. ㉗㧝⭥⧱⟍⺛ the host country of the finals 2. ㈛㧝⭥⧱⟍⺛ the host country of the contest Figure 3: Case study for GPT (ours) and PinyinGPT-Concat in both perfect pinyin and abbreviated pinyin.

どのドメインでも GPT より提案⼿法が安定して⾼い性能 • ドメインによって難しさがかなり異なる（⽂化 vs 医療） 15 Model
Games Culture Sports P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10 GPT (ours) 24.04 32.78 34.23 21.86 29.33 30.94 28.54 37.13 38.69 PinyinGPT-Concat 25.78 38.26 41.89 22.10 33.33 36.72 29.81 43.56 46.95 Real Estate Medical Finance P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10 GPT (ours) 26.53 35.27 36.74 33.59 43.54 44.93 29.00 37.24 38.47 PinyinGPT-Concat 27.28 40.16 43.86 34.76 49.28 52.56 29.17 42.17 45.52 Table 6: Performance of six sample domains over WD using abbreviated pinyin.

V100 1枚に載るモデルは30%速いが精度とのトレードオフがある設定 • 6層の GPT を 12層の GPT
から再学習 • 1モデルで V100 を占有 • ビームサイズは16 16 PinyinGPT-Concat 27.28 40.16 43.86 34.76 49.28 52.56 Table 6: Performance of six sample domains over WD using ab Model Time (ms) P@5 GPT (ours, 6L) 94 27.45 GPT (ours, 12L) 142 34.48 PinyinGPT-Concat (6L) 94 32.70 PinyinGPT-Concat (12L) 145 41.51 Table 7: Average inference time for one instance and the overall P@5 for the conﬁguration of (4-9, 4-9). nese pinyin input m which will be furthe features. Jia and Zh model to globally o put method and typ tolerant pinyin inpu Pinyin-enhanced methodology also that use pinyin inf

付録: 速度と精度、⽂脈・予測⻑の関係 17 Models 1-3 4-9 10+ T P@1 P@5
P@10 T P@1 P@5 P@10 T P@1 P@5 P@10 0-3 GPT (ours, 6L) 38 26.74 38.45 41.50 98 10.46 14.41 15.19 201 2.72 3.70 3.85 GPT (ours, 12L) 58 30.11 42.27 45.25 148 13.33 18.24 18.99 303 4.16 5.86 6.00 PinyinGPT-Concat (6L) 40 29.17 45.17 50.73 98 11.92 19.55 21.84 197 3.20 5.67 6.22 PinyinGPT-Concat (12L) 61 31.72 48.09 53.94 148 15.21 24.39 26.94 305 5.58 9.22 10.09 4-9 GPT (ours, 6L) 38 44.02 59.02 62.32 94 20.02 27.45 28.76 198 5.72 8.05 8.31 GPT (ours, 12L) 57 49.83 65.03 67.96 142 25.53 34.48 35.89 301 9.38 12.70 13.03 PinyinGPT-Concat (6L) 38 45.66 65.08 70.56 94 20.25 32.70 36.14 192 5.98 10.23 11.29 PinyinGPT-Concat (12L) 58 50.78 70.11 75.58 145 26.44 41.51 45.52 298 10.20 17.02 18.80 10+ GPT (ours, 6L) 42 54.38 69.94 72.92 99 28.81 38.98 40.41 198 10.32 14.18 14.64 GPT (ours, 12L) 64 59.39 75.00 77.60 149 35.42 46.32 47.94 301 14.96 20.11 20.63 PinyinGPT-Concat (6L) 43 53.91 73.21 78.14 98 27.21 42.36 46.45 198 9.15 15.49 17.05 PinyinGPT-Concat (12L) 66 59.89 78.81 83.33 154 34.99 51.99 56.62 306 14.93 24.78 27.03 Table 9: Experiment results for different configurations over WD using abbreviated pinyin, each score is averaged over all the domains. The first column is the context length while the first row is the target length. The field T is the average inference time in millisecond.

中国語⼊⼒の関連研究 18 1. 統計的⾔語モデル • ⼊⼒が分割済み (Zheng et al., IJCAI
2011) • 拼⾳べた書き (Chen and Lee, ACL 2000; Jia and Zhao, ACL 2014) 2. 統計的機械翻訳 • 完全な⼊⼒ (Yang et al., PACLIC 2012) • 不完全な⼊⼒ (Huang et al., IJCAI 2015) 3. 深層学習 • アテンション+情報検索 (Huang et al., ACL 2018) • LSTM+語彙適応 (Zhang et al., ACL 2019) • LSTM+⽂脈+拼⾳ (Huang and Zhao, EMNLP 2018)

まとめ: GPT を中国語⼊⼒に使ってみた • GPT を中国語⼊⼒メソッドに適⽤した https://github.com/VisualJoyce/Transformers4IME/blob/ma ster/README.en.md • 中国語⼊⼒タスクに対する新しい評価データセットを作成した
（学習済みモデルも公開されている） • 中国語⼊⼒タスクにおいて複数のデータセット・複数のドメインで Chinese GPT の効果を検証した 19

所管 GPT が⼊⼒メソッドでも使えるのを⽰したのは良い • 結果にはそこまでの驚きはない • WuDaoCorpora の中に⼊っているデータと GPT (ours)
の学習に⽤いたデータのオーバーラップが気になる（データのリーク） • 予測⼊⼒の設定での実験があると良かった（制約が使えないが、⾔語モデル的には予測は得意だと思われる）⼊⼒が分かち書き済みとか、先頭の1⽂字だけを使うとか、⼊⼒に関する前提が⽢い • 実⽤的な設定で本当に使えるかどうかはよく分からない⾊々な設定（分野・⽂脈⻑・テスト⻑）での実験は頑張っている 20

Exploring and Adapting Chinese GPT to Pinyin In...

Exploring and Adapting Chinese GPT to Pinyin Input Method

Mamoru Komachi

More Decks by Mamoru Komachi

Other Decks in Research

Featured

Transcript

Exploring and Adapting Chinese GPT to Pinyin Input Method (Tan

中国語⼊⼒タスクは拼⾳→漢字変換 • ⼊⼒の Pinyin は分かち書きされている（分かち書きは future work らしい……） 2

GPT を⽤いたベースライン GPT による中国語⼊⼒ 1. （あれば）⽂脈を⽂字単位でエンコード 2. デコーダが⽂字単位で次の⽂字を予測 • 完全な拼⾳・不完全な拼⾳ともに、マッチする⽂字のみが出⼒候補

本研究で分かったこと 1. ⼊⼒が完全な場合、パラメータをフリーズした Chinese-GPT の性能が⼀番⾼い 2. ⼊⼒が不完全な場合、パラメータをフリーズした Chinese- GPT の性能はかなり悪化し、拼⾳を考慮した⼿法を⽤いれば

提案⼿法1: PinyinGPT-Concat ⽂脈として拼⾳をくっつけて⼊れる • Positional encoding が調整されている 5 我下

提案⼿法2: PinyinGPT-Embed 拼⾳に対する埋め込みを⾜す • 拼⾳がない⽂字は [unk] 扱い 6 我下

提案⼿法3: 拼⾳を考慮した学習 • 推論時には拼⾳にマッチしない⽂字は出⼒しないので、全ての語彙を対象に学習すると不整合 →学習時にも同じ拼⾳を持つ⽂字でのみ確率を計算して学習 7 best one from

実験設定: コーパス PD Dataset (Yang et al., 2012) • 中国語⼊⼒タスクで⼀般的に使われているデータセット

実験設定: 評価尺度 Precision at top-K (P@K) • 中国語⼊⼒タスクで⼀般的な評価尺度 (Jia and

実験結果: PD データセット＋完全⼊⼒教師あり⼿法より GPT が強かった⽐較⼿法 • Google IME

実験結果: WD データセット拼⾳を考慮すると不完全⼊⼒で精度向上 11

不完全⼊⼒では拼⾳結合モデルと拼⾳を考慮した損失関数が性能に⼤きく貢献 ablation • GPT (ours) = ベースライン •

GPT モデルは⽂脈が⻑ければ⻑いほど予測性能が上がる • ⽂脈⻑が⻑い（10+）と拼⾳は P@1 にはあまり効果がない 13

GPT は⽂法的には正しいが意味が変提案⼿法は正解と違っても悪くない予測 14 Id Case Predictions 1 Context: Pinyin:

どのドメインでも GPT より提案⼿法が安定して⾼い性能 • ドメインによって難しさがかなり異なる（⽂化 vs 医療） 15 Model

V100 1枚に載るモデルは30%速いが精度とのトレードオフがある設定 • 6層の GPT を 12層の GPT

付録: 速度と精度、⽂脈・予測⻑の関係 17 Models 1-3 4-9 10+ T P@1 P@5

中国語⼊⼒の関連研究 18 1. 統計的⾔語モデル • ⼊⼒が分割済み (Zheng et al., IJCAI

まとめ: GPT を中国語⼊⼒に使ってみた • GPT を中国語⼊⼒メソッドに適⽤した https://github.com/VisualJoyce/Transformers4IME/blob/ma ster/README.en.md • 中国語⼊⼒タスクに対する新しい評価データセットを作成した

所管 GPT が⼊⼒メソッドでも使えるのを⽰したのは良い • 結果にはそこまでの驚きはない • WuDaoCorpora の中に⼊っているデータと GPT (ours)