EMNLP2015読み会-Long Short-Term Memory Neural Networks for Chinese Word Segmentation

Long Short-Term Memory Neural Networks for Chinese Word Segmentation Xinchi
Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, Xuanjing Huang EMNLP2015読み会 Masaki Rikitoku 2015/10/24

Masaki Rikitoku • NLP/Data engineer – Online Advertising • NLP/
Machine Learning – Multilingual Morphological Analysis – Text classification • Big data processing – In-memory aggregation engine for BI – Big Data processing for global social game About Me

概要 • 中国語分かち書き問題を系列ラベリング＋LSTM –RNNで定式化 • LSTM-RNNにより長距離素性を効果的に取り込めるようになった • LSTMへのDropoutの効果も検証
• PKU, MSRA, CTB corpusで現在の最高精度を達成

系列ラベリングとしての中国語分かち書き |冬天|, |能|穿|多少|穿|多少|;| B E S S S B
E S B E S |夏天|, |能|穿|多|少|穿|多|少|。| B E S S S S S S S S S B: Begin E: End M: Middle S: Single • 1文字づつ {B, E, M, S}タグを付与する • 系列ラベリング手法によるタグを付与する

系列ラベリングとしての中国語分かち書き A_{ij}：タグ連接コスト y : タグ出力コスト ※

Neural Model for Chinese Word Segmentation • 3層NN • タグ出力コストを
NNで計算 • Window size以上の長距離素性を取り込めない RNN

RNN (Recurrent Neural Network) NN RNN x y h x
y h RNNでも長距離素性、過去文脈は取り込みにくい LSTM

LSTM (Long-Short Term Memory) • Memory Cell Cが過去履歴、長距離素性などを記憶 • 入力、忘却ゲートによりセルの状態を
コントロール • RNNに比べて過去、長距離文脈を取り込める

LSTM model for Chinese Word Segmentation X(t) • コスト最小パスから各文字に対する {B,M,E,S}タグ決定
=> 分かち書き • LSTM-RNNにより長距離素性も取り込める

Training 求めるモデルパラメータ • M: 文字のベクトル表現 • A: タグ連接コスト • W_{??}:
LSTMのweight matrix 目的関数 • yがラベル系列なので構造学習 • L2正則化 • 学習はSGD+AdaGradで解いた • Dropoutも使用 ⊿: structured margin loss, s: タグ系列スコア

Experiments • PKU, MSRA,CTB6 corpusに対して分かち書きの precision, recall, F1を測定 •
パラメータ依存も調査 – Hyper-parameter – Dropout rate, – Context length • オリジナルモデルLSTM-2 – 4の評価も実施 – 普通のLSTM-1が一番良い結果

Performances of LSTM-1 with the different context lengths and dropout
rates on PKU test set. 精度のDropout rate， context length依存性 on PKU test set • Dropout rate=20T%が最高性能 => 普通に納得 • Dropoutは入力層にしか効かない、LSTM層では効果なし • Context length (0, 2): 前0,後2文字で最高性能 • LSTMの効果？ • ほぼ同じだが、(0,2)が良いのは不思議 => xの次元が低いから？

Performances on three test sets with random initialized character embeddings.
• Character embeddingはrandom init • LSTMが3コーパス全てで最高性能 • 他の手法は • Zheng et al., 2013: 3層NN+系列ラベリング • Pei et. Et al., 2014: Zheng et al., 2013 + max margin training • 本手法: Pei et al. ,2014 + LSTM

Performances on three test sets with pre- trained and bigram
character embeddings. • Pre-trainはWord2Vecでcharacter embeddingを実施 • Bigram embeddingは2文字のベクトルの平均をとった • Pre-train + bigramが最高性能

Comparison of our model with state-of- the-art methods on three
test sets. • 3コーパスで現在最高性能 • Zhang et al. 2013は、外部情報（unlabeled data, 外部知識）を使っての精度 • 本モデルはtraining corpusのみからの学習で最高性能を達成

まとめ • 中国語分かち書き問題にLSTM-RNN baseの系列ラベリング適用 • LSTMの効果で小さいwindow sizeでも現在の最高性能を発揮 •
このモデルは他の問題にも適用可能 – NE – 日本語形態素解析? – Etc.

所感 • NE, 日本語形態素解析に適用してみたい • 処理速度が気になる。辞書ベースの最小コスト法より遅いだろうが、実装して試す – Kyteaよりも遅いかも。。。 •
LSTMとGated RNN, RNNでなくて隠れ層のlinear combination modelとの比較、実装に興味がある。 – http://arxiv.org/abs/1510.02693 – LSTMの実装はつらそうなので。。。

EMNLP2015読み会-Long Short-Term Memory Neural Networks for Chinese Word Segmentation

EMNLP2015読み会-Long Short-Term Memory Neural Networks for Chinese Word Segmentation

Masaki Rikitoku

More Decks by Masaki Rikitoku

Other Decks in Technology

Featured

Transcript

Long Short-Term Memory Neural Networks for Chinese Word Segmentation Xinchi

Masaki Rikitoku • NLP/Data engineer – Online Advertising • NLP/

概要 • 中国語分かち書き問題を系列ラベリング＋LSTM –RNNで定式化 • LSTM-RNNにより長距離素性を効果的に取り込めるようになった • LSTMへのDropoutの効果も検証

中国語分かち書き |冬天 (winter)|,|能 (can)| 穿 (wear)| 多少 (amount)| 穿 (wear)

系列ラベリングとしての中国語分かち書き |冬天|, |能|穿|多少|穿|多少|;| B E S S S B

系列ラベリングとしての中国語分かち書き A_{ij}：タグ連接コスト y : タグ出力コスト ※

Neural Model for Chinese Word Segmentation • 3層NN • タグ出力コストを

RNN (Recurrent Neural Network) NN RNN x y h x

LSTM (Long-Short Term Memory) • Memory Cell Cが過去履歴、長距離素性などを記憶 • 入力、忘却ゲートによりセルの状態を

LSTM model for Chinese Word Segmentation X(t) • コスト最小パスから各文字に対する {B,M,E,S}タグ決定

Training 求めるモデルパラメータ • M: 文字のベクトル表現 • A: タグ連接コスト • W_{??}:

Experiments • PKU, MSRA,CTB6 corpusに対して分かち書きの precision, recall, F1を測定 •

Performances of LSTM-1 with the different context lengths and dropout

Performances on three test sets with random initialized character embeddings.

Performances on three test sets with pre- trained and bigram

Comparison of our model with state-of- the-art methods on three

まとめ • 中国語分かち書き問題にLSTM-RNN baseの系列ラベリング適用 • LSTMの効果で小さいwindow sizeでも現在の最高性能を発揮 •

所感 • NE, 日本語形態素解析に適用してみたい • 処理速度が気になる。辞書ベースの最小コスト法より遅いだろうが、実装して試す – Kyteaよりも遅いかも。。。 •