Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency Parsing

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency
Parsing Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency Parsing Han He Computer Science Emory University Atlanta GA 30322, USA [email protected] Jinho D. Choi Computer Science Emory University Atlanta GA 30322, USA [email protected] Abstract This paper presents our enhanced dependency parsing approach using transformer encoders, coupled with a simple yet powerful ensemble analyze gapping constructions in the enhanced UD representation. Nivre et al. (2018) evaluate both rule-based and data-driven systems for adding enhanced dependencies to existing treebanks. Apart from syntactic relations, researchers are moving to-

Enhanced Universal Dependency Parsing Ellipsi s Conjoined subjects and object
s https://universaldependencies.org/u/overview/enhanced-syntax.html

Preprocessing • Sentence split and tokenization • UDPipe (itssearch-engine —>
its search - engine) • Remove multiword expressions • —> • collapse empty nodes sing raining and development sets are segmented and tokenized. For the is used to segment raw input into each sentence gets split into a list a and Strakov´ a, 2017). A custom us is used to remove multiwords splits (e.g., remove v´ amonos but s), as well as to collapse empty NLL-U format. mer Encoder els use contextualized embeddings it can be easily adapted to languages that may no have dedicated POS taggers, and drops the Bidire tional LSTM encoder while integrating the tran former encoder directly into the biafﬁne decoder t minimize the redundancy of multiple encoders fo the generation of contextualized embeddings. Every token wi in the input sentence is split int one or more sub-tokens by the transformer encode (Section 2.2). The contextualized embedding tha corresponds to the ﬁrst sub-token of wi is treated a the embedding of wi, say ei, and fed into four type of multilayer perceptron (MLP) layers to extrac features for wi being a head (*-h) or a dependen (*-d) for the arc relations (arc-*) and the label 2 Approach 2.1 Preprocessing The data in the training and development sets are already sentence segmented and tokenized. For the test set, UDPipe is used to segment raw input into sentences, where each sentence gets split into a list of tokens (Straka and Strakov´ a, 2017). A custom script written by us is used to remove multiwords but retain their splits (e.g., remove v´ amonos but retain v´ amos nos), as well as to collapse empty nodes in the CoNLL-U format. 2.2 Transformer Encoder Our parsing models use contextualized embeddings it h ti f m th o ( c th o f ( ( E2.1 word L2 L1 word L1>L2

Encoder • mBERT v.s. language speci fi c Transformers •
ALBERT for English, RoBERTa for French • mBERT and for all languages

Decoder • Bia ff i ne DTP and DGP •
Tree Parsing v.s. Graph Parsing (b) Labeled attachment score on enhanced dependencies where labels are restricted to the UD relation (EULAS). able 1: Parsing results on the test sets for all languages. For both (a) and (b), the rows 2-4 show the results by the multilingual encoder and the rows 5-7 show the results by the language-speciﬁc encoders if available. Lang. Encoder Corpus Provider AR BERT 8.2 B Hugging Face EN ALBERT 16 GB Hugging Face ET BERT N/A TurkuNLP FR RoBERTa 138 GB Hugging Face FI BERT 24 B Hugging Face IT BERT 13 GB Hugging Face NL BERT N/A Hugging Face PL BERT 1.8 B Hugging Face SV BERT 3 B Hugging Face BG BERT N/A Hugging Face CS BERT N/A Hugging Face SK BERT N/A Hugging Face able 2: Language-speciﬁc transformer encoders to de- elop our models. The corpus column shows the corpus ze used to pretrain each encoder (B: billion tokens, B: gigabytes). BERT and RoBERTa adapt the base Figure 2: Percentages of tokens with multiple heads.

Ensemble + (H H ) · V 2 R 2.4
Dependency Tree & Graph Parsing The arc score matrix S (arc) and the label score tensor S(rel) generated by the bilinear and biaffine clas- sifiers can be used for both dependency tree parsing (DTP) and graph parsing (DGP). For DTP, which takes only the primary dependencies to learn tree structures during training, the Chu-Liu-Edmond’s Maximum Spanning Tree (MST) algorithm is applied to S (arc) for the arc prediction, then the label with largest score in S(rel) corresponding to the arc is taken for the label prediction (ADTP: the list of predicted arcs, LDTP: the labels predicted for ADTP, I: the indices of ADTP in S(rel)): ADTP = MST(S (arc)) LDTP = argmax(S(rel)[I(ADTP)]) For DGP, which takes the primary as well as the secondary dependencies in the enhanced types to learn graph structures during training, the sigmoid function is applied to S (arc) instead of the softmax function (Figure 1) so that zero to many heads can be predicted per node by measuring the pairwise losses. Then, the same logic can be used to predict the labels for those arcs as follows: ADGP = SIGMOID(S (arc)) LDGP = argmax(S(rel)[I(ADGP)]) the output of the DGP model is NP-hard (Schluter, 2014). Thus, we design an ensemble approach that computes approximate MSDAGs using a greedy algorithm. Given the score matrices S (arc) DTP and S (arc) DGP from the DTP and DGP models respectively and the label score tensor S(rel) DGP from the DGP model, Algorithm 1 is applied to find the MSDAG: Algorithm 1: Ensemble parsing algorithm Input: S (arc) DTP , S (arc) DGP , and S(rel) DGP Output: G, that is an approximate MSDAG 1 r root index(ADTP) 2 S(rel) DGP [root, :, :] 1 3 S(rel) DGP [root, r, r] +1 4 R argmax(S(rel) DGP )) 2 Rn⇥n 5 ADTP MST(S (arc) DTP ) 6 G ; 7 foreach arc (d, h) 2 A DTP do 8 G G [ {(d, h, R[d, h]} 9 end 10 ADGP sorted descend(SIGMOID(S (arc) DGP )) 11 foreach arc (d, h) 2 A DGP do 12 G (d,h) G [ {(d, h, R[d, h]} 13 if is acyclic(G (d,h)) then 14 G G (d,h) 15 end 16 end • DTP (Tree) + DGP (Graph)

Results • O ffi cially ranked the 3rd place according
to Coarse ELAS F1 scores • O ffi cially ranked the 1st place on French treebank.

Results • On 13 languages, multilingual BERT outperforms language speci
fi c • Exceptions are English, French, Finnish and Italian • On 15 languages, ensemble methods outperforms DTP/ DGP

To be Improved • Tree constraint is not necessary. •
Concatenation of all treebanks yield better performance.

Conclusion • mBERT improves multilingual parsing • DGP helps the
prediction of enhanced dependencies • Other than ensemble, more advanced parsing algorithm is needed

References • Straka, M., & Straková, J. (2017, August). Tokenizing,
pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (pp. 88-99). • Dozat, T., & Manning, C. D. (2016). Deep bia ffi ne attention for neural dependency parsing. arXiv preprint arXiv:1611.01734. • He, H., & Choi, J. (2020, May). Establishing strong baselines for the new decade: Sequence tagging, syntactic and semantic parsing with bert. In The Thirty-Third International Flairs Conference. • Kondratyuk, D. (2019). 75 Languages, 1 Model: Parsing Universal Dependencies Universally. arXiv preprint arXiv:1904.02099.

Adaptation of Multilingual Transformer Encoder ...

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency Parsing

Emory NLP

More Decks by Emory NLP

Other Decks in Technology

Featured

Transcript

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency

Enhanced Universal Dependency Parsing Ellipsi s Conjoined subjects and object

Preprocessing • Sentence split and tokenization • UDPipe (itssearch-engine —>

Encoder • mBERT v.s. language speci fi c Transformers •

Decoder • Bia ff i ne DTP and DGP •

Ensemble + (H H ) · V 2 R 2.4

Results • O ffi cially ranked the 3rd place according

Results • On 13 languages, multilingual BERT outperforms language speci

To be Improved • Tree constraint is not necessary. •

Conclusion • mBERT improves multilingual parsing • DGP helps the

References • Straka, M., & Straková, J. (2017, August). Tokenizing,