Improved Arabic Base Phrase Chunking with a new enriched POS tag set

IMPROVED ARABIC BASE PHRASE CHUNKING WITH A NEW ENRICHED POS
TAG SET DECEMBER 3, 2015 Mona T. Diab Center for Computational Learning Systems Columbia University [email protected] Proceedings of the 5th Workshop on Important Unresolved Matters, pages 89–96, Prague, Czech Republic, June 2007. (c) 2007 Association for Computational Linguistics 1

Introduction • Purpose • Developing Arabic Base Phrase Chunker (BPC)
or shallow syntactic parser • Approach • a support vector machine approach for both the POS tagging and BPC processes, with morphological features speciﬁc to Modern Standard Arabic and enriched POS tags • Result • improved state of the art performance in BPC using a new part of speech tag (POS) set with 75 tags. 2

Introduction • What is BPC ? • the process by
which adjacent words are grouped together to form non-recursive chunks in a sentence • These phrases may be verb phrases, noun phrases and adjective phrases etc. • Example: Input sentences I would eat red luscious Apples on Sundays. Chunked sentences [I] NP [would eat] VP [red luscious apples] NP [on Sundays] PP. • Some applications of BPC - information extraction, semantic role labelling 3

Introduction: IOB annotation scheme 4 I - Inside of a
chunk O – Outside the chunk B – Beginning of the chunk

Approach • Modeling the BCP problem as a classification task
• Applied supervised SVM algorithm - capability to handle a large number of (overlapping) features • POS tagging • In Arabic, words are marked case, gender, number, deﬁniteness, mood, person, voice, tense and other features. • ATB (Arabic TreeBank) – FULL tagset has 2000 tags • A reduced Tagset (RTS) contains 25 tags. • For this research , new tagset (ERTS) of 75 tags was created. 5

Approach (2) • POS tagger Features • ERTS comprises 75
tags. For the current system, only 57 tags are used. • lexical features of +/-4 character n-grams • +/- 2 words around the focus word • 2 previous tags • The kernel is a polynomial degree 2 kernel • one-vs-all approach for classiﬁcation • Base Phrase Chunking • designated 10 types of chunked phrases (21 IOB tags) • ADJP, ADVP, CONJP, INTJP, NP, PP, PREDP,PRTP, SBARP, VP. • IOB chucks would be I-ADJP, B-ADJP …etc + O 6

Approach(3) 7 • The training data is derived from the
ATB • Chunklink software ﬂattens the tree to a sequence of base (non-recursive) phrase chunks with their IOB label • Chunklink was tailored for English structure • Chunklink’s output was post-processed using linguistic knowledge of Arabic syntactic structures

Base Phrase Chunks (1) 1) IDAFA - this structure marks
possession in Arabic Example: • [msA’ AljmEp] ‘night of Friday’ is IOB annotated as • [msA‘ DNN B-NP, AljmEp DNNM I-NP] 2) NOUN-ADJ - Nouns followed by adjectives (with agreement) Example: • HSylp nhA}yp rsmyp ‘ﬁnal ofﬁcial outcome’ • [HS NNF B-NP, nhA}yp JJF I-NP, rsmyp JJF I-NP] • Similarily Pronouns, Interjections, Prepositional Phrases are modified 8

Base Phrase Chunks (2) • The complete phrase types are
1. ADJP adjectival phrase qrybA jdA ‘very soon’ 2. ADVP adverbial phrase sryEA ‘quickly’, lkn hA ‘but she’ 3. CONJP conjunctive phrase w AlﬂsTynywn ‘and the Palestinians’ 4. INTJP interjective phrase nEm ‘yes’, yA Axt ‘Oh sister’ 5. NP noun phrase AlzfAf AljmAEy ‘the group wedding, mlk AlArdn ‘king of Jordan’, 6. PP prepositional phrase xlAl Hﬂ AlzfAf AljmAEy ‘during the group wedding party’ 7. PREDP predictive phrase - An ‘[is]’,followed by a noun phrase An AlASlAH Aldyny mhmp Almjddyn ‘religious improvement is the reformers’ task’ 8. PRTP particles (negative etc.) - particles that precede both nouns and verbs lA sy∼mA ‘not as long’ 9.SBARP subjunctive constructions phrases that begin with a particle meaning ‘that’ 10. VP Verb phrase headed by a verb 9

Feature sets 1.The POS tagset 2.The presence or absence of
explicit morphological features - These features are varied and 10 different experimental conditions are devised a) POS tag sets and b) either - no explicit features (noFeat), or - all explicit features (allFeat), or - some selective features 3) The BPC context is deﬁned as a window of +/−2 tokens 4) Tags for the previous two tokens 10

Experiments and Results • Data • Obtained from - ATB1v3,
ATB2v2 and ATB3v2 - news genre • development data – 2,304 sentences and 70,188 tokens • training data – 18,970 sentences and 594,683 tokens, • test data – 2,337 sentences and 69665 tokens. • Algorithm – SVM available from YAMCHA Tools • Results – (1) POS tagging 11 RTS – 25 tags ERTS – 75 tags Baseline – most frequent tag The POS tagger outperforms the baseline

Experiments and Results (2) 12 (2) Phrase Base Chunking -
State of the art result was Fβ=1 =91.44% - Current results outperform the previous results - attributed to better quality annotations - ERTS (75)

Experiments and Results (3) • All results yielded by ERTS
POS tagset outperform their counterparts using the RTS POS tagset. • ERTS-noFeat condition outperforms all other conditions in the experiments. • adding morphological features to the RTS POS tagset slightly improves the performance • However adding these features to the ERTS and FULL conditions does not help. • may be noise • Encoded POS information was enough 13

Experiments and Results (4) 14 Assessing impact of features by
Phrase type Adding explicit morphological features to the base condition RTS yields better results The highest scores NP 94.92 ADJP 73.16 INTJP 64.29

Conclusion • Results are better than state of the art
published results • Results suggests that choosing the POS tagset carefully has a signiﬁcant impact on higher levels of syntactic processing. • Adding of morphological features increased the performance when tagsets are reduced but not when full tagsets are used. 15

Improved Arabic Base Phrase Chunking with a new...

Improved Arabic Base Phrase Chunking with a new enriched POS tag set

Yemane

More Decks by Yemane

Other Decks in Education

Featured

Transcript

IMPROVED ARABIC BASE PHRASE CHUNKING WITH A NEW ENRICHED POS

Introduction • Purpose • Developing Arabic Base Phrase Chunker (BPC)

Introduction • What is BPC ? • the process by

Introduction: IOB annotation scheme 4 I - Inside of a

Approach • Modeling the BCP problem as a classification task

Approach (2) • POS tagger Features • ERTS comprises 75

Approach(3) 7 • The training data is derived from the

Base Phrase Chunks (1) 1) IDAFA - this structure marks

Base Phrase Chunks (2) • The complete phrase types are

Feature sets 1.The POS tagset 2.The presence or absence of

Experiments and Results • Data • Obtained from - ATB1v3,

Experiments and Results (2) 12 (2) Phrase Base Chunking -

Experiments and Results (3) • All results yielded by ERTS

Experiments and Results (4) 14 Assessing impact of features by

Conclusion • Results are better than state of the art