Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improved Arabic Base Phrase Chunking with a new enriched POS tag set

Yemane
December 03, 2015

Improved Arabic Base Phrase Chunking with a new enriched POS tag set

Mona T. Diab
Center for Computational Learning Systems
Columbia University
[email protected]

Proceedings of the 5th Workshop on Important Unresolved Matters, pages 89–96,
Prague, Czech Republic, June 2007. (c)
2007 Association for Computational Linguistics

Yemane

December 03, 2015
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. IMPROVED ARABIC BASE PHRASE CHUNKING WITH A NEW ENRICHED POS

    TAG SET DECEMBER 3, 2015 Mona T. Diab Center for Computational Learning Systems Columbia University [email protected] Proceedings of the 5th Workshop on Important Unresolved Matters, pages 89–96, Prague, Czech Republic, June 2007. (c) 2007 Association for Computational Linguistics 1
  2. Introduction • Purpose • Developing Arabic Base Phrase Chunker (BPC)

    or shallow syntactic parser • Approach • a support vector machine approach for both the POS tagging and BPC processes, with morphological features specific to Modern Standard Arabic and enriched POS tags • Result • improved state of the art performance in BPC using a new part of speech tag (POS) set with 75 tags. 2
  3. Introduction • What is BPC ? • the process by

    which adjacent words are grouped together to form non-recursive chunks in a sentence • These phrases may be verb phrases, noun phrases and adjective phrases etc. • Example: Input sentences I would eat red luscious Apples on Sundays. Chunked sentences [I] NP [would eat] VP [red luscious apples] NP [on Sundays] PP. • Some applications of BPC - information extraction, semantic role labelling 3
  4. Introduction: IOB annotation scheme 4 I - Inside of a

    chunk O – Outside the chunk B – Beginning of the chunk
  5. Approach • Modeling the BCP problem as a classification task

    • Applied supervised SVM algorithm - capability to handle a large number of (overlapping) features • POS tagging • In Arabic, words are marked case, gender, number, definiteness, mood, person, voice, tense and other features. • ATB (Arabic TreeBank) – FULL tagset has 2000 tags • A reduced Tagset (RTS) contains 25 tags. • For this research , new tagset (ERTS) of 75 tags was created. 5
  6. Approach (2) • POS tagger Features • ERTS comprises 75

    tags. For the current system, only 57 tags are used. • lexical features of +/-4 character n-grams • +/- 2 words around the focus word • 2 previous tags • The kernel is a polynomial degree 2 kernel • one-vs-all approach for classification • Base Phrase Chunking • designated 10 types of chunked phrases (21 IOB tags) • ADJP, ADVP, CONJP, INTJP, NP, PP, PREDP,PRTP, SBARP, VP. • IOB chucks would be I-ADJP, B-ADJP …etc + O 6
  7. Approach(3) 7 • The training data is derived from the

    ATB • Chunklink software flattens the tree to a sequence of base (non-recursive) phrase chunks with their IOB label • Chunklink was tailored for English structure • Chunklink’s output was post-processed using linguistic knowledge of Arabic syntactic structures
  8. Base Phrase Chunks (1) 1) IDAFA - this structure marks

    possession in Arabic Example: • [msA’ AljmEp] ‘night of Friday’ is IOB annotated as • [msA‘ DNN B-NP, AljmEp DNNM I-NP] 2) NOUN-ADJ - Nouns followed by adjectives (with agreement) Example: • HSylp nhA}yp rsmyp ‘final official outcome’ • [HS NNF B-NP, nhA}yp JJF I-NP, rsmyp JJF I-NP] • Similarily Pronouns, Interjections, Prepositional Phrases are modified 8
  9. Base Phrase Chunks (2) • The complete phrase types are

    1. ADJP adjectival phrase qrybA jdA ‘very soon’ 2. ADVP adverbial phrase sryEA ‘quickly’, lkn hA ‘but she’ 3. CONJP conjunctive phrase w AlflsTynywn ‘and the Palestinians’ 4. INTJP interjective phrase nEm ‘yes’, yA Axt ‘Oh sister’ 5. NP noun phrase AlzfAf AljmAEy ‘the group wedding, mlk AlArdn ‘king of Jordan’, 6. PP prepositional phrase xlAl Hfl AlzfAf AljmAEy ‘during the group wedding party’ 7. PREDP predictive phrase - An ‘[is]’,followed by a noun phrase An AlASlAH Aldyny mhmp Almjddyn ‘religious improvement is the reformers’ task’ 8. PRTP particles (negative etc.) - particles that precede both nouns and verbs lA sy∼mA ‘not as long’ 9.SBARP subjunctive constructions phrases that begin with a particle meaning ‘that’ 10. VP Verb phrase headed by a verb 9
  10. Feature sets 1.The POS tagset 2.The presence or absence of

    explicit morphological features - These features are varied and 10 different experimental conditions are devised a) POS tag sets and b) either - no explicit features (noFeat), or - all explicit features (allFeat), or - some selective features 3) The BPC context is defined as a window of +/−2 tokens 4) Tags for the previous two tokens 10
  11. Experiments and Results • Data • Obtained from - ATB1v3,

    ATB2v2 and ATB3v2 - news genre • development data – 2,304 sentences and 70,188 tokens • training data – 18,970 sentences and 594,683 tokens, • test data – 2,337 sentences and 69665 tokens. • Algorithm – SVM available from YAMCHA Tools • Results – (1) POS tagging 11 RTS – 25 tags ERTS – 75 tags Baseline – most frequent tag The POS tagger outperforms the baseline
  12. Experiments and Results (2) 12 (2) Phrase Base Chunking -

    State of the art result was Fβ=1 =91.44% - Current results outperform the previous results - attributed to better quality annotations - ERTS (75)
  13. Experiments and Results (3) • All results yielded by ERTS

    POS tagset outperform their counterparts using the RTS POS tagset. • ERTS-noFeat condition outperforms all other conditions in the experiments. • adding morphological features to the RTS POS tagset slightly improves the performance • However adding these features to the ERTS and FULL conditions does not help. • may be noise • Encoded POS information was enough 13
  14. Experiments and Results (4) 14 Assessing impact of features by

    Phrase type Adding explicit morphological features to the base condition RTS yields better results The highest scores NP 94.92 ADJP 73.16 INTJP 64.29
  15. Conclusion • Results are better than state of the art

    published results • Results suggests that choosing the POS tagset carefully has a significant impact on higher levels of syntactic processing. • Adding of morphological features increased the performance when tagsets are reduced but not when full tagsets are used. 15