Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SIMPLE SEMI-SUPERVISED POS TAGGING

Yemane
June 17, 2016

SIMPLE SEMI-SUPERVISED POS TAGGING

Karl Stratos
Michael Collins
Columbia University, New York, NY 10027, USA

Proceedings of NAACL-HLT 2015, pages 79–87, Denver, Colorado, May 31 – June 5, 2015.
© 2015 Association for Computational Linguistics

Yemane

June 17, 2016
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. SIMPLE SEMI-SUPERVISED POS TAGGING June 17, 2016 Karl Stratos Michael

    Collins Columbia University, New York, NY 10027, USA Proceedings of NAACL-HLT 2015, pages 79–87, Denver, Colorado, May 31 – June 5, 2015. © 2015 Association for Computational Linguistics 1
  2. Overview • Purpose - to reduce the level of supervision

    required to achieve state-of-the-art POS accuracy • Motivation • POS tags are almost deterministic • Brown et. al. word clustering method reveals the underlying POS tag information of words • Method - Discriminative classifier + active learning • Result • tagging accuracy of 93% with just 400 labeled words • tagging accuracy of 97.03% with just 0.74% of the original training data [English dataset] 3
  3. Introduction • POS tagging refers to labeling words with parts-of-speech

    such as verbs, nouns …etc. • POS tagging maybe performed through • Supervised, Unsupervised, Semi-supervised learning approaches • Fully supervised POS tagging is considered as a SOLVED problem • This is not the case for Unsupervised POS tagging • The previous works • complicated by varying assumptions and unclear evaluation metrics • not good enough to be practical ( accuracy is low) 4
  4. Motivation 1. POS tags are almost deterministic • Some words

    like ‘set’ are ambiguous (verb ?, noun ?) • Some are deterministic – e.g. ‘the’ is always a determiner • Deterministic mapping f : w  t • count (w,t) – count of word w , tagged as tag t • Accuracy based on this simple assumption • Coarse tags – 88.5% • Fine grained tags - 92.22% • Model - Restricted HMM , first order sequence structure 5
  5. Motivation 2. Brown et al. (1992) model • sequence model

    often used for deriving lexical representations application – word clustering result – hierarchy over word types 6 Figure 1: representational schemes under the Brown model. (a) Bit-string representations: the path from the root to a word is encoded as a bit-string.
  6. Motivation 7 A variant of CCA [Stratos et al. (2014)]

    recovered clusters under Brown model Words are represented as m dimensional vector where m = number of hidden states in model Real – values can represent ambiguity (b) Canonical Correlation Analysis (CCA) vector representations Assumption – Hidden states in Brown model can capture POS tags
  7. Method • MINITAGGER framework • uses an existing discriminative classifier

    to map a word’s context to a POS tag e.g. SVM • allows learning from partially labeled sentences (active learning) • training and tagging can be very fast • features can be easily added 8
  8. Method - Features 9 • Identities of xi−1 , xi+1

    , xi , xi−2 , xi+2 • Prefixes and suffixes of Xi up to length 4 • Whether Xi is capitalized, numeric, or non alphanumeric (X,i) = sentence-position pair bit(X) = Brown bit string of X cca(X) = m-dim CCA embedding of X Baseline (BASE) • Spelling features of (x, i)
  9. Method - Sampling methods 10 • Sampling – selection method

    for candidate words for labeling • attempt to reduce the amount of training data • Active Learning • find the most informative words for labeling • Words with list confident predicted tag are selected for active labeling • Simple margin sampling • Random and frequent-word sampling • Random sampling: Select M words uniformly at random • Frequent-word sampling: Select random occurrences of the M most frequent word types
  10. Method - process flow 11 SVM model Extract features (with

    lexical rep.) find least confident prediction Active labeling Pool of unlabeled text
  11. Experiments • Setting • Languages – English, German, Spanish •

    Datasets – universal treebank • Tagsets – (1) 45 tags (2) reduced to 12 tags • Derived Brown representation for following unlabeled data • English: 772 mi • German and Spanish: Google Ngram (German 64 bi, Sp. 83 bi) • Bit string derivation: Liang (2005), Stratos et al. (2014) • Word embedding : derived 50 dim. word embedding , CCA algorithm of Stratos et al. (2014) • Compared results with CRF 12
  12. Experiments – • Min. size to achieve the fully supervised

    baseline ? 15 data EN12 : > 97% accuracy with 0.74% of original data
  13. Experiment – comparison with CRF 16 on English: > 97%

    accuracy with 0.74% of the data on the 12-tag version, > 96% accuracy with 0.81% of the data on the 45-tag version
  14. Conclusion • The research showed that Brown model often used

    for deriving lexical representations, is particularly appropriate for capturing POS tags • reduced the amount of labeled data required for state-of- the-art POS tagging • Obtained an accurate POS tagger with less than 1% of the normally considered amount of training data 17