Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programming Graphical Models for NLP

Programming Graphical Models for NLP

Mahak Gupta

July 13, 2015
Tweet

More Decks by Mahak Gupta

Other Decks in Education

Transcript

  1. Unsupervised Bilingual POS Tagging with Markov Random Fields (2011) Authors

    : Desai Chen, Chris Dyer, Shay B. Cohen, Noah A. Smith Presentation By : Mahak Gupta
  2. Goal We want to learn the linguistic structure of a

    language without any labelled data in that language. We take help from annotated data in other language and see how it may influence to infer the POS tagging in other language cross lingually. 2
  3. Goal - Continued 3 Annotated data Unlabelled data Labelled data

    English Sloviak/Bulgarian..etc Labelled data in corresponding language Unsupervised Learning
  4. Example Suppose we’ve two parallel sentences, Our problem is to

    infer the POS tagging of words in these sentences. I do German speak spreche not Ich kein Deutsch 4 PRP VBP NN VBP RB PRP VBP NN RB Cross linking
  5. Agenda for this Seminar 1. Background 2. Introduction - Concepts

    Used 3. Model Used 4. Optimisation Techniques 5. Experiments & Results 6. Conclusion 5
  6. Background.. • There has been previous works on this to

    infer the POS tags in Bilingual setting. Refer to Snyder et al. (2008) • Model - Directed models, HMM • Drawbacks - Ignored crossing word alignments I am vegetarian bin a Ich Vegetarier PRP VBP NN DT
  7. MRF - Markov random field (often abbreviated as MRF), Markov

    network or undirected graphical model is a set of random variables described by an undirected graph. 7
  8. Word Alignment - An NLP Task of Identifying translation relationships

    among words of different language. 8 I do not speak German Ich Spreche kein Deutsch
  9. Gisa++ (Word Alignment Generation Tool) Sample 1 - train.en 9

    I do not speak German. Road trip to Prague. I am a Vegetarian Sample 2 - train.fr Je ne parle pas allemand. Road trip à Prague. Je suis végétarien ./GISA++ train1.en train.fr (Unix Utility) uniq_num en_word word_freq 1 I 2 uniq_num fr_word word_freq 1 Je 2 uniq_num Eng_word French_word 1 I Je en.vcb fr.vcb enfr.snt
  10. Model Model described in the paper is MRF: ◦ Input

    : Two parallel sentences S<s1, s2… sn>, T<t1, t2 .. tn> I do not speak German = Ich spreche kein Deutsch. ◦ A : Set of word alignments in both these word sequences. Tool used : Giza++ (I, Ich=1, speak, spreche =2 etc...) ◦ Output : X,Y i.e sequences of POS tags for corresponding sentence pairs (X is already given, we need to infer Y) tags like NN, PRP etc that we’re interested in 10
  11. Model - Continued - The (unnormalized) Probability of this Markov

    model is given by Emit - Emission Probability Trans - Transitional Probability 11
  12. Model - Continued - The Probability of word sequences is

    given by (this can also be called as Likelihood) 12 Normalization Constant Summation of Probability of Markov model (Previous silde)
  13. Optimisation Techniques Objective : Best likelihood for the sequence of

    predicted words (of course for the best fit of predicted sequence of POS tags). We need to optimise the weight vector to have the best likelihood. 13 Optimisation Maximum Likelihood Estimation Contractive Estimation
  14. Maximum likelihood estimation (MLE) This is basically an optimisation problem;

    In class we’ve seen a class of algorithms for such optimisation i.e Hill Climbing algorithm. But they’ve a problem of “local maxima”. MLE = Maximizing the likelihood function. Log Likelihood is simply the log of likelihood function 14 https://onlinecourses.science.psu.edu/stat504/node/27
  15. Contrastive Estimation CE is the process to “tweaking/Manipulating” an optimization

    equation by introduction of a neighbourhood function. In our case we would define a neighborhood function for sentences, N(s;t) which maps a pair of sentences to a set of pairs of sentences. 15 http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf We tweak Normalization constant as a part of Neighbourh.. function
  16. MLE vs Contractive Estimation • Modelled both approached on Multilingual

    model • MRF model doesn’t work well with MLE as compared to HMM. The weights vector is not normalized properly and the “local maxima” problems is not solved (though it works well with HMM). • CE results showed a better normalized distribution to weights and hence this is chosen to optimise the equation. “local maxima” problems is solved 16
  17. Experiments Data set used:- Testing set:- 25% of sentences Word

    alignments are given by Gisa++ (http://www.statmt. org/moses/giza/GIZA++.html) Languages: English, Slovene, Bulgarian, Serbian, Total combinations (Biliungual model) = 4C2 = 6 17 • http://nl.ijs.si/ME/Vault/CD/docs/1984.html • http://www.statmt.org/europarl (For verifying crosslinking)
  18. Exp - 1 Monolingual Exp. Objective : Verify our Model’s

    Suitability 1. Supervised Learning: • Trained a monolingual supervised MRF model to compare to the results of supervised HMMs. • Our model vs HMM -difference in accuracy less than 0.1%. 2. Unsupervised Learning: 18 Language Random HMM MRF Bulgarian 82.7 88.9 93.5 English 56.2 90.7 87.0 Sloviac 83.4 85.1 89.3 Slovene 84.7 87.4 94.5
  19. But what about crosslinking ? • NYT 1984 dataset has

    only 5% of sentences with crosslinking • MRF bilingual model is applied on Europarl corpus (tested on French- English pair) 20 Language with Cross linking w/o Cross linking English 73.8 70.3 French 56.0 59.2
  20. Conclusion 21 • MRF’s to study bilingual POS taggings and

    compared results with HMM. • Bilingual MRF’s modl can also be used to model cross linking in parallel sentences.
  21. References... • Unsupervised Bilingual POS Tagging with Markov Random Fields

    (http: //people.csail.mit.edu/desaic/pdf/chen+dyer+cohen+smith.unsup11.pdf) • Unsupervised multilingual learning for POS tagging (http://www.aclweb. org/anthology/D08-1109.pdf) • Contrastive Estimation: Training Log-Linear Models on Unlabeled Data (http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf) • Classes by Andrew Ng on Machine Learning (coursera.org) 22