Programming Graphical Models for NLP

Unsupervised Bilingual POS Tagging with Markov Random Fields (2011) Authors
: Desai Chen, Chris Dyer, Shay B. Cohen, Noah A. Smith Presentation By : Mahak Gupta

Goal We want to learn the linguistic structure of a
language without any labelled data in that language. We take help from annotated data in other language and see how it may influence to infer the POS tagging in other language cross lingually. 2

Goal - Continued 3 Annotated data Unlabelled data Labelled data
English Sloviak/Bulgarian..etc Labelled data in corresponding language Unsupervised Learning

Example Suppose we’ve two parallel sentences, Our problem is to
infer the POS tagging of words in these sentences. I do German speak spreche not Ich kein Deutsch 4 PRP VBP NN VBP RB PRP VBP NN RB Cross linking

Agenda for this Seminar 1. Background 2. Introduction - Concepts
Used 3. Model Used 4. Optimisation Techniques 5. Experiments & Results 6. Conclusion 5

Background.. • There has been previous works on this to
infer the POS tags in Bilingual setting. Refer to Snyder et al. (2008) • Model - Directed models, HMM • Drawbacks - Ignored crossing word alignments I am vegetarian bin a Ich Vegetarier PRP VBP NN DT

MRF - Markov random field (often abbreviated as MRF), Markov
network or undirected graphical model is a set of random variables described by an undirected graph. 7

Word Alignment - An NLP Task of Identifying translation relationships
among words of different language. 8 I do not speak German Ich Spreche kein Deutsch

Gisa++ (Word Alignment Generation Tool) Sample 1 - train.en 9
I do not speak German. Road trip to Prague. I am a Vegetarian Sample 2 - train.fr Je ne parle pas allemand. Road trip à Prague. Je suis végétarien ./GISA++ train1.en train.fr (Unix Utility) uniq_num en_word word_freq 1 I 2 uniq_num fr_word word_freq 1 Je 2 uniq_num Eng_word French_word 1 I Je en.vcb fr.vcb enfr.snt

Model Model described in the paper is MRF: ◦ Input
: Two parallel sentences S<s1, s2… sn>, T<t1, t2 .. tn> I do not speak German = Ich spreche kein Deutsch. ◦ A : Set of word alignments in both these word sequences. Tool used : Giza++ (I, Ich=1, speak, spreche =2 etc...) ◦ Output : X,Y i.e sequences of POS tags for corresponding sentence pairs (X is already given, we need to infer Y) tags like NN, PRP etc that we’re interested in 10

Model - Continued - The (unnormalized) Probability of this Markov
model is given by Emit - Emission Probability Trans - Transitional Probability 11

Model - Continued - The Probability of word sequences is
given by (this can also be called as Likelihood) 12 Normalization Constant Summation of Probability of Markov model (Previous silde)

Optimisation Techniques Objective : Best likelihood for the sequence of
predicted words (of course for the best fit of predicted sequence of POS tags). We need to optimise the weight vector to have the best likelihood. 13 Optimisation Maximum Likelihood Estimation Contractive Estimation

Maximum likelihood estimation (MLE) This is basically an optimisation problem;
In class we’ve seen a class of algorithms for such optimisation i.e Hill Climbing algorithm. But they’ve a problem of “local maxima”. MLE = Maximizing the likelihood function. Log Likelihood is simply the log of likelihood function 14 https://onlinecourses.science.psu.edu/stat504/node/27

Contrastive Estimation CE is the process to “tweaking/Manipulating” an optimization
equation by introduction of a neighbourhood function. In our case we would define a neighborhood function for sentences, N(s;t) which maps a pair of sentences to a set of pairs of sentences. 15 http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf We tweak Normalization constant as a part of Neighbourh.. function

MLE vs Contractive Estimation • Modelled both approached on Multilingual
model • MRF model doesn’t work well with MLE as compared to HMM. The weights vector is not normalized properly and the “local maxima” problems is not solved (though it works well with HMM). • CE results showed a better normalized distribution to weights and hence this is chosen to optimise the equation. “local maxima” problems is solved 16

Experiments Data set used:- Testing set:- 25% of sentences Word
alignments are given by Gisa++ (http://www.statmt. org/moses/giza/GIZA++.html) Languages: English, Slovene, Bulgarian, Serbian, Total combinations (Biliungual model) = 4C2 = 6 17 • http://nl.ijs.si/ME/Vault/CD/docs/1984.html • http://www.statmt.org/europarl (For verifying crosslinking)

Exp - 1 Monolingual Exp. Objective : Verify our Model’s
Suitability 1. Supervised Learning: • Trained a monolingual supervised MRF model to compare to the results of supervised HMMs. • Our model vs HMM -difference in accuracy less than 0.1%. 2. Unsupervised Learning: 18 Language Random HMM MRF Bulgarian 82.7 88.9 93.5 English 56.2 90.7 87.0 Sloviac 83.4 85.1 89.3 Slovene 84.7 87.4 94.5

Exp - 2 Bilingual Exp. 19

But what about crosslinking ? • NYT 1984 dataset has
only 5% of sentences with crosslinking • MRF bilingual model is applied on Europarl corpus (tested on French- English pair) 20 Language with Cross linking w/o Cross linking English 73.8 70.3 French 56.0 59.2

Conclusion 21 • MRF’s to study bilingual POS taggings and
compared results with HMM. • Bilingual MRF’s modl can also be used to model cross linking in parallel sentences.

References... • Unsupervised Bilingual POS Tagging with Markov Random Fields
(http: //people.csail.mit.edu/desaic/pdf/chen+dyer+cohen+smith.unsup11.pdf) • Unsupervised multilingual learning for POS tagging (http://www.aclweb. org/anthology/D08-1109.pdf) • Contrastive Estimation: Training Log-Linear Models on Unlabeled Data (http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf) • Classes by Andrew Ng on Machine Learning (coursera.org) 22

Thank You!! Confusions… Questions Please 23

Programming Graphical Models for NLP

Programming Graphical Models for NLP

Mahak Gupta

More Decks by Mahak Gupta

Other Decks in Education

Featured

Transcript

Unsupervised Bilingual POS Tagging with Markov Random Fields (2011) Authors

Goal We want to learn the linguistic structure of a

Goal - Continued 3 Annotated data Unlabelled data Labelled data

Example Suppose we’ve two parallel sentences, Our problem is to

Agenda for this Seminar 1. Background 2. Introduction - Concepts

Background.. • There has been previous works on this to

MRF - Markov random field (often abbreviated as MRF), Markov

Word Alignment - An NLP Task of Identifying translation relationships

Gisa++ (Word Alignment Generation Tool) Sample 1 - train.en 9

Model Model described in the paper is MRF: ◦ Input

Model - Continued - The (unnormalized) Probability of this Markov

Model - Continued - The Probability of word sequences is

Optimisation Techniques Objective : Best likelihood for the sequence of

Maximum likelihood estimation (MLE) This is basically an optimisation problem;

Contrastive Estimation CE is the process to “tweaking/Manipulating” an optimization

MLE vs Contractive Estimation • Modelled both approached on Multilingual

Experiments Data set used:- Testing set:- 25% of sentences Word

Exp - 1 Monolingual Exp. Objective : Verify our Model’s

Exp - 2 Bilingual Exp. 19

But what about crosslinking ? • NYT 1984 dataset has

Conclusion 21 • MRF’s to study bilingual POS taggings and

References... • Unsupervised Bilingual POS Tagging with Markov Random Fields

Thank You!! Confusions… Questions Please 23