Slide 1

Slide 1 text

Unsupervised Bilingual POS Tagging with Markov Random Fields (2011) Authors : Desai Chen, Chris Dyer, Shay B. Cohen, Noah A. Smith Presentation By : Mahak Gupta

Slide 2

Slide 2 text

Goal We want to learn the linguistic structure of a language without any labelled data in that language. We take help from annotated data in other language and see how it may influence to infer the POS tagging in other language cross lingually. 2

Slide 3

Slide 3 text

Goal - Continued 3 Annotated data Unlabelled data Labelled data English Sloviak/Bulgarian..etc Labelled data in corresponding language Unsupervised Learning

Slide 4

Slide 4 text

Example Suppose we’ve two parallel sentences, Our problem is to infer the POS tagging of words in these sentences. I do German speak spreche not Ich kein Deutsch 4 PRP VBP NN VBP RB PRP VBP NN RB Cross linking

Slide 5

Slide 5 text

Agenda for this Seminar 1. Background 2. Introduction - Concepts Used 3. Model Used 4. Optimisation Techniques 5. Experiments & Results 6. Conclusion 5

Slide 6

Slide 6 text

Background.. ● There has been previous works on this to infer the POS tags in Bilingual setting. Refer to Snyder et al. (2008) ● Model - Directed models, HMM ● Drawbacks - Ignored crossing word alignments I am vegetarian bin a Ich Vegetarier PRP VBP NN DT

Slide 7

Slide 7 text

MRF - Markov random field (often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables described by an undirected graph. 7

Slide 8

Slide 8 text

Word Alignment - An NLP Task of Identifying translation relationships among words of different language. 8 I do not speak German Ich Spreche kein Deutsch

Slide 9

Slide 9 text

Gisa++ (Word Alignment Generation Tool) Sample 1 - train.en 9 I do not speak German. Road trip to Prague. I am a Vegetarian Sample 2 - train.fr Je ne parle pas allemand. Road trip à Prague. Je suis végétarien ./GISA++ train1.en train.fr (Unix Utility) uniq_num en_word word_freq 1 I 2 uniq_num fr_word word_freq 1 Je 2 uniq_num Eng_word French_word 1 I Je en.vcb fr.vcb enfr.snt

Slide 10

Slide 10 text

Model Model described in the paper is MRF: ○ Input : Two parallel sentences S, T I do not speak German = Ich spreche kein Deutsch. ○ A : Set of word alignments in both these word sequences. Tool used : Giza++ (I, Ich=1, speak, spreche =2 etc...) ○ Output : X,Y i.e sequences of POS tags for corresponding sentence pairs (X is already given, we need to infer Y) tags like NN, PRP etc that we’re interested in 10

Slide 11

Slide 11 text

Model - Continued - The (unnormalized) Probability of this Markov model is given by Emit - Emission Probability Trans - Transitional Probability 11

Slide 12

Slide 12 text

Model - Continued - The Probability of word sequences is given by (this can also be called as Likelihood) 12 Normalization Constant Summation of Probability of Markov model (Previous silde)

Slide 13

Slide 13 text

Optimisation Techniques Objective : Best likelihood for the sequence of predicted words (of course for the best fit of predicted sequence of POS tags). We need to optimise the weight vector to have the best likelihood. 13 Optimisation Maximum Likelihood Estimation Contractive Estimation

Slide 14

Slide 14 text

Maximum likelihood estimation (MLE) This is basically an optimisation problem; In class we’ve seen a class of algorithms for such optimisation i.e Hill Climbing algorithm. But they’ve a problem of “local maxima”. MLE = Maximizing the likelihood function. Log Likelihood is simply the log of likelihood function 14 https://onlinecourses.science.psu.edu/stat504/node/27

Slide 15

Slide 15 text

Contrastive Estimation CE is the process to “tweaking/Manipulating” an optimization equation by introduction of a neighbourhood function. In our case we would define a neighborhood function for sentences, N(s;t) which maps a pair of sentences to a set of pairs of sentences. 15 http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf We tweak Normalization constant as a part of Neighbourh.. function

Slide 16

Slide 16 text

MLE vs Contractive Estimation ● Modelled both approached on Multilingual model ● MRF model doesn’t work well with MLE as compared to HMM. The weights vector is not normalized properly and the “local maxima” problems is not solved (though it works well with HMM). ● CE results showed a better normalized distribution to weights and hence this is chosen to optimise the equation. “local maxima” problems is solved 16

Slide 17

Slide 17 text

Experiments Data set used:- Testing set:- 25% of sentences Word alignments are given by Gisa++ (http://www.statmt. org/moses/giza/GIZA++.html) Languages: English, Slovene, Bulgarian, Serbian, Total combinations (Biliungual model) = 4C2 = 6 17 ● http://nl.ijs.si/ME/Vault/CD/docs/1984.html ● http://www.statmt.org/europarl (For verifying crosslinking)

Slide 18

Slide 18 text

Exp - 1 Monolingual Exp. Objective : Verify our Model’s Suitability 1. Supervised Learning: ● Trained a monolingual supervised MRF model to compare to the results of supervised HMMs. ● Our model vs HMM -difference in accuracy less than 0.1%. 2. Unsupervised Learning: 18 Language Random HMM MRF Bulgarian 82.7 88.9 93.5 English 56.2 90.7 87.0 Sloviac 83.4 85.1 89.3 Slovene 84.7 87.4 94.5

Slide 19

Slide 19 text

Exp - 2 Bilingual Exp. 19

Slide 20

Slide 20 text

But what about crosslinking ? ● NYT 1984 dataset has only 5% of sentences with crosslinking ● MRF bilingual model is applied on Europarl corpus (tested on French- English pair) 20 Language with Cross linking w/o Cross linking English 73.8 70.3 French 56.0 59.2

Slide 21

Slide 21 text

Conclusion 21 ● MRF’s to study bilingual POS taggings and compared results with HMM. ● Bilingual MRF’s modl can also be used to model cross linking in parallel sentences.

Slide 22

Slide 22 text

References... ● Unsupervised Bilingual POS Tagging with Markov Random Fields (http: //people.csail.mit.edu/desaic/pdf/chen+dyer+cohen+smith.unsup11.pdf) ● Unsupervised multilingual learning for POS tagging (http://www.aclweb. org/anthology/D08-1109.pdf) ● Contrastive Estimation: Training Log-Linear Models on Unlabeled Data (http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf) ● Classes by Andrew Ng on Machine Learning (coursera.org) 22

Slide 23

Slide 23 text

Thank You!! Confusions… Questions Please 23