State Models Elias Ponvert, Jason Baldridge, Katrin Erk Department of Linguistics The University of Texas at Austin Association for Computational Linguistics 19–24 June, 2011 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34
2 Apply to new languages and domains Særær man annær man mæþæn Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34
on NP N Sunday Det the A brown N bear V sleeps , , Getting these labels right AS WELL AS the structure of the tree is hard Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
the A brown N bear V sleeps , , So the task is to identify the structure alone Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
sleeps , Learning operates from gold-standard parts-of-speech (POS) rather than raw text P N Det A N V , on Sunday , the brown bear sleeps P N , Det A N V Klein & Manning 2003 CCM Bod 2006a, 2006b Klein & Manning 2005 DMV Successors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c J. Gao et al 2003, 2004 Seginer 2007 this work Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34
segmentations ( the cat ) in ( the hat ) knows ( a lot ) about that ( the cat ) ( in the hat ) knows ( a lot ) ( about that ) ( the cat in the hat ) knows ( a lot about that ) ( the cat in the hat ) ( knows a lot about that ) ( the cat in the hat ) ( knows a lot ) ( about that ) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34
S NP D The N Cat PP P in NP D the N hat VP V knows NP D a N lot PP P about NP N that Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34
S NP D The N Cat PP P in NP D the N hat VP V knows NP D a N lot PP P about NP N that Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34
run the 0 cat 0 1 saw 0 0 the 0 red 0 0 dog 0 run 0 Common Cover Links representation Constituency tree Seginer (2007 ACL; 2007 PhD UvA) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34
in B the I hat the cat in the hat B Beginning of a constituent I Inside a constituent O Not inside a constituent Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34
brown I bear STOP # STOP # on sunday , the brown bear sleeps STOP , O sleeps Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34
the I cat O in B the I hat Hidden Markov Model B I the B the B I Probabilistic right linear grammar P( ) = P( ) P( | ) the B I B I B I O B I the cat in the hat B I the Learning: expectation maximization (EM) via forward-backward (run to convergence) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
the I cat O in B the I hat Hidden Markov Model B I the B the B I Probabilistic right linear grammar P( ) = P( ) P( | ) the B I B I B I O B I the cat in the hat B I the Decoding: Viterbi Smoothing: additive smoothing on emissions Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34
train / development / test splits Precision and recall on matched constituents Benchmark: CCL Both get tokenization, punctuation, sentence boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34
in our products now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
no asbestos there is no asbestos in our products now there in now is our Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
products in is no asbestos there there in now is our is in now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
is no asbestos there is in now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34
this in in bayern Bavaria doch nevertheless auch also sehr very erfolgreich successfully Nevertheless, the CSU does this in Bavaria very successfully as well Gold standard die csu tut das in bayern doch auch sehr erfolgreich Cascaded PRLG – Negra correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34
stays alles everything in in der the familie family With the Windsors everything stays in the family. Gold standard bei den windsors bleibt alles in der familie Cascaded PRLG – Negra correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34
constituents is possible A cascade of chunking models for raw text parsing has state-of-the-art results Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34
(pseudoword) construction Learning joint models rather than a cascade Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34
Discussion of phrasal punctuation the chunkers still do well w/out punctuation Analysis of chunking and parsing Chinese Error analysis Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34
part by the U. S. Army Research Laboratory and the U.S. Army Research Office under grant number W911NF-10-1-0533. Sup- port for Elias was also provided by Mike Hogg Endowment Fellowship, the Office of Graduate Studies at The University of Texas at Austin. Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34
furniture Gold standard two share a house almost devoid offurniture Cascaded PRLG – WSJ correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34
this Gold standard what is one to think of all this Cascaded PRLG – WSJ correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34
8.7 to 6.5 ’s 2.8 in 1.9 mr. 1.8 its 1.6 of 1.4 an 1.4 and 1.4 I P(w|I) % 1.8 million 1.6 be 1.3 company 0.9 year 0.8 market 0.7 billion 0.6 share 0.5 new 0.5 than 0.5 O P(w|O) of 5.8 and 4.0 in 3.7 that 2.2 to 2.1 for 2.0 is 2.0 it 1.7 said 1.7 on 1.5 HMM Emissions: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
die the 12.2 den the 4.4 und and 3.3 im in 3.2 das the 2.9 des the 2.7 dem the 2.4 eine a 2.1 ein a 2.0 I P(w|I) uhr o’clock 0.8 juni June 0.6 jahren years 0.4 prozent percent 0.4 mark currency 0.3 stadt city 0.3 000 0.3 millionen millions 0.3 jahre year 0.3 frankfurter Frankfurt 0.3 O P(w|O) in in 3.4 und and 2.7 mit with 1.7 f¨ ur for 1.6 auf on 1.5 zu to 1.4 von of 1.3 sich such 1.3 ist is 1.3 nicht not 1.2 HMM Emissions: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34
14.3 一 one 3.1 和 and 1.1 两 two 0.9 这 this 0.8 有 have 0.8 经济 economy 0.7 各 each 0.7 全 all 0.7 不 no 0.6 I P(w|I) 的 de 3.9 了 (perf. asp.) 2.2 个 ge (measure) 1.5 年 year 1.3 说 say 1.0 中 middle 0.9 上 on, above 0.9 人 person 0.7 大 big 0.7 国 country 0.6 O P(w|O) 在 at, in 3.4 是 is 2.4 中国 China 1.4 也 also 1.2 不 no 1.2 对 pair 1.1 和 and 1.0 的 de 1.0 将 fut. tns. 1.0 有 have 1.0 HMM Emissions: CTB Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34