Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite
State Models Elias Ponvert, Jason Baldridge, Katrin Erk Department of Linguistics The University of Texas at Austin Association for Computational Linguistics 19–24 June, 2011 Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 1 / 34

Why unsupervised parsing? 1 Less reliance on annotated training Hello!
2 Apply to new languages and domains Særær man annær man mæþæn Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 2 / 34

Assumptions made in parser learning S NP VP PP P
on NP N Sunday Det the A brown N bear V sleeps , , Getting these labels right AS WELL AS the structure of the tree is hard Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Assumptions made in parser learning P on N Sunday Det
the A brown N bear V sleeps , , So the task is to identify the structure alone Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Assumptions made in parser learning on Sunday the brown bear
sleeps , Learning operates from gold-standard parts-of-speech (POS) rather than raw text P N Det A N V , on Sunday , the brown bear sleeps P N , Det A N V Klein & Manning 2003 CCM Bod 2006a, 2006b Klein & Manning 2005 DMV Successors to DMV: - Smith 2006, Smith & Cohen 2009, Headden et al 2009, Spitkovsky et al 2010ab, &c J. Gao et al 2003, 2004 Seginer 2007 this work Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 3 / 34

Unsupervised parsing: desiderata Raw text Standard NLP / extensible Scalable
and fast Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 4 / 34

A new approach: start from the bottom Unsupervised Partial Parsing
= segmentation of (non-overlapping) multiword constituents Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 5 / 34

Unsupervised segmentation of constituents leaves some room for interpretation Possible
segmentations ( the cat ) in ( the hat ) knows ( a lot ) about that ( the cat ) ( in the hat ) knows ( a lot ) ( about that ) ( the cat in the hat ) knows ( a lot about that ) ( the cat in the hat ) ( knows a lot about that ) ( the cat in the hat ) ( knows a lot ) ( about that ) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 6 / 34

Deﬁning UPP by evaluation 1. Constituent chunks: non-hierarchical multiword constituents
S NP D The N Cat PP P in NP D the N hat VP V knows NP D a N lot PP P about NP N that Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Deﬁning UPP by evaluation 2. Base NPs: non-recursive noun phrases
S NP D The N Cat PP P in NP D the N hat VP V knows NP D a N lot PP P about NP N that Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 7 / 34

Multilingual data for direct evaluation English WSJ German Negra Chinese
CTB Sentences Types Tokens WSJ Penn Treebank 49K 44K 1M Negra Negra German Corpus 21K 49K 300K CTB Penn Chinese Treebank 19K 37K 430K Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 8 / 34

Constituent chunks and NPs in the data WSJ Chunks 203K
NPs 172K Chunks ∩ NPs 161K Negra Chunks 59K NPs 33K Chunks ∩ NPs 23K CTB Chunks 92K NPs 56K Chunks ∩ NPs 43K Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 9 / 34

The benchmark: CCL parser the cat saw the red dog
run the 0 cat 0 1 saw 0 0 the 0 red 0 0 dog 0 run 0 Common Cover Links representation Constituency tree Seginer (2007 ACL; 2007 PhD UvA) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 10 / 34

Hypothesis Segmentation can be learned by generalizing on phrasal boundaries
Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 11 / 34

UPP as a tagging problem B the I cat O
in B the I hat the cat in the hat B Beginning of a constituent I Inside a constituent O Not inside a constituent Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 12 / 34

Learning from boundaries B the I cat O in B
the I hat the cat in the hat STOP # STOP # Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 13 / 34

Learning from punctuation B on I sunday B the I
brown I bear STOP # STOP # on sunday , the brown bear sleeps STOP , O sleeps Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 14 / 34

UPP: Models P( ) ≈ P( ) P( ) B
the I cat O in B the I hat Hidden Markov Model B I the B the B I Probabilistic right linear grammar P( ) = P( ) P( | ) the B I B I B I O B I the cat in the hat B I the Learning: expectation maximization (EM) via forward-backward (run to convergence) Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

UPP: Models P( ) ≈ P( ) P( ) B
the I cat O in B the I hat Hidden Markov Model B I the B the B I Probabilistic right linear grammar P( ) = P( ) P( | ) the B I B I B I O B I the cat in the hat B I the Decoding: Viterbi Smoothing: additive smoothing on emissions Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 15 / 34

UPP: Constraints on sequences B the I cat O in
B the I hat the cat in the hat STOP # STOP # STOP B O I Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 16 / 34

UPP evaluation: Setup Evaluation by comparison to treebank data Standard
train / development / test splits Precision and recall on matched constituents Benchmark: CCL Both get tokenization, punctuation, sentence boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 17 / 34

UPP evaluation: Chunking (F-score) 0 10 20 30 40 50
60 70 80 CTB Negra WSJ CCL∗ HMM Chunker PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 18 / 34

UPP evaluation: Base NPs (F-score) 0 10 20 30 40
50 60 70 80 CTB Negra WSJ CCL∗ HMM Chunker PRLG Chunker CCL non-hierarchical constituents First-level parsing output Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 19 / 34

UPP: Review Sequence models can generalize on indicators for phrasal
boundaries Leads to improved unsupervised segmentation Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 20 / 34

Question Are we limited to segmentation? Ponvert, Baldridge, Erk (UT
Austin) Simple Unsupervised Grammar Induction ACL 2011 21 / 34

Hypothesis Identiﬁcation of higher level constituents can also be learned
by generalizing on phrasal boundaries Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 22 / 34

Cascaded UPP: 1 Segment raw text there is no asbestos
in our products now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 2 Choose stand-ins for phrases our products is
no asbestos there is no asbestos in our products now there in now is our Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 3 Segment text + phrasal stand-ins there in
now is our there in now is our Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 4 Choose stand-ins and repeat steps 3–4 our
products in is no asbestos there there in now is our is in now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: 5 Unwind to output tree our products in
is no asbestos there is in now there is no asbestos in our products now Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 23 / 34

Cascaded UPP: Review Separate models learned at each cascade level
Models share hyper-parameters (smoothing etc) Choice of pseudowords as phrasal stand-ins Pseudoword-identiﬁcation: corpus frequency Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 24 / 34

Cascaded UPP: Evaluation 0 10 20 30 40 50 60
CTB Negra WSJ CCL Cascaded HMM Cascaded PRLG All constituent F-score Cascade run to convergence Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 25 / 34

More example parses die the csu CSU tut does das
this in in bayern Bavaria doch nevertheless auch also sehr very erfolgreich successfully Nevertheless, the CSU does this in Bavaria very successfully as well Gold standard die csu tut das in bayern doch auch sehr erfolgreich Cascaded PRLG – Negra correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

More example parses bei with den the windsors Windsors bleibt
stays alles everything in in der the familie family With the Windsors everything stays in the family. Gold standard bei den windsors bleibt alles in der familie Cascaded PRLG – Negra correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

More example parses immer ever mehr more anlagenteile machine parts
¨ uberaltern over-age (with) more and more machine parts over-age Cascaded PRLG – Negra correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 26 / 34

What we’ve learned Unsupervised identiﬁcation of base NPs and local
constituents is possible A cascade of chunking models for raw text parsing has state-of-the-art results Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 27 / 34

Future directions Improvements to the sequence models Better phrasal stand-in
(pseudoword) construction Learning joint models rather than a cascade Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 28 / 34

What’s in the paper Comparison to Klein & Manning’s CCM
Discussion of phrasal punctuation the chunkers still do well w/out punctuation Analysis of chunking and parsing Chinese Error analysis Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 29 / 34

Thanks! Contact: [email protected] Code: elias.ponvert.net/upparse This work is supported in
part by the U. S. Army Research Laboratory and the U.S. Army Research Ofﬁce under grant number W911NF-10-1-0533. Sup- port for Elias was also provided by Mike Hogg Endowment Fellowship, the Ofﬁce of Graduate Studies at The University of Texas at Austin. Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 30 / 34

Appendices Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction
ACL 2011 31 / 34

More example parses two share a house almost devoid of
furniture Gold standard two share a house almost devoid offurniture Cascaded PRLG – WSJ correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

More example parses what is one to think of all
this Gold standard what is one to think of all this Cascaded PRLG – WSJ correct incorrect Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 32 / 34

Learning curves: Base NPs 10 20 30 40K 20 40
60 80 sentences 10 20 30 40K 20 60 100 20 40 60 80 F-score EM iter sentences 1 0 20 40 60 80 100 20 40 60 80 EM iter PRLG chunking model: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Learning curves: Base NPs 5 10 15K 10 20 30
40 50 sentences 5 10 15K 20 80 140 20 40 F-score EM iter sentences 1 0 50 100 150 10 20 30 40 50 EM iter PRLG chunking model: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

Learning curves: Base NPs 5 10 15K 0 10 20
30 sentences 5 10 15K 20 60 100 10 20 30 F-score EM iter sentences 1 0 20 40 60 80 100 0 10 20 30 EM iter PRLG chunking model: CTB Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 33 / 34

What are the models learning? B P(w|B) the 21.0 a
8.7 to 6.5 ’s 2.8 in 1.9 mr. 1.8 its 1.6 of 1.4 an 1.4 and 1.4 I P(w|I) % 1.8 million 1.6 be 1.3 company 0.9 year 0.8 market 0.7 billion 0.6 share 0.5 new 0.5 than 0.5 O P(w|O) of 5.8 and 4.0 in 3.7 that 2.2 to 2.1 for 2.0 is 2.0 it 1.7 said 1.7 on 1.5 HMM Emissions: WSJ Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

What are the models learning? B P(w|B) der the 13.0
die the 12.2 den the 4.4 und and 3.3 im in 3.2 das the 2.9 des the 2.7 dem the 2.4 eine a 2.1 ein a 2.0 I P(w|I) uhr o’clock 0.8 juni June 0.6 jahren years 0.4 prozent percent 0.4 mark currency 0.3 stadt city 0.3 000 0.3 millionen millions 0.3 jahre year 0.3 frankfurter Frankfurt 0.3 O P(w|O) in in 3.4 und and 2.7 mit with 1.7 f¨ ur for 1.6 auf on 1.5 zu to 1.4 von of 1.3 sich such 1.3 ist is 1.3 nicht not 1.2 HMM Emissions: Negra Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

What are the models learning? B P(w|B) 的 de, of
14.3 一 one 3.1 和 and 1.1 两 two 0.9 这 this 0.8 有 have 0.8 经济 economy 0.7 各 each 0.7 全 all 0.7 不 no 0.6 I P(w|I) 的 de 3.9 了 (perf. asp.) 2.2 个 ge (measure) 1.5 年 year 1.3 说 say 1.0 中 middle 0.9 上 on, above 0.9 人 person 0.7 大 big 0.7 国 country 0.6 O P(w|O) 在 at, in 3.4 是 is 2.4 中国 China 1.4 也 also 1.2 不 no 1.2 对 pair 1.1 和 and 1.0 的 de 1.0 将 fut. tns. 1.0 有 have 1.0 HMM Emissions: CTB Ponvert, Baldridge, Erk (UT Austin) Simple Unsupervised Grammar Induction ACL 2011 34 / 34

Simple Unsupervised Grammar Induction from Raw ...

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

More Decks by Elias Ponvert

Other Decks in Technology

Featured

Transcript