Christopher D. Manning (ACL-IJCNLP 2015) (Tables are taken from the above-mentioned paper) Presented by Mamoru Komachi <[email protected]> ACL 2015 Reading Group @ Tokyo Institute of Technology August 26th, 2015
v Entity-centric coreference systems build up coreference clusters incrementally (Raghunathan et al., 2010; Stoyanov and Eisner, 2012; Ma et al., 2014) 2 Hillary Clinton files for divorce from Bill Clinton ahead of her campaign for presidency for 2016. …. Clinton is confident that her poll numbers will skyrocket once the divorce is final. ? ! ?
v Two mention pair models: classification model and ranking model v Generates clusters features for clusters of mentions v Imitation learning v Assigns exact costs to actions based on coreference evaluation metrics v Uses the scores of the pairwise models to reduce the search space 3
belong to the same coreference cluster v Is it a coreferent? v Classification model v Which one best suites for the mention? v Ranking model 5 Bill arrived, but nobody saw him. I talked to him on the phone.
between the two mentions in sentences or number of mentions v Syntactic features: number of embedded NPs under a mention, POS tags of the first, last, and head word v Semantic features: named entity type, speaker identification v Rule-based features: exact and partial string matching v Lexical features: the first, last, and head word of the current mention 8
(Ng and Cardie, 2002) v Assigns the most probable preceding mention classified as coreferent with it as the antecedent v Only relies on local information v Entity-centric model (this work) v Operates between pairs of clusters instead of pairs of mentions v Builds up coreference chains with agglomerative clustering, by merging clusters if it predicts they are representing the same one 10
future observations depend on previous actions v Imitation learning (in this work, DAgger (Ross al., 2011)), is useful for this problem (Argall et al., 2009) v Training the agent on the gold labels alone assumes that all previous decisions were correct, but it is problematic in coreference, where the error rate is quite high v DAgger exposes the system to states at train time similar to the ones it will face at test time 12
v Iterative algorithm aggregating a dataset D consisting of states and the actions performed by the expert policy in those states v b controls the probability of the expert’s policy and current policy (decays exponentially as the iteration number increases) 13
v Merging clusters (order of merge operations is also important) influence the score v How a particular local decision will affect the final score of the coreference system? v Problem: standard coreference metrics do not decompose into clusters v Answer: rolling out the actions from the current state 14 A(s): set of actions that can be taken from the state s
clusters features v Minimum and maximum probability of coreference v Average probability and average log prob. of coreference v Average probability and log probability of coreference for a particular pair of grammar types of mentions (pron or not) 15
Whether a preceding mention pair in the list of mention pairs has the same candidate anaphor as the current one v The index of the current mention pair in the list divided by the size of the list (what percentage of the list have we seen so far?) v … v Entity-centric model doesn’t rely on sparse lexical features. Instead, it employs model stacking to exploit strong features (with scores learned from pairwise model) 16
OntoNotes v Training: 2802, development: 343, test:345 documents v Use the provided pre-processing (parse trees, NE, etc) v Common evaluation metrics v MUC, B3, CEAFE v CoNLL F1 (the average F1 score of the three metrics) v CoNLL scorer version 8.01 v Rule-based mention detection (Raghunathan et al., 2010) 18
This work primarily optimize for B3 metric during training v State-of-the-art systems use latent antecedents to learn scoring functions over mention pairs, but are trained to maximize global objective functions
evaluation metric v Post-processing of mention pair and ranking models v Closest-first clustering (Soon et al., 2001) v Best-first clustering (Ng and Cardie, 2002) v Global inference models v Global inference with integer linear programming (Denis and Baldridge, 2007; Finkel and Manning, 2008) v Graph partitioning (McCallum and Wellner, 2005; Nicolae and Nicolae, 2006) v Correlational clustering (McCallum and Wellner, 2003; Finely and Joachims, 2005) 21
Non-local entity-level information v Cluster model (Luo et al., 2004; Yang et al., 2008; Rahman and Ng, 2011) v Joint inference (McCallum and Wellner, 2003; Culotta et al., 2006; Poon and Domingos, 2008; Haghighi and Klein, 2010) v Learning trajectories of decisions v Imitation learning (Daume et al., 2005; Ma et al., 2014) v Structured perceptron (Stoyanov and Eisner, 2012; Fernandes et al., 2012; Bjoerkelund and Kuhn, 2014) 22
produced by mention pair models as features v Pairwise scores are learned using standard coreference metrics v Imitation learning can be used to learn how to build up coreference chains incrementally v Proposed model outperforms the commonly used best- first method and current state-of-the-art 23