COLING 2012 review - Speaker Deck

Slide 1

Slide 1 text

COLING 2012 review (eNLP version) Mamoru Komachi 2012/12/17 Educational NLP research group Computational Linguistics Lab Nara Institute of Science and Technology, Japan

Slide 2

Slide 2 text

Disclaimer • Not complete list, so please take a look at the paper list by yourself! • I haven’t read any papers yet. I will talk about the impression from the presentation (oral, poster, demo) of their work. Please refer to the paper itself if you feel interested J

Slide 3

Slide 3 text

COLING 2012 ORAL • Joint English Spelling Error Correction and POS Tagging for Language Learners Writing • Modeling ESL Word Choice Similarities by Representing Word Intensions and Extensions • Problems in Evaluating Grammatical Error Detection Systems • Mining Words in the Minds of Second Language Learners • Native Tongue, Lost and Found: Resources and Empirical Evaluations in Native Language Identification • Robust, Lexicalized Native Language Identification • Native Language Identification using Recurring N-grams

Slide 4

Slide 4 text

Joint English Spelling Error Correction and POS Tagging for Language Leaners Writing Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi and Yuji Matsumoto (NAIST, Japan) Problem: Spelling errors and POS tags often coincide, but each task has been solved separately Idea: Jointly perform spelling correction and POS tagging by variable length CRF (to deal with split/merge errors) • Joint model outperforms the pipeline model • Shorter outputs due to removal of delimiters

Slide 5

Slide 5 text

Modeling ESL Word Choice Similarities By Representing Word Intensions and Extensions Huichao Xue and Rebecca Hwa (University of Pittsburgh, USA) Problem: To construct a confusion set for grammatical error correction, it often relies on manually corrected learner corpus Idea: Use only a native corpus to create confusion sets by applying relevance component analysis • Better confusion sets can be learned from bilingual corpus and native corpus • Created confusion sets correlate well with real mistakes

Slide 6

Slide 6 text

Problems in Evaluating Grammatical Error Detection Systems Martin Chodorow, Markus Dickinson, Ross Israel and Joel Tetreault (City University of New York, USA) Problem: Many evaluation metrics have been used for grammatical error detection, but none of them addresses the issue of data skewness Idea: Propose best practices • Report raw frequencies (tp, fn, fp, tn) – Also report how you define true nevatives • Treat unit size (exact match/overlap) carefully • Consider weighting the reliability of judgments

Slide 7

Slide 7 text

Mining Words in the Minds of Second Language Learners Yo Ehara, Issei Sato, Hidekazu Oiwa and Hiroshi Nakagawa (University of Tokyo, Japan) Problem: Though there are many studies on measuring the size of learners’ vocabulary, few studies address what kind of words they know Idea: Define a learner-specific word difficulty measure • Theoretically sound and practically useful extension to previous models • Able to obtain interpretable weight vector

Slide 8

Slide 8 text

Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification Joel Tetreault, Daniel Blanchard, Aoife Cahill and Martin Chodorow (ETS, USA) Problem: Previous NLI task uses ICLE, but the corpus is highly skewed Idea: Create a new balanced corpus (TOEFL11) and evaluate on cross-corpora • Many trends in previous work on ICLE generalize to other corpora • Training on a large corpus and testing on a smaller one works well, but not vice versa • Accuracy varies across proficiency levels

Slide 9

Slide 9 text

Robust, Lexicalized Native Language Identification Julian Brooke and Graem Hirst (University of Toronto, Canada) Problem: Previous NLI research uses only small single corpora, which limit using lexical features Idea: Extract an ESL corpus form Lang-8 to use lexical features and perform cross-corpus evaluation • Shallow lexical features contribute much more than sophisticated syntactic features • Domain adaptation gives improvement • Evaluation on a single corpus may be questionable

Slide 10

Slide 10 text

Native Language Identification using Recurring n-grams Serhiy Bykh and Detmar Meurers (Universitaet Tuebingen, Germany) Problem: Since NLI task is a new field, features for NLI task are not well-studied Idea: Explore surface/Open-Class-POS/POS n- gram features and evaluate on cross-corpora • The finer the features, the better the accuracy • Features learned from ICLE well generalized to other corpora, unlike (Brooke and Hirst, 2011) which uses Lang-8 as a training corpus for NLI

Slide 11

Slide 11 text

COLING 2012 POSTERS • The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings • Defining Syntax for Learner Language Annotation

Slide 12

Slide 12 text

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto (NAIST, Japan) Problem: Until recently, no large-scale ESL corpora has been publicly available for grammatical error correction Idea: Extract ESL corpus from the web and see the effect of corpus size in grammatical correction • Phrase-based SMT trained on large-scale data is effective in preposition, article, lexical choice • Syntax and discourse information needed for tense, agreement, noun number errors

Slide 13

Slide 13 text

Defining Syntax for Learner Language Annotation Marwa Ragheb and Markus Dickinson (Indiana University, USA) Problem: Though POS annotation has been proposed for ESL langauge, annotating syntax for learner language is not well studied Idea: Investigate multiple layered annotation (morphological dependencies, distributional dependencies, and subcategorization) for ESL texts • Subcategorization seems preferable over other two layers, since ESL texts are often hard to parse • Open question: how can we generalize this framework to other non-canonical languages?

Slide 14

Slide 14 text

Summary • Introduced eNLP-related papers presented at COLING 2012 • A lot of work on native language identification done (there will be a shared task on NLI at BEA-8, collocated with NAACL 2013) • Cross-corpora scalability is important • Future research should go beyond surface and POS level (semantic, syntactic and discourse information be investigated)