Upgrade to Pro — share decks privately, control downloads, hide ads and more …

COLING 2012 review

COLING 2012 review

Paper reviews presented at COLING 2012. I introduced papers related to educational applications of natural language processing.

Mamoru Komachi

December 20, 2012
Tweet

More Decks by Mamoru Komachi

Other Decks in Research

Transcript

  1. COLING 2012 review (eNLP version) Mamoru Komachi 2012/12/17 Educational NLP

    research group Computational Linguistics Lab Nara Institute of Science and Technology, Japan
  2. Disclaimer • Not complete list, so please take a look

    at the paper list by yourself! • I haven’t read any papers yet. I will talk about the impression from the presentation (oral, poster, demo) of their work. Please refer to the paper itself if you feel interested J
  3. COLING 2012 ORAL • Joint English Spelling Error Correction and

    POS Tagging for Language Learners Writing • Modeling ESL Word Choice Similarities by Representing Word Intensions and Extensions • Problems in Evaluating Grammatical Error Detection Systems • Mining Words in the Minds of Second Language Learners • Native Tongue, Lost and Found: Resources and Empirical Evaluations in Native Language Identification • Robust, Lexicalized Native Language Identification • Native Language Identification using Recurring N-grams
  4. Joint English Spelling Error Correction and POS Tagging for Language

    Leaners Writing Keisuke Sakaguchi, Tomoya Mizumoto, Mamoru Komachi and Yuji Matsumoto (NAIST, Japan) Problem: Spelling errors and POS tags often coincide, but each task has been solved separately Idea: Jointly perform spelling correction and POS tagging by variable length CRF (to deal with split/merge errors) • Joint model outperforms the pipeline model • Shorter outputs due to removal of delimiters
  5. Modeling ESL Word Choice Similarities By Representing Word Intensions and

    Extensions Huichao Xue and Rebecca Hwa (University of Pittsburgh, USA) Problem: To construct a confusion set for grammatical error correction, it often relies on manually corrected learner corpus Idea: Use only a native corpus to create confusion sets by applying relevance component analysis • Better confusion sets can be learned from bilingual corpus and native corpus • Created confusion sets correlate well with real mistakes
  6. Problems in Evaluating Grammatical Error Detection Systems Martin Chodorow, Markus

    Dickinson, Ross Israel and Joel Tetreault (City University of New York, USA) Problem: Many evaluation metrics have been used for grammatical error detection, but none of them addresses the issue of data skewness Idea: Propose best practices • Report raw frequencies (tp, fn, fp, tn) – Also report how you define true nevatives • Treat unit size (exact match/overlap) carefully • Consider weighting the reliability of judgments
  7. Mining Words in the Minds of Second Language Learners Yo

    Ehara, Issei Sato, Hidekazu Oiwa and Hiroshi Nakagawa (University of Tokyo, Japan) Problem: Though there are many studies on measuring the size of learners’ vocabulary, few studies address what kind of words they know Idea: Define a learner-specific word difficulty measure • Theoretically sound and practically useful extension to previous models • Able to obtain interpretable weight vector
  8. Native Tongues, Lost and Found: Resources and Empirical Evaluations in

    Native Language Identification Joel Tetreault, Daniel Blanchard, Aoife Cahill and Martin Chodorow (ETS, USA) Problem: Previous NLI task uses ICLE, but the corpus is highly skewed Idea: Create a new balanced corpus (TOEFL11) and evaluate on cross-corpora • Many trends in previous work on ICLE generalize to other corpora • Training on a large corpus and testing on a smaller one works well, but not vice versa • Accuracy varies across proficiency levels
  9. Robust, Lexicalized Native Language Identification Julian Brooke and Graem Hirst

    (University of Toronto, Canada) Problem: Previous NLI research uses only small single corpora, which limit using lexical features Idea: Extract an ESL corpus form Lang-8 to use lexical features and perform cross-corpus evaluation • Shallow lexical features contribute much more than sophisticated syntactic features • Domain adaptation gives improvement • Evaluation on a single corpus may be questionable
  10. Native Language Identification using Recurring n-grams Serhiy Bykh and Detmar

    Meurers (Universitaet Tuebingen, Germany) Problem: Since NLI task is a new field, features for NLI task are not well-studied Idea: Explore surface/Open-Class-POS/POS n- gram features and evaluate on cross-corpora • The finer the features, the better the accuracy • Features learned from ICLE well generalized to other corpora, unlike (Brooke and Hirst, 2011) which uses Lang-8 as a training corpus for NLI
  11. COLING 2012 POSTERS • The Effect of Learner Corpus Size

    in Grammatical Error Correction of ESL Writings • Defining Syntax for Learner Language Annotation
  12. The Effect of Learner Corpus Size in Grammatical Error Correction

    of ESL Writings Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto (NAIST, Japan) Problem: Until recently, no large-scale ESL corpora has been publicly available for grammatical error correction Idea: Extract ESL corpus from the web and see the effect of corpus size in grammatical correction • Phrase-based SMT trained on large-scale data is effective in preposition, article, lexical choice • Syntax and discourse information needed for tense, agreement, noun number errors
  13. Defining Syntax for Learner Language Annotation Marwa Ragheb and Markus

    Dickinson (Indiana University, USA) Problem: Though POS annotation has been proposed for ESL langauge, annotating syntax for learner language is not well studied Idea: Investigate multiple layered annotation (morphological dependencies, distributional dependencies, and subcategorization) for ESL texts • Subcategorization seems preferable over other two layers, since ESL texts are often hard to parse • Open question: how can we generalize this framework to other non-canonical languages?
  14. Summary • Introduced eNLP-related papers presented at COLING 2012 •

    A lot of work on native language identification done (there will be a shared task on NLI at BEA-8, collocated with NAACL 2013) • Cross-corpora scalability is important • Future research should go beyond surface and POS level (semantic, syntactic and discourse information be investigated)