Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evaluation of the Bible as a Resource for Cross-Language Information Retrieval

Yemane
July 12, 2016

Evaluation of the Bible as a Resource for Cross-Language Information Retrieval

Peter A. Chew, Steve J. Verzi, Travis L. Bauer and Jonathan T. McClain
Sandia National Laboratories
P. O. Box 5800 Albuquerque, NM 87185, USA

Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 68–74, Sydney, July 2006.
( c) 2006 Association for Computational Linguistic

Yemane

July 12, 2016
Tweet

More Decks by Yemane

Other Decks in Education

Transcript

  1. EVALUATION OF THE BIBLE AS A RESOURCE FOR CROSS-LANGUAGE INFORMATION

    RETRIEVAL July 11, 2016 Peter A. Chew, Steve J. Verzi, Travis L. Bauer and Jonathan T. McClain Sandia National Laboratories P. O. Box 5800 Albuquerque, NM 87185, USA Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 68–74, Sydney, July 2006. ( c) 2006 Association for Computational Linguistic
  2. Introduction • CLIR – Cross Language Information Retrieval • Motivation

    – Bible has a multi-lingual translations • Purpose - empirical evaluation of the Bible as a resource for cross-language information retrieval (CLIR) • Methodology – cross language comparison • Result - showed the usefulness of the Bible
  3. Introduction • Project • what ideas in global discourse are

    the most popular • how the popularity of ideas changes over time. • Aim - clustering documents harvested from the internet by their ideology • how ideologically aligned are documents regardless of language • For this task parallel corpus is required
  4. Parallel multilingual corpora: available alternatives • Cross-Language Evaluation Forum (CLEF)

    [Genzalo 2001] • www.clef-campaign.org • Based on news documents or governmental communications • Task – trained on Hansard corpus • The BBC news website, http://news.bbc.co.uk • in 34 languages, • Limitation • CLEF operates on European languages • Canadian Hansard corpus covers only English and French • Languages like Arabic are not represented • BBC articles may not have the same content for different languages
  5. The Bible • Bible [ Resnik, Olsen and Diab (1999)

    ] • World’s most translated book • 2100+ languages • Easy availability • Style variation • great care is taken over the translations • parallel alignment on a verse-by-verse basis • its vocabulary appears to have a high rate of coverage (85%) of modern-day language • relatively the Bible is small
  6. Methods for Cross-Language Comparison • Implementation - Sandia Text Analysis

    Extensible Library (STANLEY) • IR based on Vector model • Term weighting based on log entropy • Two cross language comparison methods The first method (Method 1) involves creating • a separate textual model for each 'minimal unit' • ‘minimal unit’ is a single verse or group of verses • For language λ  a set of models (m1,λ , m2,λ , … mn,λ ). • If minimal unit = verse then n = 31,102 (n, number of verses).
  7. Comparing documents across languages • to compare document di (eg.

    en) with document dj (eg. ru) • treat the text of each document as a query against all of the models in its language • di is evaluated against m1,English , m2,English , …, mn,English to give simi,1 , simi,2 , …, simi, n , where simx,y represents similarity dx in language λ to model mn in language λ, • similar also for dj • similarity between di and dj, • compute the cosine between these two vectors
  8. Method-2 • Build a single set of textual models for

    all translations, • m1 might represent a model based on the concatenation of one verse in English, Russian, Arabic, and so on. • Di content within Dj is not ignored / input can be multilingual
  9. Validation of the Bible as a resource for CLIR •

    Initial analysis • Build matrix of all verses (31,102 x 31,102) for each language pair • a cell in row m and column n contains a number between 0 and 1 representing the similarity of verse m in one language to verse n in the other language • High diagonal values of matrix indicate good similarity • verse n in one language is most similar to verse n in the other, for all n En-ru is lower due to the ru inflection
  10. 1) Simple validation • The CLIR algorithm is • trained

    - on the entire Bible, • validation - on FQS (2006) and RALI (2006) corpora - In 4 out of 5 cases , engine identified similar docs in en-sp as well as sp-en Mean average precision – 0.8 Recall - 1
  11. 2) Validation on a larger test set • the CLIR

    engine is • there is more chance of wrong predictions • trained on the Bible and validated against the 114 suras of the Quran • performing a four-by-four-way test using 4 languages e.g. 60 correct sp. docs retrieved related to 60 of the 114 en docs results outperform previous research
  12. Discussion • The CLIR engine is language-independent, easily extensible •

    adding language • adding corpora • Why the Bible has not been used more widely as a multilingual source for research? • Domain is limited /Religious/ • But … bible deals with human concerns ( life, death, war, love …) • The language is archaic • This issue concerns translation style, not content • Future work • Evaluate (statistically) faithfulness of translations to the original (Hebrew and Greek) • Experiment with unit of words as morphemes • Effect of homographic cognates on performance • E.g. (French coin ‘corner’ vs. English ‘coin’)