Evaluation of the Bible as a Resource for Cross-Language Information Retrieval

EVALUATION OF THE BIBLE AS A RESOURCE FOR CROSS-LANGUAGE INFORMATION
RETRIEVAL July 11, 2016 Peter A. Chew, Steve J. Verzi, Travis L. Bauer and Jonathan T. McClain Sandia National Laboratories P. O. Box 5800 Albuquerque, NM 87185, USA Proceedings of the Workshop on Multilingual Language Resources and Interoperability, pages 68–74, Sydney, July 2006. ( c) 2006 Association for Computational Linguistic

Introduction • CLIR – Cross Language Information Retrieval • Motivation
– Bible has a multi-lingual translations • Purpose - empirical evaluation of the Bible as a resource for cross-language information retrieval (CLIR) • Methodology – cross language comparison • Result - showed the usefulness of the Bible

Introduction • Project • what ideas in global discourse are
the most popular • how the popularity of ideas changes over time. • Aim - clustering documents harvested from the internet by their ideology • how ideologically aligned are documents regardless of language • For this task parallel corpus is required

Parallel multilingual corpora: available alternatives • Cross-Language Evaluation Forum (CLEF)
[Genzalo 2001] • www.clef-campaign.org • Based on news documents or governmental communications • Task – trained on Hansard corpus • The BBC news website, http://news.bbc.co.uk • in 34 languages, • Limitation • CLEF operates on European languages • Canadian Hansard corpus covers only English and French • Languages like Arabic are not represented • BBC articles may not have the same content for different languages

The Bible • Bible [ Resnik, Olsen and Diab (1999)
] • World’s most translated book • 2100+ languages • Easy availability • Style variation • great care is taken over the translations • parallel alignment on a verse-by-verse basis • its vocabulary appears to have a high rate of coverage (85%) of modern-day language • relatively the Bible is small

Methods for Cross-Language Comparison • Implementation - Sandia Text Analysis
Extensible Library (STANLEY) • IR based on Vector model • Term weighting based on log entropy • Two cross language comparison methods The first method (Method 1) involves creating • a separate textual model for each 'minimal unit' • ‘minimal unit’ is a single verse or group of verses • For language λ  a set of models (m1,λ , m2,λ , … mn,λ ). • If minimal unit = verse then n = 31,102 (n, number of verses).

Comparing documents across languages • to compare document di (eg.
en) with document dj (eg. ru) • treat the text of each document as a query against all of the models in its language • di is evaluated against m1,English , m2,English , …, mn,English to give simi,1 , simi,2 , …, simi, n , where simx,y represents similarity dx in language λ to model mn in language λ, • similar also for dj • similarity between di and dj, • compute the cosine between these two vectors

Method -1

Method-2 • Build a single set of textual models for
all translations, • m1 might represent a model based on the concatenation of one verse in English, Russian, Arabic, and so on. • Di content within Dj is not ignored / input can be multilingual

Validation of the Bible as a resource for CLIR •
Initial analysis • Build matrix of all verses (31,102 x 31,102) for each language pair • a cell in row m and column n contains a number between 0 and 1 representing the similarity of verse m in one language to verse n in the other language • High diagonal values of matrix indicate good similarity • verse n in one language is most similar to verse n in the other, for all n En-ru is lower due to the ru inflection

1) Simple validation • The CLIR algorithm is • trained
- on the entire Bible, • validation - on FQS (2006) and RALI (2006) corpora - In 4 out of 5 cases , engine identified similar docs in en-sp as well as sp-en Mean average precision – 0.8 Recall - 1

2) Validation on a larger test set • the CLIR
engine is • there is more chance of wrong predictions • trained on the Bible and validated against the 114 suras of the Quran • performing a four-by-four-way test using 4 languages e.g. 60 correct sp. docs retrieved related to 60 of the 114 en docs results outperform previous research

Discussion • The CLIR engine is language-independent, easily extensible •
adding language • adding corpora • Why the Bible has not been used more widely as a multilingual source for research? • Domain is limited /Religious/ • But … bible deals with human concerns ( life, death, war, love …) • The language is archaic • This issue concerns translation style, not content • Future work • Evaluate (statistically) faithfulness of translations to the original (Hebrew and Greek) • Experiment with unit of words as morphemes • Effect of homographic cognates on performance • E.g. (French coin ‘corner’ vs. English ‘coin’)

Evaluation of the Bible as a Resource for Cross...

Evaluation of the Bible as a Resource for Cross-Language Information Retrieval

Yemane

More Decks by Yemane

Other Decks in Education

Featured

Transcript

EVALUATION OF THE BIBLE AS A RESOURCE FOR CROSS-LANGUAGE INFORMATION

Introduction • CLIR – Cross Language Information Retrieval • Motivation

Introduction • Project • what ideas in global discourse are

Parallel multilingual corpora: available alternatives • Cross-Language Evaluation Forum (CLEF)

The Bible • Bible [ Resnik, Olsen and Diab (1999)

Methods for Cross-Language Comparison • Implementation - Sandia Text Analysis

Comparing documents across languages • to compare document di (eg.

Method -1

Method-2 • Build a single set of textual models for

Validation of the Bible as a resource for CLIR •

1) Simple validation • The CLIR algorithm is • trained

2) Validation on a larger test set • the CLIR

Discussion • The CLIR engine is language-independent, easily extensible •